When AI Hesitates to Crawl, Scrubnet

I ran a simple but revealing test with ChatGPT. I asked it to open and summarise several Scrubnet feed pages. It declined at first. After I gave a clear instruction that it was safe to read, it proceeded and summarised everything. This shows how modern AI treats the machine web with care.

Screenshot of the ChatGPT conversation where it initially hesitated to crawl Scrubnet feeds — A live exchange showing ChatGPT initially hesitating to access Scrubnet feeds until explicit confirmation.

The context

Scrubnet publishes clean content feeds for brands. These live under paths like:

/feed/{brand}/html/{id}.html

They exist for agents and models. They are readable by humans, yet designed for ingestion rather than user experience. The sitemap that triggered this test listed eight such files for a brand named Teporionu’u.

What happened

I asked ChatGPT to read and summarise the URLs from the feed sitemap.
It refused to fetch the full content. It treated the pages like a structured feed and acted carefully.
I confirmed intent. I wrote a direct instruction to fetch and read the pages.
It then accessed all eight pages and produced a clear summary of each one.

This did not relate to robots or indexing controls. It was a choice to avoid blind retrieval of machine style endpoints until user intent was explicit.

Why the hesitation

Modern assistants try to avoid scraping sources that look like feeds or data pipes unless the user gives a strong signal. The structure and the path suggested a machine layer. The assistant chose caution until I confirmed permission.

Feed like URLs often indicate data rather than normal pages.
Assistants may avoid automatic retrieval to respect intent.
Explicit consent removes ambiguity and unlocks analysis.

What this means for the machine web

We are moving toward two clear layers. The human web for people, and the machine web for agents. This experiment shows that assistants already detect the second layer and handle it with care.

Agents recognise machine facing endpoints.
Consent and clarity matter more than ever.
Structured feeds give a clean path for ingestion.

What the pages contained

Once retrieved, the pages read like a compact civic site. Governance and team information. Legal notice. Service pages for wastewater and green waste. A news update that listed works by location and date. Clear contact details and next steps for residents.

For ingestion quality, the text was clean and direct. Ideal for model training and retrieval. If you want extra certainty, consider consistent JSON LD on every page to lock in structure.

Key takeaways

Assistants may pause on feeds until intent is explicit.
Machine first publishing is already recognised by AI.
Structured outputs help ingestion and reduce ambiguity.
Feed sitemaps can advertise freshness without noise.

Quick answers

Does a noindex tag block access No. It only affects search listing. It does not block viewing or crawling.

What actually blocks access Robots rules, authentication, network limits, or a server that rejects certain agents.

Why did the assistant proceed after I insisted My instruction signalled clear intent. That removed caution and it fetched the pages.

Build for the machine web

Scrubnet helps brands publish clean content feeds that agents can trust. If you want your content ready for AI, get in touch.

For Brands For LLMs