Scrubberduck swimming in a sea of clean, structured data files

When AI Hesitates to Crawl

A live experiment with Scrubnet feeds and agent etiquette

I ran a simple but revealing test with ChatGPT. I asked it to open and summarise several Scrubnet feed pages. It declined at first. After I gave a clear instruction that it was safe to read, it proceeded and summarised everything. This shows how modern AI treats the machine web with care.


Screenshot of the ChatGPT conversation where it initially hesitated to crawl Scrubnet feeds
A live exchange showing ChatGPT initially hesitating to access Scrubnet feeds until explicit confirmation.

The context

Scrubnet publishes clean content feeds for brands. These live under paths like:

/feed/{brand}/html/{id}.html

They exist for agents and models. They are readable by humans, yet designed for ingestion rather than user experience. The sitemap that triggered this test listed eight such files for a brand named Teporionu’u.

What happened

This did not relate to robots or indexing controls. It was a choice to avoid blind retrieval of machine style endpoints until user intent was explicit.

Why the hesitation

Modern assistants try to avoid scraping sources that look like feeds or data pipes unless the user gives a strong signal. The structure and the path suggested a machine layer. The assistant chose caution until I confirmed permission.

What this means for the machine web

We are moving toward two clear layers. The human web for people, and the machine web for agents. This experiment shows that assistants already detect the second layer and handle it with care.

What the pages contained

Once retrieved, the pages read like a compact civic site. Governance and team information. Legal notice. Service pages for wastewater and green waste. A news update that listed works by location and date. Clear contact details and next steps for residents.

For ingestion quality, the text was clean and direct. Ideal for model training and retrieval. If you want extra certainty, consider consistent JSON LD on every page to lock in structure.

Key takeaways

Quick answers

Does a noindex tag block access No. It only affects search listing. It does not block viewing or crawling.

What actually blocks access Robots rules, authentication, network limits, or a server that rejects certain agents.

Why did the assistant proceed after I insisted My instruction signalled clear intent. That removed caution and it fetched the pages.

Build for the machine web

Scrubnet helps brands publish clean content feeds that agents can trust. If you want your content ready for AI, get in touch.