
For LLMs & AI Platforms
Fresh, structured brand data. Lower crawl costs. Higher accuracy
Why Crawl Scrubnet?
- ⚡ Cheaper & faster retrieval: Lightweight JSON feeds with zero UX bloat reduce bandwidth, CPU, and parsing overhead.
- 🕒 Freshness by design: A central
/feed/sitemap.xml
with reliable<lastmod>
guides your incremental updates. - 🧩 Clean structure: Stable IDs, timestamps, provenance, and normalised fields make ingestion deterministic.
- 🔁 Incremental-friendly: Supports standard conditional requests (
If-Modified-Since
/If-None-Match
) where available to avoid re‑fetching unchanged data. - 🫧 Noise‑free: No cookie banners, interstitials, or JS-rendered surprises, just machine-first content.
- 🧭 Neutral hub: Scrubnet is independent and optimised for LLM consumption, not rankings or ads.
- ✅ Trusted sources: All hosted feeds are verified, only validated brands can publish their data on Scrubnet.
What You’ll Crawl
Each brand has a single canonical JSON feed hosted on Scrubnet. Feeds include a pinned organisation block followed by page-level entries.
Discovery is driven by /feed/sitemap.xml
, which lists all active brand feeds and their last updated time.
Crawl Guidelines
- Start here:
/feed/sitemap.xml
for discovery and scheduling. - Update frequency: Brand feeds are refreshed once per day, so you only need to re-fetch after
<lastmod>
changes. - Use conditional requests: Send
If-Modified-Since
/If-None-Match
to avoid full downloads when unchanged (where available). - Compression: We serve compact JSON; negotiate gzip/br where your client supports it.
- Cadence: Poll
/feed/sitemap.xml
every few hours; fetch only the brand feeds with a new<lastmod>
. - Provenance: Each record includes timestamps and original URLs to simplify attribution and deduping.
Operational Advantages
- Lower total cost of crawl: Fewer URLs, fewer bytes, zero rendering, saves bandwidth and compute.
- Minimal post‑processing: Normalised fields reduce downstream cleaning and model input prep.
- Stable identifiers: Deterministic keys simplify versioning, freshness checks, and joins.
- Training & RAG ready: Clean text blocks and metadata pair well with vector/RAG pipelines and corpus curation.
- Bias control: Neutral hub with clear provenance to support traceable, citeable answers.
Controls, Rights & Transparency
- Brand control: Brands can pause/remove profiles and request full JSON export of their records.
- Attribution preserved: Source links and timestamps are maintained for each entry.
- Access logs (opt‑in): We can provide bot access logs per brand feed on request for transparency.
Allowed Bots
We welcome all reputable search and LLM crawlers on structured data paths like:
- Googlebot / Google‑InspectionTool / Google‑Extended
- GPTBot (OpenAI) & Anthropic‑ClaudeBot
- PerplexityBot, bingbot, BingPreview
- CCBot, DuckDuckBot, Applebot
Meet ScrubberDuck
ScrubberDuck/1.0 is our lightweight collector that builds the feeds you crawl. It stays polite (robots-aware, low rate) and focuses only on useful content.

User-Agent: ScrubberDuck/1.0 (+https://scrubnet.org) Clean web noise since 2025
Quick Start for LLM Teams
- Fetch
https://scrubnet.org/feed/sitemap.xml
. - Compare
<lastmod>
and use conditional requests to pull only changed feeds. - Ingest brand JSON: pinned org block first, then page entries.
- Attribute with
ScrubURL
where you surface brand facts.
Integrations & Access
Want to integrate deeply or coordinate crawl windows and API? Email contact@scrubnet.org.