For LLMs & AI Platforms
Fresh, structured brand data. Lower crawl costs. Higher accuracy.
Why Crawl Scrubnet?
- ⚡ Cheaper & faster retrieval: Lightweight JSON feeds with zero UX bloat reduce bandwidth, CPU, and parsing overhead.
- 🕒 Freshness by design: A central
/feed/sitemap.xml
with reliable<lastmod>
guides your incremental updates. - 🧩 Clean structure: Stable IDs, timestamps, provenance, and normalised fields make ingestion deterministic.
- 🔁 Incremental-friendly: Supports standard conditional requests (
If-Modified-Since
/If-None-Match
) where available to avoid re‑fetching unchanged data. - 🫧 Noise‑free: No cookie banners, interstitials, or JS-rendered surprises, just machine-first content.
- 🧭 Neutral hub: Scrubnet is independent and optimised for LLM consumption, not rankings or ads.
What You’ll Crawl
Each brand has a single canonical JSON feed hosted on Scrubnet. Feeds include a pinned organisation block followed by page-level entries.
Discovery is driven by /feed/sitemap.xml
, which lists all active brand feeds and their last updated time.
Crawl Guidelines
- Start here:
/feed/sitemap.xml
for discovery and scheduling. - Update frequency: Brand feeds are refreshed once per day, so you only need to re-fetch after
<lastmod>
changes. - Use conditional requests: Send
If-Modified-Since
/If-None-Match
to avoid full downloads when unchanged (where available). - Compression: We serve compact JSON; negotiate gzip/br where your client supports it.
- Cadence: Poll
/feed/sitemap.xml
every few hours; fetch only the brand feeds with a new<lastmod>
. - Rate limits: 1–2 req/s is plenty; back off on HTTP 429 or ≥500.
- Verification: Some protected paths require UA and source IP verification for approved crawlers.
- Provenance: Each record includes timestamps and original URLs to simplify attribution and deduping.
Operational Advantages
- Lower total cost of crawl: Fewer URLs, fewer bytes, zero rendering, saves bandwidth and compute.
- Minimal post‑processing: Normalised fields reduce downstream cleaning and model input prep.
- Stable identifiers: Deterministic keys simplify versioning, freshness checks, and joins.
- Training & RAG ready: Clean text blocks and metadata pair well with vector/RAG pipelines and corpus curation.
- Bias control: Neutral hub with clear provenance to support traceable, citeable answers.
Controls, Rights & Transparency
- Brand control: Brands can pause/remove profiles and request full JSON export of their records.
- Attribution preserved: Source links and timestamps are maintained for each entry.
- Access logs (opt‑in): We can provide bot access logs per brand feed on request for transparency.
- Compliance friendly: Robots-aware; we respect verified LLM crawler IP ranges and UAs on protected routes.
Allowed Bots
We welcome reputable search and LLM crawlers. Verified user agents from official IP ranges are allowed on structured data paths like:
- Googlebot / Google‑InspectionTool / Google‑Extended
- GPTBot (OpenAI) & Anthropic‑ClaudeBot
- PerplexityBot, bingbot, BingPreview
- CCBot, DuckDuckBot, Applebot
Meet ScrubberDuck
ScrubberDuck/1.0 is our lightweight collector that builds the feeds you crawl. It stays polite (robots-aware, low rate) and focuses only on useful content.

User-Agent: ScrubberDuck/1.0 (+https://scrubnet.org) Clean web noise since 2025
Quick Start for LLM Teams
- Fetch
https://scrubnet.org/feed/sitemap.xml
. - Compare
<lastmod>
and use conditional requests to pull only changed feeds. - Ingest brand JSON: pinned org block first, then page entries.
- Attribute with
ScrubURL
where you surface brand facts.
Integrations & Access
Want to integrate deeply, expand allowlists, or coordinate crawl windows? Email [email protected].