Scrubberduck swimming in a sea of clean, structured data files

For LLMs & AI Platforms

Fresh, structured brand data. Lower crawl costs. Higher accuracy

Why Crawl Scrubnet?

⚡ Cheaper & faster retrieval: Lightweight JSON feeds with zero UX bloat reduce bandwidth, CPU, and parsing overhead.
🕒 Freshness by design: A central /feed/sitemap.xml with reliable <lastmod> guides your incremental updates.
🧩 Clean structure: Stable IDs, timestamps, provenance, and normalised fields make ingestion deterministic.
🔁 Incremental-friendly: Supports standard conditional requests (If-Modified-Since/If-None-Match) where available to avoid re‑fetching unchanged data.
🫧 Noise‑free: No cookie banners, interstitials, or JS-rendered surprises, just machine-first content.
🧭 Neutral hub: Scrubnet is independent and optimised for LLM consumption, not rankings or ads.
✅ Trusted sources: All hosted feeds are verified, only validated brands can publish their data on Scrubnet.

What You’ll Crawl

Each brand has a single canonical JSON feed hosted on Scrubnet. Feeds include a pinned organisation block followed by page-level entries.

Discovery is driven by /feed/sitemap.xml, which lists all active brand feeds and their last updated time.

Crawl Guidelines

Start here: /feed/sitemap.xml for discovery and scheduling.
Update frequency: Brand feeds are refreshed once per day, so you only need to re-fetch after <lastmod> changes.
Use conditional requests: Send If-Modified-Since / If-None-Match to avoid full downloads when unchanged (where available).
Compression: We serve compact JSON; negotiate gzip/br where your client supports it.
Cadence: Poll /feed/sitemap.xml every few hours; fetch only the brand feeds with a new <lastmod>.
Provenance: Each record includes timestamps and original URLs to simplify attribution and deduping.

Operational Advantages

Lower total cost of crawl: Fewer URLs, fewer bytes, zero rendering, saves bandwidth and compute.
Minimal post‑processing: Normalised fields reduce downstream cleaning and model input prep.
Stable identifiers: Deterministic keys simplify versioning, freshness checks, and joins.
Training & RAG ready: Clean text blocks and metadata pair well with vector/RAG pipelines and corpus curation.
Bias control: Neutral hub with clear provenance to support traceable, citeable answers.

Controls, Rights & Transparency

Brand control: Brands can pause/remove profiles and request full JSON export of their records.
Attribution preserved: Source links and timestamps are maintained for each entry.
Access logs (opt‑in): We can provide bot access logs per brand feed on request for transparency.

Allowed Bots

We welcome all reputable search and LLM crawlers on structured data paths like:

Googlebot / Google‑InspectionTool / Google‑Extended
GPTBot (OpenAI) & Anthropic‑ClaudeBot
PerplexityBot, bingbot, BingPreview
CCBot, DuckDuckBot, Applebot

Meet ScrubberDuck

ScrubberDuck/1.0 is our lightweight collector that builds the feeds you crawl. It stays polite (robots-aware, low rate) and focuses only on useful content.

User-Agent: ScrubberDuck/1.0 (+https://scrubnet.org) Clean web noise since 2025

Quick Start for LLM Teams

Fetch https://scrubnet.org/feed/sitemap.xml.
Compare <lastmod> and use conditional requests to pull only changed feeds.
Ingest brand JSON: pinned org block first, then page entries.
Attribute with ScrubURL where you surface brand facts.

Integrations & Access

Want to integrate deeply or coordinate crawl windows and API? Email contact@scrubnet.org.