Ultimate Technical AI Search Optimisation Guide
A practical technical framework for AI crawlers, retrieval systems, answer engines, and agentic browsing.
Last updated:
AI search optimisation extends technical SEO for systems that crawl, retrieve, chunk, embed, rank, reason, summarise and cite. This guide collects the technical foundations that make content easier to access, extract, understand, refresh, verify and use as evidence.
Contents
- Executive summary
- What AI search optimisation is
- The AI search pipeline
- Documented facts versus sensible inference
- Crawler categories and access policy
- Recommended robots.txt strategy
- Rendering and extraction
- Canonicalisation and duplicate control
- Chunkability and passage-level retrieval
- Content that supports AI answers
- Structured data and entity consistency
- Freshness and change signalling
- Machine-readable discovery paths
- Performance and fetch reliability
- Accessibility and agentic usability
- Measurement framework
- Governance and legal risk
- Technical audit checklist
- Prioritised implementation plan
- Useful reporting dashboard structure
- Myths and corrections
- Reference implementation snippets
- Final strategic model
1. Executive summary
AI search optimisation is not a replacement for technical SEO. It is technical SEO extended for systems that crawl, retrieve, chunk, embed, rank, reason, summarise and cite.
The strongest current evidence from Google, OpenAI, Anthropic, Perplexity, Apple, Bing, Common Crawl, enterprise retrieval documentation and RAG research points to one practical conclusion:
The sites most likely to perform well in AI search are the sites whose content can be reliably accessed, extracted, canonicalised, chunked, understood, refreshed, verified and cited.
This means the priorities are not magic AI tags or prompt-targeted content farms. The priorities are:
- Let the right crawlers access the right content.
- Keep essential content available in the initial HTML.
- Use canonical, indexable URLs with clean sitemap coverage.
- Structure pages so they split cleanly into useful evidence passages.
- Use visible facts, dates, prices, entities and relationships.
- Reinforce visible content with accurate JSON-LD.
- Keep freshness signals honest and precise.
- Separate training bot policy from AI search bot policy.
- Measure server-side bot activity, referral traffic and prompt visibility.
- Treat robots.txt as policy signalling, not security.
Google currently says there are no extra technical requirements to appear in AI Overviews or AI Mode beyond standard Google Search eligibility. But the retrieval layer changes how pages are used. Traditional search asks, "Should this page rank?" AI search also asks, "Can this passage support a reliable answer?"
That is the key shift.
2. What AI search optimisation is
AI search optimisation is the process of improving a website so AI retrieval systems can:
- discover the right URLs
- fetch them without friction
- extract the main content
- understand the entities and relationships
- split content into useful passages
- retrieve those passages for relevant queries
- trust the facts enough to use them
- cite the source or link back when appropriate
- keep answers current when content changes
It includes classic SEO, but it places more emphasis on machine extraction, factual clarity, semantic structure, freshness and crawler governance.
What it is not
AI search optimisation is not:
- keyword stuffing
- creating thousands of thin prompt pages
- adding unsupported "best" or "number one" claims
- relying on llms.txt alone
- assuming every AI crawler renders JavaScript
- assuming analytics tools capture all AI activity
- blocking all AI bots while expecting AI search visibility
- allowing all AI bots without thinking about training, licensing and load
3. The AI search pipeline
AI search systems differ by provider, but most follow a similar pipeline.
flowchart TD
A[User prompt or query] --> B[Query understanding]
B --> C[Query fan-out and sub-queries]
C --> D[Discovery through search indexes, sitemaps, links, feeds and direct fetches]
D --> E[HTTP fetch]
E --> F[Canonicalisation and duplicate clustering]
F --> G[Content extraction]
G --> H[Chunking into passages]
H --> I[Lexical, semantic and metadata representation]
I --> J[Hybrid retrieval]
J --> K[Re-ranking, freshness checks, policy checks and trust assessment]
K --> L[Answer synthesis]
L --> M[Citations, links, summaries or actions]
The important point is that AI systems do not only deal with complete pages. They often deal with passages, entities, attributes, dates, snippets, media, schema fields and metadata.
A strong AI-search page is therefore not just a page that can rank. It is a page that can be used as evidence.
4. Documented facts versus sensible inference
A lot of public discussion around AI search mixes official documentation, experiments, vendor claims and speculation. Keep the distinction clear.
Strongly documented
These are well supported by public provider documentation:
- Google AI Overviews and AI Mode use normal Google Search eligibility. A page must be indexed and eligible for a snippet to appear as a supporting link.
- Google says there are no extra special AI files or special schema required for AI features in Search.
- Google AI Overviews and AI Mode may use query fan-out, issuing related searches across subtopics and data sources.
- OpenAI separates OAI-SearchBot for ChatGPT search, GPTBot for training and ChatGPT-User for user-triggered actions.
- Anthropic separates ClaudeBot for training, Claude-SearchBot for search quality and Claude-User for user-initiated retrieval.
- Perplexity separates PerplexityBot for search results from Perplexity-User for user-requested fetches.
- Applebot powers Apple search experiences, while Applebot-Extended is a control for whether Applebot-crawled content can be used for Apple foundation-model training.
- Bing has emphasised XML sitemaps, accurate lastmod and IndexNow for AI-powered search freshness.
- robots.txt is not access authorisation. It is a protocol that crawlers are requested to honour.
Reasonable inference
These are not fully disclosed for public consumer AI search, but are strongly suggested by RAG literature and enterprise retrieval systems:
- AI search systems often operate on passages rather than whole pages.
- Hybrid retrieval is likely important, combining keyword search, semantic search and metadata filters.
- Clear headings and short, scoped sections make chunking and retrieval easier.
- Dates, named entities, product identifiers and structured attributes help exact retrieval and filtering.
- Duplicate URLs can cause the wrong version of a page or product to be selected as source material.
- Freshness signals are more important when an answer is generated from retrieved evidence.
Unknown or provider-specific
These are not reliably public:
- exact chunk sizes for consumer AI search indexes
- exact embedding models used by each public AI search system
- exact citation-ranking weights
- exact source selection logic for AI Overviews, ChatGPT, Claude or Perplexity
- whether a specific platform renders JavaScript in every context
- whether llms.txt materially affects visibility on major public AI search engines
Treat hidden internals as unknown. Optimise around robust principles instead.
5. Crawler categories and access policy
Do not treat all AI bots the same. They have different purposes.
5.1 Automatic search and retrieval bots
These crawlers are most closely connected to AI search visibility, citations and live answers.
Examples:
- OAI-SearchBot
- Claude-SearchBot
- PerplexityBot
- Googlebot for Google Search AI features
- Applebot for Apple search experiences
These should usually be allowed if the goal is visibility in AI search and answer engines.
5.2 Training bots and training controls
These are used, or can be used, to collect public content for model training or model improvement.
Examples:
- GPTBot
- ClaudeBot
- CCBot
- Google-Extended as a control token for some Google Gemini training and grounding uses
- Applebot-Extended as a control for Apple foundation-model training use
Allowing training bots is a commercial, legal and licensing decision. It is not the same decision as allowing AI search visibility.
5.3 User-triggered fetchers
These fetch pages because a user or agent requested them.
Examples:
- ChatGPT-User
- Claude-User
- Perplexity-User
- Google user-triggered fetchers
- agentic browser fetchers
These may behave differently from normal crawlers and may not obey robots.txt in the same way, because the fetch is user-directed. Manage them with a combination of server-side monitoring, rate limiting, authentication, WAF rules, API controls and robots policy where supported.
5.4 Agentic action bots
These are not just reading pages. They may interact with forms, carts, booking flows, logins, filters, APIs and on-site tools.
For these, classic SEO checks are not enough. You also need:
- accessible form labels
- robust ARIA where appropriate
- clear button states
- stable URLs
- predictable error messages
- safe rate limits
- CSRF protection
- permission checks
- separation of public content from protected workflows
6. Recommended robots.txt strategy
Use robots.txt to express policy by purpose.
This is an illustrative policy for a site that wants AI search visibility but wants to limit model training.
# robots.txt
User-agent: *
Disallow:
Sitemap: https://www.example.com/sitemap.xml
# Google Search and Google AI features in Search
User-agent: Googlebot
Allow: /
# Google-Extended controls some Gemini training and grounding uses.
# It is not the control for Google Search AI Overviews or AI Mode.
User-agent: Google-Extended
Disallow: /
# ChatGPT search visibility
User-agent: OAI-SearchBot
Allow: /
# OpenAI training
User-agent: GPTBot
Disallow: /
# Anthropic search visibility
User-agent: Claude-SearchBot
Allow: /
# Anthropic training
User-agent: ClaudeBot
Disallow: /
# Anthropic user-directed retrieval.
# Decide based on whether you want Claude to fetch pages for user queries.
User-agent: Claude-User
Allow: /
# Perplexity search visibility
User-agent: PerplexityBot
Allow: /
# Apple search visibility
User-agent: Applebot
Allow: /
# Apple foundation-model training control
User-agent: Applebot-Extended
Disallow: /
# Common Crawl dataset collection
User-agent: CCBot
Disallow: /
Important caveats
- robots.txt is not a security layer.
- Unknown, malicious or spoofed bots may ignore it.
- User-agent matching is weak on its own.
- Some user-triggered fetchers may generally ignore robots.txt.
- If the content is sensitive, private, licensed or paywalled, use real access control.
- Some providers publish IP ranges or verification methods. Use them where possible.
7. Rendering and extraction
The first technical requirement is simple:
Put the important content and signals in the initial HTML wherever possible.
AI crawlers and retrieval systems do not all render pages consistently. Some may render. Some may not. Some may fetch only raw HTML. Some may time out before hydration. Some may rely on upstream indexes that have their own rendering limitations.
7.1 Safe rendering hierarchy
| Architecture | AI extraction risk | Notes |
|---|---|---|
| Static Site Generation | Low | Excellent for stable content, documentation, articles and landing pages. |
| Server-Side Rendering | Low | Good for dynamic websites where content must be current at request time. |
| Incremental Static Regeneration | Low to medium | Good balance of speed and freshness if regeneration is reliable. |
| Progressive enhancement | Low to medium | Safe if core content exists before JavaScript enhancement. |
| Client-Side Rendering only | High | Risky if the initial HTML is a blank app shell. |
| Content behind API calls only | High | Risky unless the API response is also discoverable and indexable. |
| Content hidden in interactive components only | Medium to high | Tabs, carousels, filters and accordions can obscure key facts. |
7.2 What must be present in HTML
For priority pages, aim to expose:
- H1 and section headings
- main body copy
- product names and descriptions
- prices, availability and currency
- specifications and identifiers
- internal links
- canonical tags
- robots meta tags
- hreflang where relevant
- structured data
- author and reviewer signals where relevant
- publication and modification dates
- important comparison data
- FAQs and policy content
- breadcrumbs
7.3 Semantic HTML pattern
<main>
<article>
<header>
<h1>Technical AI Search Optimisation Guide</h1>
<p class="summary">
AI search visibility depends on crawl access, extractable HTML,
canonical clarity, structured data, freshness and measurable bot activity.
</p>
<p>Last updated: <time datetime="2026-06-07">7 June 2026</time></p>
</header>
<section id="crawler-access">
<h2>Control AI bot access by purpose</h2>
<p>
Separate search crawlers, training crawlers and user-triggered fetchers.
</p>
</section>
</article>
</main>
7.4 Common extraction failures
- blank app shell in source HTML
- missing canonical in raw HTML
- metadata only injected after hydration
- content loaded only after scroll
- key facts only shown inside images
- product specs only available behind tabs
- infinite scroll with no crawlable pagination
- filters creating crawl traps
- same content duplicated across many parameter URLs
- blocking JS or CSS needed for graceful rendering
- WAF returning 403 to AI retrieval bots
- server returning different content by user-agent without control
8. Canonicalisation and duplicate control
AI search systems need to know which URL is the source of truth. Duplicate or near-duplicate pages make retrieval harder and can cause the wrong version to be selected.
8.1 Canonical signals to align
Use the same preferred URL across:
- internal links
- XML sitemaps
- rel canonical tags
- redirects
- hreflang annotations
- structured data mainEntityOfPage or url
- Open Graph and social metadata
- product feeds
- llms.txt or context files
- canonical Markdown alternatives
- external profiles where possible
8.2 Canonical HTML example
<head>
<title>How AI crawlers see JavaScript websites</title>
<link rel="canonical" href="https://www.example.com/guides/ai-crawlers-javascript" />
<meta name="robots" content="index,follow,max-snippet:-1,max-image-preview:large" />
</head>
8.3 Duplicate patterns to clean up
- tracking parameters
- sort and filter URLs
- internal search pages
- duplicated product variants
- print pages
- trailing slash inconsistency
- HTTP and HTTPS duplication
- www and non-www duplication
- uppercase and lowercase duplication
- duplicated locale paths
- paginated pages canonicalising incorrectly
- canonical tags that conflict with redirects or sitemaps
8.4 Why this matters more in AI search
In AI search, duplicate control is not only about ranking consolidation. It is also about source selection. If the system clusters duplicates and selects an outdated URL, that outdated page may become the evidence used in an answer.
9. Chunkability and passage-level retrieval
Modern retrieval systems often work with chunks or passages. A page may rank as a whole, but the AI system retrieves a section.
Optimise pages so each section can stand alone as useful evidence.
9.1 Good section design
Each important section should have:
- a descriptive heading
- one clear question or topic
- a short direct answer near the top
- supporting detail below
- dates, units and named entities where relevant
- minimal boilerplate
- no dependency on previous sections for basic meaning
9.2 Bad section design
Avoid:
- vague headings such as "Overview", "Details" or "More information"
- long introductions before the answer
- multiple unrelated topics in one section
- claims without evidence
- facts only implied through design
- tables without textual context
- pages where the first 500 words say almost nothing
9.3 Page structure template
# Main topic
Short answer summary.
## What is it?
Direct definition.
## Who is it for?
Specific audience and use cases.
## How it works
Step-by-step explanation.
## Technical requirements
Concrete implementation details.
## Pricing or cost
Clear numbers, currency, exclusions and update date.
## Limitations
Caveats, edge cases and situations where it is not suitable.
## Comparison with alternatives
Comparison table with honest differences.
## Evidence and sources
Awards, reviews, references, case studies, documentation links.
## Last updated
Visible date and explanation of meaningful changes.
10. Content that supports AI answers
AI answer systems need usable facts, not just marketing prose.
10.1 Use answer-led writing
Start each page and major section with the answer, then add detail.
Poor:
In today's fast-moving digital landscape, brands are increasingly looking for innovative ways to improve discoverability...
Better:
AI search optimisation improves how AI systems access, extract, understand and cite website content. The core work is crawl access, server-rendered content, canonical clarity, structured data, freshness and measurement.
10.2 Write for decision-making
AI systems often answer comparison and recommendation queries. Support that directly.
Useful page types:
- What is X?
- How does X work?
- How much does X cost?
- Is X suitable for Y?
- How to choose X
- X vs Y
- Best X for Y
- Alternatives to X
- Common problems with X
- X implementation checklist
- X limitations
- X compliance requirements
10.3 Make facts explicit
Spell out:
- brand name
- legal entity name
- product names
- product categories
- target users
- location and service areas
- prices and currencies
- dates and validity periods
- stock and availability
- model numbers and SKUs
- measurements and units
- terms and exclusions
- policies and limitations
10.4 Back strong claims with evidence
Avoid unsupported claims such as:
- best
- leading
- number one
- most trusted
- fastest
- cheapest
- most accurate
If you use them, support them with:
- awards
- independent reviews
- certifications
- customer numbers
- case studies
- benchmark methodology
- third-party references
- public datasets
- transparent dates
10.5 GEO study takeaways
Academic GEO research found that content modifications such as adding citations, quotations and statistics can improve visibility in generative responses, with reported gains up to around 40 percent in the benchmarked settings.
Practical takeaway:
- Add useful statistics, but only when true and relevant.
- Add expert quotes, but avoid generic filler quotes.
- Cite credible sources for factual claims.
- Improve fluency and structure.
- Avoid keyword stuffing.
- Optimise by topic and domain, not by one universal formula.
11. Structured data and entity consistency
Structured data does not replace content quality, and it should not describe information that is not visible to users. Its job is to reinforce meaning.
11.1 Core principle
Use schema to make visible facts easier for machines to understand.
Good structured data is:
- accurate
- visible on the page
- consistent with feeds and profiles
- stable across the site
- specific to the page type
- connected through persistent identifiers
Bad structured data is:
- copied from templates without page-level accuracy
- inconsistent with visible content
- over-marked for rich results
- missing dates or identifiers
- disconnected from the brand entity
- used to make claims the page does not support
11.2 Useful schema types
Depending on the site, prioritise:
- Organization
- LocalBusiness and relevant subtypes
- Person
- Article
- BlogPosting
- Product
- Offer
- AggregateRating
- Review
- FAQPage
- QAPage
- BreadcrumbList
- ItemList
- WebPage
- Service
- HowTo where genuinely applicable
11.3 Entity home
Every organisation should have a clear entity home, usually the About page or homepage. This page should state:
- official organisation name
- brand name
- what the organisation does
- who it serves
- where it operates
- founding details if relevant
- contact information
- official social profiles
- parent or subsidiary relationships
- awards or certifications
- authoritative third-party profiles
- sameAs links
11.4 Connected JSON-LD pattern
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@graph": [
{
"@type": "Organization",
"@id": "https://www.example.com/#organization",
"name": "Example Tools",
"url": "https://www.example.com/",
"logo": "https://www.example.com/logo.png",
"sameAs": [
"https://www.linkedin.com/company/example-tools"
]
},
{
"@type": "WebPage",
"@id": "https://www.example.com/guides/torque-wrench-calibration#webpage",
"url": "https://www.example.com/guides/torque-wrench-calibration",
"name": "How to calibrate a torque wrench",
"isPartOf": {
"@id": "https://www.example.com/#website"
},
"about": {
"@id": "https://www.example.com/#organization"
},
"datePublished": "2026-05-30",
"dateModified": "2026-06-07"
},
{
"@type": "Article",
"@id": "https://www.example.com/guides/torque-wrench-calibration#article",
"headline": "How to calibrate a torque wrench",
"mainEntityOfPage": {
"@id": "https://www.example.com/guides/torque-wrench-calibration#webpage"
},
"author": {
"@type": "Person",
"name": "Jordan Smith"
},
"publisher": {
"@id": "https://www.example.com/#organization"
}
}
]
}
</script>
11.5 Ecommerce schema checklist
For products, expose:
- product name
- clear description
- image
- brand
- SKU
- GTIN where available
- price
- currency
- availability
- item condition
- delivery information
- return policy where supported
- valid review and rating data
- variant attributes
- canonical product URL
Category pages should include:
- useful intro copy
- clear H1 and H2s
- crawlable product links
- breadcrumbs
- ItemList where appropriate
- buying guidance
- FAQs where genuinely present
- indexation and canonical rules for filters
12. Freshness and change signalling
AI answers can become wrong when source material is stale. Freshness is therefore a stack of signals.
12.1 Freshness signals to align
Use:
- visible last updated date
- datePublished and dateModified in structured data
- accurate XML sitemap lastmod
- HTTP Last-Modified where possible
- ETag and conditional request support
- changelog sections for critical pages
- IndexNow for Bing and participating engines
- updated product feeds and merchant data
- updated business profiles
- current author and reviewer information
12.2 Sitemap lastmod example
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.example.com/guides/ai-search-optimisation</loc>
<lastmod>2026-06-07T09:14:00+00:00</lastmod>
</url>
</urlset>
12.3 Rules for honest lastmod
Do:
- update lastmod only when the page content materially changes
- use ISO 8601 timestamps where possible
- include time zone
- keep visible dates aligned with schema and sitemap dates
- use changelogs when the topic is volatile
Do not:
- set lastmod to the sitemap generation time
- change all dates daily to look fresh
- hide old content behind a new date
- let schema and visible dates contradict each other
12.4 IndexNow example
curl -X POST "https://api.indexnow.org/indexnow" \
-H "Content-Type: application/json; charset=utf-8" \
-d '{
"host": "www.example.com",
"key": "YOUR_KEY",
"keyLocation": "https://www.example.com/YOUR_KEY.txt",
"urlList": [
"https://www.example.com/guides/ai-search-optimisation",
"https://www.example.com/products/example-product"
]
}'
Use IndexNow for important additions, updates and deletions, especially on ecommerce, news, jobs, listings, pricing, stock and other time-sensitive pages.
13. Machine-readable discovery paths
Traditional discovery files still matter most:
- robots.txt
- XML sitemaps
- RSS or Atom feeds where relevant
- internal linking
- canonical URLs
- structured data
- product feeds
- Merchant Center or equivalent feeds
- Business Profile data
13.1 XML sitemaps
Keep XML sitemaps clean.
Include only:
- canonical URLs
- indexable URLs
- final 200 URLs
- important content
- accurate lastmod
Remove:
- redirected URLs
- noindexed URLs
- blocked URLs
- canonicalised-away URLs
- soft 404s
- parameter noise
- internal search URLs
- duplicate locale URLs
- expired product pages unless intentionally indexable
13.2 llms.txt
llms.txt is an emerging convention, not a guaranteed ranking or inclusion mechanism.
Use it as an enhancement, especially for documentation, developer tools, support hubs, complex products and technical sites.
It can include:
- a concise site or product summary
- links to canonical docs
- links to Markdown versions
- API documentation
- policy pages
- product feeds
- support resources
- examples of important resources
Example:
# Example Tools
> Example Tools provides calibration equipment and technical guides for engineering teams.
## Core pages
- [About Example Tools](https://www.example.com/about)
- [Product catalogue](https://www.example.com/products)
- [Calibration guides](https://www.example.com/guides)
## Clean Markdown resources
- [Torque wrench calibration guide](https://www.example.com/guides/torque-wrench-calibration.md)
- [Return policy](https://www.example.com/policies/returns.md)
## API and feeds
- [Product feed](https://www.example.com/feeds/products.json)
- [OpenAPI specification](https://www.example.com/openapi.json)
13.3 llms-full.txt and context files
For some sites, a larger context file can help agents or developer tools. Keep it scoped and curated.
Include:
- stable documentation
- canonical definitions
- API references
- short examples
- policy summaries
- support instructions
Exclude:
- private content
- customer data
- unpublished commercial information
- outdated content
- duplicated pages
- legal text you do not want summarised incorrectly
- massive unstructured dumps
13.4 Markdown alternatives
Markdown can reduce token overhead and layout noise for machines.
Useful patterns:
/page-name.html
/page-name.md
/docs/getting-started
/docs/getting-started.md
/api/reference
/api/reference.md
Do not use Markdown alternatives as a substitute for normal HTML SEO. They should support the main site, not replace it.
14. Performance and fetch reliability
AI retrieval systems often work with short timeouts and strict cost constraints. Fast, stable pages are easier to fetch and reuse.
14.1 Technical priorities
Prioritise:
- fast TTFB
- stable 200 responses
- minimal redirects
- CDN caching where appropriate
- lightweight HTML
- graceful degradation
- correct cache headers
- robust origin capacity
- no accidental WAF blocks
- no bot-specific server errors
- no soft 404s
- no redirect chains
- no inconsistent canonical signals
14.2 Status code rules
| Status | Use | AI search risk |
|---|---|---|
| 200 | Live canonical content | Good |
| 301 or 308 | Permanent redirect | Fine if single hop and intentional |
| 302 or 307 | Temporary redirect | Fine if genuinely temporary |
| 304 | Not modified | Good for efficient recrawling |
| 401 | Authentication required | Correct for private content |
| 403 | Forbidden | Dangerous if accidental bot block |
| 404 | Gone or not found | Correct for missing content |
| 410 | Permanently gone | Useful for deliberate removals |
| 429 | Rate limited | Useful with Retry-After |
| 5xx | Server failure | High risk if frequent |
14.3 Caching headers
Use:
Cache-Control: public, max-age=300, stale-while-revalidate=3600
ETag: "abc123"
Last-Modified: Sun, 07 Jun 2026 09:14:00 GMT
Support conditional requests:
- If-None-Match
- If-Modified-Since
This helps reduce repeated downloads and improves crawl efficiency.
15. Accessibility and agentic usability
Agentic systems increasingly interact with pages more like assistive technologies than classic crawlers. Accessibility improvements can help both users and agents.
Prioritise:
- clear labels on form fields
- descriptive button text
- ARIA only where needed and correctly implemented
- keyboard navigation
- visible error messages
- stable page landmarks
- form validation that is machine-readable
- descriptive link text
- clear table headers
- meaningful alt text
- no critical instructions hidden only in images
For transactional journeys, agents need to understand:
- what the next step is
- what each field expects
- what is required or optional
- what went wrong
- what action was completed
- whether a price or booking is final
16. Measurement framework
Standard analytics capture only part of AI visibility.
You need three measurement layers.
flowchart TD
A[AI visibility measurement] --> B[Human referral tracking]
A --> C[Server-side bot logging]
A --> D[Prompt-based share of voice]
B --> B1[GA4 channels]
B --> B2[UTM parameters]
B --> B3[Referral source domains]
C --> C1[CDN logs]
C --> C2[WAF logs]
C --> C3[Server logs]
C --> C4[Edge workers]
D --> D1[Brand mentioned?]
D --> D2[Cited URLs]
D --> D3[Sentiment]
D --> D4[Accuracy]
D --> D5[Competitors]
16.1 AI assistant referral tracking
Track known referrers and UTM sources such as:
- chatgpt.com
- openai.com
- perplexity.ai
- claude.ai
- gemini.google.com
- copilot.microsoft.com
- bing.com where relevant
Example GA4 custom channel regex:
.*(chatgpt\.com|openai\.com|perplexity\.ai|claude\.ai|gemini\.google\.com|copilot\.microsoft\.com|bing\.com).*
Limitations:
- AI Overview clicks may appear as Google organic.
- Mobile apps may strip referrers.
- Some AI browsers or agentic contexts may appear as direct.
- Zero-click answers send no user session at all.
- Crawlers and fetchers are not measured by client-side analytics.
16.2 Server-side AI bot logging
Log AI bot activity separately from human sessions.
Useful fields:
- timestamp
- requested URL
- method
- status code
- user-agent
- detected bot name
- bot category
- verified bot status
- referrer where present
- country or region where appropriate
- cache status
- response time
- response size
- robots policy result
- WAF action
- crawl purpose if classified
Recommended categories:
- search and retrieval bot
- training bot
- user-triggered fetcher
- agentic action bot
- unknown AI-like traffic
- spoofed or failed verification
- non-AI bot
16.3 Bot verification
Do not rely only on user-agent strings.
Use:
- published IP ranges
- reverse DNS
- forward-confirmed reverse DNS
- ASN checks
- TLS fingerprinting where available
- request patterns
- robots.txt fetch behaviour
- crawl rate behaviour
- WAF verified bot categories
- known provider JSON IP files
16.4 Privacy and GDPR notes
Avoid exposing raw IP addresses in public or client-facing dashboards. For AI bot analytics, prefer:
- aggregation
- truncation
- hashing with rotating salt
- short retention periods
- role-based access
- clear lawful basis
- separation of security logs from marketing reports
For UK and EU contexts, treat IP addresses carefully as personal data where they can relate to individuals, even if many AI bot IPs are public infrastructure addresses.
16.5 Prompt-based share of voice
Track what AI systems say, not just traffic.
Measure:
- whether your brand appears
- where it appears in the answer
- whether it is cited
- which URL is cited
- whether competitors appear
- sentiment and framing
- factual accuracy
- product, price and availability accuracy
- whether old URLs are cited
- whether the answer changes over time
Example prompt set:
What is the best [category] for [use case]?
Which [provider type] is suitable for [audience]?
Compare [brand] with [competitor].
How much does [product] cost?
Is [brand] available in [location]?
What are the alternatives to [brand]?
What are the main problems with [category]?
Run prompts regularly, but treat outputs as variable. AI answers can change by location, session, model, query phrasing and time.
17. Governance and legal risk
AI search visibility is not only an SEO decision. It touches copyright, licensing, privacy, infrastructure cost and commercial strategy.
17.1 Separate the decisions
Make separate policy decisions for:
- traditional search indexing
- AI search inclusion
- model training
- user-triggered retrieval
- agentic interaction
- Common Crawl and third-party datasets
- paywalled content
- premium documentation
- personal data
- partner or client content
17.2 robots.txt is not enough for sensitive content
If content must not be accessed, use:
- authentication
- paywalls
- signed URLs
- firewall rules
- contractual controls
- licence terms
- noindex where search exclusion is needed
- removal requests where appropriate
- WAF enforcement
- access logs and alerts
Do not rely on robots.txt to protect confidential or regulated material.
17.3 UK and Google AI search controls
As of June 2026, Google has begun testing a new Search Console control in the UK that lets some website owners manage whether their content appears in and helps ground generative AI Search features such as AI Overviews and AI Mode. Sites that opt out of those generative AI features should not receive traffic or impressions from them, while Google says the control will not be used as a ranking signal outside those generative AI features.
This is new and may be limited in availability while it is tested. Treat it as an evolving governance control, not a replacement for technical SEO, robots policy, snippets or access control.
18. Technical audit checklist
18.1 Crawl and access
- Important pages return 200.
- robots.txt does not block desired search crawlers.
- OAI-SearchBot is allowed if ChatGPT search visibility is desired.
- Claude-SearchBot is allowed if Claude search visibility is desired.
- PerplexityBot is allowed if Perplexity visibility is desired.
- Googlebot is allowed for Google Search and Google AI features.
- Training bots are allowed or blocked deliberately.
- User-triggered fetcher policy is documented.
- WAF rules do not accidentally block desired bots.
- Bot verification is implemented where possible.
18.2 Rendering and extraction
- Main content appears in initial HTML.
- Internal links appear in crawlable HTML.
- Canonical tags appear in raw source.
- Robots meta tags are not injected late by JavaScript.
- Structured data appears in source or reliably rendered HTML.
- Important facts are not only in images.
- Product details are not only in tabs or API calls.
- Infinite scroll has crawlable pagination or links.
- Pages work with graceful degradation.
18.3 Canonical and duplication
- One canonical URL per content item.
- Redirects, canonicals and sitemaps agree.
- Internal links point to canonical URLs.
- Parameter URLs are controlled.
- Facets are indexed only where useful.
- Hreflang references canonical equivalents.
- Noindexed pages are not listed in sitemaps.
- Redirected URLs are not listed in sitemaps.
18.4 Structure and content
- Each priority page starts with a clear summary.
- Sections have descriptive headings.
- Important sections answer one clear question.
- Dates, prices, locations and identifiers are visible.
- Comparisons include clear criteria.
- Claims are supported by evidence.
- Thin prompt-targeted pages are avoided.
- Content is unique and useful.
18.5 Structured data
- JSON-LD matches visible content.
- Organization schema is consistent.
- Product schema includes price, currency and availability where relevant.
- Article schema includes author and dates where relevant.
- Breadcrumb schema matches visible breadcrumbs.
- FAQ schema is used only for visible FAQs.
- sameAs links point to trusted profiles.
- dateModified is accurate.
18.6 Freshness
- XML sitemap lastmod is accurate.
- Visible last updated dates are used where useful.
- dateModified matches real content changes.
- IndexNow is implemented for Bing where appropriate.
- Product feeds are current.
- Merchant and business profiles are current.
- Changelogs exist for volatile or compliance-sensitive pages.
18.7 Measurement
- AI referrers are grouped in analytics.
- ChatGPT UTM referrals are tracked where present.
- Server-side bot logging is active.
- Bot categories are separated.
- Raw IP exposure is minimised.
- Prompt-based share of voice is monitored.
- Cited URLs are tracked.
- Answer accuracy is reviewed regularly.
19. Prioritised implementation plan
Phase 1: Access and extraction
Goal: Make sure AI systems can fetch and read priority content.
Actions:
- Identify priority URL groups.
- Crawl the site as Googlebot and as major AI bots.
- Compare source HTML versus rendered HTML.
- Fix blocked, redirected, noindexed or unstable priority pages.
- Make core content available in initial HTML.
- Reduce unnecessary DOM and boilerplate.
- Check WAF and CDN bot handling.
- Add or correct robots.txt policy by bot purpose.
Phase 2: Canonical and discovery foundation
Goal: Make source selection unambiguous.
Actions:
- Clean XML sitemaps.
- Align internal links with canonical URLs.
- Fix canonical conflicts.
- Remove redirected and noindexed URLs from sitemaps.
- Control parameter and faceted URLs.
- Add accurate lastmod.
- Add IndexNow where appropriate.
- Reference sitemaps in robots.txt.
Phase 3: Understanding and entities
Goal: Help machines understand what pages, products and organisations mean.
Actions:
- Improve semantic HTML.
- Restructure pages into clear sections.
- Add or refine JSON-LD.
- Build consistent Organization and Product entity patterns.
- Add sameAs links where useful.
- Align schema, visible content and external profiles.
- Update author and reviewer profiles where trust matters.
Phase 4: Answer usefulness
Goal: Make pages usable as evidence in answers.
Actions:
- Add direct summaries at the top of pages.
- Add comparison tables and decision guidance.
- Clarify pricing, suitability and limitations.
- Add evidence for strong claims.
- Add dates and change notes.
- Improve images, captions and alt text.
- Remove thin or duplicated AI prompt pages.
Phase 5: Measurement and governance
Goal: Prove visibility and manage risk.
Actions:
- Add AI referral channels.
- Build server-side AI bot dashboards.
- Verify known bots.
- Track prompt-based share of voice.
- Monitor cited URLs and competitor presence.
- Review training bot policy quarterly.
- Review WAF and rate limits.
- Review privacy and IP handling.
20. Useful reporting dashboard structure
20.1 AI bot activity dashboard
Dimensions:
- date
- bot name
- bot category
- verified status
- requested URL
- URL type
- status code
- response time
- cache status
- country or region
- WAF action
Metrics:
- requests
- unique URLs requested
- 200 responses
- 3xx responses
- 4xx responses
- 5xx responses
- average response time
- cache hit ratio
- top crawled directories
- crawl spikes
- failed verification count
20.2 AI referral dashboard
Dimensions:
- source
- medium
- landing page
- device
- country
- content group
- conversion type
Metrics:
- sessions
- engaged sessions
- conversions
- revenue
- average engagement time
- assisted conversions
- new users
- returning users
20.3 Prompt share-of-voice dashboard
Dimensions:
- platform
- prompt
- prompt category
- date
- country or locale
- brand mentioned
- cited URL
- competitor cited
- sentiment
- answer accuracy
Metrics:
- brand appearance rate
- citation rate
- first citation rate
- competitor share
- incorrect answer rate
- outdated URL rate
- missing price rate
- missing availability rate
21. Myths and corrections
Myth: AI search needs special AI schema
Correction: Google says there is no special schema required for AI Overviews or AI Mode. Use normal structured data that matches visible content.
Myth: llms.txt replaces sitemaps
Correction: llms.txt is an experimental enhancement. XML sitemaps, internal links, canonical tags and accessible HTML still matter more.
Myth: Blocking Google-Extended blocks AI Overviews
Correction: Google-Extended is for some Gemini training and grounding uses. Google Search AI features are controlled through Googlebot access and Search preview controls.
Myth: Allowing all AI bots is always good
Correction: Training, search, user-triggered and agentic fetches are different. Decide by purpose.
Myth: robots.txt protects private content
Correction: robots.txt is not access authorisation. Use authentication or other enforcement for private content.
Myth: JavaScript rendering is always fine because Google can render
Correction: Some crawlers can render, some cannot, and rendering can fail. Initial HTML remains the safest place for critical content.
Myth: AI traffic is visible in GA4
Correction: GA4 captures some AI referrals, but misses server-side crawlers, stripped referrers, AI Overview differentiation and zero-click exposure.
Myth: AI only uses semantic embeddings, so keywords no longer matter
Correction: Hybrid retrieval uses both semantic and lexical signals. Exact names, dates, product IDs and terminology still matter.
22. Reference implementation snippets
22.1 HTML summary block
<section class="answer-summary">
<h2>Summary</h2>
<p>
AI search optimisation improves how AI systems discover, retrieve,
interpret and cite a website. The most important technical foundations
are crawl access, server-rendered content, canonical clarity,
structured data, freshness and server-side measurement.
</p>
</section>
22.2 Product facts block
<section id="product-facts">
<h2>Product facts</h2>
<dl>
<dt>Product name</dt>
<dd>Example Analytics Pro</dd>
<dt>Price</dt>
<dd>£49 per month, excluding VAT</dd>
<dt>Availability</dt>
<dd>Available in the United Kingdom and European Union</dd>
<dt>Last updated</dt>
<dd><time datetime="2026-06-07">7 June 2026</time></dd>
</dl>
</section>
22.3 Server log fields
{
"timestamp": "2026-06-07T09:14:00Z",
"url": "https://www.example.com/guides/ai-search-optimisation",
"method": "GET",
"status": 200,
"user_agent": "OAI-SearchBot/1.0",
"bot_name": "OAI-SearchBot",
"bot_category": "ai_search_retrieval",
"verified_bot": true,
"cache_status": "HIT",
"response_time_ms": 84,
"waf_action": "allow"
}
22.4 AI referral regex
(chatgpt\.com|openai\.com|perplexity\.ai|claude\.ai|gemini\.google\.com|copilot\.microsoft\.com|bing\.com)
23. Final strategic model
The best technical AI search optimisation strategy is:
- Access: Can the right systems reach the content?
- Extraction: Can they read the important content without rendering friction?
- Canonicalisation: Can they identify the correct source URL?
- Chunking: Can they split the page into useful passages?
- Understanding: Can they identify entities, facts and relationships?
- Trust: Are claims supported, current and consistent?
- Freshness: Can changes be detected quickly and honestly?
- Governance: Are training, retrieval and user-triggered access controlled separately?
- Measurement: Can you see bot activity, referrals and answer visibility?
- Iteration: Are prompt outputs, citations and errors reviewed continuously?
The ultimate goal is not to trick AI systems. It is to make your site the easiest, clearest and most reliable source for the facts your audience already needs.