The Ultimate Technical AI Search Optimisation Guide

Last updated: 7 June 2026

AI search optimisation extends technical SEO for systems that crawl, retrieve, chunk, embed, rank, reason, summarise and cite. This guide collects the technical foundations that make content easier to access, extract, understand, refresh, verify and use as evidence.

1. Executive summary

AI search optimisation is not a replacement for technical SEO. It is technical SEO extended for systems that crawl, retrieve, chunk, embed, rank, reason, summarise and cite.

The strongest current evidence from Google, OpenAI, Anthropic, Perplexity, Apple, Bing, Common Crawl, enterprise retrieval documentation and RAG research points to one practical conclusion:

The sites most likely to perform well in AI search are the sites whose content can be reliably accessed, extracted, canonicalised, chunked, understood, refreshed, verified and cited.

This means the priorities are not magic AI tags or prompt-targeted content farms. The priorities are:

Let the right crawlers access the right content.
Keep essential content available in the initial HTML.
Use canonical, indexable URLs with clean sitemap coverage.
Structure pages so they split cleanly into useful evidence passages.
Use visible facts, dates, prices, entities and relationships.
Reinforce visible content with accurate JSON-LD.
Keep freshness signals honest and precise.
Separate training bot policy from AI search bot policy.
Measure server-side bot activity, referral traffic and prompt visibility.
Treat robots.txt as policy signalling, not security.

Google currently says there are no extra technical requirements to appear in AI Overviews or AI Mode beyond standard Google Search eligibility. But the retrieval layer changes how pages are used. Traditional search asks, "Should this page rank?" AI search also asks, "Can this passage support a reliable answer?"

That is the key shift.

2. What AI search optimisation is

AI search optimisation is the process of improving a website so AI retrieval systems can:

discover the right URLs
fetch them without friction
extract the main content
understand the entities and relationships
split content into useful passages
retrieve those passages for relevant queries
trust the facts enough to use them
cite the source or link back when appropriate
keep answers current when content changes

It includes classic SEO, but it places more emphasis on machine extraction, factual clarity, semantic structure, freshness and crawler governance.

What it is not

AI search optimisation is not:

keyword stuffing
creating thousands of thin prompt pages
adding unsupported "best" or "number one" claims
relying on llms.txt alone
assuming every AI crawler renders JavaScript
assuming analytics tools capture all AI activity
blocking all AI bots while expecting AI search visibility
allowing all AI bots without thinking about training, licensing and load

3. The AI search pipeline

AI search systems differ by provider, but most follow a similar pipeline.

flowchart TD
    A[User prompt or query] --> B[Query understanding]
    B --> C[Query fan-out and sub-queries]
    C --> D[Discovery through search indexes, sitemaps, links, feeds and direct fetches]
    D --> E[HTTP fetch]
    E --> F[Canonicalisation and duplicate clustering]
    F --> G[Content extraction]
    G --> H[Chunking into passages]
    H --> I[Lexical, semantic and metadata representation]
    I --> J[Hybrid retrieval]
    J --> K[Re-ranking, freshness checks, policy checks and trust assessment]
    K --> L[Answer synthesis]
    L --> M[Citations, links, summaries or actions]

The important point is that AI systems do not only deal with complete pages. They often deal with passages, entities, attributes, dates, snippets, media, schema fields and metadata.

A strong AI-search page is therefore not just a page that can rank. It is a page that can be used as evidence.

4. Documented facts versus sensible inference

A lot of public discussion around AI search mixes official documentation, experiments, vendor claims and speculation. Keep the distinction clear.

Strongly documented

These are well supported by public provider documentation:

Google AI Overviews and AI Mode use normal Google Search eligibility. A page must be indexed and eligible for a snippet to appear as a supporting link.
Google says there are no extra special AI files or special schema required for AI features in Search.
Google AI Overviews and AI Mode may use query fan-out, issuing related searches across subtopics and data sources.
OpenAI separates OAI-SearchBot for ChatGPT search, GPTBot for training and ChatGPT-User for user-triggered actions.
Anthropic separates ClaudeBot for training, Claude-SearchBot for search quality and Claude-User for user-initiated retrieval.
Perplexity separates PerplexityBot for search results from Perplexity-User for user-requested fetches.
Applebot powers Apple search experiences, while Applebot-Extended is a control for whether Applebot-crawled content can be used for Apple foundation-model training.
Bing has emphasised XML sitemaps, accurate lastmod and IndexNow for AI-powered search freshness.
robots.txt is not access authorisation. It is a protocol that crawlers are requested to honour.

Reasonable inference

These are not fully disclosed for public consumer AI search, but are strongly suggested by RAG literature and enterprise retrieval systems:

AI search systems often operate on passages rather than whole pages.
Hybrid retrieval is likely important, combining keyword search, semantic search and metadata filters.
Clear headings and short, scoped sections make chunking and retrieval easier.
Dates, named entities, product identifiers and structured attributes help exact retrieval and filtering.
Duplicate URLs can cause the wrong version of a page or product to be selected as source material.
Freshness signals are more important when an answer is generated from retrieved evidence.

Unknown or provider-specific

These are not reliably public:

exact chunk sizes for consumer AI search indexes
exact embedding models used by each public AI search system
exact citation-ranking weights
exact source selection logic for AI Overviews, ChatGPT, Claude or Perplexity
whether a specific platform renders JavaScript in every context
whether llms.txt materially affects visibility on major public AI search engines

Treat hidden internals as unknown. Optimise around robust principles instead.

5. Crawler categories and access policy

Do not treat all AI bots the same. They have different purposes.

5.1 Automatic search and retrieval bots

These crawlers are most closely connected to AI search visibility, citations and live answers.

Examples:

OAI-SearchBot
Claude-SearchBot
PerplexityBot
Googlebot for Google Search AI features
Applebot for Apple search experiences

These should usually be allowed if the goal is visibility in AI search and answer engines.

5.2 Training bots and training controls

These are used, or can be used, to collect public content for model training or model improvement.

Examples:

GPTBot
ClaudeBot
CCBot
Google-Extended as a control token for some Google Gemini training and grounding uses
Applebot-Extended as a control for Apple foundation-model training use

Allowing training bots is a commercial, legal and licensing decision. It is not the same decision as allowing AI search visibility.

5.3 User-triggered fetchers

These fetch pages because a user or agent requested them.

Examples:

ChatGPT-User
Claude-User
Perplexity-User
Google user-triggered fetchers
agentic browser fetchers

These may behave differently from normal crawlers and may not obey robots.txt in the same way, because the fetch is user-directed. Manage them with a combination of server-side monitoring, rate limiting, authentication, WAF rules, API controls and robots policy where supported.

5.4 Agentic action bots

These are not just reading pages. They may interact with forms, carts, booking flows, logins, filters, APIs and on-site tools.

For these, classic SEO checks are not enough. You also need:

accessible form labels
robust ARIA where appropriate
clear button states
stable URLs
predictable error messages
safe rate limits
CSRF protection
permission checks
separation of public content from protected workflows

6. Recommended robots.txt strategy

Use robots.txt to express policy by purpose.

This is an illustrative policy for a site that wants AI search visibility but wants to limit model training.

# robots.txt

User-agent: *
Disallow:

Sitemap: https://www.example.com/sitemap.xml

# Google Search and Google AI features in Search
User-agent: Googlebot
Allow: /

# Google-Extended controls some Gemini training and grounding uses.
# It is not the control for Google Search AI Overviews or AI Mode.
User-agent: Google-Extended
Disallow: /

# ChatGPT search visibility
User-agent: OAI-SearchBot
Allow: /

# OpenAI training
User-agent: GPTBot
Disallow: /

# Anthropic search visibility
User-agent: Claude-SearchBot
Allow: /

# Anthropic training
User-agent: ClaudeBot
Disallow: /

# Anthropic user-directed retrieval.
# Decide based on whether you want Claude to fetch pages for user queries.
User-agent: Claude-User
Allow: /

# Perplexity search visibility
User-agent: PerplexityBot
Allow: /

# Apple search visibility
User-agent: Applebot
Allow: /

# Apple foundation-model training control
User-agent: Applebot-Extended
Disallow: /

# Common Crawl dataset collection
User-agent: CCBot
Disallow: /

Important caveats

robots.txt is not a security layer.
Unknown, malicious or spoofed bots may ignore it.
User-agent matching is weak on its own.
Some user-triggered fetchers may generally ignore robots.txt.
If the content is sensitive, private, licensed or paywalled, use real access control.
Some providers publish IP ranges or verification methods. Use them where possible.

7. Rendering and extraction

The first technical requirement is simple:

Put the important content and signals in the initial HTML wherever possible.

AI crawlers and retrieval systems do not all render pages consistently. Some may render. Some may not. Some may fetch only raw HTML. Some may time out before hydration. Some may rely on upstream indexes that have their own rendering limitations.

7.1 Safe rendering hierarchy

Architecture	AI extraction risk	Notes
Static Site Generation	Low	Excellent for stable content, documentation, articles and landing pages.
Server-Side Rendering	Low	Good for dynamic websites where content must be current at request time.
Incremental Static Regeneration	Low to medium	Good balance of speed and freshness if regeneration is reliable.
Progressive enhancement	Low to medium	Safe if core content exists before JavaScript enhancement.
Client-Side Rendering only	High	Risky if the initial HTML is a blank app shell.
Content behind API calls only	High	Risky unless the API response is also discoverable and indexable.
Content hidden in interactive components only	Medium to high	Tabs, carousels, filters and accordions can obscure key facts.

7.2 What must be present in HTML

For priority pages, aim to expose:

H1 and section headings
main body copy
product names and descriptions
prices, availability and currency
specifications and identifiers
internal links
canonical tags
robots meta tags
hreflang where relevant
structured data
author and reviewer signals where relevant
publication and modification dates
important comparison data
FAQs and policy content
breadcrumbs

7.3 Semantic HTML pattern

<main>
  <article>
    <header>
      <h1>Technical AI Search Optimisation Guide</h1>
      <p class="summary">
        AI search visibility depends on crawl access, extractable HTML,
        canonical clarity, structured data, freshness and measurable bot activity.
      </p>
      <p>Last updated: <time datetime="2026-06-07">7 June 2026</time></p>
    </header>

    <section id="crawler-access">
      <h2>Control AI bot access by purpose</h2>
      <p>
        Separate search crawlers, training crawlers and user-triggered fetchers.
      </p>
    </section>
  </article>
</main>

7.4 Common extraction failures

blank app shell in source HTML
missing canonical in raw HTML
metadata only injected after hydration
content loaded only after scroll
key facts only shown inside images
product specs only available behind tabs
infinite scroll with no crawlable pagination
filters creating crawl traps
same content duplicated across many parameter URLs
blocking JS or CSS needed for graceful rendering
WAF returning 403 to AI retrieval bots
server returning different content by user-agent without control

8. Canonicalisation and duplicate control

AI search systems need to know which URL is the source of truth. Duplicate or near-duplicate pages make retrieval harder and can cause the wrong version to be selected.

8.1 Canonical signals to align

Use the same preferred URL across:

internal links
XML sitemaps
rel canonical tags
redirects
hreflang annotations
structured data mainEntityOfPage or url
Open Graph and social metadata
product feeds
llms.txt or context files
canonical Markdown alternatives
external profiles where possible

8.2 Canonical HTML example

<head>
  <title>How AI crawlers see JavaScript websites</title>
  <link rel="canonical" href="https://www.example.com/guides/ai-crawlers-javascript" />
  <meta name="robots" content="index,follow,max-snippet:-1,max-image-preview:large" />
</head>

8.3 Duplicate patterns to clean up

tracking parameters
sort and filter URLs
internal search pages
duplicated product variants
print pages
trailing slash inconsistency
HTTP and HTTPS duplication
www and non-www duplication
uppercase and lowercase duplication
duplicated locale paths
paginated pages canonicalising incorrectly
canonical tags that conflict with redirects or sitemaps

8.4 Why this matters more in AI search

In AI search, duplicate control is not only about ranking consolidation. It is also about source selection. If the system clusters duplicates and selects an outdated URL, that outdated page may become the evidence used in an answer.

9. Chunkability and passage-level retrieval

Modern retrieval systems often work with chunks or passages. A page may rank as a whole, but the AI system retrieves a section.

Optimise pages so each section can stand alone as useful evidence.

9.1 Good section design

Each important section should have:

a descriptive heading
one clear question or topic
a short direct answer near the top
supporting detail below
dates, units and named entities where relevant
minimal boilerplate
no dependency on previous sections for basic meaning

9.2 Bad section design

Avoid:

vague headings such as "Overview", "Details" or "More information"
long introductions before the answer
multiple unrelated topics in one section
claims without evidence
facts only implied through design
tables without textual context
pages where the first 500 words say almost nothing

9.3 Page structure template

# Main topic

Short answer summary.

## What is it?

Direct definition.

## Who is it for?

Specific audience and use cases.

## How it works

Step-by-step explanation.

## Technical requirements

Concrete implementation details.

## Pricing or cost

Clear numbers, currency, exclusions and update date.

## Limitations

Caveats, edge cases and situations where it is not suitable.

## Comparison with alternatives

Comparison table with honest differences.

## Evidence and sources

Awards, reviews, references, case studies, documentation links.

## Last updated

Visible date and explanation of meaningful changes.

10. Content that supports AI answers

AI answer systems need usable facts, not just marketing prose.

10.1 Use answer-led writing

Start each page and major section with the answer, then add detail.

Poor:

In today's fast-moving digital landscape, brands are increasingly looking for innovative ways to improve discoverability...

Better:

AI search optimisation improves how AI systems access, extract, understand and cite website content. The core work is crawl access, server-rendered content, canonical clarity, structured data, freshness and measurement.

10.2 Write for decision-making

AI systems often answer comparison and recommendation queries. Support that directly.

Useful page types:

What is X?
How does X work?
How much does X cost?
Is X suitable for Y?
How to choose X
X vs Y
Best X for Y
Alternatives to X
Common problems with X
X implementation checklist
X limitations
X compliance requirements

10.3 Make facts explicit

Spell out:

brand name
legal entity name
product names
product categories
target users
location and service areas
prices and currencies
dates and validity periods
stock and availability
model numbers and SKUs
measurements and units
terms and exclusions
policies and limitations

10.4 Back strong claims with evidence

Avoid unsupported claims such as:

best
leading
number one
most trusted
fastest
cheapest
most accurate

If you use them, support them with:

awards
independent reviews
certifications
customer numbers
case studies
benchmark methodology
third-party references
public datasets
transparent dates

10.5 GEO study takeaways

Academic GEO research found that content modifications such as adding citations, quotations and statistics can improve visibility in generative responses, with reported gains up to around 40 percent in the benchmarked settings.

Practical takeaway:

Add useful statistics, but only when true and relevant.
Add expert quotes, but avoid generic filler quotes.
Cite credible sources for factual claims.
Improve fluency and structure.
Avoid keyword stuffing.
Optimise by topic and domain, not by one universal formula.

11. Structured data and entity consistency

Structured data does not replace content quality, and it should not describe information that is not visible to users. Its job is to reinforce meaning.

11.1 Core principle

Use schema to make visible facts easier for machines to understand.

Good structured data is:

accurate
visible on the page
consistent with feeds and profiles
stable across the site
specific to the page type
connected through persistent identifiers

Bad structured data is:

copied from templates without page-level accuracy
inconsistent with visible content
over-marked for rich results
missing dates or identifiers
disconnected from the brand entity
used to make claims the page does not support

11.2 Useful schema types

Depending on the site, prioritise:

Organization
LocalBusiness and relevant subtypes
Person
Article
BlogPosting
Product
Offer
AggregateRating
Review
FAQPage
QAPage
BreadcrumbList
ItemList
WebPage
Service
HowTo where genuinely applicable

11.3 Entity home

Every organisation should have a clear entity home, usually the About page or homepage. This page should state:

official organisation name
brand name
what the organisation does
who it serves
where it operates
founding details if relevant
contact information
official social profiles
parent or subsidiary relationships
awards or certifications
authoritative third-party profiles
sameAs links

11.4 Connected JSON-LD pattern

<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "Organization",
      "@id": "https://www.example.com/#organization",
      "name": "Example Tools",
      "url": "https://www.example.com/",
      "logo": "https://www.example.com/logo.png",
      "sameAs": [
        "https://www.linkedin.com/company/example-tools"
      ]
    },
    {
      "@type": "WebPage",
      "@id": "https://www.example.com/guides/torque-wrench-calibration#webpage",
      "url": "https://www.example.com/guides/torque-wrench-calibration",
      "name": "How to calibrate a torque wrench",
      "isPartOf": {
        "@id": "https://www.example.com/#website"
      },
      "about": {
        "@id": "https://www.example.com/#organization"
      },
      "datePublished": "2026-05-30",
      "dateModified": "2026-06-07"
    },
    {
      "@type": "Article",
      "@id": "https://www.example.com/guides/torque-wrench-calibration#article",
      "headline": "How to calibrate a torque wrench",
      "mainEntityOfPage": {
        "@id": "https://www.example.com/guides/torque-wrench-calibration#webpage"
      },
      "author": {
        "@type": "Person",
        "name": "Jordan Smith"
      },
      "publisher": {
        "@id": "https://www.example.com/#organization"
      }
    }
  ]
}
</script>

11.5 Ecommerce schema checklist

For products, expose:

product name
clear description
image
brand
SKU
GTIN where available
price
currency
availability
item condition
delivery information
return policy where supported
valid review and rating data
variant attributes
canonical product URL

Category pages should include:

useful intro copy
clear H1 and H2s
crawlable product links
breadcrumbs
ItemList where appropriate
buying guidance
FAQs where genuinely present
indexation and canonical rules for filters

12. Freshness and change signalling

AI answers can become wrong when source material is stale. Freshness is therefore a stack of signals.

12.1 Freshness signals to align

Use:

visible last updated date
datePublished and dateModified in structured data
accurate XML sitemap lastmod
HTTP Last-Modified where possible
ETag and conditional request support
changelog sections for critical pages
IndexNow for Bing and participating engines
updated product feeds and merchant data
updated business profiles
current author and reviewer information

12.2 Sitemap lastmod example

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://www.example.com/guides/ai-search-optimisation</loc>
    <lastmod>2026-06-07T09:14:00+00:00</lastmod>
  </url>
</urlset>

12.3 Rules for honest lastmod

Do:

update lastmod only when the page content materially changes
use ISO 8601 timestamps where possible
include time zone
keep visible dates aligned with schema and sitemap dates
use changelogs when the topic is volatile

Do not:

set lastmod to the sitemap generation time
change all dates daily to look fresh
hide old content behind a new date
let schema and visible dates contradict each other

12.4 IndexNow example

curl -X POST "https://api.indexnow.org/indexnow" \
  -H "Content-Type: application/json; charset=utf-8" \
  -d '{
    "host": "www.example.com",
    "key": "YOUR_KEY",
    "keyLocation": "https://www.example.com/YOUR_KEY.txt",
    "urlList": [
      "https://www.example.com/guides/ai-search-optimisation",
      "https://www.example.com/products/example-product"
    ]
  }'

Use IndexNow for important additions, updates and deletions, especially on ecommerce, news, jobs, listings, pricing, stock and other time-sensitive pages.

13. Machine-readable discovery paths

Traditional discovery files still matter most:

robots.txt
XML sitemaps
RSS or Atom feeds where relevant
internal linking
canonical URLs
structured data
product feeds
Merchant Center or equivalent feeds
Business Profile data

13.1 XML sitemaps

Keep XML sitemaps clean.

Include only:

canonical URLs
indexable URLs
final 200 URLs
important content
accurate lastmod

Remove:

redirected URLs
noindexed URLs
blocked URLs
canonicalised-away URLs
soft 404s
parameter noise
internal search URLs
duplicate locale URLs
expired product pages unless intentionally indexable

13.2 llms.txt

llms.txt is an emerging convention, not a guaranteed ranking or inclusion mechanism.

Use it as an enhancement, especially for documentation, developer tools, support hubs, complex products and technical sites.

It can include:

a concise site or product summary
links to canonical docs
links to Markdown versions
API documentation
policy pages
product feeds
support resources
examples of important resources

Example:

# Example Tools

> Example Tools provides calibration equipment and technical guides for engineering teams.

## Core pages

- [About Example Tools](https://www.example.com/about)
- [Product catalogue](https://www.example.com/products)
- [Calibration guides](https://www.example.com/guides)

## Clean Markdown resources

- [Torque wrench calibration guide](https://www.example.com/guides/torque-wrench-calibration.md)
- [Return policy](https://www.example.com/policies/returns.md)

## API and feeds

- [Product feed](https://www.example.com/feeds/products.json)
- [OpenAPI specification](https://www.example.com/openapi.json)

13.3 llms-full.txt and context files

For some sites, a larger context file can help agents or developer tools. Keep it scoped and curated.

Include:

stable documentation
canonical definitions
API references
short examples
policy summaries
support instructions

Exclude:

private content
customer data
unpublished commercial information
outdated content
duplicated pages
legal text you do not want summarised incorrectly
massive unstructured dumps

13.4 Markdown alternatives

Markdown can reduce token overhead and layout noise for machines.

Useful patterns:

/page-name.html
/page-name.md

/docs/getting-started
/docs/getting-started.md

/api/reference
/api/reference.md

Do not use Markdown alternatives as a substitute for normal HTML SEO. They should support the main site, not replace it.

14. Performance and fetch reliability

AI retrieval systems often work with short timeouts and strict cost constraints. Fast, stable pages are easier to fetch and reuse.

14.1 Technical priorities

Prioritise:

fast TTFB
stable 200 responses
minimal redirects
CDN caching where appropriate
lightweight HTML
graceful degradation
correct cache headers
robust origin capacity
no accidental WAF blocks
no bot-specific server errors
no soft 404s
no redirect chains
no inconsistent canonical signals

14.2 Status code rules

Status	Use	AI search risk
200	Live canonical content	Good
301 or 308	Permanent redirect	Fine if single hop and intentional
302 or 307	Temporary redirect	Fine if genuinely temporary
304	Not modified	Good for efficient recrawling
401	Authentication required	Correct for private content
403	Forbidden	Dangerous if accidental bot block
404	Gone or not found	Correct for missing content
410	Permanently gone	Useful for deliberate removals
429	Rate limited	Useful with Retry-After
5xx	Server failure	High risk if frequent

14.3 Caching headers

Use:

Cache-Control: public, max-age=300, stale-while-revalidate=3600
ETag: "abc123"
Last-Modified: Sun, 07 Jun 2026 09:14:00 GMT

Support conditional requests:

If-None-Match
If-Modified-Since

This helps reduce repeated downloads and improves crawl efficiency.

15. Accessibility and agentic usability

Agentic systems increasingly interact with pages more like assistive technologies than classic crawlers. Accessibility improvements can help both users and agents.

Prioritise:

clear labels on form fields
descriptive button text
ARIA only where needed and correctly implemented
keyboard navigation
visible error messages
stable page landmarks
form validation that is machine-readable
descriptive link text
clear table headers
meaningful alt text
no critical instructions hidden only in images

For transactional journeys, agents need to understand:

what the next step is
what each field expects
what is required or optional
what went wrong
what action was completed
whether a price or booking is final

16. Measurement framework

Standard analytics capture only part of AI visibility.

You need three measurement layers.

flowchart TD
    A[AI visibility measurement] --> B[Human referral tracking]
    A --> C[Server-side bot logging]
    A --> D[Prompt-based share of voice]

    B --> B1[GA4 channels]
    B --> B2[UTM parameters]
    B --> B3[Referral source domains]

    C --> C1[CDN logs]
    C --> C2[WAF logs]
    C --> C3[Server logs]
    C --> C4[Edge workers]

    D --> D1[Brand mentioned?]
    D --> D2[Cited URLs]
    D --> D3[Sentiment]
    D --> D4[Accuracy]
    D --> D5[Competitors]

16.1 AI assistant referral tracking

Track known referrers and UTM sources such as:

chatgpt.com
openai.com
perplexity.ai
claude.ai
gemini.google.com
copilot.microsoft.com
bing.com where relevant

Example GA4 custom channel regex:

.*(chatgpt\.com|openai\.com|perplexity\.ai|claude\.ai|gemini\.google\.com|copilot\.microsoft\.com|bing\.com).*

Limitations:

AI Overview clicks may appear as Google organic.
Mobile apps may strip referrers.
Some AI browsers or agentic contexts may appear as direct.
Zero-click answers send no user session at all.
Crawlers and fetchers are not measured by client-side analytics.

16.2 Server-side AI bot logging

Log AI bot activity separately from human sessions.

Useful fields:

timestamp
requested URL
method
status code
user-agent
detected bot name
bot category
verified bot status
referrer where present
country or region where appropriate
cache status
response time
response size
robots policy result
WAF action
crawl purpose if classified

Recommended categories:

search and retrieval bot
training bot
user-triggered fetcher
agentic action bot
unknown AI-like traffic
spoofed or failed verification
non-AI bot

16.3 Bot verification

Do not rely only on user-agent strings.

Use:

published IP ranges
reverse DNS
forward-confirmed reverse DNS
ASN checks
TLS fingerprinting where available
request patterns
robots.txt fetch behaviour
crawl rate behaviour
WAF verified bot categories
known provider JSON IP files

16.4 Privacy and GDPR notes

Avoid exposing raw IP addresses in public or client-facing dashboards. For AI bot analytics, prefer:

aggregation
truncation
hashing with rotating salt
short retention periods
role-based access
clear lawful basis
separation of security logs from marketing reports

For UK and EU contexts, treat IP addresses carefully as personal data where they can relate to individuals, even if many AI bot IPs are public infrastructure addresses.

16.5 Prompt-based share of voice

Track what AI systems say, not just traffic.

Measure:

whether your brand appears
where it appears in the answer
whether it is cited
which URL is cited
whether competitors appear
sentiment and framing
factual accuracy
product, price and availability accuracy
whether old URLs are cited
whether the answer changes over time

Example prompt set:

What is the best [category] for [use case]?
Which [provider type] is suitable for [audience]?
Compare [brand] with [competitor].
How much does [product] cost?
Is [brand] available in [location]?
What are the alternatives to [brand]?
What are the main problems with [category]?

Run prompts regularly, but treat outputs as variable. AI answers can change by location, session, model, query phrasing and time.

17. Governance and legal risk

AI search visibility is not only an SEO decision. It touches copyright, licensing, privacy, infrastructure cost and commercial strategy.

17.1 Separate the decisions

Make separate policy decisions for:

traditional search indexing
AI search inclusion
model training
user-triggered retrieval
agentic interaction
Common Crawl and third-party datasets
paywalled content
premium documentation
personal data
partner or client content

17.2 robots.txt is not enough for sensitive content

If content must not be accessed, use:

authentication
paywalls
signed URLs
firewall rules
contractual controls
licence terms
noindex where search exclusion is needed
removal requests where appropriate
WAF enforcement
access logs and alerts

Do not rely on robots.txt to protect confidential or regulated material.

17.3 UK and Google AI search controls

As of June 2026, Google has begun testing a new Search Console control in the UK that lets some website owners manage whether their content appears in and helps ground generative AI Search features such as AI Overviews and AI Mode. Sites that opt out of those generative AI features should not receive traffic or impressions from them, while Google says the control will not be used as a ranking signal outside those generative AI features.

This is new and may be limited in availability while it is tested. Treat it as an evolving governance control, not a replacement for technical SEO, robots policy, snippets or access control.

18. Technical audit checklist

18.1 Crawl and access

Important pages return 200.
robots.txt does not block desired search crawlers.
OAI-SearchBot is allowed if ChatGPT search visibility is desired.
Claude-SearchBot is allowed if Claude search visibility is desired.
PerplexityBot is allowed if Perplexity visibility is desired.
Googlebot is allowed for Google Search and Google AI features.
Training bots are allowed or blocked deliberately.
User-triggered fetcher policy is documented.
WAF rules do not accidentally block desired bots.
Bot verification is implemented where possible.

18.2 Rendering and extraction

Main content appears in initial HTML.
Internal links appear in crawlable HTML.
Canonical tags appear in raw source.
Robots meta tags are not injected late by JavaScript.
Structured data appears in source or reliably rendered HTML.
Important facts are not only in images.
Product details are not only in tabs or API calls.
Infinite scroll has crawlable pagination or links.
Pages work with graceful degradation.

18.3 Canonical and duplication

One canonical URL per content item.
Redirects, canonicals and sitemaps agree.
Internal links point to canonical URLs.
Parameter URLs are controlled.
Facets are indexed only where useful.
Hreflang references canonical equivalents.
Noindexed pages are not listed in sitemaps.
Redirected URLs are not listed in sitemaps.

18.4 Structure and content

Each priority page starts with a clear summary.
Sections have descriptive headings.
Important sections answer one clear question.
Dates, prices, locations and identifiers are visible.
Comparisons include clear criteria.
Claims are supported by evidence.
Thin prompt-targeted pages are avoided.
Content is unique and useful.

18.5 Structured data

JSON-LD matches visible content.
Organization schema is consistent.
Product schema includes price, currency and availability where relevant.
Article schema includes author and dates where relevant.
Breadcrumb schema matches visible breadcrumbs.
FAQ schema is used only for visible FAQs.
sameAs links point to trusted profiles.
dateModified is accurate.

18.6 Freshness

XML sitemap lastmod is accurate.
Visible last updated dates are used where useful.
dateModified matches real content changes.
IndexNow is implemented for Bing where appropriate.
Product feeds are current.
Merchant and business profiles are current.
Changelogs exist for volatile or compliance-sensitive pages.

18.7 Measurement

AI referrers are grouped in analytics.
ChatGPT UTM referrals are tracked where present.
Server-side bot logging is active.
Bot categories are separated.
Raw IP exposure is minimised.
Prompt-based share of voice is monitored.
Cited URLs are tracked.
Answer accuracy is reviewed regularly.

19. Prioritised implementation plan

Phase 1: Access and extraction

Goal: Make sure AI systems can fetch and read priority content.

Actions:

Identify priority URL groups.
Crawl the site as Googlebot and as major AI bots.
Compare source HTML versus rendered HTML.
Fix blocked, redirected, noindexed or unstable priority pages.
Make core content available in initial HTML.
Reduce unnecessary DOM and boilerplate.
Check WAF and CDN bot handling.
Add or correct robots.txt policy by bot purpose.

Phase 2: Canonical and discovery foundation

Goal: Make source selection unambiguous.

Actions:

Clean XML sitemaps.
Align internal links with canonical URLs.
Fix canonical conflicts.
Remove redirected and noindexed URLs from sitemaps.
Control parameter and faceted URLs.
Add accurate lastmod.
Add IndexNow where appropriate.
Reference sitemaps in robots.txt.

Phase 3: Understanding and entities

Goal: Help machines understand what pages, products and organisations mean.

Actions:

Improve semantic HTML.
Restructure pages into clear sections.
Add or refine JSON-LD.
Build consistent Organization and Product entity patterns.
Add sameAs links where useful.
Align schema, visible content and external profiles.
Update author and reviewer profiles where trust matters.

Phase 4: Answer usefulness

Goal: Make pages usable as evidence in answers.

Actions:

Add direct summaries at the top of pages.
Add comparison tables and decision guidance.
Clarify pricing, suitability and limitations.
Add evidence for strong claims.
Add dates and change notes.
Improve images, captions and alt text.
Remove thin or duplicated AI prompt pages.

Phase 5: Measurement and governance

Goal: Prove visibility and manage risk.

Actions:

Add AI referral channels.
Build server-side AI bot dashboards.
Verify known bots.
Track prompt-based share of voice.
Monitor cited URLs and competitor presence.
Review training bot policy quarterly.
Review WAF and rate limits.
Review privacy and IP handling.

20. Useful reporting dashboard structure

20.1 AI bot activity dashboard

Dimensions	Metrics
date; bot name; bot category; verified status; requested URL; URL type; status code; response time; cache status; country or region; WAF action	requests; unique URLs requested; 200 responses; 3xx responses; 4xx responses; 5xx responses; average response time; cache hit ratio; top crawled directories; crawl spikes; failed verification count

20.2 AI referral dashboard

Dimensions	Metrics
source; medium; landing page; device; country; content group; conversion type	sessions; engaged sessions; conversions; revenue; average engagement time; assisted conversions; new users; returning users

20.3 Prompt share-of-voice dashboard

Dimensions	Metrics
platform; prompt; prompt category; date; country or locale; brand mentioned; cited URL; competitor cited; sentiment; answer accuracy	brand appearance rate; citation rate; first citation rate; competitor share; incorrect answer rate; outdated URL rate; missing price rate; missing availability rate

21. Myths and corrections

Myth: AI search needs special AI schema

Correction: Google says there is no special schema required for AI Overviews or AI Mode. Use normal structured data that matches visible content.

Myth: llms.txt replaces sitemaps

Correction: llms.txt is an experimental enhancement. XML sitemaps, internal links, canonical tags and accessible HTML still matter more.

Myth: Blocking Google-Extended blocks AI Overviews

Correction: Google-Extended is for some Gemini training and grounding uses. Google Search AI features are controlled through Googlebot access and Search preview controls.

Myth: Allowing all AI bots is always good

Correction: Training, search, user-triggered and agentic fetches are different. Decide by purpose.

Myth: robots.txt protects private content

Correction: robots.txt is not access authorisation. Use authentication or other enforcement for private content.

Myth: JavaScript rendering is always fine because Google can render

Correction: Some crawlers can render, some cannot, and rendering can fail. Initial HTML remains the safest place for critical content.

Myth: AI traffic is visible in GA4

Correction: GA4 captures some AI referrals, but misses server-side crawlers, stripped referrers, AI Overview differentiation and zero-click exposure.

Myth: AI only uses semantic embeddings, so keywords no longer matter

Correction: Hybrid retrieval uses both semantic and lexical signals. Exact names, dates, product IDs and terminology still matter.

22. Reference implementation snippets

22.1 HTML summary block

<section class="answer-summary">
  <h2>Summary</h2>
  <p>
    AI search optimisation improves how AI systems discover, retrieve,
    interpret and cite a website. The most important technical foundations
    are crawl access, server-rendered content, canonical clarity,
    structured data, freshness and server-side measurement.
  </p>
</section>

22.2 Product facts block

<section id="product-facts">
  <h2>Product facts</h2>
  <dl>
    <dt>Product name</dt>
    <dd>Example Analytics Pro</dd>

    <dt>Price</dt>
    <dd>£49 per month, excluding VAT</dd>

    <dt>Availability</dt>
    <dd>Available in the United Kingdom and European Union</dd>

    <dt>Last updated</dt>
    <dd><time datetime="2026-06-07">7 June 2026</time></dd>
  </dl>
</section>

22.3 Server log fields

{
  "timestamp": "2026-06-07T09:14:00Z",
  "url": "https://www.example.com/guides/ai-search-optimisation",
  "method": "GET",
  "status": 200,
  "user_agent": "OAI-SearchBot/1.0",
  "bot_name": "OAI-SearchBot",
  "bot_category": "ai_search_retrieval",
  "verified_bot": true,
  "cache_status": "HIT",
  "response_time_ms": 84,
  "waf_action": "allow"
}

22.4 AI referral regex

(chatgpt\.com|openai\.com|perplexity\.ai|claude\.ai|gemini\.google\.com|copilot\.microsoft\.com|bing\.com)

23. Final strategic model

The best technical AI search optimisation strategy is:

Access: Can the right systems reach the content?
Extraction: Can they read the important content without rendering friction?
Canonicalisation: Can they identify the correct source URL?
Chunking: Can they split the page into useful passages?
Understanding: Can they identify entities, facts and relationships?
Trust: Are claims supported, current and consistent?
Freshness: Can changes be detected quickly and honestly?
Governance: Are training, retrieval and user-triggered access controlled separately?
Measurement: Can you see bot activity, referrals and answer visibility?
Iteration: Are prompt outputs, citations and errors reviewed continuously?

The ultimate goal is not to trick AI systems. It is to make your site the easiest, clearest and most reliable source for the facts your audience already needs.

Chrome extension

Test the technical signals with SEO Scrubbox

SEO Scrubbox is a Chrome extension for technical SEO and AI crawl optimisation. It helps compare view-source and rendered signals, spot canonical drift, validate JSON-LD, audit sitemaps, hreflang, redirects, headers, CrUX metrics, and crawler access without leaving the page.

It is built for the workflow in this guide: checking whether bots can see the right HTML, schema, crawl controls, links and response signals before you move on to logs, dashboards and longer-term monitoring.

Learn about SEO Scrubbox Download Extension

Sources

The references below were used while preparing this guide and related Scrubnet AI visibility guidance. External sources were accessed on 7 June 2026 unless noted otherwise.

AI search optimisation and GEO research

Rendering, llms.txt and machine-readable resources

Crawler controls, bots and governance

Search, structured data, sitemaps and content documentation

Measurement, AI referrals and server logs

Research papers, reporting and internal source material