Why this map exists
Between mid-2023 and early 2026 the web acquired a new layer of plumbing. Where site owners used to have exactly
one machine-readable control — robots.txt — they now ship four overlapping protocols
and a handful of HTML / HTTP signals, all aimed at a single problem: tell the new species of crawler
what it may read, what it may train on, and where to find the canonical version of the truth.
The protocols evolved out of order. OpenAI shipped GPTBot in August 2023.1
Google followed with Google-Extended in September 2023.2
Apple opted in with Applebot-Extended in mid-2024.3
Jeremy Howard published the llms.txt proposal in September 2024.4
Spawning's ai.txt appeared as an opt-out registry around the same time.5
Anthropic split its single ClaudeBot into a fleet of named agents over 2024–2025.6
Perplexity introduced Perplexity-User as a distinct live-fetch agent.7
Nothing was coordinated.
The result is a quiet mess. Most site owners we audit have a robots.txt that blocks
GPTBot "because legal said so" and an llms.txt they were told to create by an SEO blog,
and the two are contradicting each other. This guide is a single reference: every directive
that exists, who reads it, and what to put where.
The four files that shape AI-crawler behaviour
Before we go field by field, the map. There are four files. They do different jobs. None of them replaces another.
| File | Layer | Format | What it answers |
|---|---|---|---|
/robots.txt |
Gate-keeping | RFC 9309 plain text | Which crawlers may fetch which paths. |
/llms.txt |
Explanatory | Markdown | What's on this site, organised for AI consumption. |
/llms-full.txt |
Ingestion | Markdown, concatenated | The actual content, pre-flattened so an LLM can read the whole site in one fetch. |
/ai.txt & meta tags |
Licensing | Plain text / HTML | Whether content may be used for training, separately from access. |
Think of it as four concentric rings: robots.txt decides if the bot may knock on the door,
llms.txt hands it a map of the rooms, llms-full.txt hands it the contents of every room
in one bundle, and ai.txt tells it which rooms the photographs may leave with.
Section 1. llms.txt: the explanatory layer
Jeremy Howard's llms.txt proposal solves a specific problem: an LLM has a small context window
and your site has hundreds of pages, half of them noise (legal, footer links, marketing chrome). The model
doesn't know which pages are the canonical reference. llms.txt answers that question in a
format the model already excels at — Markdown.4
1.1. The required field: H1
Exactly one top-level heading, at the start of the file. It is the name of the project or product. Not the company tagline, not "Welcome." A reader (human or model) should be able to identify the entity in under two seconds.
# Acme Analytics
1.2. The recommended field: blockquote summary
A single Markdown blockquote, immediately after the H1. One to three sentences. This is what an LLM will quote verbatim when summarising your site.
> Acme Analytics is a usage-analytics platform for self-serve SaaS,
> built to surface revenue-correlated product events without manual instrumentation.
1.3. Optional intro prose
Free-form Markdown between the blockquote and the first H2. Use it sparingly — every paragraph here
consumes context that could otherwise be a link. Most well-formed llms.txt files keep this
section under 200 words and reserve elaboration for the linked Markdown pages themselves.
1.4. H2 sections (link collections)
Every ## heading introduces a labelled collection of links. The label is free-form,
but a small set of conventions has emerged: ## Docs, ## Examples,
## API, ## Blog. Links inside each section follow a strict grammar:
## Docs
- [Quickstart](https://acme.com/docs/quickstart.md): A 5-minute walk-through of event ingestion.
- [Reference](https://acme.com/docs/reference.md): Every event property, with examples.
Each list item is one link, with the title in brackets and an optional one-line description after a colon.
Linking to a Markdown version of the page (note .md in the example) is best practice — it lets
the crawler skip your HTML chrome.
1.5. The special ## Optional section
This is the one piece of unusual semantics in the spec. A heading literally named ## Optional
marks links that may be dropped if the model's context budget is tight.4
Put the truly canonical material (overview, getting started, pricing) above it; put deep-cut reference
material (changelogs, edge-case docs) inside it.
1.6. The companion file: llms-full.txt
llms.txt is a map. llms-full.txt is the territory: the entire
documentation pre-concatenated as one Markdown file, so a model can ingest it in a single fetch without
traversing a sitemap. Anthropic's own documentation serves both — see
docs.anthropic.com/llms.txt and docs.anthropic.com/llms-full.txt.8
For most marketing sites the map alone is enough; for product documentation, both files are worth shipping.
1.7. Common mistakes
- Writing HTML or JSON instead of Markdown. The format is not negotiable. A crawler that doesn't see Markdown will skip the file.
- Skipping the H1. Some sites lead with a blockquote or prose. Without an H1 the file is technically invalid.
- Broken links to the listed pages. Every URL in the file must resolve to 200 OK and serve the content the description promises.
- Linking only HTML versions. The whole point is to give the model clean content. Serve a
.mdnext to your.htmlwherever possible. - Treating it as marketing copy. Density matters more than warmth — facts, prices, feature names, supported integrations.
Section 2. robots.txt: the gate-keeping layer
The 1994 standard, formalised as RFC 9309 in 2022.9
Same syntax it always had — User-agent blocks with Allow / Disallow
directives — but a completely new cast of crawlers to address. Below is the working census as of May 2026.
2.1. OpenAI agents
OpenAI splits its crawler estate into three distinct, separately-controllable agents.1
GPTBot— the training crawler. Reading it teaches future GPT models. Blocking it means OpenAI does not train on your content. Honoured.ChatGPT-User— the live-fetch agent. Runs when an active ChatGPT user clicks a citation or invokes browsing. Blocking it means ChatGPT users cannot see your site through the assistant. Honoured.OAI-SearchBot— the search-index crawler powering SearchGPT. Blocking it removes you from SearchGPT's result surface. Honoured.
A defensible default for most commercial sites is to allow OAI-SearchBot and
ChatGPT-User (these surface you to potential buyers) and decide consciously about
GPTBot (which trains a model competitor for your content).
2.2. Anthropic agents
Anthropic's documentation now enumerates four agents with distinct purposes.6
ClaudeBot— training crawler. Modern name, replacesanthropic-ai.Claude-User— live fetch when a user types a URL in Claude.Claude-SearchBot— the search index used by Claude's web-search tool.anthropic-ai&Claude-Web— legacy names; some sites still list them. Anthropic asks crawlers to ignore the older names but most operators block them defensively.
2.3. Google's two-tier model
Google deliberately decoupled "search" from "AI training" with the introduction of Google-Extended.2
Googlebot— the classic search crawler. Blocking it removes you from Google Search results.Google-Extended— the AI opt-out token. It is not a separate crawler; blocking it tells Google not to use crawled content to train Gemini, Vertex AI, or future products. You still appear in Search.GoogleOther— a catch-all for one-off internal fetches.
2.4. Apple's same trick
Applebot-Extended mirrors Google's split: regular Applebot indexes for Spotlight and
Siri suggestions; Applebot-Extended is the AI training opt-out for Apple Intelligence.3
2.5. Perplexity's split-personality crawler
Perplexity's posture was contested through 2024 — researchers documented that the platform ignored
robots.txt by issuing fetches from undeclared user-agents.7
Perplexity has since named two distinct agents and committed to honouring directives for both:
PerplexityBot— search index crawler.Perplexity-User— live-fetch agent triggered by user prompts.
Treat this fleet with the same logic as OpenAI's: block training, allow live-retrieval, decide on the index based on whether you want Perplexity referral traffic.
2.6. The wider ecosystem
Beyond the four model-providers above, a handful of additional agents matter:
| Agent | Owner | Purpose |
|---|---|---|
CCBot | Common Crawl | Open dataset used by virtually every open-source LLM training run. |
Bytespider | ByteDance | Training crawler for Doubao and downstream models. |
meta-externalagent | Meta | AI assistant crawler; replaces older FacebookBot for AI purposes. |
Amazonbot | Amazon | Alexa and AI-product crawling. |
Diffbot | Diffbot | Structured-data extractor used by many AI products as a RAG source. |
Bingbot | Microsoft | Search index, used by Copilot for grounding. |
DuckDuckBot | DuckDuckGo | Search index, used by DuckAssist. |
2.7. The two robots.txt mistakes we see weekly
-
Ordering Allow / Disallow incorrectly. RFC 9309 says the longest match wins, but several crawlers fall back to first-match. If you write
you may or may not get the blog crawled depending on the agent. Reverse the order — broadestUser-agent: GPTBot Disallow: / Allow: /blog/Allowfirst, narrowDisallowlast — to be safe with both interpretations. - Blocking the live-retrieval agent while allowing the trainer. Far more common than the inverse, and exactly backwards from what most operators actually want. Live-retrieval is what surfaces you to a buyer; training is what feeds a competing model. If you must pick one to block, block the trainer.
Section 3. ai.txt and the licensing layer
ai.txt is a separate file maintained by Spawning Inc, originally aimed at
image-training datasets but extended to any media.5
Where robots.txt answers "may you fetch?", ai.txt answers "having fetched, may
you train on it?" — a distinction that legal teams care about even if it is hard to enforce technically.
The format is intentionally narrow: a per-file-type allow / deny grammar.
User-Agent: *
Disallow: /assets/
# Allow training on docs
User-Agent: *
Allow: /docs/
Adopters in practice include Stability AI and several of the open-source image-dataset registries. Major
commercial LLM providers do not currently consume ai.txt, but EU AI Act §53 introduces a
legal obligation to "honour reservations of rights expressed in a machine-readable form" — language wide
enough to include this file.10 Ship it.
Section 4. HTML and HTTP signals
A site can also speak to crawlers per-page, in two places.
4.1. The <meta name="robots"> tag
Two AI-specific values appeared in 2023, pushed by news publishers and adopted by several model providers:
noai— do not use this page to train AI.noimageai— same, but for images on the page (relevant for stock photography and editorial illustrations).
<meta name="robots" content="noai, noimageai">
4.2. The X-Robots-Tag HTTP header
Same directives, but delivered as an HTTP response header — the only practical option for non-HTML assets like PDFs, images, and JSON endpoints.
X-Robots-Tag: noai, noimageai
4.3. The TDM Reservation Protocol
Drafted by the W3C and a few European publishers in response to the AI Act, the TDMRep proposal adds
a tdm-reservation attribute. It is not yet widely supported but is the most-likely candidate
to become a legally-binding signal in the EU. Add it defensively if you care about future European
liability:
<meta name="tdm-reservation" content="1">
<meta name="tdm-policy" content="https://example.com/tdm-policy.json">
Section 5. What each LLM actually honours
This is the question every operator wants answered. Below is the honest scorecard as of May 2026, with a note on confidence in each cell. "Honoured" means we have either a public statement from the operator or repeated empirical confirmation. "Partial" means the signal is read but the response varies by surface or surface-type. "Unconfirmed" means no public commitment and no consistent empirical evidence.
| Signal | OpenAI | Anthropic | Google (Gemini) | Perplexity | Common Crawl |
|---|---|---|---|---|---|
robots.txt Disallow on own agent |
Honoured | Honoured | Honoured (with Google-Extended for AI) |
Honoured (post-2024) | Honoured |
llms.txt as a discovery hint |
Partial (used in some Plus surfaces) | Partial (consumed by Claude's tools team) | Unconfirmed | Partial (referenced in citations) | Unconfirmed |
llms-full.txt for documentation ingestion |
Partial | Honoured for first-party docs ingestion | Unconfirmed | Partial | Unconfirmed |
ai.txt training restriction |
Unconfirmed | Unconfirmed | Unconfirmed | Unconfirmed | Partial (some dataset producers honour it) |
meta name="robots" content="noai" |
Honoured | Honoured | Honoured | Honoured | Partial |
X-Robots-Tag: noai on assets |
Honoured | Honoured | Honoured | Honoured | Partial |
TDMRep tdm-reservation |
Unconfirmed | Unconfirmed | Partial (EU surfaces) | Unconfirmed | Unconfirmed |
Two patterns are obvious. First, blocking signals (Disallow, noai) are
better-honoured than discovery signals (llms.txt): the legal cost of ignoring an
opt-out exceeds the cost of skipping a hint. Second, llms.txt adoption is happening
on the tooling side, not the foundation-model side. The agentic-framework ecosystem (Cursor,
Continue, Claude's own SDK examples) reads llms.txt reliably; the trillion-parameter base
models do not yet treat it as canonical. Ship the file anyway — the cost is ten minutes, and the upside
compounds as the standard matures.
Section 6. A working example
Here is what the four signals look like for a representative B2B SaaS site — call it Acme Analytics. Adapt the paths and product names to your own; the structure is the bit that travels.
6.1. /robots.txt
# --- Search crawlers: allow everywhere ---
User-agent: Googlebot
User-agent: Bingbot
User-agent: Applebot
User-agent: DuckDuckBot
Allow: /
# --- AI training: block by default, allow docs ---
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Bytespider
User-agent: meta-externalagent
Allow: /docs/
Allow: /blog/
Disallow: /
# --- AI search and live-retrieval: allow (this is your discovery) ---
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: Claude-SearchBot
User-agent: PerplexityBot
User-agent: Perplexity-User
Allow: /
# --- Vendor-specific AI opt-out tokens ---
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /
# --- Sitemap ---
Sitemap: https://acme.com/sitemap.xml
6.2. /llms.txt
# Acme Analytics
> Acme Analytics is a usage-analytics platform for self-serve SaaS,
> built to surface revenue-correlated product events without manual instrumentation.
Acme is used by 400+ SaaS teams to detect activation friction, predict churn,
and attribute revenue to product behaviour. Pricing starts at $99/mo; free for
teams under 10k MAU.
## Docs
- [Quickstart](https://acme.com/docs/quickstart.md): Five-minute event ingestion.
- [Reference](https://acme.com/docs/reference.md): Every event, property, and webhook.
- [SDKs](https://acme.com/docs/sdks.md): JavaScript, Python, Ruby, Go.
## Pricing
- [Plans](https://acme.com/pricing.md): All tiers with feature deltas.
## Blog
- [Founders' guide to PLG analytics](https://acme.com/blog/plg-analytics.md)
- [Why activation is the only metric that matters early](https://acme.com/blog/activation.md)
## Optional
- [Changelog](https://acme.com/changelog.md)
- [Security and compliance](https://acme.com/security.md)
- [On-prem deployment guide](https://acme.com/docs/on-prem.md)
6.3. /llms-full.txt
For most sites, generate llms-full.txt automatically by concatenating the linked Markdown
documents in llms.txt with a separator. A small Makefile step or a
post-deploy cron is enough; do not hand-maintain it.
6.4. HTML meta — for pages you must not train on
<!-- in the <head> of customer-only or licensed-content pages -->
<meta name="robots" content="noai, noimageai">
<meta name="tdm-reservation" content="1">
Section 7. How to audit your own site
Five checks. Run them once. Re-run after every major site change.
-
Does
/robots.txtresolve at the apex domain? Many sites serve it fromwww.only. Crawlers checking the apex see a 404 and fall back to "no restrictions." - Are training-bot and live-retrieval-bot directives different? If you've copy-pasted the same block for both, you're either over-blocking (no AI traffic at all) or under-blocking (training without consent).
-
Does
/llms.txtvalidate? H1 present, blockquote present, all linked URLs return 200 OK, all link descriptions present. Free validators exist; the simplest check is to feed the file to ChatGPT or Claude and ask "what is this product?" — the answer reveals if your description is doing its job. -
Do customer-only pages carry
noai? Marketing pages should not. Customer dashboards, internal docs, and licensed editorial should. -
Are the named agents up to date? The bot estate shifts every quarter. A
robots.txtlast updated in 2023 will listanthropic-aibut missClaude-User,Claude-SearchBot,OAI-SearchBot, andmeta-externalagent.
If you'd rather not run the audit by hand, Chetver's scanner runs all five — plus a hundred-odd related checks — and reports back in under a minute with a colour-coded breakdown of what's present, what's missing, and what's actively misconfigured. The free version of the scan is enough to surface the high-impact issues on most sites.
Conclusion
The directive landscape is not converging — yet. Treat the four-file model as a layered defence rather
than a single decision: robots.txt for gate-keeping, llms.txt for discovery,
llms-full.txt for clean ingestion, ai.txt plus meta tags for licensing. Each
of them is cheap to ship, and each of them is read by a different slice of the crawler ecosystem.
Three operating principles cover most of the day-to-day choices:
- Block training, allow retrieval. Live-fetch agents bring buyers; training agents feed competitors.
- Ship
llms.txteven though it's not yet universally honoured. The cost is ten minutes and the standard is accelerating. - Re-audit quarterly. The named agents shift faster than annual SEO reviews can absorb.
Ten minutes of plumbing today will determine whether — eighteen months from now — an LLM mentions you, misrepresents you, or skips you entirely. The map is here. The territory is yours.
Sources
- OpenAI Platform — GPTBot and SearchGPT documentation, accessed May 2026, https://platform.openai.com/docs/bots
- Google Search Central — Overview of Google crawlers and the Google-Extended AI control, accessed May 2026, https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
- Apple Developer — About Applebot and Applebot-Extended, accessed May 2026, https://support.apple.com/en-us/119829
- Howard, J. — The /llms.txt file proposal, September 2024, https://llmstxt.org/
- Spawning — The ai.txt opt-out specification, accessed May 2026, https://spawning.ai/ai-txt
- Anthropic — Claude's web crawlers and how to control them, accessed May 2026, https://docs.anthropic.com/en/docs/agents-and-tools/web-search-tool/robots
- Perplexity — Crawler documentation and Perplexity-User agent disclosure, accessed May 2026, https://docs.perplexity.ai/guides/bots
- Anthropic Documentation —
llms.txtandllms-full.txt, accessed May 2026, https://docs.anthropic.com/llms.txt - IETF RFC 9309 — Robots Exclusion Protocol, 2022, https://www.rfc-editor.org/rfc/rfc9309
- European Union — Regulation (EU) 2024/1689 on Artificial Intelligence (AI Act), Article 53, https://eur-lex.europa.eu/eli/reg/2024/1689/oj
For the broader strategic picture this file sits inside, read
our GEO / AIEO and AI Page strategic analysis —
which makes the case that llms.txt is one half of a two-half infrastructure (the other half being a clean AI Page).
For the measurement layer on top — how to tell whether all this plumbing actually moves the needle — see the Share of Model framework.