Inside llms.txt and friends: every AI-crawler signal you can send in 2026

Why this map exists

Between mid-2023 and early 2026 the web acquired a new layer of plumbing. Where site owners used to have exactly one machine-readable control — robots.txt — they now ship four overlapping protocols and a handful of HTML / HTTP signals, all aimed at a single problem: tell the new species of crawler what it may read, what it may train on, and where to find the canonical version of the truth.

The protocols evolved out of order. OpenAI shipped GPTBot in August 2023.¹ Google followed with Google-Extended in September 2023.² Apple opted in with Applebot-Extended in mid-2024.³ Jeremy Howard published the llms.txt proposal in September 2024.⁴ Spawning's ai.txt appeared as an opt-out registry around the same time.⁵ Anthropic split its single ClaudeBot into a fleet of named agents over 2024–2025.⁶ Perplexity introduced Perplexity-User as a distinct live-fetch agent.⁷ Nothing was coordinated.

The result is a quiet mess. Most site owners we audit have a robots.txt that blocks GPTBot "because legal said so" and an llms.txt they were told to create by an SEO blog, and the two are contradicting each other. This guide is a single reference: every directive that exists, who reads it, and what to put where.

The four files that shape AI-crawler behaviour

Before we go field by field, the map. There are four files. They do different jobs. None of them replaces another.

File	Layer	Format	What it answers
`/robots.txt`	Gate-keeping	RFC 9309 plain text	Which crawlers may fetch which paths.
`/llms.txt`	Explanatory	Markdown	What's on this site, organised for AI consumption.
`/llms-full.txt`	Ingestion	Markdown, concatenated	The actual content, pre-flattened so an LLM can read the whole site in one fetch.
`/ai.txt` & meta tags	Licensing	Plain text / HTML	Whether content may be used for training, separately from access.

Think of it as four concentric rings: robots.txt decides if the bot may knock on the door, llms.txt hands it a map of the rooms, llms-full.txt hands it the contents of every room in one bundle, and ai.txt tells it which rooms the photographs may leave with.

Section 1. `llms.txt`: the explanatory layer

Jeremy Howard's llms.txt proposal solves a specific problem: an LLM has a small context window and your site has hundreds of pages, half of them noise (legal, footer links, marketing chrome). The model doesn't know which pages are the canonical reference. llms.txt answers that question in a format the model already excels at — Markdown.⁴

1.1. The required field: H1

Exactly one top-level heading, at the start of the file. It is the name of the project or product. Not the company tagline, not "Welcome." A reader (human or model) should be able to identify the entity in under two seconds.

# Acme Analytics

1.2. The recommended field: blockquote summary

A single Markdown blockquote, immediately after the H1. One to three sentences. This is what an LLM will quote verbatim when summarising your site.

> Acme Analytics is a usage-analytics platform for self-serve SaaS,
> built to surface revenue-correlated product events without manual instrumentation.

1.3. Optional intro prose

Free-form Markdown between the blockquote and the first H2. Use it sparingly — every paragraph here consumes context that could otherwise be a link. Most well-formed llms.txt files keep this section under 200 words and reserve elaboration for the linked Markdown pages themselves.

1.4. H2 sections (link collections)

Every ## heading introduces a labelled collection of links. The label is free-form, but a small set of conventions has emerged: ## Docs, ## Examples, ## API, ## Blog. Links inside each section follow a strict grammar:

## Docs

- [Quickstart](https://acme.com/docs/quickstart.md): A 5-minute walk-through of event ingestion.
- [Reference](https://acme.com/docs/reference.md): Every event property, with examples.

Each list item is one link, with the title in brackets and an optional one-line description after a colon. Linking to a Markdown version of the page (note .md in the example) is best practice — it lets the crawler skip your HTML chrome.

1.5. The special `## Optional` section

This is the one piece of unusual semantics in the spec. A heading literally named ## Optional marks links that may be dropped if the model's context budget is tight.⁴ Put the truly canonical material (overview, getting started, pricing) above it; put deep-cut reference material (changelogs, edge-case docs) inside it.

1.6. The companion file: `llms-full.txt`

llms.txt is a map. llms-full.txt is the territory: the entire documentation pre-concatenated as one Markdown file, so a model can ingest it in a single fetch without traversing a sitemap. Anthropic's own documentation serves both — see docs.anthropic.com/llms.txt and docs.anthropic.com/llms-full.txt.⁸ For most marketing sites the map alone is enough; for product documentation, both files are worth shipping.

1.7. Common mistakes

Writing HTML or JSON instead of Markdown. The format is not negotiable. A crawler that doesn't see Markdown will skip the file.
Skipping the H1. Some sites lead with a blockquote or prose. Without an H1 the file is technically invalid.
Broken links to the listed pages. Every URL in the file must resolve to 200 OK and serve the content the description promises.
Linking only HTML versions. The whole point is to give the model clean content. Serve a .md next to your .html wherever possible.
Treating it as marketing copy. Density matters more than warmth — facts, prices, feature names, supported integrations.

Section 2. `robots.txt`: the gate-keeping layer

The 1994 standard, formalised as RFC 9309 in 2022.⁹ Same syntax it always had — User-agent blocks with Allow / Disallow directives — but a completely new cast of crawlers to address. Below is the working census as of May 2026.

2.1. OpenAI agents

OpenAI splits its crawler estate into three distinct, separately-controllable agents.¹

GPTBot — the training crawler. Reading it teaches future GPT models. Blocking it means OpenAI does not train on your content. Honoured.
ChatGPT-User — the live-fetch agent. Runs when an active ChatGPT user clicks a citation or invokes browsing. Blocking it means ChatGPT users cannot see your site through the assistant. Honoured.
OAI-SearchBot — the search-index crawler powering SearchGPT. Blocking it removes you from SearchGPT's result surface. Honoured.

A defensible default for most commercial sites is to allow OAI-SearchBot and ChatGPT-User (these surface you to potential buyers) and decide consciously about GPTBot (which trains a model competitor for your content).

2.2. Anthropic agents

Anthropic's documentation now enumerates four agents with distinct purposes.⁶

ClaudeBot — training crawler. Modern name, replaces anthropic-ai.
Claude-User — live fetch when a user types a URL in Claude.
Claude-SearchBot — the search index used by Claude's web-search tool.
anthropic-ai & Claude-Web — legacy names; some sites still list them. Anthropic asks crawlers to ignore the older names but most operators block them defensively.

2.3. Google's two-tier model

Google deliberately decoupled "search" from "AI training" with the introduction of Google-Extended.²

Googlebot — the classic search crawler. Blocking it removes you from Google Search results.
Google-Extended — the AI opt-out token. It is not a separate crawler; blocking it tells Google not to use crawled content to train Gemini, Vertex AI, or future products. You still appear in Search.
GoogleOther — a catch-all for one-off internal fetches.

2.4. Apple's same trick

Applebot-Extended mirrors Google's split: regular Applebot indexes for Spotlight and Siri suggestions; Applebot-Extended is the AI training opt-out for Apple Intelligence.³

2.5. Perplexity's split-personality crawler

Perplexity's posture was contested through 2024 — researchers documented that the platform ignored robots.txt by issuing fetches from undeclared user-agents.⁷ Perplexity has since named two distinct agents and committed to honouring directives for both:

PerplexityBot — search index crawler.
Perplexity-User — live-fetch agent triggered by user prompts.

Treat this fleet with the same logic as OpenAI's: block training, allow live-retrieval, decide on the index based on whether you want Perplexity referral traffic.

2.6. The wider ecosystem

Beyond the four model-providers above, a handful of additional agents matter:

Agent	Owner	Purpose
`CCBot`	Common Crawl	Open dataset used by virtually every open-source LLM training run.
`Bytespider`	ByteDance	Training crawler for Doubao and downstream models.
`meta-externalagent`	Meta	AI assistant crawler; replaces older `FacebookBot` for AI purposes.
`Amazonbot`	Amazon	Alexa and AI-product crawling.
`Diffbot`	Diffbot	Structured-data extractor used by many AI products as a RAG source.
`Bingbot`	Microsoft	Search index, used by Copilot for grounding.
`DuckDuckBot`	DuckDuckGo	Search index, used by DuckAssist.

2.7. The two robots.txt mistakes we see weekly

Ordering Allow / Disallow incorrectly. RFC 9309 says the longest match wins, but several crawlers fall back to first-match. If you write
```
User-agent: GPTBot
Disallow: /
Allow: /blog/
```
you may or may not get the blog crawled depending on the agent. Reverse the order — broadest Allow first, narrow Disallow last — to be safe with both interpretations.
Blocking the live-retrieval agent while allowing the trainer. Far more common than the inverse, and exactly backwards from what most operators actually want. Live-retrieval is what surfaces you to a buyer; training is what feeds a competing model. If you must pick one to block, block the trainer.

Section 3. `ai.txt` and the licensing layer

ai.txt is a separate file maintained by Spawning Inc, originally aimed at image-training datasets but extended to any media.⁵ Where robots.txt answers "may you fetch?", ai.txt answers "having fetched, may you train on it?" — a distinction that legal teams care about even if it is hard to enforce technically.

The format is intentionally narrow: a per-file-type allow / deny grammar.

User-Agent: *
Disallow: /assets/

# Allow training on docs
User-Agent: *
Allow: /docs/

Adopters in practice include Stability AI and several of the open-source image-dataset registries. Major commercial LLM providers do not currently consume ai.txt, but EU AI Act §53 introduces a legal obligation to "honour reservations of rights expressed in a machine-readable form" — language wide enough to include this file.¹⁰ Ship it.

Section 4. HTML and HTTP signals

A site can also speak to crawlers per-page, in two places.

4.1. The `<meta name="robots">` tag

Two AI-specific values appeared in 2023, pushed by news publishers and adopted by several model providers:

noai — do not use this page to train AI.
noimageai — same, but for images on the page (relevant for stock photography and editorial illustrations).

<meta name="robots" content="noai, noimageai">

4.2. The `X-Robots-Tag` HTTP header

Same directives, but delivered as an HTTP response header — the only practical option for non-HTML assets like PDFs, images, and JSON endpoints.

X-Robots-Tag: noai, noimageai

4.3. The `TDM Reservation Protocol`

Drafted by the W3C and a few European publishers in response to the AI Act, the TDMRep proposal adds a tdm-reservation attribute. It is not yet widely supported but is the most-likely candidate to become a legally-binding signal in the EU. Add it defensively if you care about future European liability:

<meta name="tdm-reservation" content="1">
<meta name="tdm-policy" content="https://example.com/tdm-policy.json">

Section 5. What each LLM actually honours

This is the question every operator wants answered. Below is the honest scorecard as of May 2026, with a note on confidence in each cell. "Honoured" means we have either a public statement from the operator or repeated empirical confirmation. "Partial" means the signal is read but the response varies by surface or surface-type. "Unconfirmed" means no public commitment and no consistent empirical evidence.

Signal	OpenAI	Anthropic	Google (Gemini)	Perplexity	Common Crawl
`robots.txt` Disallow on own agent	Honoured	Honoured	Honoured (with `Google-Extended` for AI)	Honoured (post-2024)	Honoured
`llms.txt` as a discovery hint	Partial (used in some Plus surfaces)	Partial (consumed by Claude's tools team)	Unconfirmed	Partial (referenced in citations)	Unconfirmed
`llms-full.txt` for documentation ingestion	Partial	Honoured for first-party docs ingestion	Unconfirmed	Partial	Unconfirmed
`ai.txt` training restriction	Unconfirmed	Unconfirmed	Unconfirmed	Unconfirmed	Partial (some dataset producers honour it)
`meta name="robots" content="noai"`	Honoured	Honoured	Honoured	Honoured	Partial
`X-Robots-Tag: noai` on assets	Honoured	Honoured	Honoured	Honoured	Partial
TDMRep `tdm-reservation`	Unconfirmed	Unconfirmed	Partial (EU surfaces)	Unconfirmed	Unconfirmed

Two patterns are obvious. First, blocking signals (Disallow, noai) are better-honoured than discovery signals (llms.txt): the legal cost of ignoring an opt-out exceeds the cost of skipping a hint. Second, llms.txt adoption is happening on the tooling side, not the foundation-model side. The agentic-framework ecosystem (Cursor, Continue, Claude's own SDK examples) reads llms.txt reliably; the trillion-parameter base models do not yet treat it as canonical. Ship the file anyway — the cost is ten minutes, and the upside compounds as the standard matures.

Section 6. A working example

Here is what the four signals look like for a representative B2B SaaS site — call it Acme Analytics. Adapt the paths and product names to your own; the structure is the bit that travels.

6.1. `/robots.txt`

# --- Search crawlers: allow everywhere ---
User-agent: Googlebot
User-agent: Bingbot
User-agent: Applebot
User-agent: DuckDuckBot
Allow: /

# --- AI training: block by default, allow docs ---
User-agent: GPTBot
User-agent: ClaudeBot
User-agent: anthropic-ai
User-agent: CCBot
User-agent: Bytespider
User-agent: meta-externalagent
Allow: /docs/
Allow: /blog/
Disallow: /

# --- AI search and live-retrieval: allow (this is your discovery) ---
User-agent: OAI-SearchBot
User-agent: ChatGPT-User
User-agent: Claude-User
User-agent: Claude-SearchBot
User-agent: PerplexityBot
User-agent: Perplexity-User
Allow: /

# --- Vendor-specific AI opt-out tokens ---
User-agent: Google-Extended
User-agent: Applebot-Extended
Disallow: /

# --- Sitemap ---
Sitemap: https://acme.com/sitemap.xml

6.2. `/llms.txt`

# Acme Analytics

> Acme Analytics is a usage-analytics platform for self-serve SaaS,
> built to surface revenue-correlated product events without manual instrumentation.

Acme is used by 400+ SaaS teams to detect activation friction, predict churn,
and attribute revenue to product behaviour. Pricing starts at $99/mo; free for
teams under 10k MAU.

## Docs

- [Quickstart](https://acme.com/docs/quickstart.md): Five-minute event ingestion.
- [Reference](https://acme.com/docs/reference.md): Every event, property, and webhook.
- [SDKs](https://acme.com/docs/sdks.md): JavaScript, Python, Ruby, Go.

## Pricing

- [Plans](https://acme.com/pricing.md): All tiers with feature deltas.

## Blog

- [Founders' guide to PLG analytics](https://acme.com/blog/plg-analytics.md)
- [Why activation is the only metric that matters early](https://acme.com/blog/activation.md)

## Optional

- [Changelog](https://acme.com/changelog.md)
- [Security and compliance](https://acme.com/security.md)
- [On-prem deployment guide](https://acme.com/docs/on-prem.md)

6.3. `/llms-full.txt`

For most sites, generate llms-full.txt automatically by concatenating the linked Markdown documents in llms.txt with a separator. A small Makefile step or a post-deploy cron is enough; do not hand-maintain it.

6.4. HTML meta — for pages you must not train on

<!-- in the <head> of customer-only or licensed-content pages -->
<meta name="robots" content="noai, noimageai">
<meta name="tdm-reservation" content="1">

Section 7. How to audit your own site

Five checks. Run them once. Re-run after every major site change.

Does /robots.txt resolve at the apex domain? Many sites serve it from www. only. Crawlers checking the apex see a 404 and fall back to "no restrictions."
Are training-bot and live-retrieval-bot directives different? If you've copy-pasted the same block for both, you're either over-blocking (no AI traffic at all) or under-blocking (training without consent).
Does /llms.txt validate? H1 present, blockquote present, all linked URLs return 200 OK, all link descriptions present. Free validators exist; the simplest check is to feed the file to ChatGPT or Claude and ask "what is this product?" — the answer reveals if your description is doing its job.
Do customer-only pages carry noai? Marketing pages should not. Customer dashboards, internal docs, and licensed editorial should.
Are the named agents up to date? The bot estate shifts every quarter. A robots.txt last updated in 2023 will list anthropic-ai but miss Claude-User, Claude-SearchBot, OAI-SearchBot, and meta-externalagent.

If you'd rather not run the audit by hand, Chetver's scanner runs all five — plus a hundred-odd related checks — and reports back in under a minute with a colour-coded breakdown of what's present, what's missing, and what's actively misconfigured. The free version of the scan is enough to surface the high-impact issues on most sites.

Conclusion

The directive landscape is not converging — yet. Treat the four-file model as a layered defence rather than a single decision: robots.txt for gate-keeping, llms.txt for discovery, llms-full.txt for clean ingestion, ai.txt plus meta tags for licensing. Each of them is cheap to ship, and each of them is read by a different slice of the crawler ecosystem.

Three operating principles cover most of the day-to-day choices:

Block training, allow retrieval. Live-fetch agents bring buyers; training agents feed competitors.
Ship llms.txt even though it's not yet universally honoured. The cost is ten minutes and the standard is accelerating.
Re-audit quarterly. The named agents shift faster than annual SEO reviews can absorb.

Ten minutes of plumbing today will determine whether — eighteen months from now — an LLM mentions you, misrepresents you, or skips you entirely. The map is here. The territory is yours.

Sources

OpenAI Platform — GPTBot and SearchGPT documentation, accessed May 2026, https://platform.openai.com/docs/bots
Google Search Central — Overview of Google crawlers and the Google-Extended AI control, accessed May 2026, https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
Apple Developer — About Applebot and Applebot-Extended, accessed May 2026, https://support.apple.com/en-us/119829
Howard, J. — The /llms.txt file proposal, September 2024, https://llmstxt.org/
Spawning — The ai.txt opt-out specification, accessed May 2026, https://spawning.ai/ai-txt
Anthropic — Claude's web crawlers and how to control them, accessed May 2026, https://docs.anthropic.com/en/docs/agents-and-tools/web-search-tool/robots
Perplexity — Crawler documentation and Perplexity-User agent disclosure, accessed May 2026, https://docs.perplexity.ai/guides/bots
Anthropic Documentation — llms.txt and llms-full.txt, accessed May 2026, https://docs.anthropic.com/llms.txt
IETF RFC 9309 — Robots Exclusion Protocol, 2022, https://www.rfc-editor.org/rfc/rfc9309
European Union — Regulation (EU) 2024/1689 on Artificial Intelligence (AI Act), Article 53, https://eur-lex.europa.eu/eli/reg/2024/1689/oj

For the broader strategic picture this file sits inside, read our GEO / AIEO and AI Page strategic analysis — which makes the case that llms.txt is one half of a two-half infrastructure (the other half being a clean AI Page).

For the measurement layer on top — how to tell whether all this plumbing actually moves the needle — see the Share of Model framework.

Inside llms.txt and friends: every AI-crawler signal you can send in 2026

Why this map exists

The four files that shape AI-crawler behaviour

Section 1. llms.txt: the explanatory layer

1.1. The required field: H1

1.2. The recommended field: blockquote summary

1.3. Optional intro prose

1.4. H2 sections (link collections)

1.5. The special ## Optional section

1.6. The companion file: llms-full.txt

1.7. Common mistakes

Section 2. robots.txt: the gate-keeping layer

2.1. OpenAI agents

2.2. Anthropic agents

2.3. Google's two-tier model

2.4. Apple's same trick

2.5. Perplexity's split-personality crawler

2.6. The wider ecosystem

2.7. The two robots.txt mistakes we see weekly

Section 3. ai.txt and the licensing layer

Section 4. HTML and HTTP signals

4.1. The <meta name="robots"> tag

4.2. The X-Robots-Tag HTTP header

4.3. The TDM Reservation Protocol