guide

The well-known files every site needs for AI agents

By Project Auxo · 2026-06-28

TL;DR. Modern websites expose a small set of "well-known" files that automated readers fetch by convention: robots.txt (what a crawler may access), sitemap.xml (what pages exist), llms.txt (what is worth reading), and capabilities.txt (what the site can do). They compose rather than compete — robots and sitemap serve crawlers, llms.txt serves LLMs reading, and capabilities.txt serves agents acting. To be ready for AI agents, publish all four and allow the AI crawlers.

The four files

File	Answers	For	Path
`robots.txt`	What may a crawler access?	Crawlers	`/robots.txt`
`sitemap.xml`	What pages exist?	Search engines	`/sitemap.xml`
`llms.txt`	What's worth reading?	LLMs reading	`/llms.txt`
`capabilities.txt`	What can this site do?	Agents acting	`/capabilities.txt`

The web taught machines to read in layers. Each file answers one narrow question for an automated reader, and together they let an agent go from "may I look" to "what can I do" with nothing but static fetches.

robots.txt — permissions

The original well-known file. It tells crawlers what they may and may not access. For the agentic web, the important move is to explicitly allow the AI crawlers — both training (GPTBot, Google-Extended, CCBot) and search (OAI-SearchBot, PerplexityBot, ClaudeBot) — so the rest of your files are discoverable and citable.

sitemap.xml — inventory

A list of the pages that exist, for search engines and crawlers. It answers "what is here," not "what can I do." Reference it from robots.txt with a Sitemap: line.

llms.txt — what to read

A markdown file pointing an LLM at the content worth reading — docs, key pages, context — so it does not have to crawl and parse your whole HTML site. It spread because IDE agents fetch it at inference time: static and context-efficient. It is the reading sibling of capabilities.txt.

capabilities.txt — what you can do

The missing layer none of the others cover: a declaration of the capabilities an agent can invoke, with a pointer to where the call goes (your API or an MCP server). It is static and crawlable — an agent reads it with no live connection — and it hands off invocation to whatever you already run. This is the file that turns a readable site into an actionable one.

New to it? Start with the 10-minute guide, or read how AI agents discover what a site can do.

What is a well-known file?

A file published at a predictable, conventional path (the site root or under /.well-known/) that automated clients fetch without being told where it is. robots.txt is the original example; sitemap.xml, llms.txt, and capabilities.txt follow the same idea. The value is zero-coordination discovery: any agent knows the path in advance.

Do these files replace each other?

No — they answer different questions and compose. robots.txt sets crawl permissions, sitemap.xml lists pages, llms.txt points at the content worth reading, and capabilities.txt declares what the site can do (its invocable capabilities). A complete site publishes all four; each serves a different automated reader.

Which ones do AI agents actually use?

Agentic and IDE tooling already fetches robots.txt and llms.txt at inference time because static files are more context-efficient than parsing HTML. capabilities.txt extends that same habit to action: it tells an agent what it can do and where to call, with no live connection required. As agents shift from reading to acting, capabilities.txt is the file that matters for invocation.

Where do I put capabilities.txt — root or /.well-known/?

Publish the human- and agent-readable markdown form at the root: /capabilities.txt, sibling to robots.txt and llms.txt. Optionally also publish the structured form at /.well-known/capabilities.json for agents that want machine-resolvable descriptors. Both are valid; the root file is the primary one.

How do I make sure AI crawlers can read them?

Explicitly allow both the training crawlers (GPTBot, Google-Extended, CCBot) and the search crawlers (OAI-SearchBot, PerplexityBot, ClaudeBot) in robots.txt. Files an agent or answer engine cannot fetch cannot be discovered or cited. Welcoming the AI crawlers is an adopter best-practice, not a risk — these files only advertise, they grant nothing.

Publish all four

Beyond discovery, the next questions for any consequential action — may I, what happened, can I prove it — are governance and evidence, defined by the Capability Host Protocol. capabilities.txt is the public face of that stack.

Generate your capabilities.txt Make your API agent-ready Compare the standards