KnownByLLM

Long read · 8 min

llms.txt vs robots.txt vs sitemap.xml

What each file does, and why you need all three.

Three tiny text files live at the root of your domain. Two have existed for over twenty years. The third was published in 2024 and started getting real traction in 2026. Together, they tell crawlers and AI systems three completely different things — and yet teams still confuse them every week. This is the plain-English breakdown.

The 30-second answer

robots.txt tells crawlers what they may not crawl. sitemap.xml tells crawlers what URLs exist. llms.txt tells AI systems what your site is actually about, in a format optimised for them to read.

They’re not redundant. Three different jobs, three different audiences. A modern site benefits from publishing all three.

robots.txtsitemap.xmlllms.txt
JobSet crawl permissionsList all URLsCurate the AI summary
AudienceAll crawlersSearch enginesAI systems
FormatPlain textXMLMarkdown
Year published1994 (de facto), 2022 (RFC)20052024
ToneImperative — “don’t”Inventory — “here are URLs”Editorial — “here’s what matters”
Typical size< 1 KB10 KB – 50 MB1–20 KB
Required?No (recommended)No (recommended)No (becoming expected)

1. robots.txt — the bouncer

Purpose: tell crawlers which paths they are not allowed to fetch.

Location: always https://yoursite.com/robots.txt. Subdomains have their own. Subdirectories are not respected.

Format: plain text, one rule per line. Defined by the Robots Exclusion Protocol, formalised as RFC 9309 in September 2022.

Who reads it: well-behaved crawlers — Googlebot, Bingbot, GPTBot, ClaudeBot, PerplexityBot, etc. Malicious scrapers ignore it; that’s a feature, not a bug. robots.txt is a request, not enforcement.

User-agent: *
Disallow: /admin
Disallow: /cart

User-agent: GPTBot
Allow: /

Sitemap: https://yoursite.com/sitemap.xml

Common mistake: blocking GPTBot, ClaudeBot, or PerplexityBot with a default-deny rule. If you do, your llms.txt is useless — those crawlers won’t fetch it. Our checker flags this automatically.

2. sitemap.xml — the index card

Purpose: tell crawlers which URLs exist on your site, plus optional metadata (last modified, change frequency, priority).

Location: conventionally at /sitemap.xml, but the canonical reference comes from either Google Search Console or a Sitemap: directive in robots.txt. Large sites split into multiple sitemaps under a sitemap index.

Format: XML. The schema is at sitemaps.org (last revised 2008, still authoritative).

Who reads it: search engines that want a complete URL inventory — Google, Bing, Yandex, etc. Not directly consumed by AI assistants today.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://yoursite.com/</loc>
    <lastmod>2026-05-01</lastmod>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://yoursite.com/pricing</loc>
    <lastmod>2026-04-15</lastmod>
  </url>
</urlset>

Common mistake: believing a sitemap forces Google to index a URL. It doesn’t. A sitemap is a hint about existence and recency; ranking and indexing decisions remain the search engine’s.

3. llms.txt — the table of contents for AI

Purpose: give an AI system a curated, Markdown summary of your site so it can answer questions about you without parsing your full HTML, navigation, scripts, and cookie banners.

Location: always https://yoursite.com/llms.txt. Larger sites may also publish llms-full.txt, a concatenation of full article bodies in Markdown.

Format: Markdown with a fixed structure: an H1 site name, a blockquote summary, ## sections, and link list items in - [name](url): description form. Spec by Jeremy Howard at Answer.AI, published September 2024.

Who reads it: AI agents and crawlers when they need to ground an answer in your site. Already adopted by Anthropic, Stripe, Cloudflare, Vercel, Mintlify, and a growing list of major SaaS sites. Adoption among the top 1,000 sites is still under 1%, but the curve is steep.

# Acme Corp

> Open-source database for full-text search across structured documents.

## Docs
- [Quickstart](https://acme.example/docs/quickstart): Get a cluster running in 5 minutes.
- [API reference](https://acme.example/docs/api): Full HTTP API.

## Optional
- [Architecture](https://acme.example/blog/architecture): How the index is sharded.

Common mistake: putting it in the wrong place (/.well-known/llms.txt, or in a subdirectory). The spec is unambiguous: it lives at the root.

How the three work together

They overlap less than you’d think.

  • robots.txt sets the permissions boundary — what may be fetched at all.
  • sitemap.xml describes the full surface area — every URL you want a search engine to know about.
  • llms.txt selects the important subset — the curated “here’s what matters” pointer for AI.

A typical small site has 50–500 URLs in its sitemap, but only 5–25 entries in its llms.txt. That’s the point.llms.txt is editorial, not exhaustive.

What happens if I’m missing one?

No robots.txt

All crawlers default to “everything allowed.” This is usually fine; you only need a robots.txt if you have paths to hide (admin, staging, search result pages). However, without a Sitemap: directive, search engines fall back to discovery, which is slower.

No sitemap.xml

Search engines crawl your site and discover URLs by following links. For sites with strong internal linking this works fine. For deeply nested or paginated content (large catalogues, archives), you’ll see slower indexing without a sitemap.

No llms.txt

AI assistants parse your full HTML — navigation, scripts, cookie banners, the lot — and try to summarise your site from the noise. Some succeed, but you’re leaving the answer to chance. Sites that publish a clean llms.txt report up to 10× lower token usage when AI systems cite them, which directly affects how often citation happens.

The setup checklist

  1. Publish a sitemap.xml — most CMSes (WordPress, Webflow, Shopify, Next.js) generate this automatically. Confirm it’s there at /sitemap.xml.
  2. Publish a robots.txt — even a minimal one (User-agent: * / Allow: / / Sitemap: https://yoursite.com/sitemap.xml) is better than nothing. Make sure you’re not accidentally blocking AI crawlers.
  3. Generate a llms.txt paste your URL into our generator to get a draft you can edit and upload.
  4. Validate it — run the file through the validator to confirm it’s spec-compliant before announcing it.

FAQ

Does llms.txt replace robots.txt or sitemap.xml?

No. They serve different audiences and different purposes. Search engines still rely on sitemaps; crawlers still respect robots.txt. llms.txt is purely additive — a new layer for AI.

Do AI systems actually fetch llms.txt today?

ChatGPT (with browsing), Claude, Perplexity, and Cursor are all known to read llms.txt when they fetch a site for grounding. As of 2026 it's not universal, but the major players support it and the list keeps growing.

If I block AI crawlers in robots.txt, does llms.txt still help?

No — they're complementary. Blocking GPTBot or ClaudeBot in robots.txt means those crawlers won't fetch llms.txt either. If you want AI to know about your site, you have to let them in.

Should I list every page in llms.txt?

No. llms.txt is a curated table of contents, not a sitemap. Aim for 5–25 entries that represent your most important content: docs, pricing, key product pages, foundational articles. Put the long tail elsewhere (sitemap) or in llms-full.txt.

What about llms-full.txt?

Optional. It's the full Markdown of your most important pages concatenated. Big documentation sites (Stripe, Anthropic) publish both. For most sites, just llms.txt is enough.

Will llms.txt affect my Google ranking?

Not directly. Google uses sitemap.xml for discovery, not llms.txt. llms.txt affects AI search visibility — ChatGPT, Claude, Perplexity, etc. — which is increasingly a separate channel.

Next steps