Should you let AI crawlers index your site? Yes.

Parameter	Value
Crawlers we allow on every site	GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot, anthropic-ai, Applebot-Extended
Crawlers we sometimes block	On paywalled / MNDA / internal-only content only
Default robots.txt rule	Allow / for every AI user-agent above
Cache-Control on /llms.txt	public, max-age=3600
Content-Type on /llms.txt	text/plain; charset=utf-8

The default position

For 95% of B2B brands the answer is: allow every major AI crawler. Blocking them does not protect content; it just makes you invisible to the surfaces where buyers are now researching.

If you are not in ChatGPT, you are not in the funnel. The cost of getting indexed is approximately zero (your content is already public on the open web); the cost of being blocked is the entire AI-citation channel.

The robots.txt allow-list

Drop this into /robots.txt on every site we engage with:

User-agent: *
Allow: /

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: CCBot
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: Applebot-Extended
Allow: /

User-agent: cohere-ai
Allow: /

Sitemap: https://yourdomain.com/sitemap-index.xml

That covers OpenAI (GPTBot, two variants), Anthropic (ClaudeBot + anthropic-ai), Perplexity (PerplexityBot), Google’s AI training (Google-Extended), Common Crawl which feeds many models (CCBot), Apple’s index (Applebot-Extended), and Cohere.

The legitimate exceptions

There are content types where blocking is correct:

Paywalled content — block the crawler from the paid section, not the marketing site
MNDA / customer-only documentation — block crawlers from /docs/[customer]/... paths
Internal-only knowledge base — block the entire subdomain
Personally identifying user-generated content — case-by-case

For a typical B2B site without those content categories, there is nothing legitimate to hide from AI crawlers.

What blocking actually costs

Brands that explicitly block AI crawlers see:

Zero citations across the blocked LLM (obvious)
Roughly 30–60% loss of citations across un-blocked LLMs because the AI authority graph still references your site through other crawlers, but with weaker signal
Reputational signal “this brand opted out” that some AI extractors use to down-weight even content they can technically still see

The trade-off is not “block to protect content vs. allow to gain citations”. It is “be invisible vs. be cited”. Almost no commercial brand wins by being invisible.

The /llms.txt header layer

Pair the allow-list with proper headers on /llms.txt:

/llms.txt
  Content-Type: text/plain; charset=utf-8
  Cache-Control: public, max-age=3600

/robots.txt
  Content-Type: text/plain; charset=utf-8
  Cache-Control: public, max-age=86400

Cloudflare Pages does this with _headers. Other hosts have equivalent config. Without these headers /llms.txt may be served as octet-stream and skipped by the crawler.

What we will not do

Block Google-Extended on a site where Google is the primary search source (cuts off your own AIO surface)
Block CCBot then ask why citations are flat (Common Crawl feeds many smaller models)
Add per-LLM cloaking that serves different content to different crawlers (penalised by the crawlers detecting the cloaking)

What you should do today

Open your robots.txt. Confirm none of those user-agents have Disallow: /. If they do, fix it — either remove the line or change it to Allow: /.

That is a five-minute fix that removes a substantial blocker on AI citation.

If your site does not have /llms.txt yet, that is the next step — see our guide for the spec and a copy-pasteable starter.

Should you let GPTBot, ClaudeBot and PerplexityBot index your site? Yes.

Quick Facts