The default position
For 95% of B2B brands the answer is: allow every major AI crawler. Blocking them does not protect content; it just makes you invisible to the surfaces where buyers are now researching.
If you are not in ChatGPT, you are not in the funnel. The cost of getting indexed is approximately zero (your content is already public on the open web); the cost of being blocked is the entire AI-citation channel.
The robots.txt allow-list
Drop this into /robots.txt on every site we engage with:
User-agent: *
Allow: /
User-agent: GPTBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: CCBot
Allow: /
User-agent: anthropic-ai
Allow: /
User-agent: Applebot-Extended
Allow: /
User-agent: cohere-ai
Allow: /
Sitemap: https://yourdomain.com/sitemap-index.xml
That covers OpenAI (GPTBot, two variants), Anthropic (ClaudeBot + anthropic-ai), Perplexity (PerplexityBot), Google’s AI training (Google-Extended), Common Crawl which feeds many models (CCBot), Apple’s index (Applebot-Extended), and Cohere.
The legitimate exceptions
There are content types where blocking is correct:
- Paywalled content — block the crawler from the paid section, not the marketing site
- MNDA / customer-only documentation — block crawlers from
/docs/[customer]/...paths - Internal-only knowledge base — block the entire subdomain
- Personally identifying user-generated content — case-by-case
For a typical B2B site without those content categories, there is nothing legitimate to hide from AI crawlers.
What blocking actually costs
Brands that explicitly block AI crawlers see:
- Zero citations across the blocked LLM (obvious)
- Roughly 30–60% loss of citations across un-blocked LLMs because the AI authority graph still references your site through other crawlers, but with weaker signal
- Reputational signal “this brand opted out” that some AI extractors use to down-weight even content they can technically still see
The trade-off is not “block to protect content vs. allow to gain citations”. It is “be invisible vs. be cited”. Almost no commercial brand wins by being invisible.
The /llms.txt header layer
Pair the allow-list with proper headers on /llms.txt:
/llms.txt
Content-Type: text/plain; charset=utf-8
Cache-Control: public, max-age=3600
/robots.txt
Content-Type: text/plain; charset=utf-8
Cache-Control: public, max-age=86400
Cloudflare Pages does this with _headers. Other hosts have equivalent config. Without these headers /llms.txt may be served as octet-stream and skipped by the crawler.
What we will not do
- Block Google-Extended on a site where Google is the primary search source (cuts off your own AIO surface)
- Block CCBot then ask why citations are flat (Common Crawl feeds many smaller models)
- Add per-LLM cloaking that serves different content to different crawlers (penalised by the crawlers detecting the cloaking)
What you should do today
Open your robots.txt. Confirm none of those user-agents have Disallow: /. If they do, fix it — either remove the line or change it to Allow: /.
That is a five-minute fix that removes a substantial blocker on AI citation.
If your site does not have /llms.txt yet, that is the next step — see our guide for the spec and a copy-pasteable starter.