Is your opt-out policy for AI training crawlers consistent across every major vendor?

If you opt out of AI training, the rule must apply to every training crawler - not just the famous ones.

What this signal tests

We check whether the AI-training crawlers from every major vendor - OpenAI's GPTBot, Anthropic's ClaudeBot, Google-Extended, Apple's Applebot-Extended, Common Crawl's CCBot, Meta's training crawler, ByteDance's Bytespider, and Cohere's training bot - are all treated the same way by your robots.txt. Either all allowed, or all disallowed.

Why it matters for your visibility in AI

An inconsistent training opt-out leaks your content to whichever vendor you forgot. If you block GPTBot and ClaudeBot but leave CCBot allowed, your pages still end up in Common Crawl - which then feeds dozens of open-source language models and downstream commercial products. The block on the big names becomes purely cosmetic. For businesses with legal, brand, or licensing reasons to opt out of AI training, a leak through one obscure crawler can undermine the entire policy. Conversely, sites that intend to allow training but accidentally block a single vendor get penalised in that vendor's model coverage with no upside elsewhere. A coherent, deliberate stance across all training bots is what gets you the result you actually want.

Pass criteria at a glance

Criterion	Passes when
All training-purpose bots share a single intentional posture.

How we test it

We identify every training-purpose AI bot in your robots.txt and classify each one as currently allowed or disallowed at the site root. We then check whether the majority share a single stance. Any bot whose policy disagrees with the majority is flagged as a likely oversight rather than an intentional split, since intentional per-vendor differences in training policy are extremely rare.

Show technical detection method

Classify each training bot as Allow/Disallow at `/`; flag deviations from majority stance.

If your site fails: how to fix it

Decide your overall training policy first: do you want your site used to train AI models, yes or no? Write it down clearly before editing.
Open robots.txt and locate the User-agent blocks for each training crawler: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended, CCBot, Meta-ExternalAgent, Bytespider, and any others on your radar.
Apply the same Disallow rule (or absence of one) to every training bot. If you are opting out, every one of them needs an explicit Disallow line - do not assume the wildcard covers them.
Pay particular attention to CCBot. Common Crawl is the upstream feed for many open-source LLMs, and missing it is the single most common cause of an inconsistent opt-out.
Save the file and re-run the AI Ready Test to confirm every training bot now shares the same stance.

Quick facts

Maturity	ESTABLISHED
Weight	medium
Category	Crawlability

Primary sources

Related signals

Frequently asked questions

Is blocking training bots the same as opting out legally?

Not entirely. A robots.txt block is a clear technical signal that compliant crawlers respect, and it strengthens your legal position. But it is not a substitute for a written content licence or terms of service if you want to make the opt-out legally enforceable against non-compliant scrapers.

What is Common Crawl and why does it matter?

Common Crawl is a non-profit that crawls the web at scale and publishes the result as an open dataset. Many open-source and academic language models train on this dataset. Allowing CCBot effectively means allowing your content into a wide ecosystem of downstream models you cannot individually opt out of.

Can I allow some training bots but not others?

Technically yes - robots.txt supports per-bot rules. In practice, an inconsistent stance is rarely intentional and usually means you forgot one. If you genuinely want a split policy (for example, allowing one trusted vendor only), document the reasoning in a comment above the rule so future editors do not normalise it.

Will blocking training bots hurt my search visibility?

Usually no. Training bots are separate from search bots. Blocking GPTBot does not block OAI-SearchBot, and blocking Google-Extended does not block Googlebot. You can opt out of training while remaining fully visible in AI search results.

Run your own scan

Run a free scan and see how your site grades across all 155 AI-readiness signals.

Scan your site