Is your firewall or CDN silently blocking AI crawlers despite your robots.txt allowing them?

A bot policy in robots.txt means nothing if your edge firewall blocks the bot's IP address before it ever reaches your robots file.

Scan your site

What this signal tests

We check whether the published IP address ranges of major AI vendors - OpenAI, Anthropic, Perplexity, Common Crawl, Google - are able to reach your site without being challenged or blocked by your CDN, WAF, or bot-management layer. This is a separate question from whether robots.txt allows them, and the two are often in silent conflict.

Why it matters for your visibility in AI

Cloudflare, Akamai, and other edge platforms ship default bot-management rules that block traffic from many AI crawlers - even when your robots.txt explicitly allows those same crawlers. The result is a silent contradiction: your stated policy welcomes AI bots, but your infrastructure quietly turns them away with a 403 error or a JavaScript challenge they cannot solve. This is one of the most frustrating modern visibility problems because it is invisible from inside your site. You see no errors in your CMS, the robots.txt looks perfect, but your content never appears in AI answers. Meanwhile your competitor on simpler hosting gets cited everywhere. The fix is at the CDN layer, not the application layer.

Pass criteria at a glance

Criterion Passes when
No 403/429/JS-challenge on robots-allowed paths from sampled vendor IPs.

How we test it

We download the published IP range files that each major AI vendor maintains - OpenAI, Anthropic, Perplexity, Common Crawl, and Google. We then attempt to fetch your site using each vendor's user agent string from a sample of those IPs, and we watch for any blocked response, rate-limit error, or JavaScript challenge on pages that your robots.txt clearly allows. Any such block is a contradiction we flag.

Show technical detection method
Pull vendor IP JSONs (claude.com/crawling/bots.json, openai.com/gptbot-ranges.json, perplexity.com/*.json); simulate vendor-UA fetch from those ranges; flag 403/429/challenges on robots-allowed paths.

If your site fails: how to fix it

  1. Log in to your CDN or WAF dashboard (Cloudflare, Akamai, Fastly, AWS WAF, or similar) and locate the bot-management or firewall rules section.
  2. Add the published vendor IP ranges to an explicit allow list. Anthropic, OpenAI, and Perplexity publish their crawler IPs as JSON files at known URLs - link to them from your remediation ticket so they can be refreshed automatically.
  3. Disable JavaScript challenges and CAPTCHA prompts for verified AI user agents. These crawlers do not execute JavaScript and will fail any interactive challenge, silently dropping your site from their index.
  4. If your CDN offers a managed AI-crawler ruleset (Cloudflare and Akamai both do), review whether its default behaviour is to block or allow, and adjust to match your policy.
  5. Test by simulating a vendor fetch from outside your network - many CDNs have a built-in test tool - and confirm a clean 200 response on your homepage and a sample article.
  6. Re-run the AI Ready Test to confirm no silent blocks remain.

Quick facts

MaturityESTABLISHED
Weighthigh
CategoryCrawlability

Primary sources

Related signals

Frequently asked questions

Why would my CDN block bots that I want to allow?

Most bot-management defaults are tuned for stopping scrapers, ad fraud, and automated abuse. They cannot tell a welcome AI crawler apart from a hostile scraper by default. Without an explicit allowlist, the well-intentioned default behaviour quietly hurts your AI visibility.

How do I know if Cloudflare is blocking AI bots on my site?

Check your Cloudflare dashboard under Security > Events and filter by recent traffic from known AI user agents like GPTBot or ClaudeBot. If you see many blocked or challenged requests, that is your signal. Cloudflare also offers a dedicated AI Bots view in its UI.

What if vendor IP ranges change?

They do change occasionally. Vendors publish their ranges as machine-readable JSON files specifically so you can automate refreshes. Set up a cron job or scheduled workflow that re-downloads the file weekly and updates your CDN allowlist accordingly.

Is there a risk of letting in fake bots that spoof a vendor's user agent?

Yes, which is why the IP allowlist matters. A spoofed user agent from a random IP will still be blocked because it does not match a published vendor range. The combination of user agent and IP is what makes this a safe, targeted permission.

Run your own scan

Run a free scan and see how your site grades across all 155 AI-readiness signals.

Scan your site