How to Check if AI is Crawling Your Site (2026 Guide)
The 2026 guide to detecting AI crawlers — GPTBot, ClaudeBot, PerplexityBot, Google-Extended — in server logs, robots.txt, and llms.txt.
TL;DR: AI crawlers identify themselves with specific user-agent strings — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, and others. To check if they're crawling your site: (1) grep your server access logs for the user-agent strings, (2) inspect /robots.txt and /llms.txt for explicit allow/disallow rules, (3) use a free tool like our LLM Bot Checker to confirm what's reachable. If you're not being crawled, you can't be cited in AI answers.
How to Check if AI is Crawling Your Site (2026 Guide)
In 2026, being crawled by an LLM is the new "being indexed by Google." If GPTBot, ClaudeBot, PerplexityBot and Google-Extended cannot read your pages, ChatGPT, Claude, Perplexity, and Google AI Overviews cannot cite you. The vast majority of sites we audit have at least one major AI crawler accidentally blocked.
This guide gives you a 15-minute audit you can run today.
Skip the manual work — run our free LLM Bot Checker on your URL →
Which user-agents identify the major AI crawlers?
| Crawler | User-agent string | Run by | Purpose |
|---|---|---|---|
| GPTBot | GPTBot | OpenAI | Training data |
| OAI-SearchBot | OAI-SearchBot | OpenAI | ChatGPT search results |
| ChatGPT-User | ChatGPT-User | OpenAI | On-demand fetch when a user asks |
| ClaudeBot | ClaudeBot | Anthropic | Training data |
| Claude-User | Claude-User | Anthropic | On-demand fetch when a user asks |
| Claude-SearchBot | Claude-SearchBot | Anthropic | Claude search citations (added 2026) |
| PerplexityBot | PerplexityBot | Perplexity | Search index |
| Perplexity-User | Perplexity-User | Perplexity | On-demand fetch |
| Google-Extended | Google-Extended | Gemini training (signal in robots.txt only) | |
| GoogleOther | GoogleOther | AI Overviews and experimental | |
| Bingbot | bingbot | Microsoft | Bing + Copilot |
| Applebot-Extended | Applebot-Extended | Apple | Apple Intelligence training |
A clean robots.txt in 2026 makes a deliberate choice for each of these. "I forgot they exist" is not a strategy.
How to check your server access logs
If your site is behind nginx, Apache, or a CDN with log access (Cloudflare, Fastly, Vercel), grep the user-agent column for these strings. Examples:
# nginx / apache combined log format
grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended|Applebot-Extended" /var/log/nginx/access.log | tail -50
# count hits by crawler in the last 7 days
grep -oE "GPTBot|ClaudeBot|PerplexityBot|Google-Extended" /var/log/nginx/access.log* | sort | uniq -c | sort -rn
If your hosting uses Cloudflare, run an Analytics → Logs query filtered by cf.bot_management.verified_bot = true and inspect the user-agent column. For Vercel, use vercel logs --follow and filter.
What you're looking for: at least some hits from GPTBot and ClaudeBot in the last 30 days. Zero hits = you are likely being blocked, throttled, or your sitemap is incomplete.
How to inspect your robots.txt and llms.txt
Start with https://yoursite.com/robots.txt. Look for blocks like:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
A surprising number of sites still ship the default WordPress or Shopify robots.txt that disallows AI crawlers — usually because of a panicked decision in 2023 that nobody revisited. In 2026, blocking the very crawlers that fuel ChatGPT and Claude search is equivalent to opting out of being recommended.
Then check https://yoursite.com/llms.txt. This is the emerging markdown-based standard for telling LLMs what your site contains. If you don't have one, our llms.txt explainer walks you through generating one.
llms.txt vs robots.txt — which matters more?
Both. They do different things:
- robots.txt is a crawl control file. It tells crawlers what they're allowed to fetch. AI crawlers respect it (most of them, most of the time).
- llms.txt is a comprehension hint file. It tells LLMs what your site is about and where the canonical pages live for each topic. It does not block or allow anything.
If you want to be cited, you need (a) robots.txt that allows AI crawlers, and (b) llms.txt that summarizes your site. The order matters — without robots.txt access, llms.txt is invisible.
What to do if you're being scraped without attribution
This comes up a lot. The honest answer:
- GPTBot, ClaudeBot, PerplexityBot all respect robots.txt. If you don't want them, disallow them. Verify with your logs that they stop.
- Some crawlers misrepresent their user-agent. These are rare but exist. Cloudflare's "AI Scrapers and Crawlers" rule, Fastly's similar feature, or AWS WAF can block by ASN/IP for the worst offenders.
- You usually want to be cited, not blocked. The economics of "show up in ChatGPT answers" are better than "no one ever finds my site." Block selectively, not by default.
A 15-minute audit you can run today
- Open our free LLM Bot Checker and enter your URL. It tests robots.txt for all 12 major AI crawlers and tells you which can fetch your pages.
- Grep your server logs for the user-agent strings in the table above. Confirm real-world hits.
- Open
/llms.txt. If you don't have one, generate it. - Open
/robots.txt. Make a deliberate allow/disallow decision for each crawler. - Add an FAQ block to your top 5 pages. AI crawlers love structured Q&A; it's the highest-leverage content edit you can make for citation.
Frequently Asked Questions
How do I know if ChatGPT is crawling my site?
ChatGPT uses three user-agents: GPTBot (training), OAI-SearchBot (search index), and ChatGPT-User (on-demand when a user asks a question). Grep your access logs for any of those strings. If you see ChatGPT-User hits, real users are referencing your URL inside ChatGPT.
Does Claude crawl my site?
Yes — Anthropic runs ClaudeBot (training), Claude-User (on-demand fetch), and as of 2026 Claude-SearchBot (search citations). All three respect robots.txt. Test reachability with the LLM Bot Checker.
Is Google-Extended the same as Googlebot?
No. Googlebot is the regular search crawler. Google-Extended is a robots.txt-only signal that controls whether Gemini and Vertex AI can use your content for training. Blocking Google-Extended does not block your site from Google search.
Why aren't my pages showing up in ChatGPT answers?
Three common reasons in 2026: (1) GPTBot or OAI-SearchBot is blocked in robots.txt; (2) your pages don't have FAQ schema or answer-first structure that LLMs can extract; (3) your site has low authority and AI search systems prefer better-known sources. Run the LLM Bot Checker to rule out (1) first.
Can I block AI crawlers from my whole site?
Yes — add Disallow: / rules for each AI user-agent in robots.txt. Most respect it. Some sites use Cloudflare's "Block AI Scrapers" rule for stricter enforcement. We recommend you don't block by default — being cited in AI answers is the new top-of-funnel.
Do AI crawlers obey crawl-delay?
GPTBot and ClaudeBot respect Crawl-delay directives in robots.txt. PerplexityBot mostly does. If you see aggressive fetching, set a Crawl-delay: 10 rule and watch logs over 48 hours.
Test your site against 12 AI crawlers in 5 seconds
The free Hyperleap LLM Bot Checker tests robots.txt for GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, and 7 more. Instant. No signup.
Run the LLM Bot CheckerRelated Resources
- LLM Bot Checker — free tool to test robots.txt for 12 AI crawlers
- Top LLM Crawlers and What They Do (2026) — directory of every major AI crawler
- llms.txt Explained — the emerging discoverability standard
- Will ChatGPT and Perplexity Recommend Your Business? — GEO playbook
- Free Schema Generator — JSON-LD for FAQ, Article, Product
Related Articles
Top 12 LLM Crawlers and What They Do (2026 Directory)
Every major LLM crawler in 2026 — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, and 7 more. User-agents, behavior, robots.txt rules.
Will ChatGPT and Perplexity Recommend Your Business? The 2026 GEO Playbook
Generative Engine Optimization is how you show up in ChatGPT, Perplexity, and Gemini answers. Here's the practical 2026 playbook for getting recommended.
llms.txt Explained: Should Your SaaS Site Have One in 2026?
llms.txt is an emerging standard for telling AI crawlers what your site is about. Here's what it is, what it does, and whether your SaaS needs one.
AI Chatbot for Nonprofits: Donor Engagement Without a Bigger Team
How nonprofits use AI chatbots to answer donor questions, capture volunteer signups, and route program inquiries — plus an exclusive Hyperleap nonprofit discount.