How to Check if AI is Crawling Your Site (2026 Guide)
Back to Blog
GEO

How to Check if AI is Crawling Your Site (2026 Guide)

The 2026 guide to detecting AI crawlers — GPTBot, ClaudeBot, PerplexityBot, Google-Extended — in server logs, robots.txt, and llms.txt.

Gopi Krishna Lakkepuram
May 3, 2026
7 min read

TL;DR: AI crawlers identify themselves with specific user-agent strings — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, and others. To check if they're crawling your site: (1) grep your server access logs for the user-agent strings, (2) inspect /robots.txt and /llms.txt for explicit allow/disallow rules, (3) use a free tool like our LLM Bot Checker to confirm what's reachable. If you're not being crawled, you can't be cited in AI answers.

How to Check if AI is Crawling Your Site (2026 Guide)

In 2026, being crawled by an LLM is the new "being indexed by Google." If GPTBot, ClaudeBot, PerplexityBot and Google-Extended cannot read your pages, ChatGPT, Claude, Perplexity, and Google AI Overviews cannot cite you. The vast majority of sites we audit have at least one major AI crawler accidentally blocked.

This guide gives you a 15-minute audit you can run today.

Skip the manual work — run our free LLM Bot Checker on your URL →

Which user-agents identify the major AI crawlers?

CrawlerUser-agent stringRun byPurpose
GPTBotGPTBotOpenAITraining data
OAI-SearchBotOAI-SearchBotOpenAIChatGPT search results
ChatGPT-UserChatGPT-UserOpenAIOn-demand fetch when a user asks
ClaudeBotClaudeBotAnthropicTraining data
Claude-UserClaude-UserAnthropicOn-demand fetch when a user asks
Claude-SearchBotClaude-SearchBotAnthropicClaude search citations (added 2026)
PerplexityBotPerplexityBotPerplexitySearch index
Perplexity-UserPerplexity-UserPerplexityOn-demand fetch
Google-ExtendedGoogle-ExtendedGoogleGemini training (signal in robots.txt only)
GoogleOtherGoogleOtherGoogleAI Overviews and experimental
BingbotbingbotMicrosoftBing + Copilot
Applebot-ExtendedApplebot-ExtendedAppleApple Intelligence training

A clean robots.txt in 2026 makes a deliberate choice for each of these. "I forgot they exist" is not a strategy.

How to check your server access logs

If your site is behind nginx, Apache, or a CDN with log access (Cloudflare, Fastly, Vercel), grep the user-agent column for these strings. Examples:

# nginx / apache combined log format
grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended|Applebot-Extended" /var/log/nginx/access.log | tail -50

# count hits by crawler in the last 7 days
grep -oE "GPTBot|ClaudeBot|PerplexityBot|Google-Extended" /var/log/nginx/access.log* | sort | uniq -c | sort -rn

If your hosting uses Cloudflare, run an Analytics → Logs query filtered by cf.bot_management.verified_bot = true and inspect the user-agent column. For Vercel, use vercel logs --follow and filter.

What you're looking for: at least some hits from GPTBot and ClaudeBot in the last 30 days. Zero hits = you are likely being blocked, throttled, or your sitemap is incomplete.

How to inspect your robots.txt and llms.txt

Start with https://yoursite.com/robots.txt. Look for blocks like:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

A surprising number of sites still ship the default WordPress or Shopify robots.txt that disallows AI crawlers — usually because of a panicked decision in 2023 that nobody revisited. In 2026, blocking the very crawlers that fuel ChatGPT and Claude search is equivalent to opting out of being recommended.

Then check https://yoursite.com/llms.txt. This is the emerging markdown-based standard for telling LLMs what your site contains. If you don't have one, our llms.txt explainer walks you through generating one.

llms.txt vs robots.txt — which matters more?

Both. They do different things:

  • robots.txt is a crawl control file. It tells crawlers what they're allowed to fetch. AI crawlers respect it (most of them, most of the time).
  • llms.txt is a comprehension hint file. It tells LLMs what your site is about and where the canonical pages live for each topic. It does not block or allow anything.

If you want to be cited, you need (a) robots.txt that allows AI crawlers, and (b) llms.txt that summarizes your site. The order matters — without robots.txt access, llms.txt is invisible.

What to do if you're being scraped without attribution

This comes up a lot. The honest answer:

  1. GPTBot, ClaudeBot, PerplexityBot all respect robots.txt. If you don't want them, disallow them. Verify with your logs that they stop.
  2. Some crawlers misrepresent their user-agent. These are rare but exist. Cloudflare's "AI Scrapers and Crawlers" rule, Fastly's similar feature, or AWS WAF can block by ASN/IP for the worst offenders.
  3. You usually want to be cited, not blocked. The economics of "show up in ChatGPT answers" are better than "no one ever finds my site." Block selectively, not by default.

A 15-minute audit you can run today

  1. Open our free LLM Bot Checker and enter your URL. It tests robots.txt for all 12 major AI crawlers and tells you which can fetch your pages.
  2. Grep your server logs for the user-agent strings in the table above. Confirm real-world hits.
  3. Open /llms.txt. If you don't have one, generate it.
  4. Open /robots.txt. Make a deliberate allow/disallow decision for each crawler.
  5. Add an FAQ block to your top 5 pages. AI crawlers love structured Q&A; it's the highest-leverage content edit you can make for citation.

Frequently Asked Questions

How do I know if ChatGPT is crawling my site?

ChatGPT uses three user-agents: GPTBot (training), OAI-SearchBot (search index), and ChatGPT-User (on-demand when a user asks a question). Grep your access logs for any of those strings. If you see ChatGPT-User hits, real users are referencing your URL inside ChatGPT.

Does Claude crawl my site?

Yes — Anthropic runs ClaudeBot (training), Claude-User (on-demand fetch), and as of 2026 Claude-SearchBot (search citations). All three respect robots.txt. Test reachability with the LLM Bot Checker.

Is Google-Extended the same as Googlebot?

No. Googlebot is the regular search crawler. Google-Extended is a robots.txt-only signal that controls whether Gemini and Vertex AI can use your content for training. Blocking Google-Extended does not block your site from Google search.

Why aren't my pages showing up in ChatGPT answers?

Three common reasons in 2026: (1) GPTBot or OAI-SearchBot is blocked in robots.txt; (2) your pages don't have FAQ schema or answer-first structure that LLMs can extract; (3) your site has low authority and AI search systems prefer better-known sources. Run the LLM Bot Checker to rule out (1) first.

Can I block AI crawlers from my whole site?

Yes — add Disallow: / rules for each AI user-agent in robots.txt. Most respect it. Some sites use Cloudflare's "Block AI Scrapers" rule for stricter enforcement. We recommend you don't block by default — being cited in AI answers is the new top-of-funnel.

Do AI crawlers obey crawl-delay?

GPTBot and ClaudeBot respect Crawl-delay directives in robots.txt. PerplexityBot mostly does. If you see aggressive fetching, set a Crawl-delay: 10 rule and watch logs over 48 hours.

Test your site against 12 AI crawlers in 5 seconds

The free Hyperleap LLM Bot Checker tests robots.txt for GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, and 7 more. Instant. No signup.

Run the LLM Bot Checker

Related Articles

Gopi Krishna Lakkepuram

Founder & CEO

Gopi leads Hyperleap AI with a vision to transform how businesses implement AI. Before founding Hyperleap AI, he built and scaled systems serving billions of users at Microsoft on Office 365 and Outlook.com. He holds an MBA from ISB and combines technical depth with business acumen.

Published on May 3, 2026