How to Check if AI is Crawling Your Site (2026 Guide)

TL;DR: AI crawlers identify themselves with specific user-agent strings — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, and others. To check if they're crawling your site: (1) grep your server access logs for the user-agent strings, (2) inspect /robots.txt and /llms.txt for explicit allow/disallow rules, (3) use a free tool like our LLM Bot Checker to confirm what's reachable. If you're not being crawled, you can't be cited in AI answers.

How to Check if AI is Crawling Your Site (2026 Guide)

In 2026, being crawled by an LLM is the new "being indexed by Google." If GPTBot, ClaudeBot, PerplexityBot and Google-Extended cannot read your pages, ChatGPT, Claude, Perplexity, and Google AI Overviews cannot cite you. The vast majority of sites we audit have at least one major AI crawler accidentally blocked.

This guide gives you a 15-minute audit you can run today.

Skip the manual work — run our free LLM Bot Checker on your URL →

Which user-agents identify the major AI crawlers?

Crawler	User-agent string	Run by	Purpose
GPTBot	`GPTBot`	OpenAI	Training data
OAI-SearchBot	`OAI-SearchBot`	OpenAI	ChatGPT search results
ChatGPT-User	`ChatGPT-User`	OpenAI	On-demand fetch when a user asks
ClaudeBot	`ClaudeBot`	Anthropic	Training data
Claude-User	`Claude-User`	Anthropic	On-demand fetch when a user asks
Claude-SearchBot	`Claude-SearchBot`	Anthropic	Claude search citations (added 2026)
PerplexityBot	`PerplexityBot`	Perplexity	Search index
Perplexity-User	`Perplexity-User`	Perplexity	On-demand fetch
Google-Extended	`Google-Extended`	Google	Gemini training (signal in robots.txt only)
GoogleOther	`GoogleOther`	Google	AI Overviews and experimental
Bingbot	`bingbot`	Microsoft	Bing + Copilot
Applebot-Extended	`Applebot-Extended`	Apple	Apple Intelligence training

A clean robots.txt in 2026 makes a deliberate choice for each of these. "I forgot they exist" is not a strategy.

How to check your server access logs

If your site is behind nginx, Apache, or a CDN with log access (Cloudflare, Fastly, Vercel), grep the user-agent column for these strings. Examples:

# nginx / apache combined log format
grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended|Applebot-Extended" /var/log/nginx/access.log | tail -50

# count hits by crawler in the last 7 days
grep -oE "GPTBot|ClaudeBot|PerplexityBot|Google-Extended" /var/log/nginx/access.log* | sort | uniq -c | sort -rn

If your hosting uses Cloudflare, run an Analytics → Logs query filtered by cf.bot_management.verified_bot = true and inspect the user-agent column. For Vercel, use vercel logs --follow and filter.

What you're looking for: at least some hits from GPTBot and ClaudeBot in the last 30 days. Zero hits = you are likely being blocked, throttled, or your sitemap is incomplete.

How to inspect your robots.txt and llms.txt

Start with https://yoursite.com/robots.txt. Look for blocks like:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

A surprising number of sites still ship the default WordPress or Shopify robots.txt that disallows AI crawlers — usually because of a panicked decision in 2023 that nobody revisited. In 2026, blocking the very crawlers that fuel ChatGPT and Claude search is equivalent to opting out of being recommended.

Then check https://yoursite.com/llms.txt. This is the emerging markdown-based standard for telling LLMs what your site contains. If you don't have one, our llms.txt explainer walks you through generating one.

llms.txt vs robots.txt — which matters more?

Both. They do different things:

robots.txt is a crawl control file. It tells crawlers what they're allowed to fetch. AI crawlers respect it (most of them, most of the time).
llms.txt is a comprehension hint file. It tells LLMs what your site is about and where the canonical pages live for each topic. It does not block or allow anything.

If you want to be cited, you need (a) robots.txt that allows AI crawlers, and (b) llms.txt that summarizes your site. The order matters — without robots.txt access, llms.txt is invisible.

What to do if you're being scraped without attribution

This comes up a lot. The honest answer:

GPTBot, ClaudeBot, PerplexityBot all respect robots.txt. If you don't want them, disallow them. Verify with your logs that they stop.
Some crawlers misrepresent their user-agent. These are rare but exist. Cloudflare's "AI Scrapers and Crawlers" rule, Fastly's similar feature, or AWS WAF can block by ASN/IP for the worst offenders.
You usually want to be cited, not blocked. The economics of "show up in ChatGPT answers" are better than "no one ever finds my site." Block selectively, not by default.

A 15-minute audit you can run today

Open our free LLM Bot Checker and enter your URL. It tests robots.txt for all 12 major AI crawlers and tells you which can fetch your pages.
Grep your server logs for the user-agent strings in the table above. Confirm real-world hits.
Open /llms.txt. If you don't have one, generate it.
Open /robots.txt. Make a deliberate allow/disallow decision for each crawler.
Add an FAQ block to your top 5 pages. AI crawlers love structured Q&A; it's the highest-leverage content edit you can make for citation.

Frequently Asked Questions

How do I know if ChatGPT is crawling my site?

ChatGPT uses three user-agents: GPTBot (training), OAI-SearchBot (search index), and ChatGPT-User (on-demand when a user asks a question). Grep your access logs for any of those strings. If you see ChatGPT-User hits, real users are referencing your URL inside ChatGPT.

Does Claude crawl my site?

Yes — Anthropic runs ClaudeBot (training), Claude-User (on-demand fetch), and as of 2026 Claude-SearchBot (search citations). All three respect robots.txt. Test reachability with the LLM Bot Checker.

Is Google-Extended the same as Googlebot?

No. Googlebot is the regular search crawler. Google-Extended is a robots.txt-only signal that controls whether Gemini and Vertex AI can use your content for training. Blocking Google-Extended does not block your site from Google search.

Why aren't my pages showing up in ChatGPT answers?

Three common reasons in 2026: (1) GPTBot or OAI-SearchBot is blocked in robots.txt; (2) your pages don't have FAQ schema or answer-first structure that LLMs can extract; (3) your site has low authority and AI search systems prefer better-known sources. Run the LLM Bot Checker to rule out (1) first.

Can I block AI crawlers from my whole site?

Yes — add Disallow: / rules for each AI user-agent in robots.txt. Most respect it. Some sites use Cloudflare's "Block AI Scrapers" rule for stricter enforcement. We recommend you don't block by default — being cited in AI answers is the new top-of-funnel.

Do AI crawlers obey crawl-delay?

GPTBot and ClaudeBot respect Crawl-delay directives in robots.txt. PerplexityBot mostly does. If you see aggressive fetching, set a Crawl-delay: 10 rule and watch logs over 48 hours.

How is AI search different from traditional SEO?

Traditional SEO ranks pages for keyword queries. AI search (ChatGPT, Perplexity, Google AI Overviews) cites passages from pages to compose answers. Optimizing for AI search — sometimes called GEO or AEO — focuses on citable, well-structured passages and schema markup rather than just keyword density.

Do I need to do anything different for ChatGPT and Perplexity vs Google?

The fundamentals overlap — quality content, schema markup, crawler accessibility — but AI engines lean heavily on structured data and clear passage-level answers. Make sure GPTBot, ClaudeBot, and PerplexityBot are not blocked in robots.txt and that your key answers are in scannable paragraphs and lists.

What is llms.txt and do I need one?

llms.txt is a proposed standard that helps AI crawlers find your most important content (similar in spirit to robots.txt or sitemap.xml). It is not yet a hard ranking signal, but it costs little to publish and signals intent — early-adopter sites are seeing it referenced.

How long does it take to set up an AI chatbot with Hyperleap?

Most SMBs go live in 3–5 days for self-serve setup. With Managed Setup (from $299 one-time, available on every plan), Hyperleap builds the bot for you on your content and channels. A 7-day free trial is included on every plan.

How to Check if AI is Crawling Your Site (2026 Guide)

How to Check if AI is Crawling Your Site (2026 Guide)