← Hyperleap

Sitemap URL Extractor

Extract every URL from a sitemap.xml — get a plain list, CSV, or copy-ready text in seconds

Press ⌘+Enter to analyze

Why Would You Want to Extract URLs from a Sitemap?

A sitemap is the most reliable way to enumerate every page a website considers public. Crawlers respect robots.txt, navigation menus miss orphaned pages, and competitor analysis tools cap at 100 URLs per page — but the sitemap is the ground-truth list. Extracting it gives you a flat URL list you can plug into other tools.

Common use cases:

  • Competitor research. Get the full URL list of a competitor and analyze what pages they prioritize, which categories they emphasize, and where they have content gaps.
  • Content audits. Pipe your own sitemap into a screaming-frog or SEMrush bulk audit to check status codes, page titles, meta descriptions, and Core Web Vitals across every URL.
  • Migrations. When you redesign a site, the sitemap URLs are the redirect map — extract the old sitemap, map every URL to its new location.
  • AI knowledge bases. Hyperleap AI agents use sitemaps to discover what pages to learn from. Extract your sitemap to confirm every URL that should be in the agent's knowledge base.
  • Dead-link checks. Run the URL list through a status code checker to find broken pages and 301 redirect chains.

How Do You Extract URLs from a Sitemap-Index File?

Large sites use a sitemap-index file — a sitemap-of-sitemaps that lists multiple child sitemaps. The root element is <sitemapindex> instead of <urlset>, and instead of <url> entries it contains <sitemap> entries that point to other sitemap files.

When you paste a sitemap-index, this tool returns the list of child sitemap URLs. To extract the URLs from a child sitemap, fetch each one and paste it into the tool. For most sites the child sitemaps are organized by content type (sitemap-pages.xml, sitemap-products.xml, sitemap-blog.xml) — extract the ones you actually need rather than aggregating them all.

What's the Difference Between Plain Text and CSV Output?

Plain text gives you one URL per line — perfect for piping into a shell script (while read url; do curl -I "$url"; done < sitemap-urls.txt), feeding into a crawler, or doing a quick visual scan.

CSV includes the URL plus the optional metadata fields (lastmod, changefreq, priority) that the sitemap declared. Use CSV when you want to import into Excel or Google Sheets for sorting, filtering, or pivoting — for instance, to find all URLs not modified in the last 90 days.

How Many URLs Can a Sitemap Hold?

The sitemap.org spec sets two hard limits per sitemap file: 50,000 URLs and 50 MB uncompressed. Either limit triggers a Google Search Console error.

When your site grows past either limit, split the URLs across multiple sitemaps (sitemap-pages.xml, sitemap-products.xml, etc.) and create a sitemap-index file that lists all of them. Submit the index URL to Search Console — Google then follows the index to discover each child sitemap.

For sites under 50,000 URLs, a single sitemap is simpler and works fine. Don't over-engineer with a sitemap-index until you actually need it.

How Does Hyperleap AI Use Your Sitemap?

When you point a Hyperleap AI agent at your website, the first thing it does is read your sitemap.xml to discover what pages to crawl. Every URL in the sitemap becomes a candidate for the agent's knowledge base.

That means the cleaner and more accurate your sitemap is, the smarter your AI agent becomes. Extract your sitemap with this tool, audit the URL list, remove any pages that shouldn't be in the agent's knowledge base (admin, staging, search results), and re-publish the sitemap. Then your AI agent answers visitor questions only from the pages you actually want it learning from. See how Hyperleap AI agents use your sitemap →

Need an AI chatbot for your website?

Hyperleap AI Agents answer customer questions, capture leads, and work 24/7.

Get Started Free