{"id":9216,"date":"2025-01-07T13:49:01","date_gmt":"2025-01-07T10:49:01","guid":{"rendered":"https:\/\/www.evenzia.com\/?p=9216"},"modified":"2025-01-07T13:49:02","modified_gmt":"2025-01-07T10:49:02","slug":"how-to-find-all-existing-and-archived-urls-on-a-website","status":"publish","type":"post","link":"https:\/\/www.evenzia.com\/fr\/how-to-find-all-existing-and-archived-urls-on-a-website\/","title":{"rendered":"How to Find All Existing and Archived URLs on a Website"},"content":{"rendered":"\n<p>There are many reasons you might need to find all the URLs on a website, but your exact goal will determine what you\u2019re searching for. For instance, you may want to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Identify\u00a0<strong>every indexed URL<\/strong>\u00a0to analyze issues like cannibalization or index bloat<\/li>\n\n\n\n<li>Collect\u00a0<strong>current and historic URLs\u00a0<\/strong>Google has seen, especially for site migrations<\/li>\n\n\n\n<li>Find\u00a0<strong>all 404 URLs<\/strong>\u00a0to recover from post-migration errors<\/li>\n<\/ul>\n\n\n\n<p>In each scenario, a single tool won\u2019t give you everything you need. Unfortunately, Google Search Console isn\u2019t exhaustive, and a \u201csite:example.com\u201d search is limited and difficult to extract data from.<\/p>\n\n\n\n<p>In this post, I\u2019ll walk you through some tools to build your URL list and before deduplicating the data using a spreadsheet or Jupyter Notebook, depending on your website\u2019s size.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"old-sitemaps-and-crawl-exports\">Old sitemaps and crawl exports<\/h2>\n\n\n\n<p>If you\u2019re looking for URLs that disappeared from the live site recently, there\u2019s a chance someone on your team may have saved a sitemap file or a crawl export before the changes were made. If you haven\u2019t already, check for these files; they can often provide what you need. But, if you\u2019re reading this, you probably did not get so lucky.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"archive.org\">Archive.org<\/h2>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/moz.com\/images\/blog\/Blog-Posts\/How-to-find-all-existing-and-archived-URLs-on-a-website\/1-archivedotorg.png?w=839&amp;h=236&amp;auto=compress%2Cformat&amp;fit=crop&amp;dm=1734978573&amp;s=55f5b57d0ad8d6602cc1ea6acce0ae1a\" alt=\"Archive.org\"\/><\/figure>\n\n\n\n<p><a href=\"http:\/\/archive.org\/\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">Archive.org<\/a>&nbsp;is an invaluable tool for&nbsp;<a href=\"https:\/\/moz.com\/blog\/prioritize-seo-tasks\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">SEO tasks<\/a>, funded by donations. If you search for a domain and select the \u201cURLs\u201d option, you can access up to 10,000 listed URLs.<\/p>\n\n\n\n<p><strong>However, there are a few limitations:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>URL limit:\u00a0<\/strong>You can only retrieve up to 10,000\u00a0<a href=\"https:\/\/moz.com\/learn\/seo\/url\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">URLs<\/a>, which is insufficient for larger sites.<\/li>\n\n\n\n<li><strong>Quality:\u00a0<\/strong>Many URLs may be malformed or reference resource files (e.g., images or scripts).<\/li>\n\n\n\n<li><strong>No export option:<\/strong>\u00a0There isn\u2019t a built-in way to export the list.<\/li>\n<\/ul>\n\n\n\n<p>To bypass the lack of an export button, use a browser scraping plugin like&nbsp;Dataminer.io. However, these limitations mean Archive.org may not provide a complete solution for larger sites. Also, Archive.org doesn\u2019t indicate whether Google indexed a URL\u2014but if Archive.org found it, there\u2019s a good chance Google did, too.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"moz-pro\">Moz Pro<\/h2>\n\n\n\n<p>While you might typically use a&nbsp;<a href=\"https:\/\/moz.com\/help\/link-explorer\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">link index<\/a>&nbsp;to find external sites linking to you, these tools also discover URLs on your site in the process.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/moz.com\/images\/blog\/Blog-Posts\/How-to-find-all-existing-and-archived-URLs-on-a-website\/2-moz-pro.png?w=1274&amp;h=738&amp;auto=compress%2Cformat&amp;fit=crop&amp;dm=1734978575&amp;s=f26c4587459a03215ac816c14dd1e512\" alt=\"\"\/><\/figure>\n\n\n\n<p><strong>How to use it:<\/strong><br>Export your inbound links in&nbsp;<a href=\"https:\/\/moz.com\/products\/pro\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">Moz Pro<\/a>&nbsp;to get a quick and easy list of target URLs from your site. If you\u2019re dealing with a massive website, consider using the&nbsp;<a href=\"https:\/\/moz.com\/products\/api\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">Moz API<\/a>&nbsp;to export data beyond what\u2019s manageable in Excel or Google Sheets.<\/p>\n\n\n\n<p>It\u2019s important to note that Moz Pro doesn\u2019t confirm if URLs are indexed or discovered by Google. However, since most sites apply the same&nbsp;<a href=\"https:\/\/moz.com\/learn\/seo\/robotstxt\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">robots.txt<\/a>&nbsp;rules to Moz\u2019s bots as they do to Google\u2019s, this method generally works well as a proxy for&nbsp;<a href=\"https:\/\/moz.com\/blog\/how-to-view-website-as-googlebot#how-to-set-up-your-googlebot-browser\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">Googlebot\u2019s discoverability<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"google-search-console\">Google Search Console<\/h2>\n\n\n\n<p><a href=\"https:\/\/moz.com\/blog\/a-beginners-guide-to-the-google-search-console\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">Google Search Console<\/a>&nbsp;offers several valuable sources for building your list of URLs.<\/p>\n\n\n\n<p><strong>Links reports:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/moz.com\/images\/blog\/Blog-Posts\/How-to-find-all-existing-and-archived-URLs-on-a-website\/3-google-sc.png?w=1290&amp;h=1094&amp;auto=compress%2Cformat&amp;fit=crop&amp;dm=1734978579&amp;s=0d56b7bec9f6fbd2b7b06d719ce48900\" alt=\"\"\/><\/figure>\n\n\n\n<p>Similar to Moz Pro, the Links section provides exportable lists of target URLs. Unfortunately, these exports are capped at&nbsp;<strong>1,000 URLs&nbsp;<\/strong>each. You can apply filters for specific pages, but since filters don\u2019t apply to the export, you might need to rely on browser scraping tools\u2014limited to&nbsp;<strong>500 filtered URLs<\/strong>&nbsp;at a time. Not ideal.<\/p>\n\n\n\n<p><strong>Performance \u2192 Search Results:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/moz.com\/images\/blog\/Blog-Posts\/How-to-find-all-existing-and-archived-URLs-on-a-website\/4-perf-search-results.png?w=967&amp;h=749&amp;auto=compress%2Cformat&amp;fit=crop&amp;dm=1734978581&amp;s=ff1be8eaf25ba43553db24e4fb2cf44a\" alt=\"\"\/><\/figure>\n\n\n\n<p>This export gives you a list of pages receiving search impressions. While the export is limited, you can use&nbsp;<a href=\"https:\/\/developers.google.com\/webmaster-tools\" rel=\"nofollow noopener\" target=\"_blank\">Google Search Console API<\/a>&nbsp;for larger datasets. There are also free Google Sheets plugins that simplify pulling more extensive data.<\/p>\n\n\n\n<p><strong>Indexing \u2192 Pages report:<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/moz.com\/images\/blog\/Blog-Posts\/How-to-find-all-existing-and-archived-URLs-on-a-website\/5-indexing.png?w=938&amp;h=573&amp;auto=compress%2Cformat&amp;fit=crop&amp;dm=1734978583&amp;s=765c3344eb2f449b4aa5f0f80c0c1777\" alt=\"\"\/><\/figure>\n\n\n\n<p>This section provides exports filtered by issue type, though these are also limited in scope.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"google-analytics\">Google Analytics<\/h2>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/moz.com\/images\/blog\/Blog-Posts\/How-to-find-all-existing-and-archived-URLs-on-a-website\/6-GA.png?w=937&amp;h=139&amp;auto=compress%2Cformat&amp;fit=crop&amp;dm=1734978584&amp;s=94d20f1ac1000cdeabd6efa9fe9bf323\" alt=\"Google Analytics\"\/><\/figure>\n\n\n\n<p>The&nbsp;<strong>Engagement \u2192 Pages and Screens&nbsp;<\/strong>default report in&nbsp;<a href=\"https:\/\/moz.com\/beginners-guide-to-google-analytics\" rel=\"noreferrer noopener nofollow\" target=\"_blank\">GA4<\/a>&nbsp;is an excellent source for collecting URLs, with a generous limit of&nbsp;<strong>100,000 URLs<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/moz.com\/images\/blog\/Blog-Posts\/How-to-find-all-existing-and-archived-URLs-on-a-website\/7-pages-and-screens.png?w=500&amp;h=736&amp;auto=compress%2Cformat&amp;fit=crop&amp;dm=1734978586&amp;s=7340ce16296b27c963458595066d3c88\" alt=\"\"\/><\/figure>\n\n\n\n<p>Even better, you can apply filters to create different URL lists, effectively surpassing the 100k limit. For example, if you want to export only blog URLs, follow these steps:<\/p>\n\n\n\n<p>Step 1: Add a segment to the report<\/p>\n\n\n\n<p>Step 2: Click&nbsp;\u201cCreate a new segment.\u201d<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/moz.com\/images\/blog\/Blog-Posts\/How-to-find-all-existing-and-archived-URLs-on-a-website\/8-add-segment.png?w=1028&amp;h=68&amp;auto=compress%2Cformat&amp;fit=crop&amp;dm=1734978588&amp;s=5fd5a0455b1294c04a6776bd3b49f020\" alt=\"\"\/><\/figure>\n\n\n\n<p>Step 3: Define the segment with a narrower URL pattern, such as URLs containing&nbsp;\/blog\/<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/moz.com\/images\/blog\/Blog-Posts\/How-to-find-all-existing-and-archived-URLs-on-a-website\/9-segments.png?w=718&amp;h=265&amp;auto=compress%2Cformat&amp;fit=crop&amp;dm=1734978589&amp;s=ad9fee21786fb1cabb882506ff8fab22\" alt=\"\"\/><\/figure>\n\n\n\n<p><strong>Note:<\/strong>&nbsp;URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they offer valuable insights.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"server-log-files\">Server log files<\/h2>\n\n\n\n<p>Server or CDN log files are perhaps the ultimate tool at your disposal. These logs capture an exhaustive list of every URL path queried by users, Googlebot, or other bots during the recorded period.<\/p>\n\n\n\n<p><strong>Considerations:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Data size:<\/strong>\u00a0Log files can be massive, so many sites only retain the last two weeks of data.<\/li>\n\n\n\n<li><strong>Complexity:\u00a0<\/strong>Analyzing log files can be challenging, but various tools are available to simplify the process.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"combine,-and-good-luck\">Combine, and good luck<\/h2>\n\n\n\n<p>Once you\u2019ve gathered URLs from all these sources, it\u2019s time to combine them. If your site is small enough, use Excel or, for larger datasets, tools like Google Sheets or Jupyter Notebook. Ensure all URLs are consistently formatted, then deduplicate the list.<\/p>\n\n\n\n<p>And voil\u00e0\u2014you now have a comprehensive list of current, old, and archived URLs. Good luck!<\/p>\n\n\n\n<p><\/p>\n\n\n\n<p>Source: https:\/\/moz.com\/blog\/how-to-find-all-existing-and-archived-urls-on-a-website<\/p>\n","protected":false},"excerpt":{"rendered":"<p>There are many reasons you might need to find all the URLs on a website, but your exact goal will determine what you\u2019re&hellip;<\/p>\n","protected":false},"author":1,"featured_media":9217,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-9216","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog"],"_links":{"self":[{"href":"https:\/\/www.evenzia.com\/fr\/wp-json\/wp\/v2\/posts\/9216","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.evenzia.com\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.evenzia.com\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.evenzia.com\/fr\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.evenzia.com\/fr\/wp-json\/wp\/v2\/comments?post=9216"}],"version-history":[{"count":1,"href":"https:\/\/www.evenzia.com\/fr\/wp-json\/wp\/v2\/posts\/9216\/revisions"}],"predecessor-version":[{"id":9218,"href":"https:\/\/www.evenzia.com\/fr\/wp-json\/wp\/v2\/posts\/9216\/revisions\/9218"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.evenzia.com\/fr\/wp-json\/wp\/v2\/media\/9217"}],"wp:attachment":[{"href":"https:\/\/www.evenzia.com\/fr\/wp-json\/wp\/v2\/media?parent=9216"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.evenzia.com\/fr\/wp-json\/wp\/v2\/categories?post=9216"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.evenzia.com\/fr\/wp-json\/wp\/v2\/tags?post=9216"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}