Technical SEOMay 20, 20269 min read

Crawl Budget Optimization for Ecommerce: How Google Allocates Crawl Resources

Large ecommerce catalogs waste crawl budget on duplicate pages, faceted nav, and session parameters. Here's how to audit and fix your crawl allocation so Google indexes what matters.

StoreVitals Team

Crawl budget is the number of pages Googlebot will crawl on your site within a given time period. For most small stores (under 1,000 pages), it's never a concern. For stores with 10,000+ SKUs, faceted navigation, or session-based URLs, wasted crawl budget directly delays how quickly new and updated pages get indexed — and by extension, how fast they rank.

Google determines crawl budget based on two factors: crawl rate limit (how fast it can crawl without overwhelming your servers) and crawl demand (how popular and fresh your pages are perceived to be). Your job is to eliminate pages that waste budget so Google focuses on pages that generate traffic and revenue.

The Biggest Crawl Budget Drains

1. Faceted Navigation

A product catalog with 10 filter dimensions can generate millions of URLs: /shoes?color=red&size=9&brand=nike&price=50-100 and every permutation. Each unique URL Googlebot crawls costs budget — and these pages typically have no unique content, thin inventory, and zero backlinks.

The fix: use robots.txt to block parameter-based URLs from crawling (not indexing — noindex in robots meta has no crawl budget benefit since Googlebot still fetches the page to read the tag). Use disallow directives for parameters you never want indexed:

Disallow: /*?color=
Disallow: /*?size=
Disallow: /*?sort=

Or use a blanket disallow on the faceted URLs with a canonical pointing to the clean category page. Google's preferred approach (and the most effective) is Google Search Console's URL Parameters tool — deprecated in 2022 but the underlying behavior (canonical tags + robots directives) still applies.

2. Session IDs and Tracking Parameters

URLs like /product?sessionid=abc123 or /cart?ref=email_campaign create thousands of unique URLs in Googlebot's eyes, each requiring a crawl. Session IDs in URLs are a legacy pattern; modern stores using cookies avoid this. If your store still appends session IDs to URLs, fix it at the server level or add a canonical pointing to the parameterless version.

For tracking parameters (utm_source, utm_medium, etc.), ensure your canonical tag always points to the clean URL. Google is smart about utm_ parameters, but rel=canonical is the authoritative signal.

3. Soft 404s

A soft 404 is a page that returns HTTP 200 but has no real content — out-of-stock product pages that show "this product is no longer available" instead of a 404 or 301, empty search results pages (/search?q=xyzabc), or category pages with zero products. Google wastes crawl budget discovering and revisiting these.

The correct handling depends on the situation:

  • Permanently discontinued product: 301 redirect to the category page or a similar product
  • Temporarily out of stock: keep the page, add availability schema, maintain link equity
  • Search results with no results: block with robots meta noindex or return 404
  • Empty category pages: 301 to the parent category

4. Infinite Scroll and Pagination Misimplementation

If your infinite scroll implementation generates /products?page=2&page=3 URLs or if your pagination creates /products/page/1 through /products/page/10000, you're burning crawl budget on near-duplicate pages. Use rel=canonical on paginated pages pointing to the first page if the content is similar, or implement URL consolidation for pages beyond page 3-4 that rarely receive organic traffic.

5. Duplicate Content URLs

Product pages accessible via multiple paths — /products/red-shirt, /collections/shirts/red-shirt, /sale/red-shirt — each require a crawl unless canonical tags are set. Run a crawl of your site and look for URLs with near-identical content fingerprints.

How to Audit Your Crawl Budget Usage

Google Search Console's Coverage report shows which URLs Google has discovered, crawled, and indexed. The "Crawled - currently not indexed" and "Discovered - currently not indexed" sections reveal where Googlebot is spending time without results.

To estimate your crawl budget vs. discovery rate:

  1. Go to GSC → Settings → Crawl Stats (bottom of settings page)
  2. Look at "Total crawl requests" per day — that's your effective budget
  3. Compare against your total page count
  4. If crawl requests per day × 30 < total pages, Google isn't covering your full site monthly

Screaming Frog can also simulate a Googlebot crawl and show you which URL types consume the most crawl requests.

Technical Fixes That Improve Crawl Efficiency

Sitemap hygiene

Your XML sitemap should only include URLs you want indexed. Remove URLs that return 301, 404, or 410. Remove canonical-deflected URLs. Remove noindex pages. A sitemap that lists 50,000 URLs when only 30,000 should be indexed misleads Googlebot about your site's quality.

Server response time

Googlebot's crawl rate limit is partially determined by your server's response time. A server that responds in 200ms can be crawled faster than one that takes 2 seconds. Improving TTFB not only helps users — it directly expands your crawl budget. Target under 500ms TTFB for Googlebot.

Internal linking to new pages

Crawl demand (the other crawl budget factor) increases when pages have backlinks and internal links. New product pages with zero internal links take much longer to get crawled. Add new products to your homepage featured section temporarily, include them in your sitemap immediately, and ensure they're linked from their category page from day one.

Priority Rules for Ecommerce Crawl Budget

Page TypeCrawl PriorityAction
Product pages (in-stock)HighAllow, include in sitemap
Category pagesHighAllow, include in sitemap
Blog postsMediumAllow, include in sitemap
Faceted filter URLsZeroBlock via robots.txt
Out-of-stock redirected productsZeroReturn 301 or 410
Search result pagesZeroNoindex + disallow
Cart/checkout pagesZeroDisallow in robots.txt
User account pagesZeroDisallow in robots.txt

StoreVitals and Crawl Budget

StoreVitals checks robots.txt for common crawl budget mistakes, validates sitemap health (URLs returning non-200, noindex pages included in sitemap), and flags soft 404 patterns. The crawler checks detect whether your core pages (homepage, category, product) are properly canonicalized and returns a report on any duplicate URL patterns found. Run a free scan to see how much of your crawl budget might be going to the wrong pages.

crawl budgettechnical SEOecommerce SEOGooglebotindexing

See these issues on your store?

Run a free scan and find out in seconds.

Run Free Scan