How crawl budget has changed in the last 2 years

Understanding crawl budget is an often overlooked part of SEO. But a two-year-old post my team wrote about the topic is practically ancient history in the SEO industry. So, in this article, I’ll be explaining how our understanding of crawl budget has changed in the past couple years, what’s stayed the same, and what it all means for your crawl budget optimization efforts.

What is crawl budget and why does it matter?

Computer programs designed to collect information from web pages are called web spiders, crawlers or bots. These can be malicious (e.g., hacker spiders) or beneficial (e.g., search engine and web service spiders). For example, my company’s backlink index is built using a spider called BLEXBot, which crawls up to 7.5 billion pages daily gathering backlink data.

When we talk about crawl budget, we’re actually talking about the frequency with which search engine spiders crawl your web pages. According to Google, crawl budget is a combination of your crawl rate limit (i.e., limits that ensure bots like Googlebot don’t crawl your pages so often that it hurts your server) and your crawl demand (i.e., how much Google wants to crawl your pages).

Optimizing your crawl budget means increasing how often spiders can “visit” each page, collect information and send that data to other algorithms in charge of indexing and evaluating content quality. Simply put, the better your crawl budget, the faster your information will be updated in search engine indexes when you make changes to your site.

But don’t worry. Unless you’re running a large-scale website (millions or billions of URLs) then you will likely never need to worry about crawl budget:

IMO crawl-budget is over-rated. Most sites never need to worry about this. It’s an interesting topic, and if you’re crawling the web or running a multi-billion-URL site, it’s important, but for the average site owner less so.

— 🍌 John 🍌 (@JohnMu) May 30, 2018

So why bother with crawl budget optimization? Because even if you don’t need to improve your crawl budget, these tips include a lot of good practices that improve the overall health of your site.

I think it’s worth being clear about it all though. Removing 25 useless pages is great for a leaner site, and can help users from getting lost there, but it’s not a crawl-budget question. Would people only do it for a SEO bonus? How can we help you to help them?

— 🍌 John 🍌 (@JohnMu) May 30, 2018

And, as John Mueller explains in that same thread, the potential benefits of having a leaner site include higher conversions even if they’re not guaranteed to impact a page’s rank in SERPs.

Sure, but it’s worth being honest about the size of the potential effect. If we can crawl 50k pages/day from your site, will going from 1000 to 900 pages in total change anything for crawling? Not really, but maybe it increases conversions, right?

— 🍌 John 🍌 (@JohnMu) May 30, 2018

What’s stayed the same?

In a Google Webmaster Hangout on Dec. 14, 2018, John was asked about how one could determine their crawl budget. He explains that it’s tough to pin down because crawl budget is not an external-facing metric.

He also says:

“[Crawl budget] is something that changes quite a bit over time. Our algorithms are very dynamic and they try to react fairly quickly to changes that you make on your website … it’s not something that’s assigned one time to a website.”

He illustrates this with a few examples:

You could reduce your crawl budget if you did something such as improperly setting up a CMS. Googlebot might notice how slow your pages are and slow down crawling within a day or two.
You could increase your crawl budget if you improved your website (by moving to a CDN or serving content more quickly). Googlebot would notice and your crawl demand would go up.

This is consistent with what we knew about crawl budget a couple of years ago. Many best practices for optimizing crawl budget are also equally applicable today:

1. Don’t block important pages

You need to make sure that all of your important pages are crawlable. Content won’t provide you with any value if your .htaccess and robots.txt are inhibiting search bots’ ability to crawl essential pages.

Conversely, you can use a script to direct search bots away from unimportant pages. Just note that Googlebot may assume you’ve made a mistake if you disallow lots of content or if a restricted page receives a lot of incoming links and it may still crawl these pages.

The following meta tag in the <head> section of your page will prevent most search engine bots from indexing a page on your site: <meta name=”robots” content=”noindex”>

You can also block specifically Google from crawling your page with the following meta tag: <meta name=”googlebot” content=”noindex”>

Alternatively, you can return a “noindex” X-Robots-Tag header which instructs spiders not to index your page: X-Robots-Tag: noindex

2. Stick to HTML whenever possible

Googlebot has gotten a lot better at crawling rich media files like JavaScript, Flash and XML but other search engine bots still struggle with a lot of these files. I recommend avoiding these files in favor of plain HTML whenever possible. You may also want to provide search engine bots with text versions of pages that rely heavily on these rich media files.

3. Fix long redirect chains

Each redirected URL squanders a little bit of your crawl budget. Worse, search bots may stop following redirects if they encounter an unreasonable number of 301 and 302 redirects in a row. Try to limit the number of redirects you have on your website and use them no more than twice in a row.

4. Tell Googlebot about URL parameters

If your CMS generates lots of dynamic URLs (as many of the popular ones do), then you may be wasting your crawl budget – and maybe even raising red flags about duplicate content. To inform Googlebot about URL parameters that your website engine or CMS have added that don’t impact page content, all you have to do is add parameters to your Google Search Console (go to Crawl URL Parameters).

5. Correct HTTP errors

John corrected a common misconception in late 2017, clarifying that 404 and 410 pages do in fact use your crawl budget. Since you don’t want to waste your crawl budget on error pages — or confuse users who try to reach those pages — it’s in your best interest to search for HTTP errors and fix them ASAP.

6. Keep your sitemap up to date

A clean XML sitemap will help users and bots alike understand where internal links lead and how your site is structured. Your sitemap should only include canonical URLs (a sitemap is a canonicalization signal where Google is concerned) and it should be consistent with your robots.txt file (don’t tell spiders to crawl a page you’ve blocked them from).

7. Use rel=”canonical” to avoid duplicate content

Speaking of canonicalization, you can use rel=”canonical” to tell bots which URL is the main version of a page. However, it’s in your best interest to ensure that all of the content across various versions of your page line up – just in case. Since Google introduced mobile first indexing back in 2016, they often default to the mobile version of a page being the canonical version.

8. Use hreflang tags to indicate country/language

Bots use hreflang tags to understand localized versions of your pages, including language- and region-specific content. You can use either HTML tags, HTTP headers, or your sitemap to indicate localized pages to Google. To do this:

You can add the following link element to your page’s header: <link rel=”alternate” hreflang=”lang_code” href=”url_of_page” />

You can return an HTTP header that tells Google about the language variants on the page (you can also use this for non-HTML files such as PDFs) by specifying a supported language/region code. Your header format should look something like this: Link: <url1>; rel=”alternate”; hreflang=”lang_code_1”

You can add the <loc> element to a specific URL and indicate child entries that include each localized version of the page. This page will teach you more about how to set up language – and region-specific pages that will help search engine bots crawl your page.

What’s changed?

Two main things have changed since we wrote that original article in 2017.

First, I no longer recommend RSS feeds. RSS had a small resurgence in the wake of the Cambridge Analytica scandal as many users shied away from social media algorithms – but it’s not widely used (except maybe by news reporters) and it’s not making a significant comeback.

Second, as part of the original article, we conducted an experiment that suggested a strong correlation between external links and crawl budget. It seemed to suggest that growing your link profile would help your site’s crawl budget grow proportionally.

The aforementioned Google Webmaster Hangout seemed to corroborate this finding; John mentions that a site’s crawl budget is “based a lot on demand from our side.”

But when we tried to update the study on our end, we couldn’t recreate those original findings. The correlation was very loose, suggesting that Google’s algorithm has grown quite a bit more sophisticated since 2017.

That said, please don’t read this and think, “Great, I can stop link building!”

Links remain one of the most important signals that Google and other search engines use to judge relevancy and quality. So, while link building may not be essential for improving your crawl budget, it should be a priority when you want to improve your SEO.

And that’s it! If you want to learn more about crawl budget, I recommend checking out Stephan Spencer’s three-part guide to bot herding and spider wrangling.

Opinions expressed in this article are those of the guest author and not necessarily Search Engine Land. Staff authors are listed here.

About The Author

Aleh Barysevich is Founder and Chief Marketing Officer at companies behind SEO PowerSuite, professional software for full-cycle SEO campaigns, and Awario, a social media monitoring app. He is a seasoned SEO expert and speaker at major industry conferences, including 2018’s SMX London, BrightonSEO and SMX East.

Source link