No matter how well you think you know your site, a crawler will always turn up something new. In some cases, it’s those things that you don’t know about that can sink your SEO ship.
Search engines use highly developed bots to crawl the web looking for content to index. If a search engine’s crawlers can’t find the content on your site, it won’t rank or drive natural search traffic. Even if it’s findable, if the content on your site isn’t sending the appropriate relevance signals, it still won’t rank or drive natural search traffic.
Since they mimic the actions of more sophisticated search engine crawlers, third-party crawlers, such as DeepCrawl and Screaming Frog’s SEO Spider, can uncover a wide variety of technical and content issues to improve natural search performance.
7 Reasons to Use a Site Crawler
What’s out there? Owners and managers think of their websites as the pieces that customers will (hopefully) see. But search engines find and remember all the obsolete and orphaned areas of sites, as well. A crawler can help catalog the outdated content so that you can determine what to do next. Maybe some of it is still useful if it’s refreshed. Maybe some of it can be 301 redirected so that its link authority can strengthen other areas of the site.
How is this page performing? Some crawlers can pull analytics data in from Google Search Console and Google Analytics. They make it easy to view correlations between the performance of individual pages and the data found on the page itself.
Not enough indexation or way too much? By omission, crawlers can identify what’s potentially not accessible by bots. If your crawl report has some holes where you know sections of your site should be, can bots access that content? If not, there might be a problem with disallows, noindex commands, or the way it’s coded that is keeping bots out.
Alternately, a crawler can show you when you have duplicate content. When your sifting through the URLs listed, look for telltale signs like redundant product ID numbers or duplicate title tags or other signs that the content might be the same between two or more pages.
Keep in mind that the ability to crawl does not equate to indexation, merely the ability to be indexed.
What’s that error, and why is that redirecting? Crawlers make finding and reviewing technical fixes much faster. A quick crawl of the site automatically returns a server header status code for every page encountered. Simply filter for the 404s and you have a list of errors to track down. Need to test those redirects that just went live? Switch to list mode and specify the old URLs to crawl. Your crawler will tell you which are redirecting and where they’re sending visitors to now.
Is the metadata complete? Without a crawler, it’s too difficult to identify existing metadata and create a plan to optimize it on a larger scale. Use it to quickly gather data about title tags, meta descriptions, and keywords, H headings, language tags, and more.
Does the site send mixed signals? When not structured correctly, data on individual pages can tie bots into knots. Canonical tags and robots directives, in combination with redirects and disallows affecting the same pages, can send a combination of confusing signals to search engines that can mess up your indexation and ability to perform in natural search.
If you have a sudden problem with performance in a key page, check for a noindex directive and, also, confirm the page that the canonical tag specifies. Does it convey contradictory signals to a redirect sending traffic to the page, or a disallow in the robots.txt file? You never know when something could accidentally change as a result of some other release that developers pushed out.
Is the text correct? Some crawlers also allow you to search for custom bits of text on a page. Maybe your company is rebranding and you want to be sure that you find every instance of the old brand on the site. Or maybe you recently updated schema on a page template and you want to be sure that it’s found on certain pages. If it’s something that involves searching for and reporting on a piece of text within the source code of a group of web pages, your crawler can help.
Plan Crawl Times
It’s important to remember, however, that third-party crawlers can put a heavy burden on your servers. They tend to be set to crawl too quickly as a default, and the rapid-fire requests can stress your servers if they’re already experiencing a high customer volume. Your development team may even have blocked your crawler previously based on suspected scraping by spammers.
Talk to your developers to explain what you need to accomplish and ask for the best time to do it. They almost certainly have a crawler that they use — they may even be able to give you access to their software license. Or they may volunteer to do the crawl for you and send you the file. At the least, they’ll want to advise you as to the best times of day to crawl and the frequency at which to set the bot’s requests. It’s a small courtesy that helps build respect.