Backlink Blindspots: The State of Robots.txt
Only at Moz we have devoted to making Link Explorer as similar to Google as you possibly can, specifically in the manner we crawl the web. I have discussed in previous articles some metrics we use to ascertain that performance, but today I needed to invest a little time discussing the impact of robots.txt and crawling the web.
Nearly all of you’re knowledgeable about robots.txt as the technique by which webmasters can direct Google and other bots to see only certain pages on the site. Webmasters could be selective, allowing certain bots to see some pages while denying other bots use of the same. This presents an issue for companies like Moz, Majestic, and Ahrefs: we attempt to crawl the web like Google, but certain websites deny access to your bots while allowing that use of Googlebot. So, why exactly does this matter?
Why does it matter?
As we crawl the web, if your bot encounters a robots.txt file, they’re blocked from crawling specific content. We can begin to see the links that point to the site, but we’re blind regarding this content of the site itself. We can’t begin to see the outbound links from that site. This leads to an instantaneous deficiency in the link graph, at least with regards to being similar to Google (if Googlebot isn’t similarly blocked).
But that isn’t the sole issue. There’s a cascading failure due to bots being blocked by robots.txt in the form of crawl prioritization. As a bot crawls the web, it discovers links and needs to prioritize which links to crawl next. Let’s say Google finds 100 links and prioritizes the utmost effective 50 to crawl. However, a different bot finds those same 100 links, but is blocked by robots.txt from crawling 10 of the utmost effective 50 pages. Instead, they’re forced to crawl around those, making them pick a different 50 pages to crawl. This different pair of crawled pages will return, of course, a different pair of links. In this next round of crawling, Google will not just have a different set they’re allowed to crawl, the set itself will differ simply because they crawled different pages in the first place.
Long story short, much such as the proverbial butterfly that flaps its wings eventually leading to a hurricane, small changes in robots.txt which prevent some bots and allow others ultimately leads to different results compared to what Google actually sees.
So, how are we doing?
You understand I wasn’t likely to leave you hanging. Let’s do some research. Let’s analyze the utmost effective 1,000,000 websites on the Internet in accordance with Quantcast and determine which bots are blocked, how frequently, and what impact that may have.
The methodology is rather straightforward.
- Download the Quantcast Top Million
- Download the robots.txt if available from all top million sites
- Parse the robots.txt to ascertain whether the house page and other pages can be found
- Collect link data related to blocked sites
- Collect total pages on-site related to blocked sites.
- Report the differences among crawlers.
Total sites blocked
The initial and easiest metric to report is the number of sites which block individual crawlers (Moz, Majestic, Ahrefs) while allowing Google. Most site that block one of many major SEO crawlers block them all. They just formulate robots.txt to permit major search engines while blocking other bot traffic. Lower is better.
Of the sites analyzed, 27,123 blocked MJ12Bot (Majestic), 32,982 blocked Ahrefs, and 25,427 blocked Moz. Which means that among the major industry crawlers, Moz is minimal apt to be turned away from a niche site that enables Googlebot. But what does this really mean?
Total RLDs blocked
As discussed previously, one major problem with disparate robots.txt entries is that it stops the flow of PageRank. If Google could see a niche site, they are able to pass link equity from referring domains through the site’s outbound domains on to other sites. If a niche site is blocked by robots.txt, it’s as though the outbound lanes of traffic on all the roads going into the site are blocked. By counting all the inbound lanes of traffic, we can get a notion of the total affect the link graph. Lower is better.
According to your research, Majestic ran into dead ends on 17,787,118 referring domains, Ahrefs on 20,072,690 and Moz on 16,598,365. Yet again, Moz’s robots.txt profile was most similar to that of Google’s. But referring domains isn’t the sole problem with which we must be concerned.
Total pages blocked
Most pages on the web just have internal links. Google isn’t enthusiastic about developing a link graph — they’re enthusiastic about creating a search engine. Thus, a bot designed to do something like Google must be just like concerned with pages that only receive internal links because they are those that receive external links. Another metric we can measure is the total quantity of pages which are blocked by utilizing Google’s site: query to estimate the number of pages Google has access to that a different crawler does not. So, just how do the competing industry crawlers perform? Lower is better.