Google has officially announced that GoogleBot will no longer obey robots.txt directives related to indexing. If you are a publisher relying on robots.txt and no index directives, you have until September 1st 2019 to remove it and start using an alternative.
Why the Change?
Google will no longer support the directive because it’s not an official one. In the past, they have supported the directive but this will no longer be the case. This is a good time to take a look at your robots.txt file to see where you’re using the directive and what you can do to prepare yourself when support officially ends.
Google Mostly Obeyed the Directive in the Past
As far back as 2008, Google has somewhat supported this directive. Both Matt Cutts and John Mueller have discussed this. In 2015, Perficient Digital decided to run a test to see how well Google obeyed the command. They concluded:
“Ultimately, the NoIndex directive in Robots.txt is pretty effective. It worked in 11 out of 12 cases we tested. It might work for your site, and because of how it’s implemented it gives you a path to prevent crawling of a page AND also have it removed from the index. That’s pretty useful in concept. However, our tests didn’t show 100 percent success, so it does not always work.
Further, bear in mind, even if you block a page from crawling AND use Robots.txt to NoIndex it, that page can still accumulate PageRank (or link juice if you prefer that term).
In addition, don’t forget what John Mueller said, which was that you should not depend on this approach. Google may remove this functionality at some point in the future, and the official status for the feature is ‘unsupported.’”
With the announcement from Google that noindex robots.txt is no longer supported, you cannot expect it to work.
In that blog post, they went on to say: “In the interest of maintaining a healthy ecosystem and preparing for potential future open source releases, we’re retiring all code that handles unsupported and unpublished rules (such as noindex) on September 1, 2019.”
What to Use Instead
Instead of using noindex in the robots.txt file, you can use noindex in robots meta tags. This is supported in both the HTTP response headers and in HTML making it the most effective way to remove URLs from the index when crawling is allowed.
Other options include:
- Using 404 and 410 HTTP status codes: Both of these status codes mean the page does not exist which will drop the URLs from the Google index once they are crawled and processed.
- Disallow in robots.txt: Search engines are only able to index pages they know about so blocking a page from being crawled typically means it won’t be indexed. A search engine may also index URLs based on links from other pages without seeing the content itself so Google says they aim to make those pages less visible in the future.
- Password protection: Unless you use markup to indicate paywalled or subscription-based content, hiding a page behind a login generally removes it from Google’s index.
- Search Console Remove URL tool: Use this tool to quickly and easily remove the URL from the Google search results temporarily.
Other Changes to Consider
All of this comes on the heels of an announcement that Google is working on making the robots exclusion protocol a standard and this is likely the first change that’s coming. Google released its robots.txt parser as an open source project alongside this announcement.
Google has been looking to change this for years and by standardizing the protocol, it can now move forward. In analyzing the usage of robots.txt rules, Google focused on looking at how unsupported implementations such as nofollow, noindex and crawl delay effect things. Those rules were never documented by Google so their usage in relation to the Googlebot is low. These kinds of things hurt a website’s presence in Google search results in ways they don’t believe webmasters intend.
Take time to make sure you are not using the no index directive in your robots.txt file.If you are, make sure to choose one of the suggested methods before September 1st. It’s also a good idea to look to see if you’re using the nofollow or crawl-delay commands. If you are, look to use the true supported methods for these directives going forward.
In the case of nofollow, instead of using the robots. Txt file, you should use no follow in the robots meta tags. If you need more granular control, you can use nofollow and the Rel attribute on an individual link level.
Some webmasters opt to use the crawl delay setting when they have a lot of pages and many of them are linked from your index. The bot starts crawling the site and may generate too many requests to the site for a short period of time. The traffic peak could possibly lead to depleting hosting resources that are monitored hourly. To avoid problems like this, webmasters set a crawl delay to 1 to 2 seconds so the bots crawl a website more moderately without causing load peaks.
However, the Google bot doesn’t take the crawl delay setting into consideration so you shouldn’t worry about the directive influencing your Google rankings. You can safely use it in case there are other aggressive bots you are trying to stop. It’s not likely you’ll experience issues as a result of the Googlebot crawling, but if you want to reduce its crawl rate the only way you can do this is from the Google Search Console.
Take a few minutes to look over everything in your robots.txt file to make sure you make the necessary adjustments ahead of the deadline. Though it may take a little bit of time to sort through everything and execute the changes, it will be well worth it come September.