WorryFree Computers   »   [go: up one dir, main page]

De-index/Block bulk pages from search engines for Drupal sites

  • Last updated
  • 5 minute read

Goal

How to exclude lower Drupal environments from search engine

Overview

There are scenarios that require de-indexing numerous pages, PDFs, and documents from search engines like Google, Bing, and other search engines, such as:

  1. Lower (Non-Production) Environments
  2. Confidential Information
  3. Hacked Site Pages
  4. Duplicate Content 
  5. Indexation Bloat

There is a two-part approach that can be taken to solve this problem:

  1. Removing the indexed pages/documents from the search engines
  2. Blocking search engine bots to crawl these assets in the future

Let’s look at both of these approaches in more detail.

Part One: Removing the indexed pages/documents from the search engines

1. Metatag module - Use the metatag module to set noindex & nofollow tag on all the nodes and use the config split module to ensure that the configuration for different environments is maintained. Proper testing will be required to ensure that lower environment pages have these tags and not the production!

2. Submit a sitemap.xml - Once you've added noindex & nofollow tag to the pages on lower environment, submit the sitemap.xml to Google webmaster tools & Bing webmaster tools, so that they can crawl these pages and start removing the crawled pages from their index. This will be an important step because this will help de-index the pages much faster than waiting on search engines to crawl the pages.

Not long ago, I had the opportunity to volunteer for a non-profit organization where I noticed that “Human bots” were used to create user profiles and all those user profiles were filled with links to spam sites. And to add to the complexity, more than 100,000 user profile pages were all indexed in the search engines.

Mike Madison, who also volunteers for the same non-profit and I, followed the first two steps, where noindex & nofollow were added to the user profile pages and configured the default sitemap to remove user profile links from it. We then created another sitemap with links only to the user profile pages and submitted this user sitemap to Google Search Console for Google to crawl these pages and read the noindex meta tag. The result was that in less than ten days, all the user profile pages were removed from Google’s search index.

Please note - we can't block search engines from crawling the pages/documents until we make sure that Google has removed the unwanted pages/documents from its index. If we’ll block the search engines before they de-index already crawled pages/documents then they won’t be able to read the “noindex” meta tag from the html pages and will not remove them from their index.

3. Temporary removal from Index - Google’s search console offers a tool to remove indexed pages temporarily, and Bing offers a similar tool. With these tools, one can request a search engine to remove the desired pages/documents to be de-indexed for about six months and clear the current snippet and cached version. The tool is pretty powerful and can be very useful in removing the indexed items quickly.

Although, as always - one needs to be careful, since one wrong step can remove important pages that you want to continue to keep in the search engine’s index.

Part Two: Blocking search engine bots to crawl these assets in the future

1. Robots.txt for different environments - There's the RobotsTxt module that takes config to provide the contents of a /robots.txt page. Together with the Configuration Split module you can then have it display different robots.txt content per environment or per site (in case you are using Drupal as multisite)

2. Block search engine bots using WAF - Another solution to block the search engines to access the resources would be to use a WAF service. This would allow you to specify rules that would block access to search engines (even the verified bots) to lower environments completely. That way once search engines have removed the content from it’s index, it’ll not be able to crawl it again and won’t be able to add any load on the server either.

Conclusion

Managing to remove information from search engines can take time and cause stress, but with the preceding information, you can properly resolve any issues that might exist in a short amount of time and make sure that only the information you want on search engines is shown in search results.

Thanks to Malcolm Peralty for reviewing the post for accuracy and editing the post.