Using Robots.txt with Drupal to avoid duplicate content

Drupal is a great system, but one of its flaws is the existence of multiple routes to the same content. This effectively creates duplicate content. Search engines like Google do not like duplicate content and you are likely to be penalised for it.

You can control which content search engines index by using a robots.txt file. The robots.txt file gives you the ability to disallow certain pages from being indexed. Therefore, used correctly, you can prevent a duplicate content penalty.

Drupal 5 comes with a robot.txt file by default. If you are running a version prior to Drupal 5, you will need to add your own robots.txt file. It is fairly straightforward. Open a text editor like notepad. Add the code (an example is shown below). Save the file as "robots.txt". FTP the file to the root folder on your web server.

Here is the default code that comes with Drupal 5's robots.txt file:

User-agent: *
Crawl-delay: 10
# Directories
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt
# Paths (clean URLs)
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/

I recommend adding the following:

# Directories
Disallow: /tracker/
Disallow: /xtracker/
Disallow: /user/
Disallow: /book/export/
Disallow: /forward/
Disallow: /comment/
Disallow: /feed/
Disallow: /comment/reply/
Disallow: /popular/

# Files
Disallow: /rss.xml

If you are using URL alias's to create specific URL's, then add:

Disallow : /node/*
Disallow: /taxonomy/

If you liked this, you'll love my book, Master Drupal Module Development.

"..the must have drupal developers book"


Feeling stuck with Drupal 8 module dev?

Get the free 7 lesson course that will help you get started today without feeling overwhelmed.

  • Create Drupal modules with just a few commands using the Drupal Console
  • Create custom pages
  • Create custom blocks
  • Create admin forms
  • Demystify routers and controllers
  • Bonus material

Find out more


Comments

Quick question, why /node/* (with a following asterisk), whereas /taxonomy/ (without asterisk)?

My robots.txt skills are a bit rusty so I thought I'd ask.

Thanks!

.:Joshua

Blair Wadman's picture

Good Question Joshua!

It has been a while and I can't remember the reason why I added the * to node and not taxonomy. It does seem illogical. I will do a bit of research and see which was is considered to be the most correct. My feeling at the moment is that it does not matter if you put a asterisk (that is just a wildcard) or not. But I'll see if I can find out for sure.

when I added /node/* then I checked in http://www.sxw.org.uk/computing/robots/check.html (Robots.txt Syntax Checking). I have a report "Unrecognised field. The field Disallow could not be recognised. Whilst the robots.txt standard allows for expansion by the use of undefined fields, it is likely that this line is a mistake in your file". May you explaine that. Thanks..
fajar

thanks for sharing this. This is a serious matter to which I just became aware of. My partner installed Drupal on various sites. The sites have been getting great traffic for a few years, then Wham! Google kills them. Now I find out that I have thousands of useless pages indexed to which do doubt caused the ban. So this is something that HAS to be done even if you have been unaffected for years!

Add new comment