Blair Wadman 2 minute read

Using Robots.txt with Drupal to avoid duplicate content

Drupal is a great system, but one of its flaws is the existence of multiple routes to the same content. This effectively creates duplicate content. Search engines like Google do not like duplicate content and you are likely to be penalised for it.

You can control which content search engines index by using a robots.txt file. The robots.txt file gives you the ability to disallow certain pages from being indexed. Therefore, used correctly, you can prevent a duplicate content penalty.

Drupal 5 comes with a robot.txt file by default. If you are running a version prior to Drupal 5, you will need to add your own robots.txt file. It is fairly straightforward. Open a text editor like notepad. Add the code (an example is shown below). Save the file as "robots.txt". FTP the file to the root folder on your web server.

Here is the default code that comes with Drupal 5's robots.txt file:

User-agent: *
Crawl-delay: 10
# Directories
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt
# Paths (clean URLs)
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/

I recommend adding the following:

# Directories 
Disallow: /tracker/
Disallow: /xtracker/
Disallow: /user/
Disallow: /book/export/
Disallow: /forward/
Disallow: /comment/
Disallow: /feed/
Disallow: /comment/reply/
Disallow: /popular/

# Files
Disallow: /rss.xml

If you are using URL alias's to create specific URL's, then add:

Disallow : /node/*
Disallow: /taxonomy/