Drupal is a great system, but one of its flaws is the existence of multiple routes to the same content. This effectively creates duplicate content. Search engines like Google do not like duplicate content and you are likely to be penalised for it.
You can control which content search engines index by using a robots.txt file. The robots.txt file gives you the ability to disallow certain pages from being indexed. Therefore, used correctly, you can prevent a duplicate content penalty.
Drupal 5 comes with a robot.txt file by default. If you are running a version prior to Drupal 5, you will need to add your own robots.txt file. It is fairly straightforward. Open a text editor like notepad. Add the code (an example is shown below). Save the file as "robots.txt". FTP the file to the root folder on your web server.
Here is the default code that comes with Drupal 5's robots.txt file:
Crawl-delay: 10
# Directories
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt
# Paths (clean URLs)
Disallow: /admin/
Disallow: /aggregator/
Disallow: /comment/reply/
Disallow: /contact/
Disallow: /logout/
Disallow: /node/add/
Disallow: /search/
Disallow: /user/register/
Disallow: /user/password/
Disallow: /user/login/
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=aggregator/
Disallow: /?q=comment/reply/
Disallow: /?q=contact/
Disallow: /?q=logout/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=user/password/
Disallow: /?q=user/register/
Disallow: /?q=user/login/
I recommend adding the following:
Disallow: /tracker/
Disallow: /xtracker/
Disallow: /user/
Disallow: /book/export/
Disallow: /forward/
Disallow: /comment/
Disallow: /feed/
Disallow: /comment/reply/
Disallow: /popular/
# Files
Disallow: /rss.xml
If you are using URL alias's to create specific URL's, then add:
Disallow: /taxonomy/








Why node and taxonomy different?
Quick question, why /node/* (with a following asterisk), whereas /taxonomy/ (without asterisk)?
My robots.txt skills are a bit rusty so I thought I'd ask.
Thanks!
.:Joshua
Good Question
Good Question Joshua!
It has been a while and I can't remember the reason why I added the * to node and not taxonomy. It does seem illogical. I will do a bit of research and see which was is considered to be the most correct. My feeling at the moment is that it does not matter if you put a asterisk (that is just a wildcard) or not. But I'll see if I can find out for sure.
Post new comment