Spider tricks

Pike's SEO Tactics - Robots File

XMLRSS feed

Robots File

The robots file (robots.txt) which should be placed at the root of the web server (http://www.domain.com/robots.txt) tells well-behaved web bots like Google's and Yahoo's how to spider your pages. All pages are accessible by default so having a robots file isn't necessarily important. However, each time a search engine indexes your site it will generate an error if one is not present.

The robots file is exclusion protocol. You simply state who is allowed and what isn't accessible to which bots. By default, all bots are allowed to access everything. Also, there is no guarantee a poorly behaved bot won't retrieve what you exclude. The bots are named using their "User-agent" string. This string is sent to the server with every requests to say what program is accessing the page.

The most basic robots file is a blank one. The next basic one excludes nothing:

User-agent: *
Disallow:

This one excludes everything: (Anti-optimizes your site!)

User-agent: *
Disallow: /

Yes, * refers to all user agents and disallow is the folder or files (from the web root) which is off limits. The next example shows how to ban a bot and prevent bots from reaching a folder.

User-agent: bad-bot
Disallow: /

User-agent: *
Disallow: /private/

User-agent: *
Disallow:

The bad-bot has now been banned from the site in the first rule. The second rule bans all bots from the folder. The final rule states anything else is fine. Want more examples? Go to any major website like Google (http://www.google.com/robots.txt) and look at their robots.txt file.

There is no real guarantee that a bot will honor what you write in the robots file. Don't use it to hide private and personal files. Anyone with an internet connection can look at your robots file and see what is excluded. They can then access those files if you do not password protect them.

For more information about the web robots file look at the Web Robots Pages.

Last Updated: Wed, 30 Mar 2005 23:43:52 GMT
All rights reserved © 1999-2007 Matthew Bystedt