Using Robots.txt on Your Web Site


Add to del.icio.us Add to Furl

If you are a webmaster, then you are probably already familiar with robots.txt or you have at least heard about it.  This tool is used to exclude search engines from spidering particular content on your site.

A Brief History 

Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren’t welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren’t suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

The method used to exclude robots from a server is to create a file on the server which specifies an access policy for robots. This file must be accessible via HTTP on the local URL “/robots.txt”.

Example of a robots.txt file:

User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /foo.html

More information can be found on http://www.robotstxt.org or visit the Robot Control Code Generation Tool to prepare your own robots.txt.

Leave a Reply

You must be logged in to post a comment.