Wednesday, September 24, 2008

LPT730 Lab #3 - Part 2, The Robot Exclusion Standard

The Robot Exclusion Standard is a voluntary standard by which web spiders and other automated downloading programs can avoid downloading content that's otherwise publicly available. The need for such a standard came about because search engines and other legitimate robot users attempted to download inappropriate content such as a cgi-bin directory containing programming code inappropriate for a search query. While this standard is voluntary, it's a good example of an imperfect solution on the Internet.

Given the current nature of how the Internet communicates, it's highly impractical if not impossible to hide content away from a subset of visitors to your web site. It would take nothing short of a redesign of basic protocols such as HTTP to make this happen. So a cooperative state has evolved where a website author creates a file on the site called robots.txt telling web-crawlers and other robots where they are and are not welcome. Here's an example of a robots.txt file that that asks all robots to refrain from download any files from the entire web site.
# Tells Scanning Robots Where They Are And Are Not Welcome
# User-agent: can also specify by name; "*" is for everyone
# Disallow: if this matches first part of requested path, forget it
User-agent: * # applies to all robots
Disallow: / # disallow indexing of all pages
Here's an example that asks crawlers to avoid download cgi code
User-agent: *
Disallow: /cgi-bin/
Disallow: /Ads/banner.cgi
Will doing this prevent all unwanted downloading? No. As it's a voluntary standard, some unscrupulous people will download whatever parts of your site they wish to. However, it still makes sense to create the exclusion file because the majority of users will obey it and thus a website owner can save significant money in download bandwidth and headaches by having a website that's under an appropriate load. For more information about the robot exclusion standard, try this FAQ or the references below.

References
--
The Web Robots Page - http://www.robotstxt.org/
Wikipedia - http://en.wikipedia.org/wiki/Robots.txt
Web Developer's Virtual Library - http://www.wdvl.com/Location/Search/Robots.html

No comments: