Given the current nature of how the Internet communicates, it's highly impractical if not impossible to hide content away from a subset of visitors to your web site. It would take nothing short of a redesign of basic protocols such as HTTP to make this happen. So a cooperative state has evolved where a website author creates a file on the site called robots.txt telling web-crawlers and other robots where they are and are not welcome. Here's an example of a robots.txt file that that asks all robots to refrain from download any files from the entire web site.
# Tells Scanning Robots Where They Are And Are Not WelcomeHere's an example that asks crawlers to avoid download cgi code
# User-agent: can also specify by name; "*" is for everyone
# Disallow: if this matches first part of requested path, forget it
User-agent: * # applies to all robots
Disallow: / # disallow indexing of all pages
User-agent: *Will doing this prevent all unwanted downloading? No. As it's a voluntary standard, some unscrupulous people will download whatever parts of your site they wish to. However, it still makes sense to create the exclusion file because the majority of users will obey it and thus a website owner can save significant money in download bandwidth and headaches by having a website that's under an appropriate load. For more information about the robot exclusion standard, try this FAQ or the references below.
Disallow: /cgi-bin/
Disallow: /Ads/banner.cgi
References
--
The Web Robots Page - http://www.robotstxt.org/
Wikipedia - http://en.wikipedia.org/wiki/Robots.txt
Web Developer's Virtual Library - http://www.wdvl.com/Location/Search/Robots.html
No comments:
Post a Comment