
Download: suites.robotExclusion.sit.hqx
Created by John VanDyk of Iowa State University on Sunday, Feb 23, 1997 at 8:31:25 PM.
Version 1.0.2, released March 2, 1997
Version 1.0.1, released Feb. 28, 1997
<META name="robots" content="noindex, nofollow">
inside the <HEAD> element of a page should prevent a robot from indexing that page or following links on that page.
See the example at sample.politegetpage.
Description
This suite handles the process of verifying whether or not a robot/crawler/spider is allowed to retrieve a file from a web site. It is primarily directed at those writing Frontier web crawlers and robots.
Caching of the robots.txt files from the hosts you visit is handled automatically and the refresh interval is user-configurable (user.robotExclusion.robotcacheexpires). Default is to retrieve a robots.txt file only once every 24 hours.
The robotExclusion suite follows the current (Dec 4 1996) internet draft of the robots exclusion protocol, with exceptions outlined below.
The main use of this suite will be to call it to check if you may get a page before you actually retrieve it. For example:
if robotExclusion.robotsOK("http://www.ent.iastate.edu/List/complete.html", "MyCrawler/1.0", "jvandyk@iastate.edu")
page = robotExclusion.httpgetpage("www.ent.iastate.edu", "/List/complete.html", 80, "MyCrawler/1.0", "jvandyk@iastate.edu")
This has been combined into one step at robotExclusion.samples.politegetpage.
It is very important that you send the name and version of your robot along with your request. Using robotExclusion.httpgetpage instead of netevents.examples.httpget does this. Writing a robot that does NOT send its identity, version, and contact information is downright rude.
Advanced options:
There are several user-configurable options at user.robotExclusion:
Requirements:
This suite requires the NetEvents app.
The following improvements can be made to this suite:
If you are interested in tackling one of these areas, please contact me at