RootNode, Prefilters configuration options

Prefilters are run before objects are fetched, to determine if they should be

Description

This set of filters is run before a URL is fetched, to determine if the given URL is allowed by the current configuration. The filters can notionally limit on any information which can be obtained through the URL, but they have no knowledge of document content or type. Additional filters can be easily implemented in Perl.

Configuration Options

Depth
The "depth" of a URL is the number of URLs that are traversed to get from the rootnode to it. The RootNode itself has a depth of 1, a URL linked from that object would have a depth of 2, and so on. Note that depth is relative to the Rootnode given to Harvest, not to the root of the server that the url lives on (and it has absolutely nothing to do with the number of /'s in the path ...).

Depth takes 1 argument, the maximum depth to index to. A depth of 1 will cause only the Rootnode itself to be indexed, and so on.

HostLimit
HostLimit places a maxmimum limit upon the number of hosts to access for a Rootnode. A "host" is defined as a unique netloc portion of the URL - that is the combination of the scheme (http, ftp, ...), the hostname, and the port. This filter is most commonly used with a limit of '1' to restrict gathering to the same host as the rootnode.

HostLimit takes 1 parameter - the maximum number of hosts to index. A limit of 1 will restrict the gathering to the same server as the rootnode.

NeverUp
This filter provides a means to keep a gatherer in a particular subsection of a tree by never allowing upwards traversal. It assumes that the tree is ordered as a conventional URL tree, with / acting as a directory seperator.

NeverUp takes no parameters

RobotsTxt
The robots.txt file is an accepted method for http server administrators to advertise limits on robots activity on their site. This module provides a means of filtering URLs so that a gatherer does not break these limits.

It currently only works for HTTP URLs (there is no accepted location for robots.txt files for other URL schemes), and will automatically allow objects from any other scheme.

Scheme
This provides a simple way of limiting the schemes (ie http, ftp, gopher, etc) that can be gathered. An alternative means of implementing this would be to use the URLregex module with suitable regular expressions to control the types of URL gathered.

Scheme takes a list of allowed schemes as parameters

URLLimit
This limits the total number of URLs that can be fetched from a given rootnode. Note that this limits the total number of URLs fetched, if some fetched URLs aren't summarised, or are blocked by the SOIF filters, then the number stored will be less.

URLLimit takes 1 argument, the maximum number of URLs to gather.

URLregex
This allows the URLs to be indexed to be specified by a file giving a list of regular expressions to deny or allow. Regular expression matching is done using Perl regular expressions directly, and is performed upon the entire location portion of the URL, excluding any query parameters that may be passed in.

The action of the first match for the URL is taken (so list your rules in terms of the most specific first)

Configuration is passed as a list of field regex pairs. The fields take the following form

Default (Allow|Deny)
Set the default action to be either allow, or deny.
item Allow regex
Any URLs matching this regular expression will be allowed for parsing.
item Deny regex
Any URLs matching this regular expression will be denied

Note that the regex is a perl regular expression - the usual requirements for escaping metacharacters apply. See man perlre on your system for all the gory details.

Visited
Visited maintains a list of URLs already checked, to avoid repeat processing. It should be included early in the Prefilters list to avoid unneccessary filter processing.

HostList
HostList restricts the gathering process to a list of hosts passed to it as arguments. That is, only URLs containing the exact host passed to it as strings may be gathered. If more accurate restrictions are required, try URLregex.

HostList takes a list of parameters - these are URL netloc portions (of the form scheme://host)