|
This set of filters is run before a URL is fetched, to determine if the
given URL is allowed by the current configuration. The filters can
notionally limit on any information which can be obtained through the
URL, but they have no knowledge of document content or type.
Additional filters can be easily implemented in Perl.
|
|
- Depth
-
The "depth" of a URL is the number of URLs that are traversed to get
from the rootnode to it. The RootNode itself has a depth of 1, a URL
linked from that object would have a depth of 2, and so on. Note that
depth is relative to the Rootnode given to Harvest, not to the root
of the server that the url lives on (and it has absolutely nothing to do
with the number of /'s in the path ...).
Depth takes 1 argument, the maximum depth to index to. A depth of 1 will
cause only the Rootnode itself to be indexed, and so on.
- HostLimit
-
HostLimit places a maxmimum limit upon the number of hosts to access for
a Rootnode. A "host" is defined as a unique netloc portion of the URL -
that is the combination of the scheme (http, ftp, ...), the hostname, and
the port. This filter is most commonly used with a limit of '1' to restrict
gathering to the same host as the rootnode.
HostLimit takes 1 parameter - the maximum number of hosts to index. A limit
of 1 will restrict the gathering to the same server as the rootnode.
- NeverUp
-
This filter provides a means to keep a gatherer in a particular subsection
of a tree by never allowing upwards traversal. It assumes that the tree
is ordered as a conventional URL tree, with / acting as a directory
seperator.
NeverUp takes no parameters
- RobotsTxt
-
The robots.txt file is an accepted method for http server administrators to
advertise limits on robots activity on their site. This module provides a
means of filtering URLs so that a gatherer does not break these limits.
It currently only works for HTTP URLs (there is no accepted location for
robots.txt files for other URL schemes), and will automatically allow
objects from any other scheme.
- Scheme
-
This provides a simple way of limiting the schemes (ie http, ftp, gopher,
etc) that can be gathered. An alternative means of implementing this would be to use the URLregex
module with suitable regular expressions to control the types of URL
gathered.
Scheme takes a list of allowed schemes as parameters
- URLLimit
-
This limits the total number of URLs that can be
fetched from a given rootnode. Note that this limits the total number of
URLs fetched, if some fetched URLs aren't summarised, or are blocked by the
SOIF filters, then the number stored will be less.
URLLimit takes 1 argument, the maximum number of URLs to gather.
- URLregex
-
This allows the URLs to be indexed to be specified by a file giving a list
of regular expressions to deny or allow. Regular expression matching is
done using Perl regular expressions directly, and is performed upon the
entire location portion of the URL, excluding any query parameters that
may be passed in.
The action of the first match for the URL is taken (so list your rules
in terms of the most specific first)
Configuration is passed as a list of field regex pairs. The
fields take the following form
- Default (Allow|Deny)
- Set the default action to be either allow, or deny.
- item Allow regex
- Any URLs matching this regular expression will be allowed for parsing.
- item Deny regex
- Any URLs matching this regular expression will be denied
Note that the regex is a perl regular expression - the usual requirements
for escaping metacharacters apply. See man perlre on your system
for all the gory details.
- Visited
-
Visited maintains a list of URLs already checked, to avoid repeat
processing. It should be included early in the Prefilters list to avoid
unneccessary filter processing.
- HostList
-
HostList restricts the gathering process to a list of hosts passed to it
as arguments. That is, only URLs containing the exact host passed to it
as strings may be gathered. If more accurate restrictions are required,
try URLregex.
HostList takes a list of parameters - these are URL netloc portions
(of the form scheme://host)
|