RootNode, Postfilters configuration options

Postfilters are run after objects are fetched and summarised, to determine if they should be stored

Description

This set of filters is run after the object has been fetched from the server and summarised. They will only be run if this has occured successfully. They can either ammend the object summary, or decide not to store the index altogether, based on the information contained in the object summary and any management information generated in the fetching process. Note that the ordering of these is important, the first filter in the list will be run first and so on. If a filter rejects an object outright, then no other filters will process it. Additional filters can be easily implemented in Perl.

Configuration Options

CacheControl
The CacheControl HTTP/1.1 header provides the server with a means of indicating that a document is intended for the recipient only. This module checks for that header in the data returned, and acts upon it accordingly.

If no paramaters are passed to the filter then it simply rejects any returns with CacheControl header sets. Otherwise it treats the parameter as being the attribute name in the SOIF that it should store the returned value as.

CompatHack
Harvest 1.6 and previous brokers require a number of SOIF attributes to occur in every piece of SOIF they receive. This filter provides a means of adding "standard" pieces of metadata to every resource description produced by the gathering process.

It takes as an argument a list of items of the format

ATTRIBUTE VALUE

DCDot
The Harvest schema is a bit of a mess at the moment, and should probably be refined to encapsulate Dublin Core attributes itself. When this happens this module will be redundant. Until then, this maps a number of Dublin Core attributes (encapsulated as DC. strings in the metadata) to document metadata.

The mappings are as follows

DC.Title --> title
DC.Creator --> author
DC.Creator.Address -> address
DC.Publisher --> publisher
DC.Contributor --> author
DC.Contributor.Address -> author
DC.Subject --> keywords
DC.Date --> date
DC.Type --> type

Description
The Harvest schema uses a "Description" attribute to contain a short description of the object. A number of summarisers create this from data contained within the object, however many do not. This filter this attribute, if one does not exist, from the first few characters of the Full-Text (or if that does not exist, the Partial-Text) attribute.

It takes one parameter, a number, which is the number of characters to extract to form the description. This defaults to being 240 characters, which is approximately 3 lines of text.

MD5Visited
This filter maintains a list of MD5 hashes of objects already processed to prevent objects with duplicate contents from entering the database. It relies on a summariser to create the MD5 hash in the first place, and objects without MD5 hashes are passed through.

Prune
Summarisers can often provide more information than is really wanted, or more than there is disk space for. The Prune filter will filter the summary information and either
  • Remove all attributes except those in a specific list
  • Remove all attributes in a specific list

The filters takes a string followed by a list of attributes. The string is either "include" or "remove" to select between the above two modes of operation on the given list.

RobotsMeta
The Robots META tag protocol provides a means of allowing per-page control of robot spidering on HTML pages. This filter implements support for this control method, providing that the summarisers pass through the relevant information.

Timing
This filter adds timing information to the management data of an object. It timestamps the object with the current time, and stores the number of seconds that the object is valid for (its time to live, which defaults to 28 days). The time to live is used when expiring old objects (that is, those which are no longer found during a run) from the database, and plays no part in the gathering process.

The filter can take one argument, the time to live in seconds to store for all objects.