|
This set of filters is run after the object has been fetched from the server
and summarised. They will only be run if this has occured successfully.
They can either ammend the object summary, or decide not to store the
index altogether, based on the information contained in the object
summary and any management information generated in the fetching
process.
Note that the ordering of these is important, the first filter in the
list will be run first and so on. If a filter rejects an object outright,
then no other filters will process it.
Additional filters can be easily implemented in Perl.
|
|
- CacheControl
-
The CacheControl HTTP/1.1 header provides the server with a means of
indicating that a document is intended for the recipient only. This module
checks for that header in the data returned, and acts upon it accordingly.
If no paramaters are passed to the filter then it simply rejects any returns
with CacheControl header sets. Otherwise it treats the parameter as being
the attribute name in the SOIF that it should store the returned value as.
- CompatHack
-
Harvest 1.6 and previous brokers require a number of SOIF attributes to
occur in every piece of SOIF they receive. This filter provides a means
of adding "standard" pieces of metadata to every resource description
produced by the gathering process.
It takes as an argument a list of items of the format
ATTRIBUTE VALUE
- DCDot
-
The Harvest schema is a bit of a mess at the moment, and should probably
be refined to encapsulate Dublin Core attributes itself. When this happens
this module will be redundant. Until then, this maps a number of Dublin
Core attributes (encapsulated as DC. strings in the metadata) to document
metadata.
The mappings are as follows
DC.Title --> title
DC.Creator --> author
DC.Creator.Address -> address
DC.Publisher --> publisher
DC.Contributor --> author
DC.Contributor.Address -> author
DC.Subject --> keywords
DC.Date --> date
DC.Type --> type
- Description
-
The Harvest schema uses a "Description" attribute to contain a short
description of the object. A number of summarisers create this from data
contained within the object, however many do not. This filter this
attribute, if one does not exist, from the first few characters of the
Full-Text (or if that does not exist, the Partial-Text) attribute.
It takes one parameter, a number, which is the number of characters to
extract to form the description. This defaults to being 240 characters,
which is approximately 3 lines of text.
- MD5Visited
-
This filter maintains a list of MD5 hashes of objects already processed to
prevent objects with duplicate contents from entering the database. It
relies on a summariser to create the MD5 hash in the first place, and
objects without MD5 hashes are passed through.
- Prune
-
Summarisers can often provide more information than is really wanted, or
more than there is disk space for. The Prune filter will filter the
summary information and either
- Remove all attributes except those in a specific list
- Remove all attributes in a specific list
The filters takes a string followed by a list of attributes. The string
is either "include" or "remove" to select between the above two modes
of operation on the given list.
- RobotsMeta
-
The
Robots META tag protocol
provides a means of allowing per-page control of robot spidering on HTML
pages. This filter implements support for this control method, providing
that the summarisers pass through the relevant information.
- Timing
-
This filter adds timing information to the management data of an object.
It timestamps the object with the current time, and stores the number of
seconds that the object is valid for (its time to live, which defaults
to 28 days). The time to live is used when expiring old objects (that is,
those which are no longer found during a run) from the database, and plays
no part in the gathering process.
The filter can take one argument, the time to live in seconds to store for
all objects.
|