Summarisers perform the donkey work of the reaping process. They take a
file as fetched from the network, and convert it into an attribute/value
content summary.
Summarisers are keyed based on the MIME type of the file collected, and
are specified as a list of
mime-type Summariser Summariser arguments
In addition a number of special types are supported - the FILTER type is
used to support requests for files by filters such as the RobotsTxt one,
the ALWAYS type indicates summarisers that will always be run, in addition
to the one selected by type, and HTTP,NEWS,FTP,HTTPS can be used to
create summarisers that are run (again in addition to the selected one)
according to file type.
Finally, partial mime-types (such as text) may be used to summarise all
items with that main type.
|
- FTPListing
-
This is an internal summariser, neccessary for handling FTP fetches. It
takes the directory listing returned by the ftp server when directories are
fetched, and converts it into a list of links to summarise further
This summarise should be run when the text/ftp-dir-listing MIME
type is encountered.
- Filter
-
This is another internal summariser, designed to handle fetches of files for
URLFilters (such as RobotsTxt) which require additional information before
they can approve a URL.
This summariser should be run for FILTER type objects.
- HTML
-
Summarise an HTML document.
This object should be run for items with a text/html type
- HTTP
-
This is a protocol level summariser which creates management information
from information contained in the HTTP headers. In particular, it stores
information about the last modification time of the file, and so _must_ be
used if If-Modified-Since gathering is required.
The type for this should be HTTP or HTTPS
- MD5
-
The MD5 summariser stores the MD5 hash of the object fetched (_not_ of the
summary information) in the objects management data, where it can then be
used by others to detect duplication or alteration of the objects contents.
MD5 hash generation can be time consuming on larger files, so if duplicate
detection is not important, you may wish to exclude this module
This should be run as an ALWAYS type summariser.
- NewsHeaders
-
This summarises and stores the contents of headers from News requests (such
as the title and the author of the article, along with its references)
This should be run as a NEWS or NNTP summariser
- RPM
-
Extract informatiom from a RedHat Packager Manager file. At present this
will extract the title, description and packager of the file by running
the 'rpm' command. (So it requires RPM to be installed on your system).
The type should be whatever you (using FileTypes) or the web server are
mapping RPM files to.
- RunProg
-
There are a large number of translation and summarisation programs already
available. In addition, classic Harvest ships with a large number of
summarisers. This summariser allows the construction of summariser chains
using external programs (which don't have to summarise to SOIF directly).
For instance, we can use mswordview to generate HTML from a MS-Word
document, and then use the built-in HTML summariser to summarise it.
Similarly, pdftotxt can be used with the Text summariser to index PDF.
In addition, we support summarisers which directly produce SOIF (such as
the old Harvest Summarisers.
A call to this summariser looks like
Type Program, New Type
Where Program is the program, complete with arguments, to run.
A number of substitutions will be performed on this string, specifically
%infile and %outfile will be replaced with the name
of the input and output files respectively.
New Type is the type of the data produced from this summariser.
If this is a "normal" MIME type the data will be chained through that
summariser (which, of course, could be another RunProg summariser).
Alternatively the strings FullSOIF or HarvestSOIF
may be used to indicate that the summariser produces either fully
qualified SOIF, or the reduced SOIF used by Harvest summarisers.
- Schema
-
This summariser adds some additional information to the content summaries,
if that information does not already exist. If you intend on using the
summaries with applications that are more picky about their SOIF than
Harvest-NG (such as the original gatherer), then you must include this
Summariser.
This should be run as an ALWAYS summariser.
- Text
-
Generate a summary of plain text (for instance, something with a text/plain
MIME-type, or even with a type of text/*). This just copies the entire
contents of the text into the full-text attribute, making no attempt to
reduce or summarise the information, with the exception of making some
efforts to ensure legibility of the text - we delete large amounts of
adjactent white space, and remove non-ASCII characters.
|