Summarisers configuration options

Summarisers are a set of filters which take a file and return a content summary

Description

Summarisers perform the donkey work of the reaping process. They take a file as fetched from the network, and convert it into an attribute/value content summary. Summarisers are keyed based on the MIME type of the file collected, and are specified as a list of
mime-type Summariser Summariser arguments In addition a number of special types are supported - the FILTER type is used to support requests for files by filters such as the RobotsTxt one, the ALWAYS type indicates summarisers that will always be run, in addition to the one selected by type, and HTTP,NEWS,FTP,HTTPS can be used to create summarisers that are run (again in addition to the selected one) according to file type. Finally, partial mime-types (such as text) may be used to summarise all items with that main type.

Available Summarisers

FTPListing
This is an internal summariser, neccessary for handling FTP fetches. It takes the directory listing returned by the ftp server when directories are fetched, and converts it into a list of links to summarise further

This summarise should be run when the text/ftp-dir-listing MIME type is encountered.

Filter
This is another internal summariser, designed to handle fetches of files for URLFilters (such as RobotsTxt) which require additional information before they can approve a URL.

This summariser should be run for FILTER type objects.

HTML
Summarise an HTML document.

This object should be run for items with a text/html type

HTTP
This is a protocol level summariser which creates management information from information contained in the HTTP headers. In particular, it stores information about the last modification time of the file, and so _must_ be used if If-Modified-Since gathering is required.

The type for this should be HTTP or HTTPS

MD5
The MD5 summariser stores the MD5 hash of the object fetched (_not_ of the summary information) in the objects management data, where it can then be used by others to detect duplication or alteration of the objects contents. MD5 hash generation can be time consuming on larger files, so if duplicate detection is not important, you may wish to exclude this module

This should be run as an ALWAYS type summariser.

NewsHeaders
This summarises and stores the contents of headers from News requests (such as the title and the author of the article, along with its references)

This should be run as a NEWS or NNTP summariser

RPM
Extract informatiom from a RedHat Packager Manager file. At present this will extract the title, description and packager of the file by running the 'rpm' command. (So it requires RPM to be installed on your system).

The type should be whatever you (using FileTypes) or the web server are mapping RPM files to.

RunProg
There are a large number of translation and summarisation programs already available. In addition, classic Harvest ships with a large number of summarisers. This summariser allows the construction of summariser chains using external programs (which don't have to summarise to SOIF directly). For instance, we can use mswordview to generate HTML from a MS-Word document, and then use the built-in HTML summariser to summarise it. Similarly, pdftotxt can be used with the Text summariser to index PDF.

In addition, we support summarisers which directly produce SOIF (such as the old Harvest Summarisers.

A call to this summariser looks like

Type 	Program, New Type

Where Program is the program, complete with arguments, to run. A number of substitutions will be performed on this string, specifically %infile and %outfile will be replaced with the name of the input and output files respectively.

New Type is the type of the data produced from this summariser. If this is a "normal" MIME type the data will be chained through that summariser (which, of course, could be another RunProg summariser). Alternatively the strings FullSOIF or HarvestSOIF may be used to indicate that the summariser produces either fully qualified SOIF, or the reduced SOIF used by Harvest summarisers.

Schema
This summariser adds some additional information to the content summaries, if that information does not already exist. If you intend on using the summaries with applications that are more picky about their SOIF than Harvest-NG (such as the original gatherer), then you must include this Summariser.

This should be run as an ALWAYS summariser.

Text
Generate a summary of plain text (for instance, something with a text/plain MIME-type, or even with a type of text/*). This just copies the entire contents of the text into the full-text attribute, making no attempt to reduce or summarise the information, with the exception of making some efforts to ensure legibility of the text - we delete large amounts of adjactent white space, and remove non-ASCII characters.