Configuring reap | ||
The real power of the reap spider is exposed when you start controlling it using configuration files, rather than running it from the command line. Reap configuration files are standard Harvest-NG configuration files, and obey all of the rules and structures of these files. For simplicity these are summarised in the configuration section below, but you're encouraged to read the detailed documentation. A reap configuration file is broadly split into a number of sections, which we will consider in turn. The configuration file syntax means that only the generic section has to be in the file pointed to by the --config option.
| ||
Configuration files, a quick reminder | ||
Harvest-NG configuration files are comprised of a set of tag, value pairs. These tag, value pairs can be written in a number of forms:
reap configuration files are implemented using this, with the addition that we allow 'nesting'. For a number of tags, specifically, those which implement the sections described above, the 'Value' is another configuration structure. This leads to the example structure shown below. Its useful to remember that any tag can be represented in the ways shown above, making it straightforward to have a particular tag read its value from a file, or from a helper program.
| ||
Skeletal configuration file | ||
The above information leads us to the following skeletal configuration file, which can be fleshed out by adding the appropriate configuration tags, as detailed on the pages linked to above. # Skeletal configuration file for reap general directives <Database> Database format configuration directives </Database> <RootNode> Root Node general directives <Prefilters> List of prefilters </Prefilters> <Postfilters> List of postfilters </Postfilters> </RootNode> <Summarisers> List of summarisers </Summarisers> <FileTypes> List of mappings between file extensions and mimetypes </FileTypes> <Encoding> List of content-encodings tags and their modules </Encoding>
|
||