Configuring reap

The real power of the reap spider is exposed when you start controlling it using configuration files, rather than running it from the command line.

Reap configuration files are standard Harvest-NG configuration files, and obey all of the rules and structures of these files. For simplicity these are summarised in the configuration section below, but you're encouraged to read the detailed documentation.

A reap configuration file is broadly split into a number of sections, which we will consider in turn. The configuration file syntax means that only the generic section has to be in the file pointed to by the --config option.

  • Generic options - These are settings that affect the entire gathering process, or that couldn't be easily placed in another file
  • Database options - These control the setting up of the backend database that reap uses to store gathered data
  • Rootnode options - These control the Rootnode settings, which are a starting point for a gathering process.
  • Summariser mappings - This section maps content-types returned by the server to summariser modules, in order to create content summaries from the gathered information
  • Filetype mappings - This section contains mappings of file extensions (such as .txt) to content-types, for use when the server does not return a valid content type, or for when indexing from non HTTP servers.
  • Encoding mappings -This maps methods of encoding content (such as gzip) to modules which can decompress them for summarising.

Configuration files, a quick reminder

Harvest-NG configuration files are comprised of a set of tag, value pairs. These tag, value pairs can be written in a number of forms:

  • Tag Value
  • Tag: Value
  • <Tag>
    Value
    </Tag>
  • <Tag File-containing-value>
  • <Tag |Program-producing-value>
Note the 'pipe' symbol in the last example - some character sets seem to omit it

reap configuration files are implemented using this, with the addition that we allow 'nesting'. For a number of tags, specifically, those which implement the sections described above, the 'Value' is another configuration structure. This leads to the example structure shown below.

Its useful to remember that any tag can be represented in the ways shown above, making it straightforward to have a particular tag read its value from a file, or from a helper program.

Skeletal configuration file

The above information leads us to the following skeletal configuration file, which can be fleshed out by adding the appropriate configuration tags, as detailed on the pages linked to above.

# Skeletal configuration file for reap
general directives
<Database>
        Database format configuration
directives
</Database>
<RootNode>
        Root Node general directives
        <Prefilters>
                List of prefilters
        </Prefilters>
        <Postfilters>
                List of postfilters
        </Postfilters>
</RootNode>
<Summarisers>
        List of summarisers
</Summarisers>
<FileTypes>
        List of mappings between file extensions and mimetypes
</FileTypes>
<Encoding>
	List of content-encodings tags and their modules
</Encoding>