Top level configuration options

Other sections

The following configuration sections are nested inside this one

  • RootNode - A RootNode describes the work that a reap is to carry out.
  • Database - Database directives control the format in which the reaper stores the information that it has gathered.
  • Summarisers - Summarisers are a set of filters which take a file and return a content summary
  • Encoding - Content Encoding is used when transporting compressed files using HTTP, this configuration set registers means of dealing with these files.

Configuration Options

TimeOut
The time (in seconds) to wait for the remote server to return a response.

HttpProxy
The address of a proxy server to pass all HTTP requests through

FtpProxy
The address of a proxy server to pass all FTP requests through

FileTypes
Web servers generally return unknown files with a default type. Filetypes makes it possible to further subdivide files which are sent with this default type, according to their extension.

FileTypes takes a list of extension mime-type pairs. For example

rpm                  application/x-rpm
html                 text/html

RunMode
Set the process model that the server is to use to do the gathering. This dictates whether gathering takes place sequentially (ie pages are gathered one at a time) or in parallel (multiple pages are gathered at once)

RunMode can be one of :

ClientServer
ClientServer provides a single machine, multi-process reaper.
SingleThread
SingleThread provides a single machine, single threaded reaper

Nntpserver
The server to be used by default to gather news articles from

TemporaryDir
A directory which can be used to store temporary files created during the reaping process

Delay
The length of time to wait between making a request from a server, and making a subsequent request. This is measured in seconds.

If you're gathering from your own servers, what you set this to is your business (but be careful not to overload them). For gathering from servers that you don't maintain, the recommended minimum delay is between 1 and 5 minutes. In general, the further away (administratively) the server is, the more cautious you should be.

NoIms
If set to any value, this will disable If-Modfied-Since gathering.

Under normal operation, when the reaper has a resource description of an object in the database, it will only update that description if the server indicates that the modification time of the object is newer than the copy in the database.

This mode of operation is desirable in a production gatherer, however whilst prototyping, it is possible for changes in the code or configuration to be missed if only updating modified documents, so this option is available to force the gatherer to fetch and update every document.

Debug
A list of debugging flags, as listed in the programmers documentation for Harvest::Debug. (If you need to turn this on, you probably want to look at this documenation anyway)

DBType
The type of DBM file to use for our internal data structures. This can be any *_File DBM structure supported by your Perl - common options are DB_File, GDBM_File and NDBM_File.