Using the reap spider

The reap spider is at the heart of the Harvest-NG system. It will crawl the web, gathering web pages and following links according to its configuration and store them in its database for further use.

Reap is a highly configurable beast, and can get relatively complicated, however - it has been designed from the ground up with sensible defaults, so you don't have to set everything unless you need to. Reap can be configured either from the command line, or by using the Harvest-NG configuration file format.

This document assumes that you have already downloaded, installed and tested Harvest-NG. If you haven't, please see the earlier instructions for details of this process.

Contents

Command line usage
How to use reap simply from the command line
How reap organises the gathering process
Details of how reap organises the gathering process - some understanding of this is necessary in order to be able to effectively configure, control and use it.
Configuration file usage
How to use configuration files to control the gatherer
Configuration directives
A complete, autogenerated listing of all of the available configuration directives.