Driving reap from the command line

Reap does sensible things when invoked with only one argument - the name of a URL to parse. Using this will create a database called 'WORKING' in the current directory containing every URL from a given site. For example

reap http://webharvest.sourceforge.net
will fetch all of the content from this site. You won't actually see any output, as there is no debugging enabled by default, see below for how to see whats going on!

In addition to turning on and off debugging, you can also make many other changes to the reaping process through command line options. These will override any configuration file options, so can be useful even when you're using a config file.

Getting Debugging Information

To get information about the progress of the reaping process use the --debug option. Many different tags are supported, the full list is available in the Harvest::Debug documentation, but the ones that you're most likely to need are:

Fetch
Print details of which URLs are being fetched
Filter
Print details of URLs which are rejected by the filtering process

For example:

reap --debug=Fetch,Filter http://webharvest.sourceforge.net

Configuring the Database

The database is used to store details of all of the pages fetched by reap, and its behaviour can be altered by means of a number of command line options.

--dbdir <file>
Set the directory used to store the database files. This defaults to WORKING, and is equivalent to the File option in the Database section of the configuration file.
--dbtype <module>
Set the type of DBM database to use for storage. If you're not sure what this is, you needn't worry about it, the default of GDBM_File should do you fine! Otherwise, this can be set to any perl DBM database that supports tie-ing. Suitable options include DB_File and NDBM_File. This is equivalent to the Type option in the Database section.

Customising the gatherer

You may have noticed by now that the gatherer is running very slowly, and that it doesn't bother fetching files that its already seen. Both of these can be customised, allowing you to tune the way that the gatherer works according to the site you're indexing.

--delay <seconds>
This specifies the number of seconds to wait between fetching URLs from the same server. Note that Harvest-NG allows URLs to be fetched from multiple servers in parallel, and that the delay only affects repeated fetches from the same site. The delay defaults to 60 seconds. Please be considerate of the servers you are indexing before changing this, especially if they are not under your control - many people view faster gathering rates as being highly anti-social. This is equivalent to the Delay configuration file option.
--noims
Disable the use of modification times to speed up gathering. If reap has already visited a page, then instead of asking for the page again it just asks whether it has changed. If it hasn't, then it doesn't bother fetching and resummarising the file. This is normally good as it makes things much faster - but if you've been experimenting, you may find that this ignores some of your changes, and you want every file to be re-fetched and re-summarised. This option is equivalent to the NoIms one in the configuration file.

Controlling the URL workload

A number of reap options exist to control which URLs are fetched and stored. Most of these are equivalent to configuration file options, and override any of these options that may have been specified in the configuration file.

--depth=<num>
Equivalent to the Depth option. Set the maximum depth to go to from the starting URL (the 'depth' is the number of links which have to be followed to reach the target URL from the one specified on the command line). The depth is given by num. If this option is omitted then the default depth is infinite.
--maxhost=<num>
Equivalent to the HostLimit option. Sets the maximum number of distinct hosts to visit. A host is comprised of the URL scheme (such as http), the server, and the port. Two http URLs on the same server, with different ports count as different hosts. The default is 1 - which means only URLs on the initial server will be indexed.
--maxurl=<num>
Equivalent to the URLLimit option. Sets the maximum number of unique URLs to fetch from the server. Note that this may be different from the number of URLs stored, as URLs which are rejected in the postfiltering stage will be counted as fetched. The default is for no limit to be imposed.
--ccontrol
Equivalent to the CacheControl option. Stop the gatherer from indexing HTTP items which are sent with CacheControl: private in their headers. This directive is used by web servers to indicate documents that are "eyes only" for the client using them, and which should not be redistributed.
--neverup
Equivalent to the NeverUp option. This prevents the gatherer from going 'up' from the starting URL. Up is defined by treating the path of the URL (the part of the URL after the hostname) as a directory structure - anything attempting to access a higher level of this structure is rejected.
--nodata
Implemented using the Prune option. This stops the gatherer from storing any data about a document apart from its Title. This is useful when reap is being used for management functions, such as detecting bad links or generating sitemaps and graphs, where a full content summary of every page is not necessary.
--urlregex <file>
Implemented using the URLregex option. Apply the regular expressions contained in file to a URL to determine whether it should be fetched or not. For more information, please see the URLregex page.

Miscellaneous options

These are options that didn't really fit under any other heading, and include how to use configuration files to take advantage of some of reap's more powerful features.

--tmpdir
The location in which temporary files should be stored. This defaults to either /tmp, or your TMPDIR environment variable, if its set. This is the TemporaryDir configuration file option.
--nntpserver
The address of your local newserver (commonly 'newshost' or 'nntphost'). This has no default, and is equivalent to the NNTPserver configuration option.
--config <file>
Read in configuration information from file.