Driving reap from the command line
|
|
Reap does sensible things when invoked with only one argument - the name of
a URL to parse. Using this will create a database called 'WORKING' in the
current directory containing every URL from a given site. For example
reap http://webharvest.sourceforge.net
will fetch all of the content from this site. You won't actually see any
output, as there is no debugging enabled by default, see below for how to
see whats going on!
In addition to turning on and off debugging, you can also make many other
changes to the reaping process through command line options. These will
override any configuration file options, so can be useful even when you're
using a config file.
|
Getting Debugging Information
|
|
To get information about the progress of the reaping process use the
--debug option. Many different tags are supported, the full list is
available in the Harvest::Debug documentation, but the ones that you're
most likely to need are:
- Fetch
- Print details of which URLs are being fetched
- Filter
- Print details of URLs which are rejected by the filtering process
For example:
reap --debug=Fetch,Filter http://webharvest.sourceforge.net
|
Configuring the Database
|
|
The database is used to store details of all of the pages fetched by
reap, and its behaviour can be altered by means of a number of
command line options.
- --dbdir <file>
- Set the directory used to store the database files. This defaults to
WORKING, and is equivalent to the File option in the Database
section of the configuration file.
- --dbtype <module>
- Set the type of DBM database to use for storage. If you're not sure
what this is, you needn't worry about it, the default of GDBM_File
should do you fine! Otherwise, this can be set to any perl DBM database
that supports tie-ing. Suitable options include DB_File and
NDBM_File. This is equivalent to the Type option in the Database
section.
|
Customising the gatherer
|
|
You may have noticed by now that the gatherer is running very slowly,
and that it doesn't bother fetching files that its already seen. Both
of these can be customised, allowing you to tune the way that the gatherer
works according to the site you're indexing.
- --delay <seconds>
- This specifies the number of seconds to wait between fetching URLs from
the same server. Note that Harvest-NG allows URLs to be fetched from
multiple servers in parallel, and that the delay only affects repeated
fetches from the same site. The delay defaults to 60 seconds. Please be
considerate of the servers you are indexing before changing this,
especially if they are not under your control - many people view faster
gathering rates as being highly anti-social. This is equivalent to the
Delay configuration file option.
- --noims
- Disable the use of modification times to speed up gathering. If
reap has already visited a page, then instead of asking for
the page again it just asks whether it has changed. If it hasn't, then
it doesn't bother fetching and resummarising the file. This is normally
good as it makes things much faster - but if you've been experimenting,
you may find that this ignores some of your changes, and you want every
file to be re-fetched and re-summarised. This option is equivalent to the
NoIms one in the configuration file.
|
Controlling the URL workload
|
|
A number of reap options exist to control which URLs are fetched and
stored. Most of these are equivalent to configuration file options, and
override any of these options that may have been specified in the
configuration file.
- --depth=<num>
- Equivalent to the Depth option. Set the maximum depth to go to from the
starting URL (the 'depth' is the number of links which have to be followed
to reach the target URL from the one specified on the command line). The
depth is given by num. If this option is omitted then the default
depth is infinite.
- --maxhost=<num>
- Equivalent to the HostLimit option. Sets the maximum number of distinct
hosts to visit. A host is comprised of the URL scheme (such as http), the
server, and the port. Two http URLs on the same server, with different ports
count as different hosts. The default is 1 - which means only URLs on the
initial server will be indexed.
- --maxurl=<num>
- Equivalent to the URLLimit option. Sets the maximum number of unique
URLs to fetch from the server. Note that this may be different from the
number of URLs stored, as URLs which are rejected in the postfiltering
stage will be counted as fetched. The default is for no limit to be imposed.
- --ccontrol
- Equivalent to the CacheControl option. Stop the gatherer from indexing
HTTP items which are sent with CacheControl: private in their headers. This
directive is used by web servers to indicate documents that are "eyes only"
for the client using them, and which should not be redistributed.
- --neverup
- Equivalent to the NeverUp option. This prevents the gatherer from going
'up' from the starting URL. Up is defined by treating the path of the URL
(the part of the URL after the hostname) as a directory structure - anything
attempting to access a higher level of this structure is rejected.
- --nodata
- Implemented using the Prune option. This stops the gatherer from storing
any data about a document apart from its Title. This is useful when
reap is being used for management functions, such as detecting
bad links or generating sitemaps and graphs, where a full content summary of
every page is not necessary.
- --urlregex <file>
- Implemented using the URLregex option. Apply the regular expressions
contained in file to a URL to determine whether it should be fetched
or not. For more information, please see the URLregex page.
|
Miscellaneous options
|
|
These are options that didn't really fit under any other heading, and
include how to use configuration files to take advantage of some of
reap's more powerful features.
- --tmpdir
- The location in which temporary files should be stored. This defaults to
either /tmp, or your TMPDIR environment variable, if its set. This is
the TemporaryDir configuration file option.
- --nntpserver
- The address of your local newserver (commonly 'newshost' or 'nntphost').
This has no default, and is equivalent to the NNTPserver configuration
option.
- --config <file>
- Read in configuration information from file.
|
|