Downloading and Installing Harvest-NG

Harvest-NG is not currently available from CPAN, due to the odd way in which it is configured. We hope to make later releases available from CPAN, once the package file structure has been better configured, and some namespace issues resolved.

Preinstallation Issues

Harvest-NG requires a fairly up to date perl, and a number of additional modules. If you want to grab the code anyway, feel free to head onwards to the next section.

Harvest-NG has only been tested with versions of perl later than 5.004. To check which version of perl you have installed, run the perl -v command.

In addition, we require the following extra modules, all of which are available from CPAN. This list is correct at the time of writing, but may change as CPAN modules are reorganised. Please let us know of any ommissions.

Note that HTML-Parser version 3 is considerably faster than earlier versions.

On top of that list, a number of optional portions of the code require other modules:

  • Compress-Zlib - is required by the Harvest gatherd compatibility code, and in order to be able to summarise files which are Gzip or deflate compressed (generaly files with a .gz extension)
  • Netserver-Generic - required by the Harvest gatherd compatibility code, if it is to be run in standalone (as opposed to inet) mode.
  • Crypt-SSL - required to index SSL web servers.

Downloading Harvest-NG

The latest version of Harvest-NG is available from Sourceforge. Older versions are also available in this directory.

Installing Harvest-NG

Harvest-NG is designed to be run from the directory in which it is installed in. You need only read the next paragraph if you wish to vary this (to install it system-wide, for instance).

To seperate the Harvest tree out, copy the Harvest directory to where you want perl libraries, and the reap file, together with any utilites you wish to use into a suitable binary directory. Then, if the Harvest directory is not in your perl search path, add the following on the second line of reap, and of any utilites you are using
use lib '/path/to/libraries'
(replacing the path with the location of the Harvest directory). This lot will all hopefully be automated in a later release.

Finally, if your perl is not installed in /usr/bin you'll need to alter the "shebang" (#!) lines of reap, and any of the utlities. In the first line of these files, you'll see
#!/usr/bin/perl -w
Replace /usr/bin/perl with the location of your system's perl executable (you can find this by typing which perl)

Doing your first run

Just to check everything's working, try the following, using one of the configuration files shipped with the package in the config directory

reap --config config/firstrun.conf http://localhost/
(if you're not running a web server on the local machine, replace localhost with another, locally run, webserver).

If all is OK, reap will proceed to fetch in the first 10 URLs that it encounters on this server, and store and summarise as many of them as it can. It will create the storage files in your current directory

For what to do next, read the more comprehensive user documentation