Harvest-NG Howtos

This document aims to list common "How do I X?" type questions which are raised by users of Harvest-NG. If you have a question that you'd like answered here, please raise it using one of the methods described for getting help. If you have an answer that you'd like to contribute, please mail me it.

How do I ...

How do I integrate Harvest-NG with a classic broker

Old Harvest gatherers generated SOIF containing particular fields which the broker uses to identify data from them. In order to allow the broker to use Harvest-NG generated data, these fields must be added. The CompatHack post filter was designed to do this. In the postfilter section of your configuration file, include the following

<CompatHack>
       Gatherer-Version        NG 1.0
       Gatherer-Name           Harvest-NG Example
       Gatherer-Host           localhost
</CompatHack>
(you can customise the attribute values as much as you want, but ensure that the names remain the same.

Secondly, the broker requires Timing information - this defaults to giving the objects a TTL of 28 days, and is automatically added by all distribution version of reap.

Finally, you need to run a gatherd. The gather daemon is documented in more detail on the utilities page. For now, we will assume that you want to run it in standalone mode. Simply use

util/gatherd --db=dbfile --port=port
where dbfile is the argument you passed to the File tag in the Database section (or WORKING, if you didn't enter this), and port is the port on which you want to run the service (the one which the broker will use to find it.

Having done this you should be able to configure your broker to find the gatherer on this port - any collection type should work.

How do I stop reap downloading files I don't want

If you find reap's spending all of its time downloading image files, or postscript, etc. that you've got no intention of indexing, then you can make it not gather most of these by using URL regular expressions with the URLregex module. This isn't an ideal solution (see this bug for one that would be much better, but isn't yet supported).

To use URLregex to do this, add the following to your Prefilters

<URLregex>
   Default       allow
   Deny          \.gif$
   Deny          \.jpeg$
   Deny		 \.jpg$
</URLregex>
Note the backslash character escaping the dot character, so that the regular expression parser doesn't get confused.

Remember that you can pull this in from a file by using the <URLregex filename> syntax, especially useful if you want to use the same deny list multiple times.

How do I stop reap indexing Front Page configuration directories

FrontPage apparently stores some files in directories with the name _vti_cnf which clutter the indexing process. To block these, use the URLregex module, as shown below.

<URLregex>
  Default allow
  Deny /_vti_cnf/
<URLregex>

How do I index PDF, Postscript or MS-Word documents

To do this you need an appropriate convertor installated on your system. A list of packages, and the appropriate line for the Summarisers section of your configuration file is below.

PDF
  • pdftotext, part of xpdf, is a GPL'd tool for converting from PDF to text.
    Usage: application/pdf RunProg, pdftotext %infile %outfile, text/plain

Postscript
  • pstotext, from SRC's virtial paper project. This produces high quality output from most PS files.
    Usage: application/postscript RunProg, pstotext %infile >%outfile, text/plain

MS Word documents
Results with MS Word documents tend to be variable, due to the ever changing nature of the file format.
  • MSWordView is a translator for Word 8 (Office 97) format documents.
    Usage: application/msword RunProg, mswordview %infile >%outfile, text/html
  • strings is a good last resort for indexing word documents that won't summarise any other way.
    Usage: application/msword RunProg, strings <%infile >%outfile, text/plain