Harvest-NG How-tos

Harvest-NG Howtos
This document aims to list common "How do I X?" type questions which are raised by users of Harvest-NG. If you have a question that you'd like answered here, please raise it using one of the methods described for getting help. If you have an answer that you'd like to contribute, please mail me it. How do I ... Integrate Harvest-NG with a Harvest broker Stop reap from downloading files I don't want Stop reap from indexing Front Page configuration files Index PDF, Postscript or MS-Word documents
How do I integrate Harvest-NG with a classic broker
	Old Harvest gatherers generated SOIF containing particular fields which the broker uses to identify data from them. In order to allow the broker to use Harvest-NG generated data, these fields must be added. The CompatHack post filter was designed to do this. In the postfilter section of your configuration file, include the following <CompatHack> Gatherer-Version NG 1.0 Gatherer-Name Harvest-NG Example Gatherer-Host localhost </CompatHack> (you can customise the attribute values as much as you want, but ensure that the names remain the same. Secondly, the broker requires `Timing` information - this defaults to giving the objects a TTL of 28 days, and is automatically added by all distribution version of reap. Finally, you need to run a gatherd. The gather daemon is documented in more detail on the utilities page. For now, we will assume that you want to run it in standalone mode. Simply use util/gatherd --db=dbfile --port=port where dbfile is the argument you passed to the File tag in the Database section (or `WORKING`, if you didn't enter this), and port is the port on which you want to run the service (the one which the broker will use to find it. Having done this you should be able to configure your broker to find the gatherer on this port - any collection type should work.
How do I stop reap downloading files I don't want
	If you find reap's spending all of its time downloading image files, or postscript, etc. that you've got no intention of indexing, then you can make it not gather most of these by using URL regular expressions with the URLregex module. This isn't an ideal solution (see this bug for one that would be much better, but isn't yet supported). To use URLregex to do this, add the following to your `Prefilters` <URLregex> Default allow Deny \.gif$ Deny \.jpeg$ Deny \.jpg$ </URLregex> Note the backslash character escaping the dot character, so that the regular expression parser doesn't get confused. Remember that you can pull this in from a file by using the `<URLregex filename>` syntax, especially useful if you want to use the same deny list multiple times.
How do I stop reap indexing Front Page configuration directories
	FrontPage apparently stores some files in directories with the name `_vti_cnf` which clutter the indexing process. To block these, use the URLregex module, as shown below. <URLregex> Default allow Deny /_vti_cnf/ <URLregex>
How do I index PDF, Postscript or MS-Word documents
	To do this you need an appropriate convertor installated on your system. A list of packages, and the appropriate line for the `Summarisers` section of your configuration file is below. PDF pdftotext, part of xpdf, is a GPL'd tool for converting from PDF to text. Usage: `application/pdf RunProg, pdftotext %infile %outfile, text/plain` Postscript pstotext, from SRC's virtial paper project. This produces high quality output from most PS files. Usage: `application/postscript RunProg, pstotext %infile >%outfile, text/plain` MS Word documents Results with MS Word documents tend to be variable, due to the ever changing nature of the file format. MSWordView is a translator for Word 8 (Office 97) format documents. Usage: `application/msword RunProg, mswordview %infile >%outfile, text/html` `strings` is a good last resort for indexing word documents that won't summarise any other way. Usage: `application/msword RunProg, strings <%infile >%outfile, text/plain`
Please mail any comments or questions to sxw@users.sourceforge.net

Harvest-NG Howtos