Harvest-NG Howtos | ||
This document aims to list common "How do I X?" type questions which are raised by users of Harvest-NG. If you have a question that you'd like answered here, please raise it using one of the methods described for getting help. If you have an answer that you'd like to contribute, please mail me it. How do I ...
| ||
How do I integrate Harvest-NG with a classic broker | ||
Old Harvest gatherers generated SOIF containing particular fields which the broker uses to identify data from them. In order to allow the broker to use Harvest-NG generated data, these fields must be added. The CompatHack post filter was designed to do this. In the postfilter section of your configuration file, include the following <CompatHack> Gatherer-Version NG 1.0 Gatherer-Name Harvest-NG Example Gatherer-Host localhost </CompatHack>(you can customise the attribute values as much as you want, but ensure that the names remain the same. Secondly, the broker requires Timing information - this defaults to giving the objects a TTL of 28 days, and is automatically added by all distribution version of reap. Finally, you need to run a gatherd. The gather daemon is documented in more detail on the utilities page. For now, we will assume that you want to run it in standalone mode. Simply use util/gatherd --db=dbfile --port=portwhere dbfile is the argument you passed to the File tag in the Database section (or WORKING, if you didn't enter this), and port is the port on which you want to run the service (the one which the broker will use to find it. Having done this you should be able to configure your broker to find the gatherer on this port - any collection type should work.
| ||
How do I stop reap downloading files I don't want | ||
If you find reap's spending all of its time downloading image files, or postscript, etc. that you've got no intention of indexing, then you can make it not gather most of these by using URL regular expressions with the URLregex module. This isn't an ideal solution (see this bug for one that would be much better, but isn't yet supported). To use URLregex to do this, add the following to your Prefilters <URLregex> Default allow Deny \.gif$ Deny \.jpeg$ Deny \.jpg$ </URLregex>Note the backslash character escaping the dot character, so that the regular expression parser doesn't get confused. Remember that you can pull this in from a file by using the <URLregex filename> syntax, especially useful if you want to use the same deny list multiple times.
| ||
How do I stop reap indexing Front Page configuration directories | ||
FrontPage apparently stores some files in directories with the name _vti_cnf which clutter the indexing process. To block these, use the URLregex module, as shown below. <URLregex> Default allow Deny /_vti_cnf/ <URLregex>
|
||
How do I index PDF, Postscript or MS-Word documents | ||
To do this you need an appropriate convertor installated on your system. A list of packages, and the appropriate line for the Summarisers section of your configuration file is below.
|
||