Generic configuration directives | ||
The Generic class of options controls much of how the reaper works. There is an automatically generated list of all of these configuration options. We'll only consider the more useful and frequently used options here, the auto-generated list should be regarded as the definitive version.
| ||
Getting feedback | ||
Directives:
Debug. By default, reap doesn't produce that much information about what its doing at all. However, the debug directive provides a means of enabling far more information about the reaping process. It takes as its arguments a list of tags describing debugging information to be displayed. The full list is described in more detail in the programmers documentation. However, both the Fetch (which shows URLs being fetched) and Filter (which shows the URLs which are rejected by the filters) options are very useful.
| ||
Process model | ||
Directives: RunMode,
Socket,
Reapers. The gatherer can run in a number of different process models, which varies the way in which the controller and reaper communicate. In the single threaded model, there is only one controller and reaper, and they both run as part of the same process. In the client/server model there is one controller, and a number of reapers, which communicate with each other using TCP/IP sockets. Client server mode is faster, as it removes any network based delays, but it will impose a bigger resource load on the machine from which you are gathering. When selecting the number of reapers to use in Client server mode, you should be careful to ensure that you are gathering from enough hosts to satisfy them all. In particular, even with your Delay set to 0, you can't have more than one reaper gathering from a host at any one time.
| ||
Proxies and hosts | ||
Directives: HttpProxy,
FtpProxy,
NntpServer If your site is firewalled, you may need to gather from behind a proxy server. Similarly, you may need to tell reap where to fetch news articles from. The directives in this section allow you to do that. However, if you are setting your HTTP proxy to a web cache, then be careful, as the fetching pattern exhibited by reap is very different from that of a normal human browser, and may well cause useful objects which should have remained in the cache to be removed to make way for objects which reap will use only once. If you intend doing large scale crawls through such a cache, ask the cache administrator to disable caching for your machine, so that the cache will act as a proxy. The address which you pass to the NNTPserver directive will be used for all news URLs which do not specify a host to fetch from.
| ||
Delays, Timeouts and Intelligent gathering | ||
Directives:
Delay,
Timeout,
NoIms. The time between requests to the same server, the time that a server is allowed to take to respond before we give up on it, and the intelligence of the gathering algorithm can all be controlled with these options. The Delay is the amount of time that we wait between sending a request to a server, and sending another request. For the purposes of delay, a server is a combination of the URL's scheme, host and port. That is, http://example.org/ http://example.org:8080/ and ftp://example.org/ are all different servers. What a reasonable value for delay is is a subject of great contention. If you're gathering from your own servers, you can set this as low as you want. However, when gathering from other people's machines several pieces of net-politeness come into play. Any robot has to be very careful not to overload the server which its fetching from (the server may, after all, be running all sorts of things to serve you your pages). What a reasonable value for not overloading the server is depends a lot on who you ask. Our recommendation is not to lower the Delay from 60 seconds if you're doing any kind of large scale crawling, and to increase it to at least 300 seconds for repeated runs. Timeout specifies exactly that - a time to wait for the server to respond. If the server hasn't replied in that time, the request is rejected, and an attempt made to continue with whatever information was received. IMS gathering, as controlled by the NoIms directive, is a means of making the gathering process more intelligent. Instead of blindly asking for every document that it encounters, even if it has already seen it, IMS gathering makes reap only ask for documents if they have changed since they were last fetched. This greatly speeds up the gathering process, as if the document has not changed, it need not be fetched, and it need not be indexed. Unfortunately, it can also cause problems if the configuration of the summarisation section has been changed - this can result in documents just being skipped as "not modified", and the configuration changes never being notice - hence the ability to disable this gathering method. (IMS = If-Modified-Since)
| ||
System related configuration | ||
Directives: DBType, TemporaryDir These directives provide means of altering some fairly systems specific items. Unless you're having problems with running reap, you probably won't need to change these at all.
|
||