The gathering process | ||
Before delving deeper into the details of running reap some details of how it all fits together in order to build a spider may well be of some use. You may find it easier to refer to this document in conjunction with the configuration guide, especially if you are not familiar with web spiders or the old Harvest code.
| ||
Controllers and Reapers | ||
reap consists of two distinct parts, a controller, and one or more reapers. The controller does exactly what the name suggests, it oversees the gathering process, scheduling URL fetching based on its workload, and giving the URLs to a reaper to fetch. The reaper requests the URL from the remote site, summarises it, and then returns the document to the controller. The controller then invokes a filter set which checks the document content, and determines whether to store it or not. If allowed, the controller will then store the document in the database, then extracts the list of URLs from this data, determines which ones the filters will allow it to fetch, and adds these to the workload. This architecture allows more than one reaper to be running at once - meaning that one slow server will not hold up the entire gathering process.
| ||
Workloads and Rootnodes | ||
The set of URLs that the spider has to fetch is its workload. This is a constantly changing list of URLs which are to be indexed - when the workload is empty, the spider is done. reap's workload is seeded with a number of root nodes. These are starting points for the gathering process, and take their name from the old Harvest terminology (you can also think of them as the root of a tree taken through the directed graph of the web, if you're that way inclined). A root node consists of a set of URLs (previously, as the name suggests, the node was purely one URL, however this prooved too restrictive), and a set of configuration information which indicates what restrictions there are on the gathering process. This information provides the filters which the controller uses to determine whether a URL is to be fetched, and once fetched, whether it is to be stored.
| ||
Filters, both pre and post | ||
As you may have guessed from the above description, reap has two sets of filters.
| ||
Reaping and Summarising | ||
Once a reaper has been given a URL to fetch by the controller, it fetches that URL across the network (respecting any proxy settings the user may have made). It then summarises the content of this URL into a standard form. This standard form (currently SOIF, or Structured Object Interchange Format) allows us to easily read all of the objects in the database (of which more later). The summarising process is done by a set of summarisers, chosen based on either the content-type that the server told us the document was (the preferred method), or by the extension of the file part of the URL (ie a .txt file is likely to be plain text). A number of additional summarises can extract information from the protocol headers and include this in the SOIF as well. This completed package is then returned to the controller which filters it as described above, and, hopefully, stores it in the database.
| ||
Databases | ||
The database is where all of the successfully fetched page summaries, together with management information that helps the controller when revisiting these pages, resides. The database is currently implemented purely as disk based files, however there is sufficient abstraction within the code that it could be a SQL database, or anything else that is required. Once an object enters the database it resides there until it is expired. Every object has a time to live associated with it which determines how long the object should survive after it was last fetched - setting this allows the user to determine how long objects which are no long accessible on the web should remain in the database. (Obviously, if this is less than the frequency with which reap is run, objects will not persist between runs)
|
||