The gathering process

Before delving deeper into the details of running reap some details of how it all fits together in order to build a spider may well be of some use.

You may find it easier to refer to this document in conjunction with the configuration guide, especially if you are not familiar with web spiders or the old Harvest code.

Controllers and Reapers

reap consists of two distinct parts, a controller, and one or more reapers. The controller does exactly what the name suggests, it oversees the gathering process, scheduling URL fetching based on its workload, and giving the URLs to a reaper to fetch. The reaper requests the URL from the remote site, summarises it, and then returns the document to the controller. The controller then invokes a filter set which checks the document content, and determines whether to store it or not. If allowed, the controller will then store the document in the database, then extracts the list of URLs from this data, determines which ones the filters will allow it to fetch, and adds these to the workload.

This architecture allows more than one reaper to be running at once - meaning that one slow server will not hold up the entire gathering process.

Workloads and Rootnodes

The set of URLs that the spider has to fetch is its workload. This is a constantly changing list of URLs which are to be indexed - when the workload is empty, the spider is done. reap's workload is seeded with a number of root nodes. These are starting points for the gathering process, and take their name from the old Harvest terminology (you can also think of them as the root of a tree taken through the directed graph of the web, if you're that way inclined).

A root node consists of a set of URLs (previously, as the name suggests, the node was purely one URL, however this prooved too restrictive), and a set of configuration information which indicates what restrictions there are on the gathering process. This information provides the filters which the controller uses to determine whether a URL is to be fetched, and once fetched, whether it is to be stored.

Filters, both pre and post

As you may have guessed from the above description, reap has two sets of filters.

  • prefilters are run on URLs before they are passed to a reaper to fetch in order to determine whether the request is allowed. A prefilter only has the URL of the object, and the controller's past activity to go on in permitting or denying this request.
  • postfilters are run on objects after they have been fetched, to determine whether the object should be stored, and have its links added to the workload. A postfilter has access to much more information, in addition to the URL, it can look at the content returned by the reaper.
As the name filter suggests, these are not just restricted to allowing or denying requests. They can manipulate any part of the object that they see, allowing very powerful targetting of spidering processes.

Reaping and Summarising

Once a reaper has been given a URL to fetch by the controller, it fetches that URL across the network (respecting any proxy settings the user may have made). It then summarises the content of this URL into a standard form. This standard form (currently SOIF, or Structured Object Interchange Format) allows us to easily read all of the objects in the database (of which more later). The summarising process is done by a set of summarisers, chosen based on either the content-type that the server told us the document was (the preferred method), or by the extension of the file part of the URL (ie a .txt file is likely to be plain text).

A number of additional summarises can extract information from the protocol headers and include this in the SOIF as well. This completed package is then returned to the controller which filters it as described above, and, hopefully, stores it in the database.

Databases

The database is where all of the successfully fetched page summaries, together with management information that helps the controller when revisiting these pages, resides. The database is currently implemented purely as disk based files, however there is sufficient abstraction within the code that it could be a SQL database, or anything else that is required.

Once an object enters the database it resides there until it is expired. Every object has a time to live associated with it which determines how long the object should survive after it was last fetched - setting this allows the user to determine how long objects which are no long accessible on the web should remain in the database. (Obviously, if this is less than the frequency with which reap is run, objects will not persist between runs)