Harvest-NG System overview |
|
This is an overview of the architecture of the Harvest-NG system, designed to assist developers who wish to use and expand upon the Harvest libraries, or the system commands that use them. Harvest-NG is written in object oriented perl. There exist a large number of objects, which are organised in a hierarchical fashion. The "top level" of this hierarchy are the Controller, Reaper, Object and Database classes. The Controller and Reaper classes are where the spidering code is contained. The Controller class takes care of URL filtering and scheduling, and interfaces with the Database to store the results of gatherer runs. The Reaper handles object fetching and summarising, getting workload from, and returning results to, the Controller. The Object class encapsulates an object (a URL, with related headers and summary information) in all stages of this cycle, with the Database class providing a persistent storage for Objects.
| |
Objects and URIs | |
An "object" is a URI which is to be fetched and summarised. At various stages in the life cycle of an object, it may also include a content summary, some headers and assorted management information. Headers are information used within the system, and also those headers returned by the protocol that the URI is fetched by. Information contained in headers is transient, it is not stored persistently in any part of the system, and is not preserved across runs. Management information is data used internally by the system to control the gathering process. It is stored persistently when the object is saved, and is expected to be available when the object is retrieved from the database. All of this information is encapsulated by the Object class. The Object::Headers, and Object::Manage classes provide encapsulations of any protocol headers returned when the object is fetched, and of any management data associated with the object, respectively. There is purposefully no Harvest class to encapsulate the content summary, the intention being that the content summary in use is independent of the Harvest-NG code, and can be replaced. The Metadata::SOIF class is currently used to represent content summaries. At present, in order to change the content summary it is necessary to make alterations to the Harvest::Database, and all of the Harvest::SOIFFilter classes
| |
Databases | |
The Database class provides a wrapper around a number of options of database types. Currently both DBM and directory based databases are available. The Database stores both the content summary and the management data of an object, and provides methods for accessing this information. Other techniques, such as ODBC interfaces to commercial databases should be possible. The Database does not have to perform any form of locking, as only the Controller set of objects access it, and there should only be one set of them running at one time for any given gatherer.
| |
Spidering | |
As detailed earlier, the remaining part of the Harvest code is split into two conceptual classes, the Controller and the Reaper, both of which have many perl modules making them up. The Controller provides work for the Reaper, the Reaper does this work and informs the Controller of the results. This leads to an implementation that looks like this - while ($controller->more) { $controller->done($reaper->process($controller->next)); }This would actually work it some situations, but its not really quite as simple as that, as you'll see. We'll start off looking at the construction of the Controller class, then take a look at how fetching and summarising is implemnted by the Reaper.
| |
Controller | |
The Controller class encapsulates all of the scheduling and filtering aspects of the Harvest-NG system. This complicated task can be further split into a number of sections.
| |
Scheduling | |
Perhaps the most important part of the controllers task is that of scheduling URLs to be fetched. The scheduling process is contained in the Controller::Workload class, and currently works as follows. Objects to be fetched are sorted into "buckets" according to the netloc (protocol, server and port) portion of their URL. A bucket is represented by the Controller::Bucket class. For each bucket, the time that the last request from that bucket was completed is stored. Each bucket is marked as being either busy (that is an object from that bucket has been passed out for fetching) or not. The scheduler then round robins through the buckets. If a bucket is busy then it is skipped. If a "Free" bucket was last accessed longer ago than a configurable delay then the next URL from that bucket is returned, otherwise the bucket is skipped. If all the buckets are checked without a URL being returned, then the scheduler returns the amount of time to sleep until a new URL is available. There are a number of problems with this algorithm, not least of which is the fact that if a bucket is marked as "busy", and the client fetching from that bucket never tells the scheduler that its finished, then that bucket will be ignored for the rest of the run
| |
Responses from reapers | |
As noted above the Controller receives responses from Reapers. These Objects contain both content summaries, and management information (the extraction of which is described below). The Controller then carries out some post fetch filtering on the Objects (as described below), and stores the Object in the Database. It then takes the URL references section of the management information and passes all of these URLs through the URL filters discussed below. Those URLs which are passed unhindered are added to the schedulers workload, and the gathering process continues.
| |
Filtering | |
Both URL and Post filtering are handled in similar ways. A "chain" of filters is constructed, with each filter in the chain getting the chance to modify or delete the Object as it is passed down the chain. The filter chain is given in Controller::URLFilter and Controller::SOIFFilter , although both of these classes inherit most of their functionality from Controller::FilterManager Recall that filter chains are specific to RootNodes. As Objects carry the RootNode name with them, the Controller selects the relevant filter chain to apply to that object. Filter chains are created by calling the add method with a relevant filter. For a full list of the available filters, see the URLFilter and SOIFFilter sections of the Pod documentation The URL filters work soley from any information available from the Object definition (depth, parent) and the URL itself. The Post (currently SOIF) filters work on the Object headers, management information, and on the content summaries themselves. It is likely that the "Post" filters will be split into two sections at a later date, one to deal with content summary independent filtering, and the other to work with content summaries.
| |
Reaping | |
The Reaper performs the fetching, processing and summarising part of the gathering process.
| |
Fetching | |
The Reaper uses Reaper::Fetcher to fetch the Object. This fetches the URL, using mainly the libwww-perl library to do so, and returns a status indication. The contents of the URL are stored in a temporary file on disk. The HarvestUA class is used to change some of LWP's default options, and is included here for completeness.
| |
Summarisers | |
Having fetched the URL, Reaper::Summarise handles the document summarising. The summariser works in a similar way to the filters described above, in that the user specifies a series of summarisers to use. In this case, however, the user specifies the MIME type that each summariser corresponds to. It is also possible to specify summarisers that are always run, and summarisers that are specific to a given protocol. The summarisers produce the content summary information for the object. They should also extra any hyperlinks within the object and record them in the urlreferences section of the management data. |
|