.

Harvest::Controller - interface to the spider controller

DESCRIPTION

The Controller is the external interface into the spider's scheduler, and filters.

METHODS

$overseer=new Controller($database,$delay,$noims);

Create a new Controller. There should be only one instance of the controller class per spider.

The controller class by itself will do nothing, until Rootnodes are entered using the add method.

$database is a Harvest::Database object which is the database in which all gathered data should be stored.

$delay sets the number of seconds to wait between accesses to the same server. For internet gatherering this should be at least 60, and probably around 300. (1 min and 5 min respectively)

$noims disables If-Modified-Since based incremental gathering if it is set.

$overseer->add($rootnode);

Add a new rootnode to the list being fetched by the spider. Rootnode should be an object of type Harvest::Controller::RootNode

$root->more

Returns TRUE if there are more objects left to fetch.

$obj=$root->next

Returns the next Harvest::Object to fetch.

If there are no more objects available at the current time it will return a time in seconds to sleep until more objects should become available.

$root->done($obj)

Should be called with the results of running a Harvest::Reaper fetch operation on the object $obj

This method will run the appropriate post-summarising filters, extract any URL references contained in the object, filter them, and add them to the workload, and finally add the object to the database passed to the constructor of this instance of the Controller.