.

Harvest::Controller::Workload - a class for maintaining a robots workload

DESCRIPTION

Workload is a class for maintaining and scheduling a robots workload. It maintains a working list of URLs, and prioritises which URL to process next.

URLs are stored in server ``buckets'' according the server which the URL is coming from.

At present little or no error checking on the URLs is performed. It is therefore possible to claim that a URL that was never being fetched is ``done''.

METHODS

$workload = new Workload($delay)

Constructor for the Workload. Constructs a new workload scheduler which will wait for a minimum of $delay seconds between URL fetches from the same server.

$workload->add($url)

Add a URL to the workload - $url should be an instance of Harvest::Object

$workload->next

Return the next URL to be fetched, and mark that URL as being ``pending''

$workload->down($obj)

Mark the server carrying the URL as down.

This removes the URL from the ``pending'' list and puts it back in the bucket for the server, and reduces the priority of the bucket until the next M<done> call for a URL on that server.

$url should be an instance of Harvest::Object

FIXME: This makes no sense at all. We need to remove the ``down'' object and FIXME: replace it with something different. Saying that a server is down FIXME: should just increase that servers delay to something huge, so we FIXME: don't look at it for a while (probably use exponential backoff here) =cut

sub down { my($self,$obj) = @_; my $key=$self->_key($obj);

        $self->add($obj);
        $self->{$key}->status(DOWN);
        $self->_remove_from_pending($obj);

} =head2 $workload->block($obj)

Block the server.

This will mark all URLs on the server given as being blocked.

$url should be an instance of URI::URL, but need only contain scheme, host and port parts.

FIXME: BLOCK IS CURRENTLY NOT IMPLEMENTED

$workload->unblock($obj)

Unblock the server.

This will remove the blocked status of the server bucket.

$url should be an instance of URI::URL

FIXME: UNBLOCK IS CURRENTLY NOT IMPLEMENTED

$workload->done($obj)

Mark the URL as being completed.

This removes the URL from the pending queue. Note that no precautions are taken to prevent the URL from being fetched again.

$url should be an instance of Harvest::Object

$workload->clear

Clear the entire workload.