About Harvest-NG | ||
Harvest-NG is a collection of Perl modules and scripts which provide a powerful web crawling and summarising agent. The code is aimed at providing an open source, standards compliant, tool for fetching content from a wide variety of information sources, summarising it into a set of resource descriptions, and storing these in an easily accessible database from which search services can be built and statistical information compiled.
| ||
Supported Information Services | ||
An "Information service" is something that makes content available over the world wide web. Contrary to popular belief these are not limited to HTTP and its cousins such as HTTPs. Thanks to the power of the perl LWP module we support all of the sources in popular use today (that is, http, https, ftp, gopher, and nntp). In addition, the modular architecture means that further information sources can be simply added as needed. Support of HTTPS and other encrypted information services is provided by using the perl interface to the OpenSSL libraries, and as it involves encryption may be prohibited by your local laws.
|
||
Document parsing and content types | ||
Harvest-NG supports a vast variety of content formats, mainly through the use of external summarisers and convertors. The core code is engineered to deal with HTML and plain text, but by adding additional convertors and translators, many different content types may be supported. Harvest-NG is engineered to be able to talk to many different freely available convertors, such as mswordview, pdftotext and pstotext, greatly increasing the range of content types supported. In addition, content types which are incorrectly identified by the server may also be indexed by examining their file-types, allowing the indexing of items such as RPM files, which are commonly sent with bad MIME-types.
| ||
Summarisation | ||
Having collected the content from the servers, and worked out how to read it, we convert the content into a single, standard format called SOIF (or Structured Object Interchange Format, to its friends). Using a standard format means that other tools can deal with the content without needing to worry about parsing multiple input file formats. This object is what is referred to as the "resource description". We also use the summarisation stage to extract a list of URLs from the document, meaning that Harvest-NG can follow URLs expressed in any of the supported document formats, not just those given in HTML files.
|
||
Databases and backend storage | ||
Harvest-NG stores all of the resource descriptions in a database, along with some other information about the content. This database is currently managed internally, with no need for external systems, and is again easily replaced by writing a small quantity of Perl code, for those who want to interface with "real" SQL-talking database servers. The interface to the database is clear and well documented, making it trivial to write utilities to run over the gathered data, as can be seen from a number of the tools bundled with the program.
|
||