2.0.0-RC1 Release Notes

Heritrix 2.0.0-RC1 Release Notes

Heritrix 2.0.0 is now available as a release candidate.

Heritrix 2.0.0 offers an updated user interface, a new settings framework and configuration file formats, new options for controlling the ordering of collection of URIs and sites, more options for remote-controlling a crawl engine, and many other big fixes and improvements.

What's New

For Crawl Operators

The Web UI for administering a crawl is now a separate application from the 'crawl engine' that actually collects/analyzes/stores information. As a result, while you can launch the engine and web UI inside the same Java VM, as previously, you can also launch the Web UI and engine in different JVMs, even on different machines. Also, one Web UI can administer multiple local or remote crawl engines. Finally, because the Web UI communicates with the engine strictly using "JMX" (Java Management Extensions), anything it can do can also be done by other JMX-speaking software, such as command-line utilities or alternate crawl management applications.

The model for configuring the crawler has changed significantly. The same settings exist, but now are collected on 'sheets', of a different on-disk format than in Heritrix 1.x. An implicit/virtual sheet ('default') is never actually stored but contains settings from source code default values. A 'global' sheet collects whole-crawler and all-URL settings.

Other settings that only apply to some URLs may be collected in any number of other custom-named sheets, and then mapped to URLs not just by domain (as in 1.x) but any SURT prefix. A single sheet can be mapped to many URLs; an URL may have more than one sheet mapped to it. (For example, a sheet 'slowly' with extra-long politeness delays can be mapped to sites a.com and b.com, while another sheet 'shallowly' with small max-hops limits can be mapped to b.com and c.com. Then, b.com will be crawled both 'slowly' and 'shallowly'.)

Both individual URIs and the queues of URIs within a crawler frontier may now have an integer 'precedence' value assigned to them, via separate URI and queue precedence policies configured via settings sheets. A URI's precedence affects the order in which it is collected relative to other URIs in the same queue. A queue's precedence affects which queues actively provide URIs for crawling when many queues are available and waiting. Lower numbers mean earlier (higher ranked) consideration - 1 is intended as the highest precedence. Policies already included with Heritrix, combined with sheet override configuration, can achieve a variety of desired prioritizations of crawl work. As with other crawler components, custom policies can also be implemented in Java.

For Developers

Heritrix has been split into subprojects of related (and reusable) functionality.

The innermost subproject, 'commons', includes web archiving core utility classes usable by a crawler and other related applications, such as analysis and access tools.

The next subproject, 'modules', includes crawl-focused modules with some potential applicability outside a crawler. It requires 'commons'.

The next subproject, 'engine', is a crawler with no user-interface beyond the JMX remote-control functionality. It requires both 'commons' and 'modules'.

The next subproject, 'webui', is the web-browser-based user-interface. Its output is a WAR file that can be run in the same JVM as the crawl engine, or elsewhere. It requires 'engine', 'modules', and 'archive'.

Many classes have moved or been renamed.

Heritrix 2 is now built with Maven 2, and our continuous build system has migrated to Continuum.

This somewhat encumbers usage in Eclipse. We have a guide to setting up Heritrix inside Eclipse at Setting up the new Heritrix in Eclipse. Key points to remember are: (1) an initial Maven build, either from the command-line or the M2Eclipse extension, is required to populate your local Maven repository with necessary 3rd-party-classes; (2) you must set the M2_REPO build-path variable for the Eclipse project to find that local repository; (3) the ctrl-shift-t shortcut for finding a class by typing its name is helpful for locating classes in whatever subproejct or package they now reside.

Known Limitations

Heritrix 2 has received very little testing so far on platforms other than Linux; functionality and support scripts which worked in 1.x will likely need updating, even if currently included in the distribution.

The contributed AdaptiveRevisitFrontier and has not yet been updated for operation under 2.0.0.

Documentation is very limited compared to the 1.x user manual and developer manual.

Getting the 2.0.0-RC1 Release

This release may be obtained from our Sourceforge files area:

Heritrix2 Releases

Getting Started

For operators:

2.0 Tutorial

For developers:

Setting up the new Heritrix in Eclipse

Documentation for 2.0 is limited compared to 1.x, but a series of notes on using new features will be available on the project wiki.

Resolved Issues

The following tracked issues are recorded as addressed in this 2.0.0-RC1 release:


** New Feature
    * [HER-1328] - Link-value ('PR" or similar) offline calculation, loading into URI history DB

For issues fixed in previous test releases, see:

Reporting Issues

Bugs and other issues or suggested improvements/features may be submitted through our public issue tracker.

The project discussion list is hosted at Yahoo Groups.