2.0.0 Release Notes

Heritrix 2.0.0 Release Notes

Heritrix 2.0.0 is now available.

Heritrix 2.0.0 offers an updated user interface, a new settings framework and configuration file formats, new options for controlling the ordering of collection of URIs and sites, more options for remote-controlling a crawl engine, and many other big fixes and improvements.

What's New

For Crawl Operators

The Web UI for administering a crawl is now a separate application from the 'crawl engine' that actually collects/analyzes/stores information. As a result, while you can launch the engine and web UI inside the same Java VM, as previously, you can also launch the Web UI and engine in different JVMs, even on different machines. Also, one Web UI can administer multiple local or remote crawl engines. Finally, because the Web UI communicates with the engine strictly using "JMX" (Java Management Extensions), anything it can do can also be done by other JMX-speaking software, such as command-line utilities or alternate crawl management applications.

The model for configuring the crawler has changed significantly. The same settings exist, but now are collected on 'sheets', of a different on-disk format than in Heritrix 1.x. An implicit/virtual sheet ('default') is never actually stored but contains settings from source code default values. A 'global' sheet collects whole-crawler and all-URL settings.

Other settings that only apply to some URLs may be collected in any number of other custom-named sheets, and then mapped to URLs not just by domain (as in 1.x) but any SURT prefix. A single sheet can be mapped to many URLs; an URL may have more than one sheet mapped to it. (For example, a sheet 'slowly' with extra-long politeness delays can be mapped to sites a.com and b.com, while another sheet 'shallowly' with small max-hops limits can be mapped to b.com and c.com. Then, b.com will be crawled both 'slowly' and 'shallowly'.)

Both individual URIs and the queues of URIs within a crawler frontier may now have an integer 'precedence' value assigned to them, via separate URI and queue precedence policies configured via settings sheets. A URI's precedence affects the order in which it is collected relative to other URIs in the same queue. A queue's precedence affects which queues actively provide URIs for crawling when many queues are available and waiting. Lower numbers mean earlier (higher ranked) consideration - 1 is intended as the highest precedence. Policies already included with Heritrix, combined with sheet override configuration, can achieve a variety of desired prioritizations of crawl work. As with other crawler components, custom policies can also be implemented in Java.

For Developers

Heritrix has been split into subprojects of related (and reusable) functionality.

The innermost subproject, 'commons', includes web archiving core utility classes usable by a crawler and other related applications, such as analysis and access tools.

The next subproject, 'modules', includes crawl-focused modules with some potential applicability outside a crawler. It requires 'commons'.

The next subproject, 'engine', is a crawler with no user-interface beyond the JMX remote-control functionality. It requires both 'commons' and 'modules'.

The next subproject, 'webui', is the web-browser-based user-interface. Its output is a WAR file that can be run in the same JVM as the crawl engine, or elsewhere. It requires 'engine', 'modules', and 'archive'.

Many classes have moved or been renamed.

Heritrix 2 is now built with Maven 2, and our continuous build system has migrated to Continuum.

This somewhat encumbers usage in Eclipse. We have a guide to setting up Heritrix inside Eclipse at Setting up the new Heritrix in Eclipse. Key points to remember are: (1) an initial Maven build, either from the command-line or the M2Eclipse extension, is required to populate your local Maven repository with necessary 3rd-party-classes; (2) you must set the M2_REPO build-path variable for the Eclipse project to find that local repository; (3) the ctrl-shift-t shortcut for finding a class by typing its name is helpful for locating classes in whatever subproejct or package they now reside.

Known Limitations

Converting 1.x crawl configurations to 2.0 remains a manual process. For now, we recommend consulting a 1.x order.xml (or various override settings.xml files) while using the web UI to construct a analogous crawl configuration. (While a few classes or settings have moved, names and capabilities remain substantially alike.) We are still collecting feedback on Heritrix 2 configuration formats and options and expect to make changes in future releases to ease crawl design, transition from previous configurations, and archival or the entire crawl configuration.

Heritrix 2 has received very little testing so far on platforms other than Linux; functionality and support scripts which worked in 1.x will likely need updating, even if currently included in the distribution. An issue to note under Windows is that viewing crawl job files/directories in other programs (including a command-shell) may prevent a runnable job from moving between states (ready to active, active to completed), as in Heritrix 2.0 this requires a directory rename. (See HER-1477: http://webteam.archive.org/jira/browse/HER-1477 .)

Documentation is very limited compared to the 1.x user manual and developer manual.

The focus for Heritrix 2.0 has been to deploy the new prioritization features, refactored internals, and settings options. In some cases lesser-used old functionality has been removed or is temporarily broken. A few notable areas of removed functionality:

The old specialized Scope classes have been eliminated; there is a 'scope' setting on relevant modules, which always expects a module of type DecideRule (which includes sequence compositions of DecideRules).

The contributed AdaptiveRevisitFrontier and has not yet been updated for operation under 2.0.0.

The 2.x settings system has no current analogue to the 'refinements' feature of Heritrix 1.x.

You can view a complete list of tracked issues reported in Heritrix 2.0.0 as released.

If you discover other gaps relative to 1.x, please report them via our public issue tracker..

Getting the 2.0.0 Release

This release may be obtained from our Sourceforge files area:

Heritrix2 Releases

Getting Started

For operators:

2.0 Tutorial

For developers:

Setting up the new Heritrix in Eclipse

Documentation for 2.0 is limited compared to 1.x, but a notes on using new features will be available on the project wiki. See Precedence Feature Notes for information about the URI and queue prioritization features.

Resolved Issues

The following tracked issues are recorded as addressed in this 2.0.0 release:


type key summary status

Unable to locate JIRA server for this macro. It may be due to Application Link configuration.

For issues fixed in previous test releases, see:

Reporting Issues

Bugs and other issues or suggested improvements/features may be submitted through our public issue tracker.

The project discussion list is hosted at Yahoo Groups.