2.0.0-alpha-1 Release Notes

Heritrix 2.0.0-alpha-1 Release Notes

This is a preview alpha release of a Heritrix version, 2.0.0, with some major architectural and user-interface changes. It is recommended for advanced crawl operators and developers who want to gain early familiarity with and offer feedback for our new directions. It is not recommended for production crawling, as it has not undergone the usual level of testing and still has gaps in functionality compared to the latest official (1.12.1) release. (For one example, the Web UI is not yet secured by any login challenge.)

What's New

For Crawl Operators

The Web UI for administering a crawl is now a separate application from the 'crawl engine' that actually collects/analyzes/stores information. As a result, while you can launch the engine and web UI inside the same Java VM, as previously, you can also launch the Web UI and engine in different JVMs, even on different machines. Also, one Web UI can administer multiple local or remote crawl engines. Finally, because the Web UI communicates with the engine strictly using "JMX" (Java Management Extensions), anything it can do can also be done by other JMX-speaking software, such as command-line utilities or alternate crawl management applications.

The model for configuring the crawler has changed significantly. The same settings exist, but now are collected on 'sheets', of a different on-disk format than in Heritrix 1.x. An implicit/virtual sheet ('default') is never actually stored but contains settings from source code default values. A 'global' sheet collects whole-crawler and all-URL settings.

Other settings that only apply to some URLs may be collected in any number of other custom-named sheets, and then mapped to URLs not just by domain (as in 1.x) but any SURT prefix. A single sheet can be mapped to many URLs; an URL may have more than one sheet mapped to it. (For example, a sheet 'slowly' with extra-long politeness delays can be mapped to sites a.com and b.com, while another sheet 'shallowly' with small max-hops limits can be mapped to b.com and c.com. Then, b.com will be crawled both 'slowly' and 'shallowly'.)

For Developers

Heritrix has been split into subprojects of related (and reusable) functionality.

The innermost subproject, 'archive-commons', includes web archiving core utility classes usable by a crawler and other related applications, such as analysis and access tools.

The next subproject, 'modules', includes crawl-focused modules with some potential applicability outside a crawler. It requires 'archive-commons'.

The next subproject, 'engine', is a crawler with no user-interface beyond the JMX remote-control functionality. It requires both 'archive-commons' and 'modules'.

The next subproject, 'webui', is the web-browser-based user-interface. Its output is a WAR file that can be run in the same JVM as the crawl engine, or elsewhere. It requires 'engine', 'modules', and 'archive-commons'.

Heritrix 2 is now built with Maven 2, and our continuous build system has migrated to Continuum.

This somewhat encumbers usage in Eclipse. We have a guide to setting up Heritrix inside Eclipse at Setting up the new Heritrix in Eclipse. Key points to remember are: (1) an initial Maven build, either from the command-line or the M2Eclipse extension, is required to populate your local Maven repository with necessary 3rd-party-classes; (2) you must set the M2_REPO build-path variable for the Eclipse project to find that loca l repository; (3) the ctrl-shift-t shortcut for finding a class by typing its name is helpful for locating classes in whatever subproejct or package they now reside.

Getting the 2.0.0-alpha-1 Release

This release should be obtained directly from our continuous build box. The release christened "2.0.0-alpha' is specifically build ID heritrix-2.0.0-20070719.171117-4-heritrix.

Fetch the appropriate archive file from this repository directory:

heritrix-2.0.0-20070719.171117-4-heritrix.tar.gz
heritrix-2.0.0-20070719.171117-4-heritrix.zip

You may fetch a later build if available; be aware that often such 'bleeding edge' build box builds have known or unknown problems.

Getting Started

For operators:

2.0 Tutorial

For developers:

Setting up the new Heritrix in Eclipse

Resolved Issues

The following tracked issues are recorded as having been addressed in this 2.0.0-alpha release:

http://webteam.archive.org/jira/secure/ReleaseNote.jspa?version=10021&styleName=Html&projectId=10021&Create=Create

** Bug
    * [HER-266] - Remove item from global context should remove dependencies
    * [HER-1152] - Headers should be consistent in webui
    * [HER-1188] - Deadlock risk in WorkQueueFrontier.receive and WorkQueueFrontier.deepestUri and WorkQueueFrontier.next
    * [HER-1197] - input fields in 2.0 web UI 'logs' screen have no effect / don't submit
    * [HER-1198] - surt prefix sheet association overrides not working in crawl or persisting in JVM relaunches
    * [HER-1200] - unlaunched sheets should tolerate out-of-constraint and other illegal values
    * [HER-1201] - 'remove' setting from global sheet shows 'null' rather than true default value
    * [HER-1202] - sheet-edit page fails to display with NPE in Settings.canOverride()
    * [HER-1217] - sheet overrides still don't work for 'slowly' example

** Improvement
    * [HER-1176] - make queue-assignment-policy overridable, changeable
    * [HER-1199] - should have at least 2 sample/minimal profiles
    * [HER-1203] - highlight/factor-out must-change values (operator-contact-url, operator-from); adapt to use new constraints
    * [HER-1204] - reduce default max-hops of TransclusionDecideRule to 2

** New Feature
    * [HER-1177] - publicsuffix-based HashCrawlMapper reduce-pattern

** Task
    * [HER-1187] - 2.0.0-alpha release notes

Reporting Issues

Bugs and other issues or suggested improvements/features may be submitted through our public issue tracker.

The project discussion list is hosted at Yahoo Groups.