2.0.0-beta Release Notes

Heritrix 2.0.0-beta Release Notes

The autobuild "heritrix-2.0.0-20071204.012926-112-heritrix" is considered the 'beta' release.

This is a beta release of a Heritrix version, 2.0.0, with some major architectural and user-interface changes.

Compared to the 2.0.0-alpha releases, this release adds a new framework for controlling the order of URI collection, both within a queue of related URIs and between such queues.

All core/required features of 2.0.0 are completed, and additional work for 2.0 will be directed at bug fixes and enhanced usability/packaging. All users interested in trying 2.0.0 features and helping test the software before official release are invited to try this release.

What's New

For Crawl Operators

The Web UI for administering a crawl is now a separate application from the 'crawl engine' that actually collects/analyzes/stores information. As a result, while you can launch the engine and web UI inside the same Java VM, as previously, you can also launch the Web UI and engine in different JVMs, even on different machines. Also, one Web UI can administer multiple local or remote crawl engines. Finally, because the Web UI communicates with the engine strictly using "JMX" (Java Management Extensions), anything it can do can also be done by other JMX-speaking software, such as command-line utilities or alternate crawl management applications.

The model for configuring the crawler has changed significantly. The same settings exist, but now are collected on 'sheets', of a different on-disk format than in Heritrix 1.x. An implicit/virtual sheet ('default') is never actually stored but contains settings from source code default values. A 'global' sheet collects whole-crawler and all-URL settings.

Other settings that only apply to some URLs may be collected in any number of other custom-named sheets, and then mapped to URLs not just by domain (as in 1.x) but any SURT prefix. A single sheet can be mapped to many URLs; an URL may have more than one sheet mapped to it. (For example, a sheet 'slowly' with extra-long politeness delays can be mapped to sites a.com and b.com, while another sheet 'shallowly' with small max-hops limits can be mapped to b.com and c.com. Then, b.com will be crawled both 'slowly' and 'shallowly'.)

Both individual URIs and the queues of URIs within a crawler frontier may now have an integer 'precedence' value assigned to them, via separate URI and queue precedence policies configured via settings sheets. A URI's precedence affects the order in which it is collected relative to other URIs in the same queue. A queue's precedence affects which queues actively provide URIs for crawling when many queues are available and waiting. Lower numbers mean earlier (higher ranked) consideration - 1 is intended as the highest precedence. Policies already included with Heritrix, combined with sheet override configuration, can achieve a variety of desired prioritizations of crawl work. As with other crawler components, custom policies can also be implemented in Java.

For Developers

Heritrix has been split into subprojects of related (and reusable) functionality.

The innermost subproject, 'commons', includes web archiving core utility classes usable by a crawler and other related applications, such as analysis and access tools.

The next subproject, 'modules', includes crawl-focused modules with some potential applicability outside a crawler. It requires 'archive-commons'.

The next subproject, 'engine', is a crawler with no user-interface beyond the JMX remote-control functionality. It requires both 'archive-commons' and 'modules'.

The next subproject, 'webui', is the web-browser-based user-interface. Its output is a WAR file that can be run in the same JVM as the crawl engine, or elsewhere. It requires 'engine', 'modules', and 'archive-commons'.

Heritrix 2 is now built with Maven 2, and our continuous build system has migrated to Continuum.

This somewhat encumbers usage in Eclipse. We have a guide to setting up Heritrix inside Eclipse at Setting up the new Heritrix in Eclipse. Key points to remember are: (1) an initial Maven build, either from the command-line or the M2Eclipse extension, is required to populate your local Maven repository with necessary 3rd-party-classes; (2) you must set the M2_REPO build-path variable for the Eclipse project to find that local repository; (3) the ctrl-shift-t shortcut for finding a class by typing its name is helpful for locating classes in whatever subproejct or package they now reside.

Getting the 2.0.0-beta Release

This release should be obtained directly from our continuous build box. The release christened "2.0.0-beta' is specifically build ID heritrix-2.0.0-20071204.012926-112-heritrix.

Fetch the appropriate archive file from this repository directory:

heritrix-2.0.0-20071204.012926-112-heritrix.tar.gz
heritrix-2.0.0-20071204.012926-112-heritrix.zip

You may fetch a later build if available; be aware that such 'bleeding edge' build box builds often have known or unknown problems.

Getting Started

For operators:

2.0 Tutorial

For developers:

Setting up the new Heritrix in Eclipse

Documentation for 2.0 is limited compared to 1.x, but a series of notes on using new features is planned for the wiki up to and through the official release.

Resolved Issues

The following tracked issues are recorded as addressed in this 2.0.0-beta release:

http://webteam.archive.org/jira/secure/ReleaseNote.jspa?version=10030&styleName=Html&projectId=10021&Create=Create

Release Notes - Heritrix - Version Heritrix 2.0.0-beta

** Bug
    * [HER-661] - "failed get" "user-mapped section open" exceptions
    * [HER-1155] - .utf8 metadata files need improvement
    * [HER-1205] - PREPARING/PREPARED statuses are inconsistent
    * [HER-1211] - Key.java should have flag for initialization
    * [HER-1257] - fixup of profile (eg: when settings keys evolve) awkward, perhaps impossible in web UI
    * [HER-1271] - sheet editor shouldn't lose edits in progress
    * [HER-1283] - After usual single-VM launch of Heritrix: can access UI, but launching job gets NPE
    * [HER-1286] - Processor's initialTasks not called if added mid-crawl
    * [HER-1287] - using webui 'copy' of completed job yields job that dies at launch
    * [HER-1302] - Missing/inconsistent command-line options
    * [HER-1304] - 2.x web ui needs quit/shutdown options
    * [HER-1305] - sheet details view, "Add New Element" -> "Create a new object" -> "Pick an object type to create" shows no available types
    * [HER-1306] - BeanShellProcesser, BeanShellDecideRule useless in 2.x
    * [HER-1314] - Seeds/Crawl reports broken for active crawl
    * [HER-1315] - Add/Import URIs is broken
    * [HER-1316] - ToeThread not honoring ProcessResult
    * [HER-1317] - Sheets and special characters
    * [HER-1319] - webui > about > web ui preferences broken
    * [HER-1320] - possibly incorrect ProtocolSocketFactory
    * [HER-1322] - Update relative URI derelativization, esp. of naked ?query-strings, to match RFC3986
    * [HER-1329] - requestCrawlStop doesn't wait for ToeThreads to finish
    * [HER-1330] - Committing a sheet via Web UI erases it
    * [HER-1331] - Jetty ClassLoader woe
    * [HER-1336] - NPE during checkpoint recovery
    * [HER-1337] - No usage message if incorrect command line specified
    * [HER-1341] - URIException in managerThread
    * [HER-1344] - Make path variables actually work
    * [HER-1345] - SelfTest mode on command line?
    * [HER-1349] - Sheets/Settings break for Test URL withempty or random input
    * [HER-1357] - webui 'terminate' on a crawl that is paused causes a small batch of new crawling (release of pause lock)
    * [HER-1359] - CookieSpecBase ConcurrentModificationException alerts; BdbCookieStorage needs to be default
    * [HER-1361] - ClassCast while closing StatisticsTracker
    * [HER-1362] - Processor initialTasks being invoked twice at beginning of a crawl
    * [HER-1363] - Module setting default values aren't immutable
    * [HER-1365] - large seed list submit in web ui gives blank page; java.io.IOException: FULL at crawler

** Improvement
    * [HER-1311] - Prioritzation refactoring: implement single privileged Frontier manager thread; mediation queues
    * [HER-1325] - Improve frontier report to display precedence-related information

** New Feature
    * [HER-1103] - feature request: implement skipHttpHeader/readHttpHeader for WARCRecords
    * [HER-1323] - core 'precedence' enhancements: precedence-per-URI and -per-queue; swappable policy; frontier preference for higher precedence
    * [HER-1324] - Advanced precedence policy: by discovered/collected count/size thresholds
    * [HER-1327] - Link-value ('PR'-like) precedence policy


Reporting Issues

Bugs and other issues or suggested improvements/features may be submitted through our public issue tracker.

The project discussion list is hosted at Yahoo Groups.