2.0.0-alpha-2 Release Notes

Heritrix 2.0.0-alpha-2 Release Notes

The autobuild "heritrix-2.0.0-20071024.165839-74-heritrix" is considered the 'alpha-2' release.

This is a preview alpha release of a Heritrix version, 2.0.0, with some major architectural and user-interface changes. It is recommended for volunteer testers, advanced crawl operators and developers who want to gain early familiarity with and offer feedback for our new directions. It is not recommended for production crawling, as it has not undergone the usual level of testing and still has gaps in functionality compared to the latest official (1.12.1) release.

What's New

For Crawl Operators

The Web UI for administering a crawl is now a separate application from the 'crawl engine' that actually collects/analyzes/stores information. As a result, while you can launch the engine and web UI inside the same Java VM, as previously, you can also launch the Web UI and engine in different JVMs, even on different machines. Also, one Web UI can administer multiple local or remote crawl engines. Finally, because the Web UI communicates with the engine strictly using "JMX" (Java Management Extensions), anything it can do can also be done by other JMX-speaking software, such as command-line utilities or alternate crawl management applications.

The model for configuring the crawler has changed significantly. The same settings exist, but now are collected on 'sheets', of a different on-disk format than in Heritrix 1.x. An implicit/virtual sheet ('default') is never actually stored but contains settings from source code default values. A 'global' sheet collects whole-crawler and all-URL settings.

Other settings that only apply to some URLs may be collected in any number of other custom-named sheets, and then mapped to URLs not just by domain (as in 1.x) but any SURT prefix. A single sheet can be mapped to many URLs; an URL may have more than one sheet mapped to it. (For example, a sheet 'slowly' with extra-long politeness delays can be mapped to sites a.com and b.com, while another sheet 'shallowly' with small max-hops limits can be mapped to b.com and c.com. Then, b.com will be crawled both 'slowly' and 'shallowly'.)

For Developers

Heritrix has been split into subprojects of related (and reusable) functionality.

The innermost subproject, 'commons', includes web archiving core utility classes usable by a crawler and other related applications, such as analysis and access tools.

The next subproject, 'modules', includes crawl-focused modules with some potential applicability outside a crawler. It requires 'archive-commons'.

The next subproject, 'engine', is a crawler with no user-interface beyond the JMX remote-control functionality. It requires both 'archive-commons' and 'modules'.

The next subproject, 'webui', is the web-browser-based user-interface. Its output is a WAR file that can be run in the same JVM as the crawl engine, or elsewhere. It requires 'engine', 'modules', and 'archive-commons'.

Heritrix 2 is now built with Maven 2, and our continuous build system has migrated to Continuum.

This somewhat encumbers usage in Eclipse. We have a guide to setting up Heritrix inside Eclipse at Setting up the new Heritrix in Eclipse. Key points to remember are: (1) an initial Maven build, either from the command-line or the M2Eclipse extension, is required to populate your local Maven repository with necessary 3rd-party-classes; (2) you must set the M2_REPO build-path variable for the Eclipse project to find that local repository; (3) the ctrl-shift-t shortcut for finding a class by typing its name is helpful for locating classes in whatever subproejct or package they now reside.

Getting the 2.0.0-alpha-2 Release

This release should be obtained directly from our continuous build box. The release christened "2.0.0-alpha-2' is specifically build ID heritrix-2.0.0-20071024.165839-74-heritrix.

Fetch the appropriate archive file from this repository directory:

heritrix-2.0.0-20071024.165839-74-heritrix.tar.gz
heritrix-2.0.0-20071024.165839-74-heritrix.zip

You may fetch a later build if available; be aware that such 'bleeding edge' build box builds often have known or unknown problems.

Getting Started

For operators:

2.0 Tutorial

For developers:

Setting up the new Heritrix in Eclipse

Resolved Issues

The following tracked issues are recorded as having been addressed in this 2.0.0-alpha-2 release:

http://webteam.archive.org/jira/secure/ReleaseNote.jspa?projectId=10021&styleName=Html&version=10010

** Bug
    * [HER-616] - The UURI class may throw NullPointerException in getReferencedHost()
    * [HER-1097] - ARCWriter cannot handle records larger than 2 GB
    * [HER-1128] - ExtractorHTML fails to extract FRAME SRC link without whitespace before SRC
    * [HER-1130] - Arc files not being closed
    * [HER-1131] - JVM doesn't terminate after CrawlJobManager.close
    * [HER-1132] - More/better status information via JMX
    * [HER-1133] - bin/heritrix directory woe
    * [HER-1134] - progress-statistics log not updating
    * [HER-1135] - lots of NPEs doing fetches for robot exclusions
    * [HER-1137] - User-agent string not being sent
    * [HER-1138] - crawl-report.txt tallies robot-excluded documents as 'crawled'
    * [HER-1147] - NPE in CachedBdbMap in pjack_settings
    * [HER-1149] - Nicer error when jobs/profile dirs do not exist
    * [HER-1151] - Console should be refreshable
    * [HER-1153] - *WriterProcessor missing their metadata handling
    * [HER-1156] - BdbModule should sync deferred-write DBs on checkpoint
    * [HER-1157] - URI errors are never logged
    * [HER-1158] - Delete reload() method in SheetManager
    * [HER-1159] - LocalizedError -> NonFatal error name, verification
    * [HER-1162] - Constraints should be more robust
    * [HER-1163] - "java.lang.ArithmeticException: / by zero" in org.archive.io.RecordingOutputStream.checkLimits(RecordingOutputStream.java:271)
    * [HER-1166] - BloomFilter32bitSplit artificially limits effective size
    * [HER-1168] - CrawlMapper broken by outlinks/outcandidates split
    * [HER-1169] - CrawlMapper 'check-uri' doesn't work in late position (if fetch status != 0)
    * [HER-1171] - Queues-of-queues (inactive,retired,etc.) grow without bound; OutOfMemoryError
    * [HER-1172] - (patch) ExtractorUniversal file handle leak
    * [HER-1173] - forms with large values get "ArrayIndexOutOfBoundsException: 200000" (jetty limit)
    * [HER-1174] - WebUI insecure
    * [HER-1178] - ARCReader cannot handle ARC-files with records larger than 2 GB
    * [HER-1188] - Deadlock risk in WorkQueueFrontier.receive and WorkQueueFrontier.deepestUri and WorkQueueFrontier.next
    * [HER-1191] - https: (no slashes) in recovery from recovery log causes NPE in UURIFactory.checkHttpSchemeSpecificPartSlashPrefix
    * [HER-1192] - Heritrix extracts and schedules invalid https: URI
    * [HER-1194] - Keys with a module type should have automatic non-null Constraint
    * [HER-1195] - Jobs should not be allowed to have same name as profiles
    * [HER-1196] - OpenMBean exceptions are in my face
    * [HER-1208] - Webui never closes open profiles
    * [HER-1210] - Invalid default Key values don't trigger error
    * [HER-1212] - Heritrix.main should display host:port of webui
    * [HER-1215] - support kickUpdate-like refresh of internal components in 2.x
    * [HER-1216] - Uneditable settings should have icon indicators in editor
    * [HER-1220] - Better null handling in Sheets
    * [HER-1226] - System properties considered harmful
    * [HER-1227] - CrawlURI.MAX_OUTLINKS not being enforced
    * [HER-1228] - Edit launched jobs before running them
    * [HER-1230] - Alerts MIA in web ui
    * [HER-1231] - Web ui needs to be able to recover from checkpoiint and recover from logs
    * [HER-1232] - Kill CrawlScope and its subclasses
    * [HER-1233] - Kill CrawlOrder, simplify CrawlController
    * [HER-1235] - crawlstatus.txt MIA
    * [HER-1239] - FetchHTTP throws NullPointerExcetpion on checkpoint if cookies are disabled
    * [HER-1240] - Editing sheet in active job triggers InstanceAlreadyExistExceptions
    * [HER-1241] - Report click-throughs don't work
    * [HER-1242] - 2.x needs during-crawl seed-add capability
    * [HER-1243] - Completed reports links says "Not Yet Implemented"
    * [HER-1244] - Logs dir shouldn't be hardcoded
    * [HER-1245] - Sheets/Seeds should be read-only for completed jobs
    * [HER-1246] - Seed limit should be 5000, webui should complain when limit exceeded
    * [HER-1248] - (sub)domains of infinite breadth kill crawler throughput
    * [HER-1249] - 2.x settings FetchHttp cookies handling functionality lost
    * [HER-1251] - Gordon can't launch a ready job after editing it.
    * [HER-1252] - Invalid active job shouldn't crash webui
    * [HER-1253] - Add ability to autowire modules
    * [HER-1258] - Constraint failure shows unhelpful default object string
    * [HER-1260] - port view/edit frontier page to 2.x
    * [HER-1262] - webui missing delete options: delete job/profile; delete sheet
    * [HER-1264] - Is stdout/stderr still redirecting to heritrix_out?
    * [HER-1272] - give a useful error on invalid config
    * [HER-1274] - job/profile name with '+' causes error
    * [HER-1276] - The dist tarball should have "jobs" directory
    * [HER-1280] - do not by default GET form action URLs declared as POST, because it can cause problems/complaints
    * [HER-1281] - crawl stalls with plenty of inactive queues; StoredQueue sync/corruption?
    * [HER-1283] - After usual single-VM launch of Heritrix: can access UI, but launching job gets NPE
    * [HER-1292] - ARCWriter writes incorrect length in ARC files for 304 responses.
    * [HER-1293] - Deleting job causes "Error closing environment" LogFileNotFoundException
    * [HER-1294] - Can't change max-toe-threads
    * [HER-1295] - broken-fetch retries triggering unwanted 304 responses
    * [HER-1298] - Too many alerts: nonfatals also becoming alerts
    * [HER-1301] - Internal version strings not replaced by build
    * [HER-1303] - 2.0.0-alpha-2 needs updated release/tester notes

** Improvement
    * [HER-811] - Externalize arc reader/writer as distinct package
    * [HER-907] - Log jmxclient start/pause/resume/stop/etc. events
    * [HER-963] - Refactor settings system so more amenable to remote control
    * [HER-987] - Move build on to  maven 2.0
    * [HER-1003] - Interruptible regex - TimeoutCharSequence?
    * [HER-1181] - apply scope to recovery replays
    * [HER-1206] - make SurtAuthorityQueueAssignmentPolicy the new default if unspecified
    * [HER-1229] - Use CrawlJob everywhere
    * [HER-1256] - upgrade BDB-JE to 3.2.44

** New Feature
    * [HER-1175] - publicsuffix-based queue assignment policy (TopmostAssignedSurtQueueAssignmentPolicy)
    * [HER-1180] - support iso-draft WARC (0.17) in writer/reader
    * [HER-1182] - web ui should show how long Heritrix process has been running
    * [HER-1254] - Per-host tallies of robots exclusions, uncrawled URIs at end of early-terminated crawl
    * [HER-1255] - Add option to dump queues to crawl.log at end of crawl
    * [HER-1297] - [2.x] JMX dumpUris method supporting the same behavior as the frontier search on the wui

** Task
    * [HER-1160] - Need a unit test for ExtractorHTML and META robots tag
    * [HER-1247] - Document config.txt

Reporting Issues

Bugs and other issues or suggested improvements/features may be submitted through our public issue tracker.

The project discussion list is hosted at Yahoo Groups.