Child pages
  • Release Notes - 1.14.4
Skip to end of metadata
Go to start of metadata

Release Notes - 1.14.4 (May 2010)

These are the project wiki Release Notes for the 1.14.4 release.

Release 1.14.4 is a 'micro' release with a number of small bugfixes and new requested features.

The 1.14.4 release is now available at TK.

Notable Changes

Support for FTP transactions in WARC records (HER-1577)

Heritrix now supports recording full FTP transactions in WARC records. For each FTP URL retrieved, the control conversation is recorded in a WARC metadata record with Content-Type: application/ftp; msgtype=control-conversation, the payload data is recorded in a WARC resource record with Content-Type: application/ftp; msgtype=payload-data, and FTP fetch metadata (as well as outlinks) are recorded in a corresponding WARC metadata record.

Other WARC corrections (HER-1659)

Written WARC files now consistently identify as WARC version "1.0" (HER-1648) and will grow to the 1GB size recommended by the specification.

Windows annoyances fixed (HER-510, HER-1622, HER-1625)

Several problems causing errors or problems in using Heritrix on Windows, related to improper quoting or path-separators, have been corrected.

Seeds with Internationalized Domain Names (IDN) better supported (HER-1711)

Encoding problems which interfered with specification of some Internationalized Domain Name seeds have been corrected.

Hosts report expanded to include novel/duplicate bytes/URLs counts (HER-1650)

Crawl statistics now collect, and the 'Hosts' report includes, counts of the URLs and total content byte-sizes deemed either 'novel' or 'duplicate' by the duplication-reduction/persist-history mechanisms, if enabled on a crawl.

Trailing '*' tolerated in robots.txt Disallow/Allow rules (HER-1620)

Heritrix will now tolerate a trailing '*' wildcard sometimes added by webmasters (though not necessary) in their robots.txt Disallow/Allow rules. (Leading or internal wildcards are not yet supported.)

CachedBdbMap changes, replacement (HER-1677, HER-1658, HER-1705, HER-1609

A number of performance, memory-retention, and deadlock-risk issues occasionally affecting the implementation class CachedBdbMap were identified. Fixes have been applied, but also the class has been replaced with a more simple implementation focused specifically on Heritrix's common use cases, ObjectIdentityBdbCache.

Additional contributors

In addition to the usual suspects, this release includes contributed fixes or functionality from:

  • Paul Baclace
  • Sergey Khenkin

All Tracked Changes

The following 44 tracked issues are recorded as addressed in this 1.14.4 release:

T Key Summary Status

  • No labels