Release Notes - 1.14.4 (May 2010)
These are the project wiki Release Notes for the 1.14.4 release.
Release 1.14.4 is a 'micro' release with a number of small bugfixes and new requested features.
The 1.14.4 release is now available at TK.
Support for FTP transactions in WARC records (HER-1577)
Heritrix now supports recording full FTP transactions in WARC records. For each FTP URL retrieved, the control conversation is recorded in a WARC metadata record with Content-Type: application/ftp; msgtype=control-conversation, the payload data is recorded in a WARC resource record with Content-Type: application/ftp; msgtype=payload-data, and FTP fetch metadata (as well as outlinks) are recorded in a corresponding WARC metadata record.
Other WARC corrections (HER-1659)
Written WARC files now consistently identify as WARC version "1.0" (HER-1648) and will grow to the 1GB size recommended by the specification.
Several problems causing errors or problems in using Heritrix on Windows, related to improper quoting or path-separators, have been corrected.
Seeds with Internationalized Domain Names (IDN) better supported (HER-1711)
Encoding problems which interfered with specification of some Internationalized Domain Name seeds have been corrected.
Hosts report expanded to include novel/duplicate bytes/URLs counts (HER-1650)
Crawl statistics now collect, and the 'Hosts' report includes, counts of the URLs and total content byte-sizes deemed either 'novel' or 'duplicate' by the duplication-reduction/persist-history mechanisms, if enabled on a crawl.
Trailing '*' tolerated in robots.txt Disallow/Allow rules (HER-1620)
Heritrix will now tolerate a trailing '*' wildcard sometimes added by webmasters (though not necessary) in their robots.txt Disallow/Allow rules. (Leading or internal wildcards are not yet supported.)
A number of performance, memory-retention, and deadlock-risk issues occasionally affecting the implementation class CachedBdbMap were identified. Fixes have been applied, but also the class has been replaced with a more simple implementation focused specifically on Heritrix's common use cases, ObjectIdentityBdbCache.
In addition to the usual suspects, this release includes contributed fixes or functionality from:
- Paul Baclace
- Sergey Khenkin
All Tracked Changes
The following 44 tracked issues are recorded as addressed in this 1.14.4 release: