Digest differs between ARCReader and Wayback index-arc.

Description

I noticed that the digest computed by the ARCReader/ARCRecord as used by the NutchWAX Importer.java differs from the digest computed by the Wayback-1.2.1 'index-arc' tool. Upon further examination it only seems to happen on archive records that are redirects with 0-byte bodies. I'm guessing it's the 0-byte body that's causing the difference, not the redirect; it just happens that in the test data I was using, the redirects are the only guys with 0-byte bodies.

ARC:locdata140-bu.us.archive.org/2/CONGRESS-047-20071128224914-00005-crawling08.us.archive.org.arc.gz
URL: rules.senate.gov/history/images/hearings_nav_off.gif

Generate a CDX for that arc file with Wayback-1.2.1, then use the ARCReader/ARCRecord class bundled with NutchWAX 0.12 and you'll see that the digest differs.

Environment

None
Obsolete

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Affects versions

Priority

Major
Configure