ServerNotModified WARC revisit records incorrectly record WARC-Payload-Digest

Description

This problem was first reported in the OpenWayback issue tracker (https://github.com/iipc/openwayback/issues/224).

Basically, then writing a Server-Not-Modified profile version of a revisit record, the WARC-Payload-Digest must either equal the original payload or be omitted.

This is pretty clear when you read the two following excerpts from the WARC spec:

5.9 WARC-Payload-Digest
An optional parameter indicating the algorithm name and calculated value of a digest applied to the
payload referred to or contained by the record - which is not necessarily equivalent to the record block.

6.7.3 Profile: Server Not Modified
...
For records using this profile, the payload is defined as the original payload content from which a 'Last-Modified' and/or 'ETag' value was taken.

Currently, the WarcWriterProcessor adds the WARC-Payload-Digest to ALL http records using CrawlUri.getContentDigestSchemeString().

To fix this, this should only be done for response records. For revisit records, use data in the revisit profile attached to the CrawlURI.

Environment

None

Status

Assignee

Kristinn Sigurðsson

Reporter

Kristinn Sigurðsson

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Fix versions

Priority

Major
Configure