We're updating the issue view to help you get more done. 

ServerNotModified WARC revisit records incorrectly record WARC-Payload-Digest


This problem was first reported in the OpenWayback issue tracker (https://github.com/iipc/openwayback/issues/224).

Basically, then writing a Server-Not-Modified profile version of a revisit record, the WARC-Payload-Digest must either equal the original payload or be omitted.

This is pretty clear when you read the two following excerpts from the WARC spec:

5.9 WARC-Payload-Digest
An optional parameter indicating the algorithm name and calculated value of a digest applied to the
payload referred to or contained by the record - which is not necessarily equivalent to the record block.

6.7.3 Profile: Server Not Modified
For records using this profile, the payload is defined as the original payload content from which a 'Last-Modified' and/or 'ETag' value was taken.

Currently, the WarcWriterProcessor adds the WARC-Payload-Digest to ALL http records using CrawlUri.getContentDigestSchemeString().

To fix this, this should only be done for response records. For revisit records, use data in the revisit profile attached to the CrawlURI.





Kristinn Sigurðsson


Kristinn Sigurðsson

Fix versions