This problem was first reported in the OpenWayback issue tracker (https://github.com/iipc/openwayback/issues/224).
Basically, then writing a Server-Not-Modified profile version of a revisit record, the WARC-Payload-Digest must either equal the original payload or be omitted.
This is pretty clear when you read the two following excerpts from the WARC spec:
An optional parameter indicating the algorithm name and calculated value of a digest applied to the
payload referred to or contained by the record - which is not necessarily equivalent to the record block.
6.7.3 Profile: Server Not Modified
For records using this profile, the payload is defined as the original payload content from which a 'Last-Modified' and/or 'ETag' value was taken.
Currently, the WarcWriterProcessor adds the WARC-Payload-Digest to ALL http records using CrawlUri.getContentDigestSchemeString().
To fix this, this should only be done for response records. For revisit records, use data in the revisit profile attached to the CrawlURI.
Confirmed the fix as working. Did discover two additional issues with this type of deduplication though.
1. No annotation is made.
2. Response headers are not saved to WARC.
Since #1 is very easy to fix, I'll roll it into this fix. #2 requires a closer look.