Edge case in parsing WARC header - org.archive.io.warc.WARCRecord.java:parseHeaders()

Description

org.archive.io.warc.WARCRecord.javaarseHeaders()
Ran into an edge case. Webserver returned no Content-Type header, but the cdx had a Content-Type of application/http

Offending Code - line 125:
Header [] h = HttpParser.parseHeaders(in, WARC_HEADER_ENCODING);
for (int i = 0; i < h.length; i++) {
m.put(h[i].getName(), h[i].getValue());
}
The apache parser gives us an array of key,value pairs, but 'm' is a HashMap so duplicate fields will be overwritten. In most cases the WARC application/http wrapper content type is overwritten by the proper content type when generating the cdx since it appears later in the header. With no content type specified in original header, the WARC content type was passed on to the indexer. The WARC content type may need to be parsed into a separate field.

Offending WARC header:
WARC-Target-URI: http://cahs.webs.com/
WARC-Date: 2011-11-02T23:44:47Z
WARC-Payload-Digest: sha1:SSSAM4BJXECKXZOJGENR3BH642NNL7XM
WARC-IP-Address: 216.52.115.51
WARC-Record-ID: <urn:uuid:24818775-0ad4-405b-bc42-93027916ec46>
Content-Type: application/http; msgtype=response
Content-Length: 20522

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Date: Wed, 02 Nov 2011 23:44:46 GMT
Connection: close

Environment

None

Status

Assignee

Unassigned

Reporter

Adam Miller

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Affects versions

Heritrix 3.1.0

Priority

Major
Configure