ARCRecord.readHttpHeader() fails if HTTP response lacks empty line between HTTP headers and HTTP body.

Description

The arc file: TSUNAMI-20-20050508234209-00784-crawling001.archive.org.arc.gz

contains the record

Notice that the empty line between the Location: header and the start of the body (<HTML>) is missing. The HTTP spec requires an empty line between them.

Without that empty line, the ARCRecord.readHttpHeader() method calls the Apache commons HttpParser.parseHeaders() on the HTTP body lines (starting with <HTML>) which throws an exception since the HTTP body is not valid headers.

This exception bubbles up into the ARCRecord constructor, and further up to the ArchiveReader. The ArchiveReader is unable to recover by skipping to the next record because the exception occurs before the ARCRecord is completely initialized.

An easy, potential fix is to simply remove the call to the Apache commons HttpParser.parseHeaders() in ARCReader. It looks like the parsed headers aren't actually used anywhere important. The HTTP status line is, but the headers are not.

Environment

None

Assignee

Unassigned

Reporter

Aaron Binns

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Affects versions

Priority

Major
Configure