The arc file: TSUNAMI-20-20050508234209-00784-crawling001.archive.org.arc.gz
contains the record
Notice that the empty line between the Location: header and the start of the body (<HTML>) is missing. The HTTP spec requires an empty line between them.
Without that empty line, the ARCRecord.readHttpHeader() method calls the Apache commons HttpParser.parseHeaders() on the HTTP body lines (starting with <HTML>) which throws an exception since the HTTP body is not valid headers.
This exception bubbles up into the ARCRecord constructor, and further up to the ArchiveReader. The ArchiveReader is unable to recover by skipping to the next record because the exception occurs before the ARCRecord is completely initialized.
An easy, potential fix is to simply remove the call to the Apache commons HttpParser.parseHeaders() in ARCReader. It looks like the parsed headers aren't actually used anywhere important. The HTTP status line is, but the headers are not.