ARCRecord.readHttpHeader() fails if HTTP response lacks empty line between HTTP headers and HTTP body.

Description

The arc file: TSUNAMI-20-20050508234209-00784-crawling001.archive.org.arc.gz

contains the record

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 http://www.pconline.com.cn/market/price/method=price2&areaId=5/&bigtypeId=20805/JavaScript1.2 219.136.244.102 20050509001107 text/html 653 HTTP/1.0 302 Not Found Server: thttpd Content-Type: text/html Date: Mon, 09 May 2005 00:11:07 GMT Last-Modified: Mon, 09 May 2005 00:11:07 GMT Accept-Ranges: bytes Connection: close Location:http://arch.pconline.com.cn/market/price/method=price2&areaId=5/&bigtypeId=20805/JavaScript1.2 <HTML> <HEAD><TITLE>302 Not Found</TITLE></HEAD> <BODY BGCOLOR="#cc9999" TEXT="#000000" LINK="#2020ff" VLINK="#4040cc"> <H2>302 Not Found</H2> The requested URL '/market/price/method=price2&areaId=5/&bigtypeId=20805/JavaScript1.2' was not found on this server. <HR> <ADDRESS><A HREF="http://www.acme.com/software/thttpd/">thttpd</A></ADDRESS> </BODY> </HTML>

Notice that the empty line between the Location: header and the start of the body (<HTML>) is missing. The HTTP spec requires an empty line between them.

Without that empty line, the ARCRecord.readHttpHeader() method calls the Apache commons HttpParser.parseHeaders() on the HTTP body lines (starting with <HTML>) which throws an exception since the HTTP body is not valid headers.

This exception bubbles up into the ARCRecord constructor, and further up to the ArchiveReader. The ArchiveReader is unable to recover by skipping to the next record because the exception occurs before the ARCRecord is completely initialized.

An easy, potential fix is to simply remove the call to the Apache commons HttpParser.parseHeaders() in ARCReader. It looks like the parsed headers aren't actually used anywhere important. The HTTP status line is, but the headers are not.

Environment

None

Status

Assignee

Unassigned

Reporter

Aaron Binns

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Affects versions

Heritrix 3.1.1

Priority

Major
Configure