Mime-type detection infinite loop due to control character in DOCTYPE declaration.

Description

While importing the NLA 2011 domain harvest, we ran into 6 warc files which contained records that caused the importer to spin in an infinite loop. If not infinite, at least 100% CPU for 24 hours, then I killed the job.

The offending warc records were all very small HTML pages, all of which shared a common attribute: a control character in the DOCTYPE header:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/cD/xhtml1-strict.dtd">

See the "cD/xhtml" bit...well in the actual warc file, there is a 0x01 byte between the 'c' and 'D'.

I suspect this control-character is the problem.

Environment

None

Status

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Major
Configure