Mime-type detection infinite loop due to control character in DOCTYPE declaration.

Description

While importing the NLA 2011 domain harvest, we ran into 6 warc files which contained records that caused the importer to spin in an infinite loop. If not infinite, at least 100% CPU for 24 hours, then I killed the job.

The offending warc records were all very small HTML pages, all of which shared a common attribute: a control character in the DOCTYPE header:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/cD/xhtml1-strict.dtd">

See the "cD/xhtml" bit...well in the actual warc file, there is a 0x01 byte between the 'c' and 'D'.

I suspect this control-character is the problem.

Environment

None

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Major
Configure