Mime-type detection infinite loop due to control character in DOCTYPE declaration.

Description

While importing the NLA 2011 domain harvest, we ran into 6 warc files which contained records that caused the importer to spin in an infinite loop. If not infinite, at least 100% CPU for 24 hours, then I killed the job.

The offending warc records were all very small HTML pages, all of which shared a common attribute: a control character in the DOCTYPE header:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/cD/xhtml1-strict.dtd">

See the "cD/xhtml" bit...well in the actual warc file, there is a 0x01 byte between the 'c' and 'D'.

I suspect this control-character is the problem.

Environment

None

Activity

Show:
Aaron Binns
April 19, 2011, 4:26 PM

Unfortunately, unlike WAX-81, changing the underlying HTML parser from "neko" to "tagsoup" does not work-around the problem.

Aaron Binns
April 19, 2011, 5:05 PM

A-ha! The problem is actually in the MIME-detection via Tika. In the stack trace, you can see that the doctype is being processed to determine which version of xhtml/xml it is. I can only assume the control character confuses it.

Aaron Binns
August 31, 2011, 4:03 AM

Another example found in the AIT collection 2438:

There is a control character at the end of the DOCTYPE header which causes the mime-detection to explode.
Stack trace:

Aaron Binns
May 24, 2012, 6:04 PM

Updating xerces library to 2.11.0 fixes this.

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Major
Configure