While importing the NLA 2011 domain harvest, we ran into 6 warc files which contained records that caused the importer to spin in an infinite loop. If not infinite, at least 100% CPU for 24 hours, then I killed the job.
The offending warc records were all very small HTML pages, all of which shared a common attribute: a control character in the DOCTYPE header:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/cD/xhtml1-strict.dtd">
See the "cD/xhtml" bit...well in the actual warc file, there is a 0x01 byte between the 'c' and 'D'.
I suspect this control-character is the problem.
Unfortunately, unlike WAX-81, changing the underlying HTML parser from "neko" to "tagsoup" does not work-around the problem.
A-ha! The problem is actually in the MIME-detection via Tika. In the stack trace, you can see that the doctype is being processed to determine which version of xhtml/xml it is. I can only assume the control character confuses it.
Another example found in the AIT collection 2438:
There is a control character at the end of the DOCTYPE header which causes the mime-detection to explode.
Stack trace:
Updating xerces library to 2.11.0 fixes this.