While importing the NLA 2011 domain harvest, we ran into 6 warc files which contained records that caused the importer to spin in an infinite loop. If not infinite, at least 100% CPU for 24 hours, then I killed the job.
The offending warc records were all very small HTML pages, all of which shared a common attribute: a control character in the DOCTYPE header:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/cD/xhtml1-strict.dtd">
See the "cD/xhtml" bit...well in the actual warc file, there is a 0x01 byte between the 'c' and 'D'.
I suspect this control-character is the problem.