In this WARC file:
Seems to cause the HTML parser to spin 100% cpu for hours and hours.
Changing the Nutch plugin's HTML parser implementation from "neko" to "tagsoup" fixes the problem. Tagsoup seems to be happy to parse the HTML. I don't know if there are any unwanted consequences of using "tagsoup" instead.
The file is truncated, ending in in the unfinished 'script' tag.
This seems to be the cause of the problem for neko. If we edit the WARC record and close that script tag, then neko is happy.
TagSoup seems to be happy either way.
Attached the HTML page.