Corrupt script tag at end of page causes HTML parser infinite loop.

Description

In this WARC file:

http://aidata401-bu.us.archive.org:8008/31/ARCHIVEIT-194-20090920063827-00252-crawling109.us.archive.org.warc.gz

The record:

https://twitter.com/sustainword

Seems to cause the HTML parser to spin 100% cpu for hours and hours.

Environment

None

Activity

Show:
Aaron Binns
April 19, 2011, 4:21 PM

Changing the Nutch plugin's HTML parser implementation from "neko" to "tagsoup" fixes the problem. Tagsoup seems to be happy to parse the HTML. I don't know if there are any unwanted consequences of using "tagsoup" instead.

Aaron Binns
April 19, 2011, 5:08 PM

The file is truncated, ending in in the unfinished 'script' tag.

This seems to be the cause of the problem for neko. If we edit the WARC record and close that script tag, then neko is happy.

TagSoup seems to be happy either way.

Aaron Binns
April 19, 2011, 5:10 PM

Attached the HTML page.

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Major
Configure