All progress, on a large scale crawl, stopped. On inspection, I found that the job page would not load, but I could still trigger most (although not all) the reports.Further, the progress-statistics.log had stopped getting new lines when this condition was hit. This suggests that the statistics thread was being blocked by whatever was causing this.
I proceeded to dump the heap and thread state (thread dump attached).
I then tried to end the crawl and found that this must have shaken whatever was causing this loose as end of crawl reports were all generated normally and the 'job page' in the web UI become responsive again.
I also noted that three URLs that had been locked 'in progress' where terminated at that point with a -5 (Unexpected runtime exception). All three threads holding those URLs seemed to be waiting to update the relevant queue in the Frontier, leading me to suspect that this has something to do with BDB.
I have another, nearly identical except for seeds, crawl that has been running for much longer. So clearly this is not something that can be easily reproduced.
Java(TM) SE Runtime Environment (build 1.6.0-b105)
Red Hat Enterprise Linux Server release 6.4 (Santiago)
Thread report of second crawl to experience this.
Thread dump for the second crawl to become stuck
For the first stuck crawl.
The exception <DaemonThread name="Checkpointer"/> caught exception: java.lang.NullPointerException is entirely within a BDB-JE thread. Bdb has its own conception of checkpointing which is separate from heritrix's and appears to be
part of normal bdb operation.
https://forums.oracle.com/thread/2557641 makes reference to the same stack trace. "We assume this is a JE bug at this point. We'll fix it for the next patch release (Q3), under bug number [#22631]." (Not sure where to find that bug report, might require logging in according to http://stackoverflow.com/questions/2599961/where-is-the-oracle-bug-database)
I think we can tentatively conclude that the exception left bdb in some kind of a borked state prone to deadlock.
The second deadlocked thread dump looks like it didn't upload correctly, it's identical to the first one....