Open issues

Shutdown engine on SEVERE errors
HER-2054
occasional strange memory leak after crawl finishes
HER-2036
Heritrix in Amazon Cloud
HER-1991
MirrorWriterProcess Example
HER-1968
Fix/rethink default case-flattening canonicalization (LowercaseRule)
HER-1945
heritrix hitting non existent URLs in wix.com/app-market
HER-2096
Add support for extracting URLs from img srcset attribute
HER-2094
appCtx.getBean() does no longer work in scripting console
HER-2093
Improve feedback after specifying errornous command line arguments
HER-2091
heritrix is missing facility to shutdown from console
HER-2090
RuntimeException in AMQPUrlReceiver kills StarterRestarter?
HER-2088
HTML extractor fails to extract CSS from a link tag
HER-2086
password
HER-2079
Link Analysis with Apache Giraph (Cluster mode)
HER-2077
Enable configuration of log4j in libraries
HER-2075
Limited Parallelism
HER-2060
Rotate heritrix_out.log
HER-2055
support crawling without any dns resolution (can be useful when crawling through proxy)
HER-2050
do something about w/arc reading code
HER-2049
ftp protocol robots.txt
HER-2047
FetchWhois mishandles certain tlds
HER-2046
Possible deadlock
HER-2045
Controlling download time from a URI
HER-2043
should follow redirects from /robots.txt and respect directives found
HER-2038
Upgrade Guava to latest version
HER-2033
The Heritrix Crawler can proceed with active links only. In Xpath we have to stop till //a, we cant make any changes in these links.
HER-2028
Error connectiong to https site
HER-2026
ARCRecord.computeMetaData() fails due to space in mime-type in record header.
HER-2025
ARCRecord.readHttpHeader() fails if HTTP response lacks empty line between HTTP headers and HTTP body.
HER-2023
dns: scheme ignored when creating SURT
HER-2012
Possible race condition in org.archive.util.FileUtils.ensureWriteableDirectory() leading to data loss
HER-2009
Heritrix3 how incremental crawler?If someone know mail to me. thank you. imnu06@126.com
HER-2008
option to let heritrix decide automatically on number of toe threads based on available heap
HER-2007
way to reenqueue failed url other than in original spot at the head of the queue
HER-2003
Website in .au Heritrix gave SEVERE log message regarding public suffix list
HER-2000
Expand ExtractorHTML to extract html from conditional comments
HER-1998
ToeThread Fatal Exception: "kryo.SerializationException: Buffer limit exceeded" in BdbMultipleWorkQueues.get
HER-1996
CLONE - Force generation of report files
HER-1995
web ui/api unresponsive during PersistLoadProcessor preload
HER-1992
Link with '%' character does not get encoded, leading to 400 - bad request
HER-1988
REST upload crawler-bean.cxml to XML extension rather than CXML
HER-1986
response codes 404, and 500 due to invalid URIs
HER-1982
H3: flawed failed-checkpoint recovery in CheckpointService.requestCrawlCheckpoint may prevent future checkpoints
HER-1980
Bug in shouldProcessRule in WriterPoolProcessor. It doesn't work
HER-1978
Bug in shouldProcessRule in WriterPoolProcessor. It doesn't work
HER-1977

Shutdown engine on SEVERE errors

If encounterting SEVERE errors, running jobs should be tried to get stopped and the heritrix engine should be shutdown. An NFS mount that contained heritrix's job directory went down last night. This lead to gazillion of Exceptions of the form:

Oct 06, 2013 5:19:40 PM org.archive.crawler.frontier.BdbWorkQueue peekItem
SEVERE: peekItem failure; retrying (in thread 'ToeThread #23: ')
com.sleepycat.je.EnvironmentFailureException: (JE 4.1.6) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 4.1.6) /mnt/sammy-data/heritrix_jobs/two-hops-no-requests-html-only/state fetchTarget of 0x112/0x941961 parent IN=62673966 IN class=com.sleepycat.je.tree.BIN lastFullVersion=0x183/0x2a17d parent.getDirty()=true state=0 LOG_FILE_NOT_FOUND: Log file missing, log is likely invalid. Environment is invalid and must be closed.
        at com.sleepycat.je.EnvironmentFailureException.wrapSelf(EnvironmentFailureException.java:196)
        at com.sleepycat.je.dbi.EnvironmentImpl.checkIfInvalid(EnvironmentImpl.java:1439)
        at com.sleepycat.je.Database.checkEnv(Database.java:1778)
        at com.sleepycat.je.Database.openCursor(Database.java:625)
        at org.archive.crawler.frontier.BdbMultipleWorkQueues.getNextNearestItem(BdbMultipleWorkQueues.java:297)
        at org.archive.crawler.frontier.BdbMultipleWorkQueues.get(BdbMultipleWorkQueues.java:258)
        at org.archive.crawler.frontier.BdbWorkQueue.peekItem(BdbWorkQueue.java:103)
        at org.archive.crawler.frontier.WorkQueue.peek(WorkQueue.java:173)
        at org.archive.crawler.frontier.WorkQueueFrontier.findEligibleURI(WorkQueueFrontier.java:651)
        at org.archive.crawler.frontier.AbstractFrontier.next(AbstractFrontier.java:452)
        at org.archive.crawler.framework.ToeThread.run(ToeThread.java:133)
Caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 4.1.6) /mnt/sammy-data/heritrix_jobs/two-hops-no-requests-html-only/state fetchTarget of 0x112/0x941961 parent IN=62673966 IN class=com.sleepycat.je.tree.BIN lastFullVersion=0x183/0x2a17d parent.getDirty()=true state=0 LOG_FILE_NOT_FOUND: Log file missing, log is likely invalid. Environment is invalid and must be closed.
        at com.sleepycat.je.tree.IN.fetchTarget(IN.java:1332)
        at com.sleepycat.je.tree.BIN.fetchTarget(BIN.java:1367)
        at com.sleepycat.je.dbi.CursorImpl.fetchCurrent(CursorImpl.java:2499)
        at com.sleepycat.je.dbi.CursorImpl.getCurrentAlreadyLatched(CursorImpl.java:1545)
        at com.sleepycat.je.dbi.CursorImpl.getNextWithKeyChangeStatus(CursorImpl.java:1692)
        at com.sleepycat.je.dbi.CursorImpl.getNext(CursorImpl.java:1617)
        at com.sleepycat.je.Cursor.retrieveNextAllowPhantoms(Cursor.java:2485)
        at com.sleepycat.je.Cursor.retrieveNext(Cursor.java:2304)
        at com.sleepycat.je.Cursor.getNext(Cursor.java:1013)
        at org.archive.crawler.frontier.BdbMultipleWorkQueues.getNextNearestItem(BdbMultipleWorkQueues.java:313)
        ... 6 more
Caused by: java.io.FileNotFoundException: /mnt/sammy-data/heritrix_jobs/two-hops-no-requests-html-only/state/00000112.jdb (Input/output error)
        at java.io.RandomAccessFile.open(Native Method)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:233)
        at java.io.RandomAccessFile.<init>(RandomAccessFile.java:118)
        at com.sleepycat.je.log.FileManager$1.<init>(FileManager.java:995)
        at com.sleepycat.je.log.FileManager.openFileHandle(FileManager.java:994)
        at com.sleepycat.je.log.FileManager.getFileHandle(FileManager.java:890)
        at com.sleepycat.je.log.LogManager.getLogSource(LogManager.java:1074)

The local disk that contains the heritrix_out.log ran full within minutes because of this flood of exceptions. In such non recoverable cases, the engine should be shutdown while trying to save as much state as possible.

Status