Issues
ToeThread Fatal Exception: "kryo.SerializationException: Buffer limit exceeded" in BdbMultipleWorkQueues.get | Unassigned | Gordon Mohr | Unresolved | Mar 7, 2012 | Jan 24, 2019 | ||||||
heritrix hitting non existent URLs in wix.com/app-market | Unassigned | Vangelis Banos | Unresolved | Aug 31, 2017 | Aug 31, 2017 | ||||||
Heritrix ignores robots.txt | Unassigned | Robert Jäschke | Unresolved | Jun 2, 2016 | Jun 7, 2016 | ||||||
appCtx.getBean() does no longer work in scripting console | Unassigned | Robert Jäschke | Unresolved | Jun 2, 2016 | Jun 7, 2016 | ||||||
heritrix is missing facility to shutdown from console | Unassigned | Karl-Philipp Richter | Unresolved | Feb 4, 2016 | Feb 8, 2016 | ||||||
Improve feedback after specifying errornous command line arguments | Unassigned | Karl-Philipp Richter | Unresolved | Feb 4, 2016 | Feb 4, 2016 | ||||||
RuntimeException in AMQPUrlReceiver kills StarterRestarter? | Unassigned | Andrew Jackson | Unresolved | Sep 17, 2015 | Sep 23, 2015 | ||||||
HTML extractor fails to extract CSS from a link tag | Unassigned | Kristinn Sigurðsson | Unresolved | Aug 20, 2015 | Aug 20, 2015 | ||||||
duplicate user agent records in robots.txt cause overwriting of rules | Unassigned | Robert Jäschke | Unresolved | Jun 25, 2015 | Jun 27, 2015 | ||||||
![]() | password | Unassigned | connor taylor | Unresolved | Mar 1, 2015 | Mar 1, 2015 | |||||
Link Analysis with Apache Giraph (Cluster mode) | Unassigned | Zhang Xiang | Unresolved | Nov 7, 2014 | Nov 7, 2014 | ||||||
url alone not sufficient to identify unique unit of web content, should be something like canonicalize(url+headers) | Unassigned | Noah Levitt | Unresolved | Aug 21, 2009 | Oct 20, 2014 | ||||||
Enable configuration of log4j in libraries | Unassigned | Kristinn Sigurðsson | Unresolved | Oct 3, 2014 | Oct 3, 2014 | ||||||
Using 'sun.security.tools.KeyTool' restricts to Oracle-based JVM's. | Unassigned | Thorbjørn Ravn Andersen | Unresolved | Aug 6, 2014 | Aug 20, 2014 | ||||||
Add option to prefer the non-DNS resolves | Unassigned | Andres Aguilar | Unresolved | Jun 6, 2014 | Jun 6, 2014 | ||||||
[Optionally?] accelerated transition to terminated state after STOP issued | Unassigned | Aaron Ximm | Unresolved | May 21, 2014 | May 21, 2014 | ||||||
Identify programs with minimal Closed Captioning | Unassigned | Roger G Macdonald | Unresolved | Jan 13, 2014 | Jan 13, 2014 | ||||||
H3: manifest of all files (esp. W/ARCs) from a job, access to W/ARCs, ability to delete/clear | Unassigned | Gordon Mohr | Unresolved | Jun 3, 2010 | Jan 10, 2014 | ||||||
WorkQueueFrontier - add log of queue lifecycle | Unassigned | Gordon Mohr | Unresolved | Feb 17, 2007 | Jan 10, 2014 | ||||||
improved completion time estimates (queue/total) | Unassigned | Gordon Mohr | Unresolved | Feb 17, 2007 | Jan 10, 2014 | ||||||
H3: improve crawler capacity/state reporting for participation in pool of crawling machines | Unassigned | Gordon Mohr | Unresolved | Jun 3, 2010 | Jan 10, 2014 | ||||||
console rates off after checkpoint-resume | Unassigned | Gordon Mohr | Unresolved | Sep 3, 2010 | Jan 10, 2014 | ||||||
H3: offer web operations to delete job/dir/files (cleaning up local machine/crawler state) | Unassigned | Gordon Mohr | Unresolved | Jun 3, 2010 | Jan 10, 2014 | ||||||
crawl-manifest.txt not produced by H3; update and improve manifest functionality | Unassigned | Hunter Stern | Unresolved | Jan 14, 2010 | Jan 10, 2014 | Jan 19, 2010 | |||||
evaluate H3 in context of IPv6 | Unassigned | Gordon Mohr | Unresolved | May 11, 2011 | Jan 10, 2014 | ||||||
BASE HREF of enclosing HTML not used by SWFExtractor | Unassigned | Gordon Mohr | Unresolved | Oct 25, 2010 | Jan 10, 2014 | ||||||
link named "tail alert log..." does not show all alerts | Unassigned | Travis Wellman | Unresolved | Sep 20, 2011 | Jan 10, 2014 | ||||||
H3: "add some color" | Unassigned | Gordon Mohr | Unresolved | Aug 31, 2010 | Jan 10, 2014 | ||||||
bring back the progress-bar | Unassigned | Gordon Mohr | Unresolved | Dec 7, 2009 | Jan 10, 2014 | ||||||
deprecate shortReportLineTo method in Reporter interface | Unassigned | Travis Wellman | Unresolved | Nov 16, 2011 | Jan 10, 2014 | ||||||
recovery-log scanning generates more error output than is reasonable | Unassigned | Gordon Mohr | Unresolved | Nov 8, 2010 | Jan 10, 2014 | ||||||
human readable number formats in console | Unassigned | Steve Sisney | Unresolved | Jul 9, 2009 | Jan 10, 2014 | ||||||
Write W/ARC per domain/host/seed/etc. | Unassigned | Gordon Mohr | Unresolved | Feb 17, 2007 | Jan 10, 2014 | ||||||
Springify(?):Simple guided field-based configuration UI | Unassigned | Gordon Mohr | Unresolved | Aug 28, 2008 | Jan 10, 2014 | ||||||
checkpoint directories for logs | Unassigned | Travis Wellman | Unresolved | Aug 17, 2011 | Jan 10, 2014 | ||||||
Limited Parallelism | Unassigned | Shaofeng Liu | Unresolved | Jan 9, 2014 | Jan 9, 2014 | ||||||
support Google's robots.txt wildcards ('*') and end-anchor ('$') | Unassigned | Gordon Mohr | Unresolved | Apr 3, 2009 | Dec 13, 2013 | ||||||
Heritrix install manual | Unassigned | Janis | Unresolved | Nov 22, 2013 | Nov 22, 2013 | ||||||
canonicalization losing docs: make content&result sensitive | Unassigned | Gordon Mohr | Unresolved | Feb 17, 2007 | Nov 14, 2013 | ||||||
Rotate heritrix_out.log | Unassigned | Jean-Pierre Bergamin | Unresolved | Oct 8, 2013 | Oct 8, 2013 | ||||||
Shutdown engine on SEVERE errors | Unassigned | Jean-Pierre Bergamin | Unresolved | Oct 8, 2013 | Oct 8, 2013 | ||||||
Bogus seed numbers in crawl-report | Unassigned | Jean-Pierre Bergamin | Unresolved | Oct 8, 2013 | Oct 8, 2013 | ||||||
support crawling without any dns resolution (can be useful when crawling through proxy) | Unassigned | Noah Levitt | Unresolved | Sep 10, 2013 | Sep 10, 2013 | ||||||
do something about w/arc reading code | Unassigned | Noah Levitt | Unresolved | Sep 10, 2013 | Sep 10, 2013 | ||||||
Possible deadlock | Unassigned | Kristinn Sigurðsson | Unresolved | Jul 29, 2013 | Sep 9, 2013 | ||||||
ftp protocol robots.txt | Unassigned | Noah Levitt | Unresolved | Sep 7, 2013 | Sep 7, 2013 | ||||||
FetchWhois mishandles certain tlds | Unassigned | Noah Levitt | Unresolved | Sep 6, 2013 | Sep 6, 2013 | ||||||
support 'nofollow' in links | Unassigned | Michael Stack | Unresolved | Feb 17, 2007 | Aug 13, 2013 | ||||||
Redirects of robots.txt treated as valid robots, null pointer exception | Unassigned | Niels van Hecke | Unresolved | Jul 3, 2013 | Jul 3, 2013 | ||||||
Controlling download time from a URI | Unassigned | Smriti Malhotra | Unresolved | Jun 24, 2013 | Jun 24, 2013 |
1-50 of 528