Classic software project
Back to project
View all filters
Artboard Copy 3
Created with Sketch.
Corrupt script tag at end of page causes HTML parser infinite loop.
Mime-type detection infinite loop due to control character in DOCTYPE declaration.
Extract HTML meta tags for 'description' and 'keywords' and add to segment.
Option to skip an ARC record based on size or other filtering policy
nutchwax-0.13/src/java/org/archive/nutchwax/imagesearch/DocIndexer.java:309: error: method filter in class IndexingFilters cannot be applied to given types
research sorting feature for NutchWAX
Hacks to use with Hadoop-0.20 from Cloudera
Nutch HTML parser infinite loop.
HTML noindex and nofollow enforced in HTMLParser?
DateAdder should have an option to determine if norms should be used.
In IndexSearcher.translateHits(), when de-duping use a FieldSelector when loading the document to only load the site field.
Add record to index for non-text documents
Digest differs between ARCReader and Wayback index-arc.
More aggressive collapsing by site in search results
Nutchwax requires very long timeouts on remotely hosted arc files
Add URL canonicalization to pageranker
contrib/archive/README.txt needs clarifications
nutchwax home page issue tracker still points to sf.net
Investigate malformed URL report during date-adder
Investigate why reading content from archive file uses such small chunks
Add DFS read/write support to DateAdder
Add reading of archive files from DFS
Integrate nutchwax with Access Control Oracle
Cannot use rsync URLs, no handler for rsync protocol.
1-25 of 83