Issues

Corrupt script tag at end of page causes HTML parser infinite loop.
WAX-81
Mime-type detection infinite loop due to control character in DOCTYPE declaration.
WAX-80
Extract HTML meta tags for 'description' and 'keywords' and add to segment.
WAX-79
Slow parsing
WAX-76
Option to skip an ARC record based on size or other filtering policy
WAX-15
nutchwax-0.13/src/java/org/archive/nutchwax/imagesearch/DocIndexer.java:309: error: method filter in class IndexingFilters cannot be applied to given types
WAX-83
research sorting feature for NutchWAX
WAX-64
Hacks to use with Hadoop-0.20 from Cloudera
WAX-75
Nutch HTML parser infinite loop.
WAX-82
HTML noindex and nofollow enforced in HTMLParser?
WAX-78
DateAdder should have an option to determine if norms should be used.
WAX-60
In IndexSearcher.translateHits(), when de-duping use a FieldSelector when loading the document to only load the site field.
WAX-54
Add record to index for non-text documents
WAX-44
Digest differs between ARCReader and Wayback index-arc.
WAX-5
More aggressive collapsing by site in search results
WAX-17
Nutchwax requires very long timeouts on remotely hosted arc files
WAX-30
Add URL canonicalization to pageranker
WAX-33
contrib/archive/README.txt needs clarifications
WAX-31
nutchwax home page issue tracker still points to sf.net
WAX-29
Investigate malformed URL report during date-adder
WAX-28
Investigate why reading content from archive file uses such small chunks
WAX-14
Add DFS read/write support to DateAdder
WAX-13
Add reading of archive files from DFS
WAX-18
Integrate nutchwax with Access Control Oracle
WAX-40
Cannot use rsync URLs, no handler for rsync protocol.
WAX-70
1-25 of 83