All issues

Class not found when importing within a Hadoop MR job.
WAX-69
nutchwax-0.13/src/java/org/archive/nutchwax/imagesearch/DocIndexer.java:309: error: method filter in class IndexingFilters cannot be applied to given types
WAX-83
Nutch HTML parser infinite loop.
WAX-82
Corrupt script tag at end of page causes HTML parser infinite loop.
WAX-81
Mime-type detection infinite loop due to control character in DOCTYPE declaration.
WAX-80
Extract HTML meta tags for 'description' and 'keywords' and add to segment.
WAX-79
HTML noindex and nofollow enforced in HTMLParser?
WAX-78
JDK6u23 breaks GzippedInputStream & W/ARCReaders with different GZIP handling
WAX-77
Slow parsing
WAX-76
Hacks to use with Hadoop-0.20 from Cloudera
WAX-75
Add support for storing fields in compressed form.
WAX-74
Change default value of searcher.fieldcache in nutch-site.xml to 'false'
WAX-73
Simply build system to copy NW files into Nutch dirs and use Nutch build.xml
WAX-72
NutchWAX-required libraries not included in nutch-1.0.job
WAX-71
Cannot use rsync URLs, no handler for rsync protocol.
WAX-70
Compatibility with {index+segment}s created by NutchWAX 0.10.
WAX-68
Nutch OpenOffice parser does not pass along metadata.
WAX-67
Index documents without crawldb nor linkdb.
WAX-66
Some odd-ball characters display as '?' in search results.
WAX-65
research sorting feature for NutchWAX
WAX-64
LengthNormUpdater returning error code if no fields in index have norms is inconvenient.
WAX-63
Add ability to configure HTTP headers to support cacheing.
WAX-62
DateAdder should have an option to determine if norms should be used.
WAX-60
Wrong log() function used in PageRankScoringFilter.
WAX-59
Need tool to update an existing index's norms based on pagerank information.
WAX-58
nutchwax command-driver doesn't properly enclose arguments in quotes.
WAX-57
Date-adder allows for duplicate dates to be added to a record.
WAX-56
NutchWaxBean's command-line searching should emit title along with other document metadata.
WAX-55
In IndexSearcher.translateHits(), when de-duping use a FieldSelector when loading the document to only load the site field.
WAX-54
IndexMerging parallel indexes fails when index is empty.
WAX-53
Add option to NutchWaxBean to specify directory where index+segments are to be found.
WAX-52
Enhance index merging to combine parallel indexes.
WAX-51
Add "num hits to find" option to NutchWaxBean
WAX-50
Add "hitsPerSite" option to NutchWaxBean
WAX-49
Use NutchWAX configurable query filter for site and url fields.
WAX-48
Stop storing document key in "orig" field in index, synthesize it as needed from the "url" and "digest" fields.
WAX-47
Add option to DumpParallelIndex to output only single field.
WAX-46
Add record to index for non-text documents
WAX-44
bug in Hurricane Katrina
WAX-43
Add option to continue importing if an arcfile cannot be read.
WAX-42
Option to enable/disable the FIELDCACHE in the Nutch IndexSearcher.
WAX-41
Integrate nutchwax with Access Control Oracle
WAX-40
Write more efficient, specialized segment parse_text merging
WAX-39
Build omits neessary libraries from .job file.
WAX-38
Per-collection segments not supported in distributed/master-slave configuration.
WAX-37
Some additional diagnostics on connecting results to segments and snippets would be very helpful.
WAX-36
Add pagerankdb similar to linkdb but which only keeps counts rather than actual inlinks.
WAX-35
Add option to omit storing of content in segment
WAX-34
Add URL canonicalization to pageranker
WAX-33
500 error - java.lang.NegativeArraySizeException
WAX-32
issue 1 of 83

Class not found when importing within a Hadoop MR job.

Description

There are two ways to run a NutchWAX MapReduce job. It can be invoked with the 'nutchwax' command-line driver, e.g.

nutchwax import <manifest> <segment>

or it can be submitted as a job to a Hadoop cluster

hadoop jar $NUTCH_HOME/nutch-1.0.job org.archive.nutchwax.Importer <manifest> <segment>

When using the second method, during the reduce step of the import job, when the key/value pairs for the segment's crawl_data are read, Hadoop cannot find the class

org.apache.nutch.protocol.ProtocolStatus

Everything works fine when using the 'nutchwax' command-line driver with a full-on NutchWAX installation.

I'm guessing that there is some differences in the way the classloaders are configured in the two different contexts.

Environment

None

Status

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Components

Fix versions

Affects versions

0.13

Priority

Critical
Configure