Class not found when importing within a Hadoop MR job.

Description

There are two ways to run a NutchWAX MapReduce job. It can be invoked with the 'nutchwax' command-line driver, e.g.

nutchwax import <manifest> <segment>

or it can be submitted as a job to a Hadoop cluster

hadoop jar $NUTCH_HOME/nutch-1.0.job org.archive.nutchwax.Importer <manifest> <segment>

When using the second method, during the reduce step of the import job, when the key/value pairs for the segment's crawl_data are read, Hadoop cannot find the class

org.apache.nutch.protocol.ProtocolStatus

Everything works fine when using the 'nutchwax' command-line driver with a full-on NutchWAX installation.

I'm guessing that there is some differences in the way the classloaders are configured in the two different contexts.

Environment

None

Activity

Show:
Aaron Binns
January 12, 2010, 10:18 PM

SVN 2943.

Since we don't use the segment's crawl_data anyways, I've just commented out the code in Importer.java that writes it.

Now, the Reduce step completes successfully and
the segment's crawl_data directory has an empty Hadoop Map file.

The segment's crawl_data directory can simply be ignored for the rest of NutchWAX processing.

Fixed

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Components

Fix versions

Affects versions

Priority

Critical
Configure