There are two ways to run a NutchWAX MapReduce job. It can be invoked with the 'nutchwax' command-line driver, e.g.
nutchwax import <manifest> <segment>
or it can be submitted as a job to a Hadoop cluster
hadoop jar $NUTCH_HOME/nutch-1.0.job org.archive.nutchwax.Importer <manifest> <segment>
When using the second method, during the reduce step of the import job, when the key/value pairs for the segment's crawl_data are read, Hadoop cannot find the class
org.apache.nutch.protocol.ProtocolStatus
Everything works fine when using the 'nutchwax' command-line driver with a full-on NutchWAX installation.
I'm guessing that there is some differences in the way the classloaders are configured in the two different contexts.
SVN 2943.
Since we don't use the segment's crawl_data anyways, I've just commented out the code in Importer.java that writes it.
Now, the Reduce step completes successfully and
the segment's crawl_data directory has an empty Hadoop Map file.
The segment's crawl_data directory can simply be ignored for the rest of NutchWAX processing.