Cannot use rsync URLs, no handler for rsync protocol.

Description

There are two ways to run a NutchWAX MapReduce job. It can be invoked with the 'nutchwax' command-line driver, e.g.

nutchwax import <manifest> <segment>

or it can be submitted as a job to a Hadoop cluster

hadoop jar $NUTCH_HOME/nutch-1.0.job org.archive.nutchwax.Importer <manifest> <segment>

When using the second method, the 'rsync' protocol handler cannot be found. Whatever mechanism that registers a handler for 'rsync' isn't done when run via the hadoop driver. Some classloader differences I presume.

Environment

None

Activity

Show:
Aaron Binns
January 12, 2010, 10:28 PM

Work-around is to use HTTP instead of rsync.

Aaron Binns
January 21, 2010, 12:59 AM

I think I found the problem:

org.archive.io.ArchiveReaderFactory:

static {
if (System.getProperty("java.protocol.handler.pkgs") != null) {
System.setProperty("java.protocol.handler.pkgs",
System.getProperty("java.protocol.handler.pkgs")
+ "|" + "org.archive.net");
} else {
System.setProperty("java.protocol.handler.pkgs", "org.archive.net");
}
}

I'll bet that when we invoke via the Hadoop 'jar' runner, the Java security permissions are such that these calls to System.setProperty() are not allowed.

Perhaps we can use another method to register the handler classes:
1. Add -Djava.protocol.handler.pkgs=org.archive.net to the command-line. Messy.
2. Add a config file to the nutchwax .job file that does the same thing. This is probably the best way to go.

Obsolete

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Components

Affects versions

Priority

Major
Configure