contrib/archive/README.txt needs clarifications

Description

When following the contrib/archive/README.txt, I went down a few blind alleys because of some ambiguities in the text. The next person reading it would be greatly helped if the following doc changes were made:

  • State that the "simple" directions in the README.txt are for non-distributed, Standalone mode

    • tell how to get into this mode and what it means

    • alternatively, turn the overview-summary.html into a real getting started and use that to replace the current README.txt

  • In real distributed mode, "nutchwax import" only handles obtaining arc files from S3 or via http, and not hdfs

    • no import via hdfs:// was a surprise since full interoperation is naturally assumed, so this current limitation should be pointed out.

    • Import of arc files from the real fs is possible, but on a distributed hadoop, the local files must be replicated to each slave machine (very inefficient, but this could be a convenient first test to see if an old arc file can be processed).

Environment

None
Obsolete

Assignee

Aaron Binns

Reporter

Paul Baclace

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Affects versions

Priority

Major
Configure