When following the contrib/archive/README.txt, I went down a few blind alleys because of some ambiguities in the text. The next person reading it would be greatly helped if the following doc changes were made:
Point to a local copy of the comprehensive, in-depth Getting Started
Where is the source of this page: http://archive-access.sourceforge.net/projects/nutchwax/apidocs/overview-summary.html
State that the "simple" directions in the README.txt are for non-distributed, Standalone mode
tell how to get into this mode and what it means
alternatively, turn the overview-summary.html into a real getting started and use that to replace the current README.txt
In real distributed mode, "nutchwax import" only handles obtaining arc files from S3 or via http, and not hdfs
no import via hdfs:// was a surprise since full interoperation is naturally assumed, so this current limitation should be pointed out.
Import of arc files from the real fs is possible, but on a distributed hadoop, the local files must be replicated to each slave machine (very inefficient, but this could be a convenient first test to see if an old arc file can be processed).