Index documents without crawldb nor linkdb.

Description

It would be really, really, really, nice if we could dispense with the crawldb and linkdb all together. The crawldb contains info about the URLs/documents related to Nutch's crawling features – which we don't use at all. The linkdb is always empty in practice. However, the Nutch 'index' command requires them to exist. It would be great to hack-up Nutch's 'index' command to simply ignore them all together.

Environment

None

Activity

Show:
Aaron Binns
October 26, 2009, 11:04 PM

Fixed. SVN 2833 and SVN 2834

Added NutchWAX version of "Indexer.java" and command-line driver to run same indexing process as the Nutch indexer but w/o requiring the crawldb and linkdb. In fact, the NutchWAX Indexer doesn't want them on the command line at all.

I also added a command-line driver in the 'nutchwax' script, so one can do

nuchwax index <indexes-dir> <segment>...

Fixed

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Fix versions

Priority

Major
Configure