Compatibility with {index+segment}s created by NutchWAX 0.10.

Description

NutchWAX 0.10 and 0.12 use a slightly different format for the key field in the segment. A Nutch(WAX) segment contains Hadoop MapFiles, where the key is based on the URL and the value is the document-specific info. We are concerned with the segment's 'parse_text' MapFile. This file contains the parsed text of the documents in the index. These are used for generating the search resulsts snippets.

In NutchWAX 0.10, the format of the key was

c=<collectionId>,u=<url>

and in NutchWAX 0.12, the format was changed to

<url> <digest>

in order to support (de-)duplication.

If one tries to point a NutchWAX 0.12 searcher to a NW 0.10 {index+segment}, it will successfully search the index, but will be unable to generate snippets due to the change in the key format.

We need a method to tell NutchWAX which segments use he 0.10 format and which use the 0.12 format. Then it can generate the key accordingly and thus simultaneously search indexes created by both 0.10 and 0.12.

Environment

None

Activity

Show:
Aaron Binns
January 13, 2010, 12:19 AM

SVN 2870 & 2946

In the segments directory, create a file named "versions". In it,
place lines of the form:

<segment-name> <version>

where version can be either "10" or "12" (without quotes). Ex.

foo-segment 10
bar-segment 12

If a segment is not listed in the "versions" file, it will be treated as version 12.

Fixed

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Fix versions

Priority

Major
Configure