In IndexSearcher.translateHits(), when de-duping use a FieldSelector when loading the document to only load the site field.

Description

When we perform the "hitsPerSite=1" deduping, we iterate through the documents, checking the 'site' field of each against a list of sites we have already seen.

When the document is obtained from the IndexReader, all of the fields are loaded, i.e. read from disk. This is inefficient because all we care about is the 'site' field.

It should be more efficient to only load it by using a custom FieldSelector, something like

String url = reader.document( doc, new FieldSelector() {
FieldSelectorResult accept( String name )
{
if ( "site".equals( name ) ) return FieldSelectorResult.LOAD_AND_BREAK;

return FieldSelectorResult.NO_LOAD;
}
} ).get( "url");

which would only load the "site" field and also stop trying to load fields after it is loaded.

Environment

None
Fixed

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Major
Configure