research sorting feature for NutchWAX

Description

The following email from Chris Stockwell <CStockwell@mt.gov> prompted this JIRA:

---------
Searching from the search box on our Archive Montana web site, at http://msl.mt.gov/For_State_Employees/Archive_Montana/default.asp, for "Montana Administrative Rules" brings up links to Montana Administrative Rules, but older dates. I was able to get the most recent dates to come to the top by searching for "Montana Administrative Rules 2009." Can the search engine be tweaked so the latest crawls will appear at the top of the search results?

For "Montana Administrative Rules", the first items is from 2006. This page has been crawled regularly since then.
----------

Is it possible to research the sorting functionality of NutchWAX and determine if providing the user with a flag/parameter that will sort according to the indexed fields is feasible?

Environment

Firefox Mac

Activity

Show:
Hunter Stern
September 22, 2009, 10:37 PM

Here is a stab at the requirements for the sorting feature.

Add two parameters that allows sorting on an arbitrary indexed field. For the first version of this feature, the only field that would need to be supported is "archivedate" This would allow a partner to submit a URL that looks like this:

http://www.archive-it.org/public/search?query=Montana+Administrative+Rules&collection=499&Submit1=Search&sort=archivedate&sortorder=desc

The additional parameters are "sort=archivedate" (or whatever name is used to identify the date of capture) and sortorder=desc. The sortorder parameter would have two possible values, desc and asc.

The use case this feature would support is for the Montana State Library's search page at http://msl.mt.gov/For_State_Employees/Archive_Montana/default.asp. Currently a search on "Montana Administrative Rules" returns the following item at the top of the result list:

http://wayback.archive-it.org/499/20061008091110/http://sos.mt.gov/ARM/index.asp

With the sort=archivedate&sortorder=desc parameters, the first item displayed should be the most recently archived page with the highest pagerank.

Aaron Binns
September 23, 2009, 6:53 PM

How would this interact with the 'hitsPerSite' function? Normally we use hitsPerSite=1, which means that we only show the best result from a site and omit the rest. When ordering by date, would we want to take the one with the latest date rather than highest score?

Consider a search that has two hits from the same site

URL Date Score
http://www.example.com/foo.html 2007-08-15 95.5
http://www.example.com/bar.html 2009-01-04 34.3

Normall, when we do the hitsPerSite=1 collating, we will keep the first one since it has a better score and discard the second. If sorting by date, would we keep the second and chuck the first, even though the second has a much worse score?

Also, for URLs with many revisit dates, which one do we use for sorting? Imagine a government document, a PDF, that doesn't change. We revisit it each month for 3 years: 200601, 200602, 200603, ..., 200909. Which of those 36 dates would we use for date sorting?

Hunter Stern
September 24, 2009, 4:27 PM

For the first example we would keep the second and ignore the first, even through the first has a higher rank. It seems like that's the behavior the partner wanted.

For the second example, we would use the most recent date, even though the doc hasn't changed. Since we "virtually" crawled the doc on the revisit dates, we would want the search result to reflect the most recent crawl, even thought that crawl "captured" a duplicate record.

Fixed

Assignee

Hunter Stern

Reporter

Hunter Stern

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Components

Due date

2009/09/22

Priority

Major
Configure