Add URL canonicalization to pageranker

Description

I noticed when working on AIT that the pageranker can be "incorrect" in that it doesn't canonicalize the URLs. Thus, we can have a pagerank.txt file produced with stuff like:

335097 http://mt.gov
378310 http://mt.gov/images/foot.gif
393938 http://mt.gov/images/mtgovlogo.gif
538614 http://mt.gov/code/handheld.css
646266 http://mt.gov/itsd/policy/policies/ENTINT030.asp
647305 http://mt.gov/discover/disclaimer.asp#accessibility
696642 http://mt.gov/

Notice the first and last lines. Those should probably be combined since the URLs would (AFAIK) canonicalize to the same.

And, if we do canonicalize the URLs when generating the pagerank.txt file, we'll have to do the same during indexing to match.

Environment

None
Obsolete

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Major
Configure