Remove ArchivalUrls from incoming requests

Description

Wayback in Archival Url mode should check for embedded Archival Urls in incoming requests, remove them and redirect to the modified form. With wayback rewriting javascript, URLs in the page need to be left as-is, but we don't have enough context to do this at rewrite time - only after the browser has made a request with an embedded Archival URL can we remove it.

Example, from:

http://wayback.archive-it.org/1726/20091231154917/http://www.creightonmagazine.org/

Wayback server-side rewrite is changing the javascript function argument:

CurrentIssue/Source.Flash/sp_Buttons?title=Read Article&theLink=http://www.creightonmagazine.org/Issue.Fall_Winter_2009/President.asp

to:

CurrentIssue/Source.Flash/sp_Buttons?title=Read Article&theLink=http://wayback.archive-it.org/1726/20091231154917/http://www.creightonmagazine.org/Issue.Fall_Winter_2009/President.asp

The function called is adding the argument as a CGI GET argument to a constructed URL. The original form was crawled, but when server-side rewrite has added the archival URL prefix, the altered form is not found.

Simplest solution seems like it's to inspect incoming URLs, and if they contain an archival URL (using RegEx) they Archival URL prefix should be stripped, and the client redirected to the unmodified form..

Environment

None

Activity

Show:
Brad Tofel
January 14, 2010, 2:21 AM

<snipped email conv from Kate>

Unless I'm totally out of it though, it looks like while it did get the url of the
button, it didn't actually follow the link through to collect that page. Any ideas,
or is that something that someone else should look into, since it's more of a
Heritrix issue?

For example, the wayback url of the button is at:
http://wayback.archive-it.org/1726/20091231154934/http://www.creightonmagazine.org/CurrentIssue/Source.Flash/sp_Buttons.swf?title=Read
Article&theLink=http://www.creightonmagazine.org/Issue.Fall_Winter_2009/Freshmen.asp

The page that it links to should have the following wayback url, but gives a 'not in
archive' error:
http://wayback.archive-it.org/1726/*/http://www.creightonmagazine.org/Issue.Fall_Winter_2009/Freshmen.asp

</snipped>

You're totally right, heritrix has not gotten all it needed to..

Here, the flash buttons were archived:

http://wayback.archive-it.org/1726/20091231154934/http://www.creightonmagazine.org/CurrentIssue/Source.Flash/sp_Buttons.swf?title=Read%20Article&theLink=http://www.creightonmagazine.org/Issue.Fall_Winter_2009/Freshmen.asp

But that little flash app drives the browser to:

http://www.creightonmagazine.org/Issue.Fall_Winter_2009/Freshmen.asp

Actually, they've gotten fancy: The same sp_Buttons.swf is used for all the buttons in the navigation menu, and the flash script inspects the two GET arguments, "title" and "theLink" and creates the button text and the link, on-the-fly:

http://www.creightonmagazine.org/CurrentIssue/Source.Flash/sp_Buttons.swf?title=GoFoo&theLink=http://archive.org/

So this is a little more complicated than I'd thought.

There are 2 problems here:
1) Heritrix didn't capture the subsequent page.
2) Wayback will need to do some very site-specific acrobatics to make these buttons work, once Heritrix has captured them.

I'm tempted to say we have bigger fish to fry at the moment, and #2 will have to remain broken for the near term wrt wayback playback. We should definitely make sure Heritrix is capturing the subsequent links, I think this could be done by adding the menu button targets as additional seeds.

Assignee

Kenji Nagahashi

Reporter

Brad Tofel

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Components

Sprint

None

Fix versions

Affects versions

Due date

2010/03/08

Priority

Major
Configure