Seeds Report missing redirect URLs for 301 / 302 responses

Description

None

Environment

None

Activity

Show:
Adam Miller
October 22, 2014, 8:48 PM

The cause looks to be from changes to CandidatesProcessor, CrawlURI, and SeedRecord in https://github.com/internetarchive/heritrix3/pull/76

The CandidatesProcessor clears the curi outlink list when it is done, and used to add them all to an outCandidates list. The outCandidates list was then read by SeedRecord to determine the redirect location. That list no longer exists and all calls to curi.getOutCandidates() were replaced with curi.getOutLinks(). This looks to be fine everywhere except SeedRecord, which tries to pull from the outLinks list after it has been cleared.

Adam Miller
October 22, 2014, 9:13 PM

Removing the curi.getOutLinks().clear() from CandidatesProcessor should fix the problem, but I'm not sure if there is a reason to have the outlines cleared.
https://github.com/adam-miller/heritrix3/commit/fb23b40e745be51e12a0e00ab3c774f961160eeb

Adam Miller
October 23, 2014, 6:42 PM

This may be a better solution: Allow the CandidatesProcessor to continue clearing the outlinks list, and instead store the redirect in the data attributes for seed CrawlURIs only.
https://github.com/adam-miller/heritrix3/compare/HER-2076

Adam Miller
October 23, 2014, 6:44 PM

Gordon, can you take a look?

Hunter Stern
November 12, 2014, 2:06 AM
Edited

We (Noah and I) merged https://github.com/internetarchive/heritrix3/pull/103 to address a related issue for https://webarchive.jira.com/browse/ARI-4084. Scoping was effected by clearing outlinks before the redirect of a seed was added to the list of seeds.

Assignee

Gordon Mohr

Reporter

Adam Miller

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Major
Configure