should follow redirects from /robots.txt and respect directives found

Description

Currently, based on a long-ago understanding of what other major search engines/crawling projects did, Heritrix treats anything other than a '200' from a /robots.txt request as a robots-not-present situation.

It appears that Google, at least, follows redirects and respects the directives so found. (See for example the query [site:www.swims.nhs.uk] regarding the site mentioned in thread http://tech.groups.yahoo.com/group/archive-crawler/message/8078.)

Heritrix should match this behavior, since most webmasters craft their robots.txt based on expectations set by Google.

Open question: does Google follow redirects even to URIs that don't end "/robots.txt", or the URIs on different hosts, and still apply the resulting rules to the original hostname?

Environment

None

Status

Assignee

Unassigned

Reporter

Gordon Mohr

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Components

Affects versions

Heritrix 3.1.1

Priority

Major
Configure