duplicate user agent records in robots.txt cause overwriting of rules

Description

According to the robots.txt standard duplicate user agent fields are not allowed:

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

However, some servers don't obey this rule and have rules like these (example):

User-agent: *
Disallow: /secret

User-agent: *
Crawl-Delay: 10

In Version 3.1.0 (and possibly later versions) of Heritrix the second rule seems to overwrite the first rule and interpret the file as if it looks like this:

User-agent: *
Crawl-Delay: 10

This means that pages below /secret are crawled, despite the very likely different intention of the web server owner.
Although, technically, Heritrix is doing the right job here, I think it would be better from a practical point of view, to not overwrite rules in these cases but extend the set of rules, such that the interpretation is as follows:

User-agent: *
Disallow: /secret
Crawl-Delay: 10

This can avoid quite some hassle between crawl operators and website owners.

Environment

None

Activity

Show:
Robert Jäschke
June 25, 2015, 3:21 PM

Unfortunately, it seems that I can not update my issue and remove some errors. Thus, here are some errata:

  • Duplicate user agents fields are allowed, but not having the value '*' more than once.

  • "technically, Heritrix is doing the right job here" is not correct. It handles an error with a default which I think should be different.

  • Sorry for the remaining spelling errors.

Noah Levitt
June 27, 2015, 12:06 AM

Do you have a patch?

Assignee

Unassigned

Reporter

Robert Jäschke

Labels

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

1 (Very Easy)

Actual Difficulty

None

Affects versions

Priority

Minor
Configure