duplicate user agent records in robots.txt cause overwriting of rules

Description

According to the robots.txt standard duplicate user agent fields are not allowed:

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

However, some servers don't obey this rule and have rules like these (example):

User-agent: *
Disallow: /secret

User-agent: *
Crawl-Delay: 10

In Version 3.1.0 (and possibly later versions) of Heritrix the second rule seems to overwrite the first rule and interpret the file as if it looks like this:

User-agent: *
Crawl-Delay: 10

This means that pages below /secret are crawled, despite the very likely different intention of the web server owner.
Although, technically, Heritrix is doing the right job here, I think it would be better from a practical point of view, to not overwrite rules in these cases but extend the set of rules, such that the interpretation is as follows:

User-agent: *
Disallow: /secret
Crawl-Delay: 10

This can avoid quite some hassle between crawl operators and website owners.

Environment

None

Status

Assignee

Unassigned

Reporter

Robert Jäschke

Labels

Group Assignee

None

ZendeskID

None

Estimated Difficulty

1 (Very Easy)

Actual Difficulty

None

Affects versions

Priority

Minor
Configure