According to the robots.txt standard duplicate user agent fields are not allowed:
If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
However, some servers don't obey this rule and have rules like these (example):
User-agent: *
Disallow: /secretUser-agent: *
Crawl-Delay: 10
In Version 3.1.0 (and possibly later versions) of Heritrix the second rule seems to overwrite the first rule and interpret the file as if it looks like this:
User-agent: *
Crawl-Delay: 10
This means that pages below /secret are crawled, despite the very likely different intention of the web server owner.
Although, technically, Heritrix is doing the right job here, I think it would be better from a practical point of view, to not overwrite rules in these cases but extend the set of rules, such that the interpretation is as follows:
User-agent: *
Disallow: /secret
Crawl-Delay: 10
This can avoid quite some hassle between crawl operators and website owners.
Unfortunately, it seems that I can not update my issue and remove some errors. Thus, here are some errata:
Duplicate user agents fields are allowed, but not having the value '*' more than once.
"technically, Heritrix is doing the right job here" is not correct. It handles an error with a default which I think should be different.
Sorry for the remaining spelling errors.
Do you have a patch?