We're updating the issue view to help you get more done. 

duplicate user agent records in robots.txt cause overwriting of rules

Description

According to the robots.txt standard duplicate user agent fields are not allowed:

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

However, some servers don't obey this rule and have rules like these (example):

User-agent: *
Disallow: /secret

User-agent: *
Crawl-Delay: 10

In Version 3.1.0 (and possibly later versions) of Heritrix the second rule seems to overwrite the first rule and interpret the file as if it looks like this:

User-agent: *
Crawl-Delay: 10

This means that pages below /secret are crawled, despite the very likely different intention of the web server owner.
Although, technically, Heritrix is doing the right job here, I think it would be better from a practical point of view, to not overwrite rules in these cases but extend the set of rules, such that the interpretation is as follows:

User-agent: *
Disallow: /secret
Crawl-Delay: 10

This can avoid quite some hassle between crawl operators and website owners.

Environment

None

Status

Assignee

Unassigned

Reporter

Robert Jäschke

Labels

Estimated Difficulty

1 (Very Easy)

Affects versions

Heritrix 3.1.0

Priority

Minor