Fix/rethink default case-flattening canonicalization (LowercaseRule)

Description

Our default canonicalization-rules lowercase a URI (LowercaseRule) before using it as the token for checking uniqueness-of-inclusion. This was reasonable in the early days of the web, when some major (especially Windows-based) servers considered alternate casings as equivalent, and as a result case-variable versions of URIs often kept working and thus proliferated. HOWEVER, it is becoming increasingly common to use case-sensitive identifiers in URIs for compactness: for example, at URL-shortening services. Thus our current default can cause multiple case-alternate versions of (for example) a bit.ly URL to be mistakenly considered already-considered.

The safe choice for ensuring completeness would be to eliminate the rule entirely, but that would result in some unknown amount of duplicate collection from those servers that still do case-flattening themselves. (The worst case would probably be an early path-segment which is discovered as both /Public/ and /public/, with extensive relative-URI-cross-linked material underneath, resulting in a doubling of all such content. Still, it might not be that large nowadays.)

Perhaps it could still be applied in qualified situations: switching off the 'Server' header, for example, for IIS.

Environment

None

Status

Assignee

Unassigned

Reporter

Gordon Mohr

Labels

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Affects versions

Heritrix 3.1.0-RC1

Priority

Critical
Configure