Are URLs including 'Japanese Full Space' supported?

Description

I'm writing this to know whether Hertrix is supporting Japanese Full Space or not.

When target URLs include Japanese Full Space ('E3 80 80' in UTF 8), Heritrix seems escape it as '%3000'. As a result, Heritrix cannot access and collect the page because it uses '%3000' instead of '%E3%80%80' in the escaped URL. This escapce process seems to be done in org.archive.net.UURIFactory#escapeWhitespace.

Is this because Heritrix don't support URLs which inculdes Japanese characters so far? If so, I would like to know if there is any concrete plan to support Japanese characters in the future.

Environment

None

Activity

Show:
Noah Levitt
October 28, 2015, 7:11 PM

Thank you for the report. This is certainly a bug. Other Japanese characters seem to work fine. It's only the special handling of whitespace that causes this problem. The way the code is organized in HEAD, the offending class has moved into the webarchive-commons project. I've filed a pull request to fix. https://github.com/iipc/webarchive-commons/pull/50

Noah Levitt
October 28, 2015, 7:12 PM

In the mean time, if you have a seed with U+3000, you should be able to crawl it by escaping it yourself (%E3%80%80) in your cxml.

Masahiro Shimada
October 29, 2015, 4:35 AM

Thank you so much for your quick reply! I look forward to the issue being fixed.

Noah Levitt
May 26, 2016, 6:02 PM

Pull request is merged, the fix should be in the next release of webarchive-commons.

Assignee

Unassigned

Reporter

Masahiro Shimada

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Minor
Configure