I'm writing this to know whether Hertrix is supporting Japanese Full Space or not.
When target URLs include Japanese Full Space ('E3 80 80' in UTF 8), Heritrix seems escape it as '%3000'. As a result, Heritrix cannot access and collect the page because it uses '%3000' instead of '%E3%80%80' in the escaped URL. This escapce process seems to be done in org.archive.net.UURIFactory#escapeWhitespace.
Is this because Heritrix don't support URLs which inculdes Japanese characters so far? If so, I would like to know if there is any concrete plan to support Japanese characters in the future.
Thank you for the report. This is certainly a bug. Other Japanese characters seem to work fine. It's only the special handling of whitespace that causes this problem. The way the code is organized in HEAD, the offending class has moved into the webarchive-commons project. I've filed a pull request to fix. https://github.com/iipc/webarchive-commons/pull/50
In the mean time, if you have a seed with U+3000, you should be able to crawl it by escaping it yourself (%E3%80%80) in your cxml.
Thank you so much for your quick reply! I look forward to the issue being fixed.
Pull request is merged, the fix should be in the next release of webarchive-commons.