Reported by British Library:
> From: "Hockx-Yu, Helen"
> Date: October 19, 2009 1:42:40 PM PDT
> Subject: One and other website and character encoding problem full text search
> Another minor problem which I would like to ask for your advice is
> that we recently released the full-text search function of the UK
> Web Archive: http://www.webarchive.org.uk. In the search result
> listing, empty spaces seem to be interpreted at questions marks. If
> you search for example the term "fish and chips", you will see a lot
> of ???:
> . The most typical example is BDA Sign Community : Fish Chips
> [05/06/2009], when clicking on the archived web page
> , it is clear that empty spaces are displayed as ?? in the search
> result. Very strange.
I looked at a few of the items in the results page linked above.
It appears that on this page, the HTML entity » which is the character '»' is shown as '?' in the search results.
The search result snippet includes the page's title, which is:
<title>Tomorrow’s fish and chips by petite anglaise</title>
Again, it looks like the HTML entities are the problem: ’ and
I'm a bit surprised that would be a problem. It's so ubiquitous in HTML pages that if it was troublesome, we would have noticed it a long time ago, and it would be popping-up in all NutchWAX deployments.
Brad pointed out that we should double-check the manner in which the results are displayed on the BL results pages, specifically which encoding they are using.
I looked at their results page and it uses "windows-1252", in the HTML, you can see
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
This is confused a bit further by the HTTP content-type header in the response that serves the page specifies "ISO-8859-1".
In either case, whether using "windows-1252" or "iso-8859-1", there are going to be characters that aren't representable within those encodings.
We strongly suggest that UTF-8 is used throughout, which avoids these encoding mis-match problems.