Reported by British Library:
> From: "Hockx-Yu, Helen"
> Date: October 19, 2009 1:42:40 PM PDT
> Subject: One and other website and character encoding problem full text search
>
> Another minor problem which I would like to ask for your advice is
> that we recently released the full-text search function of the UK
> Web Archive: http://www.webarchive.org.uk. In the search result
> listing, empty spaces seem to be interpreted at questions marks. If
> you search for example the term "fish and chips", you will see a lot
> of ???:
> http://www.webarchive.org.uk/ukwa/searchtext?text_s=fish+and+chips
> . The most typical example is BDA Sign Community : Fish Chips
> [05/06/2009], when clicking on the archived web page
> http://www.webarchive.org.uk/wayback/archive/20090605150409/http://bda.org.uk/Fish_Chips-i-358.html
> , it is clear that empty spaces are displayed as ?? in the search
> result. Very strange.
I looked at a few of the items in the results page linked above.
1. http://www.webarchive.org.uk/wayback/archive/20090605150409/http://bda.org.uk/Fish_Chips-i-358.html
It appears that on this page, the HTML entity » which is the character 'ยป' is shown as '?' in the search results.
The search result snippet includes the page's title, which is:
<title>Tomorrow’s fish and chips by petite anglaise</title>
Again, it looks like the HTML entities are the problem: ’ and
I'm a bit surprised that would be a problem. It's so ubiquitous in HTML pages that if it was troublesome, we would have noticed it a long time ago, and it would be popping-up in all NutchWAX deployments.
Brad pointed out that we should double-check the manner in which the results are displayed on the BL results pages, specifically which encoding they are using.
I looked at their results page and it uses "windows-1252", in the HTML, you can see
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
This is confused a bit further by the HTTP content-type header in the response that serves the page specifies "ISO-8859-1".
In either case, whether using "windows-1252" or "iso-8859-1", there are going to be characters that aren't representable within those encodings.
We strongly suggest that UTF-8 is used throughout, which avoids these encoding mis-match problems.