Some odd-ball characters display as '?' in search results.

Description

Reported by British Library:

> From: "Hockx-Yu, Helen"
> Date: October 19, 2009 1:42:40 PM PDT
> Subject: One and other website and character encoding problem full text search
>
> Another minor problem which I would like to ask for your advice is
> that we recently released the full-text search function of the UK
> Web Archive: http://www.webarchive.org.uk. In the search result
> listing, empty spaces seem to be interpreted at questions marks. If
> you search for example the term "fish and chips", you will see a lot
> of ???:
> http://www.webarchive.org.uk/ukwa/searchtext?text_s=fish+and+chips
> . The most typical example is BDA Sign Community : Fish Chips
> [05/06/2009], when clicking on the archived web page
> http://www.webarchive.org.uk/wayback/archive/20090605150409/http://bda.org.uk/Fish_Chips-i-358.html
> , it is clear that empty spaces are displayed as ?? in the search
> result. Very strange.

Environment

None

Activity

Show:
Aaron Binns
October 22, 2009, 5:20 PM

I looked at a few of the items in the results page linked above.

1. http://www.webarchive.org.uk/wayback/archive/20090605150409/http://bda.org.uk/Fish_Chips-i-358.html

It appears that on this page, the HTML entity » which is the character '»' is shown as '?' in the search results.

2. http://www.webarchive.org.uk/wayback/archive/20080531231756/http://www.petiteanglaise.com/archives/2008/02/15/tomorrows-fish-and-chips/

The search result snippet includes the page's title, which is:

<title>Tomorrow&#8217;s fish and chips&nbsp;by&nbsp;petite anglaise</title>

Again, it looks like the HTML entities are the problem: &#8217; and &nbsp;

I'm a bit surprised that &nbsp; would be a problem. It's so ubiquitous in HTML pages that if it was troublesome, we would have noticed it a long time ago, and it would be popping-up in all NutchWAX deployments.

Aaron Binns
October 26, 2009, 10:34 PM

Brad pointed out that we should double-check the manner in which the results are displayed on the BL results pages, specifically which encoding they are using.

I looked at their results page and it uses "windows-1252", in the HTML, you can see

<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">

This is confused a bit further by the HTTP content-type header in the response that serves the page specifies "ISO-8859-1".

In either case, whether using "windows-1252" or "iso-8859-1", there are going to be characters that aren't representable within those encodings.

We strongly suggest that UTF-8 is used throughout, which avoids these encoding mis-match problems.

Not a Bug

Assignee

Aaron Binns

Reporter

Aaron Binns

Labels

None

Issue Category

None

Group Assignee

None

ZendeskID

None

Estimated Difficulty

None

Actual Difficulty

None

Priority

Major
Configure