Wednesday 17 August 2011

A bug in Google search?

I think I have just found a (small) bug in Google search. Unfortunately this explanation may get a little technical for non-programmers, but I will try to explain the problem as simply as I can.

I was looking through some logs for my website and I found that someone had entered a search term akin to 'britishwalks walk 906'. The search led to this search query.

As you can see, Google search displayed the first couple of lines of each entry; in the case of my website the first hit contained the following: '7 Jan 2011 – Walk #906:'

This is strange as the walk was actually walked on the 1st of July. This date is present in my webpage as 01/07/2011. The search engine is obviously taking the date in UK format (day/month/year, or 01/07/2011) and converting it into American format (month/day/year, or 07/01/2011), before displaying the month in three-letter textual form ('Jan' instead of 01).

The previous walk was walked on the 30th of June, and that displays correctly within Google search. This means that Google must perform some data checking; as 30 is greater than the possible number of months (12) it is invalid in the US date format and therefore they display it in the more common UK format. This fault is present on every webpage I have checked where the date component is less than 12.

I have had a quick (but hardly exhaustive) ponder and cannot think of any way my pages could be creating this problem. Likewise, I cannot think of a way of setting your locale in an HTML page to let them know the format of items like dates. I could use a locale-neutral format such as yyyy-mm-dd (e.g. 2011-08-17), but that is far less obvious to my readers, the vast majority of whom are in the UK.

Perhaps if the domain is a 'uk' one or the domain is registered in the UK Google could default to UK date format; this would be much more work for them and would still be prone to potential errors. It may be far simpler for them not to parse the displayed date to include a month name, and instead just to display it as it appears in the webpage. 

This is hardly a major bug or feature, but nonetheless is interesting. Why do they parse a plain-text date within a webpage and convert it to another format? Do they do this for any other date on the page, and if so are these conversions prone to similar errors?

I have done a quick search (with Google, naturally...) and cannot see this reported anywhere else. This means that the bug may only just been created, or a transient feature.

It should also be noted that Google's results are far more helpful than Bing's, which does not even include the obvious webpage in their results.

2 comments:

Griffmonster said...

just a thought - would setting the html element lang attribute value to "en-GB" make any difference

David Cotton said...

Doh! I forgot about the region subtag for the language, which I usually try to avoid to keep the language as generic as possible.

I have altered walk 906 to be en-GB instead of en; however the change will not show in Google's search results until their bot has trawled the page.

However it does not explain why they reformat the date string, which is in plaintext in the webpage. I can understand them removing the formatting such as tables before displaying the information, but not for reformatting the plaintext date string.

I shall wait and see if it fixes the problem...

cheers, Griffmonster. As usual, ask a question on the Internet and someone will come up with a clever and good answer :-)