more IR

I came across an interesting IR post by Dan Lemire sometime earlier in the week that I meant to post.  He compares searching for “Kurt Gödel” with “Kurt Goedel” and in the comments “Kurt Godel”.  Google returns different results for the first two but Bing doesn’t.  The comments say “Gödel” and “Godel” return the same results in both.

The problem is interesting for two reasons.  Firstly, the three different queries are referring to the same person, so you can make an intuitive argument that they should return the same results.

Secondly, choosing one of the particular queries says a little about the person searching.  The person searching with “Gödel” may want more formal documents whereas the person searching with “Godel” may not have a preference.  Of course, it could reflect irrelevant factors – the person searching with “Godel” could expect that it doesn’t matter for search engines, or could be in a rush, or the difference could reflect the relative ease of diacritics on OS X vs Windows.

In the comments of the post, Paul makes an apt point – Google’s “Did you mean?” feature can address most of the problem.

If I had to propose a solution, I’d maybe suggest query expansion for text normalization – “Gödel” might be expanded to 40% “Gödel”, 30% “Goedel”, and 30% “Godel”.  Something to reflect the synonymy but that also accounts for a slight preference for one version.


