Wednesday, August 23, 2006

Keywords, Semantics, and Disambiguation (Oh, My!)

Since the advent of the now defunct Archie server in 1990, keyword queries have played a critical role in all search engine technology, reinventing -- if not destroying -- the traditional role of categorization as a means of organizing information. Indeed, one of the most profound shortcomings of most Web directories has been their failure to include a data entry field for keywords. To wit, if the important keywords for a particular category or site are not found in the Web directory's path to that category or in the title/ description of a website listing, there's simply no way that the category or listing can be found using a flat file search of the site.

On the other extreme, Overture -- formerly known as GoTo, now known as Yahoo! Sponsored Search, not to be confused with Yahoo! Search Marketing's various other programs such as Yahoo! Search Submit and Yahoo! Directory Submit -- and other purveyors of sponsored search listings have long been exploiting the fact that website owners will pay as much as $100.00 per click to have a sponsored link displayed alongside the organic search results for particular keywords. And because of the highly subjective and largely self-serving editorial policies of the major players in the pay-for-play search engine market space, quality control is sort of a non-starter. To wit, if end users have trouble finding what they're looking for, they'll just click on more sponsored links.

Somewhere in the middle of the two extremes where the role of keywords is simply ignored or exploited are search solutions such as Technorati,, or Flickr, which employ user-created keywords tags to index the content found on websites. While these search solutions are still very much in their infancy, they suffer from the same sorts of problems that other end-user solutions have always suffered from: (1) A lack of competence on the part of webmasters in objective description and (2) cynical gaming of keyword algorithms. Assuming, arguendo, that these two quality control problems could be resolved, the two ongoing issues that would emerge when using keywords to organize information are (1) semantics and (2) disambiguation.

While attending Search Engine Strategies San Jose 2006, I stumbled upon the booth of Teragram, a company that seems to be on the cutting edge of linguistic analysis, and they confirmed what I already knew about keyword semantics and disambiguation. To wit, when you're trying to figure out what somebody really wants to know when they enter a particular keyword query into a search engine -- i.e., when you engage in semantic keyword analysis -- there's no easy way to do so. But what truly surprised me is that the Teragram representative that I spoke to candidly admitted that they often turn to Wikipedia as a linguistic resource for disambiguation.

As far as I can tell, Wikipedia's disambiguation process is wholly human driven, and it does not seem to rely upon linguistic expertise. Rather, rank and file Wikipedians who are fluent in a particular language are encouraged to ask themselves, "When a reader enters [a specific keyword or keyword phrase in this language] and pushes 'Go,' what article would they realistically be expecting to see as a result?" If, in the opinion of the Wikipedian asking him-or-herself this question, there is a reasonable chance of confusion, he or she is encouraged to add a "disambiguation link" near the top of the primary article.

In cases of extreme ambiguity, Wikipedians are encouraged to create a "disambiguation page" which is populated with disambiguation links. And when one disambiguation page doesn't do the job, the first disambiguation page will lead to one or more other disambiguation pages. While not disambiguation per se, some Wikipedia articles include one or more summaries narrating related topics, and these summaries are usually linked to larger articles on those related topics.

As I alluded to above, disambiguation is usually a painstakingly labor intensive process that is properly informed by deep reflection. However, the process of disambiguation seems quite painless when it is contrasted with the much larger challenge of semantic analysis, the latter presuming an exhaustive disambiguation process has already been performed. At that point, the objective is to determine the meaning of a particular word or group of words in a particular context. This is not unlike the process where a statute promulgated by a legislature is interpreted by a court.

As I stated in a previous blog post, there are those who believe that, given the right tools, Wikipedians will be able to handle the challenging work of semantic analysis, and I do not share this optimism. Rather, I believe that semantic analysis will remain an esoteric discipline for decades to come, and that those who are proficient at semantic analysis will be able to write their own tickets in Corporate America until such time as a technological singularity is reached and artificial intelligence agents are created that can read and write and/or converse with human beings.


