Wednesday, August 23, 2006

Keywords, Semantics, and Disambiguation (Oh, My!)

Since the advent of the now defunct Archie server in 1990, keyword queries have played a critical role in all search engine technology, reinventing -- if not destroying -- the traditional role of categorization as a means of organizing information. Indeed, one of the most profound shortcomings of most Web directories has been their failure to include a data entry field for keywords. To wit, if the important keywords for a particular category or site are not found in the Web directory's path to that category or in the title/ description of a website listing, there's simply no way that the category or listing can be found using a flat file search of the site.

On the other extreme, Overture -- formerly known as GoTo, now known as Yahoo! Sponsored Search, not to be confused with Yahoo! Search Marketing's various other programs such as Yahoo! Search Submit and Yahoo! Directory Submit -- and other purveyors of sponsored search listings have long been exploiting the fact that website owners will pay as much as $100.00 per click to have a sponsored link displayed alongside the organic search results for particular keywords. And because of the highly subjective and largely self-serving editorial policies of the major players in the pay-for-play search engine market space, quality control is sort of a non-starter. To wit, if end users have trouble finding what they're looking for, they'll just click on more sponsored links.

Somewhere in the middle of the two extremes where the role of keywords is simply ignored or exploited are search solutions such as Technorati, del.icio.us, or Flickr, which employ user-created keywords tags to index the content found on websites. While these search solutions are still very much in their infancy, they suffer from the same sorts of problems that other end-user solutions have always suffered from: (1) A lack of competence on the part of webmasters in objective description and (2) cynical gaming of keyword algorithms. Assuming, arguendo, that these two quality control problems could be resolved, the two ongoing issues that would emerge when using keywords to organize information are (1) semantics and (2) disambiguation.

While attending Search Engine Strategies San Jose 2006, I stumbled upon the booth of Teragram, a company that seems to be on the cutting edge of linguistic analysis, and they confirmed what I already knew about keyword semantics and disambiguation. To wit, when you're trying to figure out what somebody really wants to know when they enter a particular keyword query into a search engine -- i.e., when you engage in semantic keyword analysis -- there's no easy way to do so. But what truly surprised me is that the Teragram representative that I spoke to candidly admitted that they often turn to Wikipedia as a linguistic resource for disambiguation.

As far as I can tell, Wikipedia's disambiguation process is wholly human driven, and it does not seem to rely upon linguistic expertise. Rather, rank and file Wikipedians who are fluent in a particular language are encouraged to ask themselves, "When a reader enters [a specific keyword or keyword phrase in this language] and pushes 'Go,' what article would they realistically be expecting to see as a result?" If, in the opinion of the Wikipedian asking him-or-herself this question, there is a reasonable chance of confusion, he or she is encouraged to add a "disambiguation link" near the top of the primary article.

In cases of extreme ambiguity, Wikipedians are encouraged to create a "disambiguation page" which is populated with disambiguation links. And when one disambiguation page doesn't do the job, the first disambiguation page will lead to one or more other disambiguation pages. While not disambiguation per se, some Wikipedia articles include one or more summaries narrating related topics, and these summaries are usually linked to larger articles on those related topics.

As I alluded to above, disambiguation is usually a painstakingly labor intensive process that is properly informed by deep reflection. However, the process of disambiguation seems quite painless when it is contrasted with the much larger challenge of semantic analysis, the latter presuming an exhaustive disambiguation process has already been performed. At that point, the objective is to determine the meaning of a particular word or group of words in a particular context. This is not unlike the process where a statute promulgated by a legislature is interpreted by a court.

As I stated in a previous blog post, there are those who believe that, given the right tools, Wikipedians will be able to handle the challenging work of semantic analysis, and I do not share this optimism. Rather, I believe that semantic analysis will remain an esoteric discipline for decades to come, and that those who are proficient at semantic analysis will be able to write their own tickets in Corporate America until such time as a technological singularity is reached and artificial intelligence agents are created that can read and write and/or converse with human beings.

Monday, August 21, 2006

Battle of the Internet Trade Shows

Search Engine Strategies (SES) Toronto 2006 went head to head with the much larger AdTech San Francisco on April 25th and 26th earlier this year, and I had a hard time deciding which event I should attend, if either. I finally decided that the smaller venue in Toronto would give me greater access to some of the heavy hitters in the Search Engine Optimization (SEO) industry who I knew personally who were committed to the SES Toronto show. I also thought SES Toronto would give me insight into the still emerging bilingual search engine space in the United States. However, as Internet trade shows go, it is pretty clear that SES now ranks third behind AdTech and WebmasterWorld PubCon as a place to see and be seen by others in the SEO industry. Yahoo! didn't even bother sending a contingent to Toronto, and I ran into at least one person staffing a booth at Toronto who confessed that he worked for a completely different division of the company that he was representing and was simply filling in as a warm body while the more knowledgeable representatives at his company worked the San Francisco show. This made me wonder just how long the Battle of the Internet Trade Shows can continue without SES turning into AdTech roadkill.

I am often asked why I attend trade shows. Indeed, I often ask myself that same question when I contemplate the time, trouble, and expense that is involved in attending a show. However, at some point after I arrive at a show, some serendipitous meeting occurs that puts everything in perspective. To wit, someone working a booth, someone I meet at a party or reception, or (on the rare occasions that I attend a conference presentation) some conference speaker provides me with a key bit of critical data that allows me to stay one step ahead of the competition and/or troubleshoot one of the problems that my clients are having. I also find myself playing the role of matchmaker, putting together the people who need a product or service with the people who provide that service, generating goodwill (and sometimes extra income) in the process. In sum, whenever I attend a trade show, there is always some specific event that occurs that makes me glad I came.

As worthwhile as I find attending trade shows, there comes a point where the time, trouble, and expense that is involved in doing so yields diminishing returns. Back in the day, there was only one high tech trade show: Comdex. However, while I was away at law school, Comdex became a has been and Internet World became the new venue for technology geeks. A few years later, Internet World became a casualty of the dot-com bust, and sometime thereafter SES seemed to rise from the proverbial ashes. However, as I alluded to earlier in this post, SES has become too specialized. To wit, as important as search engine marketing is, it is not as important as some people seem to think it is when it comes to connecting people with the products, services, and information that they need, and I think that the organizers of SES will start finding it harder and harder to attract major vendors like Google, Yahoo!, and MSN to their events.

In his keynote address for SES Toronto, Danny Sullivan brought up the issue of "contextual pollution." To wit, contextual advertising programs like Google Adwords do not fit into the rubric of "traditional" search engine marketing, and neither Google nor Yahoo! provide metrics that break out contextual advertising from "traditional" search engine marketing. He then asserted that all online marketing media will need to start providing accountability through solid metrics and went on to predict that search engine marketing will enter a "third generation" where search becomes more vertical and more personalized. After clarifying what is and isn't search marketing, Sullivan reassured everyone that the current popularity of search engine marketing "is not a bubble," but that most players in the search engine space are now looking "beyond search" when hawking products and services to their customers. This made me think, paraphrasing a famous quote, that when someone says, "This is not a bubble," . . . it's a bubble.

Mark my words: Intellectual purity about what is and isn't search engine marketing will make search engine marketing as we know it obsolete, along with the SES trade shows if the organizers of said shows continue to limit themselves to exploring search engine strategies and only search engine strategies. Search engines are only one component of online information distribution and retrieval, and -- as I stated in a previous blog post -- the big picture goes far beyond search engine marketing. It involves a wide variety of traditional media, emerging technologies, and one or more disruptive information technologies that are flying under most people's radar at the moment, and it involves a wide variety of activities ranging from education to commerce. Even AdTech and PubCon are too commercial for my tastes, but they are still extremely relevant enough to their target audience, so much so that they may eclipse the various non-commercial activities that are usually associated with search engine marketing a la SES.

Saturday, August 19, 2006

Ghost URLs

In a previous blog post, I set forth the site configuration procedures that I go through prior to launching a static website and the ongoing content modification and release process that I go through to keep a static website Google-friendly. Few webmasters follow quality control procedures like these; even fewer check their server logs to find and correct 404 errors (i.e., "page not found") with 301 redirects. Consequently, search engine databases suffer from problems that can be summed up by the aphorism "garbage in; garbage out." To wit, "ghost URLs."

I use the term ghost URL to refer to a wide variety of site configuration issues that haunt most of the websites that I encounter. One of these issues is caused by the fact that most URLs are ambiguous -- i.e., there are usually a wide variety of URLs that point to the exact same content because (1) a numerical IP address has more than one domain name associated with it; (2) a particular webserver fails to correct ambiguous requests for site content; and (3) the same content is mirrored in two or more documents. There are some legitimate reasons for mirroring static content, but most of the time these mirrors are unintentional spam, thereby diluting Google PageRank for the preferred URL.

The second type of ghost URL that I encounter is created when a webmaster moves content from one URL to another without implementing a 301 redirect or "refresh redirect." On rare occasions, a webmaster who has created a ghost URL will have had the foresight to set up a customized 404 error message, but most of the time webmasters who move content without implementing 301 redirects are oblivious to the problems caused by ghost URLs. Consequently, the end user will usually encounter the default 404 error message that his or her browser displays. As I stated previously, most recently in the blog post referenced above, a scheduled content modification and release process is the best way to avoid these types of ghost URLs.

A third type of ghost URL that I encounter is created by inbound links from one website that points to non-existent content on another website. As a webmaster, you have no direct control over these inbound links. What you can do is set up a customized 404 error page, monitor your site referral logs for recurring 404 errors, contact the webmasters who set up the offending inbound links and ask them to correct the problem, and implement 301 redirects. Whether or not the offending inbound links are fixed, the 301 redirects should stay in place until the 404 errors for a particular content request disappear.

Some ghost URLs are inadvertently created by webmasters who are vetting pages of beta content. The links on these beta pages often point to non-existent URLs, and the webmasters who set up these beta pages will often mistakenly assume that no one else knows about these beta pages because they haven't linked to them or told anyone about them. Little do they realize that all they have to do to let the the secret out is click on an outbound link on one of their beta pages and follow that link to another website. A curious webmaster or spider will then follow his, her, or its site referral logs and find the beta page along with all of its links.

Given the highly decentralized nature of the Internet, there's very little hope of a centralized strategy of quality control for static content on the World Wide Web emerging anytime soon, and as an Internet consultant, I have my hands full trying to bring webmasters up to speed on the site configuration issues that I narrated above. However, when I was sitting in the WiFi lounge at Search Engine Strategies San Jose 2006, a somewhat inexperienced webmaster asked me if I "knew anything about 301 redirects," and then listened quite intently to what I had to say about site configuration issues, so I suspect that there are quite a few other people who are actively addressing these issues. Even so, I suspect that ghost URLs will continue to haunt the Web for the foreseeable future.

Tuesday, August 08, 2006

The Advent of MySpace

According to recent news reports, MySpace.com is now the most highly trafficked webtsite in the United States, eclipsing search engine giant Google. Like the blogosphere before it, MySpace has become a force to be reckoned with, a fact that has not escaped the attention of Google who just inked a deal to provide search functionality on MySpace. This latter fact came to my attention as I was sitting in the WiFi lounge at Search Engine Strategies San Jose 2006 at a table with a couple of people who work for Yahoo!

I've known for quite some time that MySpace was becoming more and more popular, and while it holds very limited appeal for me, I have visited the site when directed to it by friends who have opened an account. What caught me by surprise is that there are people who use MySpace as a communications medium the way that I use e-mail, which explains why MySpace has become more popular than Google, as e-mail is still more of a killer application than search. What also caught me by surprise is the fact that most of the exit traffic from MySpace is comprised of search queries. As such, Google should see a huge surge in traffic.

Google could have acquired MySpace a while back, but apparently didn't see the potential. Now Google will be forced to make revenue-sharing payments to Fox Interactive Media totaling at least $900 million between the beginning of 2007’s first quarter and the end of 2010’s second quarter. In the long term, MySpace stands to gain even more from this marriage of convenience as it will be able to partner with service providers outside of the search engine space.

Monday, August 07, 2006

What Is the Semantic Web?

In a recent blog post, I said that the ongoing efforts to develop the Semantic Web provide a surprisingly coherent vision of how the Internet should be indexed. However, beyond a passing reference to the fact that the Semantic Web uses a declarative ontological language known as OWL, I didn't really say what the Semantic Web is or what it does. Tim Berners-Lee has published a draft of a road map for his vision of the Semantic Web, and while Berners-Lee's road map is purportedly still evolving, his draft road map was last updated in October of 1998, leading me to believe that his vision has not changed that much.

Berners-Lee's vision of the Semantic Web was and is a stylized version of Artificial Intelligence ("AI"). To wit, "Leaving aside the artificial intelligence problem of training machines to behave like people, the Semantic Web approach instead develops languages for expressing information in a machine processable form." As a long term vision, the idea seems to be that people who publish high quality Web-based resources in a machine-processable language will make other Web-based resources obsolete. At a more practical level, the Semantic Web relies upon the Resource Description Framework (RDF) to provide a standardized schema for meta data.

The fundamental problem with trying to impose a standardized schema for meta data on Web-based publishers is that the Internet is a highly decentralized information resource. Anyone can create their own schema for Web-based URIs/URLs, and said URIs/URLs are seldom unique. Indeed, to provide total anonymity and combat censorship, the Freenet Project relies upon user-designated URIs.

At the present time, the closest thing to a Web-based hegemony of meta data is Google. I'm not talking about Google's Sitemaps feature (still in beta and soon to be renamed "Google webmaster tools"), although that is a very good example of the phenomenon. Rather, what I'm talking about is that people who want their fair share of Google's referral traffic do their best to make their websites Google-friendly. Of course, few people who are optimizing their websites for Google are concerned about quality control, and Google's meta data is easily compromised by people who have learned how to game the Google algorithm.

Ignoring for the time being the fact that the quality of meta data for the Web is easily compromised, consider the fact that RDF pretends to use a string of URIs to create semantic meta data that is machine processable. For the most part, this meta data is simplistic gibberish that mimics subject-verb-object grammar. However, when it comes to large scale online collaboration, RDF does provide a workable framework for the sharing of meta data that is currently limited to private databases.

Ongoing development of the Web Ontology Language (aka OWL) -- i.e., the declarative ontological language of the Semantic Web -- resulted in the publication of a W3C recommendation that as of February 10, 2004 was still somewhat esoteric. It remains to be seen whether OWL will ever garner enough proponents to make the Semantic Web a force to be reckoned with. But AI is not going to go away, and if the Semantic Web isn't the killer application that makes AI a reality, the Semantic Web will still provide a coherent vision for how the Internet should be indexed for decades to come.

Tuesday, August 01, 2006

Wikipedia and the Semantic Web

On Monday July 31, 2006, The Colbert Report aired a feature on Wikipedia, and (at his prompting) Stephen Colbert's minions quickly descended on Wikipedia's article on elephants, deliberately introducing the clearly erroneous assertion that the population of elephants in Africa has tripled in the last six months. After checking out the action on Wikipedia, I searched the blogosphere for commentary on Wikipedia, hoping to find more media buzz about the impact of Colbert's broadcast. However, I ended up reviewing content that made me reflect (once again) on how important Wikipedia has become to the indexing of content on the Internet.

In a post entitled Wikipedia 3.0: The End of Google?, the Evolving Trends blog provided a rather esoteric treatise about the Semantic Web. To wit:
"The Semantic Web requires the use of a declarative ontological language like OWL to produce domain-specific ontologies that machines can use to reason about information and make new conclusions, not simply match keywords."
For those of you who are not familiar with OWL, it is an intentionally dyslexic acronym for Web Ontology Language. The problem with using OWL to power the Semantic Web is that it relies upon present-day Web-based publishers for quality control -- i.e., the same people who are likely to stay awake nights dreaming up scams and schemes to exploit the Web for personal profit. The bloggers at Evolving Trends would have you believe that, given the right tools, Wikipedians will be able to deal with these issues. However, I do not share their optimism.

Wikipedia is an unqualified success when it comes to large scale collaboration for online content generation, and -- as alluded to by the bloggers at Evolving Trends -- Wikipedia has the capacity to make Google obsolete. But not for the reasons alluded to by said bloggers -- i.e., by using a highfalutin ontology generated by conscientious Web-based publishers. Rather, Wikipedia has the capacity to out-Google Google by providing relevant responses to keyword-based queries. To this end, Wikipedia spends a great deal of time disambiguating keyword-based queries.

On a much smaller scale, albeit one that has provided proof of concept, I started doing something similar a few years ago with the XODP Web Guides. To wit, following much the same process that I used when creating a new ODP category back in the day, I review the search results for a particular search term, separate the Roman Meal from the Hormel, and provide appropriate titles and descriptions for the wheat. With the XODP Web Guides I also go two steps further: As appropriate, I provide an introductory blurb narrating the presumed relevance of a particular search term and parse the annotated links for that search term with appropriate interrogatories. Even so, the Semantic Web envisioned by Timothy Berners-Lee is not likely to be realized anytime soon. Certainly not in the context of Wikipedia, much less in the context of annotated link lists.

As one might expect, the biggest challenge with creating the Semantic Web is and will continue to be semantics. Most human beings take for granted their ability to understand what somebody else means when they say something, but this ability is hardly trivial and virtually impossible to code into a language processing program. Indeed, eloquent orators, writers, and translators are usually hard pressed to explain the rationale for their choice of words on a particular occasion. The words they use just seem right as they bubble up from their subconscious mind. If they're lucky, they have a chance to censor the words that might be misinterpreted, and after careful consideration can be relied upon to explain why the words that they did choose were in fact the right words.

Assuming that linguists are able to come up with an artificial intelligence (AI) that can read and write and/or converse with human beings, there will still be vast uncharted oceans of knowledge on the World Wide Web, unless said linguists are also able to give AI the ability to look at pictures and video and interpret their relevance as well. However, such abilities are not yet the province of AI; they are the province of sci-fi. Meanwhile, Wikipedia remains the most successful experiment in online content generation and indexing to date.

Inherent problems with quality control notwithstanding, Wikipedia has succeed where ODP/dMOZ failed in that the Wikipedia community is truly open. To wit, as there are no meaningful barriers to entry, anyone can contribute to Wikipedia. That's not to say that there aren't some control freaks at Wikipedia doing their best to assert themselves and turn Wikipedia into a complicated bureaucracy. However, if one has a vested interest in quality control, there are meaningful ways of challenging the mediocre status quo at Wikipedia.

While Wikipedia is at the vanguard of progressive and pragmatic efforts for online content generation and indexing, the ongoing efforts to develop the Semantic Web provide a surprisingly coherent vision of how the Internet should be indexed. In the decades to come, this vision will probably be explored in the context of academia and inform the research conducted by cognitive scientists. However, by the time this vision reaches the public at large, it will almost certainly be watered down to something much less sublime.