Wednesday, February 28, 2007

Respected Wikipedian Lies to the Press

Stacy Schiff of The New Yorker wrote an article about Wikipedia that was first published online in July of 2006. One of Schiff's sources was a pseudonymous Wikipedian known as "Essjay" who lied to Schiff about virtually all of his biographical details:
" . . . [Essjay] was described in the piece as 'a tenured professor of religion at a private university' with 'a Ph.D. in theology and a degree in canon law.'

"Essjay was recommended to Ms. Schiff as a source by a member of Wikipedia's management team because of his respected position within the Wikipedia community. He was willing to describe his work as a Wikipedia administrator but would not identify himself other than by confirming the biographical details that appeared on his user page. . . . Essjay now says that his real name is Ryan Jordan, that he is twenty-four and holds no advanced degrees, and that he has never taught. He was recently hired by Wikia—a for-profit company affiliated with Wikipedia—as a 'community manager'; he continues to hold his Wikipedia positions. . . ."
While Daniel Brandt first broke this story quite some time ago, it's just now starting to get some traction in the blogosphere, and it took me quite a while to uncover the details. A good place to bring yourself up to speed is on Jimbo Wales Wikipedia User Page. However, the bottom line is that Essjay acknowledges no wrongdoing, claiming to be justified in perpetrating this fraud as a way of protecting himself from online stalkers.

While it's not a crime of moral turpitude, lying to a member of the press when the truth will do just as well is evidence of moral turpitude, not to mention just plain stupid when you lie about falsifiable facts, and Essjay's excuse of necessity just doesn't wash. To wit, in addition to lying to the press, Essjay used his counterfeit credentials to bolster his and Wikipedia's credibility in a letter that he gave to students who wanted to cite Wikipedia as a reference work. He didn't have to give the interview, he didn't need to write the letter, and he didn't need to lie to the press to protect himself from online stalkers.

I am no stranger to online stalkers. Shortly after I founded the XODP Yahoo! eGroup, a (now former) ODP meta editor set up Netesqsucks.com, a hate site dedicated to persecuting me. And while said meta editor set up said hate site under an assumed name and a fake address, an investigative reporter uncovered his true identity by tracking down the credit card that said stalker used to pay for the site's domain registration. Eventually, this meta editor (a private investigator) was terminated for using sock puppets to promote his own websites on ODP and exclude his competitors.

I didn't need to talk to the investigative reporter who uncovered the truth about the person behind the Netesqsucks.com website, but I did, and I took a huge risk in revealing to her in confidence all sorts of personal information about myself on deep background. Because Essjay was referred to The New Yorker by Wikipedia management, he was assumed to be trustworthy without providing that background information. Simply put, Essjay violated that trust.

Tuesday, February 27, 2007

Vanity. . . Definitely My Favorite Sin

While there's still no trackback on the Citizendium Blog for my previous XODP Blog post entitled Larry Sanger Gets It Totally Wrong, Larry has condescended to comment on my more recent post entitled Dealing with Jackasses at Wikipedia and Citizendium:
"What argument or evidence can you offer to support this silly claim that I wish to establish 'centralized content control'?"
That's a good question. But I have a much better question: If it's such a silly claim, why are you bothering to respond to it? Seriously.

Larry continues:
"I'm surprised that you have jumped on the bandwagon of those who say, 'Oh my gosh, if there are experts involved, it must be Nupedia all over again!'"
Hardly a bandwagon, although I would have to agree that many other people worthy of note believe that Citizendium qualifies as a recapitulation of Nupedia. However, I have no problem with having experts involved in wiki projects, and my position on Wikipedia's unfair bias against experts would have been quite apparent to you if you'd carefully read the post to which you responded.

Larry continues:
"The Citizendium has editors who can weigh in as "resident experts" as necessary. . . . To suggest . . . that they simply insist on their views without argument--is in essence to malign a whole bunch of people you don't know at all."
I can't tell you what a joy it is to have you put words in my mouth that I never said. To wit, I did not say that "Citizendium editors simply insist on their views without argument," and suggesting that this was my position is what logicians refer to as a fallacious "straw man" argument. For future reference, if I mean to say that Citizendium editors simply insist on their view without argument, I won't have to suggest it; I'll just say it.

Larry continues:
"[C]redentials are necessary for being an editor. Not for being an author, of course, and most of our registered contributors are authors. But, yes, you have to prove that you're actually an expert."
The need to have a distinction between expert editors and other contributors is somewhat lost on me, but you have answered your own question regarding my characterization of your position on centralized content control.

Larry continues:
"Frankly, we put out on their ear far faster than Wikipedia ever did anyone who actually acts like a jackass. Your own intemperate post, for instance, is the sort of mean-spirited, vicious personal attack that would get you excluded."
Hardly an intemperate post by me, but definitely an intemperate response on your part wherein you stop holding court just long enough to step outside and challenge me to to mutual combat.

Dealing with Jackasses at Wikipedia and Citizendium

Following a comment by one Mike on the XODP Blog eventually took me to said Mike's Modern Dragons Blog, which has a very lengthy post about Citizendium. And while Larry Sanger seems to be quite enamored with Mike's post, I think Mike is making extraordinary demands on his potential blogging audience by using over 5,000 words, not including the footnotes, appendices, and comments that follow the main post. Even so, as Mike's post seems to have drawn the attention of the usual suspects, I made a point of reading through the whole thing, and I was rewarded by the following quotable quote cited by Mike:
"Many experts who have left, or otherwise have expressed dissatisfaction with Wikipedia, fall into two categories: Those who have had repeated bad experiences dealing with jackassses, and are frustrated by Wikipedia’s inability to restrain said jackasses; and those who themselves are jackasses."
I have stated before that Wikipedia has an unfair bias against experts, just as I have stated before that project forking at Wikipedia is long overdue. However, as alluded to by the quote above, experts who have left Wikipedia and jackasses who have left Wikipedia are not mutually exclusive categories. Rather, the jackasses who make contributing to Wikipedia an unpleasant experience form a spectrum that runs from the most ignorant and uninformed contributors to the most educated and biased control freaks. I say this as someone who voluntarily limits the vast majority of his Wikipedia contributions to "Talk Pages" rather than getting into "Edit Wars" with said jackasses. Even then, I often find myself blown away by the recalcitrance of the jackasses for which Wikipedia seems to be an attractive nuisance.

On this note, from the comments section of Mike's blog post comes the following gem from Wikipedian Fred Bauder:
"Most Wikipedians support Citizendium, but, having had experience with Larry, most old-timers are somewhat sceptical[sic] about a project controlled by him. Given a choice, he will nearly always choose a solution which involves top-down control. . . . Another problem, he does not so much respect expertise, as credentials. . . ."
Having had direct contact with Fred Bauder during my early editing experiences at Wikipedia, I can honestly say that he was one of the jackasses at Wikipedia that inspired me to give up the proverbial ghost when it came to quality control. I was particularly annoyed at his penchant for moving forward with dramatic changes to an existing Wikipedia article without bothering to form a consensus. And in all fairness to Fred, there are probably just as many people at Wikipedia who think that I'm a jackass.

Ironically enough, the dispute that I had with Fred involved his revisions to the Wikipedia Law article, and Fred is a retired lawyer, which should give him more than a passing familiarity with the topic. And while I do not practice law, I was a Member of Law Review and a Teacher's Assistant for Legal Writing before graduating from UC Davis Law School. As such, from a Larry Sanger-esque credentials standpoint, I think that I too would have been on pretty firm ground. Even so, the most constructive input regarding the Wikipedia Law article that came during my confrontation with Fred came from Wikipedian Slrubenstein (a very modest professor of anthropology), mysterious old-time Wikipedian SJK, and noted Wikipedian Lee Daniel Crocker (a computer programmer by trade who was the primary author of the current MediaWiki software).

By the time I started contributing to Wikipedia in August of 2002, the writing was already on the wall regarding Larry Sanger's once prominent role as Wikipedia's editor in chief. And having successfully cultivated more than one online community of several hundred members (i.e., Wherewithal and Project Napa) and then having had both of those communities taken away from me and destroyed by the people who hired me to cultivate those communities, I was inclined to feel sorry for Larry. However, the more I came to understand Larry's feelings about what Wikipedia should be, the more I came to realize that Wikipedia was much better off without someone like Larry in charge.

While Fred Bauder may or may not speak for the majority of old timers at Wikipedia, (IMHO) he is spot on in his criticism of Larry Sanger. One need not look far for evidence of Larry's failings: Nupedia failed because it favored credentialism and centralized control over content, and once Wikipedia emerged as a viable replacement for Nupedia, Sanger sought to impose credentialism and centralized content control at Wikipedia. Even now, Larry hopes to use Citizendium to reassert the validity of credentialism and centralized content control as a panacea for what ails Wikipedia.

I mentioned above that during my tenure at UC Davis Law School I was a Teacher's Assistant for Legal Writing and a Member of Law Review. These positions gave me an enormous appreciation for just how difficult it is for most people to produce quality writing, just how defensive most people can be about really bad writing, and just how difficult it can be to reach a consensus on what good writing actually is when more than one editor's viewpoint must be appeased. As a result of these harsh realities, the vast majority of promising law review candidates would wash out of law review before the end of their first semester, and the author of an article that qualified him or her for law review membership was seldom willing to claim the final product as his or her own.

Dynamics similar to those that I experienced at UC Davis Law Review are at work at Wikipedia, but with a rather strange twist: Anybody can put in their own .02 in on a Wikipedia article, and quite a few people do. To wit, while most Wikipedia articles end up having a relatively small number of regular and interested contributors, the vast majority of content on Wikipedia is contributed by the rare occasional contributor. What this means is that the occasional contributor is the key ingredient to Wikipedia's exponential growth but is not even a significant part of Wikipedia ongoing quality control problems. Those problems are caused by biased and/or uninformed experts and zealots -- i.e., jackasses.

Rather than encourage the occasional contributor who has helped Wikipedia grow and prosper and discourage jackasses from asserting ownership over particular articles, Larry Sanger hopes to improve upon the Wikipedia model by excluding from Citizendium the people who provide the vast majority of useful content and forming an enclave of like-minded expert jackasses. It's an interesting experiment, but -- given the enormous barriers to entry at Citizendium -- not one that I or any of the experts that I know are interested in joining. And it's probably just as well: Larry Sanger is apparently moderating all comments and trackbacks to the Citizendium blog, as all the comments appearing there harmonize with his point of view and no trackback reference to my previous blog post has yet appeared.

Sunday, February 25, 2007

Larry Sanger Gets It Totally Wrong

While exploring the blogosphere for commentary about Wikipedia, my attention was drawn to a post by Larry Sanger on the Citizendium Blog:
". . . Wikipedia's reach is now enormous, and if indeed it has gained a reputation, whether deserved or not, as a source of reasonably reliable information, and it defames someone for any significant length of time, such defamation can do very real harm to a person’s reputation. . . .

"[ . . .]

". . . Ethically, and probably legally, Wikipedia’s managers must face up to it, because the injustice the current situation perpetrates is completely and
obviously intolerable.
While Wikipedia may or may not have an ethical duty to prevent people from being defamed *on* Wikipedia, an issue on which reasonable minds may differ, the only thing that I find "obvious" on the issue of defamation *by* Wikipedia is that Larry Sanger knows absolutely nothing about defamation law.

BTW, as I am wont to say when commenting on legal issues, I do not practice law, and nothing I have written in this post or elsewhere should be construed as a legal opinion or as legal advice. That having been said, according to Attorney Scott D. Sheftall, the lawyer who is representing Fuzzy Zoeller in the recent defamation suit involving Wikipedia, the law is pretty clear when it comes to the fact that Wikipedia has no legal exposure here:
"Zoeller's attorney, Scott D. Sheftall, said he filed the lawsuit against a Miami firm last week because the law won't allow him to sue St. Petersburg-based Wikipedia."
The irony here is that Sanger's ongoing indictment of Wikipedia is that it allegedly fails to provide accurate information. In Sanger's own words:
"Articles that contain defamatory remarks are not 'objectionable comments' on a 'message board'; they are presented to the world and typically accepted as fact, which is something Jimmy himself encourages by saying, as he does, that Wikipedia is actually pretty reliable."
This is a classic case of the pot calling the kettle black, and a pretty good case could be made that Sanger's comments about Jimbo Wales are defamatory:
"It is reprehensible, moreover, to react to this situation lackadaisically, regarding false claims that are personally and professionally damaging as in effect 'collateral damage' that society must bear as the price of having such a wonderful project as Wikipedia. That reaction is reprehensible because it assumes that Wikipedia's managers cannot improve the way that it deals with this sort of problem. They could, without completely breaking its system; but they choose not to."
According to the guidelines offered by the Electronic Frontier Foundation in re defamation, an attorney could probably make a prima facie case that Sanger is publishing untrue statements about Wales with the intention of impuning Wales' reputation. However, given that Wales would probably qualify as a public figure, said attorney would also have to prove "actual malice" on the part of Sanger. Nonetheless, I think it's incumbent upon Sanger to live up to the standards that he hopes to enforce:
"[Citizendium] will have a zero tolerance policy toward any even possibly defamatory remarks: to say something that might tend to impugn someone’s reputation, even if true, will require extremely good documentation. If no such documentation is offered, or if it does not check out, the person who makes such claims will be "escorted to the door." We simply won't tolerate it."

Powerset Redux

In a previous XODP Blog post, I commented on the fact that Powerset had succeeded where Google had failed in wooing natural language processing (NLP) guru Ronald Kaplan to spearhead its efforts at developing a semantic search engine. Unlike most of the commentators that I've read on the blogosphere, I was willing to suspend judgment on Powerset's technology until it came out of stealth mode. However, I remained titillated by the prospect of semantic search finally coming of age, so I did a little more digging, and I found some more information about what Powerset is planning to debut:
"There are two key things here: the use of NLP and the disruption to the search interface. Finally, information retrieval will actually mean information retrieval, not document retrieval. One of the fundamental models of search that may be challenged . . . is . . . that search engines are designed to take people to pages. The more we can understand and summarize the information on those pages, the weaker this model becomes. . . ."
[The above summary comes from the blog of Matthew Hurst, a close friend of Powerset's CEO Barney Pell.]

As promising as this vision of search is, it fails to take into account the cost of search computations:
"The computation for each query has quantifiable costs and benefits. The costs include R&D, computing infrastructure amortization, energy, rent, maintenance. The direct benefit is advertising revenue from the query response; indirect benefits such as market share gained from greater search quality are harder to measure, but can still be estimated. Search engine success is ultimately given by the efficiency with which it delivers advertising revenue given these factors."
This sort of cost/benefit analysis gives rise to an inherent conflict of interest in search engine technology as we know it. To wit, if a search engine's natural search results are too good, there's no incentive to click on paid search results, and the business model breaks down. At the other extreme is an information resource like Wikipedia, with a business model that is based primarily upon a gifting economy. That sort of business model also breaks down when the quality of information it provides gets too good, as it attracts all sorts of people whose primary interest is in getting rich off of those who work for free.

To quote the late Keith Moon, "Sometimes you do alright to steer clear of quality." (A quote that I found misattributed to Pete Townshend.) Nowhere is this more true than in it is when it comes to technology, where a successful disruptive technology (i.e,. blogs) usually starts out as an inferior technology. Assuming, arguendo, that Powerset's technology truly is better than Google's, this is the real challenge that Powerset's evangelists will have to address.

Saturday, February 24, 2007

Rumors of Wikipedia's Imminent Demise Are Greatly Exagerrated

Following up on some recent XODP Blog posts about Wikipedia, I noticed that it has become somewhat fashionable to predict Wikipedia's imminent demise, the most recent example of this that I found being Peter Da Vanzo's post at the v7n blog, Have DMOZ Taken Over Wikipedia?
"DMOZ and Wikipedia share much in common - other than Wikipedia isn't completely useless. Yet.

"[ . . . ]

. . . You build something that looks open, and appears to be open, but in reality, is locked up tight, and run by a small group of people making ever more insular decisions.

". . . [T]hey're under-resourced for the task, and as the task grows, the more under-resourced they become. In response, they compromise the very thing that made them valuable - accessibility."
I've had recurring issues with Wikipedia since I first started contributing to it back in August of 2002, and I think that some project forking is long overdue. However, I think that Da Vanzo's indictment of Wikipedia is gratuitiously harsh. Ditto for the indictment of Wikipedia proffered by Nicholas Carr back in May of 2006:
"Wikipedia, the encyclopedia that "anyone can edit," was a nice experiment in the "democratization" of publishing, but it didn't quite work out. Wikipedia is dead. It died the way the pure products of idealism always do, slowly and quietly and largely in secret, through the corrosive process of compromise.

". . . A few months ago, in the wake of controversies about the quality and reliability of the free encyclopedia's content, the Wikipedian powers-that-be - . . . tightened the restrictions on editing. In addition to banning some contributors from the site, the administrators adopted an "official policy" of what they called, in good Orwellian fashion, "semi-protection" to prevent "vandals" (also known as people) from messing with their open encyclopedia."
A much more muted prediction of Wikipedia's demise was offered by Eric Goldman on his Technology and Marketing Law Blog entitled Wikipedia Will Fail in Four Years:
". . . I'm . . . basing this prediction on the experiences of ODP. I think it's fair to say that (1) in its heyday, the ODP did an amazing job of aggregating free labor to produce a valuable database, and (2) the ODP is now effectively worthless."
While I agree that Wikipedia has all sorts of limitations and faces all sorts of challenges, I also think that predicting Wikipedia's demise based on comparisons to ODP/dMOZ is very simplistic thinking.

The single biggest problem with the purportedly Open Directory Project was that it was never, ever truly open. A close second was the fact that ODP never had a business plan. Neither of these things can be said about Wikipedia. Rather, the single biggest problem with Wikipedia is quality control, and a close second is the bizarre bureaucracy that has been slowy emerging in response to the problem of quality control. The fallback solution to both of these problems and most of the other problems that open projects encounter is project forking, something that could not happen with ODP because of its corporate ownership, its onerous licensing restrictions, and its failure and refusal to use an open source software platform.

Thursday, February 22, 2007

Grokking the Semantic Web

Although I've written about what the Semantic Web is previously, I've always felt that there should be a simpler way to explain it. And while exploring the blogosphere, I found a blog post on ....more semantic! that finally did just that by referencing Wikipedia as a global ontology:
"[Y]ou could . . . use a wikipedia reference to indicate the semantic concept that you are writing about. Thus, . . . you could use the link http://en.wikipedia.org/wiki/Rome to indicate that you are refering to the city of Rome, the capital of Italy."
This simple statement sums up just how straightforward it would be for someone to publish a document on the World Wide Web that is compatible with the Semantic Web. It also makes it abundantly clear just how tedious and time consuming such a task would be.

Most people who describe the Semantic Web talk about making web documents easier for machines to process. That's true enough, but that's really not what the Semantic Web is all about. At its essence, the Semantic Web is all about disambiguation, something that Wikipedia does quite well, even breaking through language barriers in the process.

The other day, a friend of mine who is taking a conversational Spanish class asked me if I could help her find something that was written in Spanish to complete an assignment for her class. I immediately thought of Wikipedia and how it links from articles in one language to articles in another language, so I told my friend to find a Wikipedia article in English that she liked and follow the link to the Spanish Wikipedia. I then directed her to use Babelfish for a crude translation of the article on the Spanish Wikipedia.

The Wikipedia article that my friend picked was Cat. However, when I followed the link to the corresponding article on the Spanish Wikipedia, I discovered that the two articles were quite different, so I found the appropriate article on the Spanish Wikipedia and changed the outgoing link on the English Wikipedia. Prior to the change I made, the English Wikipedia considered Cat to be a synonym for House cat, but linked to the more generic Spanish Wikipedia article entitled Felis rather than the more specific article entitled Gato doméstico, which in turn redirected to Felis silvestris catus.

This simple exercise demonstrates how a semantic search engine algorithm could drastically improve the relevancy of keyword-based search results, something that I hinted at in my previous XODP Blog post entitled Wordnet, Disambiguation, and the Semantic Web. And contrary to what some have suggested, a semantic search engine would not need to be intimately familiar with a user and/or the context of a particular user's search. It could default statistically to the most likely meaning of a particular word occurrence and still allow an end user to provide feedback on what he or she really meant.

All of this begs (or rather raises) the question of whether people need or want a user agent that is this sophisticated when it comes to searching the Web. I've done a significant amount of end user training, both for my private clients (most of whom are lawyers) and during presentations that I've made at Mandatory Continuing Legal Education (MCLE) seminars, and few people ever tax my knowledge base with their questions. In fact, I got my best reviews (5 out of a possible 5 for 90 percent of the queries asked of seminar attendees) by taking a full 30 minutes to explain the anatomy of hypertext links, knowledge that most Web-savvy individuals take for granted. Thus, I am inclined to believe that both the Semantic Web and semantic search will remain solutions looking for a problem for the foreseeable future; at best, they may become solutions made by geeks and for geeks.

Wednesday, February 21, 2007

Linking Wikipedia Articles to ODP/dMOZ Categories

While following up on a previous XODP Blog post narrating the death and resurrection of ODP, I found a link in a post on the Text Technologies blog pointing to a post on Joost de Valk's blog entitled DMOZ and Wikipedia: how it should work:
"Wikipedia is an online encyclopedia. . . . An encyclopedia should contain references to other articles in that encyclopedia, but doesn’t nescessarily have to have external links. A directory’s major purpose (at least the ODP’s major purpose) is to contain external links, chosen by editors to be of high quality.

"[ . . . ]

"On some articles, editors have linked to DMOZ, and removed almost all external links. That, to me, looks very good. The links in DMOZ have been checked, and should be of high value, in Wikipedia, this is impossible to do, as anyone can add his [or her] own link."
A truly open Web directory would be a great companion to Wikipedia, but linking a Wikipedia article to ODP and eliminating all other outbound links from that article will create many more problems than it will solve, a fortiori when one considers the uncertain future of ODP.

Astonishingly, no one seems to be pointing out just how bad this idea really is, much less why. When it comes to quality control, ODP's track record is much worse than Wikipedia's, so much so that both MSN and Google now allow webmasters to opt out of using ODP's meta data for their site descriptions, whereas Yahoo! has its own human editors to provide site descriptions. Add to this ODP's historical backlog of hundred of thousands (perhaps millions?) of sites, with a typical wait of some six months or more for a site submission to be reviewed, along with the total lack of transparency in the purportedly Open Directory Project, and Joost's idea makes about as much sense as amending the United States Constitution so that the current President Bush can be elected for a third term and stay the course in Iraq.

Assuming, arguendo, that ODP did not have systemic issues with quality control, was not a black hole for site submissions, and actually was an open directory, the idea of removing relevant outbound links from Wikipedia and replacing them with one link to a purportedly authoritative and comprehensive Web directory would be bad enough all by itself. One of the best things about Wikipedia is that it is a centralized clearinghouse for information on keyword-based topics, and a list of relevant outbound links is a logical component of such an online reference. To this end, many (if not most) Wikipedia articles include easily verifiable citations that link to other websites. Consequently, I find it hard to believe that more than a handful of Wikipedians would take Joost's suggestion seriously, but well over a thousand Wikipedia articles currently use a DMOZ link template, and Wikipedia's external links guideline was revised on November 16, 2006 to make the use of this template the norm. Notwithstanding the six week outage of ODP's website, this recommendation remains the status quo.

ODP/dMOZ - In Memoriam, Once Again

While I occasionally mention ODP/dMOZ in a historical context, particularly so in reference to Wikipedia inheriting whatever creative genius ODP once had, I wrote off ODP quite some time ago. Indeed, I had almost written off Web directories altogether until I stumbled upon Robert Barger and Brian Prince at Search Engine Strategies 2005 and discovered that they and others like them had successfully reinvented the concept of indexing communities. Even so, when enough people find some meat on ODP's rotting horse carcass, I write yet another post-mortem on the topic.

Notwithstanding a dismissive post that I wrote on the XODP Blog on June 19, 2006 entitled Will ODP Ever Die?, my last extensive post-mortem on ODP was posted on the XODP Blog on March 23, 2006, just under a year ago, so I guess it's about time for yet another one. To this end, Robozilla posted a brief message on the XODP Yahoo! eGroup pointing to a post on Richard Skrenta's blog from December 16, 2006 entitled, DMOZ had 9 lives. Used up yet?. I'm sorry that I'm so late in getting to the party, but better late than never. In any event, the horse has been dead for quite some time and has been hoisted up on a rope and hit so many times that it's beginning to resemble a piñata, the difference being that there's no candy on the inside of this beast.

For those of you who are unaware of the recent drama over at ODP, here's a pithy excerpt from Skrenta's blog:
"Apparently the machine holding dmoz in AOL ops crashed. Standard backups had been discontinued for some reason; during unsuccessful attempts to restore some of the lost data, ops blew away the rest of the existing data on the system.

"So for the past 6 weeks, a few folks have been trying to patch the system back together again (reverse engineering from the latest RDF dump, I suppose).

"dmoz doesn't exactly operate on a model of transparency, to say the least, so they have been keeping the details of what happened private. . . ."
I first became aware of this drama when the ODP article at Wikipedia (which is still on my Wikipedia watchlist) started to see some really weird edits. Given what I know of the inner workings of AOL, I gave ODP about a 50/50 chance of disappearing from AOL, the way that Disney's Go Directory and Looksmart's Zeal directory disappeared, with the ODP editor community reconstituting itself in one form or another on one or more websites using one of the not-so-recent ODP RDF dumps. I also anticipated that Richard Skrenta and Jimbo Wales would step up to the plate and offer whatever assistance they could. But for the fact that someone over at AOL took notice of Skrenta's blog post, that's what would have happened. As it now stands, ODP's systems have been more or less restored to the status quo ante, for better or for worse.

With the exception of a very small handful of true believers, most of the people who have ever had any dealings with ODP are either very apathetic about it (i.e., me) or have a very negative opinion of it. Many people would like to see ODP completely extinguished, but the ongoing ODP RDF dumps and resulting ODP clones that currently populate the Web make that an impossibility. The best case scenario is one where a white knight like Jimbo Wales would be allowed to give the current ODP a new home.

More than one person with the resources to make it happen (not Jimbo Wales) has asked me if I would be interested in spearheading an effort to reorganize ODP, the last time being about a year and a half ago. And my answer has always been, "Sure, for the right price. However, if you have those sort of financial resources, there's quite a few other projects that I consider more worthwhile." And that really is the problem with ODP. With the rare exception of someone like Jimbo Wales, the only people with the wherewithal to salvage what's left of ODP are either totally disinterested, totally clueless, totally in denial, or totally corrupt. Moreover, while AOL does not seem to be interested in revitalizing ODP, it is dead set on making sure that no one takes it away from them.

Tuesday, February 20, 2007

WordNet Redux

In a recent XODP Blog post, I extolled the virtues of WordNet and declared that it was one of the best kept secrets when it comes to Natural Language Processing (NLP) resources. Following up on this post, I found that few people outside of the NLP arena seem to have even heard of WordNet, and even the experts in this field do not seem to appreciate WordNet's potential. For instance, in a post at the Artificial Artifical Intelligence Blog, Lukas Biewald laments:
"Are concepts really a hierarchy? I’ve heard cognitive scientists think so, but I disagree. And I think that trying to make all the concepts conform to this artificial hierarchal structure has turned WordNet into a much less useful resource.

"[ . . .]

". . . [Some] groups of concepts . . . actually have a hierarchical structure for an unrelated real-world reason. . . .

"[ . . .]

"But this hierarchy completely breaks down for more conceptual things. Is respect in the sense of 'respect for my Father,' a type of 'attitude'' or 'politeness' or 'filial duty' or 'affection?' Clearly it’s all these things. But the guys making WordNet didn’t want to believe that, so they make respect as a type of attitude one semantic category, and respect as a type of politeness another, and so on, until there are ten separate senses for respect."
In their book, Computational Linguistics, which I mentioned in my previous post about WordNet, Igor Bolshakov and Alexander Gelbukh anticipate these sort of objections to creating linear representations (i.e., Text) of non-linear entities (i.e., Meaning):
". . . The human had to be satisfied with the instrument of speech given to him by nature. This is why we use while speaking a linear and rather slow method of acoustic coding of the information we want to communicate to someone else.

"[ . . . ]

"While the information contained in the text can have a very complicated structure, with many relationships between its elements, the text itself has always one-dimensional, linear nature, given letter by letter. . . . [A] text represents non-linear information transformed into linear form. What is more, the human cannot represent in usual texts even the restricted non-linear elements of spoken language, namely, intonation and logical stress. . . .

". . . A text consists of elementary pieces having their own, usually rather elementary, meaning. This meaning is determined by the meaning of each one of their components, though not always in a straightforward way. These structures are organized in even larger structures like sentences, etc. . . . Such organization provides linguistics with the means to develop the methods of intelligent text processing."
Like Wikipedia and its ongoing efforts at disambiguation, WordNet cannot be all things to all people, nor should it try, notwithstanding Lukas Biewald's assertion that WordNet should make allowances for words meaning "all of the above." What WordNet can be, is, and should remain, is a srong foundation for more sophisticated semantic analysis.

Monday, February 19, 2007

WordNet, Disambiguation, and the Semantic Web

While following up on a previous XODP Blog post that covered Powerset's attempts to outdo Google by using Natural Language Processing (NLP), I discovered a book entitled Computational Linguistics, available for free in both HTML and PDF format. As someone who has a background in both linguistics and cognitive science, I found the book a fascinating read, and about a third of the way through it, I discovered Princeton's WordNet. While WordNet is anything but obscure, as NLP resources go, I thnk WordNet is one of the world's best kept secrets.

A while back I heaped praised on Wikipedia for its ability to disambiguate keyword-based queries. Without taking anything away from Wikipedia's ongoing efforts at disambiguation, WordNet is already the most comprehensive open source and open content online resource for disambiguation of English nouns, verbs, and adjectives, providing extensive semantic analysis for just under 150,000 unique word strings. Noticeably missing from WordNet's database are articles, prepositions, pronouns, conjunctions, and word particles. Moreover, WordNet does not provide any information about any word's etymology or pronunciation. However, Wordnet does provide operational definitions for wordstrings, information about their common usage, and comprehensive semantic grouping information, including synonyms, antonyms, hyponyms, hypernyms, meronyms, holonyms, and troponyms. And if you want to know what any of those words mean, I suggest that you go on over to WordNet and enter them into the online interface.

WordNet has more or less solved one of the most basic challenges that was facing developers of the Semantic Web, as WordNet provides a relatively comprehensive operational lexical ontology for the English language where the taxonomy is both clear and definite, but still very, very flexible. In lieu of a standard search engine algorithm, which is limited to determining what I call "keyword relevancy," a properly configured semantic web user agent would be able to determine the actual semantic relevancy of online resources. Moreover, the end user would not have to resort to a series of keyword-based searches. Rather, an end user would have an ongoing conversation with his or her user agent, providing more and more feedback on what said end user wanted to know.

The Semantic Web envisioned by Tim Berners-Lee suffers from the perception that it's a high falutin enterprise and is commonly referred to by most web developers as the Pedantic Web. This ivory tower disconnect could easily be remedied by creating user-friendly user agents. In time, casual end users would be willing to use a semantic search agent when they encountered problems finding useful online resources with a standard search engine. Meanwhile, people who want to use the Web for serious research would be able to carve out a semantic niche for themselves.

Sunday, February 18, 2007

Powerset Counts Coup with Google

A post on John Battelle's Searchblog points to an article on VentureBeat, the latter being journalist Matt Marshall's recently launched online publication that covers developments in the venture capital world.
"Powerset, a San Francisco search engine company, will announce Friday it has won exclusive rights to significant search engine technology it says may help propel it past Google.

"The technology, developed at Palo Alto Research Center (PARC) in Silicon Valley, seeks to understand the meanings between words, akin to the way humans understand language — and is thus called 'natural language. . . .'

"The deal is significant because practical use of linguistic technology has eluded Google. . . .

"[ . . . ]

" . . . Powerset could possibly steal a lead if it improves search results by a significant measure with natural language and simultaneously incorporates a near-equivalent to Google’s existing capabilities. Powerset has been hiring lots of Yahoo search experts and others, to help it do that.

"[ . . . ]

". . . Negotiations on the deal, just completed, were so secretive that Powerset’s executives hid a Xerox PARC scientist, Ron Kaplan, in a back room when VentureBeat stopped by for an interview last year. Kaplan, who has led the 'natural language' group for several years, joined Powerset as chief technology officer in July. This is a coup for Powerset, because Kaplan did not respond to some early probes from Google. In an interview, Kaplan said he didn't believe Google took natural language seriously enough. 'Deep analysis is not what they’re doing. Their orientation is toward shallow relevance, and they do it well.' Powerset, however, 'is much deeper, much more exciting. It really is the whole kit and caboodle.' While natural language has been a vexing problem for decades, Kaplan said he believes it is ready for prime-time."
The fact that a heavy-hitter like Kaplan chose Powerset over Google is quite remarkable, as Google has been responsible for a brain drain in the search engine space for quite some time. Of course, the technology is being hyped, site unseen, and as pointed out by Matt at Venturebeat, it also remains to be seen whether people can be convinced to change their keyword-based search behavior.

Wikipedia's Feckless Attempts to Fight Link Spam

Although it's hardly breaking news, I recently discovered that Wikipedia developer Brion Vibber has reinstated nofollow attributes for outbound links on Wikipedia articles. Brion took this action at the request of Wikipedia founder and God-king Jimbo Wales, but Brion also mentioned that he had heard a rumor that a search engine optimization (SEO) contest threatened to overwhelm Wikipedia with spam. According to Brion, the change will be for the indefinite future:
"I'd prefer to see actual improvements (whitelisting, fading, flagging and approval system, etc) rather than just turn it off one day, though."
Hard on the heels of this policy change, most members of the SEO community started crying foul, even as a significant number of Wikipedians started chanting, "Ding, dong, the witch is dead." Both of these responses are what I would have expected. However, when the nofollow feature was first put to a vote on Wikipedia some time ago, a majority of Wikipedians voted to turn off the nofollow attribute on Wikimedia software, returning Wikipedia to the status quo ante.

Even with the nofollow atribute enabled on Wikimedia software, Wikipedia remains an attractive nuisance for link spammers, as Wikipedia is currently ranked 12th on Alexa for overall web traffic. Assuming that the nofollow attribute does what it purports to do, all the current Wikipedia nofollow policy does is make sure that a search engine like Google does not count links on Wikipedia as votes in favor of increasing a particular URL's Google PageRank. While this may have a phenomenal impact on search engine rankings and results over the next month or two, it will have little to no impact on Wikipedia link spam, at least not in the foreseeable future.

Simply put, Wikipedia's nofollow policy makes about as much sense as screen doors on a submarine. While the latter may keep the fish out, the floodgates will remain open, and the shameful joy of the Wikipedians who think that Jimbo has struck a blow for truth, justice, and the American way is sounding more and more like hollow and self-aggrandizing rhetoric:
"We are not a links directory which is one of the problems with various ideas to selectively turn off no follow[sic]. If we do that we are basicaly[sic] admiting[sic] we are a links directory and we would gain very little for doing so."
Notwithstanding claims to the contrary, Wikipedia is a trusted source of URLs, not unlike a links directory. Moreover, it's pretty clear that link spam is already a serious problem for Wikipedia, and that the problem is only going to get much worse for the foreseeable future, prompting me to revise what I asserted in a previous blog post that the first practical limitation that Wikipedia will encounter will be in its ability to attract additional contributors. Indeed, the problem that Wikipedia now has is that it is attracting too many contributors with a hidden agenda.

Saturday, February 17, 2007

What Are Wikipedia's Limitations?

In a previous XODP Blog post, I stated my belief that whatever creative and progressive genius ODP/dMOZ once had has been inherited by Wikipedia. Following up on this belief, I considered the fact that the number of articles on Wikipedia rivals the number of URLs indexed in the ODP database. And while my cursory review of Wikipedia's statistics failed to determine exactly how many outbound links Wikipedia has, I think it's safe to say that the number of outbound links in Wikipedia articles is already much larger than the number of URLs indexed in the ODP database. This prompts me to ask: What are Wikipedia's limitations?

Given the highly decentralized nature of Wikipedia editing practices and the scalability of Wikipedia's open source platform, there are virtually no internal limitations to Wikipedia's potential for growth, whether it be growth in the sheer amount of content, growth in the number of articles, growth in the number of outbound links, or growth in the number of contributors. Rather, Wikipedia has a demonstrated potential for exponential growth, and its practical limitations are found in the potential number of topics to be covered, the amount of outside content to be indexed, and the number of people who are willing to contribute their time and talents to Wikipedia. A derivative limitation is quality control, as Wikipedia's so-called "conflict of interest" policy scares away many subject matter experts.

Of the many growth metrics one can measure, the only one that indicates Wikipedia's growth is slowing down is its prominence on Alexa, where Wikipedia is currently ranked #12. To a certain degree, this is comparing apples and oranges, as none of the sites that are impeding Wikipedia's growth in this way are reference sites. Rather, Wikipedia is now going head to head with established search engines and community portals and is steadily gaining ground; as I have stated previously, there is every reason to believe that Wikipedia can and will make search engines as we know them obsolete because of Wikipedia's focus on disambiguating keyword-based queries.

Although it's impossible to predict when it will happen, the first practical limitation that Wikipedia will encounter will be in its ability to attract additional contributors. A related problem that I alluded to above is quality control based on Wikipedia's inherent bias against subject matter experts. Absent some incredible breakthroughs in artificial intelligence, these limitations will impede Wikipedia's growth long before any inherent limitations in the number of potential Wikipedia topics to be covered. Meanwhile, based on Wikipedia's success at being the purveyor of general knowledge, I think the various expert-based wikis that are currently emerging will find it easier to attract contributors.