ODP, hierarchical organization, and other thoughts

I went to a google@work seminar in Dallas last week. It was mostly a sales pitch for Google’s enterprise services, but there were a few interesting bits such as getting a glimpse of Google’s intranet. Another thing stood out that prompted this post. Part of Google’s pitch is that hierarchical organization is dead. More than that, all hierarchical models of organization are bad. Whether it’s directories on your hard disk, folders on your desktop, folders in your email program, categorical tagging of rss feeds, or topical organization of website contents, it’s all bad, bad, bad. The one true way, they claim, is to dump all your data into a single chaotic mess and “embrace the chaos”. By which they mean, of course, purchase Google Enterprise products and services to search for what you need. After all, how else will you ever find what you’re looking for – your data is now lost in the chaotic mess. Asking a company the specializes in searching unorganized data how to organize your data strikes me as being very like asking the barber if you need a haircut. The answer will profit someone but probably not you.

Somewhere, during the powerpoint presentation, was a frame actually titled “Heirarchical organization is dead” and it was illustrated by a full frame image of the Open Directory Project’s index page. The sad thing is not so much that they used this example, but that it was such a powerful example. It generated a fair amount of laughter from the audience as the Google guy talked about how sites like ODP used to think they could manually categorize the Internet. He asked how many of the 100+ people present used (or were even aware of) ODP or similar directories for finding things on the web; no hands were raised. Then he asked how many people used search engines like Google to find things on the web: all hands raised. More laughter.

This is one of two events that recently brought home to me just how dead ODP is. The other was when I tried to log in to my ODP editor account and discovered ODP was down. A little research revealed it had been down for quite a while. Apparently there was a hardware failure back in October of 2006. AOL techs managed to bungle the restore process somehow, resulting in the unrecoverable destruction of large amounts of ODP. Then they discovered they’d forgotten to make backups for the last few years. Oops. Since then, they’ve been slowly reconstructing things. The content itself was salvaged from one of the weekly data dumps but all or most of the editor metadata was lost. Information is scarce as AOL has mostly forgotten about ODP and ODP staff continue to be very secretive about everything that goes on. While a lot of public portions of ODP are back online, a lot of the editor functionality is still down six months later. At least one of the important servers used by the editors is still offline. The really suprising thing is not just that I hadn’t noticed ODP being down but the web as a whole hadn’t noticed. There was a time when ODP being down for weeks would have been front page news on sites like Slashdot. Other than ODP editors and a few obscure SEO blogs, no one noticed it was gone.

While I don’t agree with Google’s conclusion that all heirarchical organization is bad, I think they are right in the case of web directories. It’s simply not a useful or reasonable method of organizing web sites compared to more modern social bookmarking systems like del.icio.us or reddit. It’s an adapt or die world and, sadly, ODP doesn’t seem to be the sort of organization that can adapt to the changes taking place.

I expect ODP will limp along if AOL continues to allow it but I don’t hold out any hope that ODP is ever going to fully return from the dead, I’m still an editor and I will continue to assist them with data integrity checking on the weekly XML data dumps (which have finally resumed again, by the way). However, I’m in the process of working with another editor to migrate the data dump checking process to an ODP server, so it won’t take up my time or energy anymore. I’m also spending far less time on my other ODP-related projects.

Speaking of social information processing, there was an interesting paper published by Kristina Lerman of USC this month on the subject, Social Information Processing in Social News Aggregation (PDF format). The paper looks at the way Digg exploits the power of social information processing to solve the problem of rating aggregated news stories.

Free Software

I’ve also been catching up on a couple of free software projects the last couple of weeks. I’ve posted a new version of my fork of the mod_virgule code that includes the latest patches to the official version. It also includes several new features from my ToDo list including a configurable sitemap and article index page. Both of these elements are hard-coded in the official source making it difficult to use mod_virgule without editing the source code and recompiling. I also added a simple include function to the XML markup (suggested on the mod_virgule development list). For a full list of all the goodies in my version, like UTF-8 support, password reminders, libxml2 support, etc., see the web page. Keep the patches and suggestions coming…

I’ve also posted a new version of dumpcheck, the program I’ve been using to help debug the weekly ODP data dump generation problems. In this case, I got a nice pile of useful patches from Andreas Steinmetz including a few fixes for compiler warnings and a couple of new features. Thanks!

mod_virgule and UTF-8 weirdness

I’m seeing more and more UTF-8 related issues pop up in code lately for some reason. Much of the debugging work I’ve done with the ODP XML dumps has been tracking down illegal XML characters and invalid UTF-8 byte sequences.

Now I’ve run across a related bug in mod_virgule. The trust metrics on robots.net stopped working a few days ago and today I took some time to track down the reason. It turned out to be an interesting little issue with the way mod_virgule handles the storage of data in the XML database. I’ve implemented a temprorary work-around that has things working safely again but I think a longer term fix is needed.

I posted to the virgule_dev mailing list about the problem but it’s been pretty much dead for the past few months. Basically what happened is a foreign user posted some data to their user profile using a funky non-UTF-8 compatible character set. The result was a corrupt profile.xml file for that user account. That, in turn, led to Apache segfaulting during each subsequent attempt by mod_virgule to process the trust metric. Because of the segfault there was no error reporting to alert anyone of the problem and it took several days before anyone noticed that something was wrong.

The root of the problem seems to be that mod_virgule is simply taking whatever raw data a user puts in a form and passes it directly to xmlSetProp(). This works great as long you only give it valid UTF-8 data but it’s not designed to work on anything else. It seems to me that four things need to be done to fix this:

  • Pages need to explicitly specify UTF-8 as the doctype
  • All form data needs to be validated before passing to libxml
  • Invalid data needs to be converted or rejected
  • The trust metric code needs some additional error handling

If anyone has any thoughts on this or has had a similar experience with mod_virgule, I’d be curious to hear about it.

Ray Rainwater RIP

It’s been busy and I’ve fallen behind on posting anything new lately. It’s been a mixed month of good news and bad. The bad news was hearing that Ray Rainwater had died. While not totally unexpected, one never likes to lose a friend. In this case a friend Susan and I knew only through the Internet. Our paths crossed doing genealogical research on the Rainwater family and we’ve corresponded with Ray frequently over the last couple of years. We’d talked about making the trip to Alaska to meet him and he had hoped to make the trip to Dallas at one point as well. Neither happened in time.

He used to send me reminders when I hadn’t updated my weblog in a while (it’s always nice to know somebody actually reads this thing!) and often offered interesting, related anecdotes from his life. When I wrote about my feelings the morning after 9/11, he was reminded of his reaction to the 1941 attack on Pearl Harbor. Whether discussing current events or the specs of the latest digital cameras, there was always something interesting in his emails. He and Susan frequently discussed genealogical mysteries. He will be missed.

There’s been some good news this month as well. Business continues to pick up and there really do seem to be signs of an economic recovery going on. In what spare time I have, we’ve started a major reorganization of the ODP Robotics categories and have already double the size of the category. Meanwhile, after a week or so of downtime, ODP finally installed the new servers for the public side of the site. The new server for the editor site is also up and this week they’re replacing the server that runs the internal forums. The new ODP servers run Linux instead of Solaris and the proprietary forum software has been replaced with GPL’d software. So we’re one tiny step closer in the quest to run the Open Directory Project on Open software.

ODP/dmoz Update

(Sinus update: It’s been about one week since the surgery. I’m off all but a few of the drugs, I’m back to my usual routine at work, and I feel great; better than I’ve felt in a year. I can breathe, taste, and smell. I feel a few years younger.)

The latest RDF dump error report shows no XML character errors for the third week running. Invalid UTF-8 sequences are down from hundreds to just two this week. It’s definitely the best dump ever and I’m keeping my fingers crossed that this week’s dump will be 100% error-free at the character encoding level. In anticipation of that, I’ve started compiling an ODP RDF ToDo list of other bugs and optimizations that need work. I’ve made some progress with one of the oft-requested features for the dump which is to break the full 1GB dump into smaller, category-specific dumps. While testing things out, I’m hosting the smaller dumps locally but if they start seeing a lot of use, hopefully they’ll get moved to an ODP server with enough bandwidth to handle them.

ODP RDF Exports

The RDF exports seem to be coming out like clockwork again from ODP. The first was riddled with errors but the second is much, much better. No illegal XML characters in either file and only one had UTF-8 errors. With luck, the next one will be error free. I’m going to attempt to create smaller RDFs of ODP subcats for those who only need one or two categories and don’t like downloading the full 1GB RDF.