Free Software

I’ve also been catching up on a couple of free software projects the last couple of weeks. I’ve posted a new version of my fork of the mod_virgule code that includes the latest patches to the official version. It also includes several new features from my ToDo list including a configurable sitemap and article index page. Both of these elements are hard-coded in the official source making it difficult to use mod_virgule without editing the source code and recompiling. I also added a simple include function to the XML markup (suggested on the mod_virgule development list). For a full list of all the goodies in my version, like UTF-8 support, password reminders, libxml2 support, etc., see the web page. Keep the patches and suggestions coming…

I’ve also posted a new version of dumpcheck, the program I’ve been using to help debug the weekly ODP data dump generation problems. In this case, I got a nice pile of useful patches from Andreas Steinmetz including a few fixes for compiler warnings and a couple of new features. Thanks!

mod_virgule and UTF-8 weirdness

I’m seeing more and more UTF-8 related issues pop up in code lately for some reason. Much of the debugging work I’ve done with the ODP XML dumps has been tracking down illegal XML characters and invalid UTF-8 byte sequences.

Now I’ve run across a related bug in mod_virgule. The trust metrics on robots.net stopped working a few days ago and today I took some time to track down the reason. It turned out to be an interesting little issue with the way mod_virgule handles the storage of data in the XML database. I’ve implemented a temprorary work-around that has things working safely again but I think a longer term fix is needed.

I posted to the virgule_dev mailing list about the problem but it’s been pretty much dead for the past few months. Basically what happened is a foreign user posted some data to their user profile using a funky non-UTF-8 compatible character set. The result was a corrupt profile.xml file for that user account. That, in turn, led to Apache segfaulting during each subsequent attempt by mod_virgule to process the trust metric. Because of the segfault there was no error reporting to alert anyone of the problem and it took several days before anyone noticed that something was wrong.

The root of the problem seems to be that mod_virgule is simply taking whatever raw data a user puts in a form and passes it directly to xmlSetProp(). This works great as long you only give it valid UTF-8 data but it’s not designed to work on anything else. It seems to me that four things need to be done to fix this:

  • Pages need to explicitly specify UTF-8 as the doctype
  • All form data needs to be validated before passing to libxml
  • Invalid data needs to be converted or rejected
  • The trust metric code needs some additional error handling

If anyone has any thoughts on this or has had a similar experience with mod_virgule, I’d be curious to hear about it.

XTM, RDF, DAML, OIL, and other uses of XML

My recent discussions with the ODP guys about open-sourcing the ODP backend software have led me to read up on RDF, which is the format used by ODP for exporting the ontological information and content of the directory. One thing I immediately ran into was XTM, the ISO standard for creating XML Topic Maps. These seem to me to be competing standards in that they both use XML to describe ontological information. RDF seems to be enjoying much more widespread use on the web but I’m playing catch-up in this particular area right now, so I may be missing some uses of XTM. One helpful document I’ve found is a paper by Lars Marius Garshol comparing XTM, RDF, and two RDF extensions, DAML and OIL. If anyone knows of other introductory-level documents describing the similarities and differences of XTM and RDF, I’d be curious to hear about them.