I’ve also been catching up on a couple of free software projects the last couple of weeks. I’ve posted a new version of my fork of the mod_virgule code that includes the latest patches to the official version. It also includes several new features from my ToDo list including a configurable sitemap and article index page. Both of these elements are hard-coded in the official source making it difficult to use mod_virgule without editing the source code and recompiling. I also added a simple include function to the XML markup (suggested on the mod_virgule development list). For a full list of all the goodies in my version, like UTF-8 support, password reminders, libxml2 support, etc., see the web page. Keep the patches and suggestions coming…
I’ve also posted a new version of dumpcheck, the program I’ve been using to help debug the weekly ODP data dump generation problems. In this case, I got a nice pile of useful patches from Andreas Steinmetz including a few fixes for compiler warnings and a couple of new features. Thanks!
I’m seeing more and more UTF-8 related issues pop up in code lately for some reason. Much of the debugging work I’ve done with the ODP XML dumps has been tracking down illegal XML characters and invalid UTF-8 byte sequences.
Now I’ve run across a related bug in mod_virgule. The trust metrics on robots.net stopped working a few days ago and today I took some time to track down the reason. It turned out to be an interesting little issue with the way mod_virgule handles the storage of data in the XML database. I’ve implemented a temprorary work-around that has things working safely again but I think a longer term fix is needed.
I posted to the virgule_dev mailing list about the problem but it’s been pretty much dead for the past few months. Basically what happened is a foreign user posted some data to their user profile using a funky non-UTF-8 compatible character set. The result was a corrupt profile.xml file for that user account. That, in turn, led to Apache segfaulting during each subsequent attempt by mod_virgule to process the trust metric. Because of the segfault there was no error reporting to alert anyone of the problem and it took several days before anyone noticed that something was wrong.
The root of the problem seems to be that mod_virgule is simply taking whatever raw data a user puts in a form and passes it directly to xmlSetProp(). This works great as long you only give it valid UTF-8 data but it’s not designed to work on anything else. It seems to me that four things need to be done to fix this:
- Pages need to explicitly specify UTF-8 as the doctype
- All form data needs to be validated before passing to libxml
- Invalid data needs to be converted or rejected
- The trust metric code needs some additional error handling
If anyone has any thoughts on this or has had a similar experience with mod_virgule, I’d be curious to hear about it.
The RDF exports seem to be coming out like clockwork again from ODP. The first was riddled with errors but the second is much, much better. No illegal XML characters in either file and only one had UTF-8 errors. With luck, the next one will be error free. I’m going to attempt to create smaller RDFs of ODP subcats for those who only need one or two categories and don’t like downloading the full 1GB RDF.