I’m seeing more and more UTF-8 related issues pop up in code lately for some reason. Much of the debugging work I’ve done with the ODP XML dumps has been tracking down illegal XML characters and invalid UTF-8 byte sequences.
Now I’ve run across a related bug in mod_virgule. The trust metrics on robots.net stopped working a few days ago and today I took some time to track down the reason. It turned out to be an interesting little issue with the way mod_virgule handles the storage of data in the XML database. I’ve implemented a temprorary work-around that has things working safely again but I think a longer term fix is needed.
I posted to the virgule_dev mailing list about the problem but it’s been pretty much dead for the past few months. Basically what happened is a foreign user posted some data to their user profile using a funky non-UTF-8 compatible character set. The result was a corrupt profile.xml file for that user account. That, in turn, led to Apache segfaulting during each subsequent attempt by mod_virgule to process the trust metric. Because of the segfault there was no error reporting to alert anyone of the problem and it took several days before anyone noticed that something was wrong.
The root of the problem seems to be that mod_virgule is simply taking whatever raw data a user puts in a form and passes it directly to xmlSetProp(). This works great as long you only give it valid UTF-8 data but it’s not designed to work on anything else. It seems to me that four things need to be done to fix this:
- Pages need to explicitly specify UTF-8 as the doctype
- All form data needs to be validated before passing to libxml
- Invalid data needs to be converted or rejected
- The trust metric code needs some additional error handling
If anyone has any thoughts on this or has had a similar experience with mod_virgule, I’d be curious to hear about it.