ODP/dmoz Update

(Sinus update: It’s been about one week since the surgery. I’m off all but a few of the drugs, I’m back to my usual routine at work, and I feel great; better than I’ve felt in a year. I can breathe, taste, and smell. I feel a few years younger.)

The latest RDF dump error report shows no XML character errors for the third week running. Invalid UTF-8 sequences are down from hundreds to just two this week. It’s definitely the best dump ever and I’m keeping my fingers crossed that this week’s dump will be 100% error-free at the character encoding level. In anticipation of that, I’ve started compiling an ODP RDF ToDo list of other bugs and optimizations that need work. I’ve made some progress with one of the oft-requested features for the dump which is to break the full 1GB dump into smaller, category-specific dumps. While testing things out, I’m hosting the smaller dumps locally but if they start seeing a lot of use, hopefully they’ll get moved to an ODP server with enough bandwidth to handle them.

ODP RDF Exports

The RDF exports seem to be coming out like clockwork again from ODP. The first was riddled with errors but the second is much, much better. No illegal XML characters in either file and only one had UTF-8 errors. With luck, the next one will be error free. I’m going to attempt to create smaller RDFs of ODP subcats for those who only need one or two categories and don’t like downloading the full 1GB RDF.

BSA, ODP, RDF, and other TLAs

We received another BSA threat-letter at NCC Friday. That’s two in as many weeks. It was the usual collection of vague threats that if we didn’t rush out and buy some Microsoft, Adobe, and Macromedia software the BSA might have to search our office for unlicensed software and fine us a few million dollars. This time I called their toll free number and told them to remove us from their mailing list because we were tired of getting lame marketing letters disguised as legal threats. (feel free to call them yourself and let them know what you think of them – hey, it’s a free call! 1-877-536-4BSA) I also told them we’d instituted a company-wide policy to discontinue the use of all software products made by BSA members in favor of Free/Open equivalents because of the marketing-by-extortion methods of the BSA. The girl I was talking with claimed they removed our address from their list and, after specifically asking her twice to do so, claimed she had made a note in our file about our new policy. So, will they really remove us from their list or will they put us on the list of companies to target with audits? Time will tell.

ODP finally solved the problem with RDF generation and a new RDF dump showed up on the 13th. The downside is that the dump is still riddled with invalid UTF-8 sequences and illegal XML characters. On the last RDF, I provided offsets of a lot of the errors by waiting for an XML parser to bomb-off and then looking for the problems with a hex editor (which is time consuming when you have to start over on a 1GB file after each error). This time I decided to be lazy and wrote some quick C code to do it for me. Strangely, a search on the net had failed to locate any UTF-8 or XML checkers that would work on arbitrarily large files. And most XML validators don’t check for illegal XML characters or invalid UTF-8 sequences, they simple fail unrecoverably when the they hit one. Anyway, I processed the RDF files and posted a list of errors in the latest dump. So with a little luck, the next RDF dumps will be much cleaner.

First Post: 2003

Well, I suppose it’s a bit late to be posting new year’s resolutions so I’ll just skip straight on to other things. I’ve picked up a couple of new ODP categories to play with. ODP still hasn’t gotten the RDF export fixed. They seem to be having some major scaling issues right now. Wish they’d accept some help from the many editors who’ve offered but they seem determined not to.

The DPRG had to find a new meeting place this year. For several years we’ve been meeting at the Bill Priest Institute. Starting this month we’ve been offered meeting space at The Science Place, where we usually hold the Roborama and other competitions. So expect to see more robots and hackers wandering around The Science Place this year.

XTM, RDF, DAML, OIL, and other uses of XML

My recent discussions with the ODP guys about open-sourcing the ODP backend software have led me to read up on RDF, which is the format used by ODP for exporting the ontological information and content of the directory. One thing I immediately ran into was XTM, the ISO standard for creating XML Topic Maps. These seem to me to be competing standards in that they both use XML to describe ontological information. RDF seems to be enjoying much more widespread use on the web but I’m playing catch-up in this particular area right now, so I may be missing some uses of XTM. One helpful document I’ve found is a paper by Lars Marius Garshol comparing XTM, RDF, and two RDF extensions, DAML and OIL. If anyone knows of other introductory-level documents describing the similarities and differences of XTM and RDF, I’d be curious to hear about them.