Open Directory Project

Back in the day I was an editor for the Open Directory Project and also fooled around with the "RDF" dumps from time to time. The ODP dumps had a history of being rather dirty files containing a wide assortment of errors and non-standard usage of XML and RDF syntax.

I started working with ODP to try to improve the quality of the data dumps. The result was the creation of the code below as well as a lengthy Data Dump ToDo List on which I tracked the various bugs and improvement requests related to the ODP dumps.

Search engines and social bookmarking eventually rendered manually created web directories like ODP irrelevant and they are largely forgotten today.

The programs below are intended for developers and probably won't be very useful to you if your goal is to put ODP data on your website. If you're looking for a complete solution for using ODP data on your website, I recommend having a look at the minimoz package written by Andreas Steinmetz. Minimoz parses ODP data, stores it in an SQL database, and generates an ODP-like template-based website using the data. It's written in C and is Free Software licensed under the GNU GPL. Another option is phpODPWorld. It's written in PHP, licensed under the GNU GPL. It can display selected categories or all of ODP using any SQL database supported by the PHP Pear module.

dumpcheck

dumpcheck is a C program that will scan the UTF-8 encoded XML data dump files exported by ODP and report the location of invalid UTF-8 sequences, illegal XML characters, illegal Unicode characters, and XML well-formedness errors. This helps ODP staff to find and correct problems in the data dump generation scripts. The line number, byte offset, approximate ODP category ID, and hex values of offending XML and UTF-8 characters are reported. The line number of XML well-formedness errors is reported (thanks to libxml2!). While this program is intended to assist with debugging of the ODP data dumps it may be useful for other tasks as well.

dumpcheck-1.11.tar.gz [17Kb]

odp2db

odp2db is a collection of Perl programs that can be used to parse the ODP data dumps and insert the data into an SQL database. Both the structure.rdf.u8 and content.rdf.u8 files are parsed. A minimal table structure is included that is suitable for loading the database but probably not useful for any real work. The XML::Parse and DBI Perl modules are required. I developed this for use with PostgreSQL but have tried to stick with standard ANSI SQL as much as possible so it should work with MySQL and anything else supported by DBI with only very minimal changes.

odp2db-1.2.tar.gz [17Kb]

License: All software on this page is Free Software licensed under the GNU GPL.


Copyright © 2007 by R. Steven Rainwater, All Rights Reserved.