Dave Beckett's blog

Refactoring Raptor for RDF Atom

2006-04-02 19:49

I spend most of my work day writing twiki pages or going to meetings, so I've been doing coding on my own time, and presently have been working on refactoring the internals of Raptor's XML support. I'll explain why below.

It's been a long process and probably never ending. The reason I started writing Raptor in October 2000 was to have a conformant RDF/XML parser and to use the best XML parser available. This wasn't too clear then as you had the choice of:

  • libxml / libxml2: new and good
  • expat 1.95.x: old i.e. mature, well known but not having much development and also good.

So I made it work with both and as I needed namespace support for RDF/XML, made them both look like they generated something like SAX2 namespace events. At that time, only libxml2 supported namespaces. This libxml/expat + namespace support + RDF/XML parser was all done in one 140K C file. Which was a problem, but the parser did work!

Raptor slowly grew more features to support the updating of RDF/XML and I became the editor of what would be the revised RDF W3C Recommendation. It added: URIs, URI resolving, URI retrieval, XML Qnames, XML Namespaces, XML Base, Unicode, UTF-8 and an XML Writer for the rdf:parseType="Literal" handling. Plus a few new parsers: N-Triples (I co-created this), Turtle (I created this; there's a theme here!) and RSS Tag Soup for the 9 flavours of RSS (I have nothing to do with this :) ) plus Atom. Plus a slew of serializers to match, the XML Writer being refactored to it's own public API for this. It's not really an RDF parser library anymore, it's a web library with support for mapping between syntaxes and RDF triples.

Meanwhile, SAX2 and RDF/XML were still intertwined. Until this week in 2006. Finally I've pulled them apart which allows me to make a few neat things possible - the RSS tag soup parser has switched from using libxml-only xmlReader API to the separate SAX2 API so now you can do RSS and Atom with expat too. This also improves the Atom support as it can handle the type='xhtml' and type='xml' markup plus now uses the well-tested xml:base, QNames and Namespaces parts from Raptor. I hope that it'll also be able to deal with other xml formats inside Atom, so I'm guessing RDF/XML in Atom will be possible. DOAP over Atom anyone?

However at this point I'm stopping as the part that has me stumped is how to best represent Atom in RDF triples. The Atom OWL work seems to be going slowly (Aside: also the web site acts very oddly to HTTP wget/curl requests). Mostly I'd like to have readers not have to care that it was Atom or RSS tag soup to begin with, so I'm thinking something like an Atom / RSS1.0 hybrid format.

Handwaving: here's something I hand-edited:

<item rdf:about="http://example.org/blog/2006/04/01/stuff">
  <!-- the common bits -->
  <title>Stuff</title>
  <link>http://example.org/blog/2006/04/01/stuff</link>
  <description>&lt;div xmlns="http://www.w3.org/1999/xhtml"&gt;&lt;&gt;Content here&lt;/p&gt;&lt;/div&gt;</description>

  <-- The RSS 1.0 bits -->
  <dc:date>2006-04-02T20:19:00-08:00</dc:date>
  <content:encoded><![CDATA[<div xmlns="http://www.w3.org/1999/xhtml"><p>Content here</p></div>]]></content:encoded>

  <!-- The atom bits -->
  <atom:id>tag:example.org,2006:1234</atom:id>
  <atom:link rdf:parseType="Resource">
    <atom:link-href rdf:resource="http://example.org/blog/2006/04/01/stuff" />
    <atom:link-rel>alternate</atom:link-rel>
  </atom:link>
  <atom:updated>2006-04-02T20:19:00-08:00</atom:updated>
  <atom:content rdf:parseType="Resource">
    <atom:content-type>xhtml</atom:content-type>
    <atom:content-content rdf:parseType="Literal"><![CDATA[<div xmlns="http://www.w3.org/1999/xhtml"><p>Content here</p></div>]]></atom:content-content>
  </atom:content>
</item>

So why am I working on better Atom support in an RDF parser?

Because Atom 1.0 is the best way to encode data for blog entries. It's long past time to ditch the horror that is RSS, the worst ambiguously defined XML format since OPML.