Dave Beckett's blog

Munging Planet RDF

2004-10-21 18:58

Sam Ruby says in his slide Munging from his slides on the pitfalls around Unicode, XML and HTTP: Planet RDF will take HTML and run it through a iso-8859-1 to utf-8 conversion This is not quite correct.

The code behind PlanetRDF uses the source

blogroll to get the RSS feed URIs. These are fetched and RSS parsed using the Ultra-liberal RSS parser giving Unicode inside Python. This data is used to create a skeleton html document in UTF-8 which is passed to tidy to try to fix HTML escaping and tagging messes. Tidy is told to read and write UTF-8. The aggregation then is performed and the result is a new RDF/XML (RSS1.0) feed in UTF-8 which is then XSLTed into XHTML in UTF-8. There is no explicit transcoding. If there is a problem, it'll be at the first RSS stage.

There are sometimes encoding errors in titles in the main page body which is due to python problems understanding when tidy emits UTF-8 encoded bytes and python attempts to read them as ASCII. The right hand side is always correct, since it is all done in RDF from the source blogroll, no munging.

I guess it's time to junk the "Ultra-liberal" parser and replace it with a real one and as all PlanetRDF feeds are RSS 1.0, not RSS tag soup, we can use an RDF/XML parser. At that point PlanetRDF will be triples all the way down :)

More detail of how PlanetRDF works was given in Planet Blog by Edd Dumbill.