Dave Beckett's blog

Parsing 8G of RDF/XML

2004-07-28 22:53

Last week the UniProt in RDF Format was

announced by Eric Jain. This includes a 700Mb gzipped RDF/XML file in the data area. That's 8369854785 bytes of RDF/XML. So, of course I had to throw my Raptor parser at it to see if it'd survive.

Round 1, it died with this error:

$ gunzip < uniprot.rdf.gz |rapper -c - http://example.org
rapper: Error - URI http://example.org:41640376 - Duplicated rdf:ID value '_501D28'

So a duplicate ID, I wonder if they know that, in line 41.6M.

Round 2, disable errors:

$ gunzip < uniprot.rdf.gz | rapper -c --ignore-errors - http://example.org
rapper: Parsing returned 134091199 statements

This took about 26 minutes on my 2 year old desktop PC to count the 134M triples. Which works out to around 86,000 triples/second. The PC was struggling with the CPU of the ID checking as well as that of gunzip. Parsing the raw file might not work since raptor uses standard C I/O, not large file I/O so could not seek to read all an 8G file.

rapper was taking a huge amount of memory, I suspect for the rdf:ID duplicate value checking which hasn't yet seen that size of data. While I was waiting I've thought of a few ways to optimise it.