Last week the UniProt in RDF Format was
announced by Eric Jain. This includes a 700Mb gzipped RDF/XML file in the data area. That's 8369854785 bytes of RDF/XML. So, of course I had to throw my Raptor parser at it to see if it'd survive.
Round 1, it died with this error:
$ gunzip < uniprot.rdf.gz |rapper -c - http://example.org
rapper: Error - URI http://example.org:41640376 - Duplicated rdf:ID value '_501D28'
So a duplicate ID, I wonder if they know that, in line 41.6M.
Round 2, disable errors:
$ gunzip < uniprot.rdf.gz | rapper -c --ignore-errors - http://example.org
rapper: Parsing returned 134091199 statements
This took about 26 minutes on my 2 year old desktop PC to count the 134M triples. Which works out to around 86,000 triples/second. The PC was struggling with the CPU of the ID checking as well as that of gunzip. Parsing the raw file might not work since raptor uses standard C I/O, not large file I/O so could not seek to read all an 8G file.
rapper was taking a huge amount of memory, I suspect for the
rdf:ID
duplicate value checking which hasn't yet seen
that size of data. While I was waiting I've thought of a few ways to
optimise it.