Dave Beckett's blog

Making Debian Docker Images Smaller

2015-04-18 14:00


  1. Use one RUN to prepare, configure, make, install and cleanup.
  2. Cleanup with: apt-get remove --purge -y $BUILD_PACKAGES $(apt-mark showauto) && rm -rf /var/lib/apt/lists/*

I've been packaging the nghttp2 HTTP/2.0 proxy and client by Tatsuhiro Tsujikawa in both Debian and with docker and noticed it takes some time to get the build dependencies (C++ cough) as well as to do the build.

In the Debian packaging case its easy to create minimal dependencies thanks to pbuilder and ensure the binary package contains only the right files. See debian nghttp2

For docker, since you work with containers it's harder to see what changed, but you still really want the containers as small as possible since you have to download them to run the app, as well as the disk use. While doing this I kept seeing huge images (480 MB), way larger than the base image I was using (123 MB) and it didn't make sense since I was just packaging a few binaries with some small files, plus their dependencies. My estimate was that it should be way less than 100 MB delta.

I poured over multiple blog posts about Docker images and how to make them small. I even looked at some of the squashing commands like docker-squash that involved import and export, but those seemed not quite the right approach.

It took me a while to really understand that each Dockerfile command creates a new container with the deltas. So when you see all those downloaded layers in a docker pull of an image, it sometimes is a lot of data which is mostly unused.

So if you want to make it small, you need to make each Dockerfile command touch the smallest amount of files and use a standard image, so most people do not have to download your custom l33t base.

It doesn't matter if you rm -rf the files in a later command; they continue exist in some intermediate layer container.

So: prepare configure, build, make install and cleanup in one RUN command if you can. If the lines get too long, put the steps in separate scripts and call them.

Lots of Docker images are based on Debian images because they are a small and practical base. The debian:jessie image is smaller than the Ubuntu (and CentOS) images. I haven't checked out the fancy 'cloud' images too much: Ubuntu Cloud Images, Snappy Ubuntu Core, Project Atomic, ...

In a Dockerfile building from some downloaded package, you generally need wget or curl and maybe git. When you install, for example curl and ca-certificates to get TLS/SSL certificates, it pulls in a lot of extra packages, such as openssl in the standard Debian curl build.

You are pretty unlikely to need curl or git after the build stage of your package. So if you don't need them, you could - and you should - remove them, but that's one of the tricky parts.

If $BUILD_PACKAGES contains the list of build dependency packages such as e.g. libxml2-dev and so on, you would think that this would get you back to the start state:

$ apt-get install -y $BUILD_PACKAGES
$ apt-get remove -y $BUILD_PACKAGES

However this isn't enough; you missed out those dependencies that got automatically installed and their dependencies.

You could try

$ apt-get autoremove -y

but that also doesn't grab them all. It's not clear why to me at this point. What you actually need is to remove all autoadded packages, which you can find with apt-mark showauto

So what you really need is

$ AUTO_ADDED_PACKAGES=`apt-mark showauto`
$ apt-get remove --purge -y $BUILD_PACKAGES $AUTO_ADDED_PACKAGES

I added --purge too since we don't need any config files in /etc for build packages we aren't using.

Having done that, you might have removed some runtime package dependencies of something you built. That's harder to automatically find, so you'll have to list and install those by hand

$ apt-get install -y $RUNTIME_PACKAGES

Finally you need to cleanup apt which you should do with rm -rf /var/lib/apt/lists/* which is great and removes all the index files that apt-get update installed. This is in many best practice documents and example Dockerfiles.

You could add apt-get clean which removes any cached downloaded packages, but that's not needed in the official Docker images of debian and ubuntu since the cached package archive feature is disabled.

Finally don't forget to delete your build tree and do it in the same RUN that you did a compile, so the tree never creates a new container. This might not make sense for some languages where you work from inside the extracted tree; but why not delete the src dirs? Definitely delete the tarball!

This is the delta for what I was working on with dajobe/nghttpx.

479.7 MB  separate prepare, build, cleanup 3x RUNs
186.8 MB  prepare, build and cleanup in one RUN
149.8 MB  after using apt-mark showauto in cleanup

You can use docker history IMAGE to see the detailed horror (edited for width):

...    /bin/sh -c /build/cleanup-nghttp2.sh && rm -r   7.595 MB
...    /bin/sh -c cd /build/nghttp2 && make install    76.92 MB
...    /bin/sh -c /build/prepare-nghttp2.sh            272.4 MB

and the smallest version:

...    /bin/sh -c /build/prepare-nghttp2.sh &&         27.05 MB

The massive difference is the source tree and the 232 MB of build dependencies that apt-get pulls in. If you don't clean all that up before the end of the RUN you end up with a huge transient layer.

The final size of 149.8 MB compared to the 122.8 MB debian/jessie base image size is a delta of 27 MB which for a few servers, a client and their libraries sounds great! I probably could get it down a little more if I just installed the binaries. The runtime libraries I use are 5.9 MB.

You can see my work at github and in the Docker Hub

... and of course this HTTP/2 setup is used on this blog!



Open to Rackspace

2013-06-24 21:34

I'm happy to announce that today I started work at Rackspace in San Francisco in a senior engineering role. I am excited anticipating these aspects:

  1. The company: a fun, fast moving organization with a culture of innovation and openness
  2. The people: lots of smart engineers and devops to work with
  3. The technologies: Openstack cloud, Hadoop big data and lots more
  4. The place: San Francisco technology world and nearby Silicon Valley

My personal technology interests and coding projects such as Planet RDF, the Redland librdf libraries and Flickcurl will continue in my own time.



2012-05-10 08:20

Digg just announced that Digg Engineering Team Joins SocialCode and The Washington Post reported SocialCode hires 15 employees from Digg.com

This acquihire does NOT include me. I will be changing jobs shortly but have nothing further to announce at this time.

I wish my former Digg colleagues the best of luck in their new roles. I had a great time at Digg and learned a lot about working in a small company, social news, analytics, public APIs and the technology stack there.


Releases = Tweets

2011-08-15 12:25

I got tired of posting release announcements to my blog so I just emailed the announcements to the redland-dev list, tweeted a link to it from @dajobe and announced it on Freshmeat which a lot of places still pick up..

Here are the tweets for the 13 releases I didn't blog since the start of 2011:

  • 3 Jan: Released Raptor RDF syntax library 2.0.0 at http://librdf.org/raptor/ only 10 years in the making :)
  • 12 Jan: Released Rasqal RDF Query Library 0.9.22: Raptor 2 only, ABI/API break, 16 new SPARQL Query 1.1 builtins and more http://bit.ly/fzb9xW #rdf
  • 27 Jan: Rasqal 0.9.23 RDF query library released with SPARQL update query structure fixes (for @theno23 and 4store ): http://bit.ly/gVDp57
  • 1 Feb: Released Redland librdf 1.0.13 C RDF API and Triplestores with Raptor 2 support + more http://bit.ly/hOr4HA
  • 22 Feb: Released Rasqal RDF Query Library 0.9.25 with many SPARQL 1.1 new things and fixes. RAND() and BIND() away! http://bit.ly/flFDH1
  • 20 Mar: Raptor RDF Syntax Library 2.0.1 released with minor fixes for N-Quads serialializer and internal librdfa parser http://bit.ly/fT3aPX
  • 26 Mar: Released my Flickcurl C API to Flickr 1.21 with some bug fixes and Raptor V2 support (optional) See http://bit.ly/f7QncO
  • 1 Jun: Released Raptor 2.0.3 RDF syntax library: a minor release adding raptor2.h header, Turtle / TRiG and ohter fixes. http://bit.ly/jHKaB8
  • 27 Jun: Rasqal RDF query library 0.9.26 released with better UNION execution, SPARQL 1.1 MD5, SHA* digests and more http://bit.ly/lI7lDW
  • 23 Jul: Released Redland librdf RDF API / triplestore C library 1.0.14: core code cleanups, bug fixes and a few new APIs. http://bit.ly/qqV1Rb
  • 25 Jul: Raptor RDF Syntax C library 2.0.4 released with YAJL V2, and latest curl support, SSL client certs, bug fixes and more http://bit.ly/oCIIDd

(yes 13; I didn't tweet 2 of them: Rasqal 0.9.24 and Raptor 2.0.2)

You know it's quite tricky to collapse months of changelogs (GIT history) into release notes, compress it further into a news summary of a few lines and even harder to compress that into less than 140 characters. It is way less if you include room for a link url and space for retweeting and sometimes need a hashtag for context.

So how do you measure a release? Let's try!


Released tarball files from the Redland download site.

date package old
tarball size
tarball size
byte diff
2011-01-03 raptor 1.4.21 2.0.0 1,651,843 1,635,566 -16,277 -0.99%
2011-01-12 rasqal 0.9.21 0.9.22 1,356,923 1,398,581 +41,658 3.07%
2011-01-27 rasqal 0.9.22 0.9.23 1,398,581 1,404,087 +5,506 0.39%
2011-01-30 rasqal 0.9.23 0.9.24 1,404,087 1,412,165 +8,078 0.58%
2011-02-01 redland 1.0.12 1.0.13 1,552,241 1,554,764 +2,523 0.16%
2011-02-22 rasqal 0.9.24 0.9.25 1,412,165 1,429,683 +17,518 1.24%
2011-03-20 raptor 2.0.0 2.0.1 1,635,566 1,637,928 +2,362 0.14%
2011-03-20 raptor 2.0.1 2.0.2 1,637,928 1,633,744 -4,184 -0.26%
2011-03-26 flickcurl 1.20 1.21 1,775,246 1,775,999 +753 0.04%
2011-06-01 raptor 2.0.2 2.0.3 1,633,744 1,652,679 +18,935 1.16%
2011-06-27 rasqal 0.9.25 0.9.26 1,429,683 1,451,819 +22,136 1.55%
2011-07-23 raptor 2.0.3 2.0.4 1,652,679 1,660,320 +7,641 0.46%
2011-07-25 redland 1.0.13 1.0.14 1,554,764 1,581,695 +26,931 1.73%
Barchart of %diffs between tarball releases.  Noticeable differences are raptor 2.0.0 with a big negative change and rasqal 0.9.22 with largest increase.
Click image to embiggen

Releases that stand out here are Raptor 2.0.0 which was a major release with lots of changes and Rasqal 0.9.21; that changed a lot upwards and it was both an API break as well as lots of new functionality.


Taken from my GitHub repositories extracting the tagged releases, excluding ChangeLog* files, and running diffstat over the output of a recursive diff -uRN.

date package old
2011-01-03 raptor 1.4.21 2.0.0 215 34,018 30,348 64,366
2011-01-12 rasqal 0.9.21 0.9.22 94 11,641 5,712 17,353
2011-01-27 rasqal 0.9.22 0.9.23 25 5,663 5,199 10,862
2011-01-30 rasqal 0.9.23 0.9.24 48 1,107 227 1,334
2011-02-01 redland 1.0.12 1.0.13 96 3,721 5,627 9,348
2011-02-22 rasqal 0.9.24 0.9.25 64 3,857 1,333 5,190
2011-03-20 raptor 2.0.0 2.0.1 42 6,163 5,833 11,996
2011-03-20 raptor 2.0.1 2.0.2 9 55 12 67
2011-03-26 flickcurl 1.20 1.21 19 737 308 1,045
2011-06-01 raptor 2.0.2 2.0.3 88 2,827 2,232 5,059
2011-06-27 rasqal 0.9.25 0.9.26 116 7,130 4,272 11,402
2011-07-23 raptor 2.0.3 2.0.4 33 808 103 911
2011-07-25 redland 1.0.13 1.0.14 75 3,681 5,477 9,158
Total 924 81,408 66,683 148,091
Barchart of number of source lines changed (insertions + deletions) in each release.  Raptor 2.0.0 stands out as much larger than all the others, nearly combined.
Click image to embiggen

Again Raptor 2.0.0 stands out as changing a huge number of files and lines. Also you can see the mistake that was Raptor 2.0.1 being corrected the same day with Raptor 2.0.2 with a few changes. This didn't seem to get tweeted. However also note that several of the Rasqal releases like 0.9.22 and 0.9.26 changed many files. The 'source lines net' column is the addition of the insert and deletes although some of those lines are the same.


Words from the changelog, the release notes and the news post comparing the number of words in the rendered output.

date package old
to release
word ratio
to news
word ratio
2011-01-03 raptor 1.4.21 2.0.0 15,411 2,709 5.69 365 42.22
2011-01-12 rasqal 0.9.21 0.9.22 3,465 1,199 2.89 162 21.39
2011-01-27 rasqal 0.9.22 0.9.23 318 135 2.36 52 6.12
2011-01-30 rasqal 0.9.23 0.9.24 450 254 1.77 59 7.63
2011-02-01 redland 1.0.12 1.0.13 778 235 3.31 73 10.66
2011-02-22 rasqal 0.9.24 0.9.25 1,649 558 2.96 136 12.13
2011-03-20 raptor 2.0.0 2.0.1 247 76 3.25 50 4.94
2011-03-20 raptor 2.0.1 2.0.2 42 27 1.56 42 1.00
2011-03-26 flickcurl 1.20 1.21 119 - - 68 -
2011-06-01 raptor 2.0.2 2.0.3 872 266 3.28 28 31.14
2011-06-27 rasqal 0.9.25 0.9.26 4,410 970 4.55 96 45.94
2011-07-23 raptor 2.0.3 2.0.4 517 345 1.50 77 6.71
2011-07-25 redland 1.0.13 1.0.14 1,347 620 2.17 88 15.31
Total 29,625 7,394 1,296
Bar chart of the ratio of the number of words in the changelog to those in the release news.  Rasqal 0.9.26 and Raptor 2.0.0 are the argest but the tiny Raptor 2.0.2 bug fix is also notable.
Click image to embiggen

So now we get to words. Yes, lots of words, most of them by me. Starting with the changelog which is a hand edited version of the SVN and later GIT changes was over 15K words for Raptor 2.0.0. And that gets boiled down lots into release notes, news and then a terse tweet. Since the changelog corresponds roughly to source changes but the news to user visible changes like APIs, you can see that the oddities are again Rasqal 0.9.26 where there were lots of changes but not so much news; it was mostly internal work.

Now I need to go summarise this blog post in a tweet: Releases = Tweets in 1156 words http://bit.ly/n88ZIQ


Rasqal 0.9.21 and SPARQL 1.1 Query aggregation

2010-12-05 17:24

Rasqal 0.9.21 was just released on Saturday 2010-12-04 (announcement) containing the following new features:

  • Updated to handle aggregate expression execution as defined by the SPARQL 1.1 Query W3C working draft of 14 October 2010
  • Executes grouping of results: GROUP BY
  • Executes aggregate expressions: AVG, COUNT, GROUP_CONCAT, MAX, MIN, SAMPLE, SUM
  • Executes filtering of aggregate expressions: HAVING
  • Parses new syntax: BINDINGS, isNUMERIC(), MINUS, sub SELECT and SERVICE.
  • The syntax format for parsing data graphs at URIs can be explictly declared.
  • The roqet utility can execute queries over SPARQL HTTP Protocol and operate over data from stdin.
  • Added several new APIs
  • Fixed Issue: #0000388

See the Rasqal 0.9.21 Release Notes for the full details of the changes.

I'd like to emphasis a couple of the changes to the roqet(1) utility program: you can now use it to query over data from standard input, i.e. use it as a filter, but only if you are querying over one graph. You can also specify the format of the data graphs either on standard input or from URIs, if the format can't be determined or guessed from the mime type, name or bytes. Finally roqet(1) can execute remote queries at a SPARQL Protocol HTTP service, sometimes called a "SPARQL endpoint".

The new support for SPARQL Query 1.1 aggregate queries (and other features) led me to make comments to the SPARQL working group about the latest SPARQL Query 1.1 working draft based on the implementation experience. The comments (below) were based on the implementation I previously outlined in Writing an RDF query engine. Twice at the end of October 2010.

The implementation work to create the new features was substantial but relatively simple to describe: new rowsources were added for each of the aggregation operations and then included in the execution plan when the query structure indicated their presence after parsing. There was some additional glue code that needed to be add to allow rows to indicate their presence in a group; a simple integer group ID was sufficient and the ID value has no semantics, just a check for a change of ID is enough to know a group started or ended.

I also introduced an internal variable to bind the result of SELECT aggregate expressions after grouping ($$aggID$$ which are illegal names in sparql). I then used that to replace the aggregate expression in the SELECT and the HAVING expressions and used the normal evaluation mechanisms. As I understand it, the SPARQL WG is now considering adding a way to name these explicitly in the GROUP statement. A happy coincidence since I had implemented it without knowing that.

To prepare this I did think about the approach a lot and developed a couple of diagrams for the grouping and aggregation rowsources that might help to understand how they work, how they can be implemented and tested as standalone unit modules, which they were.

Rasqal Group By Rowsource
Diagram showing input and outputs of group by rowsource in Rasqal.

As always, the above isn't quite how it is implemented. There is no group by node if there is an implicit group when GROUP BY is missing but an aggregate expression is used; instead the Rasqal rowsource class synthesizes 1 group around the incoming row, when grouping is requested.

Rasqal Aggregation Rowsource
Diagram showing inputs and outputs of Rasqal aggregation rowsource.

This shows the guts of the aggregate expression evaluation where internal variables are introduced, substituted into the SELECT and HAVING expressions and then created as the aggregate expressions are executed over the groups.

The rest of this post are my detailed thoughts on this draft of SPARQL 1.1 Query as posted to the SPARQL WG comments list but turned into HTML markup here.

Dave Beckett comments on SPARQL 1.1 Query Language W3C WD 2010-10-14

These are my personal comments (not speaking for any past or current employer) on: SPARQL 1.1 Query Language W3C Working Draft 14 October 2010

My comments are based on the work I did to add some SPARQL 1.1 query and update support to my Rasqal RDF query library (engine and API) in version 0.9.21 just released 2010-12-04 as


Some background to my work is given in a blog post: Writing an RDF query engine. Twice

I. General comments

I felt the specification introduced more optional features bundled together, where it was not entirely clear what the combination of those features would do. For example a query with no aggregate expression but has a GROUP BY and HAVING is allowed by the syntax and the main document doesn't say if it's allowed or what it means.

I found it hard to assemble all the pieces from the mathematical explanations into something I could code.

The spec has several terms in the grammar not in the query document. After asking, these turned out to be federated query (BINDINGS), or update (LOAD, ...) but these are not pointed out or linked to clearly although there is mention of the documents in the status section. Please make these more clear.

I decided to concentrate on the new Aggregates feature since I had already implemented SELECT expressions, leaving Subqueries and Negation to later. Property paths should be in the list of new features in the status section at the of the document.

"SPARQL 1.1 Uniform HTTP Protocol for Managing RDF Graphs" is rather a long title; what does 'Uniform' or 'HTTP' add? SOAP is dead. suggest "SPARQL 1.1 RDF Graph Management Protocol" or RDF dataset

With all the additions especially property paths (a new query language), update (data management language) and federated query (remote query support) and I understand \~30 additional keywords are being added beyond this draft for functions and operators, I see this as a major change to SPARQL 1.0, more of a SPARQL 2. You should consider renaming it.

II. Aggregates

I found the math in the aggregation and grouping sections rather hard to understand so I also looked what MySQL and SQLite did, and wrote my own diagram based on the data flow:

Guessed dataflow of SPARQL 1.1 query engine in executing a query
See SPARQL 1.1 Query Execution Sequence

so for me it was easier to see the individual components/stages (which roughly correspond to SPARQL algebra terms).

I had to make several of my own tests with my guess on what the answers should be. With all the pieces for aggregate expressions: grouping, aggregate expression, distinct, having, counting (count * vs count(expr)) there needs to be several tests with good coverage.

I felt aggregate functions can be broken down into these parts

  1. selecting of aggregate function value
  2. grouping of results - optional; explicit, implicit when agg func present
  3. execution of aggregate functions - optional; with some special cases
  4. filtering of group results with having - optional

(following my diagram above)

As it is clear they are all optional, it probably is worth explaining what it means when they are absent, such as group by + having with no aggregate expression as mentioned above.

III. Bindings is a new syntax

BINDINGS essentially gives a new way to write down a variable bindings result set. Even though it is discussed in the federated query spec about using it for SERVICE, it's not restricted to that by the grammar or specifications.

BINDINGS in the query grammar (rBindingsClause) I previously asked about on 2010-10-15 and this comment is an extension of that comment.

So as I read it this is a valid 'query' which does no real execution but just returns a set.

  ( "var1-value1"  "var2-value1" )
  ( "var1-value2"  "var2-value2" )

or if you really must you can leave out the WHERE:

  ( "var1-value1"  "var2-value1" )
  ( "var1-value2"  "var2-value2" )

My question is to ask if this is correct and to clarify in the spec the intended use, whether or not it is intended for use with SERVICE only.

IV. Section-by-section comments

Section: Status of this Document

Should mention property paths as new since that is a major addition after SPARQL 1.0

Please link to the documents in the status, these are just text.

Sections 1-8

Skipped, they are same as SPARQL 1.0 I hope

9 Property Paths

I am unlikely to ever implement any of this, it's a second query language inside SPARQL. How many systems implemented this before the SPARQL 1.1 work was started?

10 Aggregates

I took all the examples in this section and turned them into test cases where possible.


The explanation of errors and ListEvalE is rather opaque. It is still not clear to me what is done with errors in GROUP BY, HAVING and arguments to aggregate expressions. Some are skipped, some are ignored and return NULL. Examples and tests will enable checking this but the spec needs to be clearer.

Definition: Group and Aggregation were hard for me to understand. The input to Aggregation being a 'scalar' meaning actually a set of key:value pairs was confusing. It is not also not clear if those are a set or an ordered set of parameters. This is only used today for the 'separator' with GROUP_CONCAT.

10.2.1 HAVING

What happens when there is an expression error?

What variables and expressions can be used here and what is their scope?

10.2.2 Set Functions

Another confusing section. I mostly ignored this and did what SQL did.

None of the functions that I can tell, ever use 'err'.

10.2.3 Mapping from Abstract Syntax to Algebra

scalarvals argument is used here - I think this is called 'scalar' earlier.

Un-numbered Section after 10.2.3: Joining Aggregate Values

Never figured out what this was trying to define but my code executes the example.

11. Subqueries

(Ignored in my current work)

12 RDF Dataset

(Same as SPARQL 1.0 I assume so no comments)

13 Basic Federated Query

Yes, please merge in the text here.

14 Solution Sequences and Modifiers

( Aside: This is one of those SPARQL parts where everything mentioned is optional. Otherwise this section has no change from SPARQL 1.0, I am just mentioning it as a pointer of a trend. )

15. Query Forms

No comments.

16. Testing Values

16.3 Operator Mapping

Is it worth noting the new operators in SPARQL 1.1?

Operators: implemented isNUMERIC()

16.4 Operators Definitions

My current state of implementation of new to SPARQL 1.1 expressions

  • 16.4.16 IF - implemented
  • 16.4.17 IN - implemented
  • 16.4.18 NOT IN - implemented
  • 16.4.19 IRI - implemented
  • 16.4.20 URI - implemented
  • 16.4.21 BNODE - implemented
  • 16.4.22 STRDT - implemented
  • 16.4.23 STRLANG - implemented

No comments on the above


I am lumping these together with sub-SELECT to implement. My concern here is that the syntax gets super-complex since all the graph pattern syntax can now appear inside any expression syntax.

There is a filter operator "exists" that ...

Does this imply these can only appear in FILTER expressions? Please clarify.

17 Definition of SPARQL

I looked at the 17.2.3 for aggregate queries and it was more helpful than the math earlier. The pseudo code in Step 4 is a bit too unclear. Is that an example implementation or the required one?

17.6 Extending SPARQL Basic Graph Matching


18 SPARQL Grammar

Clearly this is not complete; there are lots of notes to update it.

19 Conformance

If property paths are not removed, please add a conformance level that includes SPARQL 1.1 without property paths.

Does SPARQL 1.1 Query require implementation of the dependent specs - federated query and update? Looks to me that protocol may also be dependent?


End of life of Raptor V1. End of support for Raptor V1 in Rasqal and librdf

2010-11-07 14:49

Raptor V1 was last released in January 2010 and Raptor V2 seems pretty stable and working. I am therefore announcing that from early 2011, Raptor V2 will replace Raptor V1 and be a requirement for Rasqal and librdf.

End of life timeline


  1. Raptor V1 last release remains 1.4.21 of 2010-01-30
  2. Raptor V2 release 2.0.0 will happen "soon".
  3. The next Rasqal release will support Raptor V1 and Raptor V2.
  4. The next librdf release will support Raptor V1 and Raptor V2 (and requires Rasqal built with the same Raptor version).


  1. The next Rasqal release will support Raptor V2 only.
  2. The next librdf release will support Raptor V2 only (and require a Rasqal built with Raptor V2).

In the style of open source I've been using for the Redland libraries, which might be described as "release when it's ready, not release by date", these dates may slip a little but the intention is that Raptor V2 becomes the mainline.

I do NOT rule out that there will be another Raptor V1 release but it will be ONLY for security issues. It will contain minimal changes and not add any new features or fix any other type of bug.

Developer Actions

If you use the Raptor V1 ABI/API directly, you will need to upgrade. If you want to write conditional code, that's possible. The redland librdf GIT source (or 1.0.12) uses the approach of macros that rewrite V2 into V1 forms and I recommend this way since dropping Raptor V1 support then amounts to removing the macros.

The Raptor V2 API documentation has a detailed section on the changes and there is also an upgrading document plus it points to a perl script docs/upgrade-script.pl (also in the Raptor V2 distribution) that automates some of the work (renames mostly) and leaves markers where a human has to fix.

The Raptor V1 API documentation will remain in a frozen state available at http://librdf.org/raptor/api-1.4/

Packager Actions

If you are a packager of the redland libraries, you need to prepare for the Raptor V1 / Raptor V2 transition which can vary depending on your distribution's standards. The two versions share two files: the rapper binary and the rapper.1 man page. I do not want to rename them to rapper2 etc. since rapper is a well known utility name in RDF and I want 'rapper' to provide the latest version.

In the Debian packaging which I maintain, these are already planned to be in separate packages so that both libraries can be installed and you can choose the raptor-utils2 package over raptor-utils (V1).

In other distributions where everything is in one package (BSD Ports for example) you may have to move the rapper/rapper.1 files to the raptor V2 package and create a new raptor1 package without them. i.e. something like this

Raptor V1 package 1.4.21-X:

/usr/lib/libraptor1.so.1* ...

(no /usr/bin/rapper or /usr/share/man/man1/rapper.1 )

Raptor V2 package 2.0.0-1:

/usr/lib/libraptor2.so.0* ...



conflicts with older raptor1 packages before 1.4.21-X

The other thing to deal with is that when Rasqal is built against Raptor V2, it has internal change that mean librdf has to also be built against rasqal-with-raptor2. This needs enforcing with packaging dependencies.

This packaging work can be done/started as soon as Raptor V2 2.0.0 is released which will be "soon".


Writing an RDF query engine. Twice

2010-10-24 15:00

"You'll end up writing a database" said Dan Brickley prophetically in early 2000. He was of course, correct. What started as an RDF/XML parser and a BerkeleyDB-based triple store and API, ended up as a much more complex system that I named Redland with the librdf API. It does indeed have persistence, transactions (when using a relational database) and querying. However, RDF query is not quite the same thing as SQL since the data model is schemaless and graph centric so when RDQL and later SPARQL came along, Redland gained a query engine component in 2003 named Rasqal: the RDF Syntax and Query Library for Redland. I still consider it not a 1.0 library after over 7 years of work.

Query Engine The First

The first query engine was written to execute RDQL which today looks like a relatively simple query language. There is one type of SELECT query returning sequences of sets of variable bindings in a tabular result like SQL. The query is a fixed pattern and doesn't allow any optional, union or conditional pattern matching. This was relatively easy to implement in what I've called a static execution model:

  1. Break the query up into a sequence of triple patterns: triples that can include variables in any position which will be found by matching against triples. A triple pattern returns a sequence of sets of variable bindings.
  2. Match each of the triple patterns in order, top to bottom, to bind the variables.
  3. If there is a query condition like ?var > 10 then check that it evaluates true.
  4. Return the result.
  5. Repeat at step #2.

The only state that needed saving was where in the sequence of triple patterns that the execution had got to - pretty much an integer, so that the looping could continue. When a particular triple pattern was exhausted it was reset, the previous one incremented and the execution continued.

This worked well and executes all of RDQL no problem. In particular it was a lazy execution model - it only did work when the application asked for an additional result. However, in 2004 RDF query standardisation started and the language grew.

Enter The Sparkle

The new standard RDF query language which was named SPARQL had many additions to the static patterns of the RDQL model, in particular it added OPTIONAL which allowed optionally (sic) matching an inner set of triple patterns (a graph pattern) and binding more variables. This is useful in querying heterogeneous data when there are sometimes useful bits of data that can be returned but not every graph has it.

This meant that the engine had to be able to match multiple graph patterns - the outer one and any inner optional graph pattern - as well as be able to reset execution of graph patterns, when optionals were retried. Optionals could also be nested to an arbitrary depth.

This combination meant that the state that had to be preserved for getting the next result became a lot more complex than an integer. Query engine #1 was updated to handle 1 level of nesting and a combination of outer fixed graph pattern plus one optional graph pattern. This mostly worked but it was clear that having the entire query have a fixed state model was not going to work when the query was getting more complex and dynamic. So query engine #1 could not handle the full SPARQL Optional model and would never implement Union which required more state tracking.

This meant that Query Engine #1 (QE1) needed replacing.

Query Engine The Second

The first step was a lot of refactoring. In QE1 there was a lot of shared state that needed pulling apart: the query itself (graph patterns, conditions, the result of the parse tree), the engine that executed it and the query result (sequence of rows of variable bindings). That needed pulling apart so that the query engine could be changed independent of the query or results.

Rasqal 0.9.15 at the end of 2007 was the first release with the start of the refactoring. During the work for that release it also became clear that an API and ABI break was necessary as well to introduce a Rasqal world object, to enable proper resource tracking - a lesson hard learnt. This was introduced in 0.9.16.

There were plenty of other changes to Rasqal going on outside the query engine model such as supporting reading and writing result formats, providing result ordering and distincting, completing the value expression and datatype handling data and general resilience fixes.

The goals of the refactoring were to produce a new query engine that was able to execute a more dynamic query, be broken into understandable components even for complex queries, be testable in small pieces and to continue to execute all the queries that QE1 could do. It should also continue to be a lazy-evaluation model where the user could request a single result and the engine should do the minimum work in order to return it.

Row Sources and SPARQL

The new query engine was designed around a new concept: a row source. This is an active object that on request, would return a row of variable bindings. It generates what corresponds to a row in a SQL result. This active object is the key for implementing the lazy evaluation. At the top level of the query execution, there would be basically one call to top_row_source.getRow() which itself calls inner rowsources' getRow() in order to execute the query to return the next result.

Each rowsource would correspond approximately to a SPARQL algebra concept, and since the algebra had a well defined way to turn a query structure into an executable structure, or query plan, the query engine's main role in preparation of the query was to become a SPARQL query algebra implementation. The algebra concepts were added to Rasqal enabling turning the hierarchical graph pattern structure into algebra concepts and performing the optimization and algebra transformations in the specification. These transformations were tested and validated against the examples in the specification. The resulting tree of "top down" algebra structures were then used to build the "bottom up" rowsource tree.

The rowsource concept also allowed breaking up the complete query engine execution into understandable and testable chunks. The rowsources implemented at this time include:

  • Assignment: allowing binding of a new variable from an input rowsource
  • Distinct: apply distinctness to an input rowsource
  • Empty: returns no rows; used in legitimate queries as well as in transformations
  • Filter: evaluates an expression for each row in an input rowsource and passes on those that return True.
  • Graph: matches against a graph URI and/or bind a graph variable
  • Join: (left-)joins two inner rowsources, used for OPTIONAL.
  • Project: projects a subset of input row variables to output row
  • Row Sequence: generates a rowsource from a static set of rows
  • Sort: sort an input rowsource by a list of order expressions
  • Triples: match a triple pattern against a graph and generate a row. This is the fundamental triple pattern or Basic Graph Pattern (BGP) in SPARQL terms.
  • Union: return results from the two input rowsources, in order

The QE1 entry point was refactored to look like getRow() and the query engines were tested against each other. In the end QE2 was identical, and eventually QE2 was improved such that it passed more DAWG SPARQL tests that than QE1.

So in summary QE2 works like this:

  • Parse the query string into a hierarchy of graph patterns such as basic, optional, graph, group, union, filter etc. (This is done in rasqal_query_prepare())
  • Create a SPARQL algebra expression from the graph pattern tree that describes how to evaluate the query. (This is in rasqal_query_execute() calling QE2 )
  • Invert the algebra expression to a hierarchy of rowsources where the top rowsource getRow() call will evaluate the entire query (Ditto)

(If you want to see some of the internals on a real query, run roqet -d debug query.rq from roqet built in maintainer mode and both the query structure and algebra version will be generated.

The big advantage from a maintenance point of view is that it is divided into small understandable components that can be easily added to.

The result was released in Rasqal 0.9.17 at the end of 2009; 15 months after the previous release. It's tempting to say nobody noticed the new query engine except that it did more work. There is no way to use the old query engine except by a configure argument when building it. The QE1 code is never called and should be removed from the sources.

Example execution

When QE2 is called by the command line utility roqet, there is a lot going on inside Rasqal and Raptor. A simplified version of what goes on when a query like this is run:

$ cat example.rq
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?website
FROM   <http://planetrdf.com/bloggers.rdf>
WHERE { ?person foaf:weblog ?website ;
                foaf:name ?name .
        ?website a foaf:Document
ORDER BY ?name

$ roqet example.rq 
roqet: Querying from file example.rq
roqet: Query has a variable bindings result
result: [name=string("AKSW Group - University of Leipzig"), website=uri<http://blog.aksw.org/>]

is described in the following picture if you follow the numbers in order:

Roqet query execution example

This doesn't include details of content negotiation, base URIs, result formatting, or the internals of the query execution described above.


Now it is the end of 2010 and SPARQL 1.1 work is underway to update the original SPARQL Query which was complete in January 2008. It is a substantial new version that adds greatly to the language. In the SPARQL 1.1 2010-10-14 draft version it adds (these items may or may not be in the final version):

  • Assignment with BIND(expr AS ?var)
  • Aggregate expressions such as SUM(), COUNT() including grouping and group filtering with HAVING
  • Negation between graph patterns using MINUS.
  • Property path triple matching.
  • Computed select expressions: SELECT ... (expr AS ?var)
  • Federated queries using SERVICE to make SPARQL HTTP query requests
  • Sub SELECT and BINDINGS to allow queries/results inside queries.
  • Updates allowing insertion, deletion and modification of graphs via queries as well as other graph and dataset management

The above is my reading of the major items in the latest draft SPARQL 1.1 query language or it's dependent required specifications.

Rasqal next steps

So does SPARQL 1.1 mean Rasqal Query Engine 3? Not yet, although the Rasqal API is still changing too much to call it stable and another API/ABI break is possible. There's also the question of making an optimizing query engine, a more substantial activity. At this time, I'm not motivated to implement property paths since it seems like a lot of work and there are other pieces I want to do first. Rasqal in GIT handles most of the syntax and is working towards implementing most of the execution of aggregate expressions, sub selects and SERVICE although no dates yet. I work on Rasqal in my spare time when I feel like it, so maybe it won't be mature with a stable API (which would be a 1.0) until SPARQL 2 rolls by.


Redland librdf 1.0.11 released

2010-09-25 12:44

I have just released Redland librdf library version 1.0.11 which has been in progress for some time, delayed by the large amount of work to get out Raptor V2 as well as initial SPARQL 1.1 draft work for Rasqal 0.9.20.

The main features in this release are as follows:

See the Redland librdf 1.0.11 Release Notes for the full details of the changes.

Note that the Redland language bindings works fine with Redland librdf 1.0.11 but the bindings will soon have a release to match.


Leaving Yahoo - Joining Digg

2010-08-26 11:44

I'm heading to a new adventure at Digg in San Francisco to be a lead software engineer working on APIs and syndication.

I've been at Yahoo! nearly 5 years so it is both a happy and sad time for me, and I wish all the excellent people I worked with the best of luck in future.

Here is a summary of the main changes:

  • Silicon Valley -> San Francisco
  • 15,000 staff -> 100 staff
  • Architect -> Software engineer
  • strategizing, meeting -> coding
  • Powerpoint, OmniGraffle, twiki -> emacs, eclipse, ...?
  • (No coding!) -> Python, Java, Hadoop, Cassandra, ...?
  • Sunny days -> Foggy days
  • 15 min commute -> 2.5hr commute (until I move to SF)
  • Public company -> private company



Rasqal RDF Query Library 0.9.20

2010-08-22 12:33

I just released a new version of my Rasqal RDF Query Library for two main new features:

  1. Support more of the new W3C SPARQL working drafts of 1 June 2010 for SPARQL 1.1 Query and SPARQL 1.1 Update.
  2. Support building with Raptor V2 API as well as Raptor V1 API..

The main change is to start to add to Rasqal's APIs and query engine changes for the new SPARQL 1.1 working drafts. This release adds support the syntax for all the changes for Query and Update. The new draft syntax is available via the 'laqrs' query language name, until the SPARQL 1.1 syntax is finalized. The 'sparql' query language provides SPARQL 1.0 support.

On Query 1.1, the addition is primarily syntax and API support for the new syntax. There is expression execution for the new functions IF(), URI(), STRLANG(), STRDT(), BNODE(), IN() and NOT IN() which are noew usable as part of the normal expression grammar. The existing aggregate function support was extended to add the new SAMPLE() and GROUP_CONCAT() but remains syntax-only. Finally the new GROUP BY with HAVING conditions were added to the syntax and had consequent API updates but no query engine execution of them.

For Update 1.1 the full set of update operations syntax were added and they create API structures. Note, however there seem to be some ambiguities in the draft syntax especially around multiple optional tokens in a row near WITH which are particularly hard to implement in flex and bison (aka "lex and yacc").

The main non-SPARQL 1.1 related change is to allow building Rasqal with Raptor V2 APIs rather than V1. Raptor V2 is in beta so this is not a final API and is thus not the default build, it has to be enabled with --enable-raptor2 with configure. When raptor V2 is stable (2.0.0), Rasqal will require it.

The changes to Rasqal in this release, in summary, are:

  • Updated to handle more of the new syntax defined by the SPARQL 1.1 Query and SPARQL 1.1 Update W3C working drafts of 1 June 2010
  • Added execution support for new SPARQL 1.1 query built-in expressions IF(), URI(), STRLANG(), STRDT(), BNODE(), IN() and NOT IN().
  • Added an 'html' query result table format from patch by Nicholas J Humfrey
  • Added API support for group by HAVING expressions.
  • Added XSD Date comparison support.
  • Support building with Raptor V2 API if configured with --with-raptor2.
  • Many other bug fixes and improvements were made.
  • Fixed Issues: #0000352, #0000353, #0000354, #0000360, #0000374, #0000377 and #0000378

See the Rasqal 0.9.20 Release Notes for the full details of the changes.

Get it at http://download.librdf.org/source/rasqal-0.9.20.tar.gz.

PS The source code control has also moved to GIT and hosted at GitHub.