Dave Beckett's blog

The Journey to SRE

In my career I've had three big fork()s in the road, so far.

My higher education started off with a Computer Science degree back in 1990 from University of Bristol UK, with a class size of less than 20. My final year project was a parallelized graphics renderer written in occam. #code #graphics

At the end my degree, I had applied to do a PhD in computer graphics, but a couple of days before that offer appeared: I got a job offer for a parallel computing position, which I accepted.

Fork #1: Parallel Computing

If I had started the PhD, the other fork path would likely have ended up as me working as a computer graphics renderer or pipeline engineer working for a big CGI or SFX firm, probably in the US.

Instead, I went to work at the University of Kent at Canterbury (UKC), now called University of Kent, in Canterbury, UK of course. There I worked on the Meiko parallel computer at a blistering 25MHz - a relatively unheard of speed in 1990 - with dozens of nodes, each capable of thousands of lightweight processes based on CSP (lighter than threads, look it up). I helped operate the Meiko system: rebooting, rewiring it (literally wires) between the nodes and racks. #operations #code #learning #teaching

Deeper into that, I got into organizing materials for the Internet Parallel Computing Archive, the software to manage it and hand-compiling Apache on SunOS and IRIX to run it. This led to my first home page location http://www.hensa.ac.uk/parallel/www/djb1.html in 1993. #operations #code

Fork #2: Web and Metadata

If I had continued with parallel computing, the second fork alternate path would have likely been going into research, getting a PhD, working on high performance computing, supercomputers and probably ending up in the US.

Instead, I improved the archive, developed metadata to manage it in IAFA Templates and expanded to work on web metadata, Dublin Core and onwards to RDF and Semantic Web. I wrote software in Perl, presented at multiple web conferences from WWW3, workshops and attended many Dublin Core working groups. #code #web #rdf #metadata

Meanwhile, around 1996, my day-to-day work changed to be web-focused, working on the UK Mirror Service at Kent, installing machines, operating them, making backups and keeping things running for the entire UK academic network, a network called HENSA. I also ran the computer science department's (first) web site http://www.cs.ukc.ac.uk. This was where I learnt operations, web tech and started using Linux. #web #operations #learning

In 2000, I took up an opportunity to go work at the Institute for Learning and Research Technology (ILRT) at the University of Bristol as a technical researcher entirely on software and metadata in the emerging RDF and semantic web area. At that time, I created the Free Software / Open Source Redland RDF libraries all written in C and supporting multiple language bindings, developed and tested these across multiple OSes via build farms. I worked for several years on the software, RDF, semantic web and other standards work in EU research projects such as SWADE, SKOS, as well as lots of W3C projects and working groups for RDF, SPARQL and Turtle. I learnt so much about organizing my time and working in a fast changing environment. #operations #code #web #learning #metadata

I was asked in 2005 if I'd like to come take the work and experience I'd developed in the semantic web work and deploy the software at Yahoo! in USA. I said yes.

Fork #3: Corporate USA

The third fork's other path would have been continuing in the UK and EU University sector, working on open source and web technologies as they evolved. Possibly, I would have ended up working in some large UK IT firm, deploying web tech or teaching web tech in Universities.

At Yahoo! in Sunnyvale, I entered a whole new world, in which there were highly specialized roles, such as Product Managers and Operations Engineers to go along with Software Engineers. After multiple positions and not working on coding or web technologies, I ended up far away from my happy place. #architect #learning

In 2012, I moved on to software engineering roles at a social news startup, Digg, which closed up shop, then subsequently at Rackspace Hosting in San Francisco in 2013. In both cases, I was increasingly working Hadoop big data applications, as well as running and operating Hadoop which was now called DevOps. #operations #code #learning #bigdata

That led to joining Twitter in 2016 finally as Site Reliability Engineer for the Data Platform operating the Hadoop clusters with software addressing the day to day issues, automating the routine tasks and working on strategic projects like cloud for data platform. Finally, I arrived at the job title that matched what I'd been doing for a long time and I loved working in a group of SREs, always learning and helping. #sre #operations #code #learning #teaching #bigdata #cloud

In 2022, Twitter also sold its furniture and, well, that's another story... #chaos

So here we are in 2023 and I'm excited to announce I'm joining Google as a Staff System Engineer in the Site Reliability Engineering part of the Google Cloud organization. #sre #learning #cloud

Permalink

Making Debian Docker Images Smaller

TL;DR:

Use one RUN to prepare, configure, make, install and cleanup.
Cleanup with: apt-get remove --purge -y $BUILD_PACKAGES $(apt-mark showauto) && rm -rf /var/lib/apt/lists/*

I've been packaging the nghttp2 HTTP/2.0 proxy and client by Tatsuhiro Tsujikawa in both Debian and with docker and noticed it takes some time to get the build dependencies (C++ cough) as well as to do the build.

In the Debian packaging case its easy to create minimal dependencies thanks to pbuilder and ensure the binary package contains only the right files. See debian nghttp2

For docker, since you work with containers it's harder to see what changed, but you still really want the containers as small as possible since you have to download them to run the app, as well as the disk use. While doing this I kept seeing huge images (480 MB), way larger than the base image I was using (123 MB) and it didn't make sense since I was just packaging a few binaries with some small files, plus their dependencies. My estimate was that it should be way less than 100 MB delta.

I poured over multiple blog posts about Docker images and how to make them small. I even looked at some of the squashing commands like docker-squash that involved import and export, but those seemed not quite the right approach.

It took me a while to really understand that each Dockerfile command creates a new container with the deltas. So when you see all those downloaded layers in a docker pull of an image, it sometimes is a lot of data which is mostly unused.

So if you want to make it small, you need to make each Dockerfile command touch the smallest amount of files and use a standard image, so most people do not have to download your custom l33t base.

It doesn't matter if you rm -rf the files in a later command; they continue exist in some intermediate layer container.

So: prepare configure, build, make install and cleanup in one RUN command if you can. If the lines get too long, put the steps in separate scripts and call them.

Lots of Docker images are based on Debian images because they are a small and practical base. The debian:jessie image is smaller than the Ubuntu (and CentOS) images. I haven't checked out the fancy 'cloud' images too much: Ubuntu Cloud Images, Snappy Ubuntu Core, Project Atomic, ...

In a Dockerfile building from some downloaded package, you generally need wget or curl and maybe git. When you install, for example curl and ca-certificates to get TLS/SSL certificates, it pulls in a lot of extra packages, such as openssl in the standard Debian curl build.

You are pretty unlikely to need curl or git after the build stage of your package. So if you don't need them, you could - and you should - remove them, but that's one of the tricky parts.

If $BUILD_PACKAGES contains the list of build dependency packages such as e.g. libxml2-dev and so on, you would think that this would get you back to the start state:

$ apt-get install -y $BUILD_PACKAGES
$ apt-get remove -y $BUILD_PACKAGES

However this isn't enough; you missed out those dependencies that got automatically installed and their dependencies.

You could try

$ apt-get autoremove -y

but that also doesn't grab them all. It's not clear why to me at this point. What you actually need is to remove all autoadded packages, which you can find with apt-mark showauto

So what you really need is

$ AUTO_ADDED_PACKAGES=`apt-mark showauto`
$ apt-get remove --purge -y $BUILD_PACKAGES $AUTO_ADDED_PACKAGES

I added --purge too since we don't need any config files in /etc for build packages we aren't using.

Having done that, you might have removed some runtime package dependencies of something you built. That's harder to automatically find, so you'll have to list and install those by hand

$ RUNTIME_PACKAGES="...."
$ apt-get install -y $RUNTIME_PACKAGES

Finally you need to cleanup apt which you should do with rm -rf /var/lib/apt/lists/* which is great and removes all the index files that apt-get update installed. This is in many best practice documents and example Dockerfiles.

You could add apt-get clean which removes any cached downloaded packages, but that's not needed in the official Docker images of debian and ubuntu since the cached package archive feature is disabled.

Finally don't forget to delete your build tree and do it in the same RUN that you did a compile, so the tree never creates a new container. This might not make sense for some languages where you work from inside the extracted tree; but why not delete the src dirs? Definitely delete the tarball!

This is the delta for what I was working on with dajobe/nghttpx.

479.7 MB  separate prepare, build, cleanup 3x RUNs
186.8 MB  prepare, build and cleanup in one RUN
149.8 MB  after using apt-mark showauto in cleanup

You can use docker history IMAGE to see the detailed horror (edited for width):

...    /bin/sh -c /build/cleanup-nghttp2.sh && rm -r   7.595 MB
...    /bin/sh -c cd /build/nghttp2 && make install    76.92 MB
...    /bin/sh -c /build/prepare-nghttp2.sh            272.4 MB

and the smallest version:

...    /bin/sh -c /build/prepare-nghttp2.sh &&         27.05 MB

The massive difference is the source tree and the 232 MB of build dependencies that apt-get pulls in. If you don't clean all that up before the end of the RUN you end up with a huge transient layer.

The final size of 149.8 MB compared to the 122.8 MB debian/jessie base image size is a delta of 27 MB which for a few servers, a client and their libraries sounds great! I probably could get it down a little more if I just installed the binaries. The runtime libraries I use are 5.9 MB.

You can see my work at github and in the Docker Hub

... and of course this HTTP/2 setup is used on this blog!

References

flatten images - merge multiple layers into a single one #332 The open issue from April 2013 on about flattening images aka docker squash idea. TL;DR: It's hard.
Flat Docker images by Maciej Pasternacki. Describes a script to compress Dockerfile commands.
Create The Smallest Possible Docker Container by Adriaan de Jonge Docker within Docker and static go binaries and tiny images.
Optimizing Docker Images by Brian DeHamer. Where I learnt of docker history IMAGE
Squashing Docker Images by Jason Wilder. Introduces docker-squash Go application. docker-squash on GitHub
Docker official best practices
Dockerfile Best Practices - take 2 by Michael Crosby. Linked from the official page.
Official Docker images bootstrap script It includes multiple fixes including for initrd, dpkg apt speedup and preventing services starting by policy-rc.d and/or upstart. It makes apt-get "effectively running apt-get clean after every install" as well as disable the package caching.
phusion image base script The base preparation of phusion's images (which are quite opinionated) is also interesting if you want to find out how to set base languages and locales

Permalink

Open to Rackspace

I'm happy to announce that today I started work at Rackspace in San Francisco in a senior engineering role. I am excited anticipating these aspects:

The company: a fun, fast moving organization with a culture of innovation and openness
The people: lots of smart engineers and devops to work with
The technologies: Openstack cloud, Hadoop big data and lots more
The place: San Francisco technology world and nearby Silicon Valley

My personal technology interests and coding projects such as Planet RDF, the Redland librdf libraries and Flickcurl will continue in my own time.

Permalink

Undugg

Digg just announced that Digg Engineering Team Joins SocialCode and The Washington Post reported SocialCode hires 15 employees from Digg.com

This acquihire does NOT include me. I will be changing jobs shortly but have nothing further to announce at this time.

I wish my former Digg colleagues the best of luck in their new roles. I had a great time at Digg and learned a lot about working in a small company, social news, analytics, public APIs and the technology stack there.

Permalink

Releases = Tweets

I got tired of posting release announcements to my blog so I just emailed the announcements to the redland-dev list, tweeted a link to it from @dajobe and announced it on Freshmeat which a lot of places still pick up..

Here are the tweets for the 13 releases I didn't blog since the start of 2011:

3 Jan: Released Raptor RDF syntax library 2.0.0 at http://librdf.org/raptor/ only 10 years in the making :)
12 Jan: Released Rasqal RDF Query Library 0.9.22: Raptor 2 only, ABI/API break, 16 new SPARQL Query 1.1 builtins and more http://bit.ly/fzb9xW #rdf
27 Jan: Rasqal 0.9.23 RDF query library released with SPARQL update query structure fixes (for @theno23 and 4store ): http://bit.ly/gVDp57
1 Feb: Released Redland librdf 1.0.13 C RDF API and Triplestores with Raptor 2 support + more http://bit.ly/hOr4HA
22 Feb: Released Rasqal RDF Query Library 0.9.25 with many SPARQL 1.1 new things and fixes. RAND() and BIND() away! http://bit.ly/flFDH1
20 Mar: Raptor RDF Syntax Library 2.0.1 released with minor fixes for N-Quads serialializer and internal librdfa parser http://bit.ly/fT3aPX
26 Mar: Released my Flickcurl C API to Flickr 1.21 with some bug fixes and Raptor V2 support (optional) See http://bit.ly/f7QncO
1 Jun: Released Raptor 2.0.3 RDF syntax library: a minor release adding raptor2.h header, Turtle / TRiG and ohter fixes. http://bit.ly/jHKaB8
27 Jun: Rasqal RDF query library 0.9.26 released with better UNION execution, SPARQL 1.1 MD5, SHA* digests and more http://bit.ly/lI7lDW
23 Jul: Released Redland librdf RDF API / triplestore C library 1.0.14: core code cleanups, bug fixes and a few new APIs. http://bit.ly/qqV1Rb
25 Jul: Raptor RDF Syntax C library 2.0.4 released with YAJL V2, and latest curl support, SSL client certs, bug fixes and more http://bit.ly/oCIIDd

(yes 13; I didn't tweet 2 of them: Rasqal 0.9.24 and Raptor 2.0.2)

You know it's quite tricky to collapse months of changelogs (GIT history) into release notes, compress it further into a news summary of a few lines and even harder to compress that into less than 140 characters. It is way less if you include room for a link url and space for retweeting and sometimes need a hashtag for context.

So how do you measure a release? Let's try!

Tarballs

Released tarball files from the Redland download site.

date	package	old version	new version	old tarball size	new tarball size	tarball byte diff	tarball %diff
2011-01-03	raptor	1.4.21	2.0.0	1,651,843	1,635,566	-16,277	-0.99%
2011-01-12	rasqal	0.9.21	0.9.22	1,356,923	1,398,581	+41,658	3.07%
2011-01-27	rasqal	0.9.22	0.9.23	1,398,581	1,404,087	+5,506	0.39%
2011-01-30	rasqal	0.9.23	0.9.24	1,404,087	1,412,165	+8,078	0.58%
2011-02-01	redland	1.0.12	1.0.13	1,552,241	1,554,764	+2,523	0.16%
2011-02-22	rasqal	0.9.24	0.9.25	1,412,165	1,429,683	+17,518	1.24%
2011-03-20	raptor	2.0.0	2.0.1	1,635,566	1,637,928	+2,362	0.14%
2011-03-20	raptor	2.0.1	2.0.2	1,637,928	1,633,744	-4,184	-0.26%
2011-03-26	flickcurl	1.20	1.21	1,775,246	1,775,999	+753	0.04%
2011-06-01	raptor	2.0.2	2.0.3	1,633,744	1,652,679	+18,935	1.16%
2011-06-27	rasqal	0.9.25	0.9.26	1,429,683	1,451,819	+22,136	1.55%
2011-07-23	raptor	2.0.3	2.0.4	1,652,679	1,660,320	+7,641	0.46%
2011-07-25	redland	1.0.13	1.0.14	1,554,764	1,581,695	+26,931	1.73%

Barchart of %diffs between tarball releases. Noticeable differences are raptor 2.0.0 with a big negative change and rasqal 0.9.22 with largest increase.

Click image to embiggen

Releases that stand out here are Raptor 2.0.0 which was a major release with lots of changes and Rasqal 0.9.21; that changed a lot upwards and it was both an API break as well as lots of new functionality.

Sources

Taken from my GitHub repositories extracting the tagged releases, excluding ChangeLog* files, and running diffstat over the output of a recursive diff -uRN.

date	package	old version	new version	source files changed	source lines inserted	source lines deleted	source lines net
2011-01-03	raptor	1.4.21	2.0.0	215	34,018	30,348	64,366
2011-01-12	rasqal	0.9.21	0.9.22	94	11,641	5,712	17,353
2011-01-27	rasqal	0.9.22	0.9.23	25	5,663	5,199	10,862
2011-01-30	rasqal	0.9.23	0.9.24	48	1,107	227	1,334
2011-02-01	redland	1.0.12	1.0.13	96	3,721	5,627	9,348
2011-02-22	rasqal	0.9.24	0.9.25	64	3,857	1,333	5,190
2011-03-20	raptor	2.0.0	2.0.1	42	6,163	5,833	11,996
2011-03-20	raptor	2.0.1	2.0.2	9	55	12	67
2011-03-26	flickcurl	1.20	1.21	19	737	308	1,045
2011-06-01	raptor	2.0.2	2.0.3	88	2,827	2,232	5,059
2011-06-27	rasqal	0.9.25	0.9.26	116	7,130	4,272	11,402
2011-07-23	raptor	2.0.3	2.0.4	33	808	103	911
2011-07-25	redland	1.0.13	1.0.14	75	3,681	5,477	9,158
			Total	924	81,408	66,683	148,091

Barchart of number of source lines changed (insertions + deletions) in each release. Raptor 2.0.0 stands out as much larger than all the others, nearly combined.

Click image to embiggen

Again Raptor 2.0.0 stands out as changing a huge number of files and lines. Also you can see the mistake that was Raptor 2.0.1 being corrected the same day with Raptor 2.0.2 with a few changes. This didn't seem to get tweeted. However also note that several of the Rasqal releases like 0.9.22 and 0.9.26 changed many files. The 'source lines net' column is the addition of the insert and deletes although some of those lines are the same.

Words

Words from the changelog, the release notes and the news post comparing the number of words in the rendered output.

date	package	old version	new version	changelog words	release note words	changelog to release word ratio	news words	changelog to news word ratio
2011-01-03	raptor	1.4.21	2.0.0	15,411	2,709	5.69	365	42.22
2011-01-12	rasqal	0.9.21	0.9.22	3,465	1,199	2.89	162	21.39
2011-01-27	rasqal	0.9.22	0.9.23	318	135	2.36	52	6.12
2011-01-30	rasqal	0.9.23	0.9.24	450	254	1.77	59	7.63
2011-02-01	redland	1.0.12	1.0.13	778	235	3.31	73	10.66
2011-02-22	rasqal	0.9.24	0.9.25	1,649	558	2.96	136	12.13
2011-03-20	raptor	2.0.0	2.0.1	247	76	3.25	50	4.94
2011-03-20	raptor	2.0.1	2.0.2	42	27	1.56	42	1.00
2011-03-26	flickcurl	1.20	1.21	119	-	-	68	-
2011-06-01	raptor	2.0.2	2.0.3	872	266	3.28	28	31.14
2011-06-27	rasqal	0.9.25	0.9.26	4,410	970	4.55	96	45.94
2011-07-23	raptor	2.0.3	2.0.4	517	345	1.50	77	6.71
2011-07-25	redland	1.0.13	1.0.14	1,347	620	2.17	88	15.31
			Total	29,625	7,394		1,296

Bar chart of the ratio of the number of words in the changelog to those in the release news. Rasqal 0.9.26 and Raptor 2.0.0 are the argest but the tiny Raptor 2.0.2 bug fix is also notable.

Click image to embiggen

So now we get to words. Yes, lots of words, most of them by me. Starting with the changelog which is a hand edited version of the SVN and later GIT changes was over 15K words for Raptor 2.0.0. And that gets boiled down lots into release notes, news and then a terse tweet. Since the changelog corresponds roughly to source changes but the news to user visible changes like APIs, you can see that the oddities are again Rasqal 0.9.26 where there were lots of changes but not so much news; it was mostly internal work.

Now I need to go summarise this blog post in a tweet: Releases = Tweets in 1156 words http://bit.ly/n88ZIQ

Permalink

Rasqal 0.9.21 and SPARQL 1.1 Query aggregation

Rasqal 0.9.21 was just released on Saturday 2010-12-04 (announcement) containing the following new features:

Updated to handle aggregate expression execution as defined by the SPARQL 1.1 Query W3C working draft of 14 October 2010
Executes grouping of results: GROUP BY
Executes aggregate expressions: AVG, COUNT, GROUP_CONCAT, MAX, MIN, SAMPLE, SUM
Executes filtering of aggregate expressions: HAVING
Parses new syntax: BINDINGS, isNUMERIC(), MINUS, sub SELECT and SERVICE.
The syntax format for parsing data graphs at URIs can be explictly declared.
The roqet utility can execute queries over SPARQL HTTP Protocol and operate over data from stdin.
Added several new APIs
Fixed Issue: #0000388

See the Rasqal 0.9.21 Release Notes for the full details of the changes.

I'd like to emphasis a couple of the changes to the roqet(1) utility program: you can now use it to query over data from standard input, i.e. use it as a filter, but only if you are querying over one graph. You can also specify the format of the data graphs either on standard input or from URIs, if the format can't be determined or guessed from the mime type, name or bytes. Finally roqet(1) can execute remote queries at a SPARQL Protocol HTTP service, sometimes called a "SPARQL endpoint".

The new support for SPARQL Query 1.1 aggregate queries (and other features) led me to make comments to the SPARQL working group about the latest SPARQL Query 1.1 working draft based on the implementation experience. The comments (below) were based on the implementation I previously outlined in Writing an RDF query engine. Twice at the end of October 2010.

The implementation work to create the new features was substantial but relatively simple to describe: new rowsources were added for each of the aggregation operations and then included in the execution plan when the query structure indicated their presence after parsing. There was some additional glue code that needed to be add to allow rows to indicate their presence in a group; a simple integer group ID was sufficient and the ID value has no semantics, just a check for a change of ID is enough to know a group started or ended.

I also introduced an internal variable to bind the result of SELECT aggregate expressions after grouping ($$aggID$$ which are illegal names in sparql). I then used that to replace the aggregate expression in the SELECT and the HAVING expressions and used the normal evaluation mechanisms. As I understand it, the SPARQL WG is now considering adding a way to name these explicitly in the GROUP statement. A happy coincidence since I had implemented it without knowing that.

To prepare this I did think about the approach a lot and developed a couple of diagrams for the grouping and aggregation rowsources that might help to understand how they work, how they can be implemented and tested as standalone unit modules, which they were.

Rasqal Group By Rowsource
Diagram showing input and outputs of group by rowsource in Rasqal.

As always, the above isn't quite how it is implemented. There is no group by node if there is an implicit group when GROUP BY is missing but an aggregate expression is used; instead the Rasqal rowsource class synthesizes 1 group around the incoming row, when grouping is requested.

Rasqal Aggregation Rowsource
Diagram showing inputs and outputs of Rasqal aggregation rowsource.

This shows the guts of the aggregate expression evaluation where internal variables are introduced, substituted into the SELECT and HAVING expressions and then created as the aggregate expressions are executed over the groups.

The rest of this post are my detailed thoughts on this draft of SPARQL 1.1 Query as posted to the SPARQL WG comments list but turned into HTML markup here.

Dave Beckett comments on SPARQL 1.1 Query Language W3C WD 2010-10-14

These are my personal comments (not speaking for any past or current employer) on: SPARQL 1.1 Query Language W3C Working Draft 14 October 2010

My comments are based on the work I did to add some SPARQL 1.1 query and update support to my Rasqal RDF query library (engine and API) in version 0.9.21 just released 2010-12-04 as

announced.

Some background to my work is given in a blog post: Writing an RDF query engine. Twice

I. General comments

I felt the specification introduced more optional features bundled together, where it was not entirely clear what the combination of those features would do. For example a query with no aggregate expression but has a GROUP BY and HAVING is allowed by the syntax and the main document doesn't say if it's allowed or what it means.

I found it hard to assemble all the pieces from the mathematical explanations into something I could code.

The spec has several terms in the grammar not in the query document. After asking, these turned out to be federated query (BINDINGS), or update (LOAD, ...) but these are not pointed out or linked to clearly although there is mention of the documents in the status section. Please make these more clear.

I decided to concentrate on the new Aggregates feature since I had already implemented SELECT expressions, leaving Subqueries and Negation to later. Property paths should be in the list of new features in the status section at the of the document.

"SPARQL 1.1 Uniform HTTP Protocol for Managing RDF Graphs" is rather a long title; what does 'Uniform' or 'HTTP' add? SOAP is dead. suggest "SPARQL 1.1 RDF Graph Management Protocol" or RDF dataset

With all the additions especially property paths (a new query language), update (data management language) and federated query (remote query support) and I understand \~30 additional keywords are being added beyond this draft for functions and operators, I see this as a major change to SPARQL 1.0, more of a SPARQL 2. You should consider renaming it.

II. Aggregates

I found the math in the aggregation and grouping sections rather hard to understand so I also looked what MySQL and SQLite did, and wrote my own diagram based on the data flow:

Guessed dataflow of SPARQL 1.1 query engine in executing a query
See SPARQL 1.1 Query Execution Sequence

so for me it was easier to see the individual components/stages (which roughly correspond to SPARQL algebra terms).

I had to make several of my own tests with my guess on what the answers should be. With all the pieces for aggregate expressions: grouping, aggregate expression, distinct, having, counting (count * vs count(expr)) there needs to be several tests with good coverage.

I felt aggregate functions can be broken down into these parts

selecting of aggregate function value
grouping of results - optional; explicit, implicit when agg func present
execution of aggregate functions - optional; with some special cases
filtering of group results with having - optional

(following my diagram above)

As it is clear they are all optional, it probably is worth explaining what it means when they are absent, such as group by + having with no aggregate expression as mentioned above.

III. Bindings is a new syntax

BINDINGS essentially gives a new way to write down a variable bindings result set. Even though it is discussed in the federated query spec about using it for SERVICE, it's not restricted to that by the grammar or specifications.

BINDINGS in the query grammar (rBindingsClause) I previously asked about on 2010-10-15 and this comment is an extension of that comment.

So as I read it this is a valid 'query' which does no real execution but just returns a set.

SELECT * WHERE BINDINGS ?var1 ?var2 {
  ( "var1-value1"  "var2-value1" )
  ( "var1-value2"  "var2-value2" )
}

or if you really must you can leave out the WHERE:

SELECT * BINDINGS {
  ( "var1-value1"  "var2-value1" )
  ( "var1-value2"  "var2-value2" )
}

My question is to ask if this is correct and to clarify in the spec the intended use, whether or not it is intended for use with SERVICE only.

IV. Section-by-section comments

Section: Status of this Document

Should mention property paths as new since that is a major addition after SPARQL 1.0

Please link to the documents in the status, these are just text.

Sections 1-8

Skipped, they are same as SPARQL 1.0 I hope

9 Property Paths

I am unlikely to ever implement any of this, it's a second query language inside SPARQL. How many systems implemented this before the SPARQL 1.1 work was started?

10 Aggregates

I took all the examples in this section and turned them into test cases where possible.

10.2

The explanation of errors and ListEvalE is rather opaque. It is still not clear to me what is done with errors in GROUP BY, HAVING and arguments to aggregate expressions. Some are skipped, some are ignored and return NULL. Examples and tests will enable checking this but the spec needs to be clearer.

Definition: Group and Aggregation were hard for me to understand. The input to Aggregation being a 'scalar' meaning actually a set of key:value pairs was confusing. It is not also not clear if those are a set or an ordered set of parameters. This is only used today for the 'separator' with GROUP_CONCAT.

10.2.1 HAVING

What happens when there is an expression error?

What variables and expressions can be used here and what is their scope?

10.2.2 Set Functions

Another confusing section. I mostly ignored this and did what SQL did.

None of the functions that I can tell, ever use 'err'.

10.2.3 Mapping from Abstract Syntax to Algebra

scalarvals argument is used here - I think this is called 'scalar' earlier.

Un-numbered Section after 10.2.3: Joining Aggregate Values

Never figured out what this was trying to define but my code executes the example.

11. Subqueries

(Ignored in my current work)

12 RDF Dataset

(Same as SPARQL 1.0 I assume so no comments)

13 Basic Federated Query

Yes, please merge in the text here.

14 Solution Sequences and Modifiers

( Aside: This is one of those SPARQL parts where everything mentioned is optional. Otherwise this section has no change from SPARQL 1.0, I am just mentioning it as a pointer of a trend. )

15. Query Forms

No comments.

16. Testing Values

16.3 Operator Mapping

Is it worth noting the new operators in SPARQL 1.1?

Operators: implemented isNUMERIC()

16.4 Operators Definitions

My current state of implementation of new to SPARQL 1.1 expressions

16.4.16 IF - implemented
16.4.17 IN - implemented
16.4.18 NOT IN - implemented
16.4.19 IRI - implemented
16.4.20 URI - implemented
16.4.21 BNODE - implemented
16.4.22 STRDT - implemented
16.4.23 STRLANG - implemented

No comments on the above

16.4.24 NOT EXISTS and EXISTS

I am lumping these together with sub-SELECT to implement. My concern here is that the syntax gets super-complex since all the graph pattern syntax can now appear inside any expression syntax.

There is a filter operator "exists" that ...

Does this imply these can only appear in FILTER expressions? Please clarify.

17 Definition of SPARQL

I looked at the 17.2.3 for aggregate queries and it was more helpful than the math earlier. The pseudo code in Step 4 is a bit too unclear. Is that an example implementation or the required one?

17.6 Extending SPARQL Basic Graph Matching

Ignored.

18 SPARQL Grammar

Clearly this is not complete; there are lots of notes to update it.

19 Conformance

If property paths are not removed, please add a conformance level that includes SPARQL 1.1 without property paths.

Does SPARQL 1.1 Query require implementation of the dependent specs - federated query and update? Looks to me that protocol may also be dependent?

Permalink

End of life of Raptor V1. End of support for Raptor V1 in Rasqal and librdf

Raptor V1 was last released in January 2010 and Raptor V2 seems pretty stable and working. I am therefore announcing that from early 2011, Raptor V2 will replace Raptor V1 and be a requirement for Rasqal and librdf.

End of life timeline

Now

Raptor V1 last release remains 1.4.21 of 2010-01-30
Raptor V2 release 2.0.0 will happen "soon".
The next Rasqal release will support Raptor V1 and Raptor V2.
The next librdf release will support Raptor V1 and Raptor V2 (and requires Rasqal built with the same Raptor version).

2011

The next Rasqal release will support Raptor V2 only.
The next librdf release will support Raptor V2 only (and require a Rasqal built with Raptor V2).

In the style of open source I've been using for the Redland libraries, which might be described as "release when it's ready, not release by date", these dates may slip a little but the intention is that Raptor V2 becomes the mainline.

I do NOT rule out that there will be another Raptor V1 release but it will be ONLY for security issues. It will contain minimal changes and not add any new features or fix any other type of bug.

Developer Actions

If you use the Raptor V1 ABI/API directly, you will need to upgrade. If you want to write conditional code, that's possible. The redland librdf GIT source (or 1.0.12) uses the approach of macros that rewrite V2 into V1 forms and I recommend this way since dropping Raptor V1 support then amounts to removing the macros.

The Raptor V2 API documentation has a detailed section on the changes and there is also an upgrading document plus it points to a perl script docs/upgrade-script.pl (also in the Raptor V2 distribution) that automates some of the work (renames mostly) and leaves markers where a human has to fix.

The Raptor V1 API documentation will remain in a frozen state available at http://librdf.org/raptor/api-1.4/

Packager Actions

If you are a packager of the redland libraries, you need to prepare for the Raptor V1 / Raptor V2 transition which can vary depending on your distribution's standards. The two versions share two files: the rapper binary and the rapper.1 man page. I do not want to rename them to rapper2 etc. since rapper is a well known utility name in RDF and I want 'rapper' to provide the latest version.

In the Debian packaging which I maintain, these are already planned to be in separate packages so that both libraries can be installed and you can choose the raptor-utils2 package over raptor-utils (V1).

In other distributions where everything is in one package (BSD Ports for example) you may have to move the rapper/rapper.1 files to the raptor V2 package and create a new raptor1 package without them. i.e. something like this

Raptor V1 package 1.4.21-X:

/usr/lib/libraptor1.so.1* ...

(no /usr/bin/rapper or /usr/share/man/man1/rapper.1 )

Raptor V2 package 2.0.0-1:

/usr/lib/libraptor2.so.0* ...

/usr/bin/rapper

/usr/share/man/man1/rapper.1

conflicts with older raptor1 packages before 1.4.21-X

The other thing to deal with is that when Rasqal is built against Raptor V2, it has internal change that mean librdf has to also be built against rasqal-with-raptor2. This needs enforcing with packaging dependencies.

This packaging work can be done/started as soon as Raptor V2 2.0.0 is released which will be "soon".

Permalink

Writing an RDF query engine. Twice

"You'll end up writing a database" said Dan Brickley prophetically in early 2000. He was of course, correct. What started as an RDF/XML parser and a BerkeleyDB-based triple store and API, ended up as a much more complex system that I named Redland with the librdf API. It does indeed have persistence, transactions (when using a relational database) and querying. However, RDF query is not quite the same thing as SQL since the data model is schemaless and graph centric so when RDQL and later SPARQL came along, Redland gained a query engine component in 2003 named Rasqal: the RDF Syntax and Query Library for Redland. I still consider it not a 1.0 library after over 7 years of work.

Query Engine The First

The first query engine was written to execute RDQL which today looks like a relatively simple query language. There is one type of SELECT query returning sequences of sets of variable bindings in a tabular result like SQL. The query is a fixed pattern and doesn't allow any optional, union or conditional pattern matching. This was relatively easy to implement in what I've called a static execution model:

Break the query up into a sequence of triple patterns: triples that can include variables in any position which will be found by matching against triples. A triple pattern returns a sequence of sets of variable bindings.
Match each of the triple patterns in order, top to bottom, to bind the variables.
If there is a query condition like ?var > 10 then check that it evaluates true.
Return the result.
Repeat at step #2.

The only state that needed saving was where in the sequence of triple patterns that the execution had got to - pretty much an integer, so that the looping could continue. When a particular triple pattern was exhausted it was reset, the previous one incremented and the execution continued.

This worked well and executes all of RDQL no problem. In particular it was a lazy execution model - it only did work when the application asked for an additional result. However, in 2004 RDF query standardisation started and the language grew.

Enter The Sparkle

The new standard RDF query language which was named SPARQL had many additions to the static patterns of the RDQL model, in particular it added OPTIONAL which allowed optionally (sic) matching an inner set of triple patterns (a graph pattern) and binding more variables. This is useful in querying heterogeneous data when there are sometimes useful bits of data that can be returned but not every graph has it.

This meant that the engine had to be able to match multiple graph patterns - the outer one and any inner optional graph pattern - as well as be able to reset execution of graph patterns, when optionals were retried. Optionals could also be nested to an arbitrary depth.

This combination meant that the state that had to be preserved for getting the next result became a lot more complex than an integer. Query engine #1 was updated to handle 1 level of nesting and a combination of outer fixed graph pattern plus one optional graph pattern. This mostly worked but it was clear that having the entire query have a fixed state model was not going to work when the query was getting more complex and dynamic. So query engine #1 could not handle the full SPARQL Optional model and would never implement Union which required more state tracking.

This meant that Query Engine #1 (QE1) needed replacing.

Query Engine The Second

The first step was a lot of refactoring. In QE1 there was a lot of shared state that needed pulling apart: the query itself (graph patterns, conditions, the result of the parse tree), the engine that executed it and the query result (sequence of rows of variable bindings). That needed pulling apart so that the query engine could be changed independent of the query or results.

Rasqal 0.9.15 at the end of 2007 was the first release with the start of the refactoring. During the work for that release it also became clear that an API and ABI break was necessary as well to introduce a Rasqal world object, to enable proper resource tracking - a lesson hard learnt. This was introduced in 0.9.16.

There were plenty of other changes to Rasqal going on outside the query engine model such as supporting reading and writing result formats, providing result ordering and distincting, completing the value expression and datatype handling data and general resilience fixes.

The goals of the refactoring were to produce a new query engine that was able to execute a more dynamic query, be broken into understandable components even for complex queries, be testable in small pieces and to continue to execute all the queries that QE1 could do. It should also continue to be a lazy-evaluation model where the user could request a single result and the engine should do the minimum work in order to return it.

Row Sources and SPARQL

The new query engine was designed around a new concept: a row source. This is an active object that on request, would return a row of variable bindings. It generates what corresponds to a row in a SQL result. This active object is the key for implementing the lazy evaluation. At the top level of the query execution, there would be basically one call to top_row_source.getRow() which itself calls inner rowsources' getRow() in order to execute the query to return the next result.

Each rowsource would correspond approximately to a SPARQL algebra concept, and since the algebra had a well defined way to turn a query structure into an executable structure, or query plan, the query engine's main role in preparation of the query was to become a SPARQL query algebra implementation. The algebra concepts were added to Rasqal enabling turning the hierarchical graph pattern structure into algebra concepts and performing the optimization and algebra transformations in the specification. These transformations were tested and validated against the examples in the specification. The resulting tree of "top down" algebra structures were then used to build the "bottom up" rowsource tree.

The rowsource concept also allowed breaking up the complete query engine execution into understandable and testable chunks. The rowsources implemented at this time include:

Assignment: allowing binding of a new variable from an input rowsource
Distinct: apply distinctness to an input rowsource
Empty: returns no rows; used in legitimate queries as well as in transformations
Filter: evaluates an expression for each row in an input rowsource and passes on those that return True.
Graph: matches against a graph URI and/or bind a graph variable
Join: (left-)joins two inner rowsources, used for OPTIONAL.
Project: projects a subset of input row variables to output row
Row Sequence: generates a rowsource from a static set of rows
Sort: sort an input rowsource by a list of order expressions
Triples: match a triple pattern against a graph and generate a row. This is the fundamental triple pattern or Basic Graph Pattern (BGP) in SPARQL terms.
Union: return results from the two input rowsources, in order

The QE1 entry point was refactored to look like getRow() and the query engines were tested against each other. In the end QE2 was identical, and eventually QE2 was improved such that it passed more DAWG SPARQL tests that than QE1.

So in summary QE2 works like this:

Parse the query string into a hierarchy of graph patterns such as basic, optional, graph, group, union, filter etc. (This is done in rasqal_query_prepare())
Create a SPARQL algebra expression from the graph pattern tree that describes how to evaluate the query. (This is in rasqal_query_execute() calling QE2 )
Invert the algebra expression to a hierarchy of rowsources where the top rowsource getRow() call will evaluate the entire query (Ditto)

(If you want to see some of the internals on a real query, run roqet -d debug query.rq from roqet built in maintainer mode and both the query structure and algebra version will be generated.

The big advantage from a maintenance point of view is that it is divided into small understandable components that can be easily added to.

The result was released in Rasqal 0.9.17 at the end of 2009; 15 months after the previous release. It's tempting to say nobody noticed the new query engine except that it did more work. There is no way to use the old query engine except by a configure argument when building it. The QE1 code is never called and should be removed from the sources.

Example execution

When QE2 is called by the command line utility roqet, there is a lot going on inside Rasqal and Raptor. A simplified version of what goes on when a query like this is run:

$ cat example.rq
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name ?website
FROM   <http://planetrdf.com/bloggers.rdf>
WHERE { ?person foaf:weblog ?website ;
                foaf:name ?name .
        ?website a foaf:Document
      }
ORDER BY ?name

$ roqet example.rq 
roqet: Querying from file example.rq
roqet: Query has a variable bindings result
result: [name=string("AKSW Group - University of Leipzig"), website=uri<http://blog.aksw.org/>]
...

is described in the following picture if you follow the numbers in order:

Roqet query execution example

This doesn't include details of content negotiation, base URIs, result formatting, or the internals of the query execution described above.

SPARQL 1.1

Now it is the end of 2010 and SPARQL 1.1 work is underway to update the original SPARQL Query which was complete in January 2008. It is a substantial new version that adds greatly to the language. In the SPARQL 1.1 2010-10-14 draft version it adds (these items may or may not be in the final version):

Assignment with BIND(expr AS ?var)
Aggregate expressions such as SUM(), COUNT() including grouping and group filtering with HAVING
Negation between graph patterns using MINUS.
Property path triple matching.
Computed select expressions: SELECT ... (expr AS ?var)
Federated queries using SERVICE to make SPARQL HTTP query requests
Sub SELECT and BINDINGS to allow queries/results inside queries.
Updates allowing insertion, deletion and modification of graphs via queries as well as other graph and dataset management

The above is my reading of the major items in the latest draft SPARQL 1.1 query language or it's dependent required specifications.

Rasqal next steps

So does SPARQL 1.1 mean Rasqal Query Engine 3? Not yet, although the Rasqal API is still changing too much to call it stable and another API/ABI break is possible. There's also the question of making an optimizing query engine, a more substantial activity. At this time, I'm not motivated to implement property paths since it seems like a lot of work and there are other pieces I want to do first. Rasqal in GIT handles most of the syntax and is working towards implementing most of the execution of aggregate expressions, sub selects and SERVICE although no dates yet. I work on Rasqal in my spare time when I feel like it, so maybe it won't be mature with a stable API (which would be a 1.0) until SPARQL 2 rolls by.

Permalink

Redland librdf 1.0.11 released

I have just released Redland librdf library version 1.0.11 which has been in progress for some time, delayed by the large amount of work to get out Raptor V2 as well as initial SPARQL 1.1 draft work for Rasqal 0.9.20.

The main features in this release are as follows:

Virtuoso storage backend querying now fully works.
Several new convenience APIs were added and others deprecated.
Support building with Raptor V2 API if configured with --with-raptor2.
Exports more functions to SWIG language bindings.
Switched to GIT version control hosted by GitHub.
Fixed Issues: #0000124, #0000284, #0000321, #0000322, #0000334, #0000338, #0000341, #0000344, #0000350, #0000363, #0000366, #0000371, #0000380, #0000382 and #0000383

See the Redland librdf 1.0.11 Release Notes for the full details of the changes.

Note that the Redland language bindings 1.0.10.1 works fine with Redland librdf 1.0.11 but the bindings will soon have a release to match.

Permalink

Leaving Yahoo - Joining Digg

I'm heading to a new adventure at Digg in San Francisco to be a lead software engineer working on APIs and syndication.

I've been at Yahoo! nearly 5 years so it is both a happy and sad time for me, and I wish all the excellent people I worked with the best of luck in future.

Here is a summary of the main changes:

Silicon Valley -> San Francisco
15,000 staff -> 100 staff
Architect -> Software engineer
strategizing, meeting -> coding
Powerpoint, OmniGraffle, twiki -> emacs, eclipse, ...?
(No coding!) -> Python, Java, Hadoop, Cassandra, ...?
Sunny days -> Foggy days
15 min commute -> 2.5hr commute (until I move to SF)
Public company -> private company

Exciting!

Permalink

Dave Beckett's blog

References

Tarballs

Sources

Words

I. General comments

II. Aggregates

III. Bindings is a new syntax

IV. Section-by-section comments

Section: Status of this Document

Sections 1-8

9 Property Paths

10 Aggregates

10.2

10.2.1 HAVING

10.2.2 Set Functions

10.2.3 Mapping from Abstract Syntax to Algebra

Un-numbered Section after 10.2.3: Joining Aggregate Values

11. Subqueries

12 RDF Dataset

13 Basic Federated Query

14 Solution Sequences and Modifiers

15. Query Forms

16. Testing Values

16.3 Operator Mapping

16.4 Operators Definitions

16.4.24 NOT EXISTS and EXISTS

17 Definition of SPARQL

17.6 Extending SPARQL Basic Graph Matching

18 SPARQL Grammar

19 Conformance

End of life timeline

Developer Actions

Packager Actions

Query Engine The First

Enter The Sparkle

Query Engine The Second

Row Sources and SPARQL

Example execution

SPARQL 1.1

Rasqal next steps

Links

Feeds