Bootstrapping RDF applications with Redland

David Beckett

Senior Technical Researcher
Institute for Learning and Research Technology (ILRT), University of Bristol

Bristol
UK
<www.dajobe.org>

Redland[Redland] is a set of mature RDF open source libraries written in C providing a foundation layer of technology for semantic web applications including language bindings to C#, Perl, Python and Ruby as well as others.

This paper describes how Redland aims to provide an out-of-the box solution to either add semantic web support to an application or to build a new RDF application. The design approach of Redland is based on a C core with wrapping language bindings and this is explained along with the pros and cons. The different Redland libraries and their relationships is described along with how they provide parts of the functionality so that a full RDF API is not always even needed to access the world of RDF data.

Redland provides multiple ways to get RDF data in, either from RDF syntaxes, RSS XML messy data or directly as native RDF triples. This data can be manipulated using the main API and the results used to generate new formats. The paper explains the process using one of the Redland language bindings.

The RDF query language support that Redland provides with RDQL and W3C SPARQL language enable compact but powerful ways to access and manipulate data, and allow typical semantic web scenarios of semi-structured data to be delt with, along with tracking context. The paper also describes some of the novel features that SPARQL provides.

Table of Contents
1. Introduction
2. Design approach of Redland
3. Redland API style - Object-based in C
4. Language Bindings to C
5. RDF Input, Output and Storage
6. Data Management and Tracking with Redland Contexts
7. RDF API - RDF Data Access and Manipulation
8. Query Languages - RDQL and SPARQL
9. Conclusion
Bibliography

1. Introduction

Redland is a mature Free Software / Open Source library written in C designed to enable RDF support in existing applications and to provide a way for new RDF-based applications to be built easily. Ease for applications developer here means that the average programmer should be able to pick up the library and in an afternoon be able to get the job done in a screenful of code at most. This meant Redland had to be usable from a wide variety of programming languages, be portable and flexible in using whatever functionality a wide variety of open systems provide, and be licensed such that it could work with many users.

Any required external libraries have to be widely available where critical - Redland's key requirements are an XML parser - where it allows either expat or libxml to be used which are widely available - and some me sort of persistent store where there are many choices. The optional parts appear in Redland as modules or in object terms, multiple implementations of interfaces that provide parsers, serialisers, triple stores, query languages and so on.

The portability and flexible configuration was aided by the use of the GNU autotools for discovering and enabling the features available on individual systems. automake, autoconf, and libtool (although quite complex in themselves) enable the portable use of very system-specific detail such as shared-library / DLL / OSX framework, bundle, module, dynamic loading to be removed from the main codebase.

One major consideration is what licenses the free software / open source software FS/OSS is under. Most developers are either using or familiar with the GNU GPL, so Redland is available under a compatible license set - the LGPL/GPL dual license. This is not always a suitable license, so a third alternative is the Apache 2.0 license which is used by a large number of projects from the Apache Foundation including the well known HTTP server.

Documentation is another consideration in ensuring easy access to software. All the Redland libraries have news, release notes, and detailed changelogs and also include reference documentation which needs to be kept in synchronisation with the code in order to be useful. The main Redland RDF API ensures this by creating documentation derived directly from the C sources in a Javadoc-style, building DocBook XML, that builds into an HTML API or PDF reference. Raptor and Rasqal have detailed Unix manual page for the smaller APIs, as do each of the command line utilities. All of the APIs and language bindings come with examples and some have individual test suites (Perl, Python, C#). All of this is available in each release and is the main content that builds the Redland web site. The missing parts at present in documentation terms are tutorials and comprehensive examples of the entire API, although some of the test suite do exercise the majority of the API and provides useful example code.

2. Design approach of Redland

Redland was designed as a family of four C libraries covering major parts of the functionality.

Redland Components

Raptor RDF Parser Toolkit[Raptor]: parsing, serialising, URIs, WWW retrieval, Unicode and XML
Rasqal RDF Query Library[Rasqal]: query syntax support and query execution engine (using Raptor)
Redland[Redland]: user API to all features and triple storage (using Raptor and Rasqal)
Redland Bindings[Bindings]: language bindings to Redland (using Redland)

This split allowed development of the libraries to proceed at different rates, and to concentrate specific detail on one technology such as XML into just one place (Raptor, for parsing and serialising). This also allowed useful subsets of the full library set to be available and to enable separate use without the full Redland RDF API. In particular parsing (Raptor) and querying (Rasqal) are used in external applications to interoperate. An application can use Raptor alone to get RDF data in and out from syntaxes without having to even have any RDF processing or understanding in the code. (The command-line user need not even learn the RDF API of Redland, as the parsing, querying and API calls can be done from simple RDF utility programs, one for each library.)

The language bindings are a separate part so that those developers that just need a C, C++ or related language (Obj-C) interface do not need to use them. When these are packaged up into binary packages as done for Linux, they are split into further per-language packages, so that the minimum number of libraries need be installed by a developer starting from a set of compiled binary packages.

The Redland API library (aka librdf) uses and wraps all the functionality that Raptor and Rasqal deliver, and adds the extra RDF-specific functionality of storing RDF graphs, manipulating them and making a full and coherent user RDF API above the parsing and querying libraries. To a user of a language binding above the RDF API, the distinction between Raptor, Rasqal and Redland itself is not noticeable in normal use.

3. Redland API style - Object-based in C

Every web programming language provides access to C and most desktop applications are still written in C / C++. Thus to provide access across multiple systems and languages, the Redland libraries were written in portable ANSI C99. This however, did not mean that the API needed to be solely a set of functions in the single flat namespace that C provides. This would deliver a unnatural API when reflected into other language bindings. As not all web languages are based on an object model, C provides a better basis than C++ (which requires objects) to deliver a consistent API that does not require object support in the language.

Redland provides an object model in C with constructor functions, destructor functions and method functions that can be accessed via any language that can call C. Languages such as Tcl where there are no objects can use them without need for further work, and those where object support is evolving such as moving from PHP3 to PHP4 to PHP5 can use them or not as the versions and support provides. Languages with objects such as Perl, Python and C# can map the Redland objects can be mapped to match language concepts.

In Redland a class called FOO is defined as a C typedef librdf_foo. The constructors for the class are defined as functions with signatures like:

  librdf_foo* librdf_new_foo(void)

which takes no parameters. Constructors usually have parameters and are named in a similar way with an extra part appropriate for the name for example:

  librdf_foo* librdf_new_foo_with_options(char *options)

A copy constructor may also be defined which has a signature:

  librdf_foo* librdf_new_foo_from_foo(librdf_foo* old_foo)

A destructor is always defined with the function signature:

  void librdf_free_foo(librdf_foo* foo)

Methods of the class have names starting with librdf_foo_ and examples could be:

  /* accessor functions to object part 'thing' */
  int librdf_foo_set_thing(librdf_foo* foo, char *thing)
  char *librdf_foo_get_thing(librdf_foo* foo)

Clearly there can be no inheritance or method overloading using this model. Redland handles this by having interface classes such as storage that have multiple implementations for the classes that implement the storage 'interface'. In the language bindings, native inheritance can be used to provide convenience wrapper classes, such as a particular storage implementation.

Compiling in C provides a RDF core library with high performance, small memory footprint and low memory use, especially with the Redland use of streaming explained later. There are of course several downsides to using C that are well known - the memory management and absence of OO support features like reference counting, exceptions and proper strings in the language.

To try to reduce these problems as much as possible, Redland undergoes lots of testing especially by the use of debugging malloc systems and processor memory simulators such as Valgrind [Valgrind]. Each Redland C module also comes with multiple built in unit-tests.

Redland is regularly built and tested on over a dozen different Linux / Unix systems from BSDen to Solaris and architectures (such as the 11 architectures supported by Debian GNU/Linux) and is also built on Apple OSX and Win32 by external users. Redland is automatically tested using the SourceForge compile farm to track build and test failures, and a bug tracker used to follow up user reports.

4. Language Bindings to C

There are several common web development languages that operate on a dynamic scripting model such as Perl, Python and Ruby. These present a class and object model to the user, which can map to Redland's model. Redland uses swig[SWIG] to generate bindings to these scripting languages based on a pseudo-C description. It generates the interface C code to do the marshalling/unmarshalling of arguments to functions and their return values along with whatever type-checking is possible. This allows concentrating on encoding more natural bindings on top of the raw SWIG conversion of the functions, to use the different features in the language bindings.

The class foo with a constructor, destructor and two methods could mapped as shown in Example 1:

Example 1. Functions for Redland Objects in C for class foo

  librdf_foo* librdf_new_foo(void);

  void librdf_free_foo(librdf_foo* foo);

  int librdf_foo_set_thing(librdf_foo* foo, char *thing);
  char *librdf_foo_get_thing(librdf_foo* foo);

SWIG would convert that into similar-named functions that would be callable in the target binding language, converting the librdf_foo typedef into a typed object pointer which the marshalling code would check.

This would not look like objects to the target language, so all the Redland bindings that map to object-based languages provide wrapper classes. In Python it would be something like in Example 2 where SWIG presents the mapped functions as appearing in the Redland class:

Example 2. Python binding for the class foo

  class Foo:
    def __init__(self):
      self._native=Redland.librdf_new_foo()

    def __del__(self):
      Redland.librdf_free_foo(self._native)

    def set_thing(self, value):
      return Redland.librdf_foo_set_thing(self._native, value)

    def thing(self):
    return Redland.librdf_foo_get_thing(self._native)

So to the Python programmer, the Redland implementation detail is lost and it looks and works like a regular python class as shown in Example 3:

Example 3. Python example

  import RDF

  foo=RDF.Foo()
  foo.set_thing("blah")
  print foo.thing()

This scheme has been fully implemented for the core binding languages for Redland - Python, Perl, Ruby. C# has it's own equivalent to SWIG and calls C directly but the wrapper classes work in the same fashion. Tcl and PHP remain flat functional mappings and Java although using this scheme, is not a core binding language for Redland. An third party maintainer has developed Objective-C bindings in the same fashion for OSX.

Each language requires it's own additional customisation for individual features that are in one language but not another, such as using Python generators to wrap Redland iterators, Perl arrays around lists or detail with the additional management aspects that occur near strings and Unicode.

5. RDF Input, Output and Storage

It is important that it is easy for data to be easily input and output from a system and Redland provides multiple (7+) parsers and serialisers (3+) operating on syntaxes. The parsers can read RDF syntaxes such as RDF/XML and generate RDF triples, the serialisers do the reverse and generate syntaxes from an RDF graph. These are integrated into the API so that reading and writing to and from strings, local files and URIs is a single method call.

However not all data is neatly stored in RDF files, so Redland includes two parsers that allow manipulating of the input form into RDF triples - an RSS tag soup parser that can read most any version of "RSS" up to and including the ongoing Atom work, and a GRDDL parser that can read XHTML that include built-in transformation information using XSLT. This set of scraping tools development is ongoing.

Redland provides multiple backends for storing RDF graphs, using a storage interface abstraction. The choice of storage can depend not only on the system that Redland is configured and built on, but on the features they provide as they include a range of stores going from in-memory storage without indexes, persistent stores in static files, simple database files up to using a relational database (MySQL). The tradeoffs between the storage modules and their different features are described in the documentation.

6. Data Management and Tracking with Redland Contexts

One frequent use of RDF data is to aggregate content from a variety of data sources into one RDF graph (data store) and then to manipulate and query the result. At this point the user often wants to find out where the result came from - which sources formed part of it. Redland provides a feature for managing this called contexts which allow sets of triples in a data store to be marked with a URI when added and to get back this URI when searching. This enables this common problem and also allows an additional one to be solved - updating a graph when an original source content changes, as the contexts api allows removal of all triples associated with a context, so that new data can be loaded if need be.

With Redland contexts, each triple can be marked as having multiple nodes called context nodes such as C1, C2. These are regular Redland nodes so can appear in triples themselves. Table 1 shows four triples that appear in the the subject of two triples. The context nodes are assigned when the triples are added to the Redland graph, but otherwise they are no different from any other graph node. The first 2 triples are in context C1, and could be from the same graph. Triples 1,3 and 4 are in context C2 and could be from a separate graph that was aggregated. The last two triples provide extra information about the contexts (graphs) themselves such as where, when they were added to the Redland model or other data management information.

Table 1. Redland RDF graph with contexts

�	Triple	Contexts
1.	(s1, p1, o1)	C1, C2
2.	(s2, p2, o2)	C1
3.	(s3, p3, o3)	C2
4.	(s4, p4, o4)	C2
5.	(C1, p5, o5)	�
6.	(C2, p6, o6)	�

This is made available by methods on the stream and iterator classes after making an API call that returns them. It allows getting the context node for the current result - triple or node returned from the stream. In pseudo code, like this:

  stream=model.find_statement(s, p, o)
  statement=stream.current()
  context_node=stream.context()

Methods of the model class provide additional contexts control including listing all the context nodes in a graph, listing all the triples with a given context and to bulk add/delete triples associated with a context - a common use case.

Redland contexts can be used for several different techniques of recording context depending on how the context nodes are associated with the triples, which is application specific, and is done at the time the triples are added to the graph. The uses could include the following, although this is not an exhaustive list:

Enable true graph merging / updating / demerging: Identify the subgraphs (sets of triples from particular sources) with context nodes.
Triple Identity: Add each triple with a different context node. RDF's model does not assign identity to triples. There is reification also which might be used with this approach.
Triple Provenance: Use the context node as the subject of other triples about the triple that is returned.
Subgraphs: Similar to the merging approach but consider the RDF graph to be a set of graphs and manipulate them as such.

7. RDF API - RDF Data Access and Manipulation

The Redland API contains the full set of triple-level manipulation to RDF graph, integrated with the parsing and serialising by streams that provide access to a sequence of triples, which allows data to be generated lazily, on-demand by the application where possible. This enables the use of a minimum of resources when performing model operations or operating across a protocol or API.

In addition to the low-level API calls to add a triple, remove a triple and similar, the RDF graphs can be searched for patterns of triples using a triple pattern matching call find_statements:

  /* ... set statement parts to what is needed, leaving wildcards NULL .. */
  predicate = librdf_new_node_from_uri_string(world, "http;//example.org/pred");
  statement = librdf_new_statement_from_nodes(NULL, predicate, NULL);

  /* find matching triples in the graph ('model') matching statement */
  stream = librdf_model_find_statements(model, statement)

which gives a stream of RDF triple results (or in Python, a generator of them; and in Perl, an array)

This requires for more complex API interactions, quite a few objects, classes and method calls. This also means that the application developer has to be involved with the detail of the API and cannot express application-centric problems, in RDF terms such as graph triples. This has been solved in other areas by the development of query languages, explained in the next section.

8. Query Languages - RDQL and SPARQL

Query languages aid the application developer to do a lot of data model access, selection and formatting in a compact format, as well as provide a standardisation point for the developer across multiple implementations. Redland presently has support for two RDF query languages RDQL[RDQL] which has been a defacto standard RDF query language for several years and SPARQL [SPARQL], which is currently being standardised by the W3C RDF Data Access Working Group (DAWG). The language is still under development and the final syntax and feature set is not finally decided.

SPARQL was developed as a standard RDF query language, that is, with RDF concepts at the core - triples and graphs and with support for the unique requirements of the semantic web that have been found from the earlier query language work. These include supporting the semi-structured nature and schemalessness of RDF by allowing queries with optional parts to succeed (SPARQL OPTIONAL) with partial matches and not requiring all the types of data to be pre-declared before a query (as SQL does).

SPARQL OPTIONAL allow such things as querying for multiple properties of a resource in one go, without necessarily requiring that all of them are present:

Example 4. SPARQL Optionals

  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  SELECT ?name ?mbox
  WHERE  { ?r rdf:type foaf:Person }
	 OPTIONAL { ?r foaf:name  ?name }
	 OPTIONAL { ?r foaf:mbox  ?mbox }

which could find for each FOAF Person object, bindings for the names and email address, where known. In a lower-level API call this would typically need several queries or multiple API calls to perform.

The experience of running RDF query languages on the web led to the language enabling web integration in several aspects. The contexts feature of Redland and similar implementations in some earlier RDQL support was used to return information on where aggregated RDF data came from originally, by tracking the source graph URIs. SPARQL makes this available both in forming queries and in binding results by the GRAPH operator, taking either a URI argument to restrict a query to triples from graph or a variable, to return the provenance of a triple:

Example 5. SPARQL GRAPH

  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  SELECT ?g ?name
  WHERE  { ?r rdf:type foaf:Person }
	 GRAPH ?g { ?r foaf:name  ?name }

which returns the names of the people (FOAF Person classes) in the graph and the original graph in which the triple was found. In common semantic web applications where data is integrated, this allows the application developer to deal with issues of provenance and trust and data management. Redland supports manipulating contexts which maps directly to the GRAPH support.

Finally SPARQL has a full set of supported operators on RDF literals, simple XML Schema datatypes (plus a few others such as dateTime) along with operators on them from XQuery Functions and Operators. It also has several other features including creating RDF graphs as results in addition to variable bindings, which allows novel possibilities of chaining RDF graph creation. Redland's query language support is provided by Rasqal[Rasqal] which lazily evaluates the queries, so that results are generated and produced only as they are needed - streaming.

9. Conclusion

Redland is a mature FS/OSS C set of libraries providing a core RDF library that enables fast development of RDF applications in multiple languages and high performance with it's tested and compact C core.

Bibliography

Redland Redland RDF Application Framework D. Beckett available at http://librdf.org/

Raptor Raptor RDF Parser Toolkit D. Beckett available at http://librdf.org/raptor/

Rasqal Rasqal RDF Query Library D. Beckett available at http://librdf.org/rasqal/

Redland Bindings Redland Language Bindings D. Beckett available at http://librdf.org/bindings/

Valgrind Valgrind Julian Seward et al, an x86 process simulator, available at http://valgrind.org/

SWIG Simplified Wrapper and Interface Generator tool for connecting programs written in C and C++ with a variety of high-level programming languages. Available at http://www.swig.org/

SPARQL SPARQL Query Language for RDF Eric Prud'hommeaux and Andy Seaborne (editors), W3C Working Draft of 17 February 2005, work in progess. Available at http://www.w3.org/TR/2005/WD-rdf-sparql-query-20050217/

RDQL RDQL - A Query Language for RDF Andy Seaborne, HP Labs, W3C Member Submission, 9 January 2004. Available at http://www.w3.org/Submission/2004/SUBM-RDQL-20040109/