Modernising Semantic Web Markup

The Resource Description Framework (RDF) web metadata format has an XML syntax RDF/XML which has been described as a ugly and flawed, mainly as a consequence of it being an early XML format, dating from 1998. This presentation will describe the perceived and real problems and select appropriate modern XML and web best practices for improving RDF markup that can be better used with the latest XML technologies such as XSLT 2 and XQuery.

The presentation will distinguish a semantic web markup format rather than a format intended solely for software as one intended to be easier for end users to author and more clearly be appropriate for typical application areas of lightweight web metadata and authored web ontologies.

XML best practice in any area is a tricky subject to discuss and get agreement on but the XML technologies considered include XML Namespaces, XML QNames in content, omitting some darker corners of the XML specification along with use of clear user-friendly technologies such as the RELAXNG grammar-based XML schema language, part of the ISO DSDL work. The presentation will also discuss approaches starting from XHTML to generate semantic web data.

Introduction

[RDF/XML Revised] is the W3C Recommendation that I edited for the W3C [RDF Core WG] 2001-2004. It re-defines the RDF/XML syntax designed in 1998 by the original RDF working group in terms of the XML Infoset (with XML Base). The original RDF/XML syntax was created with a variety of goals that somewhat clashed. This created a syntax that is often criticized as not meeting modern XML best practice. This paper discusses some of the issues that have been raised with the syntax, looks at other work in creating new XML syntaxes for RDF and describes a strawman new XML syntax for RDF graphs, RXR.

The goals for this work are to design a simple XML syntax that covers RDF in a straightforward fashion which that is quickly understandable. The intended users of this format are new authors, or those that have read the RDF documentation and want to write the triples down simply, for possible later XML-level processing.

Things that are out of scope for the syntax include RDF model extensions (contexts, quotation, literal subjects, nested graphs, named graphs) and complex triple structures in the RDF model (reification, collections). In order to make a simple XML format, that generates restrictions that will be discussed in later sections.

Problems with RDF/XML

The RDF Core working group looked at comments on the original RDF Model &Syntax document and the later work from the community and recorded these in the [RDF Core Issues List] . Not all of these issues were possible to address during the syntax revising without inventing a new syntax, which was out of scope for the working group. The major remaining problems were as follows:

These cannot all be addressed while keeping the resulting XML format simple and for the intended use; some of the triple structures that could be generated are complex and would not be easy for the typical user to see at a glance how to write them in XML.

Recent RDF XML alternatives

[RDF/XML Retrospective] described the history of the revising of RDF/XML and outlined some potential solutions to the various problems, for both users and machines as well as in XML and non-XML formats. This was not taken further on the XML, but led to the development of a non-XML format [Turtle] intended for quick writing of RDF, not discussed further here.

Carroll and Stickler in [TRiX] propose a new XML syntax TRiX based on a triples-level markup with the following form:

This beyond-RDF extension counters the simple, unsurprising approach intended here, and the use of XSLT (especially XSLT2 and W3C XML Schemas) takes the XML tool requirements much beyond what could be called core XML.

Selecting XML for improved RDF markup

A syntax is most likely to be user friendly if the terms used are minimal, consistent and appropriate. The syntax terms should correspond directly to RDF concepts such as triples and the parts: subject, predicate and object, so that how the syntax is written clearly maps to the concepts. You should not need to understand either what an RDF schema or XML language is. An RDF schema language is a description of the vocabulary in the RDF graph; an XML schema language describes what an XML syntax looks like and how it is structured and constrained - they work at different levels.

The types of things that can be subjects, predicates and objects in RDF are either RDF URI References, blank node identifiers or literals. The literals can be datatyped (which can be XML content) and may have a content language (xml:lang). There are also some restrictions on which types can be used in the subject, predicate and object fields. As far as is possible, these constraints must be enforced by an appropriate XML schema language so that if it is wanted, the user can use standard validation tools or work from the schema. The XML syntax should be simple enough that knowledge of any XML schema language is not required but knowledge of such a language is beneficial.

The goals force restrictions on the complex XML detail that humans have problems with constructing, in this case XML Literals in RDF which use Exclusive XML Canonicalisation. If those are needed, RDF/XML provides that facility in a better form that can be given in a simple XML format. The XML format also should use the minimum of XML specifications and in particular, stick to the ones most widely understood, used and deployed. Those include XML itself, possibly XML Namespaces and some XML schema language - taking care to use the minimum possible complexity of the common schema languages: DTDs, W3C XML Schemas or RELAX NG. Another choice is to not use some darker corners of the XML specifications such as processing instructions and entities, showing the SGML background of XML and not seen in most modern XML designs.

XML namespaces are used in many modern XML formats, especially in order to use W3C XML schema datatypes which use QNames to identify the datatypes. QNames are thus good candidates for technology to use in order to get familiar looking and modern XML. The W3C TAG has an issue with using QNames being used in places not representing element names. This is most tricky when used in attribute value for content that does not have type support for QNames, such as DTDs. DTDs are therefore not so appropriate for use here: although they do handle namespaces, they cannot do the full type checking. The issue for RDF and QNames is that they have historically been used differently from how they are used as identifiers in XML schema datatypes - RDF concatenates the namespace name and the local name to form a URI reference, whereas W3C XSD keeps them as a pair. This has an unexpected consequence for describing datatypes; the namespace URIs have to be different for xsd if they were used in RDF, since to construct the URIs for the XML schema datatypes requires a different namespace that that used when used in WXS schema documents.

RXR (Regular XML RDF)

RXR takes the approach of mapping the RDF concepts to the same XML element names in what might be called element-normal form where every choice point gives a new element. This would generate a very deep tree of tags that are rather verbose as shown in Figure 1 if elements alone were used.

A lesson learned when designing [Turtle] was that RDF collections are tedious to write down in RDF triples and as this is prone to error, worthy of having special support. These are used at the object part of triples, so an additional collection tag is introduced, allowing a sequence of object elements inside. Figure 4 shows an example of a collection of literals as the object of a triple.

The XML schemas for RXR were written in RELAX NG compact and translated to W3C XML Schema and DTD using James Clark's trang tool. The resulting schemas are complete and relatively straightforward.

RDF and HTML/XHTML

Another approach to making easy to use, user friendly semantic web markup is to start from XHTML markup and transform it or annotate it into triples. In February 2004, the HTML working group announced a draft document [RDF/XHTML] which defines an approach for semantic markup for XHTML2 (intended to be an XHTML2 module) adding two main new features:

A second approach is to describe how parts of XHTML can be mapped into triples, typically via an XSLT transform. One of these described most recently is [GRDDL] which designates how RDF triples are generated via an HTML head profile attribute and value.

The new W3C Semantic Web Best Practices and Deployment (SWBPD) Working Group is coordinating proceeding this development with the HTML WG using both approaches, as it provides a way to get semantic markup from existing XHTML as well as in a new XHTML2. This remains draft and ongoing work.

Conclusions

RXR describes a simple and mostly regular triple XML format for RDF that is straightforward to explain and match to the RDF triple model. It is compatible with XML schemas in several languages and does not use XML QNames.

Avoiding URI abbreviation with QNames does have a downside that the URIs of RDF are visible and verbose. Replacing these or adding additional QName alternatives would have a cost in usability as how to explain why some things in attributes with ':' in them are URIs, others are QNames, These would have to be clearly distinguished with new attributes. Adding support for, say, xsi:type attributes for XML schema datatypes would also have issues confusing the use of that attribute with datatype and the RDF property often abbreviated as rdf:type.

Modernising Semantic Web Markup

Abstract

Table of Contents