Modernising Semantic Web Markup

Keywords: DAML, DTD, Dublin Core, RDF, Semantic Web, XML, XSLT

David Beckett
Senior Technical Researcher
Institute for Learning and Research Technology (ILRT), University of Bristol
Bristol
UK
http://www.dajobe.org/

Biography

Dave is a researcher at the ILRT, University of Bristol and works on the Semantic Web Advanced Development Europe (SWAD-E) project. He has been working with metadata and the Dublin Core since 1995, RDF since 1998, is a member of the W3C RDF Core Working Group, and editor of the RDF/XML Syntax W3C Recommendation. Dave is the author of the Redland, Raptor and Rasqal RDF tools and maintainer of the RDF Resource Guide.


Abstract


The Resource Description Framework (RDF) web metadata format has an XML syntax RDF/XML which has been described as a ugly and flawed, mainly as a consequence of it being an early XML format, dating from 1998. This presentation will describe the perceived and real problems and select appropriate modern XML and web best practices for improving RDF markup that can be better used with the latest XML technologies such as XSLT 2 and XQuery.

The presentation will distinguish a semantic web markup format rather than a format intended solely for software as one intended to be easier for end users to author and more clearly be appropriate for typical application areas of lightweight web metadata and authored web ontologies.

XML best practice in any area is a tricky subject to discuss and get agreement on but the XML technologies considered include XML Namespaces, XML QNames in content, omitting some darker corners of the XML specification along with use of clear user-friendly technologies such as the RELAXNG grammar-based XML schema language, part of the ISO DSDL work. The presentation will also discuss approaches starting from XHTML to generate semantic web data.


Table of Contents


Introduction
Problems with RDF/XML
Recent RDF XML alternatives
Selecting XML for improved RDF markup
RXR (Regular XML RDF)
RDF and HTML/XHTML
Conclusions
Appendix A - XML Schemas
Bibliography

Introduction

[RDF/XML Revised] is the W3C Recommendation that I edited for the W3C [RDF Core WG] 2001-2004. It re-defines the RDF/XML syntax designed in 1998 by the original RDF working group in terms of the XML Infoset (with XML Base). The original RDF/XML syntax was created with a variety of goals that somewhat clashed. This created a syntax that is often criticized as not meeting modern XML best practice. This paper discusses some of the issues that have been raised with the syntax, looks at other work in creating new XML syntaxes for RDF and describes a strawman new XML syntax for RDF graphs, RXR.

The goals for this work are to design a simple XML syntax that covers RDF in a straightforward fashion which that is quickly understandable. The intended users of this format are new authors, or those that have read the RDF documentation and want to write the triples down simply, for possible later XML-level processing.

Things that are out of scope for the syntax include RDF model extensions (contexts, quotation, literal subjects, nested graphs, named graphs) and complex triple structures in the RDF model (reification, collections). In order to make a simple XML format, that generates restrictions that will be discussed in later sections.

Problems with RDF/XML

The RDF Core working group looked at comments on the original RDF Model &Syntax document and the later work from the community and recorded these in the [RDF Core Issues List] . Not all of these issues were possible to address during the syntax revising without inventing a new syntax, which was out of scope for the working group. The major remaining problems were as follows:

  1. You cannot distinguish an RDF node element from a property element by simple inspection of the element in question without knowing the current striping after [Striped RDF/XML] .
  2. The frame-style approach of the description block does not clearly match the RDF model - triples in the RDF graph.
  3. There are excessive choices for users in choosing how to write RDF/XML.
  4. Elements, attributes and attribute values are used for the same purposes, for example, encoding an RDF URI reference.
  5. The way that XML QNames are used does not constrain the elements and attribute tags that can appear in RDF/XML.
  6. The unconstrained syntax cannot be described completely with XML schema languages such as DTDs and W3C XML Schema (WXS).
  7. It does not allow using xsi:type for specifying W3C XML Schema datatypes.
  8. The syntax is not easy to use with XML technologies such as XSLT, XQuery and other XML tools (mostly due to the unconstrained tags and many abbreviations).
  9. It is impossible to embed RDF/XML in XHTML while retaining DTD validation (while this is also true for any other XML syntax embedded in a DTD-constrained format).
  10. It is hard to emit human-readable RDF/XML from an RDF graph due to the range of choices ([Unparsing] ).
  11. RDF/XML cannot describe collections of literals.
  12. Not all property URIs can be encoded.
  13. Various aesthetic criticisms have been leveled at the syntax such as being ``ugly''.

These cannot all be addressed while keeping the resulting XML format simple and for the intended use; some of the triple structures that could be generated are complex and would not be easy for the typical user to see at a glance how to write them in XML.

Recent RDF XML alternatives

[RDF/XML Retrospective] described the history of the revising of RDF/XML and outlined some potential solutions to the various problems, for both users and machines as well as in XML and non-XML formats. This was not taken further on the XML, but led to the development of a non-XML format [Turtle] intended for quick writing of RDF, not discussed further here.

Carroll and Stickler in [TRiX] propose a new XML syntax TRiX based on a triples-level markup with the following form:

  1. a TRiX document contains a set of rdf graph
  2. each graph may have a name
  3. position in the triple element is significant and determines the triple subject, predicate or object
  4. no XML QNames are allowed
  5. suggest to make it user friendly via XSLT transforms using a XML PI to indicate this

This beyond-RDF extension counters the simple, unsurprising approach intended here, and the use of XSLT (especially XSLT2 and W3C XML Schemas) takes the XML tool requirements much beyond what could be called core XML.

Selecting XML for improved RDF markup

A syntax is most likely to be user friendly if the terms used are minimal, consistent and appropriate. The syntax terms should correspond directly to RDF concepts such as triples and the parts: subject, predicate and object, so that how the syntax is written clearly maps to the concepts. You should not need to understand either what an RDF schema or XML language is. An RDF schema language is a description of the vocabulary in the RDF graph; an XML schema language describes what an XML syntax looks like and how it is structured and constrained - they work at different levels.

The types of things that can be subjects, predicates and objects in RDF are either RDF URI References, blank node identifiers or literals. The literals can be datatyped (which can be XML content) and may have a content language (xml:lang). There are also some restrictions on which types can be used in the subject, predicate and object fields. As far as is possible, these constraints must be enforced by an appropriate XML schema language so that if it is wanted, the user can use standard validation tools or work from the schema. The XML syntax should be simple enough that knowledge of any XML schema language is not required but knowledge of such a language is beneficial.

The goals force restrictions on the complex XML detail that humans have problems with constructing, in this case XML Literals in RDF which use Exclusive XML Canonicalisation. If those are needed, RDF/XML provides that facility in a better form that can be given in a simple XML format. The XML format also should use the minimum of XML specifications and in particular, stick to the ones most widely understood, used and deployed. Those include XML itself, possibly XML Namespaces and some XML schema language - taking care to use the minimum possible complexity of the common schema languages: DTDs, W3C XML Schemas or RELAX NG. Another choice is to not use some darker corners of the XML specifications such as processing instructions and entities, showing the SGML background of XML and not seen in most modern XML designs.

XML namespaces are used in many modern XML formats, especially in order to use W3C XML schema datatypes which use QNames to identify the datatypes. QNames are thus good candidates for technology to use in order to get familiar looking and modern XML. The W3C TAG has an issue with using QNames being used in places not representing element names. This is most tricky when used in attribute value for content that does not have type support for QNames, such as DTDs. DTDs are therefore not so appropriate for use here: although they do handle namespaces, they cannot do the full type checking. The issue for RDF and QNames is that they have historically been used differently from how they are used as identifiers in XML schema datatypes - RDF concatenates the namespace name and the local name to form a URI reference, whereas W3C XSD keeps them as a pair. This has an unexpected consequence for describing datatypes; the namespace URIs have to be different for xsd if they were used in RDF, since to construct the URIs for the XML schema datatypes requires a different namespace that that used when used in WXS schema documents.

RXR (Regular XML RDF)

RXR takes the approach of mapping the RDF concepts to the same XML element names in what might be called element-normal form where every choice point gives a new element. This would generate a very deep tree of tags that are rather verbose as shown in Figure 1 if elements alone were used.

<triple>
  <subject><uri>http://purl.org/net/dajobe/</uri></subject>
  <predicate><uri>http://purl.org/dc/elements/1.1/creator</uri></predicate>
  <object><literal>Dave Beckett</literal></object>
</triple>

Figure 1: An RDF XML syntax in element-normal form

Although very regular, this is rather verbose and in particular it is not a natural way to write a string as a literal compared as a URI. It is more natural to make literal content appear as element content and this makes the other forms as alternatives; this suggested using attributes for the types indications rather than remaining totally as an element-normal format.

The XML attributes then become the modifiers for the XML elements for the RDF concepts for the parts of the triples. The triple element content enforces the standard order of describing a triple as used in the RDF specifications. (The order is not significant but is commonly used in one order for consistency and to ease learning).

Rewriting Figure 1 into the final RXR form gives Figure 2 with the root element added.

<graph xmlns="http://ilrt.org/discovery/2004/03/rxr/">
  <triple>
    <subject uri="http://purl.org/net/dajobe/"/>
    <predicate uri="http://purl.org/dc/elements/1.1/creator"/>
    <object>Dave Beckett</object>
  </triple>
</graph>

Figure 2: RDF Triple in RXR

The remaining detail is primarily with literals. They can have a datatype URI so add an attribute datatype for that with URI content and a language which can reuse xml:lang from XML. Figure 3 shows examples of some ways that literals can be written.

<graph xmlns="http://ilrt.org/discovery/2004/03/rxr/">

  <triple>
    <subject uri="http://example.org/res1"/>
    <predicate uri="http://example.org/pred1"/>
    <object>simple literal</object>
  </triple>

  <triple>
    <subject uri="http://example.org/res2"/>
    <predicate uri="http://example.org/pred2"/>
    <object datatype="http://example.org/mytype">1,2,3</object>
  </triple>

</graph>

Figure 3: RXR Literals - simple and datatyped

A lesson learned when designing [Turtle] was that RDF collections are tedious to write down in RDF triples and as this is prone to error, worthy of having special support. These are used at the object part of triples, so an additional collection tag is introduced, allowing a sequence of object elements inside. Figure 4 shows an example of a collection of literals as the object of a triple.

<graph xmlns="http://ilrt.org/discovery/2004/03/rxr/">

  <triple>
    <subject uri="http://example.org/res"/>
    <predicate uri="http://example.org/pred"/>
    <collection>
      <object>a</object>
      <object>b</object>
      <object>c</object>
    </collection>
  </triple>

</graph>

Figure 4: RXR Collection of literals

As already discussed, RXR omits complex parts of RDF such as XML Literal which moves between levels of abstraction - the XML level of elements, attributes and the encoded versions as a string. There is no easy way to these layers in XSLT say, without essentially writing an XML serializer performing Exclusive XML Canonicalization in XSLT which probably requires XSLT2 at a minimum ([TRiX] ).

The XML schemas for RXR were written in RELAX NG compact and translated to W3C XML Schema and DTD using James Clark's trang tool. The resulting schemas are complete and relatively straightforward.

RDF and HTML/XHTML

Another approach to making easy to use, user friendly semantic web markup is to start from XHTML markup and transform it or annotate it into triples. In February 2004, the HTML working group announced a draft document [RDF/XHTML] which defines an approach for semantic markup for XHTML2 (intended to be an XHTML2 module) adding two main new features:

  1. The <meta> element can take element content and new attributes to define the subject or object of an RDF statement including datatypes.
  2. A <span> element can take a resource attribute that allows the covered content to be the literal object of an RDF triples.

A second approach is to describe how parts of XHTML can be mapped into triples, typically via an XSLT transform. One of these described most recently is [GRDDL] which designates how RDF triples are generated via an HTML head profile attribute and value.

The new W3C Semantic Web Best Practices and Deployment (SWBPD) Working Group is coordinating proceeding this development with the HTML WG using both approaches, as it provides a way to get semantic markup from existing XHTML as well as in a new XHTML2. This remains draft and ongoing work.

Conclusions

RXR describes a simple and mostly regular triple XML format for RDF that is straightforward to explain and match to the RDF triple model. It is compatible with XML schemas in several languages and does not use XML QNames.

Avoiding URI abbreviation with QNames does have a downside that the URIs of RDF are visible and verbose. Replacing these or adding additional QName alternatives would have a cost in usability as how to explain why some things in attributes with ':' in them are URIs, others are QNames, These would have to be clearly distinguished with new attributes. Adding support for, say, xsi:type attributes for XML schema datatypes would also have issues confusing the use of that attribute with datatype and the RDF property often abbreviated as rdf:type.

Appendix A - XML Schemas

The RELAX NG schema is available at http://ilrt.org/discovery/2004/03/rxr/rxr.rng (XML) and http://ilrt.org/discovery/2004/03/rxr/rxr.rnc (Compact); the W3C XML Schema at http://ilrt.org/discovery/2004/03/rxr/rxr.xsd and the DTD at http://ilrt.org/discovery/2004/03/rxr/rxr.dtd.

Bibliography

[RDF/XML Revised]
D. Beckett, RDF/XML Syntax Specification (Revised), W3C Recommendation, 10 February 2004, http://www.w3.org/TR/rdf-syntax-grammar/.
[RDF Core Issues List]
W3C RDF Core Working Group, RDF Core Issues List, http://www.w3.org/2000/03/rdf-tracking/.
[RDF Core WG]
W3C RDF Core Working Group, http://www.w3.org/2001/sw/RDFCore/.
[WebOnt WG]
W3C Web Ontology Working Group, http://www.w3.org/2001/sw/WebOnt/.
[Striped RDF/XML]
D. Brickley, Understanding the Striped RDF/XML Syntax, World Wide Web Consortium (W3C), October 2001, http://www.w3.org/2001/10/stripes/
[Unparsing]
J. J. Carroll, Unparsing RDF/XML, Proceedings of the eleventh international conference on World Wide Web, pages 454-461. ACM Press, 2002, http://www2002.org/CDROM/refereed/184/
[TRiX]
J. J. Carroll and P. Stickler, RDF Triples in XML, HP Labs Technical Report HPL-2003-268, 11 February 2004. http://www.hpl.hp.com/techreports/2003/HPL-2003-268.html
[RDF/XML Retrospective]
D. Beckett, A retrospective on the development of the RDF/XML Revised Syntax, ILRT, University of Bristol. ILRT Research Report Number: 1017, 11 June 2003. http://www.ilrt.bris.ac.uk/publications/researchreport/rr1017/report_html?ilrtyear=2003
[Turtle]
D. Beckett, Turtle - Terse RDF Triple Language, ILRT, University of Bristol, first announced January 2004. http://www.ilrt.bris.ac.uk/discovery/2004/01/turtle/
[RDF/XHTML]
M. Birbeck, XHTML and RDF, x-port.net Ltd, 14 February 2004. http://www.w3.org/MarkUp/2004/02/xhtml-rdf.html
[GRDDL]
D. Connolly, Gleaning Resource Descriptions from Dialects of Languages (GRDDL), W3C Coordination Group Note, 16 March 2004. http://www.w3.org/2004/01/rdxh/spec