Connecting XML, RDF and Web Technologies for Representing Knowledge on the Semantic Web

Keywords: RDF, XML, Metadata, Semantic Web, Topic Maps

Dave Beckett
Institute for Learning and Research Technology (ILRT), University of Bristol
Bristol
UK

http://www.dajobe.org/

Biography

Dave is a researcher at the ILRT, University of Bristol and works on RDF and Semantic Web projects for the Research and Teaching & Learning communities. He has been working with metadata and the Dublin Core since 1995, RDF since 1998, is a member of the W3C RDF Core Working Group, and editor of the RDF/XML syntax draft document. Dave is the author of the Redland RDF system.


Abstract


In order to represent knowledge for it to be usable web-wide and in interoperable ways, it should be done using well-known, and the most appropriate web technologies. These include the XML family of standards and the Resource Description Framework (RDF). RDF is a metadata format that uses a directed, labeled graph structure, URIs for identifiers, XML for its its syntax, and a simple type and class structure for recording schema information. RDF can be used with many XML technologies such as XML Namespaces, RELAX NG, XML Schema, XSLT and has relationships to others such as XTM, XPath, XQuery, SQL and relational databases.

When these technologies are used, the knowledge has to be described in web-wide concepts which are usually provided in what are variously called schemas, ontologies, thesauri or vocabularies. These vocabularies can include general ones such as the Dublin Core, ones from particular communities, from existing controlled vocabularies and new ones for particular purposes. These need to be able to be mixed and matched, discovered and new concepts understood (if only partially) in order that relationships between the concepts can be made. These methods and forms are the technological basis of the Semantic Web idea.

This paper describes how these technologies are best used together, their relationships and where each of them can be appropriately applied. This is done using examples of how they are being used in different applications, communities and industries. The ongoing development of the mentioned standards is also be explained.


Table of Contents


Introduction
The Web and the Semantic Web
XML for recording information
Resource Description Framework (RDF) Model and Syntax
Topic Maps
XML Schema Languages
RDF Schema Languages
Conclusions
Bibliography

Introduction

The RDF (Resource Description Framework) is a simple descriptive metadata model with an XML syntax with a background in frame systems, simple knowledge representation and the DCMES (Dublin Core Metadata Element Set) [DCMES]

The Web and the Semantic Web

The current web consists of primarily content, or resources, that is written for understanding by humans perceiving it (reading mostly). This content is created by writing mostly HTML web pages with words, structure, links, presentational information, graphics and other features that are generally called markup. Apart from the structure, machines cannot mostly not look at web pages and grasp what they are about. Standard information and text processing techniques such as natural language processing (NLP), word co-occurrence, and so on can be applied to the textual part of the content. Advanced use of the webiness of the linked pages such as the citation model PageRank used by Google will let you use the link structure to label the cited pages with better keywords but that won't tell you that it said:

<a href="otherpage">sucks as a resource on foo</a>

The web uses links to point to other pages which may or may not exist. This is a good design feature that allowed the web to scale, and the 404 Page Not Found error of HTTP meant that the web was not fragile in requiring coordination from both ends to generate the hypertext web of information. Thus links in web pages mean "may point to" and have no other property. They don't mean "is a good page about the words inside the anchor" which is what causes the citation model to break down a little when used by computers (Although searching for "thing sucks" tends to work quite well)

So although the current web is great for people who can perceive the context of the links in the web, understand the anchor text to work out the relationship between the pages, this is no good for machines. In order to handle the ever increasing exchange of knowledge on the web, machine support for this is crucial and indeed, as richer information is being webified, the new semantics that these provide must be able to be passed on to machines. This must be like the web of other content; and indeed part of it. The web has no root, so the web of machine processable semantics must look like the web of markup - it has no root, the weblike data must be able to point to other weblike data without coordination, and it must be able to scale. The semantic web uses the same features of the web itself:

  1. Everything is identified
  2. Must be able to support partial information or relationships; the semantic 404
  3. Scalable - anyone can say anything about anything. Corollary: there will always be more to find out
  4. Evolvable - no rigid structure, and a way to find out how new things relate to existing known things
  5. Minimal - standardize as little as possible, add more later as necessary

The main goal is to enable machines to be able to do more work with data on the web, enabling the data web to be connected, and provide a basis for layering future applications.

XML for recording information

XML is a family of formats based on the core XML specification plus a halo of related ones. The extensible part of XML comes from well-formedness but mostly from XML namespaces which allows a URI to be associated with a set of names in an XML file. This enables applications to assign semantics for tags by associating them with namespaces, in an independent fashion. The structure provided by XML documents is a mostly-tree of elements, anchored at the root element - a standard kind of hierarchical information structure. The mostly-tree is that there are intra-document links allowed by use of the ID and IDREF functionality that allows connections and reuse of existing tags, in other places in the same document. This results in a self-contained description of some information in the XML document. To record different information, generally different XML formats are used and these formats cannot be used together. This is because each of them models the tree-like hierarchical data structure where the content describes the information for the given purpose and the two trees cannot be merged.

The web of markup (HTML) can easily be joined together since there is not one center, and the markup language came with a built-in pointer mechanism - the link - that allowed a graph of pages to easily be created. For XML, linking technologies were added afterwords such as the XLink (XML Linking Language) [XLink] which uses URI references that may point into XML via use of the XPointer (XML Pointer) [XPointer] which defines how XML fragment identifiers work (still under discussion at this date) a subset of which is a method of describing elements inside XML XPath (XML Path) [XPath] . Together these allow XLink, XPointer and XPath-aware systems to create relationships between elements of different XML documents. Later on, schema languages for XML allowed more sophisticated typed relationships between tags to be expressed, which again requires a schema-aware XML system in order for this to work.

XLink has been used in a simple form in some specifications such as SVG (Scalable Vector Graphics) but does not seem to have widespread deployment in the standard XML toolbox which at present seems to consist of XML, Namespaces, DOM/SAX or Infoset interfaces with possibly some validation mechanism either well-formedness, DTD or a schema language such as W3C XML Schemas. The latter is gaining implementations but cannot be guaranteed to be available to XML applications.

Connections:

Resource Description Framework (RDF) Model and Syntax

RDF is a simple web metadata format that is the W3C standard for description for the web. The model RDF uses is a graph of information where the nodes in the graph are labeled either with URIs, strings or are unlabeled. The nodes are connected by directed arcs which are labeled with URIs. This simple graph forms a web of data like the web of markup, and indeed can be part of it, and the simplicity of the model means it can be used to model many different information representations. The background to RDF is from frame systems, relational databases, and simple metadata formats such as the DCMES. These rather flat models can be directly represented in RDF and richer structures constructed by using more detailed graphs.

The RDF model is outward looking - it requires URIs for all terms - the things being described, the properties used to describe them. RDF expects that these URIs will be defined by multiple communities, and mixing and matching of terms will go on. Terms can be defined in RDF models and this defines them and gives them URIs, so these are at an equal level to other terms defined with external URIs, and thus can be reused by other documents. The URI approach means that there is openness in description; there is no restriction on what terms are used or what resources can be described. This enables good scalability like the web; graphs can be built up without coordination, and aggregation of data is easy and cheap, since URIs are used to identify concepts which means the same concepts can be easily merged when they are recognized. The terms that are used in RDF are URIs; there is no way to define locally or scoped concepts. URIs do enable different types of identifiers to be used such as mailto: that do not involve retrieval actions and RDF doesn't require URI retrieval in order to work; just uses them as identifiers.

The RDF model is encoded in concrete form by a syntax that is written in XML. This syntax uses XML namespaces to define the groups of terms that may be defined in the mixing-and-matching style. The syntax is undergoing a revision by the W3C RDF Core Working Group [RDF Core WG] to update the original 1998 specification after comments and development of newer specifications. The original RDF syntax was original defined in terms of a BNF grammar over XML and no DTD since it couldn't express the flexibility that RDF required. Subsequently, several XML schema languages have been developed and other XML technologies.

The current working draft for the RDF/XML Syntax [RDF/XML Syntax (Revised)] re-expresses the syntax based as a on the [XML Information Set] in the style of a SAX-like stream of events. [RELAX NG Non-XML Syntax] is used in the document a help to check the structure of RDF/XML and [XML Base] support was added as a useful addition to the language. The specification is also being updated for better internationalization support after the recent [Character Model for the WWW] .

The XML syntax for RDF is for a node- and edge-labeled directed graph. This is quite close to the proposed SOAP Data Model section in [SOAP Part 2: Adjuncts] : "The SOAP data model represents information as a directed, edge-labeled graph of nodes." This is very close to the RDF model.

Connections:

Topic Maps

I am no expert on TMs (Topic Maps) [Topicmaps.Org] so this is an overview of the issues from what I understand. Many people have been through how similar they are to RDF and indeed both can address similar applications. TMs have a richer model than RDF that includes the scoping of the relations (associations) between concepts. The concepts or subjects are identified by subject indicators which allow various forms of values, not just URIs, and provide ways to identify concepts that are more abstract without giving them global names; this helps with modeling the information correctly without requiring naming entities that have no identifier. RDF has recently emphasized this more in the concept of unnamed nodes in the RDF graph, which can be identified by full description rather than the identifier on the node. TMs have an XML syntax and a processing model for manipulating multiple topic maps and merging them. There have been demonstrations of RDF systems such as querying implemented using TMs and vice versa, although the mapping isn't generally quite one-to-one, and there is yet no official mapping defined.

Connections:

XML Schema Languages

Powerful XML schema languages such as W3C [XML Schema Part 0: Primer] (XSD) and [RELAX NG Specification] have been developed that can be used with XML to enable simple to very sophisticated validation of the structure of the XML tree and datatyping validation of tag content. XSD uses XML namespaces to attach an XSD schema, so documents with mixed namespaces need to import the schemas for each. This means that all terms must generally be known when the document and applications are written, unless the schemas are passed along. This gives strict validation like a database schema but limits the flexibility if new terms need to be added.

Connections:

RDF Schema Languages

Validation of RDF is rather lightweight given the expectation that unexpected terms may turn up at any point. Profiles of the model can be restricted and validated for particular applications. For this, there is an RDF Schema Language[RDF Schema] which provides simple schema support for defining the relationships between the basic concepts in RDF - resource types and property types; instances of Class and Property respectively. These can be related in subclass or subproperty graphs. The relationships are again, not hierarchies, loops are allowed - this lets higher level languages that use RDF further restrict this on application demand; some of which require loops in their type relationships. The schema language also contains domain and range constraints in the type system, not the same as XSD restrictions which apply to the XML in sophisticated ways. The RDF constraints here apply to what types of resource are allowed to be put at the subject or object of an RDF statement.

More advanced syntax schemas are available as extensions to RDF, the DAML+OIL[DAML+OIL] rules include cardinality constraints for example and the beginnings of the use of XSD data types. The DAML+OIL work is feeding into the W3C Web Ontology Working Group [WebOnt WG] work that is developing an ontology language for the semantic web. The layering of the semantic web technologies then allows other level such as querying and rules beyond this; currently experimental.

The RDF Core WG is presently addressing how to add atomic XSD data types in RDF to validate literal content (not complex types at this time) so that the range of XML data types can be used in RDF systems.

Connections:

Conclusions

XML works best on data or document, using namespaces with it when modularity and flexibility is required. This tends to give strict description of documents with no unexpected content able to be understood later on, along with validation of the XML format, terms and content either via DTDs or latterly, schema languages. RDF works best for concepts that need relating, when unexpected terms are OK, heterogeneous data will be used and handled via rdf schema, bootstrapping into higher semantic web layers, potentially using with rules and inference. Topic maps can be used for more sophisticated scoped relationships with clarity on use and mention of the concepts, with formal processing model for manipulating them and can be related to RDF if necessary.

Bibliography

[DCMES]
DCMI, 2 July 1999, Dublin Core Metadata Element Set, Version 1.1: Reference Description, http://dublincore.org/documents/1999/07/02/dces/.
[XLink]
S. DeRose, E. Maler and D. Orchard, 27 June 2001, XML Linking Language (XLink), http://www.w3.org/TR/xlink/.
[XPointer]
S. DeRose, E. Maler and R. Daniel Jr, 11 September 2001, XML Linking Language (XLink), W3C Candidate Recommendation, http://www.w3.org/TR/xptr/.
[XPath]
J. Clark and S. DeRose, 16 November 1999, XML Path Language (XPath), W3C Recommendation, http://www.w3.org/TR/xpath/.
[RDF Core WG]
W3C RDF Core Working Group, http://www.w3.org/2001/sw/RDFCore/.
[RDF/XML Syntax (Revised)]
D. Beckett, 25 March 2002, RDF/XML Syntax Specification (Revised), W3C Working Draft, http://www.w3.org/TR/rdf-syntax-grammar/.
[XML Information Set]
J. Cowan and R. Tobin, 24 October 2001, XML Information Set, W3C Recommendation, http://www.w3.org/TR/xml-infoset/.
[RELAX NG Non-XML Syntax]
J. Clark, 2001, RELAX NG Non-XML Syntax, http://www.thaiopensource.com/relaxng/nonxml/.
[XML Base]
J. Marsh, 27 June 2001, XML Base, W3C Recommendation, http://www.w3.org/TR/xmlbase/.
[Character Model for the WWW]
M. Dürst et al, 20 February 2002, Character Model for the World Wide Web 1.0, W3C Working Draft, http://www.w3.org/TR/charmod/.
[Topicmaps.Org]
XTM - Topicmaps.org, , http://www.topicmaps.org/.
[XML Schema Part 0: Primer]
D.C. Fallside, 2 May 2001, XML Schema Part 0: Primer, W3C Recommendation, http://www.w3.org/TR/xmlschema-0/.
[RELAX NG Specification]
J. Clark and MURATA Makota, 3 December 2001, RELAX NG Specification, OASIS Committee Specification, http://www.oasis-open.org/committees/relax-ng/spec-20011203.html. Latest version at http://relaxng.org/.
[SOAP Part 2: Adjuncts]
M. Gudgin, M. Hadley, J-J. Moreau, H.F. Nielsen, 17 December 2001, SOAP Version 1.2 Part 2: Adjuncts, W3C Working Draft, http://www.w3.org/TR/soap12-part2/.
[RDF Schema]
D. Brickley and R.V. Guha, 27 March 2000, Resource Description Framework (RDF) Schema Specification 1.0, W3C Candidate Recommendation, http://www.w3.org/TR/rdf-schema/.
[DAML+OIL]
DAML+OIL (March 2001), http://www.daml.org/2001/03/daml+oil-index.html.
[WebOnt WG]
W3C Web Ontology Working Group, http://www.w3.org/2001/sw/WebOnt/.