Proposed Encodings for Dublin Core Metadata

Note: This is an OBSOLETE document, please see the official RFC 2731: Encoding Dublin Core Metadata in HTML

Version $Id: dc-encoding.shtml,v 1.7 1999/12/17 15:28:16 djb1 Exp $

Dave Beckett
D.J.Beckett@ukc.ac.uk
Computing Laboratory
University of Kent at Canterbury

1. Overview

The Dublin Metadata Core Element Set (or DCES for short) is a set of thirteen metadata elements created by the members attendeding the OCLC/NCSA Metadata Workshop[1] at Dublin, Ohio, USA in April 1995. The DCES was designed for authors and information providers to describe Document-Like Objects or DLOs. It is intended to be easy to use and to promote ``Metadata For All'' via allowing interchange and interoperability of metadata.

This document describes proposals for:

The syntax of the Element names.
The syntax of the Element qualifiers.
The standard elements and qualifiers.
Recommendations on extending the DCES -- adding new elements, qualifiers and/or qualifier values.
Encoding the DCES into Attribute : Value metadata formats.
Encoding the DCES inside HTML pages.
Encoding the DCES in RFC822, News, IAFA Templates, WhoIs++ and PNG Images.

2. Introduction to the Dublin Core Element Set

The 13 core Elements are:

Description: The field of knowledge to which the work belongs or topic (subject, content, abstract).
Title: The name of the resource.
Creator: The person(s) or organisation(s) primarily responsible for the intellectual content of the resource.
Publisher: The agent or agency responsible for making the resource available.
OtherAgent: The person(s) or organisation(s), such as editors and transcribers, who have made other significant intellectual contributions to the work.
Date: Dates related to the resource.
ResourceType: The genre of the resource, such as novel, poem, or dictionary.
Form: The physical manifestation of the resource, such as Postscript file or HTML document
Identifier: String or number used to uniquely identify the resource
Relation: Relationship to other resources
Source: Resources, either print or electronic, from which this resource is derived, if applicable
Language: Language of the intellectual content
Coverage: The spatial locations and temporal durations characteristic of the resource

These 13 elements can also have qualifiers, or qualifiers, applied to them to allow further description of the encoded metadata. The qualifiers have a name and value so a fully-qualified Dublin Core Element has four parts:

The element name (e.g. "Author")
The element value (e.g. "D.J.Beckett@ukc.ac.uk")
Optional qualifier(s) qualifying the element and element value. There can be zero or more qualifiers and each one has a:
1. Name (e.g. "Scheme") and a
2. Value (e.g. "Email" for qualifier Scheme)

The DCES is also extensible in several ways. New qualifier values can be used, new qualifiers can be added, new elements can be created and the core itself can be made just a package of a larger framework of metadata. The ways in which changes be made are discussed in Section 4..

3. Syntax of the DCES

3.1 Syntax of Element names

The Dublin Core Elements consists of the 13 elements defined in [1]:

Subject Title Author Publisher OtherAgent Date ResourceType Form Identifier Relation Source Language Coverage

These element names are reserved and are case independent.

Element names must follow this syntax:

letter (letter | number)*
where letter is one of 'A' to 'Z' or 'a' to 'z', number is one of '0' to '9' and * means 0 or more of the given character.

The resulting element names are case independent.

It is recommended that in choosing new element names, BiCapitalisation is used to separate words or sub-words as necessary. e.g. a new element "AccessRights".

3.2 Syntax of qualifiers

For each Dublin Core Element there can be one or more qualifiers that qualify the element or the element value. There are several standard qualifiers defined for the 13 core elements:

Core Element  Qualifiers
==========================
Title         Scheme Type(*)
Description   Scheme Type(*)
Author        Scheme Type(*)
Publisher     Scheme Type(*)
OtherAgent    Scheme Type(*) Role
Date          Scheme Type
ResourceType  Scheme Type(*)
Form          Scheme Type(*)
Identifier    Scheme Type(*)
Relation      Scheme Type Identifier
Source        Scheme Type(*)
Language      Scheme Type(*)
Coverage      Scheme Type

A full enumeration of the standard (and other) qualifiers, their values and their meaning is given in [2]
(*) indicates reserved qualifiers in the above document that have no current-value.

New qualifiers can be defined for existing elements or for new elements and must follow this syntax:

letter (letter | number)*
where letter is one of 'A' to 'Z' or 'a' to 'z', number is one of '0' to '9' and * means 0 or more of the given character.

The resulting qualifier names are case independent.

NOTE

qualifiers CANNOT be duplicated in the same element.
qualifiers CANNOT have an empty (all whitespace) or null value.

4. Extending the DCES

If you want to extend the Dublin Core Elements...

First see if an element in the core 13 elements can do what you want, possibly by using one of the qualifiers given in [2].
Will adding a new qualifier value to an existing appropriate element work?
How about adding a new qualifier and new qualifier values for it, to an existing appropriate element?
If none of the above are appropriate, then it is probably worth adding a new element with new qualifiers and qualifier values.

However, it is strongly recommended that if a lot of new elements are being added it would be better to design a domain-specific metadata package and use the Warwick Framework as described in [3], [4], [5] and [6] to combine it with a package of the core elements.

5. Encoding the DCES

5.1 Recommended Encoding Formats

The syntax working group at the Warwick metadata meeting reported[7] recommending three encodings as follows:

SGML[8]
Advantages: Precise and can be validated. Canonical syntax.
Disadvantages: Frightening to casual user.
Embedded in an HTML document in document <HEAD>
Advantages: Easy to use, familiar. Minimal-effort approach.
Disadvantages: Imprecise, unconstrained.
Embedded as a reference to an independent SGML document (in HTML or other documents)
Advantages: Allows complex records.
Disadvantages: Separates metadata and data - consistency, maintenance problems.

An example SGML DTD for the DCES was given in REF and an extended one including the Warwick Framework was given in [4] An HTML syntax was also given in the latter document but it has encoding problems with qualifiers. These are reduced in the encoding method described here. Encoding the elements in HTML is described in Section 6.

5.2 Encoding Dublin Core into Attribute : Value form

Each Dublin Core element can have many elements (Element name, value and optionally qualifier names and values) and to encode this in a flat Attribute : Value metadata form such as IAFA templates, the parts must be encoded. This is done by mapping the element name to the Attribute and the qualifiers and element value to the Value. The latter is done by prefixing the element value with the qualifiers qualified with braces '(' and ')' which gives (for the example above):

Attribute: Author
Value: (Scheme=email)D.J.Beckett@ukc.ac.uk

5.2.1 Encoding qualifier names and values

There are some encoding rules to allow more flexibility and to allow the full range of element and qualifier values to be encoded without ambiguity.

For each qualifier name and qualifier value, the two are appended with '=' between them. If the qualifier value has any of the following characters in it: '(' ')' '%' ',' OR starts or ends with whitespace (not recommended) then the character is replaced with '%' followed by two hexadecimal digits corresponding to the ASCII code for them i.e.:

' ' is replaced with %20
'%' is replaced with %25
'(' is replaced with %28
')' is replaced with %29
',' is replaced with %2C

(This is the %-encoding using in URLs to escape special characters.)

Example: For element "Date", qualifier "Scheme" with mythical value "ISO1234(1996)" and element-value "1996-01-01:01:01:01" this would give:

Attribute: Date
Value: (Scheme=ISO1234%281996%29)1996-01-01:01:01:01

For simple decoders that are not interested in the qualifiers, a count of the number of '(' and ')'s seen will allow the qualifiers to be skipped.

The output format can also include multiple schemes, in two forms. the extra schemes can be either added inside the braces, separated by ',' OR just appended in an extra set of qualifiers.

Example: For element "Relation" value "http://www.oclc.org/" with qualifiers/values of "Scheme" of "URN" and "Type" of "IsParentOf" this would give these two options:

Attribute: Relation
Value: (Scheme=URN,Type=ParentOf)http://www.oclc.org/ OR
Attribute: Relation
Value: (Scheme=URN)(Type=ParentOf)http://www.oclc.org/

For clarity, whitespace can be added anywhere in the attribute value before or after the braces, the '='s or the ','s i.e in these places indicated by *:

*(*Scheme*=*URN*,*Type*=*ParentOf*)*http://www.oclc.org/

which can make the above look like:

Attribute: Relation
Value: (Scheme= URN, Type= ParentOf) http://www.oclc.org/

etc.

In Internet terms: be liberal on accepting the format and conservative on creating it - white space is allowed and ignored around all the parts of the qualifier names / values on reading and writing.

5.2.2 Encoding element value

If the element value begins with '(' then there is some ambiguity as to whether a qualifier is beginning or not. In this case, the initial '(' should be duplicated since '((' is not legal for qualifier qualifying.

For example for element "Identifier" with (rather bogus) value "(none)" it would be encoded as:

Attribute: Identifier
Value: ((none)

This has the advantage that the original value is still available as a sub-string of the encoded value and can be easily extracted.

5.3 Decoding Dublin Core from Attribute : Value encoded form

On decoding, the Attribute is the element name. The Value must be parsed (from left to right) using the following algorithm:

If '((' is seen, then the value starts from here with the second character '(', end.
Otherwise if a single '(' is seen, then a qualifier is beginning and it terminates with ')'. Once this is parsed, continue at step 1.
Otherwise the value starts at this character, end.

See also the example PERL5 code in [PERL1].

When parsing a qualifier, use the following algorithm:

Ignore any leading whitespace
The qualifier name starts here and ends before the first '=' or any amount of whitespace.
Skip any amount of whitespace the '=' and any amount of whitespace. it.
The qualifier value starts here and ends before the first ')', ',' or whitespace seen.
Skip any single ',' seen or any amount of whitespace or if ')' was seen, end.

See also the example PERL5 code in [PERL2].

6. Encoding Dublin Core Metadata in HTML

The Warwick metadata meeting syntax working group report[7] and the Embedding Metadata in HTML 2.0[9] document describe ways to add metadata to HTML using information stored in the <HEAD> of the document, either directly or by reference. This section describes how to do this in detail for the Dublin Core without breaking any existing HTML 2.0 standards (i.e. the HTML will still validate with a SGML parser and the appropriate DTD).

6.1 Including the metadata in the HTML document

Firstly, use the encoding method above in Section 5.2 for the given element parts. This will give an Attribute and a Value for each element. These are then written in HTML by encoding them in the <META> tags.

<META> tags are used inside <HEAD> of HTML documents for metadata. The Dublin Core is a specialised part of that and to distinguish it from other metadata that may be encoded, the Attribute / Dublin Code element name is prefixed with DC.. This is then used in the <META> NAME attribute and the value is used in the <META> CONTENT attribute. For example, if the metadata is:

Attribute: Author
Value: (Scheme=email)D.J.Beckett@ukc.ac.uk

it is encoded in HTML as:

 <HTML>
   <HEAD>
     <META NAME="DC.Author" CONTENT="(Scheme=email)D.J.Beckett@ukc.ac.uk">
   </HEAD>
   ...
 </HTML>

Note: The element name is still case dependent - the value of the NAME attribute - so dc.author is equivalent to DC.Author.

Note: The double quotes (") around the NAME and CONTENT attribute values are compulsory.

It is recommended in [9] that a reference is provided via the <LINK> tag to the definition of metadata that are put in headers. The proposed convention applied to the DCES means adding a reference to the document http://purl.org/metadata/dublin_core_elements like this:

 <HTML>
   <HEAD>
     <META NAME="DC.Author" CONTENT="(Scheme=email)D.J.Beckett@ukc.ac.uk">
     ... other <META> tags for other DCES elements
     <LINK REL="SCHEMA.dc" HREF="http://purl.org/metadata/dublin_core_elements">
   </HEAD>
   ...
 </HTML>

It is recommended that for other elements or schemes that are not in the core, <LINK>s to definitions are added if they are known.

Note: According to the HTML 2.0 DTD, <LINK> tags must appear after all the <META> tags. FIXME

Note: The following syntax proposed in <http://info.ox.ac.uk/%7Elou/wip/metadata.syntax.html is not supported since it requires breaking HTML validation by including illegal characters in the NAME attribute:

 <META NAME='DC:date(ISO)' CONTENT="1993-01-23">

ISSUE: Grouping. Example

 <META name='DC.GroupStart' content='group number 42'>
    <META name='DC.Something' content="something else">
    <!-- more METAS here -->
 <META name='DC.GroupEnd' content= 'group number 42'>

6.2 Including a reference to the metadata in the HTML document

As an alternative to directly including the metadata in <HEAD> tag, there can be a reference to an external document that contains the metadata; this document would probably be an SGML representation of the Dublin Core. This is done by using a <LINK> tag with attribute REL set to metadata and the reference to the URL of the actual metadata. For example:

 <HTML>
   <HEAD>
     <LINK REL="metadata" HREF="document.dces">
   </HEAD>
   ...
 </HTML>

ISSUE: REL="metadata" seems vague

It is recommended that the suffix for an external SGML encoded Dublin Core Element Set be .dces (or Warwick Framework)?

7. Other encodings

7.1 Encoding the Dublin Core in RFC822-style headers

RFC822 headers[10] are a simple attribute:value form of header that are used in email, USENET news[11], IAFA Templates[12], WhoIs++[13], HyperText Transfer Protocol (HTTP)[14] and others.

Some of these formats have restrictions on the structure or allowed character set of the attributes and values but the current Dublin Core can be encoded according to the rules in Section 5.2.

Using the example:

Attribute: Author
Value: (Scheme=email)D.J.Beckett@ukc.ac.uk

RFC822 mail header

The attribute musn't clash with existing RFC822 headers. The prescribed way to enforce this is to prefix the header with "X-". To further make the attribute unique, it is proposed that the prefix "DC-" is also used to give:

 X-DC-Author: (Scheme=email) D.J.Beckett@ukc.ac.uk

It is recommended in RFC822 that long lines (80 or more characters) are wrapped / broken at spaces. This is possible by the use of continuation lines which begin with some whitespace. This has implications on encoding values that contain whitespace. It is recommended that all whitespace (spaces, linefeeds, tabs) in an element value are replaced with single spaces where present and the resulting value wrapped at a convienient spaces.

On reading an RFC822-encoded Dublin Core element, the initial whitespace at the beginning of continued lines and the previous linefeed should be replaced with a single space before parsing further.

For example:

Element: Title
Value: (Scheme=None) Online Computer Library Center (OCLC) /
National Center for Supercomputing Applications (NCSA) Metadata Workshop Report

could be encoded as:

 X-DC-Author: (Scheme = None) Online Computer Library Center
              (OCLC) National Center for Supercomputing Applications
              (NCSA) Metadata Workshop Report

The problem with this encoding is the loss of whitespace.

USENET News

The same encoding for mail headers can be used.

IAFA Templates

The proposed draft standard for IAFA Templates allowed #- to introduce experimental attributes but it is recommended that ....

7.2 Encoding the Dublin Core in PNG Images

The Portable Network Graphics (PNG) image format[15] allows text keywords and values to be stored. The standard restricts the keywords and values in the ISO 8859-1 (Latin-1)[16] character set but this is suffient for the Dublin Core. The values can contain newlines (decimal 10) but other control characters are not recommended.

It is recommended that the prefix "DC-" is used to encode the Dublin Core element name as the PNG text keyword and the text value field used to encode the element value and qualifier values according to the rules in Section 5.2.

PNG has some standard keywords already defined and this is a proposed mapping between them and the Dublin Core elements:

PNG Text Keyword and Description: Dublin Core Elements with recommended scheme
Title: Short (one line) title or caption for image: DC-Title
Author: Name of image's creator: DC-Author
It isn't clear if this is name and/or email address but name <email-address> is probably common.
Description: Description of image (possibly long): DC-Subject
Copyright: Copyright notice: None.
Should go in an external terms and conditions container.
Creation Time: Time of original image creation: DC-Date (Type=Creation, Scheme=RFC822)
The standard notes that the RFC1123 date format should be used which is the same as the RFC822 date format but with a requirement to use 4-digit years.
Software: Software used to create the image: DC-Source (Scheme=Software)
Disclaimer: Legal disclaimer: None.
Should go in an external terms and conditions container.
Warning: Warning of nature of content: None.
Should go in an external ratings container.
Source: Device used to create the image: DC-Source (Scheme=Device)
Comment: Miscellaneous comment; conversion from GIF comment: None.

7.3 Encoding the Dublin Core in X.500/LDAP

See Representing the Dublin Core within X.500 and LDAP proposal by Hamilton et. al.[17]

Note there isn't a way to encode qualifiers

7.4 Encoding the Dublin Core in USMARC

REFERENCE

8. Leverage of existing schema

i.e. how to say it matches a particular ISO, IETF, other standard. MORE

9. Standardisation / Registration Authority

ISSUE: Ask meta2 list, OCLC, UKOLN, IANA, ... maybe give part of scheme space to each? MORE

10. References

[1] OCLC/NCSA Metadata Workshop Report, S. Weibel, J. Godby, E. Miller. & R. Daniel, Dublin, Ohio, USA, 1995, <URL:http://www.oclc.org:5047/oclc/research/publications/weibel/metadata/dublin_core_report.html>

[2] Dublin Core Qualifiers, J. Knight & M. Hamilton, Loughborough, UK, September 1996, <URL:http://www.roads.lut.ac.uk/Metadata/DC-SubElements.html>

[3] The Warwick Metadata Workshop Report, L. Dempsey & S. Weibel, D-Lib Magazine, ISSN 1082-9873, April 1996, <URL:http://www.dlib.org/dlib/july96/07weibel.html> and <URL:http://www.ukoln.ac.uk/dlib/dlib/july96/07weibel.html>

[4] The Warwick Framework: A Container Architecture for Aggregating Sets of Metadata, C. Lagoze, C. Lynch & R. Daniel Jr., Cornell Computer Science Technical Report TR96-1593, <URL:http://cs-tr.cs.cornell.edu/Dienst/UI/2.0/Describe/ncstrl.cornell%2fTR96-1593>

[5] The Warwick Framework, C. Lagoze, Digital Library Research Group, Cornell University, D-Lib Magazine, ISSN 1082-9873, April 1996, <URL:http://www.dlib.org/dlib/july96/lagoze/07lagoze.html> and <URL:http://www.ukoln.ac.uk/dlib/dlib/july96/lagoze/07lagoze.html>

[6] A MIME Implementation for the Warwick Framework, J. Knight and M. Hamiltion, University of Loughborough, 1996, <URL:http://weeble.lut.ac.uk/MIME-WF.html>.

[7] A Syntax for Dublin Core Metadata, L. Burnard, E. Miller, L. Quin & C.M. Sperberg-McQueen, April 1996, <URL:http://info.ox.ac.uk/%7Elou/wip/metadata.syntax.html>

[8] Standard Generalized Markup Language (SGML) ISO 8879:1986, <URL:http://www.iso.ch/cate/d16387.html>

[9] An Approach for Embedding Metadata in HTML 2.0, Stuart.L. Weibel et al, June 2, 1996, W3C Distributed Indexing and Searching Workshop, MIT, USA, <URL:http://www.oclc.org:5046/~weibel/html-meta.html>

[10] RFC822, .... <URL:ftp://nic.merit.edu/documents/rfc/rfc1522.txt>

[11] NEWS

[12] IAFA Templates

[13] WhoIs ++

[14] HyperText Transfer Protocol (HTTP)

[15] PNG (Portable Network Graphics) Specification 1.0, T. Boutell (Editor) et al., July 1 1996, W3C Proposed Recommendation, W3C - MIT,USA / INRIA,France, <URL:http://www.w3.org/pub/WWW/TR/PR-png-960701.html>

[16] International Organization for Standardization, "Information Processing --- 8-bit Single-Byte Coded Graphic Character Sets --- Part 1: Latin Alphabet No. 1, ISO 8859-1, 1987.

[17] Representing the Dublin Core within X.500 and LDAP, M. Hamilton m.t.hamilton@lut.ac.uk, R. Iannella renato@dstc.edu.au and J. Knight j.p.knight@lut.ac.uk, August 1996, draft document posted to Metadata Workshop II mailing list.

Appendix A: Code Examples

PERL Example 1

 sub parse_value {
   my($value)=@_;

   while (1) {
     $value =~ s/^\s*//;
     # 1. If '((' is seen...
     if ($value =~ /^\(\(/) {
       # After removing leading '(', element value is rest of value
       $value =~ s/^\(//;
       last;
     }
     # 2. If a single '(' is seen...
     if ($value =~ /^\(/) {
       # Parse rest of value as a qualifier
       $value = parse_qualifier($value);
       # OR could skip sub element with:
       #  $value =~ s/^\([^\)]+\)//;
       next;
     }
     # 3. Otherwise, value is rest of this string, so end.
     last;
   }
   print "Element value: '$value'\n";
 }

PERL Example 2

 sub parse_qualifier {
   my($value)=@_;

   # 0. Remove leading '('
   $value =~ s/^\(//;

   while($value) {
     # 1. Ignore any leading whitespace
     $value =~ s/\s*//;

     # 5.b If a trailing ')' is seen, remove it and end
     last if $value =~ s/^\)//;

     # 2. Get qualifier name, ending before = or whitespace
     $value =~ s/^([^\s=]+)//;
     my $qualifier = $1;

     # 3. Remove '=' and whitespace around it
     $value =~ s/^\s*=\s*//;

     # 4. Get qualifier value, ending before ')', ',' or whitespace
     $value =~ s/^([^),\s]+)//;
     my $qualifier_value = $1;

     # Replace %-encoded characters with their value
     $qualifier_value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg;
     print "qualifier $qualifier:'$qualifier_value'\n";

     # 5a. Remove ',' if present
     $value =~ s/^,//;
   }

   return $value;
 }

Acknowledgements

Thanks to Jon Knight for his comments.