Note: This is an OBSOLETE document, please see the official RFC 2731: Encoding Dublin Core Metadata in HTML
Version $Id: dc-encoding.shtml,v 1.7 1999/12/17 15:28:16 djb1 Exp $
Dave Beckett
D.J.Beckett@ukc.ac.uk
Computing Laboratory
University of Kent at Canterbury
The Dublin Metadata Core Element Set (or DCES for short) is a set of thirteen metadata elements created by the members attendeding the OCLC/NCSA Metadata Workshop[1] at Dublin, Ohio, USA in April 1995. The DCES was designed for authors and information providers to describe Document-Like Objects or DLOs. It is intended to be easy to use and to promote ``Metadata For All'' via allowing interchange and interoperability of metadata.
This document describes proposals for:
The 13 core Elements are:
These 13 elements can also have qualifiers, or qualifiers, applied to them to allow further description of the encoded metadata. The qualifiers have a name and value so a fully-qualified Dublin Core Element has four parts:
The DCES is also extensible in several ways. New qualifier values can be used, new qualifiers can be added, new elements can be created and the core itself can be made just a package of a larger framework of metadata. The ways in which changes be made are discussed in Section 4..
The Dublin Core Elements consists of the 13 elements defined in [1]:
Subject Title Author Publisher OtherAgent Date ResourceType Form Identifier Relation Source Language Coverage
These element names are reserved and are case independent.
Element names must follow this syntax:
letter (letter | number)*
where letter is one of 'A' to 'Z' or 'a' to 'z', number is one of '0' to '9' and * means 0 or more of the given character.
The resulting element names are case independent.
It is recommended that in choosing new element names, BiCapitalisation is used to separate words or sub-words as necessary. e.g. a new element "AccessRights".
For each Dublin Core Element there can be one or more qualifiers that qualify the element or the element value. There are several standard qualifiers defined for the 13 core elements:
Core Element Qualifiers ========================== Title Scheme Type(*) Description Scheme Type(*) Author Scheme Type(*) Publisher Scheme Type(*) OtherAgent Scheme Type(*) Role Date Scheme Type ResourceType Scheme Type(*) Form Scheme Type(*) Identifier Scheme Type(*) Relation Scheme Type Identifier Source Scheme Type(*) Language Scheme Type(*) Coverage Scheme Type
A full enumeration of the standard (and other) qualifiers,
their values and their meaning is given in
[2]
(*) indicates reserved qualifiers in the above document that
have no current-value.
New qualifiers can be defined for existing elements or for new elements and must follow this syntax:
letter (letter | number)*
where letter is one of 'A' to 'Z' or 'a' to 'z', number is one of '0' to '9' and * means 0 or more of the given character.
The resulting qualifier names are case independent.
NOTE
If you want to extend the Dublin Core Elements...
However, it is strongly recommended that if a lot of new elements are being added it would be better to design a domain-specific metadata package and use the Warwick Framework as described in [3], [4], [5] and [6] to combine it with a package of the core elements.
The syntax working group at the Warwick metadata meeting reported[7] recommending three encodings as follows:
An example SGML DTD for the DCES was given in REF and an extended one including the Warwick Framework was given in [4] An HTML syntax was also given in the latter document but it has encoding problems with qualifiers. These are reduced in the encoding method described here. Encoding the elements in HTML is described in Section 6.
Each Dublin Core element can have many elements (Element name, value and optionally qualifier names and values) and to encode this in a flat Attribute : Value metadata form such as IAFA templates, the parts must be encoded. This is done by mapping the element name to the Attribute and the qualifiers and element value to the Value. The latter is done by prefixing the element value with the qualifiers qualified with braces '(' and ')' which gives (for the example above):
Attribute: Author
Value: (Scheme=email)D.J.Beckett@ukc.ac.uk
There are some encoding rules to allow more flexibility and to allow the full range of element and qualifier values to be encoded without ambiguity.
For each qualifier name and qualifier value, the two are appended with '=' between them. If the qualifier value has any of the following characters in it: '(' ')' '%' ',' OR starts or ends with whitespace (not recommended) then the character is replaced with '%' followed by two hexadecimal digits corresponding to the ASCII code for them i.e.:
' ' is replaced with %20
'%' is replaced with %25
'(' is replaced with %28
')' is replaced with %29
',' is replaced with %2C
(This is the %-encoding using in URLs to escape special characters.)
Example: For element "Date", qualifier "Scheme" with mythical value "ISO1234(1996)" and element-value "1996-01-01:01:01:01" this would give:
Attribute: Date
Value: (Scheme=ISO1234%281996%29)1996-01-01:01:01:01
For simple decoders that are not interested in the qualifiers, a count of the number of '(' and ')'s seen will allow the qualifiers to be skipped.
The output format can also include multiple schemes, in two forms. the extra schemes can be either added inside the braces, separated by ',' OR just appended in an extra set of qualifiers.
Example: For element "Relation" value "http://www.oclc.org/" with qualifiers/values of "Scheme" of "URN" and "Type" of "IsParentOf" this would give these two options:
Attribute: Relation
Value: (Scheme=URN,Type=ParentOf)http://www.oclc.org/ OR
Attribute: Relation
Value: (Scheme=URN)(Type=ParentOf)http://www.oclc.org/
For clarity, whitespace can be added anywhere in the attribute value before or after the braces, the '='s or the ','s i.e in these places indicated by *:
*(*Scheme*=*URN*,*Type*=*ParentOf*)*http://www.oclc.org/
which can make the above look like:
Attribute: Relation
Value: (Scheme= URN, Type= ParentOf) http://www.oclc.org/
etc.
In Internet terms: be liberal on accepting the format and conservative on creating it - white space is allowed and ignored around all the parts of the qualifier names / values on reading and writing.
If the element value begins with '(' then there is some ambiguity as to whether a qualifier is beginning or not. In this case, the initial '(' should be duplicated since '((' is not legal for qualifier qualifying.
For example for element "Identifier" with (rather bogus) value "(none)" it would be encoded as:
Attribute: Identifier
Value: ((none)
This has the advantage that the original value is still available as a sub-string of the encoded value and can be easily extracted.
On decoding, the Attribute is the element name. The Value must be parsed (from left to right) using the following algorithm:
See also the example PERL5 code in [PERL1].
When parsing a qualifier, use the following algorithm:
See also the example PERL5 code in [PERL2].
The Warwick metadata meeting syntax working group report[7] and the Embedding Metadata in HTML 2.0[9] document describe ways to add metadata to HTML using information stored in the <HEAD> of the document, either directly or by reference. This section describes how to do this in detail for the Dublin Core without breaking any existing HTML 2.0 standards (i.e. the HTML will still validate with a SGML parser and the appropriate DTD).
Firstly, use the encoding method above in Section 5.2 for the given element parts. This will give an Attribute and a Value for each element. These are then written in HTML by encoding them in the <META> tags.
<META> tags are used inside <HEAD> of HTML documents for metadata. The Dublin Core is a specialised part of that and to distinguish it from other metadata that may be encoded, the Attribute / Dublin Code element name is prefixed with DC.. This is then used in the <META> NAME attribute and the value is used in the <META> CONTENT attribute. For example, if the metadata is:
Attribute: Author
Value: (Scheme=email)D.J.Beckett@ukc.ac.uk
it is encoded in HTML as:
<HTML> <HEAD> <META NAME="DC.Author" CONTENT="(Scheme=email)D.J.Beckett@ukc.ac.uk"> </HEAD> ... </HTML>
Note: The element name is still case dependent - the value of the NAME attribute - so dc.author is equivalent to DC.Author.
Note: The double quotes (") around the NAME and CONTENT attribute values are compulsory.
It is recommended in [9] that a reference is provided via the <LINK> tag to the definition of metadata that are put in headers. The proposed convention applied to the DCES means adding a reference to the document http://purl.org/metadata/dublin_core_elements like this:
<HTML> <HEAD> <META NAME="DC.Author" CONTENT="(Scheme=email)D.J.Beckett@ukc.ac.uk"> ... other <META> tags for other DCES elements <LINK REL="SCHEMA.dc" HREF="http://purl.org/metadata/dublin_core_elements"> </HEAD> ... </HTML>
It is recommended that for other elements or schemes that are not in the core, <LINK>s to definitions are added if they are known.
Note: According to the HTML 2.0 DTD, <LINK> tags must appear after all the <META> tags. FIXME
Note: The following syntax proposed in <http://info.ox.ac.uk/%7Elou/wip/metadata.syntax.html is not supported since it requires breaking HTML validation by including illegal characters in the NAME attribute:
<META NAME='DC:date(ISO)' CONTENT="1993-01-23">
ISSUE: Grouping. Example
<META name='DC.GroupStart' content='group number 42'> <META name='DC.Something' content="something else"> <!-- more METAS here --> <META name='DC.GroupEnd' content= 'group number 42'>
As an alternative to directly including the metadata in <HEAD> tag, there can be a reference to an external document that contains the metadata; this document would probably be an SGML representation of the Dublin Core. This is done by using a <LINK> tag with attribute REL set to metadata and the reference to the URL of the actual metadata. For example:
<HTML> <HEAD> <LINK REL="metadata" HREF="document.dces"> </HEAD> ... </HTML>
ISSUE: REL="metadata" seems vague
It is recommended that the suffix for an external SGML encoded Dublin Core Element Set be .dces (or Warwick Framework)?
RFC822 headers[10] are a simple attribute:value form of header that are used in email, USENET news[11], IAFA Templates[12], WhoIs++[13], HyperText Transfer Protocol (HTTP)[14] and others.
Some of these formats have restrictions on the structure or allowed character set of the attributes and values but the current Dublin Core can be encoded according to the rules in Section 5.2.
Using the example:
Attribute: Author
Value: (Scheme=email)D.J.Beckett@ukc.ac.uk
The attribute musn't clash with existing RFC822 headers. The prescribed way to enforce this is to prefix the header with "X-". To further make the attribute unique, it is proposed that the prefix "DC-" is also used to give:
X-DC-Author: (Scheme=email) D.J.Beckett@ukc.ac.uk
It is recommended in RFC822 that long lines (80 or more characters) are wrapped / broken at spaces. This is possible by the use of continuation lines which begin with some whitespace. This has implications on encoding values that contain whitespace. It is recommended that all whitespace (spaces, linefeeds, tabs) in an element value are replaced with single spaces where present and the resulting value wrapped at a convienient spaces.
On reading an RFC822-encoded Dublin Core element, the initial whitespace at the beginning of continued lines and the previous linefeed should be replaced with a single space before parsing further.
For example:
Element: Title
Value: (Scheme=None) Online Computer Library Center (OCLC) /
National Center for Supercomputing Applications (NCSA) Metadata Workshop Report
could be encoded as:
X-DC-Author: (Scheme = None) Online Computer Library Center (OCLC) National Center for Supercomputing Applications (NCSA) Metadata Workshop Report
The problem with this encoding is the loss of whitespace.
The same encoding for mail headers can be used.
The proposed draft standard for IAFA Templates allowed #- to introduce experimental attributes but it is recommended that ....
The Portable Network Graphics (PNG) image format[15] allows text keywords and values to be stored. The standard restricts the keywords and values in the ISO 8859-1 (Latin-1)[16] character set but this is suffient for the Dublin Core. The values can contain newlines (decimal 10) but other control characters are not recommended.
It is recommended that the prefix "DC-" is used to encode the Dublin Core element name as the PNG text keyword and the text value field used to encode the element value and qualifier values according to the rules in Section 5.2.
PNG has some standard keywords already defined and this is a proposed mapping between them and the Dublin Core elements:
DC-Author
It isn't clear if this is name and/or email address but
name <email-address> is probably
common.
DC-Date (Type=Creation, Scheme=RFC822)
The standard notes that the RFC1123 date format should be
used which is the same as the RFC822 date format but with a
requirement to use 4-digit years.
See Representing the Dublin Core within X.500 and LDAP proposal by Hamilton et. al.[17]
Note there isn't a way to encode qualifiers
REFERENCE
i.e. how to say it matches a particular ISO, IETF, other standard. MORE
ISSUE: Ask meta2 list, OCLC, UKOLN, IANA, ... maybe give part of scheme space to each? MORE
[1] OCLC/NCSA Metadata Workshop Report, S. Weibel, J. Godby, E. Miller. & R. Daniel, Dublin, Ohio, USA, 1995, <URL:http://www.oclc.org:5047/oclc/research/publications/weibel/metadata/dublin_core_report.html>
[2] Dublin Core Qualifiers, J. Knight & M. Hamilton, Loughborough, UK, September 1996, <URL:http://www.roads.lut.ac.uk/Metadata/DC-SubElements.html>
[3] The Warwick Metadata Workshop Report, L. Dempsey & S. Weibel, D-Lib Magazine, ISSN 1082-9873, April 1996, <URL:http://www.dlib.org/dlib/july96/07weibel.html> and <URL:http://www.ukoln.ac.uk/dlib/dlib/july96/07weibel.html>
[4] The Warwick Framework: A Container Architecture for Aggregating Sets of Metadata, C. Lagoze, C. Lynch & R. Daniel Jr., Cornell Computer Science Technical Report TR96-1593, <URL:http://cs-tr.cs.cornell.edu/Dienst/UI/2.0/Describe/ncstrl.cornell%2fTR96-1593>
[5] The Warwick Framework, C. Lagoze, Digital Library Research Group, Cornell University, D-Lib Magazine, ISSN 1082-9873, April 1996, <URL:http://www.dlib.org/dlib/july96/lagoze/07lagoze.html> and <URL:http://www.ukoln.ac.uk/dlib/dlib/july96/lagoze/07lagoze.html>
[6] A MIME Implementation for the Warwick Framework, J. Knight and M. Hamiltion, University of Loughborough, 1996, <URL:http://weeble.lut.ac.uk/MIME-WF.html>.
[7] A Syntax for Dublin Core Metadata, L. Burnard, E. Miller, L. Quin & C.M. Sperberg-McQueen, April 1996, <URL:http://info.ox.ac.uk/%7Elou/wip/metadata.syntax.html>
[8] Standard Generalized Markup Language (SGML) ISO 8879:1986, <URL:http://www.iso.ch/cate/d16387.html>
[9] An Approach for Embedding Metadata in HTML 2.0, Stuart.L. Weibel et al, June 2, 1996, W3C Distributed Indexing and Searching Workshop, MIT, USA, <URL:http://www.oclc.org:5046/~weibel/html-meta.html>
[10] RFC822, .... <URL:ftp://nic.merit.edu/documents/rfc/rfc1522.txt>
[11] NEWS
[12] IAFA Templates
[13] WhoIs ++
[14] HyperText Transfer Protocol (HTTP)
[15] PNG (Portable Network Graphics) Specification 1.0, T. Boutell (Editor) et al., July 1 1996, W3C Proposed Recommendation, W3C - MIT,USA / INRIA,France, <URL:http://www.w3.org/pub/WWW/TR/PR-png-960701.html>
[16] International Organization for Standardization, "Information Processing --- 8-bit Single-Byte Coded Graphic Character Sets --- Part 1: Latin Alphabet No. 1, ISO 8859-1, 1987.
[17] Representing the Dublin Core within X.500 and LDAP, M. Hamilton m.t.hamilton@lut.ac.uk, R. Iannella renato@dstc.edu.au and J. Knight j.p.knight@lut.ac.uk, August 1996, draft document posted to Metadata Workshop II mailing list.
sub parse_value { my($value)=@_; while (1) { $value =~ s/^\s*//; # 1. If '((' is seen... if ($value =~ /^\(\(/) { # After removing leading '(', element value is rest of value $value =~ s/^\(//; last; } # 2. If a single '(' is seen... if ($value =~ /^\(/) { # Parse rest of value as a qualifier $value = parse_qualifier($value); # OR could skip sub element with: # $value =~ s/^\([^\)]+\)//; next; } # 3. Otherwise, value is rest of this string, so end. last; } print "Element value: '$value'\n"; }
sub parse_qualifier { my($value)=@_; # 0. Remove leading '(' $value =~ s/^\(//; while($value) { # 1. Ignore any leading whitespace $value =~ s/\s*//; # 5.b If a trailing ')' is seen, remove it and end last if $value =~ s/^\)//; # 2. Get qualifier name, ending before = or whitespace $value =~ s/^([^\s=]+)//; my $qualifier = $1; # 3. Remove '=' and whitespace around it $value =~ s/^\s*=\s*//; # 4. Get qualifier value, ending before ')', ',' or whitespace $value =~ s/^([^),\s]+)//; my $qualifier_value = $1; # Replace %-encoded characters with their value $qualifier_value =~ s/%([a-fA-F0-9][a-fA-F0-9])/pack("C", hex($1))/eg; print "qualifier $qualifier:'$qualifier_value'\n"; # 5a. Remove ',' if present $value =~ s/^,//; } return $value; }
Thanks to Jon Knight for his comments.