A Reference Model for Metadata

                         A REFERENCE MODEL FOR METADATA
                                  A Strawman
                              Francis Bretherton
                           University of Wisconsin
                                 DRAFT 3/2/94

Acknowledgements
Significant contributions to the concepts described here have been
made by Paul Kanciruk and Paul Singley of Oak Ridge National
Laboratory, Joe Rueden of the University of Wisconsin, and John Pfaltz
of the University of Virginia.  The author is also grateful for
encouragement and assistance from Bernie O'Lear of the National Center
for Atmospheric Research, and Otis Graf and Bob Coyne of IBM.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

1.      The challenge

Scientists in all disciplines face a computer-enabled explosion of
data, about collisions between fundamental particles, about protein
structure and folding, tomographic images of the human brain, or
observations of the planet Earth from spacecraft.  At the same time,
the widespread use of networks and distributed servers is greatly
facilitating the availability of data, AND the opportunities for
misinterpreting it.  What we typically regard as the data is but a
small part of the information that has to be assembled, inter-related,
and communicated, if distant colleagues who were not closely involved
in its collection are to make full use of it.  This additional
information (metadata) is central to our scientific objectives, yet we
have few computer-based tools to assist in describing effectively what
all these bits mean.

In the domain of Earth system science, metadata has even greater
significance.  With growing world population and expectations of
increasing per capita resource use, humans are increasingly modifying
our natural environment on a global scale, with profound but ill
understood consequences for our children and grandchildren.  A central
challenge to scientists seeking to understand how the Earth system
functions is to document the changes that actually occur over the
coming decades.  A test of the adequacy of our present information
systems is to imagine what our successors will think twenty years from
now as they examine our records at a time when everyone who is
currently involved has passed from the scene, and try to determine
whether the apparent changes between now and then are real, or are
merely undetected artifacts of the way we took the measurements or
analyzed the data.  This requires standards of documentation and
quality assurance that far exceed what we are presently able to
achieve routinely, sufficient for fail safe unambiguous communication
about scientifically crucial details without benefit of the
interactive questioning that is usually the cornerstone of human
discourse.

Existing database management systems are dominated by commercial
applications.  These tend to have schemas that are relatively static,
though the contents may be updated frequently.  Priorities are
transactional integrity and efficiency of routine use, though as the
tools (such as SQL queries) have become available exploratory queries
from management have become more prominent.  To achieve these
objectives, system design and operation have typically been maintained
under tight central control.

Scientific databases, on the other hand, are generally adding not only
more data but also new types of data.  Deletions tend to occur only en
masse when entire datasets are discarded as obsolete or not worth
maintaining.  Success is measured by the discovery of new relations
within the data and by the new questions they stimulate, not by
transactional efficiency.  The flexibility to deal with rapidly
evolving schemas, and for handling exploratory queries effectively
must be our fundamental priorities.  In addition, the same networks
are encouraging individual initiative, challenging the very notion of
a centralized authority able to impose standards across ANY realistic
domain of participants.  Thus new concepts are required, which
redefine the functional relationships of users, contributors,
designers, and operators of our information systems.


2.      Purpose of this document

The purpose of this document is to provoke a discussion from which may
emerge a consensus, on some basic principles which must underlie any
broadly applicable set of approaches to improving this situation.
Though written by a scientist grounded in Earth system science, the
focus is on the information that has to pass in both directions across
the interface between humans and electronic representation in quasi-
independent logical modules of our data systems, and also between such
modules as they are loosely coupled to form a complete system.  It
attempts a functional analysis of this interface for an individual
module (also known as a database), with the goal of isolating and
abstracting a description that captures the essence of what is needed,
both to enable the full power of computer to computer communications
and at the same time to empower scientific users of, and contributors
to, the information system with their customary modes and vocabulary
of scientific discourse.  Such an analysis is a metadata reference
model, a framework for specifying the logical structure of the
external interfaces to a database with enough precision to be
practically realizable in an efficient manner, yet deliberately
independent of any particular implementation.  Such a framework could
then be used for specifying requirements and performance benchmarks in
procurement of complete systems, hardware and software, and should
enable explicit consideration of design and operational tradeoffs in
the light of evolving scientific understanding and technological
innovation.  Codification of the interface structure should also
encourage the development of interactive software tools to ease the
pain and improve the productivity of scientists seeking information
from the system, or contributing information to it.  It also focuses
attention on a few specialized services that MUST be centrally managed
and coordinated.

A successful example of such a concept is the reference model for
hierarchical storage (IEEE, 1994), developed in response to the needs
of the national supercomputing centers.  Its existence has stimulated
vendors to offer complete systems tailored to particular circumstances
rather than individual parts of which the user is responsible for the
integration.  A similar, though more hardware oriented, example is
provided by the description of alternative architectures for Federated
Database Systems in Sheth and Larson (1990).

3.      What is metadata?  A Parable

     a.      Innocence

Hearing the following message on the radio:
          `temperature 26, relative humidity 20%'
a typical U.S. listener would probably conclude, prompted by the
juxtaposition of the words `temperature' and `humidity' and by past
experience of similar messages, that it was a statement about the
current weather, and that it was a cold dry day outside.  If she were
Canadian, on the other hand, she might at first think it was a
pleasant summer day, then realize it was January so that this could
not be her weather that was being referred to, finally recalling that
this radio station transmitted from south of the border and that
Americans report temperatures in degrees Fahrenheit, not Celsius.  Yet
if the temperature had indeed been reported in Celsius, as -3.33333oC,
she might well still have rejected the message, suspecting a lack of
professionalism somewhere in its formulation.  The clue, of course,
would be her intuitive realization that the fine gradations implied by
5 decimal places are totally meaningless in describing real weather.

This simple example illustrates a key property of METADATA, which is
usually loosely defined as something like `data about data', or
`additional information that is necessary for data to be useful'.  For
any communication between two parties to be effective, there has to be
a common body of knowledge shared between them, that sets the context
for a message.  In human discourse, a large amount of this context is
generally implicit, or hinted at in ambiguous ways.  If different
assumptions about it are made at either end, even the simplest piece
of data is likely to be misunderstood.  Metadata is required to flesh
out this context.  Of course, in a well understood situation, most of
the metadata is redundant and frequently omitted, but if the situation
or the parties were to change somewhat, more would be required.  If
the situation were to change greatly, much more would be required.
Thus the statement of what is adequate metadata is itself situation
dependent.  This is why the meaning of the term itself seems so
slippery.

     b.      Engagement

Suppose now that our listener is a weather freak, who regularly scans
the dial and makes entries in her computer like:

WERN  1/28/94  0700  CST  26   20
KVOD  1/28/94  0600  MST  40  135
WGBH  1/29/94  0800  EST  36   90

She also exchanges with like minds all over the continent by Internet,
thus expanding her personal collection, which contains thousands of
entries.  Each one of these is, of course, similar to a record in a
relational database, with primary key provided by the combination of
radio station call letters, date, and time, which, provided she has
been careful to eliminate duplicate entries, provide a unique
identifier for each record.  The last two columns are the contents of
the original radio messages, but these have now been tagged with
additional information necessary for correct interpretation in a
continental, climatological context rather than in a local, current
weather, context.  At the same time, to save typing and computer
storage, the variable names which were explicit in the original
messages have been compacted into the headers which label each column.
To complete the picture it is necessary to append two additional
tables.  The first has each entry uniquely identified by a radio
station call letter, and lists the location of its service area and
whether it reports in Fahrenheit or Celsius.  The second has only one
entry, listing the units (%) associated with relative humidity.  In
the relational data model, links are established between such tables
through the variables that are common to each, and together these
tables are sufficient to describe the logical connections that are
needed.

As this example makes clear, there is no LOGICAL distinction between
metadata and data.  With a slight change in context, the variable
names which were originally an integral part of each message have
become an appendage to a whole column of numbers, whereas the
information about location and time which was originally implicit has
become an integral part of what most people would call the data.  Even
in a given context, more than one consistent database structure is
possible.  For example, coupling location data to a map of time zones
and to the U.S. and Canadian calendars for daylight savings time might
have avoided the entry of the whole column which qualifies the time
tag.  The Celsius - Fahrenheit issue might also have been handled in a
similar way.  The choice between such alternative structures is
governed by convenience and efficiency of storage, entry, and access,
rather than by logical necessity.

None of this is any surprise to anyone who has actually constructed a
working database.  Databases consist of variables linked by a
permanent structure.  Each variable (also known as an entity) will
normally have several attributes, such as its name, color, numerical
value, units, precision, etc., each of which may also vary or be
constant within the span of the database.  The permanent structure
links variables and their attributes, and defines groups of variables.
However, additional structure is imparted by the specific values of
attributes.  For example, if two occurrences of the attribute `call
letters' have the same values, it is assumed that they refer to the
same radio station, with all the consequences of that identity.  This
assumption is crucial for inferring location in our listener's
database.  Within her universe, this works well, but had her sharing
group included members from Europe, where radio stations do not use
unique call letters, a different permanent structure would have been
required.  A database is defined for a specific context and has a
specific schema, the set of entities with which the database is
concerned and the relationships between them which are known to the
database.  Each time the context is extended attributes that had been
constants (e.g. units ) may become variables, and methods of linking
that had been built deep into the permanent structure may become
falsified.  It is thus vital to distinguish a particular database
implementation from the logical structure of the context for which it
is intended.  Understanding and characterizing the latter is key to a
robust and effective design.


     c.      Awakening

Suppose our listener now goes to college, where she attends a course
entitled Meteorology 101.  Reexamining her database she suddenly
realizes that the record KVOD_1/28/94 _0600 must be in error.  By
definition, values of relative humidity must lie in the range 0-100%.
Thus this entry is logically inconsistent and must be used with
extreme caution, if at all.  She needs to annotate it to that effect.
Unfortunately, this contingency was not in her original design, but by
good luck she had included an extra vacant column of width 1
character.  She thus marks this record with an * in that column,
labels the column data quality, and adds a new table with the one
entry

*  error

The information conveyed is less specific than would be ideal, but is
better than nothing.

She also finds new treasures in her database.  Abstracting all the
records for KVOD and WERN, she can describe the climate in Denver and
Madison, using a graphical comparison as the centerpiece of a term
paper.  But wait - there is something odd about that graph.  It seems
as if in 1995 the temperature in Madison abruptly became warmer,
perhaps a symptom of global warming!  However, exhaustive research
reveals that in March of that year WERN had started reporting the
weather from a new location, and global warming was probably only an
artifact of a missing piece of metadata.  Disappointed but wiser, she
decides to repair that omission and add a comment to that effect to
her database, only to discover that with her database management
software there was no way she could attach an electronic Post-It note
to an arbitrarily defined category of data!  So the critical
information remained on a piece of paper, which was, of course, lost
next time she moved to a new apartment.

Her term paper addressed the question `How many days in the year is
Denver warmer than Boston?'  She realized that, for any given date and
time, the statement `Denver is warmer than Boston' can in principle be
evaluated as True, False, Not Resolvable, or NoData.  This involves
examining the values of the variable temperature and its attribute
`units' which are associated with KVOD and WGBH, converting if
necessary from Celsius to Fahrenheit or vice versa, and testing
whether the result associated with the former is greater than that
from the latter by an amount which exceeds the assigned measurement
tolerances.  Since she was in fact interested in this question for
many pairs of cities, she wrote a FORTRAN program to perform these
operations, and then added to her database a table which contained the
result for every pair of radio stations.  Next time the issue came up,
she did not consult the raw data, but rather this derived product.

The context for a database is set not just by the scope of the data
being input, but also by the type of questions being asked.  As
understanding of the contents grows, even with the identical data,
issues will arise that require additions to the contents in ways that
cannot be foreseen in specific terms.  Design decisions based on
implementation efficiency must not preclude the ability to incorporate
new information at a whole variety of levels.  These levels range from
additional varying attributes attached to all instances of particular
variables or groups of variables, to occasional comments or processing
branch points attached to whole blocks of instances, or to derived
products that will frequently be accessed without direct reference to
the original, or perhaps processed `on the fly' as the requirement
arises according to algorithms that are themselves entries in the
database.  A robust design must reflect a deep understanding of
considerations such as uncertainty of measurement and human error, as
well as an appreciation of the extent to which the apparent logical
structure of entities within the database is based on arbitrary
convention as opposed to fundamental principles.


     d.      Frustration

Our listener has now become a serious student of fluid dynamics and
Earth system science.  She learns how the winds and ocean currents,
and even the flow of water through the soil, are driven by gradients
of pressure, and how the equations of motion can be simulated in a
computer.  She recognizes the importance of understanding the
interactions between such flow processes in the atmosphere, the
oceans, and on the land surface, and starts a comparative study.  Her
first task is to find some data that might be suitable.

So she logs on to the NASA master directory and starts to search under
the keyword `pressure'.  She finds pressure is a key variable for a
large fraction of atmospheric data sets, but frequently as a label
tagging where in the atmosphere an observation was taken, rather than
as a variable at a fixed location, which was what she expected.  For
some ocean measurements pressure occurs in the same sense, but more
often the tag is labelled `depth'.  Never does it seem to refer to the
expected physical driver for motion.  In the hydrology section she
found no references to pressure at all!

So she returns to her textbooks, and discovers that in computer models
of this type the active variable driving the motion is actually
dynamic height, which is the height of a surface of constant pressure,
rather than the pressure at constant height.  The two are closely
related, and provided additional information is available to determine
the fluid density, there is a mathematical transformation between
them.  Thus the words are often loosely used interchangeably, even
though technically they are quite different.  So she returns to her
data base and this time finds what she was seeking for the atmosphere
and the ocean, but still draws blank on the hydrology.  In despair,
she finally finds a hydrologist and explains her problem.  He laughs,
and remarks that what she is looking for is called by hydraulic
engineers `head'.  Noting that oceanographers use `head' to refer to a
water closet, she returns to the Master Directory and from the
summaries there selects a few promising candidates.

She now wants a more specific description of contents, but realizes
her candidates are in five different data centers spread among three
different federal agencies, each with its own access system and rules
and regulations.  After much time on the telephone, she is able to log
on to one of them.  The dataset seems to be the type of thing she
would need, but there is no way she can get a brief sample without
first obtaining a user authorization and then purchasing by mail
special software to read their format, with a 90% chance that when she
is able to visualize the sample it will prove unsuitable.  After
several similar experiences she abandons her project, and concentrates
instead solely on what she can do with her existing database.

This example illustrates some of the practical realities of building
connections and establishing interactions between different parts of
the Earth sciences community.  The concept of a database designed for
a particular task is inapplicable to such an open, evolving, context.
An effective information system must not only provide efficient access
to a wide diversity of specific data.  It must also provide tutorials
and help systems tailored to the starting points of many different
users, which may range over a wide variety of disciplinary backgrounds
and degrees of sophistication.  Though humans may be able to sense
ambiguity and conflict in the names used to describe related concepts
within different disciplines, computer to computer communication
requires a reliable procedure for resolving such issues.  Different
parts of the system have to be semi-autonomous and self-describing,
not only to a human, but also to other computers and databases across
a network.  There are few standards, and no central authority capable
of imposing any across more than a subset of the contexts the system
has to serve, yet total anarchy is also unacceptable.  Patches have to
be built between existing ways of doing things and existing formats
for communication, even if they are not entirely consistent, at the
same time as providing pathways for voluntary, incremental, growth
toward a system that benefits everyone.  There should always be a few
(not just one or infinitely many) alternative communication formats
available, from which the parties can choose the most convenient, with
the onus for adaptation on the party with the greater resources.  The
most precious and irreplaceable asset is the time and enthusiasm of
the people who are potential users of the information system and whose
knowledge and skills are needed to contribute to its development.
Metadata includes all the data needed to make such an expanded vision
work.
 4.     A Strawman Reference Model

     a.      Overview

          i.      A vision of the system

The metadata reference model referred to here is a logical analysis of
the structure of the external interface of autonomous modules (here
called databases) loosely linked within a complete information system
(here called the association).  For present purposes, the relevant
communications within that system are between scientists and
databases, and between databases themselves.  For an important subset
of applications in Earth system science, the ultimate objective is to
effective one way transfer of information between humans now and
humans decades in the future, without benefit of the interactive
exchanges that usually mediate human discourse.

The environment for such communication is presumed to be similar to
that provided by Internet, a rapidly evolving, open association of
autonomous units, with a minimal set of operating rules, and no
central authority capable of imposing uniform standards except by
common agreement.  Technologies such as Gopher, and World Wide Web
with user interfaces such as MOSAIC, provide navigational markers and
flexible access tools, but useful user-oriented higher level data
structures for full-service information systems have still to be
evolved.  The emergence of knowledge based software agents which
navigate the network and perform chores on behalf of users only
accentuates the need for clear external interfaces to logical units in
the information system, which can be accessed effectively by both
humans and by other computers.


          ii.     What is metadata?

Metadata is often defined as data about data, or, only a little less
vaguely, as the information required to make scientific data useful.
Indeed, the term means different things to different people, and
defies precise definition.  We will use it in only a general sense,
taking refuge behind an operational shift to defining context instead.

For successful communication it is essential that both parties share a
common set of assumptions (here referred to as a context) according to
which the messages that pass are to be interpreted.  In natural
language such contexts are generally largely implicit, and are
elucidated by question and answer only when apparent inconsistencies
indicate that there may be a problem in interpretation.  However,
computers typically require that all relevant assumptions be explicit.

At the center of traditional approaches to database design is a
schema.  A schema describes the conceptual or logical data structures
of all the objects or entities with which the database is concerned,
together with all the relationships between them known to the database
(see for example Sheth and Larson 1990).  In such a well defined
context, the difference between metadata and data disappears -
metadata is simply data.  However, when the context is extended or
modified, new information, metadata, is needed to provide unambiguous
interpretation and a new schema.  Thus, according to this perspective,
the distinction between metadata and data is merely one of use, and
the focus shifts to another formidable task, that of defining context.
Metadata becomes the additional data that must be invoked to implement
the change in context.  The prefix meta does not attach to the data
itself, but derives from the circumstance of change.

Here a context is simply a set of assumptions, with a unique name or
identifier.  An approach to building such sets, including the
categories of assumptions that are needed, is discussed below under
`templates'.  Subject to certain significant caveats, such contexts
can then be combined or manipulated using the operations of set
algebra such as union and intersection.  Note, however, that each
party involved in a communication must first declare by name or
establish by enumeration an INITIAL CONTEXT for themselves.  These
initial contexts must then be reconciled (to a first approximation by
set union) into a UNIFIED CONTEXT within which messages can be
unambiguously interpreted.  To achieve this unification, metadata (or,
if you prefer, data) must be exchanged or otherwise invoked.

          iii.    A metadata reference model

A metadata reference model is an analysis of the uses of metadata (in
the general sense of the word) in four different areas of scientific
data management activity, each with its own characteristic
requirements:
               (1)     query, browse, retrieval;

               (2)     ingest, quality assurance, reprocessing;

               (3)     machine to machine transfer;

               (4)     storage, archive;

in the context of a crosscutting set of
               (5)     disciplinary perspectives.


This analysis, expressed in natural language, has to result in a
logical structure that is sufficiently precise and internally coherent
to be used as a framework for defining the external interface of
logical modules (here referred to as databases), which are loosely
linked in an association which forms the information system.  The
logical structure of this external interface is quite distinct from
any implementation of particular databases, focussing instead on the
information that has to pass across the interface for the enterprise
to be successful.

Each module or database appears to the outside world as a
          iv.     Structured data expression

that has the following properties:
               (1)     self describing to humans and machines;

               (2)     can be entered and manipulated at different
levels;

               (3)     presents different views according to level;

               (4)     is persistent;

               (5)     is dynamically extensible;

               (6)     is portable to different implementations while
preserving its external interface;

               (7)     contains an extensible set of methods and
utilities for manipulating and transforming data;

               (8)     has a look and feel that can be tailored to
user discipline and datastream structure.

An additional property which seems highly desirable is that it be
possible in constructing such modules to include parts of others in a
transparent fashion.  It is at present unclear whether this is
inconsistent with the autonomy of individual modules.

It is assumed that individual structured data expressions are normally
built around particular data holdings, often centered on an
established data stream or research program, with ancillary
information from elsewhere added as required.  The lowest level, an
atomic expression, is conceived as intellectually coherent and
structurally simple (e.g. a few relational tables with SQL access),
though it might contain very large volumes of data.  Higher levels
would build more complex structures.  The term `level' is used here to
specify entry points and degree of complexity of constituent sub-
expressions, but the number of such levels will vary between
databases.  Users will typically match the level to their prior
familiarity with the subject matter, so some approximate naming
convention for levels is desirable.

These structured data expressions form quasi-autonomous but
interacting modules (databases) within the broader, decentralized
information system (the association).  The context in which data is to
be interpreted defines the metadata that is required.  Thus a central
issue is how to define, structure, and describe, the range of contexts
within which a particular database must function.  This range will be
built to any particular level on the characteristics of individual
data streams, drawing on a vocabulary and set of default conventions
and rule based knowledge which are identified by discipline,
subdiscipline and specialty, expanded from top down.  These databases
will be complemented within the association by issue-oriented high-
level modules, mostly based upon natural language, which provide
structured learning paths for the intelligent but non-expert user
through key documents and published literature down to selected
derived products and original datasets.  Such high-level modules will
act essentially as multiple indices over the population of databases,
relying scientific interpretation of the database contents.  The
primary emphasis here is on the first type of database, where the
interactions with original datasets will be mediated by individuals
who are scientifically knowledgeable in the disciplinary area, though
not necessarily expert in the specific formats and conventions used or
in the details of instruments or analysis techniques.

This structured data expression should be able to exchange information
with humans in any of the
          v.      Basic modes of expression for scientific discourse:

               (1)     natural language - marked up to make it
partially machine searchable

               (2)     tables and data structures

               (3)     graphs and diagrams

               (4)     images

               (5)     equations and mathematical models

               (6)     algorithms and implementable procedures

               (7)     electronic documents and other formats
combining several modes

Not all these may be economically feasible at first, but full
implementation will require them all.

Because the voluntary judgement of scientists and information
specialists is critical to effective functioning of the information
system, emphasis in this analysis is placed upon a logical basis for
structuring content and providing tools which ease the pain of
information entry and retrieval.  At the same time, the analysis logic
must be sufficiently rigorous to allow also the efficient
implementation and automated management of very large and complex
databases.  Beyond a certain level of sophistication these
requirements may well be inconsistent or impractical.  However,
technologies such as AGENTS and INTERACTIVE FORMS are potentially
capable of greatly increasing the productivity of humans interacting
with the system, and the vision of such tools should not be
constrained by existing practice.

          vi.     Services

Though it is supposed that each database retains design autonomy over
its internal implementation and contents, for the association of many
databases to function there need to be certain system-wide services
which are centrally administered.  Besides standard network
connectivity, transport and communication protocols, special attention
needs to be paid to a process for assembling community inputs and
resolving conflicts on at least the following issues:

               (1)     User authorization and authentication.

Even if the contents of the database are available for full and open
exchange, the ideal of collaborative science, it will still be
necessary to restrict the ability to modify those contents, and to
account for the use of resources.  Since many users will be other
computers known only by a network address, there will have to be some
centrally coordinated register of authorized users in various
categories, and some technique for authenticating stated identities.

               (2)     Globally unique names and their disciplinary
aliases.

Discrete names for discrete entities form the glue that holds a
computer network together.  Likewise, scientific discourse requires
consensus on the meaning of technical terms.  Providing a universal
name service in an unstructured, evolving, distributed environment
touches some fundamental issues relating to conflict resolution and
concurrency.  However, it may well be that, even though no perfect
solution may exist, for relatively slowly changing uses such as
disciplinary templates practical compromises are effective.  When a
variable or data structure is generated by reprocessing for which no
external name exists, the variable or structure can be described, at
the cost of some inconvenience, by its pedigree.  After careful
analysis to distinguish homonyms and eliminate functional duplicates,
the more significant of such items may in due course be assigned
global names with appropriate disciplinary aliases, so that they may
be more easy to use.

               (3)     Model description languages - particularly for
structured data objects

A structured data expressions must include methods to describe itself,
both to another computer and to a human being.  Such descriptions must
be in one of a few model description languages, and the initial
negotiations between communicating partners should include the
selection of a suitable one.  As at other levels, this negotiation
protocol is the key to graceful evolution.

               (4)     A few ALTERNATIVE data transfer formats

These should cover the range of modes of expression of scientific
discourse and a range of hardware and software capabilities.  They may
well be defined procedurally, as methods which are based upon
transformation algorithms that are themselves entries in the database.

               (5)     Templates to help define context

A central concept in the reference model is the articulation of a set
of hierarchically structured templates, named for example by a tag
discipline>.<subdiscipline>.<specialty>, or by <datastream>, each of
which defines a set of default assumptions and can be used
individually or by set union to construct approximations to initial
contexts.  The assumptions in disciplinary templates should include
coverage of the following categories:
                    (a)     public names for variables;

                    (b)     logical and associative relationships
between variables;

                    (c)     descriptions of standard measurement and
analysis procedures, including suggestions for relevant metadata;

                    (d)     descriptions of standard theoretical
models;

                    (e)     units, levels of precision;

each expressed in an appropriate format, rule base, or language.  The
primary information is semantic (i.e. science content), but
               (6)

much value for database management would come from structuring
relations in the template according to the following classes:

                    (a)     Fundamental - logically based, can be
hardwired into database manipulations;

                    (b)     Proximity - physically/intellectually
based associations, things most likely to be retrieved together,
crucial for implementation efficiency;

                    (c)     Transformation - associated with
implementable algorithms that are themselves entries in the database
e.g. Celsius <-> Fahrenheit or internal <-> external formats;

                    (d)     Derivative - value added products that
become new entries in the database and deflect queries from the
original; or

                    (e)     Guide - explanations and cross-references
driven by science content, intended to inform a human user.


Likewise, each instrument system or datastream has its own set of
obvious metadata.  This includes logbooks, calibrations, and other
information that is necessary to proceed from pointer readings to
calibrated physical units to derived products, and statements of
priority which reflect the original purpose of the measurements or
analysis.  In special cases ad hoc formats have been developed for
including such data (e.g. American Petroleum Institute, 1993), but
more systematic set of default templates for data streams similar to
that for disciplines would aid communication.  Approximations to
initial contexts for a dialogue between a user and database would be
provided by the default assumptions in, on the one hand, the template
or templates associated with a discipline or disciplines declared by
the user, and, on the other hand, by a template associated with the
datastream.  Each set of default assumptions would then be modified if
necessary, before merger into a single unified context.

The templates will evolve slowly with time as enhanced scientific
understanding and scope changes.  Thus provision must also be made for
version control, updating, and structural evolution.

Such templates would take much of the pain (for humans at least) out
of establishing effective communication, and should stimulate the
development of interactive forms and software agents to assist the
two-way flow of information between humans and the database.  However,
to combine such contexts by set algebra into broader,
interdisciplinary, aggregates, and to use them to develop specific
schema suitable for machine to machine communication, it is also
necessary to have an information-system-wide process for identifying
overlaps and resolving conflicts, for example where the same name is
used for different things, or different names for the same thing.
Each distinct assumption should have a unique identifier.  This
requirement is a potentially serious caveat, of which the implications
are unclear.


     b.      Query, browse, retrieval

This interface is driven by a human user's need to answer questions
efficiently.  This requires a response time that will keep the user
engaged, and response information which is appropriate to the context
as perceived by the user.  For a direct human user a brief initial
questionnaire (for example naming the user's discipline,
subdiscipline, and specialty) may be enough to enable responses to be
tailored intelligently.  Alternatively, specific queries could be
managed by an appropriate user based software agent, working to a more
detailed user profile.

          i.      What datasets exist?

The Global Change Master Directory is already functioning on line.
Individual U.S. holdings are summarized in Directory Information
Format (DIFs), though not all agencies have it fully populated.  The
system works but is experiencing some trouble with keywords.  Though
it is adequate for the scientifically knowledgeable user in his area
of expertise who is the principal focus of this analysis, the system
needs to be supplemented with hierarchically structured, issue
oriented introductory guides to the science, which start from highly
summarized assessments, with successive layers presenting more and
more detail on selected subtopics, based ultimately on published
literature, well validated model outputs and data products, and a few
reliable and especially significant original datasets.

          ii.     Is it likely to be of use to me?

               (1)     An index to the logical structure of the
database

               (2)     A summary description of scientific context,
including discussion of primary scientific objectives for the data,
the variables, instrument systems, processing algorithms and quality
control procedures

               (3)     The spatial, temporal coverage and sampling

               (4)     The scientific credentials of this data,
including evidence for its credibility, references to scientific
publications which used it or commented on its quality or
deficiencies.

Each of these items should be presented first in a quickly assimilated
summary, with hot buttons leading to layers containing more detail if
desired.  The summaries will typically be electronic documents and
prepared for the purpose by knowledgeable scientists, and should also
be accessible by library-type searches.  However, to ensure currency,
the most detailed layers on which they are based should be active
datasets, with automatic updating of the summaries where appropriate.

          iii.    Is it really what I want?

Browse Products:
               (1)     A typical sample

               (2)     Diagrams

               (3)     Graphs

               (4)     Derived products

In the interests of efficiency and effective presentation, this set of
browse products will generally be prepared off line by reprocessing,
though in some cases realizations may be invoked procedurally as
required.  Where possible they should be communicated in the mode of
scientific discourse that is most appropriate for the user.  For
certain datasets, utilities may be provided enabling a users to
prepare their own.

          iv.     How do I get it?

There are several stages in a Data Request
               (1)     An interactive order form, completed by the
user, checked by the system

                    (a)     Name

                    (b)     Address

                    (c)     Medium

                    (d)     Preferred Format

                    (e)     Variables

                    (f)     Scope

                    (g)     View

                    (h)     Level of metadata required

                    (i)      etc

               (2)     Resource estimation

               (3)     Authorization

               (4)     Implementation - typically batch

               (5)     Statistical analysis- typically background

The statistical analysis is to provide information for the management
and evolutionary design of the database.  Data gathering should be
built into the query-response and data request software.

          v.      Lessons learned

The lesson learned from analysis of this aspect of the database
external interface is that, besides the original data itself, an
active archive needs to offer a rich set of specially prepared
electronic documents with a high density of scientific guide
information, built on top of an expanding range of products derived
from the original data and other sources.  The preparation of, and
provision of access to, this supplementary information requires input
from knowledgeable scientists and skilled communicators and involves
considerable work.  It is however an investment which both greatly
enhances the utility of the database for the typical user, and also,
by deflecting ill-informed requests, may actually reduce the need for
ad hoc, high volume, accesses to the original data itself.
Particularly important is the selection of products derived in various
ways from the original data.  The great majority of potential users
would probably be satisfied with a well executed higher level product
with an established scientific pedigree, rather than have to derive
something similar from scratch themselves.  Easing the process of
generating appropriate supplementary information, and of dynamically
structuring the database to accommodate it, has to be a fundamental
consideration of the information system design.


     c.      Ingest, quality assurance, and reprocessing

This interface is driven by the need to acquire a high quality dataset
with a precisely defined data dictionary, and to ensure the logical
and scientific integrity of the database.  It requires input from both
expert scientists and knowledge engineers, negotiated if necessary
between them.  Reprocessing is an intensive use of the database which
contributes to quality control and is under control of individuals who
may be presumed expert in both the datastream and the internal
database structure.

          i.      Ingest

The information that needs to be acquired falls into 3 categories,
relating to:
               (1)     scientific content

               (2)     logical structure

               (3)     patterns of use

          ii.

Linkages that need to be defined are expressed by:
               (1)     Internal representation of data dictionary -
external representation

               (2)     Internal names - external names

               (3)     Assumptions about context - explicit
representation

               (4)     Variables - attributes

               (5)     Assumed mathematical and logical equivalences -
tests for database integrity

               (6)     Assumed transformation algorithms and utilities

               (7)     Content quality control - action in case of
exceptions

                    (a)     permissible ranges of attributes

                    (b)     tolerances in transformations

                    (c)     missing data

                    (d)     attaching quality control flags

               (8)     built in attributes

                    (a)     sticky notes attached to blocks of data

               (9)     Proximity relationships - efficient internal
representations

Proximity relationships are criteria indicating relative probability
that data items will need to be accessed together.  They provide
information fundamental to efficient database design, such hard-wired
relationships and pointer structure.
               (10)    dimensionality of data source

                    (a)     space and time

                    (b)     scientific associations between variables

                    (c)     relevance blocks for metadata

               (11)    Known and projected reprocessing algorithms -
implications for database

               (12)    Linkage definitions - a decision model for
database design and operations


          iii.    Quality assurance

               (1)     Identify quality assurance issues

               (2)     Add quality assurance flags and comments

               (3)     Document

                    (a)     algorithms

                    (b)     diagnostics

                    (c)     external inputs


          iv.     Reprocessing

               (1)     Produce derived products

               (2)     Add to database, and to browse information

               (3)     Analyze production experience for quality
assurance information and act appropriately

               (4)     Analyze production experience for access
efficiency


          v.      Lessons learned

This aspect of the database requires that semantic and logical
relationships known only to the originating scientist be translated
smoothly into an internal representation (sub-schema) which can be
efficiently accessed and manipulated with available hardware and
software, i.e. a blending of the skills and knowledge of both
scientists and database designers.  It is unclear whether scientists
should be permitted to change the structure of a major database, but
it is vital to develop tools such as interactive forms which
efficiently elicit the information on the basis of which such changes
may be made.


     d.      Machine - machine transfer

This interface is driven by the need to transfer without human
intervention all or part of he information in the structured data
object to another operating system on a different hardware
configuration, while preserving the integrity of the data and metadata
and all the logical and scientific relationships among them.  Such a
capability is also fundamental to transferring an existing database to
a more efficient implementation within the same environment.
          i.      Steps in communication

               (1)     Establish computer-computer link

               (2)     Negotiate dialog protocols and transfer formats
(choice among few - few)

               (3)     Establish level of shared knowledge

                    (a)     data model service (language for
describing data dictionary)

                    (b)     disciplinary templates service(operational
definition of discipline)

                    (c)     name service (global system names and
disciplinary aliases)

                    (d)     previously exchanged information and
required updates

               (4)     Identification of items to be passed

               (5)     Implement transfer (including concurrency and
recovery procedures)

               (6)     Validation and integrity checks

               (7)     Reconciliation of exceptions

               (8)     Statistical analysis of this transaction


          ii.     Lessons learned

Machine to Machine Transfer between different operating environments
places many demands on the completeness and robustness of the
descriptions of data structure.  In a loose association of autonomous
units, a good way of enabling graceful evolution is to provide at each
level of the communication process not a single interface or transfer
standard, but rather a limited set of choices of such standards,
together with a negotiation process whereby the parties can select
that which best meets their needs.  As new technology or circumstances
change, a new standard can always be added to the set of choices, and
those users who find it more convenient will gradually make the
necessary investment and adapt to using it.  Likewise, outmoded or
insufficient standards will decline in use and may then be subjected
to sunset rules.  A well endowed general purpose data server would be
expected to implement all the choices, but a part time user with
limited hardware capability might be able to invoke only one or two,
with corresponding degradation in the expected level of service.  Thus
even more important than the details of the interface are the
protocols and language of the negotiation.  This principle applies as
much to methods for structured data expressions to describe
themselves, data model description languages, and to definitions of
the context surrounding a data exchange, as it does to the formats for
the exchange itself.


     e.      Storage and archive

This interface is driven by need for efficient implementation of
search and retrieval with the overall goal of total cost minimization.
This requires a
          i.      Balance between

               (1)     Storage system and media costs

               (2)     Access and processing costs

               (3)     User time and satisfaction while seeking and
retrieving information

               (4)     Scientist and knowledge engineer time importing
information

               (5)     Extensibility and evolutionary potential of the
system


          ii.     Required information

               (1)     Decision model for database design

A decision model is an analysis of the choices that have to be made at
the design stage and at the operation stage, and how they impact the
overall goal.  It provides a framework for assessing the utility of
information being sought both through interactive forms and through
statistical analysis of database use, for the purposes of ensuring the
logical integrity of the database and increasing its overall
efficiency.
               (2)     Specification of design assumptions and
anticipated changes

               (3)     Logical data dictionary and proximity
information

This information has to be garnered systematically from various
sources, including a careful review of scientist's insights of
fundamental logical relations that can be relied upon for database
structure, and proximity relations indicating which variables are most
likely to be retrieved together.
               (4)     Expected frequency of accesses for various
inputs

                    (a)     interactive

                    (b)     batch

               (5)     Analysis tools for estimating resource
requirements

               (6)     Performance evaluation criteria and tools


          iii.    Lessons Learned

A central requirement for this aspect is a decision model which shows,
for each major database architecture, how such information would
actually be used in selecting a design, or in modifying operating
regimes.  It may be necessary to formalize such decision models, using
knowledge engineers to capture the heuristic rules of experienced
designers for a variety of local data models.  The use of interactive
forms to capture from scientists the metadata necessary for efficient
database implementation, and the collection of appropriate statistics
on performance, both depend on a good analysis of what the
implementation choices are.  Of course, given the need for unspecified
extensibility and evolution, there will be much uncertainty in such
decision analyses, but, given appropriate models and the right input
information, examining a variety of scenarios should indicate which
designs are more likely to be robust at reasonable total cost.

A greater choice of local data models may also be needed.
Requirements for easy evolution and modification of the schema seem to
imply greater flexibility than is available from existing relational
or tree structured data models.  Such flexibility would eem to be
provided by an object oriented functional representation such as the
ADAMS language (Pfaltz 1992).



5.      References

          American Petroleum Institute, 1993.  Record Oriented Data
Encapsulation Format Standard - A proposed schema under RP 66 Version
2.00.  Prepared by the Format Workgroup

     IEEE, 1994.  A Reference Model for Open Storage Systems
Interconnection, version 5. unapproved draft, IEEE MS System Reference
Model 1.3 February 11 1994.

     Pfaltz, J. L., J. C. French, A. S. Grimshaw, R. D. McElrath.1992.
Functional Data Representation in Scientific Information Systems,
Proceedings of the Conference on Earth and Space Science Information
Systems, Pasadena, CA, Jet Propulsion Laboratory.

     Sheth, A. P. and J. A. Larson. 1990.  Federated Database Systems
for Managing Distributed, Heterogeneous, and Autonomous Databases.
ACM Computing Surveys, 22, 183-236.