A REFERENCE MODEL FOR METADATA A Strawman Francis Bretherton University of Wisconsin DRAFT 3/2/94 Acknowledgements Significant contributions to the concepts described here have been made by Paul Kanciruk and Paul Singley of Oak Ridge National Laboratory, Joe Rueden of the University of Wisconsin, and John Pfaltz of the University of Virginia. The author is also grateful for encouragement and assistance from Bernie O'Lear of the National Center for Atmospheric Research, and Otis Graf and Bob Coyne of IBM. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1. The challenge Scientists in all disciplines face a computer-enabled explosion of data, about collisions between fundamental particles, about protein structure and folding, tomographic images of the human brain, or observations of the planet Earth from spacecraft. At the same time, the widespread use of networks and distributed servers is greatly facilitating the availability of data, AND the opportunities for misinterpreting it. What we typically regard as the data is but a small part of the information that has to be assembled, inter-related, and communicated, if distant colleagues who were not closely involved in its collection are to make full use of it. This additional information (metadata) is central to our scientific objectives, yet we have few computer-based tools to assist in describing effectively what all these bits mean. In the domain of Earth system science, metadata has even greater significance. With growing world population and expectations of increasing per capita resource use, humans are increasingly modifying our natural environment on a global scale, with profound but ill understood consequences for our children and grandchildren. A central challenge to scientists seeking to understand how the Earth system functions is to document the changes that actually occur over the coming decades. A test of the adequacy of our present information systems is to imagine what our successors will think twenty years from now as they examine our records at a time when everyone who is currently involved has passed from the scene, and try to determine whether the apparent changes between now and then are real, or are merely undetected artifacts of the way we took the measurements or analyzed the data. This requires standards of documentation and quality assurance that far exceed what we are presently able to achieve routinely, sufficient for fail safe unambiguous communication about scientifically crucial details without benefit of the interactive questioning that is usually the cornerstone of human discourse. Existing database management systems are dominated by commercial applications. These tend to have schemas that are relatively static, though the contents may be updated frequently. Priorities are transactional integrity and efficiency of routine use, though as the tools (such as SQL queries) have become available exploratory queries from management have become more prominent. To achieve these objectives, system design and operation have typically been maintained under tight central control. Scientific databases, on the other hand, are generally adding not only more data but also new types of data. Deletions tend to occur only en masse when entire datasets are discarded as obsolete or not worth maintaining. Success is measured by the discovery of new relations within the data and by the new questions they stimulate, not by transactional efficiency. The flexibility to deal with rapidly evolving schemas, and for handling exploratory queries effectively must be our fundamental priorities. In addition, the same networks are encouraging individual initiative, challenging the very notion of a centralized authority able to impose standards across ANY realistic domain of participants. Thus new concepts are required, which redefine the functional relationships of users, contributors, designers, and operators of our information systems. 2. Purpose of this document The purpose of this document is to provoke a discussion from which may emerge a consensus, on some basic principles which must underlie any broadly applicable set of approaches to improving this situation. Though written by a scientist grounded in Earth system science, the focus is on the information that has to pass in both directions across the interface between humans and electronic representation in quasi- independent logical modules of our data systems, and also between such modules as they are loosely coupled to form a complete system. It attempts a functional analysis of this interface for an individual module (also known as a database), with the goal of isolating and abstracting a description that captures the essence of what is needed, both to enable the full power of computer to computer communications and at the same time to empower scientific users of, and contributors to, the information system with their customary modes and vocabulary of scientific discourse. Such an analysis is a metadata reference model, a framework for specifying the logical structure of the external interfaces to a database with enough precision to be practically realizable in an efficient manner, yet deliberately independent of any particular implementation. Such a framework could then be used for specifying requirements and performance benchmarks in procurement of complete systems, hardware and software, and should enable explicit consideration of design and operational tradeoffs in the light of evolving scientific understanding and technological innovation. Codification of the interface structure should also encourage the development of interactive software tools to ease the pain and improve the productivity of scientists seeking information from the system, or contributing information to it. It also focuses attention on a few specialized services that MUST be centrally managed and coordinated. A successful example of such a concept is the reference model for hierarchical storage (IEEE, 1994), developed in response to the needs of the national supercomputing centers. Its existence has stimulated vendors to offer complete systems tailored to particular circumstances rather than individual parts of which the user is responsible for the integration. A similar, though more hardware oriented, example is provided by the description of alternative architectures for Federated Database Systems in Sheth and Larson (1990). 3. What is metadata? A Parable a. Innocence Hearing the following message on the radio: `temperature 26, relative humidity 20%' a typical U.S. listener would probably conclude, prompted by the juxtaposition of the words `temperature' and `humidity' and by past experience of similar messages, that it was a statement about the current weather, and that it was a cold dry day outside. If she were Canadian, on the other hand, she might at first think it was a pleasant summer day, then realize it was January so that this could not be her weather that was being referred to, finally recalling that this radio station transmitted from south of the border and that Americans report temperatures in degrees Fahrenheit, not Celsius. Yet if the temperature had indeed been reported in Celsius, as -3.33333oC, she might well still have rejected the message, suspecting a lack of professionalism somewhere in its formulation. The clue, of course, would be her intuitive realization that the fine gradations implied by 5 decimal places are totally meaningless in describing real weather. This simple example illustrates a key property of METADATA, which is usually loosely defined as something like `data about data', or `additional information that is necessary for data to be useful'. For any communication between two parties to be effective, there has to be a common body of knowledge shared between them, that sets the context for a message. In human discourse, a large amount of this context is generally implicit, or hinted at in ambiguous ways. If different assumptions about it are made at either end, even the simplest piece of data is likely to be misunderstood. Metadata is required to flesh out this context. Of course, in a well understood situation, most of the metadata is redundant and frequently omitted, but if the situation or the parties were to change somewhat, more would be required. If the situation were to change greatly, much more would be required. Thus the statement of what is adequate metadata is itself situation dependent. This is why the meaning of the term itself seems so slippery. b. Engagement Suppose now that our listener is a weather freak, who regularly scans the dial and makes entries in her computer like: WERN 1/28/94 0700 CST 26 20 KVOD 1/28/94 0600 MST 40 135 WGBH 1/29/94 0800 EST 36 90 She also exchanges with like minds all over the continent by Internet, thus expanding her personal collection, which contains thousands of entries. Each one of these is, of course, similar to a record in a relational database, with primary key provided by the combination of radio station call letters, date, and time, which, provided she has been careful to eliminate duplicate entries, provide a unique identifier for each record. The last two columns are the contents of the original radio messages, but these have now been tagged with additional information necessary for correct interpretation in a continental, climatological context rather than in a local, current weather, context. At the same time, to save typing and computer storage, the variable names which were explicit in the original messages have been compacted into the headers which label each column. To complete the picture it is necessary to append two additional tables. The first has each entry uniquely identified by a radio station call letter, and lists the location of its service area and whether it reports in Fahrenheit or Celsius. The second has only one entry, listing the units (%) associated with relative humidity. In the relational data model, links are established between such tables through the variables that are common to each, and together these tables are sufficient to describe the logical connections that are needed. As this example makes clear, there is no LOGICAL distinction between metadata and data. With a slight change in context, the variable names which were originally an integral part of each message have become an appendage to a whole column of numbers, whereas the information about location and time which was originally implicit has become an integral part of what most people would call the data. Even in a given context, more than one consistent database structure is possible. For example, coupling location data to a map of time zones and to the U.S. and Canadian calendars for daylight savings time might have avoided the entry of the whole column which qualifies the time tag. The Celsius - Fahrenheit issue might also have been handled in a similar way. The choice between such alternative structures is governed by convenience and efficiency of storage, entry, and access, rather than by logical necessity. None of this is any surprise to anyone who has actually constructed a working database. Databases consist of variables linked by a permanent structure. Each variable (also known as an entity) will normally have several attributes, such as its name, color, numerical value, units, precision, etc., each of which may also vary or be constant within the span of the database. The permanent structure links variables and their attributes, and defines groups of variables. However, additional structure is imparted by the specific values of attributes. For example, if two occurrences of the attribute `call letters' have the same values, it is assumed that they refer to the same radio station, with all the consequences of that identity. This assumption is crucial for inferring location in our listener's database. Within her universe, this works well, but had her sharing group included members from Europe, where radio stations do not use unique call letters, a different permanent structure would have been required. A database is defined for a specific context and has a specific schema, the set of entities with which the database is concerned and the relationships between them which are known to the database. Each time the context is extended attributes that had been constants (e.g. units ) may become variables, and methods of linking that had been built deep into the permanent structure may become falsified. It is thus vital to distinguish a particular database implementation from the logical structure of the context for which it is intended. Understanding and characterizing the latter is key to a robust and effective design. c. Awakening Suppose our listener now goes to college, where she attends a course entitled Meteorology 101. Reexamining her database she suddenly realizes that the record KVOD_1/28/94 _0600 must be in error. By definition, values of relative humidity must lie in the range 0-100%. Thus this entry is logically inconsistent and must be used with extreme caution, if at all. She needs to annotate it to that effect. Unfortunately, this contingency was not in her original design, but by good luck she had included an extra vacant column of width 1 character. She thus marks this record with an * in that column, labels the column data quality, and adds a new table with the one entry * error The information conveyed is less specific than would be ideal, but is better than nothing. She also finds new treasures in her database. Abstracting all the records for KVOD and WERN, she can describe the climate in Denver and Madison, using a graphical comparison as the centerpiece of a term paper. But wait - there is something odd about that graph. It seems as if in 1995 the temperature in Madison abruptly became warmer, perhaps a symptom of global warming! However, exhaustive research reveals that in March of that year WERN had started reporting the weather from a new location, and global warming was probably only an artifact of a missing piece of metadata. Disappointed but wiser, she decides to repair that omission and add a comment to that effect to her database, only to discover that with her database management software there was no way she could attach an electronic Post-It note to an arbitrarily defined category of data! So the critical information remained on a piece of paper, which was, of course, lost next time she moved to a new apartment. Her term paper addressed the question `How many days in the year is Denver warmer than Boston?' She realized that, for any given date and time, the statement `Denver is warmer than Boston' can in principle be evaluated as True, False, Not Resolvable, or NoData. This involves examining the values of the variable temperature and its attribute `units' which are associated with KVOD and WGBH, converting if necessary from Celsius to Fahrenheit or vice versa, and testing whether the result associated with the former is greater than that from the latter by an amount which exceeds the assigned measurement tolerances. Since she was in fact interested in this question for many pairs of cities, she wrote a FORTRAN program to perform these operations, and then added to her database a table which contained the result for every pair of radio stations. Next time the issue came up, she did not consult the raw data, but rather this derived product. The context for a database is set not just by the scope of the data being input, but also by the type of questions being asked. As understanding of the contents grows, even with the identical data, issues will arise that require additions to the contents in ways that cannot be foreseen in specific terms. Design decisions based on implementation efficiency must not preclude the ability to incorporate new information at a whole variety of levels. These levels range from additional varying attributes attached to all instances of particular variables or groups of variables, to occasional comments or processing branch points attached to whole blocks of instances, or to derived products that will frequently be accessed without direct reference to the original, or perhaps processed `on the fly' as the requirement arises according to algorithms that are themselves entries in the database. A robust design must reflect a deep understanding of considerations such as uncertainty of measurement and human error, as well as an appreciation of the extent to which the apparent logical structure of entities within the database is based on arbitrary convention as opposed to fundamental principles. d. Frustration Our listener has now become a serious student of fluid dynamics and Earth system science. She learns how the winds and ocean currents, and even the flow of water through the soil, are driven by gradients of pressure, and how the equations of motion can be simulated in a computer. She recognizes the importance of understanding the interactions between such flow processes in the atmosphere, the oceans, and on the land surface, and starts a comparative study. Her first task is to find some data that might be suitable. So she logs on to the NASA master directory and starts to search under the keyword `pressure'. She finds pressure is a key variable for a large fraction of atmospheric data sets, but frequently as a label tagging where in the atmosphere an observation was taken, rather than as a variable at a fixed location, which was what she expected. For some ocean measurements pressure occurs in the same sense, but more often the tag is labelled `depth'. Never does it seem to refer to the expected physical driver for motion. In the hydrology section she found no references to pressure at all! So she returns to her textbooks, and discovers that in computer models of this type the active variable driving the motion is actually dynamic height, which is the height of a surface of constant pressure, rather than the pressure at constant height. The two are closely related, and provided additional information is available to determine the fluid density, there is a mathematical transformation between them. Thus the words are often loosely used interchangeably, even though technically they are quite different. So she returns to her data base and this time finds what she was seeking for the atmosphere and the ocean, but still draws blank on the hydrology. In despair, she finally finds a hydrologist and explains her problem. He laughs, and remarks that what she is looking for is called by hydraulic engineers `head'. Noting that oceanographers use `head' to refer to a water closet, she returns to the Master Directory and from the summaries there selects a few promising candidates. She now wants a more specific description of contents, but realizes her candidates are in five different data centers spread among three different federal agencies, each with its own access system and rules and regulations. After much time on the telephone, she is able to log on to one of them. The dataset seems to be the type of thing she would need, but there is no way she can get a brief sample without first obtaining a user authorization and then purchasing by mail special software to read their format, with a 90% chance that when she is able to visualize the sample it will prove unsuitable. After several similar experiences she abandons her project, and concentrates instead solely on what she can do with her existing database. This example illustrates some of the practical realities of building connections and establishing interactions between different parts of the Earth sciences community. The concept of a database designed for a particular task is inapplicable to such an open, evolving, context. An effective information system must not only provide efficient access to a wide diversity of specific data. It must also provide tutorials and help systems tailored to the starting points of many different users, which may range over a wide variety of disciplinary backgrounds and degrees of sophistication. Though humans may be able to sense ambiguity and conflict in the names used to describe related concepts within different disciplines, computer to computer communication requires a reliable procedure for resolving such issues. Different parts of the system have to be semi-autonomous and self-describing, not only to a human, but also to other computers and databases across a network. There are few standards, and no central authority capable of imposing any across more than a subset of the contexts the system has to serve, yet total anarchy is also unacceptable. Patches have to be built between existing ways of doing things and existing formats for communication, even if they are not entirely consistent, at the same time as providing pathways for voluntary, incremental, growth toward a system that benefits everyone. There should always be a few (not just one or infinitely many) alternative communication formats available, from which the parties can choose the most convenient, with the onus for adaptation on the party with the greater resources. The most precious and irreplaceable asset is the time and enthusiasm of the people who are potential users of the information system and whose knowledge and skills are needed to contribute to its development. Metadata includes all the data needed to make such an expanded vision work. 4. A Strawman Reference Model a. Overview i. A vision of the system The metadata reference model referred to here is a logical analysis of the structure of the external interface of autonomous modules (here called databases) loosely linked within a complete information system (here called the association). For present purposes, the relevant communications within that system are between scientists and databases, and between databases themselves. For an important subset of applications in Earth system science, the ultimate objective is to effective one way transfer of information between humans now and humans decades in the future, without benefit of the interactive exchanges that usually mediate human discourse. The environment for such communication is presumed to be similar to that provided by Internet, a rapidly evolving, open association of autonomous units, with a minimal set of operating rules, and no central authority capable of imposing uniform standards except by common agreement. Technologies such as Gopher, and World Wide Web with user interfaces such as MOSAIC, provide navigational markers and flexible access tools, but useful user-oriented higher level data structures for full-service information systems have still to be evolved. The emergence of knowledge based software agents which navigate the network and perform chores on behalf of users only accentuates the need for clear external interfaces to logical units in the information system, which can be accessed effectively by both humans and by other computers. ii. What is metadata? Metadata is often defined as data about data, or, only a little less vaguely, as the information required to make scientific data useful. Indeed, the term means different things to different people, and defies precise definition. We will use it in only a general sense, taking refuge behind an operational shift to defining context instead. For successful communication it is essential that both parties share a common set of assumptions (here referred to as a context) according to which the messages that pass are to be interpreted. In natural language such contexts are generally largely implicit, and are elucidated by question and answer only when apparent inconsistencies indicate that there may be a problem in interpretation. However, computers typically require that all relevant assumptions be explicit. At the center of traditional approaches to database design is a schema. A schema describes the conceptual or logical data structures of all the objects or entities with which the database is concerned, together with all the relationships between them known to the database (see for example Sheth and Larson 1990). In such a well defined context, the difference between metadata and data disappears - metadata is simply data. However, when the context is extended or modified, new information, metadata, is needed to provide unambiguous interpretation and a new schema. Thus, according to this perspective, the distinction between metadata and data is merely one of use, and the focus shifts to another formidable task, that of defining context. Metadata becomes the additional data that must be invoked to implement the change in context. The prefix meta does not attach to the data itself, but derives from the circumstance of change. Here a context is simply a set of assumptions, with a unique name or identifier. An approach to building such sets, including the categories of assumptions that are needed, is discussed below under `templates'. Subject to certain significant caveats, such contexts can then be combined or manipulated using the operations of set algebra such as union and intersection. Note, however, that each party involved in a communication must first declare by name or establish by enumeration an INITIAL CONTEXT for themselves. These initial contexts must then be reconciled (to a first approximation by set union) into a UNIFIED CONTEXT within which messages can be unambiguously interpreted. To achieve this unification, metadata (or, if you prefer, data) must be exchanged or otherwise invoked. iii. A metadata reference model A metadata reference model is an analysis of the uses of metadata (in the general sense of the word) in four different areas of scientific data management activity, each with its own characteristic requirements: (1) query, browse, retrieval; (2) ingest, quality assurance, reprocessing; (3) machine to machine transfer; (4) storage, archive; in the context of a crosscutting set of (5) disciplinary perspectives. This analysis, expressed in natural language, has to result in a logical structure that is sufficiently precise and internally coherent to be used as a framework for defining the external interface of logical modules (here referred to as databases), which are loosely linked in an association which forms the information system. The logical structure of this external interface is quite distinct from any implementation of particular databases, focussing instead on the information that has to pass across the interface for the enterprise to be successful. Each module or database appears to the outside world as a iv. Structured data expression that has the following properties: (1) self describing to humans and machines; (2) can be entered and manipulated at different levels; (3) presents different views according to level; (4) is persistent; (5) is dynamically extensible; (6) is portable to different implementations while preserving its external interface; (7) contains an extensible set of methods and utilities for manipulating and transforming data; (8) has a look and feel that can be tailored to user discipline and datastream structure. An additional property which seems highly desirable is that it be possible in constructing such modules to include parts of others in a transparent fashion. It is at present unclear whether this is inconsistent with the autonomy of individual modules. It is assumed that individual structured data expressions are normally built around particular data holdings, often centered on an established data stream or research program, with ancillary information from elsewhere added as required. The lowest level, an atomic expression, is conceived as intellectually coherent and structurally simple (e.g. a few relational tables with SQL access), though it might contain very large volumes of data. Higher levels would build more complex structures. The term `level' is used here to specify entry points and degree of complexity of constituent sub- expressions, but the number of such levels will vary between databases. Users will typically match the level to their prior familiarity with the subject matter, so some approximate naming convention for levels is desirable. These structured data expressions form quasi-autonomous but interacting modules (databases) within the broader, decentralized information system (the association). The context in which data is to be interpreted defines the metadata that is required. Thus a central issue is how to define, structure, and describe, the range of contexts within which a particular database must function. This range will be built to any particular level on the characteristics of individual data streams, drawing on a vocabulary and set of default conventions and rule based knowledge which are identified by discipline, subdiscipline and specialty, expanded from top down. These databases will be complemented within the association by issue-oriented high- level modules, mostly based upon natural language, which provide structured learning paths for the intelligent but non-expert user through key documents and published literature down to selected derived products and original datasets. Such high-level modules will act essentially as multiple indices over the population of databases, relying scientific interpretation of the database contents. The primary emphasis here is on the first type of database, where the interactions with original datasets will be mediated by individuals who are scientifically knowledgeable in the disciplinary area, though not necessarily expert in the specific formats and conventions used or in the details of instruments or analysis techniques. This structured data expression should be able to exchange information with humans in any of the v. Basic modes of expression for scientific discourse: (1) natural language - marked up to make it partially machine searchable (2) tables and data structures (3) graphs and diagrams (4) images (5) equations and mathematical models (6) algorithms and implementable procedures (7) electronic documents and other formats combining several modes Not all these may be economically feasible at first, but full implementation will require them all. Because the voluntary judgement of scientists and information specialists is critical to effective functioning of the information system, emphasis in this analysis is placed upon a logical basis for structuring content and providing tools which ease the pain of information entry and retrieval. At the same time, the analysis logic must be sufficiently rigorous to allow also the efficient implementation and automated management of very large and complex databases. Beyond a certain level of sophistication these requirements may well be inconsistent or impractical. However, technologies such as AGENTS and INTERACTIVE FORMS are potentially capable of greatly increasing the productivity of humans interacting with the system, and the vision of such tools should not be constrained by existing practice. vi. Services Though it is supposed that each database retains design autonomy over its internal implementation and contents, for the association of many databases to function there need to be certain system-wide services which are centrally administered. Besides standard network connectivity, transport and communication protocols, special attention needs to be paid to a process for assembling community inputs and resolving conflicts on at least the following issues: (1) User authorization and authentication. Even if the contents of the database are available for full and open exchange, the ideal of collaborative science, it will still be necessary to restrict the ability to modify those contents, and to account for the use of resources. Since many users will be other computers known only by a network address, there will have to be some centrally coordinated register of authorized users in various categories, and some technique for authenticating stated identities. (2) Globally unique names and their disciplinary aliases. Discrete names for discrete entities form the glue that holds a computer network together. Likewise, scientific discourse requires consensus on the meaning of technical terms. Providing a universal name service in an unstructured, evolving, distributed environment touches some fundamental issues relating to conflict resolution and concurrency. However, it may well be that, even though no perfect solution may exist, for relatively slowly changing uses such as disciplinary templates practical compromises are effective. When a variable or data structure is generated by reprocessing for which no external name exists, the variable or structure can be described, at the cost of some inconvenience, by its pedigree. After careful analysis to distinguish homonyms and eliminate functional duplicates, the more significant of such items may in due course be assigned global names with appropriate disciplinary aliases, so that they may be more easy to use. (3) Model description languages - particularly for structured data objects A structured data expressions must include methods to describe itself, both to another computer and to a human being. Such descriptions must be in one of a few model description languages, and the initial negotiations between communicating partners should include the selection of a suitable one. As at other levels, this negotiation protocol is the key to graceful evolution. (4) A few ALTERNATIVE data transfer formats These should cover the range of modes of expression of scientific discourse and a range of hardware and software capabilities. They may well be defined procedurally, as methods which are based upon transformation algorithms that are themselves entries in the database. (5) Templates to help define context A central concept in the reference model is the articulation of a set of hierarchically structured templates, named for example by a tag discipline>.<subdiscipline>.<specialty>, or by <datastream>, each of which defines a set of default assumptions and can be used individually or by set union to construct approximations to initial contexts. The assumptions in disciplinary templates should include coverage of the following categories: (a) public names for variables; (b) logical and associative relationships between variables; (c) descriptions of standard measurement and analysis procedures, including suggestions for relevant metadata; (d) descriptions of standard theoretical models; (e) units, levels of precision; each expressed in an appropriate format, rule base, or language. The primary information is semantic (i.e. science content), but (6) much value for database management would come from structuring relations in the template according to the following classes: (a) Fundamental - logically based, can be hardwired into database manipulations; (b) Proximity - physically/intellectually based associations, things most likely to be retrieved together, crucial for implementation efficiency; (c) Transformation - associated with implementable algorithms that are themselves entries in the database e.g. Celsius <-> Fahrenheit or internal <-> external formats; (d) Derivative - value added products that become new entries in the database and deflect queries from the original; or (e) Guide - explanations and cross-references driven by science content, intended to inform a human user. Likewise, each instrument system or datastream has its own set of obvious metadata. This includes logbooks, calibrations, and other information that is necessary to proceed from pointer readings to calibrated physical units to derived products, and statements of priority which reflect the original purpose of the measurements or analysis. In special cases ad hoc formats have been developed for including such data (e.g. American Petroleum Institute, 1993), but more systematic set of default templates for data streams similar to that for disciplines would aid communication. Approximations to initial contexts for a dialogue between a user and database would be provided by the default assumptions in, on the one hand, the template or templates associated with a discipline or disciplines declared by the user, and, on the other hand, by a template associated with the datastream. Each set of default assumptions would then be modified if necessary, before merger into a single unified context. The templates will evolve slowly with time as enhanced scientific understanding and scope changes. Thus provision must also be made for version control, updating, and structural evolution. Such templates would take much of the pain (for humans at least) out of establishing effective communication, and should stimulate the development of interactive forms and software agents to assist the two-way flow of information between humans and the database. However, to combine such contexts by set algebra into broader, interdisciplinary, aggregates, and to use them to develop specific schema suitable for machine to machine communication, it is also necessary to have an information-system-wide process for identifying overlaps and resolving conflicts, for example where the same name is used for different things, or different names for the same thing. Each distinct assumption should have a unique identifier. This requirement is a potentially serious caveat, of which the implications are unclear. b. Query, browse, retrieval This interface is driven by a human user's need to answer questions efficiently. This requires a response time that will keep the user engaged, and response information which is appropriate to the context as perceived by the user. For a direct human user a brief initial questionnaire (for example naming the user's discipline, subdiscipline, and specialty) may be enough to enable responses to be tailored intelligently. Alternatively, specific queries could be managed by an appropriate user based software agent, working to a more detailed user profile. i. What datasets exist? The Global Change Master Directory is already functioning on line. Individual U.S. holdings are summarized in Directory Information Format (DIFs), though not all agencies have it fully populated. The system works but is experiencing some trouble with keywords. Though it is adequate for the scientifically knowledgeable user in his area of expertise who is the principal focus of this analysis, the system needs to be supplemented with hierarchically structured, issue oriented introductory guides to the science, which start from highly summarized assessments, with successive layers presenting more and more detail on selected subtopics, based ultimately on published literature, well validated model outputs and data products, and a few reliable and especially significant original datasets. ii. Is it likely to be of use to me? (1) An index to the logical structure of the database (2) A summary description of scientific context, including discussion of primary scientific objectives for the data, the variables, instrument systems, processing algorithms and quality control procedures (3) The spatial, temporal coverage and sampling (4) The scientific credentials of this data, including evidence for its credibility, references to scientific publications which used it or commented on its quality or deficiencies. Each of these items should be presented first in a quickly assimilated summary, with hot buttons leading to layers containing more detail if desired. The summaries will typically be electronic documents and prepared for the purpose by knowledgeable scientists, and should also be accessible by library-type searches. However, to ensure currency, the most detailed layers on which they are based should be active datasets, with automatic updating of the summaries where appropriate. iii. Is it really what I want? Browse Products: (1) A typical sample (2) Diagrams (3) Graphs (4) Derived products In the interests of efficiency and effective presentation, this set of browse products will generally be prepared off line by reprocessing, though in some cases realizations may be invoked procedurally as required. Where possible they should be communicated in the mode of scientific discourse that is most appropriate for the user. For certain datasets, utilities may be provided enabling a users to prepare their own. iv. How do I get it? There are several stages in a Data Request (1) An interactive order form, completed by the user, checked by the system (a) Name (b) Address (c) Medium (d) Preferred Format (e) Variables (f) Scope (g) View (h) Level of metadata required (i) etc (2) Resource estimation (3) Authorization (4) Implementation - typically batch (5) Statistical analysis- typically background The statistical analysis is to provide information for the management and evolutionary design of the database. Data gathering should be built into the query-response and data request software. v. Lessons learned The lesson learned from analysis of this aspect of the database external interface is that, besides the original data itself, an active archive needs to offer a rich set of specially prepared electronic documents with a high density of scientific guide information, built on top of an expanding range of products derived from the original data and other sources. The preparation of, and provision of access to, this supplementary information requires input from knowledgeable scientists and skilled communicators and involves considerable work. It is however an investment which both greatly enhances the utility of the database for the typical user, and also, by deflecting ill-informed requests, may actually reduce the need for ad hoc, high volume, accesses to the original data itself. Particularly important is the selection of products derived in various ways from the original data. The great majority of potential users would probably be satisfied with a well executed higher level product with an established scientific pedigree, rather than have to derive something similar from scratch themselves. Easing the process of generating appropriate supplementary information, and of dynamically structuring the database to accommodate it, has to be a fundamental consideration of the information system design. c. Ingest, quality assurance, and reprocessing This interface is driven by the need to acquire a high quality dataset with a precisely defined data dictionary, and to ensure the logical and scientific integrity of the database. It requires input from both expert scientists and knowledge engineers, negotiated if necessary between them. Reprocessing is an intensive use of the database which contributes to quality control and is under control of individuals who may be presumed expert in both the datastream and the internal database structure. i. Ingest The information that needs to be acquired falls into 3 categories, relating to: (1) scientific content (2) logical structure (3) patterns of use ii. Linkages that need to be defined are expressed by: (1) Internal representation of data dictionary - external representation (2) Internal names - external names (3) Assumptions about context - explicit representation (4) Variables - attributes (5) Assumed mathematical and logical equivalences - tests for database integrity (6) Assumed transformation algorithms and utilities (7) Content quality control - action in case of exceptions (a) permissible ranges of attributes (b) tolerances in transformations (c) missing data (d) attaching quality control flags (8) built in attributes (a) sticky notes attached to blocks of data (9) Proximity relationships - efficient internal representations Proximity relationships are criteria indicating relative probability that data items will need to be accessed together. They provide information fundamental to efficient database design, such hard-wired relationships and pointer structure. (10) dimensionality of data source (a) space and time (b) scientific associations between variables (c) relevance blocks for metadata (11) Known and projected reprocessing algorithms - implications for database (12) Linkage definitions - a decision model for database design and operations iii. Quality assurance (1) Identify quality assurance issues (2) Add quality assurance flags and comments (3) Document (a) algorithms (b) diagnostics (c) external inputs iv. Reprocessing (1) Produce derived products (2) Add to database, and to browse information (3) Analyze production experience for quality assurance information and act appropriately (4) Analyze production experience for access efficiency v. Lessons learned This aspect of the database requires that semantic and logical relationships known only to the originating scientist be translated smoothly into an internal representation (sub-schema) which can be efficiently accessed and manipulated with available hardware and software, i.e. a blending of the skills and knowledge of both scientists and database designers. It is unclear whether scientists should be permitted to change the structure of a major database, but it is vital to develop tools such as interactive forms which efficiently elicit the information on the basis of which such changes may be made. d. Machine - machine transfer This interface is driven by the need to transfer without human intervention all or part of he information in the structured data object to another operating system on a different hardware configuration, while preserving the integrity of the data and metadata and all the logical and scientific relationships among them. Such a capability is also fundamental to transferring an existing database to a more efficient implementation within the same environment. i. Steps in communication (1) Establish computer-computer link (2) Negotiate dialog protocols and transfer formats (choice among few - few) (3) Establish level of shared knowledge (a) data model service (language for describing data dictionary) (b) disciplinary templates service(operational definition of discipline) (c) name service (global system names and disciplinary aliases) (d) previously exchanged information and required updates (4) Identification of items to be passed (5) Implement transfer (including concurrency and recovery procedures) (6) Validation and integrity checks (7) Reconciliation of exceptions (8) Statistical analysis of this transaction ii. Lessons learned Machine to Machine Transfer between different operating environments places many demands on the completeness and robustness of the descriptions of data structure. In a loose association of autonomous units, a good way of enabling graceful evolution is to provide at each level of the communication process not a single interface or transfer standard, but rather a limited set of choices of such standards, together with a negotiation process whereby the parties can select that which best meets their needs. As new technology or circumstances change, a new standard can always be added to the set of choices, and those users who find it more convenient will gradually make the necessary investment and adapt to using it. Likewise, outmoded or insufficient standards will decline in use and may then be subjected to sunset rules. A well endowed general purpose data server would be expected to implement all the choices, but a part time user with limited hardware capability might be able to invoke only one or two, with corresponding degradation in the expected level of service. Thus even more important than the details of the interface are the protocols and language of the negotiation. This principle applies as much to methods for structured data expressions to describe themselves, data model description languages, and to definitions of the context surrounding a data exchange, as it does to the formats for the exchange itself. e. Storage and archive This interface is driven by need for efficient implementation of search and retrieval with the overall goal of total cost minimization. This requires a i. Balance between (1) Storage system and media costs (2) Access and processing costs (3) User time and satisfaction while seeking and retrieving information (4) Scientist and knowledge engineer time importing information (5) Extensibility and evolutionary potential of the system ii. Required information (1) Decision model for database design A decision model is an analysis of the choices that have to be made at the design stage and at the operation stage, and how they impact the overall goal. It provides a framework for assessing the utility of information being sought both through interactive forms and through statistical analysis of database use, for the purposes of ensuring the logical integrity of the database and increasing its overall efficiency. (2) Specification of design assumptions and anticipated changes (3) Logical data dictionary and proximity information This information has to be garnered systematically from various sources, including a careful review of scientist's insights of fundamental logical relations that can be relied upon for database structure, and proximity relations indicating which variables are most likely to be retrieved together. (4) Expected frequency of accesses for various inputs (a) interactive (b) batch (5) Analysis tools for estimating resource requirements (6) Performance evaluation criteria and tools iii. Lessons Learned A central requirement for this aspect is a decision model which shows, for each major database architecture, how such information would actually be used in selecting a design, or in modifying operating regimes. It may be necessary to formalize such decision models, using knowledge engineers to capture the heuristic rules of experienced designers for a variety of local data models. The use of interactive forms to capture from scientists the metadata necessary for efficient database implementation, and the collection of appropriate statistics on performance, both depend on a good analysis of what the implementation choices are. Of course, given the need for unspecified extensibility and evolution, there will be much uncertainty in such decision analyses, but, given appropriate models and the right input information, examining a variety of scenarios should indicate which designs are more likely to be robust at reasonable total cost. A greater choice of local data models may also be needed. Requirements for easy evolution and modification of the schema seem to imply greater flexibility than is available from existing relational or tree structured data models. Such flexibility would eem to be provided by an object oriented functional representation such as the ADAMS language (Pfaltz 1992). 5. References American Petroleum Institute, 1993. Record Oriented Data Encapsulation Format Standard - A proposed schema under RP 66 Version 2.00. Prepared by the Format Workgroup IEEE, 1994. A Reference Model for Open Storage Systems Interconnection, version 5. unapproved draft, IEEE MS System Reference Model 1.3 February 11 1994. Pfaltz, J. L., J. C. French, A. S. Grimshaw, R. D. McElrath.1992. Functional Data Representation in Scientific Information Systems, Proceedings of the Conference on Earth and Space Science Information Systems, Pasadena, CA, Jet Propulsion Laboratory. Sheth, A. P. and J. A. Larson. 1990. Federated Database Systems for Managing Distributed, Heterogeneous, and Autonomous Databases. ACM Computing Surveys, 22, 183-236.