Toward Unified Metadata for the Department of Defense

Toward Unified Metadata for the Department of Defense1 Arnon Rosenthal, Edward Sciore*, Scott Renner The MITRE Corporation, Bedford MA {arnie, sar}@mitre.org, [email protected] ABSTRACT Data sharing within and among large organizations is possible only if adequate metadata is captured. But administrative and technological boundaries have increased the cost and reduced the effectiveness of metadata exploitation. We examine practices in the Department of Defense (DOD) and in industry’s Electronic Data Interchange (EDI) to explore pragmatic difficulties in three major areas: the collection of metadata; the use of intermediary transfer structures such as formatted messages and file exchange formats; and the adoption of standards such as IDEF1X-97. We realize that in large organizations, a complete metadata specification will need to evolve gradually. We are concerned here with initial steps. We therefore propose a simple framework, for both databases and transfer structures, which can accommodate varying degrees of metadata specification. We then propose some conceptually simple (but rarely practiced) techniques and policies to increase metadata reusability within this framework. 1. INTRODUCTION The barriers toward shared and integrated data are severe and well known. They include heterogeneous requirements and practices, inadequate tools for documentation, and inadequate will to provide good documentation. Efforts to overcome these barriers through better data administration have had limited success. This paper examines some of the shapes these barriers take within the Department of Defense (DOD), and describes some steps taken by DOD and standards groups to ameliorate those difficulties. Our goal is to describe a vision of a metadata environment that supports data sharing with and within large-scale organizations, and to compare this vision with the current situation in DOD and with some relevant standards. 1 Presented at the IEEE Metadata Conference, Silver Spring, MD, 1997. Proceedings may be at http://www.llnl.gov/liv_comp/metadata/md97.html or from IEEE home page. * Edward Sciore is also a member of the Computer Science Department, Boston College, Chestnut Hill MA. In this paper, ìdata sharing” will refer both to situations where there is, in effect, an integrated database accessible to multiple applications (sometimes called integration), and also to situations where one application or database transmits data for use by another (sometimes called interoperability). In either case the key consideration is dealing with semantic and structural incompatibilities between the participating systems. We distinguish four ways to cope with such incompatibilities, involving successively lower uniformity, and hence greater autonomy and evolvability: 1. Require all systems to be implemented using the same interface schema (i.e., tables, attributes, and attribute details, but not necessarily indexes, tablespaces, etc.) 2. Require systems to communicate via a single, global interface schema, while allowing them to implement local data schemas in other ways. 3. Require systems to communicate via a single, global abstract interface schema. The abstraction omits representation details, such as implementation datatype or measurement units. It may also combine or split attributes; for example, mapping Date to Day, Month, and Year. 4. Allow systems to communicate through explicit correspondences between separately- developed interface schemas. The first approach works on a small scale; it fails whenever the participants must organize their representation of the world in radically different forms. The second approach works on a medium scale, especially when there is a dominant partner to dictate the global schema. Neither of these approaches are feasible as complete solutions for an organization on the scale of the DOD [RRS96]. The third approach abandons the single concrete interface schema, and uses mediators to translate data from the server’s context to the client’s context [AK91, CHS91, GMS94]. System administrators are required to describe the meaning, structure, and representation of their system’s data in terms of the abstract interface schema. The mediator uses this metadata from the source and client systems to ensure semantically-correct data sharing. For the DOD, the single interface schema required for the third approach might involve tens of thousands of entity types. It is difficult to believe such a schema could ever be constructed in a single piece. This leaves the fourth approach, in which we have several interface schemas, and explicit correspondences between these schemas. While this approach places the greatest burden on tools, and is beyond the state of current prototypes, it is the only one that matches the scale and the degree of autonomy found in the DOD. Part of our goal is to show how an organization can prepare to collect the system descriptions and correspondences now, even though mediator technology is not yet mature. 2 The data administration process responsible for collecting this metadata is a collaboration between an enterprise-wide effort on the one hand, and the individual system builders on the other. The builders must be involved, because they possess the knowledge needed for writing system descriptions. There must be a central enterprise-wide effort, because individual builders will not put forth the effort to build infrastructure for everyone. In fact, builders often see metadata collection as a low-priority, low-return burden. In Section 2 we provide a framework that enables metadata to be specified and reused more easily, and suggest policies and tools that encourage administrators to specify reusable metadata. There are two typical means for a system to pass data to another: have it converted to the required format (either explicitly or via a mediator), or put it into a standard transfer format (using a file or a message) and have the receiving system pull it out again. From the viewpoint of a database purist, the use of transfer structures should be dropped. However, their use is well-established within the DOD, the EDI community, and others; the impact on existing operations and legacy systems would be too traumatic.2 From a practical viewpoint, transfer structures are notoriously expensive to administer and maintain, and inhibit system flexibility. In Section 3 we show how transfer structures may be accommodated at a more reasonable cost. Both DOD and various disciplinary groups have addressed requirements for metadata registries that combine all the above information. Where possible, this paper employs ideas and terminology from a draft standard (ISO 11179) [ISO97], elaborating parts of it to enable greater metadata reuse. We found the standard quite informative and useful, but fear that key omissions will greatly reduce its influence. Section 4 discusses this issue, along with problems associated with several other related standards efforts. To summarize, we are proposing a framework for metadata and metadata administration which: · describes a form for metadata that will be friendly to reuse and sharing, · provides the proper incentives to align parochial interests with global, and · unifies the treatment of metadata describing database and transfer structures We also identify some desiderata for any metadata framework, including good short term return on investment, flexibility to accommodate emerging technologies, and minimum personnel costs. 2 Also, we expect that transfer structures will have continuing use in performance optimizations, as a way of avoiding sending large numbers of similar messages and/or as a means of compressing information to be sent over a narrow channel. 3 For now, the challenge for user organizations with large scale problems is to do simple things on a large scale. Their next steps must be made despite uncertainties about which long-term approaches, formalisms, standards, and tools will attain success. We therefore focus on metadata-improvement steps that will be valuable in many scenarios and with many different mediators. Finally, a disclaimer: The DOD is an enormous organization, with voluminous and sometimes inconsistent documentation. Our descriptions here are unavoidably incomplete, and some details may no longer be accurate. The opinions expressed represent only those of the authors. While we have had considerable exposure to the DOD central data administration organization (the Defense Information Systems Agency, or DISA), and to some organizations doing data administration for command and control (C2) and logistics, we judged that the outside metadata community would gain more from a rich, best-efforts description than if we confined discussion to the areas we know best. Our examples are made-up instead of based on real systems, in order to avoid long and unnecessary detailed explanations. 2. METADATA TO SUPPORT DATA SHARING In this section we propose some metadata constructs to be used for describing data in databases and other structures. One may regard this as a partial specification for a repository or metadata registry for database information; a few parts of this have been prototyped. Several requirements drive our design choices in ways that differ from much previous research: · reuse of definitions: Reuse leads to uniformity that greatly simplifies sharing. Also, local administrators will participate more fully if reuse can be made easier than reinvention. · correspondences between definitions: It will frequently be necessary to combine information

Load more