Toward Unified for the Department of Defense1 Arnon Rosenthal, Edward Sciore*, Scott Renner The MITRE Corporation, Bedford MA {arnie, sar}@mitre.org, [email protected]

ABSTRACT

Data sharing within and among large organizations is possible only if adequate metadata is captured. But administrative and technological boundaries have increased the cost and reduced the effectiveness of metadata exploitation. We examine practices in the Department of Defense (DOD) and in industry’s Electronic Data Interchange (EDI) to explore pragmatic difficulties in three major areas: the collection of metadata; the use of intermediary transfer structures such as formatted messages and file exchange formats; and the adoption of standards such as IDEF1X-97.

We realize that in large organizations, a complete metadata specification will need to evolve gradually. We are concerned here with initial steps. We therefore propose a simple framework, for both databases and transfer structures, which can accommodate varying degrees of metadata specification. We then propose some conceptually simple (but rarely practiced) techniques and policies to increase metadata reusability within this framework.

1. INTRODUCTION

The barriers toward shared and integrated data are severe and well known. They include heterogeneous requirements and practices, inadequate tools for documentation, and inadequate will to provide good documentation. Efforts to overcome these barriers through better data administration have had limited success. This paper examines some of the shapes these barriers take within the Department of Defense (DOD), and describes some steps taken by DOD and standards groups to ameliorate those difficulties. Our goal is to describe a vision of a metadata environment that supports data sharing with and within large-scale organizations, and to compare this vision with the current situation in DOD and with some relevant standards.

1 Presented at the IEEE Metadata Conference, Silver Spring, MD, 1997. Proceedings may be at http://www.llnl.gov/liv_comp/metadata/md97.html or from IEEE home page.

* Edward Sciore is also a member of the Department, Boston College, Chestnut Hill MA. In this paper, dataì sharing” will refer both to situations where there is, in effect, an integrated database accessible to multiple applications (sometimes called integration), and also to situations where one application or database transmits data for use by another (sometimes called interoperability). In either case the key consideration is dealing with semantic and structural incompatibilities between the participating . We distinguish four ways to cope with such incompatibilities, involving successively lower uniformity, and hence greater autonomy and evolvability:

1. Require all systems to be implemented using the same interface schema (i.e., tables, attributes, and attribute details, but not necessarily indexes, tablespaces, etc.) 2. Require systems to communicate via a single, global interface schema, while allowing them to implement local data schemas in other ways. 3. Require systems to communicate via a single, global abstract interface schema. The abstraction omits representation details, such as implementation datatype or measurement units. It may also combine or split attributes; for example, mapping Date to Day, Month, and Year. 4. Allow systems to communicate through explicit correspondences between separately- developed interface schemas.

The first approach works on a small scale; it fails whenever the participants must organize their representation of the world in radically different forms. The second approach works on a medium scale, especially when there is a dominant partner to dictate the global schema. Neither of these approaches are feasible as complete solutions for an organization on the scale of the DOD [RRS96].

The third approach abandons the single concrete interface schema, and uses mediators to translate data from the server’s context to the client’s context [AK91, CHS91, GMS94]. administrators are required to describe the meaning, structure, and representation of their system’s data in terms of the abstract interface schema. The mediator uses this metadata from the source and client systems to ensure semantically-correct data sharing.

For the DOD, the single interface schema required for the third approach might involve tens of thousands of entity types. It is difficult to believe such a schema could ever be constructed in a single piece. This leaves the fourth approach, in which we have several interface schemas, and explicit correspondences between these schemas. While this approach places the greatest burden on tools, and is beyond the state of current prototypes, it is the only one that matches the scale and the degree of autonomy found in the DOD. Part of our goal is to show how an organization can prepare to collect the system descriptions and correspondences now, even though mediator technology is not yet mature.

2 The data administration process responsible for collecting this metadata is a collaboration between an enterprise-wide effort on the one hand, and the individual system builders on the other. The builders must be involved, because they possess the knowledge needed for writing system descriptions. There must be a central enterprise-wide effort, because individual builders will not put forth the effort to build infrastructure for everyone. In fact, builders often see metadata collection as a low-priority, low-return burden. In Section 2 we provide a framework that enables metadata to be specified and reused more easily, and suggest policies and tools that encourage administrators to specify reusable metadata.

There are two typical means for a system to pass data to another: have it converted to the required format (either explicitly or via a mediator), or put it into a standard transfer format (using a file or a message) and have the receiving system pull it out again. From the viewpoint of a database purist, the use of transfer structures should be dropped. However, their use is well-established within the DOD, the EDI community, and others; the impact on existing operations and legacy systems would be too traumatic.2 From a practical viewpoint, transfer structures are notoriously expensive to administer and maintain, and inhibit system flexibility. In Section 3 we show how transfer structures may be accommodated at a more reasonable cost.

Both DOD and various disciplinary groups have addressed requirements for metadata registries that combine all the above . Where possible, this paper employs ideas and terminology from a draft standard (ISO 11179) [ISO97], elaborating parts of it to enable greater metadata reuse. We found the standard quite informative and useful, but fear that key omissions will greatly reduce its influence. Section 4 discusses this issue, along with problems associated with several other related standards efforts.

To summarize, we are proposing a framework for metadata and metadata administration which:

· describes a form for metadata that will be friendly to reuse and sharing, · provides the proper incentives to align parochial interests with global, and · unifies the treatment of metadata describing database and transfer structures

We also identify some desiderata for any metadata framework, including good short term return on investment, flexibility to accommodate emerging technologies, and minimum personnel costs.

2 Also, we expect that transfer structures will have continuing use in performance optimizations, as a way of avoiding sending large numbers of similar messages and/or as a means of compressing information to be sent over a narrow channel.

3 For now, the challenge for user organizations with large scale problems is to do simple things on a large scale. Their next steps must be made despite uncertainties about which long-term approaches, formalisms, standards, and tools will attain success. We therefore focus on metadata-improvement steps that will be valuable in many scenarios and with many different mediators.

Finally, a disclaimer: The DOD is an enormous organization, with voluminous and sometimes inconsistent documentation. Our descriptions here are unavoidably incomplete, and some details may no longer be accurate. The opinions expressed represent only those of the authors. While we have had considerable exposure to the DOD central data administration organization (the Defense Information Systems Agency, or DISA), and to some organizations doing data administration for command and control (C2) and logistics, we judged that the outside metadata community would gain more from a rich, best-efforts description than if we confined discussion to the areas we know best. Our examples are made-up instead of based on real systems, in order to avoid long and unnecessary detailed explanations.

2. METADATA TO SUPPORT DATA SHARING

In this section we propose some metadata constructs to be used for describing data in databases and other structures. One may regard this as a partial specification for a repository or metadata registry for database information; a few parts of this have been prototyped. Several requirements drive our design choices in ways that differ from much previous research:

· reuse of definitions: Reuse leads to uniformity that greatly simplifies sharing. Also, local administrators will participate more fully if reuse can be made easier than reinvention. · correspondences between definitions: It will frequently be necessary to combine information systems whose data definitions were developed separately. To share between such systems, we need the ability to identify correspondences that support sharing (i.e., “is , rather than correspondences that simply guide more detailed exploration (e.g., “is somehow related”). · transfer structures and application program interfaces: The metamodel (constructs) and the atomic metadata instances here can be largely the same as those used for describing databases. Reuse and correspondence constructs still apply. · incentives: The knowledge required for data sharing resides in system-building organizations, not in a central administration group. Incentives must be structured so that parochial interests will be aligned with the global interest.

4 Each of these issues is addressed in a corresponding subsection.

2.1 METADATA SPECIFICATION FOR REUSE

The knowledge representation literature identifies three major kinds of descriptive information: structure, meaning, and representation. Structure identifies the atomic units of information and the way the system combines them into aggregated units (such as entities and relationships). Meaning refers to the abstract concept corresponding to a data item. Representation refers to the way in which the system implements the item. Other information that repositories often contain (e.g., tracking “owners” and status of definitions) are not discussed here.

Much of the necessary technology for capturing these kinds of metadata in a repository is familiar. It needs to be brought together and adapted to meet the needs of sharing rather than documenting an individual system. The intent is that there be a repository that supports several constituencies: tools that capture metadata; programmers who write applications that touch multiple databases; and mediators that automatically translate data (when sufficient metadata has already been captured). The repository’s data descriptions should cover structured databases, transfer formats, and (eventually) arguments in application program interfaces.

2.1.1 Basic Definitions

Data modeling terminology is varied and often conflicting. To avoid confusion, we adopt a simple model, consisting of three kinds of data element: entities (e.g. Aircraft), properties (e.g. CurrentAirspeed), and domains (e.g. Velocity).3 We do not consider methods, nor inheritance rules, nor heterogeneity. Relationships can be treated similarly to attributes.

Both entities and domains may have properties. Entity properties are often called attributes. Domain properties are often called meta-attributes, and are used to interpret its values. For example, the domain Velocity might have the three properties Datatype, SpeedUnit, and Precision. We assume that all domains have a (possibly implicit) property called Datatype, as a slot to hold the implementation type of the domain (e.g., float, integer).

3 This definition is not commonly used in the DOD data administration community, where dataì element î and attributeì î are often thought of as synonyms. Here we are following the ISO standard.

5 What is needed is a way to encourage reuse (or refinement) of concepts already known. This can be done by viewing the repository in three complementary (though not completely disjoint) ways: one to describe the structures, another for meanings of data elements, a third to describe representations.

2.1.2 Structure

Database administrators are more familiar with database schemas than with data dictionaries, and knowledge engineering terminology is likely to be quite foreign. Therefore, we want the structural description of the data to look like a database schema, with a little more information attached.

A schema contains several sorts of structural information, split into collection information and type information. (In a relational database, the only collections are sets and they are closely identified with types, i.e., relations. An object database offers more flexibility.) The type information has several parts: Each property is identified with a pair (entity, domain); each entity is identified with a set of properties; the entire database has a set of domains that it uses. The following table illustrates some structural metadata for a very simple example about aircraft:

NAME KIND INFO Aircraft entity {id, maxSpeed, grossWt, hoursFlown} Velocity domain {speedUnits} Identifier domain {} Weight domain {} id property from Aircraft to Identifier maxSpeed property from Aircraft to Velocity grossWt property from Aircraft to Weight hoursFlown property from Aircraft to Number speedUnits property from Speed to String

An equivalent model of structural information is as a directed graph. The nodes of the graph denote entities and domains, and the arrows denote properties.

2.1.3 Meaning

Each data element appearing in a schema will be associated with a meaning. The meaning of a data element can be thought of as a pair, consisting of a Body (which gives the actual meaning) and a Name. The Body may be text, a document URL, or an object identifier (permitting comparisons for “is the meaning object identical”).

6 The following table gives the (grossly oversimplified) meanings of some elements of our aircraft example:

NAME BODY Aircraft “flying thing” Velocity “how fast it goes” Identifier “unique id in Navy” Weight itsì weight or massî id idì of aircraftî maxSpeed maxì air speedî grossWt ìmax weight, fully loadedî hoursFlown flyingì time since last repairî speedUnits mph,ì kph, etc.î

The combination of structural information and meanings resembles what the knowledge representation community calls an ontology. As a side point, we note that ontologies also have classification relationships among the nodes, such as generalization hierarchies. Generalization hierarchies are useful for finding the right definition and for factoring a concept’s definition (e.g., one can define triangle as 3-sided polygon), and are hence very desirable for a repository to support.

An ontological concept has no implied representation; i.e., the definition may be used by systems that implement the concept in different ways. In particular, the meaning of an entity or domain does not imply the existence of any of its properties; the specification of an entitysí (or domains)í properties is part of its representation. This definition allows metadata sharing that might otherwise be impossible. For example, both the Crew_Planning and Maintenance systems both involve aircraft, but have widely different ideas of what properties an Aircraft entity has. By circumventing the need to agree on all properties, the two communities will be able to agree on the same meaning for the Aircraft entity.

An ontological concept also has no context, which means that a concept cannot mean different things to different people, or somehow change its meaning based on the user or state of the system. For example, developers sometimes use a field of a record for multiple purposes; this is not supported within the ontology. (Instead, one could map the physical representation to a view schema in which each field has a specific meaning, and then relate elements of that view to the ontology.)

7 Properties usually greatly outnumber entity and domain definitions, and seem likely to require the bulk of administrators’ labor. If the meaning of each property is specified independently by each administrator, then overly fine distinctions will be made, inhibiting interoperability. By removing representational issues from the ontology, similar properties having different representations can share the same meaning. This reduces specification labor and makes it easier to share the information. For an example, the ontology entry for CurrentAirspeed can be referenced for properties such as CurrentAirspeed_mph, CurrentAirspeed_kph, and Current_Airspeed_Long_Float. For a second example, the single property idì î in the above table denotes all aircraft identifiers, regardless of the detailed representation used for the Identifier domain, or the properties an organization attaches to Aircraft.

2.1.4 Representation

Representation deals with details that are not essential to meaning; a data element can often be translat2ntAb322 T3can78.24 0 TD 01 T(buthaikeval01 )3 e propert can share the sag different) Tj 0 -13.69 TD -0.0894 Tc 0.2e2ntAb322 T3can7 share the y outnu50(all) Tj 1F0 112 Tf -0.028 Tc 0.2 de283ganizatd78.2ies su83.6.

8 Current standards and systems restrict the kinds of representations that can be specified; these restrictions correspond to the extremes of our framework. For example: An entity representation in which all properties are fully bound corresponds to a traditional database relation scheme. An entity representation in which all properties are unbound corresponds to a reference scheme. A domain representation in which all specified properties are bound corresponds to what the DOD calls a generic data element [DOD94]. A domain representation in which all specified properties are unbound corresponds to a semantic domain [SSR94].

Our framework avoids arbitrary restrictions on the kinds of representations can be named and shared. We permit administrators to name and reuse any set of such specifications, such as a partial data element, or even a partial database description. There are many purposes for which such metadata might be used. In general, when two systems agree on something, it may be handy to define a concept that captures the amount agreed, even if there remain some disagreements. For example in the above table, WeightKg is such a domain. This domain specifies that the value must be in kilograms, but the implementation datatype is unspecified (and will be left to the individual systems).

An advantage of this flexible representation framework is that any amount of property specifications are possible, depending on the data administrator’s needs. This promotes metadata sharing — an administrator can register representations in the repository as accurately as possible.

2.2 SPECIFYING CORRESPONDENCES BETWEEN DATA ELEMENTS

Data sharing requires some degree of mediation, based on metadata about correspondences between definitions. (For simplicity, we assume that all mediator-relevant definitions are in repositories, and that all the repositories use the same schema constructs for expressing this information; heterogeneous repository structures are a second-order problem, and outside our scope). The necessary correspondence information is substitutability, roughly defined as the appropriateness of using data conforming to one system’s metadata as the value for an element described in another system. (One can also define substitutability for definitions in registries, based on substitutability of data elements that conform).

Gathering such information is difficult and costly, but not impossible. Humans are typically needed to confirm whether it is appropriate to assert substitutability4; such assertions are implicit whenever programmers or users share information among databases. Unfortunately, they are generally not recorded as accessible knowledge, due to inadequate repository support.

4 Full automation seems infeasible, and techniques for automated assistance are outside the scope of this paper.

9 The construct for registering “substitutability” correspondences needs to say more than “Element X can substitute for element Y.” There is a long distance between “I think this textual definition fits my data” and “I am sharing data with three other systems that subscribe to this definition.” One cannot assume that all registered correspondences have met the second criterion ó and therefore, the repository needs to capture compliance information.

The repository schema must define the form of the assertions. Some additional fields that ought to be supported by any repository include:

· For what purpose (i.e., in what context). Distinctions that are irrelevant to one organization (“Price” versus “Bid Price”) may be critical to another purpose. · Who says so. · How certain they are (certainty).

The above limited features are intended to allow administrators to record some knowledge easily, to support discovery of what might be relevant, and exploitation of what is known to be relevant. They are intended as a core for supporting large organizations. If one were supporting schema integration tools over the repository, one would certainly want to record negative assertions (“don’t ask me again these are different”), assertions about specialization, and perhaps assertions that certain pairs are candidates for further investigation.

One must be aware that overworked system builders may obey the letter but not necessarily the spirit of interoperability rules. We give several examples of (ill-advised) mandates that can create unintended consequences:

· Mandate that all data elements shared among systems must be recorded in standard form in a repository, without provisions for recording the state of uncertainty. Unless there are incentives and tools to facilitate reuse, it may be easier for system builders to record new elements than to reuse existing ones. · Assume that all registered correspondences are accurate. They may not be – a short textual definition may not adequately capture the assumptions. · Provide incentives for reuse of data elements, without estimating certainty of compliance. System builders may reuse existing elements, but may impose additional local meanings that prevent sharing. · Require system builders to provide definitions of a quality sufficient to determine substitutability. Now system builders will try to avoid describing information that is currently local, even if it may someday be useful to others.

10 A better set of incentives would encourage system-builders to speak honestly. It should be possible for a system builder to say “I agree with that textual definition text, (so perhaps this is an opportunity to share data), or ìI have checked that I can really substitute data from that source, and/or that source has checked and they will accept data from me.” These statements can be accomplished using the constructs described above.

If the central administration wishes to measure how well each local administrator contributes to data sharing, they need a metric that balances factors of stand-alone quality, reuse, and accuracy of documentation. Stand-alone quality is the current practice; definition reuse is easily measured. To avoid the difficulties listed above, one should give bonuses for higher degrees of certainty, but penalties if if the error rate proves higher than indicated by the certainty.

2.3 METADATA FOR TRANSFER STRUCTURES

We now discuss the metadata needed to describe the intermediate transfer structures used to transmit data from one system to another. The key insight is that a transfer structure is a kind of database. As with a database, we need metadata to describe the meaning, representation, and structure of everything in the transfer structure. Fortunately, meaning and representation descriptions are exactly the same as with databases, and can be made with reference to the same repository. This by itself means that we manage one metadata resource, not two, and that unnecessary diversity is avoided.

The main difference is in describing structural information. A database can be described as a collection of relational tables. Transfer messages, however, are usually arranged in a hierarchical format; here, we must be able to describe the hierarchy. To transmit a transfer structure (or a database), one needs externalization functions that convert the contents of a table or message to a string of characters or bits, and internalization functions that recreate the database from the bitstream.

At present, it is necessary to define and administer the message hierarchy and the preparation/parsing rules by hand. This is not a simple task, and can consume quite a lot of effort. In the future, it should be possible to come up with a canonical hierarchy, or a self- describing structure, to automate the business of arranging the message contents. There is reason to belive that the automated arrangement could be as efficient (in terms of bandwidth) as the present manual approach.

2.4 ADMINISTRATIVE POLICIES AND INCENTIVES

It is now generally accepted that the more accurately and completely a system can specify its metadata, the greater the interoperability it can have with other systems. However, this goal is not always easy to achieve. A database administrator, when specifying metadata, may be

11 uncertain whether data element descriptions really match an existing concept, or whether a new data element can be created that captures the meaning better. There is an unavoidable tension between the goals of sharing metadata and accurately specifying its meaning.

A solution is likely to involve a mix of policies and automated tools. Some of the required tools resemble tools that support reuse libraries. One needs to provide users easy access for browsing metadata, and to provide information-retrieval style searches to identify definitions that might be relevant. Ideally, one would also allow administrators to inspect existing data and (for more technically-oriented administrators) the code that manipulates it.

We propose that metadata specification tools satisfy the following requirements:

A. Be Incremental

· Accommodate gradual progress. Gather partial metadata as a byproduct of developers’ efforts to interchange high-priority information between pairs of systems. Provide good interoperability services for the parts of systems that are already documented. · Avoid “everybody together” deadlines on upgrades. When all systems in a family must cut-over simultaneously (as when a standard message format is changed), the speed of adaptation may be governed by the slowest development group. Instead, support coexistence using backward gateways and data pumps, versioning and . · Insist on good short term return on investment. Both the shape of the future and the chance of success are too uncertain to justify that are strictly long term. · Stay flexible. Provide a good start toward supporting the emerging mediation technologies, positioning the organization to use whatever products emerge.

B. Minimize Burdens on Individual Administrators (especially within development programs)

· Ask administrators concrete rather than metaphysical questions. A question such as “are these two concepts the same?” may be difficult to answer, because the answer might be 90% true. The more concrete question “would you accept data from this source?” would elicit a more useful answer · Make metadata convenient to access. Information-retrieval tools should aid an administrator in finding an existing definition that suits a requirement. Automated tools, such as interface generators, mediators, and query tools, should be able to access and exploit the existing metadata.

12 · Minimize the granule of specification and evolution. For example, a user should not need to read a complex SQL query to arrange interchange of one additional data element. · Allow import of composite definitions, not just atomic ones. The unit of import may be different from the unit of specification and change. · Minimize the effort of data administration, including retraining effort. The metadata structure should be administrator-friendly; automated translators could then produce the representation required by a mediator product.

C. Plan to Serve a Muddled World

While parts of systems will become well behaved (e.g., communicating standardized data through standard interfaces), the scope of our systems will increase. New requirements and the need to collaborate closely with additional systems will guarantee that heterogeneity will persist. Therefore:

· Partially-compliant systems must be able to play. Systems that do not conform to the latest standards will continue to be important. (Some may be legacy systems; others may belong to other organizations.) We suspect that a little wrapping will make almost any system compliant to a small but useful degree. · Categorization should not be critical. It was never possible to obtain consistency in what was modeled as an entity versus as a relationship. Such ambiguities will continue in our metadata. Sharing should not depend on getting the categorization “right,” and different communities should not be forced to argue about which approach is “right.”

3. CURRENT PRACTICES

This section discusses current practices in the DOD, with a few brief remarks about commercial practice.

The DOD is a very large organization with complex data needs. Organizations within the DOD at many levels in the hierarchy have a surprising amount of autonomy in acquiring information systems and ñ until recently ñ in defining separate data schemas to suit their needs. As a result, the DOD is filled with information systems that cannot share data, both because semantically-equivalent items are difficult to identify, and because different names, structures, and representations have been chosen for equivalent items. These failures in data interoperability have been a known problem for many years. Efforts to solve these data sharing problems have included standard message formats, and more recently, a data element standardization program. In this section we describe these two efforts at improved data

13 sharing, note which parts correspond to our metadata framework, and point out where additional or different kinds of administration may be needed.

3.1 STANDARD MESSAGE FORMATS

Standard messages have been used to communicate between DOD information systems for decades. The JINTACCS5 program was created in 1977 as an attempt to link existing command and control systems across service boundaries (e.g. Army to/from Air Force). It is responsible for two families of standard messages: USMTF and TADIL. The USMTF message family is a set of character-based messages with a strong resemblance to EDI. These messages are intended to be man-readable as well as machine-processable. The TADIL message family consists of binary, fixed-length messages used in tactical data links. There are other similar standards, both for communicating within individual services, and with NATO allies.

These message standards were introduced in order to solve interoperability problems. They are widely judged to have failed at this task. The main reason is the expense of maintaining the standard, the high cost of building and maintaining the interface software which produces and consumes the transmitted messages, and the resulting inflexible architecture of systems which can only communicate through messages that must be defined far in advance. This in no way means that their use can be abandoned. In many cases their continued use is required by international agreement. There is a huge software base that depends on their continued use. There are even advantages. Some messages are prepared once, then multicast to several receivers. Messages also enabled communication between systems (e.g., in vehicles) that have not upgraded to modern workstations or even PCs, and cannot afford the overhead of a DBMS.

In this section we describe some problems in the administration and use of standard message formats and ways that our metadata framework might lead to solutions.

· The bridge software combines functions that should be separate. Frequently a bridge program searches the source system’s database to determine data to extract, transforms the data to match the target schema, transforms attributes to match the detailed target representations, and merges the result into the target’s database. The tight coupling makes such programs very difficult to maintain, and makes it difficult to use mediation. We would recommend separating the extraction, transformation, and merge steps, with a view toward replacing the transformations with automated mediation.

5 JINTACCS: Joint Interoperability of Tactical Command and Control Systems TADIL: Tactical Digital Information Link USMTF: United States Message Text Format

14 · Correspondences are buried in source code. Bridge programs as described above embody a great deal of knowledge of each system, but that knowledge is effectively lost. For example, one required weeks of analysis using custom tools that reverse engineered C source with embedded SQL in order to figure out what actually happens in one such bridge connecting two large Air Force applications. · Transfer structure definitions and stringification code are manually maintained. Ideally, one would use commercial software for transferring between databases. Currently, each message family has different rules for externalization and internalization functions. Character-based messages cannot be easily extended to pass binary data (e.g., images), except through an inefficient uuencode-style encoding mechanism. · No ad-hoc queries can be accommodated. Transfer structures can be large, and their generation can be slow. A program that generates the entire message does not provide a means of extracting a a small amount of up-to-date information from the source. · Sources and producers may disagree about the need to change widely-used transfer structures. Some DOD organizations have bridges to old versions of transfer structures, and not all the changes have been backward compatible.

Many of the same problems arise in the commercial EDI community. Message elements are not yet coordinated with elements in databases, or in business object standards; each participant organization must manually map their databases to files described in terms of the standard elements of a message. With some products, the mapping must be repeated when mapping to another message, even if the same element is used. Furthermore, pairwise negotiation is often needed about the detailed ; as a result, dialects have emerged. For example, both Wal-Mart and Kmart are major EDI users, but a supplier who communicates with both will probably have redundant administrative tasks.

3.2 DATA ADMINISTRATION AND DATA ELEMENT STANDARDIZATION

Data administration within the DOD is governed by a policy and set of procedures known as the 8320.1 directive [DOD94]. This policy states that all new and modified information systems in the DOD must be built using only standard data elements. The intent is to eliminate data interoperability problems by eliminating all differences in name, structure, and representation of semantically-equivalent data items.6

Until very recently the emphasis in DOD data administration has been in collecting a critical mass of standard data elements. The intention is that these data elements will be reused in

6 Problems in entity identification and conflict resolution are largely ignored in 8320.1.

15 new and modified systems; new elements are to be added only when equivalent elements cannot be found. Every standard data element is tied to an attribute in an IDEF1X data schema7. (IDEF1X is an extended entity-relationship model [Bruce]; see also the standards discussion in Section 4.) The policy defines certain metadata to be collected for each data element (e.g., datatype, maximum length), and prescribes a naming convention to ensure a unique name (which embeds other metadata) for each element. Data elements are added to the standard through an approval process which attempts to assign control (or stewardship) of each data element to the functional group most concerned with its subject area. The approval process is also intended, through a step known as cross-functional review, to ensure that schemas and data elements submitted for review are fully integrated into the global DOD schema before acceptance, eliminating any redundant representations of semantically- equivalent concepts.

The 8320.1 policy was established in 1991. While the DOD has been successful in collecting definitions for approximately 17,000 standard data elements, this has not yet provided major improvements in data interoperability. A consensus also holds that the schema integration supposed to occur during the review process has not really succeeded ó the current data element repository is known to contain many redundant (and poorly-designed) data elements. However, there is no consensus on what ought to be done next: some hold that with renewed effort and stricter enforcement, the existing process can succeed; others believe that some changes and improvements to the process are needed.

Major software packages (such as R3 from SAP) also require extensive data administration to connect to an organization’s systems. Yet here too the lack of repositories and standards is costly – the results of data administration cannot easily be shared with other organizations.

4. STANDARDS DISCUSSION

This section discusses several “standards”8 that have a large potential role in DOD :

· The International Standards Organization standard for data element and other metadata registries (ISO-11179) · DOD’s guidance for defining data element dictionaries (8320.1)

16 · The IDEF1X formalism for describing data models. · STEP and CDIF

4.1 DATA ELEMENT REGISTRY STANDARD (ISO-11179)

The standard describes some important meta-properties and behaviors. However, the standard specifies only conceptual behaviors. It does not specify schemas or APIs that a registry must support, and as a result it provides absolutely no software interoperability; as a result it does not widen a market in a way that the SQL and ODBC standards catalyzed the explosion of 4GL and application-development tools. We discuss these issues in greater detail below.

The standard does have significant strengths. It provides a very clear tutorial, and very useful guidance in many aspects of specifying a registry’s contents and behavior. It standardizes metadata attributes to be attached to element definitions. It aids politically in getting metadata providers to agree to a form for the metadata they provide (e.g., “We’re following an international standard, so don’t try to redefine it.”) And it enables a certification process for appropriateness of some of the registry’s policies (though for DOD, the benefits of this certification are uncertain).

Reflecting the interests of activists who provide large, public datasets (e.g., census or scientific data), the standard contains features that would make data easier for humans to find, judge, and provide bridges for import into user-owned applications. It aims to make the necessary information visible, but not at a high degree of automation. In contrast, DOD is more concerned with peer-to-peer sharing among its components.

For us, the main flaw of the standard is that its scope excludes features that are essential to attracting vendor support. We want the metadata registry to promote the development of an industry, with niches that include repository management software, metadata capture tools, and tools for exploiting the captured metadata. For such an industry to flourish, two things must occur: All the above metadata-driven software must interoperate, to minimize the number of interfaces each vendor must support; and the resulting mix should support information-sharing.

To enable software interoperation, one needs to standardize: a schema for (at least a core subset of) the metadata; a (for accessing this information) and a transfer format (for archiving and shipping metadata). The registry standard describes what needs to be shared, but does not provide such concrete mechanisms. The necessary tasks, then, are to choose a query language (e.g., a subset of SQL or ODL) and a transfer format (e.g., CDIF), and within the language’s constructs, to standardize a schema by which registries could be accessed.

17 The main purpose in populating the repository is not to describe a single database, it is to share information among systems. For the industry’s tools to support such sharing, the repository standard must specify the form of the substitutability relationships discussed in section 2.2.

ISO-11179 needs correspondences, so buyers can have tools for data sharing. With current scope, it describes information needed for accessing databases, but not for sharing between them; this greatly reduces the utility to our customers. Most organizations in DOD give low priority to their role as data providers to others, and are not clamoring for ways to improve. The need is for data sharing across organizations.

4.2 DOD Directive 8320.1

The role of 8320.1 within the DOD was discussed in section 3. We now briefly describe its technical underpinnings.

8320.1 is a standard for descriptions of data elements. It parallels the treatment in ISO-11179 rather closely, but contains more specifics. For example, 11179 says that one may define domains for values; 8320 defines numerous useful ones (but imposes central control on domain-creation).

8320.1 shares the weaknesses of 11179 by not supporting (representation-independent) semantic domains, by not having a mechanism to package arbitrary sets of definitions into a reusable unit, and by not having a special construct for substitutability correspondences. In addition, it inhibits reuse by not supporting 11179’s representation-independent “conceptual data elements.” Finally, 8320.1 requires a complex structure for names. While the structure seems well founded (and is reflected in our treatment of semantic domains and representation), it creates unusable names. Now that most users can afford good graphical tools, the information ought instead to be captured only in a data dictionary to be searched and displayed in many ways; in the meantime, usable names should be provided.

4.3 IDEF1X VERSUS UML

DOD standard schemas use IDEF1X, an extended entity-relationship (EER) model. The IDEF1X models have a rich variety of constraint constructs. There are good tools for displaying and manipulating diagrams; LogicWorks’ ERWin seems very popular within DOD. The existing standard (IDEF1X-93) has several serious weaknesses, so a new standard is needed. Omissions from IDEF1X-93 include:

· There are no constructs for mapping between models, i.e., no language for defining views.

18 · There is no standard meta-schema or even file exchange format to enable interchange of models among tools from different vendors. Ideally, one would be able to ship both entire models and finer-grained descriptions (e.g., of individual entities) to another organization, to aid collaborative design. · Multiple inheritance is not supported, making it harder to reuse portions of a specification, or to support an that does have multiple inheritance. · There is no connection to process or organizational modeling. · IDEF1X defines models, but the standard currently has no facilities for tying those models to database schemas. (Some tools can produce database schemas from models, or reverse- engineer models from schemas, but there is no vendor-independent mechanism for recording the connections.)

We note that some of the supporting tools do have constructs for some of the above features. But without a standard, one cannot expect consistency, let alone interoperability.

A major extension to IDEF1X (IDEF1X-97) has been proposed [IEEE96]. It remedies many of the above deficiencies, and gives improved support for object-oriented analysis and design, while preserving an easy transition path from IDEF1X-93. Most of its constructs seem well designed. Unfortunately, it does not seem in synch with major industry trends – de facto standards, and tighter coupling across lifecycle stages. We compare it with an emerging alternative, the Unified (UML).

IDEF1X-97 rightly includes an object model, a mapping (i.e., query) language, and a transfer syntax. Unfortunately, in each case, the specification goes its own way, rather than adopting and extending an existing standard. We note also that it is insufficient for IDEF1X-97 to receive support from data-modeling vendors – the object model and mapping language need support from vendors at many other stages of the software lifecycle. For example, one wants mappings between schemas to be executable (providing database translation), but this requires that the mappings be in SQL or (for some object database vendors) ODL. Furthermore, as the object-relational formalisms used for physical schemas approach EER constructs in richness, the burden of separate formalisms for physical and conceptual specification will become unjustified.

In recent months, UML has become an industry juggernaut for object-oriented analysis and design, winning support from both Microsoft and the Object Management Group. It provides substantial integration across the lifecycle, and seems assured of support from many leading tool vendors, not just those who serve the government. It contains analogs to most of the data modeling constructs of IDEF1X-97; the missing constructs could be provided as extensions to the UML model, and would be candidates for future standardization. Its close connections

19 with the object community indicate that it will remain consistent with their formalisms, and its ancestry guarantees that it suits life-cycle stages beyond data modeling.

One further motivator for the shape of IDEF1X-97 was to minimize the pain of transitioning existing conceptual models to the new standard. However, if one examines the entire lifecycle, the choice seems to be IDEF1X with UML, or UML alone. The former requires a translation each time a model is imported or exported from conceptual modeling tools and will inhibit maintaining consistency across stages. The UML-based strategy requires just a one-time translation of existing models (probably automatable). The new formalism should be fairly easy for existing staff to learn, while schools will include UML in their training of future software engineers.

4.4 STEP, CDIF, ETC.

Several existing standards point the way toward hiding the syntax used to transfer designs among systems. Both the Case Data Interchange Format and the STEP format for exchange of engineering product data specify a rule for translating an instance of a given schema to a data stream (or file) that can be shipped or archived. STEP has, and CDIF is developing, a way to see the transferred information as an object database. (Because it was developed early, STEP defined its own database interface).

The argument for using CDIF seems stronger. It already plays in the CASE arena (and is the basis for transferring UML models); also, its database interface will be OMG-compliant. Mappings from database format to the transfer form are built in, reused for each schema. This saves effort, but may be less efficient than a custom encoding.

5. SUMMARY AND CONCLUSIONS

We have presented some “next steps” that large organizations can take to improve metadata capture and thereby improve data sharing. None requires a technology breakthrough, though several may require significant changes to organizations and incentives. The problems and opportunities were illustrated with examples from Department of Defense practices, augmented by comments that the Electronic Data Interchange world behaves rather similarly.

We adapted the conventional story about required metadata (i.e., descriptions of structures, concepts, and representation) to enhance sharing by allowing flexibility in the granularity of reuse. We also addressed the need to support both a shared pool of definitions, and to incrementally record connections among definitions that were developed independently. A form for such connections was proposed. The treatment can be considered a refinement of

20 existing standards (both ISO-11179 and DOD-8320), to make greater use of semantic domains and to define a relationship type to handle correspondence information.

We also discussed why metadata administration needs to be shared across technology boundaries, to capture metadata about databases, transfer formats, and even arguments in application interfaces. We gave several reasons why intermediate structures will not be replaced by direct communication among applications’ databases, but argued that the intermediate structures could be regarded as temporary databases whose schemas are fairly conventional even though their syntax is peculiar.

After discussing how these concepts play out in practice, we then surveyed some applicable standards. We discussed why, for different reasons, neither the metadata registries standard (ISO-11179) nor the proposed IDEF1X update seems to adequately answer the concerns of software vendors, and hence neither seems likely to be supported by many products. Yet there is much technical insight in both proposals, so they need to be extended (for ISO-11179) or merged (IDEF1X-97) into UML

The major barrier to adopting our approaches seems to be the weak position of repositories. A new generation of repository products is emerging, and due to low price and web access, seem likely to receive greater acceptance. It would also be highly desirable for vendors and standards committees to devise repository schemas that are friendly to definition reuse and to capturing correspondences.

6. REFERENCE LIST

[AK91] Arens, Y., C. Knoblock, ìPlanning and reformulating queries for semantically- modeled multidatabase systems.î In Proceedings of the 1st International Conference on Information and Knowledge Management (1992), pp. 92-101.

[Bruce] Bruce, T., Designing Quality Databases With IDEF1X Information Models. New York: Dorset House, 1992.

[CHS91] Collet, C., M. Huhns, W. Shen, “Resource Integration Using a Large Knowledge Base in Carnot,î IEEE Computer, 24(12), December 1991.

[DOD94] Department of Defense, Data Element Standardization Procedures, March 1994. DOD 8320.1-M. .

[GMS94] Goh, C., S. Madnick, M. Siegel. “Context interchange: overcoming the challenges of large-scale interoperable database systems in a dynamic environment.” Third International Conference on Information and Knowledge Management (1994), pp. 337-346.

21 [IEEE96] IEEE IDEF1X Standards Working Group. “IDEF1X-97 Conceptual Modeling, (IDEF-object) Semantics and Syntax”, 1996.

[ISO97] International Standards Organiation (Gilman ed.) “ISO/IEC 11179 - Specification and Standardization of Data Elements”, http://www.lbl.gov/~olken/X3L8/drafts/draft.docs.html

[RRS96] Renner, S., A. Rosenthal, J. Scarano, “Data Interoperability: Standardization or Mediation” (poster presentation), IEEE Metadata Workshop, Silver Spring, MD, April 1996. http://www.computer.org/conferen/meta96/renner/data-interop.html

[SR96] Seligman, L., A. Rosenthal, “A Metadata Resource to Promote Data Integration”, IEEE Metadata Workshop, Silver Spring MD., April 1996 http://www.computer.org/conferen/meta96/seligman/seligman.html

[SSR94] Sciore, E., M. Siegel, A. Rosenthal. “Using semantic values to facilitate interoperability among heterogeneous information systems,” ACM Transactions on Database Systems, June 1994.

ACKNOWLEDGEMENTS

This work was sponsored by the US Air Force Electronic Systems Center (ESC) Office (SIO) under contract AF19628-94-C-0001.

22