Comparative Study of Metadata Standards and Metadata Repositories Gunjan Pahuja JSSATE/CSE Department, U.P., India.

Abstract-Lot of work is being accomplished in the national and international standards communities to reach a consensus on standardizing metadata and repositories for organizing the metadata. Descriptions of several metadata standards and their importance to statistical agencies are provided in this paper. Existing repositories based on these standards help to promote interoperability between organizations, systems, and people. Repositories are vehicles for collecting, managing, comparing, reusing, and disseminating the designs, specifications, procedures, and outputs of systems, e.g. statistical surveys.

Keywords: Metadata, Metainformation, Metadata Repositories, ISO 704, ISO 860, ISO 1087-1, ISO/IEC 11179, ISO 2788, ISO 5964, ISO/IEC 1448.

I. INTRODUCTION registry provides the entities necessary for Metadata is defined as data about data. Metadata registration and of the objects. can be stored and managed in a database, called a Metadata and metadata repositories have two basic registry or repository. Metadata management refers purposes: to the content, structure, and designs necessary to 1. End-user oriented purpose: The purpose is to manage the vocabulary and other metadata that support potential users of statistical information describes statistical data, designs, and processes. (e.g., through Internet data dissemination systems); This includes the development of metadata models A potential end-user of statistical information to define the content of metadata within some needs to identify, locate, retrieve, process, context, building metadata repositories to organize interpret, and analyze statistical data that may be the metadata defined in the model, developing relevant for a task that the user has at hand. statistical terminologies which define and organize 2. Production oriented purpose: The purpose is to terms into a structure with relationships, and support the planning, design, operation, processing, identifying the relationships between the and evaluation of statistical. The efficient and terminology structure and other metadata and data. effective use of data is made possible by the Much work is being accomplished in the national organized storage and use of metadata. Data sets and international standards communities, especially become much more useful when useful metadata ANSI and ISO to reach consensus on standardizing descriptions are readily available. When metadata metadata and repositories. This work has had a are centrally maintained for collections of data sets, large impact on efforts to build metadata systems in users who need to determine which files are the statistical community. appropriate for their work can do so. Many types of The paper includes a general description of requests can be answered through metadata queries metadata and metadata repositories; a description like which data sets contain specific information, of metadata management standards; how metadata such as yearly income? Which data sets share affects data quality and some measures for quality common information from which links can be of metadata itself; and the benefits of implementing made to form larger data sets? Locate data sets by metadata standards and repositories. broad subjects through pointers to specific items under those subjects and monitor data storage II. STATISTICAL METADATA & REPOSITORIES system usage by tracking file sizes. A. Statistical Metadata The data collected and processed through surveys, III. TERMINOLOGY is called microdata, macrodata, or time series. A terminology is the study of terms and their use. Statistical metadata is defined as the data and Terms are words and compound words that are documentation that describe statistical data over the used in specific contexts. A concept is a unit of lifetime of that data. thought .At the most basic level, a terminology is a listing, without structure, or without reference to a B. Metadata Repositories specific language. However, concepts must be A metadata registry is a database used to store, defined in a natural language, possibly employing organize, manage, and share metadata. A metadata other terms. Terminologies, or classifications, are 2nd International Conference on Methods and Models in Science and Technology (ICM2ST-11) AIP Conf. Proc. 1414, 72-76 (2011); doi: 10.1063/1.3669934 © 2011 American Institute of Physics 978-0-7354-0991-0/$30.00

72 most useful when a structure is applied to them. C. ISO 1087-1 The major categories of structure types are ISO 1087 creates common sets of terms for the 1. Thesaurus - A controlled set of terms covering a subject field of terminology application specific domain of knowledge formally organized development. Part 1 of the standard identifies, so that the a priori relationships between concepts defines, and classifies terms used to develop are made explicit; terminologies for specific applications. The main 2. Taxonomy - A classification according to sections of the document are: Language and reality; presumed natural relationships; Concepts, Definitions, Designations , Terminology, 3. Ontology - A formal specification of a Aspects of terminology work, Terminological conceptualization. products; and Terminological entries. Only the thesaurus structure type has been standardized till now. D. Other Standards There are other standardization efforts for IV. TERMINOLOGY STANDARDS terminology structures and knowledge A. ISO 704 management. The most important of these are: This is the most fundamental standard and is a. ISO 2788: This standard describes how to build a about how to write standards for terminology. It is thesaurus using one language. It uses the theory divided into three major sections: concepts; and structures defined in the standards described definitions; and terms. These are the basic above. constructs that are necessary for terminology. b. ISO 5964: The content of ISO 2788 is assumed An object is an observable phenomenon, and a to hold for multilingual thesauri as well as concept is a mental construct serving to classify monolingual ones. those objects. Any object may have multiple c. ISO/IEC 14481: This standard defines the concepts associated with it, depending on the constructs to be contained in a modeling facility context or point of view. that can be used to create a formal description of A characteristic is used to differentiate concepts in some part of an enterprise. This may include a a terminology, and different types of characteristics terminology. A modeling facility is the basis for are described. The totality of characteristics for a building a language and a tool to support the concept is called its intension. The totality of activity of creating a formal description. The objects sharing all the characteristics of a concept modeling facility defines the semantics of the is its extension. language as a set of constructs and how the Relationships are described in some detail because constructs are related, but not the language syntax. concepts are always related to other concepts in a terminology. Finally, systems of concepts are E. ISO/IEC 11179 described. A system is the set of concepts of a ISO/IEC 11179 defines a description of the given subject field. metadata and activities needed to manage data elements in a registry. Data elements are the B. ISO 860 fundamental units of data, that an organization ISO 860 is the extension of ISO 704 and this collects, processes, and disseminates. Metadata standard specifies a methodology for the registries organize information about data harmonization of concepts, definitions, terms, elements, provide access to the information, concept systems, and term systems. This standard facilitate standardization, identify duplicates, and addresses two types of harmonization: facilitate data sharing. Data dictionaries are usually 1. Concept harmonization: It means the elimination associated with single data sets, but a metadata of minor differences between two or more closely registry contains descriptions of the data elements related concepts. Concept harmonization is not the for an entire program or organization. transfer of a concept system to another language An important feature of a metadata registry is that but it involves the comparison and matching of data elements are described by a data element concepts and concept systems in one or more concept and a representation or value domain. The languages or subject fields. advantages of this are as follows: 2. Term harmonization: It refers to the designation 1. Sets of similar data elements are linked to a of a single concept (in different languages) by shared concept thus reducing search time; terms that reflect similar characteristics. Term 2. Every representation associated with a concept harmonization is possible only when the concepts can be shown together thus flexibility is increased; the terms represent are almost exactly the same.

73 3. Similar data elements are located through similar 5. Part 5 - Naming and Identification Principles for concepts, again assisting searches and Data Elements - Specifies rules and guidelines for administration of a registry. naming and designing non-intelligent identifiers for Data elements are described by object class, data elements. This document is an International property, and representation. Standard. 1. The object class is a set of ideas, abstractions, or 6. Part 6 - Registration of Data Elements - things in the real world that can be identified with Describes the functions and rules that govern a data explicit boundaries and meaning and whose element registration authority. This document is an properties and behavior follow the same rules. International Standard. The revision of ISO/IEC 2. The property is a characteristic common to all 11179-3 (Part 3) will include a conceptual model members of an object class. They are what humans for a metadata registry (for data elements). The use to distinguish or describe objects. Examples of meta model, or metadata model for data elements, properties are color, model, sex, age, income, provides a detailed description of the types of address, price, etc. information which belong to a metadata registry. It 3. The representation describes how the data are provides a framework for how data elements are represented. The most important aspect of the formed and the relationships among the parts. representation part of a data element is the value Implementing this scheme provides users the domain. A value domain is a set of permissible information they need to understand the data values for a data element. For example, the data elements of an organization. Many agencies, element representing annual household income including the U.S. Census Bureau, are may have the set of non-negative integers as a set implementing these standards as part of their of valid values. The valid values may be a pre- metadata repository design efforts. specified list of categories with some identifier for each category, such as: V. IMPLEMENTATION OF ISO/IEC 11179 1 $0 - $10,000 2 $10,001 - $30,000 Many organizations are implementing metadata 3 $30,001 -+ registries based on ISO/IEC 11179. This section DEC (combination of an object class and a contains descriptions of several of these efforts. property) is a concept that can be represented in the form of a data element, described independently of A. The Intelligent Transportation System Data any particular representation. In the examples Registry Initiative above, annual household income actually names a The Department of Transportation is working on DEC, which has two possible representations developing 50 to 60 standards for interoperability associated with it. Therefore, a data element can among the systems that will comprise the nation's also be seen to be composed of two parts: a data Intelligent Transportation System. Five Standards element concept and a representation. Development Organizations are cooperating in the ISO/IEC 11179 is divided into six parts. The name development of these standards, in addition to of the parts, a short description of each, and the identifying a controlled vocabulary and various status is described below: access paths for information. A major project 1. Part 1 - Framework for the Specification and within this initiative is the Intelligent Standardization of Data Elements - Provides Transportation System (ITS) Data Registry. It is overview data elements and the concepts used in being designed and developed by IEEE and is the rest of the standard. This document is an based upon the ISO/IEC 11179 standard's concepts. International Standard. The Transportation Department's explicit use of 2. Part 2 - Classification of Data Elements - ISO/IEC 11179 as part of their standards program Describes how to classify data elements. This for the intelligent highway projects has been for the document is an International Standard. purpose of encouraging and enabling the 3. Part 3 - Basic Attributes of Data Elements - transportation agencies of the 50 States and Defines the basic set of metadata for describing a Territories to be able to exchange and work with data element. This document is an International data that has consistent semantics across the Standard. It is currently being revised. various governmental and other organizations 4. Part 4 - Rules and Guidelines for the involved. Formulation of Data Definitions - Specifies rules and guidelines for building definitions of data B. The Environmental Data Registry elements. This document is an International The Environmental Protection Agency (EPA) is Standard. developing methods to: a) share environmental data across program systems; b) reduce the burden of

74 reporting regulatory and compliance information; E. The Census Bureau Corporate Metadata c) improve access to environmental data via the Repository web; and d) integrate environmental data from a The Census Bureau is building a unified wide variety of sources. framework for statistical metadata. The focus of the To accomplish these improvements, the work is to integrate ISO 11179 and survey Environmental Information Office is creating the metadata, using the metadata to enhance business Environmental Data Registry. The EDR is the applications. The goal is to put metadata to work to Agency's central source of metadata describing guide survey design, processing, analysis, and environmental data. In support of environmental dissemination. data standards, the EDR offers well-formed data Current project applications include the American elements along with their value domains. The EDR Fact Finder - Data Access and Dissemination is also a vehicle for reusing data elements from System. This project is a large effort to disseminate other data standards-setting organizations. Decennial Census, Economic Censuses, and American Community Survey data via the Internet. C. Australian National Health Information Knowledgebase F. Statistics Canada Integrated Meta DataBase The Australian Knowledgebase is an electronic Statistics Canada is building a metadata registry, storage site for Australian health metadata, and called the Integrated MetaDataBase, based on same includes a powerful query tool. Knowledgebase conceptual model for statistical metadata can be used to find out what data collections are developed at the Census Bureau. This effort is still available on a particular health related topic or in the design and in initial implementation stages. term, and any related official national agreements, It will integrate all the surveys the agency definitions, standards and work programs, as well conducts, contain many standardized and as any linked organizations, institutions, groups, harmonized data elements, and link statistical data committees or other entities. The Knowledgebase to the survey process. provides direct integrated access to the major elements of health information design in Australia: VI. REGISTRATION & QUALITY 1. The National Health Information Model; The quality of data is enhanced when the proper 2. The National Health Data Dictionary; metadata is available for that data. Metadata 3. The National Health Information Work Program; describing sampling errors, non-sampling errors, 4. The National Health Information Agreement. estimations, questionnaire design and use, and other quality measures all need to be included in a D. The United States Health Information well-designed statistical metadata registry. Knowledgebase However, this does not say anything about the The Department of Defense - Health Affairs in quality of the metadata itself. collaboration with the Health Care Financing Registration is the process of managing metadata Administration is developing the United States content and quality. It includes: Health Information Knowledgebase (USHIK) Data 1. Making sure mandatory attributes are filled out; Registry Project. The project goal is to build, 2. Determining that rules for naming conventions, populate, demonstrate, and make available for forming definitions, classification, etc. are general use a data registry to assist in cataloging followed; and harmonizing data elements across multi- 3. Maintaining and managing levels of quality. organizations. The requirements team includes Registration levels are a way for users to see at a representatives from the Department of Veteran glance what quality of metadata was provided for Affairs, the Health Level Seven (HL7) standards objects of interest. The lowest quality is much like committee, the Health Care Financing "getting some metadata"; a middle level is "getting Administration, and the Department of Defense it all" (i.e., all that is necessary); and the highest Health Affairs office. The implementation builds level is "getting the metadata right". on Environmental Protection Agency and Semantic content addresses the meaning of an item Australian Institute of Health and Welfare described by metadata. But for statistical surveys, implementations and utilizes DoD - Health Affairs' much more relevant information is necessary to Health Information Resource Service (HIRS) to describe an object. A data element has a definition, develop and implement a data registry. The project but additional information that is necessary to utilizes selected Health Insurance Portability and really understand it is the value domain; the Accountability Act (HIPAA) data elements for question that is the source of the data; the universe demonstration. for the question; the skip pattern in the

75 questionnaire that brought the interviewer to the including the U.S. Census, Bureau, U.S. question; interviewer instructions about how to ask Environmental Protection Agency, U.S. Health the question; sample design for the survey; Care Financing Administration and others. standard error estimates for the data; etc. The international standards committee ISO/IEC Once the semantic content is really known, then the JTC1/SC32/WG2 is responsible for developing and work to harmonize some data across surveys and maintaining this and related standards. agencies can begin. Harmonization can occur at Participation by national statistical offices through many levels, e.g., data, data set, and survey. the appropriate national standards organization will Metadata has quality when it serves its purpose and make this effort much stronger and provide a allows the user to find or understand the data which means to interoperability across national is described. As such, metadata quality has several boundaries for statistics. dimensions: • the full set of metadata attributes and ACKNOWLEDGEMENTS classification schemes are as complete as possible; Author is thankful to the referees for their valuable • the mandatory metadata attributes describe each comments for improvement in the manuscript of object uniquely; this paper. Author is also grateful to the editors and • naming conventions are fully specified and can their editorial staff for their cooperation and help be checked; during whole process of evaluation and publication for the same. • guidelines for forming definitions are fully specified and can be checked; • rules for classifying objects with classification REFERENCES schemes are specified and can be checked; [1]. Lizhen Liu et. al,” Metadata Extraction Based on Mutual Information in Digital VII. BENEFITS Libraries”2007,IEEE. The benefits of implementing an ISO/IEC 11179 [2]. Gillman, D. W., Appel, M. V., and Highsmith, metadata registry are : S. N. Jr., “Building a Statistical Metadata 1. Central management of metadata describing data Repository”, Presented at METIS Workshop, and other objects throughout the agency; Geneva, Switzerland, February, 1998. 2. Increased chances of sharing data and metadata [3].Graves, R. B. and Gillman, D. W., "Standards with other agencies those are also compliant with for Management of Statistical Metadata: A the standard; Framework for Collaboration", ISIS-96, Bratislava, 3. Improved understandability of data and survey Slovakia, May 21-24, 1996. processes for users; [4].ISO 704, “Principles and Methods of 4. Single point of reference for data harmonization; Terminology”, 1987, International standard. 5. Central reference for survey re-engineering and [5].ISO 860, “Terminology Work - Harmonization re-design. of Concepts and Terms”, 1996, International The structure of an ISO/IEC 11179 metadata standard. registry has many points at which terminology will [6].ISO 1087-1, “Terminology Work - Vocabulary aid in searching and understanding the objects - Theory and Application”, 1995, International described. The meaning of an object is clarified standard. through the set of all the terms linked to it. [7].Sundgren, B. , "Guidelines on the Design and Although a definition is important for Implementation of Statistical Metainformation understanding of an object, it often does not Systems", R&D Report Statistics Sweden, 1993. convey the full context in which the definition is [8]. Sundgren, B., Gillman, D. W., Appel, M. V., made. A good example is a question in a and LaPlant, W. P., "Towards a Unified Data and questionnaire. The question wording itself serves Metadata System at the Census Bureau", Census as its definition, but the universe for which the Annual Research Conference, Arlington, VA., question is asked or about is usually not specified. March 18- 21, 1996. That context is inferred from the flow and answers [9]. Anca Vaduva, Klaus R. Dittrich,”Metadata to previous questions. However, appropriate terms Management for Data Warehousing: Between associated with a question can convey some of this Vision and Reality”,pp 129-135,2001,IEEE. necessary information without resorting to [10] Keith G Jeffery,”METADATA: The Future of following a complicated questionnaire down to the Information Systems”. question under consideration. In conclusion, many organizations are implementing ISO/IEC 11179 metadata registries,

76 Copyright of AIP Conference Proceedings is the property of American Institute of Physics and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.