A Generic Data Model for Statistical Indicators and Measurement Units to Enable User-Specific Representation Formats

A Generic Data Model for Statistical Indicators and Measurement Units to Enable User-Specific Representation Formats Michaela Denk1, Wilfried Grossmann2 1International Monetary Fund, Statistics Department*, e-mail: [email protected] 2University of Vienna, Institute for Scientific Computing, e-mail: [email protected] Abstract Based on a review of existing standards and guidelines as well as the current international practice of modeling measurement units and related concepts in representation of economic and statistical data, a generic data model for statistical indicators and measurement units is introduced that may contribute to the further development of the SDMX content oriented guidelines in terms of harmonizing cross-domain concepts such as measurement units. Examples from databases of SDMX sponsor organizations are used to demonstrate the applicability of the proposed data model to a broad range of statistical indicators and its ability to serve as a basis for the creation of user-specific data representation formats. This is achieved through customizable combinations of the structural elements of the generic data model according to user requirements and illustrates the practical relevance of the presented ideas. Keywords: Metadata, SDMX, semantic decomposition. 1. Introduction In data exchange the concept of indicator plays a crucial role. Proper understanding of the indicator depends on knowledge of how the indicator was calculated and what kinds of units are used for measurement. The SDMX Content Oriented Guidelines (COG) (2009) can be regarded as the most prominent current effort focusing on the harmonization of cross-domain concepts for data exchange. The guidelines recommend practices for creating interoperable data and metadata sets using the SDMX technical standards with the intent of generic applicability across subject-matter domains. In the Metadata Common Vocabulary (Annex 4 of SDMX COG) five cross-domain concepts are described for exchange of indicators and unit of measure. The (statistical) indicator itself is defined as item 331 by “A data element that represents statistical data for a specified * The views expressed herein are those of the authors and should not be attributed to the IMF, its Executive Board, or its management. time, place, and other characteristics, and is corrected for at least one dimension (usually size) to allow for meaningful comparisons”. Four items in SDMX COG are related to measurement unit, viz. unit of measure (item 384), adjustment (5; included in unit or indicator in many statistical databases as illustrated in Denk and Grossmann (2010)), base period (19; relevant for the interpretation of index data, series at real terms, or changes with respect to a certain period), and unit multiplier (382; specifies the exponent to the basis 10 observation values were divided by, usually for presentation purposes). From the 14 datasets investigated by Denk and Grossmann and Froeschl (2010), four do not separate the economic indicator from the measurement unit or do not provide the unit information at all, whereas four other datasets even split other concepts such as unit multiplier or adjustment method from the unit. The other six databases separate unit of measure from economic indicator. A broad variety of unit types is used such as index, count, ratio, rate, percentage, or changes. The cases with a single, mixed dimension at least combine information on measured (economic) indicator, type of unit, unit of measure, adjustment method, and frequency. Several examples (e.g. "Personal computers" or "Youth unemployment rate, aged 15-24, men") omit the unit information completely, assuming that it is obvious from the indicator used. On the other hand, observe that these indicators give information about the underlying population to which the measured concept refers. This shows that in contrast to the first impression one may have (viz. that this problem is a rather easy one that was resolved a long time ago, e.g. by the International System of Units), the analysis showed that the current international practice is very diverse and that neither the recommendations provided in SDMX COG (2009) nor more general measurement unit codification systems have already been adopted by the statistical organizations investigated. One reason may be that the units for indicators are in some sense simple; besides monetary units dimensionless units like percent or counts dominate in applications. What makes the usage of such units difficult is the fact that the measurement instruments are rather complicated (consider for example change in GDP over years), and many times the computation of the indicator provides little information about the used unit. Hence, it is not surprising that the main issue in the analyzed databases is that the measurement unit dimension in the data models used does not represent a "pure" unit of measure. Even SDMX recommendations do not treat all of these components separately. The first step in harmonizing the structure and content of measurement units as currently used by statistical organizations is the identification of their basic building blocks and of relations between these building blocks. Denk and Grossmann (2010) proposed a generic model for the semantic decomposition into four components, viz. Indicator, Measurement, Adjustment, and Reference. The present paper further develops the proposed model by refining it with respect to three features: (i) introduction of a "Family" concept for indicator and unit of measurement that allows grouping of indicators and units into families that share the phenomena they are destined to measure and, if applicable, the derivation method they were obtained from; (ii) generalization of the concepts of unit multiplier, adjustment, and reference which seems necessary for covering complex indicators; (iii) inclusion of additional standard dimensions required to define the meaning of any statistical figure, such as geographical and temporal reference and measurement conditions. The paper is structured as follows. Following some introductory remarks and basic definitions, the extended generic data model is presented in section 2. Section 3 illustrates the application of the model by means of examples, primarily from economic statistics. The derivation of customized data representation formats based on the needs of data consumers is described in section 4. Finally, the paper provides some concluding remarks and an outlook on future work. 2. A Generic Metadata Model Starting point of our development is a look at other institutions aiming for standardization of measurement units. Two prominent examples are the International System of Units (SI) and the Unified Code for Units of Measurement (UCUM) which mainly cover measurement in physical sciences. These systems show a number of features that are of interest for standardization of measurement in case of international statistics as well. The most important feature is a unit typology with the fundamental distinction between base units and units derived from base units by mathematical formulas. Another important feature is the idea of using a prefix notation corresponding to the idea of unit multiplier in SDMX. Besides these two major types of units UCUM also considers in section 3.2 (§24 - §26) so-called arbitrary or procedure defined units defined as “units whose meaning entirely depends on the measurement procedure that are not related to any UCUM or SI unit but completely depend on the measurement procedure”, and in section 4.2 (§34 - §42) customary units that correspond roughly to the idea of local and traditional usage of alternative measurement systems for quantities that can be measured by UCUM or SI (base) units. Customary units are grouped into some common families defined according to the corresponding standard unit (for example units for length like inch, foot or yard). Looking at statistical applications in the examples we can conclude that in the sense of UCUM there are two types of important applications: arbitrary units using in a strict sense a dimensionless unit like percent or count with proper prefixes or multipliers (millions, thousands etc.) and monetary units, which can be interpreted as customary units in a local setting of a fictitious or virtual universal currency unit. Moreover for the dimensionless unit percent many different customary units are in use for example ratio or rate. The indicator itself roughly corresponds to a combination of the UCUM name of the unit and the type of quantity measured and requires additional specification of the measurement procedure in the sense of UCUM. The following metadata model is an attempt to put these ideas into a more formal framework as outlined in Figure 1. It is based on the semantic decomposition of a statistical observation into four basic components: Indicator, Measurement, Unit Family, and Unit. Each component has a label attribute that is a simple textual descriptor that may contain a combination of information that is included in other attributes in an unstructured way. Figure 1. Semantic Model for Indicators and Corresponding Measurement 2.1. Indicator Strictly speaking, the Indicator itself is not part of the measurement, but as mentioned above a description of the quantity measured. In that sense it seems important to include the indicator into the model, in order to avoid confusion between indicator and measurement unit as it is the case in several analyzed examples. The Indicator class consists of the following

Load more