A Generic Data Model for Statistical Indicators and Measurement Units to Enable User-Specific Representation Formats

Michaela Denk1, Wilfried Grossmann2 1International Monetary Fund, Statistics Department*, e-mail: [email protected] 2University of Vienna, Institute for Scientific Computing, e-mail: [email protected]

Abstract

Based on a review of existing standards and guidelines as well as the current international practice of modeling measurement units and related concepts in representation of economic and statistical data, a generic data model for statistical indicators and measurement units is introduced that may contribute to the further development of the SDMX content oriented guidelines in terms of harmonizing cross-domain concepts such as measurement units. Examples from databases of SDMX sponsor organizations are used to demonstrate the applicability of the proposed data model to a broad range of statistical indicators and its ability to serve as a basis for the creation of user-specific data representation formats. This is achieved through customizable combinations of the structural elements of the generic data model according to user requirements and illustrates the practical relevance of the presented ideas.

Keywords: Metadata, SDMX, semantic decomposition.

1. Introduction

In data exchange the concept of indicator plays a crucial role. Proper understanding of the indicator depends on knowledge of how the indicator was calculated and what kinds of units are used for measurement. The SDMX Content Oriented Guidelines (COG) (2009) can be regarded as the most prominent current effort focusing on the harmonization of cross-domain concepts for data exchange. The guidelines recommend practices for creating interoperable data and metadata sets using the SDMX technical standards with the intent of generic applicability across subject-matter domains. In the Metadata Common Vocabulary (Annex 4 of SDMX COG) five cross-domain concepts are described for exchange of indicators and unit of measure. The (statistical) indicator itself is defined as item 331 by “A data element that represents statistical data for a specified

* The views expressed herein are those of the authors and should not be attributed to the IMF, its Executive Board, or its management. time, place, and other characteristics, and is corrected for at least one dimension (usually size) to allow for meaningful comparisons”. Four items in SDMX COG are related to measurement unit, viz. unit of measure (item 384), adjustment (5; included in unit or indicator in many statistical databases as illustrated in Denk and Grossmann (2010)), base period (19; relevant for the interpretation of index data, series at real terms, or changes with respect to a certain period), and unit multiplier (382; specifies the exponent to the basis 10 observation values were divided by, usually for presentation purposes).

From the 14 datasets investigated by Denk and Grossmann and Froeschl (2010), four do not separate the economic indicator from the measurement unit or do not provide the unit information at all, whereas four other datasets even split other concepts such as unit multiplier or adjustment method from the unit. The other six databases separate unit of measure from economic indicator. A broad variety of unit types is used such as index, count, ratio, rate, percentage, or changes. The cases with a single, mixed dimension at least combine information on measured (economic) indicator, type of unit, unit of measure, adjustment method, and frequency. Several examples (e.g. "Personal computers" or "Youth unemployment rate, aged 15-24, men") omit the unit information completely, assuming that it is obvious from the indicator used. On the other hand, observe that these indicators give information about the underlying population to which the measured concept refers.

This shows that in contrast to the first impression one may have (viz. that this problem is a rather easy one that was resolved a long time ago, e.g. by the International System of Units), the analysis showed that the current international practice is very diverse and that neither the recommendations provided in SDMX COG (2009) nor more general measurement unit codification systems have already been adopted by the statistical organizations investigated. One reason may be that the units for indicators are in some sense simple; besides monetary units dimensionless units like percent or counts dominate in applications. What makes the usage of such units difficult is the fact that the measurement instruments are rather complicated (consider for example change in GDP over years), and many times the computation of the indicator provides little information about the used unit. Hence, it is not surprising that the main issue in the analyzed databases is that the measurement unit dimension in the data models used does not represent a "pure" unit of measure. Even SDMX recommendations do not treat all of these components separately.

The first step in harmonizing the structure and content of measurement units as currently used by statistical organizations is the identification of their basic building blocks and of relations between these building blocks. Denk and Grossmann (2010) proposed a generic model for the semantic decomposition into four components, viz. Indicator, Measurement, Adjustment, and Reference. The present paper further develops the proposed model by refining it with respect to three features: (i) introduction of a "Family" concept for indicator and that allows grouping of indicators and units into families that share the phenomena they are destined to measure and, if applicable, the derivation method they were obtained from; (ii) generalization of the concepts of unit multiplier, adjustment, and reference which seems necessary for covering complex indicators; (iii) inclusion of additional standard dimensions required to define the meaning of any statistical figure, such as geographical and temporal reference and measurement conditions.

The paper is structured as follows. Following some introductory remarks and basic definitions, the extended generic data model is presented in section 2. Section 3 illustrates the application of the model by means of examples, primarily from economic statistics. The derivation of customized data representation formats based on the needs of data consumers is described in section 4. Finally, the paper provides some concluding remarks and an outlook on future work.

2. A Generic Metadata Model

Starting point of our development is a look at other institutions aiming for standardization of measurement units. Two prominent examples are the International System of Units (SI) and the Unified Code for Units of Measurement (UCUM) which mainly cover measurement in physical sciences. These systems show a number of features that are of interest for standardization of measurement in case of international statistics as well. The most important feature is a unit typology with the fundamental distinction between base units and units derived from base units by mathematical formulas. Another important feature is the idea of using a prefix notation corresponding to the idea of unit multiplier in SDMX. Besides these two major types of units UCUM also considers in section 3.2 (§24 - §26) so-called arbitrary or procedure defined units defined as “units whose meaning entirely depends on the measurement procedure that are not related to any UCUM or SI unit but completely depend on the measurement procedure”, and in section 4.2 (§34 - §42) customary units that correspond roughly to the idea of local and traditional usage of alternative measurement systems for quantities that can be measured by UCUM or SI (base) units. Customary units are grouped into some common families defined according to the corresponding standard unit (for example units for length like inch, foot or yard).

Looking at statistical applications in the examples we can conclude that in the sense of UCUM there are two types of important applications: arbitrary units using in a strict sense a dimensionless unit like percent or count with proper prefixes or multipliers (millions, thousands etc.) and monetary units, which can be interpreted as customary units in a local setting of a fictitious or virtual universal currency unit. Moreover for the dimensionless unit percent many different customary units are in use for example ratio or rate. The indicator itself roughly corresponds to a combination of the UCUM name of the unit and the type of quantity measured and requires additional specification of the measurement procedure in the sense of UCUM.

The following metadata model is an attempt to put these ideas into a more formal framework as outlined in Figure 1. It is based on the semantic decomposition of a statistical observation into four basic components: Indicator, Measurement, Unit Family, and Unit. Each component has a label attribute that is a simple textual descriptor that may contain a combination of information that is included in other attributes in an unstructured way.

Figure 1. Semantic Model for Indicators and Corresponding Measurement

2.1. Indicator

Strictly speaking, the Indicator itself is not part of the measurement, but as mentioned above a description of the quantity measured. In that sense it seems important to include the indicator into the model, in order to avoid confusion between indicator and measurement unit as it is the case in several analyzed examples.

The Indicator class consists of the following attributes and references to other model components: label, concept, population unit, population restrictions, reference area, reference period, cross-classification, type, family, derivation formula, link to Measurement, and links to two or more Indicators a derived Indicator was derived from.

The concept attribute specifies the quantity to be measured by the indicator. It often corresponds to the name of the attribute for which data are provided. The label typically refers to a table header. Base and derived concepts can be discerned. A base concept can be directly obtained from a survey or other form of data collection in some population. Typical examples are prices of goods, income of a household or a person, or assets and liabilities of a bank. Derived concepts are obtained from measuring different base concepts and/or from different populations, as for example indices or growth rates.

A specification of the statistical reference population is of utmost importance for a proper understanding of an indicator. According to the MCV a (statistical) population is defined as “The total membership or population or "universe" of a defined class of People”. Considering economic statistics, the statistical population may be a collection of population units other than people, hence we define it in terms of a population unit (e.g. person, commodity, institutional unit) and a subject-matter definition provided as restrictions on some characteristics of the population unit (e.g. age=20-30, sector=banks). In any case, a specification of the temporal and spatial validity (reference period and reference area) is required as well. Additional classification criteria, for example economic sectors in economic statistics, or age groups of persons in case of demographic indicators, are captured in the cross-classification attribute. If such additional specifications are given the indicator may be a vector or an array of numbers instead of a single number. Note that such an understanding of indicators as vectors also allows the formation of indicators for different countries or of time series of indicators.

The type of the indicator reflects the different types of concepts introduced above and we distinguish between base indicators and derived indicators. Base indicators correspond to quantities which are available for direct measurement in one population. In context of statistical applications typical examples are population counts or other distributional characteristics of a surveyed quantity (for example poverty measurement using the Gini index), prices of units of goods (for example oil prices per barrel), or currency amounts. As indicated in Figure 1, derived indicators are obtained from at least two base indicators or indicators by using a mathematical formula. Such derivations may use different concepts in the indicators as well as different statistical populations for the indicators involved. For example, a growth rate uses the same indicator concept for both indicators, but different statistical populations in terms of the reference period, whereas GDP per Capita uses two different indicator concepts for different statistical population units, viz. GDP (population unit = institutional unit according to SNA) and domestic population (population unit = person).

The family attribute groups different indicators according to the method how the indicator is derived from other indicators. In case of base indicators there are some natural types like population counts or other distributional characteristics, unit prices or amount of money. For derived indicators there are types like differences, ratios, or indices, which specify how the indicator is obtained from the base indicators.

2.2. Measurement

As usual in models for measurement units (cf. SI-system and UCUM), the actual measurement unit used for an indicator is specified by selecting a specific unit from a specific measurement unit family. This is done via Unit and Scale (multiplier) references as central components of Measurement. The scale (multiplier) is defined as a special dimensionless unit prefix as discussed in the subsection on unit families and units below. As previously discussed many units used in economic statistics fall into the categories of arbitrary or customary units in the sense of UCUM. Hence, a description of the measurement method and the conditions under which the measurement is done are needed in addition to unit and scale. Measurement conditions can be described by indicating the institution responsible for presenting the data. This is of utmost importance in case of international statistics where the standard documentation proposed by SDMX uses only the conditions of the last step but does not take into account the environment of the original measurement.

Three methodological aspects of high relevance in official statistics are currently included in the model, but there is potential for extensions and generalizations. Statistical aggregation measures are essential in case of high frequency data for which specific summary measures, such as period averages or highest and lowest values in a period (e.g. maximum daily price of a stock index) are exchanged. The same measures are used for temporal aggregation of lower frequency data, e.g. to aggregate monthly data to quarters. Besides these traditional location measures also other characteristics of a distribution can be used, for example in case of poverty indicators, the Gini coefficient or the percentage of households having an income below 60 percent of the median household income are important measures. In case of indicators available as time series, measures taking into account the time series feature such as moving averages may be used. Statistical measures referring to a period, such as period average, actually relate to the frequency or periodicity of the data as specified in the cross-classification attribute of the Indicator.

Units of particular Unit Families such as index, change over reference period, or balance indicator require reference information that is specified as part of Measurement. This reference information may be a reference period, value, or statistical measure used to define a reference value. The value domain of unit reference period includes time stamps in different granularity and predefined values such as previous period, corresponding period of previous year, or years since time stamp. Reference value specifies the value of the indicator in the reference period (often 100 or 1000 for an index) or the “norm” value for a balance indicator (often 0 for differences or 1 for ratios). The reference value may also be defined as aggregate with respect to a reference period, for example as moving average of months since reference period. In this case, reference value statistical measure needs to be specified as well. The value domain is the same as for statistical measure.

The adjustment attributes capture monetary adjustments such as price or exchange rate adjustments (e.g. constant) that also require the specification of a reference period as well as more general econometric adjustments such as seasonal or working day adjustments making use of methods like X-12. Other types of adjustment are conceivable; the list can be extended according to specific needs.

2.3. Unit and Unit Family

The Unit Family is mainly used for grouping a set of measurement units that measure the same kind of quantity and thus have the same meaning from a practical point of view. Examples for unit families from physical measurement are length, time, mass or temperature. In the present model a Unit Family is categorized with respect to two different typologies following the ideas of UCUM. On the one hand, base and derived type unit families (and thus, units) are distinguished. On the other hand, UCUM type discerns standard, customary, arbitrary, prefix, and dimensionless unit families and units. Derived units are calculated from base or derived units and in many cases the derivation of a unit is closely related to the derivation of an indicator, for example velocity as ratio of length of a covered distance (e.g. in meter) and time taken to cover the distance (e.g. in seconds). However, a derived indicator may be measured in a base measurement unit. Consider for example the difference of an indicator in two groups. The measurement unit of the derived indicator (the difference) is the same as the unit of the two initial indicators in most cases (for differences of ratios the unit of the derived indicator may be different).

With respect to UCUM type, UCUM and SI base and derived units as well as units derived thereof are considered standard. For some quantities not only a standard unit family, but also additional customary unit families exists. For instance length can be measured by the metric length measurement system with units like meter and kilometer or by the US & British system with units like foot or mile. Prefix unit families can be used in combination with any other unit family that is not dimensionless to represent the unit multiplier. UCUM provides two prefix families. One of them represents positive and negative powers of 10, viz. the conventional metric prefixes such as kilo- (10^3) or centi- (10^-2). The second prefix unit family is the family typically used in information technology. Instead of powers of 10 it is based on powers of 2. UCUM defines an additional unit family measuring dimensionless quantities that closely resembles the family. It just uses different labels for the powers of ten (e.g. percent for 10^-2 or parts per million for 10^-6) and is used on its own.

In addition to label, type, and UCUM type, a Unit Family has a reference to its standard unit (e.g. meter for metric length or second for time) and, in case of a derived family, a unit derivation rule that contains links to unit families from which it is derived. The derivation rule specifies how units of the unit family are derived from units of other families, for example a unit of family “metric area” is derived by squaring a unit of family “metric length”.

The Unit itself consists of its label, a reference to the Unit Family it belongs to, a conversion formula that specifies the relation of the unit to the standard unit of its family, a conversion reference period that is relevant for time-dependent conversion (e.g. for currency units), a conversion method to apply the conversion formula to actual data, and, in case of derived units, a derivation formula that follows the derivation rule specified in the corresponding Unit Family. In many cases, the conversion formula is just a linear transformation, most often even without the shift parameter, that shows how a unit can be converted to the standard unit of the family. The conversion method combines the appropriate conversion formula of the unit and the corresponding inverse conversion formula of the target unit (method parameter) to convert the observations to the target unit. The derivation formula can be regarded as an instance of the more general unit conversion rule specified in the unit family. It contains references to specific units, for example meter in the above “metric area” example. 3. Applying the Model

In addition to the measurable quantities covered by UCUM (and SI) measurement units and unit families, population counts, amounts of money, unit prices, and various derived quantities such as ratios (e.g. index, share of total) or differences (e.g. balance indicators) play a central role in official statistics. They are accounted for in the presented model as indicator families. The units and unit families used to measure these indicators are included in the model as customary unit families. Population count (type base) is basically a dimensionless measurement unit family. The units are more precisely defined by referring to the population unit that the counting operation applies to, e.g. persons or specific commodities. Currency (base) encompasses currencies such as Euro or US Dollar with one of the currencies arbitrarily defined as standard unit. Unit prices are modeled as base indicator family, with subfamilies such as currency exchange rates. However, they use units from derived unit families such as currency per metric volume, currency per time, or currency exchange rates (=currency per currency). Ratios (e.g. percent) and differences of ratios (e.g. percentage points) are also customary derived unit families. They actually measure dimensionless indicators.

Based on the above ideas the unit families displayed in Figure 2 can be regarded as a basis for enhanced code lists for these cross-domain concepts.

Figure 2. Unit Families

For base indicator families, the correspondence to unit families is one-to-one apart from customary unit families. In many cases, derived unit families and their derivation rule resemble the construction of a derived indicator. For example derived indicator families such as indices, general ratios or more specific ratios like shares of total or ratio balance indicators (a ratio of indicators with the same concept but different subpopulations defined by different population restrictions) correspond to the dimensionless measurement unit families index and ratio (percent). Relative changes over a reference period (or change rates) correspond to derived indicators obtained as ratios of two indicators differing only with respect to their temporal specification and use the ratio (percent) unit family. In case of a ratio of different concepts for the same population the unit family and unit can be obtained in a similar way as for physical units. Consider for example GDP per capita calculated as a ratio of GDP measured in some currency (e.g. Euro) and population measured in number of persons. The unit family is then derived from currency and population count families, and unit as Euro per person. However, a derived indicator (family) does not necessarily imply a derived unit (family). E.g. indicators calculated as differences, such as absolute changes over reference period, refer to the same measurement unit of the derived indicator as the initial indicators (unless the initial indicators are ratios). These examples show that a calculus of units and indicators requires careful examination of the formulas, similar to the case of physical units. A more detailed analysis of various important economic indicators and the corresponding derivation concept may be found in Denk, Grossmann and Froeschl (2010).

4. Customized Data Representation Formats

The generic data model presented in the previous sections is regarded as a foundation of a common language of data exchange that allows the creation of (ideally: any; realistically: the most relevant) data representation formats required by data consumers. Data that were decomposed into the structural elements of the generic data model can easily be viewed in terms of data representation formats customizable to the needs of data consumers. The generic data model needs to have the finest possible granularity to enable as many data representation formats as possible. For each data representation format to be supported a mapping rule has to be defined that aligns the model components, in particular attributes and references, to the components of the target representation format. In most cases, the target representation format simply consists of a table title and row and column headers. For time series, row headers often correspond to the reference area or the indicator and column headers to the reference period. The table header usually contains the indicator concept, population unit and restrictions, and measurement information. More sophisticated, that is more structured, representation outputs may have separate elements for unit and other measurement information such as adjustments.

The following examples may illustrate the idea of customizing data representation formats by means of mapping rules. The rules may simply concatenate values of model attributes, replace attribute values by a certain text (e.g. replace Price Adjustment = constant by constant prices) or omit them on a certain condition (e.g. omit Scale if Scale = units), and/or combine them with additional text.

 Table header = Indicator-Concept, Indicator-Population Restrictions, Measurement-Scale (omit if units), Measurement-Unit (omit if 1), Measurement- any Adjustment  Unemployment, female, 15-24, thousands, number of persons, seasonal adjustment  GDP, millions, US Dollar, constant prices  Commodity price, crude oil, US Dollar per Barrel  Indicator = Indicator-Derivation Formula Unit = Measurement-Unit (omit if 1), Measurement-Scale (omit if units)  Expenditure, government, military / GDP Percent  Table header = Indicator-Concept, Indicator-Population Restrictions Unit = Measurement-Unit (omit if 1), Indicator-Family (if subfamily of ratio), Measurement-Unit Ref. Period, Measurement-Scale (omit if units) Column headers = Indicator-Cross-classification, Indicator-Reference Period  GDP Percent, relative change over reference period, same period previous year Quarters, 2000Q1-2010Q4

5. Conclusion

The motivation of this paper was the decomposition of the concepts statistical indicator and measurement unit into their basic building blocks to enable the further development of a generic metadata model for these concepts. The metadata model introduced can be regarded as a foundation for the development of (enhanced) standardized value domains and code lists for the identified structural elements in the context of SDMX. As demonstrated, it may also serve as a basis for the creation of user-specific, customizable data representation formats. Future work will focus on the operational aspect and hence on the representation of derived indicators, measurement, and unit families as well as required constraints and rules for the propagation of indicator and measurement information. The definition of code lists including "shortcut" descriptors for mixed concepts (as also used in the mapping rules that specify data representation formats) is also of high priority. The long-term objective of this research is the development of a unit (family) calculus and its inclusion in the metadata model to enable metadata driven processing based on ideas developed by Froeschl (1997).

References

Denk M., Grossmann W., Froeschl K. A. (2010) Towards a best practice of modeling unit of measure and related statistical metadata, in: European Conference on Quality in Official Statistics 2010, http://q2010.stat.fi/papers/, Statistics Finland. Denk M., Grossmann W. (2010) Semantic Decomposition of Indicators and Corresponding Measurement Units, in: KSEM 2010, LNAI 6291, Bi & Williams (Eds.), Springer, 603-608. Froeschl, K. A. (1997) Metadata Management in Statistical Information Processing, Springer, Wien-New York. The International System of Units, http://www.bipm.org/en/si/ The Unified Code for Units of Measure, http://unitsofmeasure.org/ SDMX Content Oriented Guidelines 2009, 16pp + 5 annexes, http://www.sdmx.org/