(I) Functions of Metadata in Statistical Production
Total Page:16
File Type:pdf, Size:1020Kb
The SDMX Istat Framework. Prepared by Mauro Bianchi, Dario Camol and Laura Vignola, ISTAT, Italy I. INTRODUCTION ISTAT in the two last years have been involved in several Eurostat SDMX projects the Eurostat SODI project, "Demography Pilot Project" and recently to the "Census Hub Pilot Project". The experience acquired during those projects has brought the development team in building the SDMX Framework. The idea was to create a new environment to manage the entire data flow that conducts to the reporting of data and reference metadata in SDMX format. The SDMX Framework, released as open source package, is composed by building blocks that can be also used separately, so that other statistical agencies can re-use and integrate them in their environments II. ARCHITECTURE A. The data framework The framework is composed by modules and classes that permit the user to create the environment, load data, create SDMX data file in compact and cross sectional format and the corresponding RSS (pull method without a web service) and make available data by a web service. 1.2 Manager The application Manager is a Client-Server application that creates the structure of the database. This structure is “SDMX compliant” that is the tables created inside the database contains as object the elements belong to the information model of SDMX (concepts, descriptors, code lists, key families). These tables are populated using the structure files, containing the Data Structure Definitions and all the information contained in these files are stored in specific tables created in the first loading and updated in every future loading if necessary. In this manner the DSD with all the concepts, code lists attribute, key family are stored into the database, the loading of DSD is important not only for the loading of data but also for the generation of SDMX data file. After the storage of the structure the user can create metadata table and one or more dataflow tables to contain the data. The loading of DSD and the creation of metadata and data flows tables must be done every time is necessary to loads a new data structure or a new dataflow. These tables are specific for every DSD or dataflow loaded. At the creation stage of the metadata table the user could decide if to insert a group of optional fields (title, obs_prebreak…etc). This decision depends only to the attributes that the user want to visualize into the SDMX data file and to the presence of these attributes into the data file that have to load. In the table called Metadati there will be stored all the information that concern the levels hierarchically superior to the observation level, in the tables of dataflow, instead, the information that is possible to find at the observation level. The Manager permits also to manage the users. There are three typologies of user with different level of authorization, this assure a more checked use of the application. 1.3 Loader 1 At this point the data file can be loaded. Actually three formats can be loaded on the system: text format with separator fields, Gesmes format for STS and fixed length record. The system with a WYSIWYG wizard follows the user during the importing process. After the selection of the dataset and the related DSD the user can choose the file to import. In this phase a mapping is needed between the fields present into the metadata and dataflow tables and the fields present into the text, Gesmes or fixed length record data file. Through the visual aid of a grid and a selection box every field of the file must be linked to the related one present in the dataflow table. This is the last step before the creation of the SDMX file and the real storage of the data. If the file contains a series of values for update on the database the data will be overwritten. Having all the data and metadata stored in its database the web service is ready to receive the SDMX query and send to the client application the specific Compact or Cross Sectional file. .txt .ges .flr ESA DEM rss EUROSTAT STS Data Loader Structure LoaderStructure SDMX query SDMX data gen. Query parser SDMX Data file SQL generator SDMX db SDMX generator Web services Fig. 1 B. The database The databases that can be actually used by the application are three Access, Oracle and SqlServer. The database is initially empty and its tables are created and filled at runtime. The first file needed is an xml file containing a link between dataflow and DSD. Importing this file the system will create the table Data_Flows. The SDMX Structure files are needed to create the metadata tables (fig. 2) After the creation and the loading of this table the user must create a metadata table for each DSD loaded and one or more tables for data (the name of this table is represented by the dataflow name). (fig. 3) During the creation of these tables the user could decide to include some fields for the optional attributes that are part of the DSD. The fields that are inserted in these tables have not to have necessarily a linked field in the imported data file. 2 Fig. 1 Fig. 2 In Fig. 1 we can see all the tables concerning the DSD. In fig 2 are showed as examples the tables of Metadata and dataflow created for two type of DSD (demography and Census). The table Users_link, links the users of the system to the dataflow. The rss_table is used by the rss_provider application to make available the RSS file to Eurostat. C. The web services The web services are composed of three parts. The first part takes a SDMXquery as input and making a parsing of this query, generates a table containing all the required variables for each row. The rows have to be considered in “AND” relation and columns in the same row are considered in “OR” clause. The second part is the only part that access to the database, it creates a sql query to retrieve data and send the data to the last part of web service to generate sdmx data file in Compact and Cross Sectional format. Now we focus on the three modules and their principal characteristics. 1 The query parser This module does not access to the database and elaborates the imported xml file giving as result a grid that has got, in its rows, a combination of values (columns) representing the concepts involved into the DataWhere of the SDMX query. 3 A SDMX query is an xml file structured following the SDMX rules. We are interested by the DataWhere query. In this kind of query we can find these nodes: Time And Or Dimension Attribute DataFlow We can define the nodes written in italics as “complex nodes” and the others ones “simple nodes”. The simple nodes contains only the value while the complex nodes can contains also other nodes, in particular the node “Time” contains only two child nodes “StartTime” and “EndTime” or only a node “Time”, while the node “And” and “Or” can contain all the nodes related above. These two last types of node are Boolean operators and they can contain very complex sub-trees with other nested Boolean operation. What are the dimensions/attributes we can find inside a Dimension node? They are all the dimension/attributes referred to the required dataflow, as we can find in the dataflow table (see paragraph B). The query parser proceed in two steps 1.1 Rewriting of the query In this step we worked with the SQL language, making the transformation of the file, to have a simpler structure. The structures related below are translated in other structures: - If we have an “And” node with “And” child nodes inside, we can remove all the child nodes and put theirs conditions into the parent “And” node Ex. <DataWhere> <And> Cond 1 Cond 2 <And>Cond 4, Cond 5</And> <And>Cond 6, Cond 7</And> <Or>Cond 8, Cond 9</Or> </And> </DataWhere> the new structure become <And> Cond 1 Cond 2 Cond 4 Cond 5 Cond 6 Cond 7 <Or>Cond 8, Cond 9</Or> </And> In the same way we can do for the “Or” node. - If we have an “Or” node with “Or” child nodes inside, we can remove all the child nodes and put theirs conditions into the parent “Or” node - The Time node may be viewed as an “And” node because the child nodes (EndTime, StartTime) are into the where clause, in an “And” condition. If the node time contains only a node time child (see 4 the demografy query structure) we can change this child node into tow childs StartTime and EndTime with the same date and return to a precedent situation. Ex. <DataWhere> <And> Cond 1 Cond 2 <Time> <StartTime>Date 1</StartTime> <EndTime>Date 2</EndTime> </Time> </And> </DataWhere> the new structure become <DataWhere> <And> Cond 1 Cond 2 <And> <StartTime>Date 1</StartTime> <EndTime>Date 2</EndTime> </And> </And> </DataWhere> - Of the conditions in “Or” node involve all the same concept, all these conditions are joined in a single condition with all the values separated by a character. This last transformation, that conserves the peculiarity of the grid, have to be necessary to solve the problems of the hypercube that in the trasformation of cartesian product can be done a grid with too much rows to be elaborated. - An “And” node with the “Or” child nodes can be translated as an “Or” node with “And” child nodes each of all have as child the cartesian product of all child node of the “Or” child nodes Ex.