3.2 S-DWH Information Systems Architecture

3.2 S-DWH Information Systems Architecture The Information Systems connect the business to the infrastructures, in our context this is represented by a conceptual organization of the effective S-DWH which is able to support tactical demands. In the layered architecture, in terms of data system, we identify: - the staging data are usually of temporary nature, and its contents can be erased, or archived, after the DW has been loaded successfully; - the operational data is a database designed to integrate data from multiple sources for additional operations on the data. The data is then passed back to operational systems for further operations and to the data warehouse for reporting; - the Data Warehouse is the central repository of data which is created by integrating data from one or more disparate sources and store current as well as historical data; - data marts are kept in the access layer and are used to get data out to the users. Data marts are derived from the primary information of a data warehouse, and are usually oriented to specific business lines. Source Integration Interpretation and Access Layer Layer Analysis Layer Layer Staging Data Operational Data Data Warehouse Data Mart ICT - Survey DATA MINING ANALYSIS EDITING SBS - Survey ANALYSIS REPORTS ETrade - Survey operational information Data Mart … operational Data Warehouse Data Mart information Data Mart ADMIN Figure 3 - Information Systems Architecture The Metadata Management of metadata used and produced in all different layers of the warehouse are specifically defined in the Metadata framework 1 and the Micro data linking2 . This is used for description, identification and retrieval of information and links the various layers of the S-DWH, which occurs through the mapping of different metadata description schemes; It contains all statistical actions, all classifiers that are in use, input and output variables, selected data sources, descriptions of output tables, questionnaires and so on. All these meta-objects are collected during design phase into one metadata repository. It configures a metadata-driven system well-suited also for supporting the management of actions or IT modules, in generic workflows. In order to suggest a possible path towards process optimization and cost reduction, in this chapter we will introduce a data model and a possible simple description of a generic workflow, which links the business model with the information system in the S-DWH. 1 Lundell L.G. (2012) Metadata Framework for Statistical Data Warehousing, ver. 1.0. Deliverable 1.1 2 Ennok M et al. (2013) On Micro data linking and data warehousing in production of business statistics, ver. 1.1. Deliverable 1.4 3.2.1 S-DWH is a metadata-driven system The over-arching Metadata Management of a S-DWH as metadata-driven system supports Data Management within the statistical program of an NSI, and it is therefore vital to thoroughly manage the metadata. To address this we refer to the metadata chapter where metadata are organized in six main categories. The main six categories are: - active metadata, metadata stored and organized in a way that it enables operational use, manual or automated; - passive metadata, any metadata that are not active; - formalised metadata, metadata stored and organised according to standardised codes, lists and hierarchies; - free-form metadata, metadata that contain descriptive information using formats ranging from completely free-form to partly formalised; - reference metadata, metadata that describe the content and quality of the data in order to help the user understand and evaluate them (conceptually); - structural metadata, metadata that help the user find, identify, access and utilise the data (physically). Metadata in each of these categories belong to a specific type, or subset of metadata. The five subsets are: - statistical metadata, data about statistical data e.g. variable definition, register description, code list; - process metadata, metadata that describe the expected or actual outcome of one or more processes using evaluable and operational metrics; - quality metadata, any kind of metadata that contribute to the description or interpretation of the quality of data; - technical metadata, metadata that describe or define the physical storage or location of data; - authorization metadata are administrative data that are used by programmes, systems or subsystems to manage user’s access to data. In the S-DWH, one of the key factors is consolidation of multiple databases into a single database and identifying redundant columns of data for consolidation or elimination. This involves coherence of statistical metadata and in particular on managed variables. Statistical actions should collect unique input variables, not just rows and columns of tables in a questionnaire. Each input variable should be collected and processed once in each period of time. This should be done so that the outcome, input variable in warehouse, could be used for producing various different outputs. This variable triggers changes in almost all phases of statistical production process. So, samples, questionnaires, processing rules, imputation methods, data sources, etc., must be designed and built in compliance with standardized input variables, not according to the needs of one specific statistical action. The variable based on statistical production system reduces the administrative burden, lowers the cost of data collection and processing and enables to produce richer statistical output faster. Of course, this is true in boundaries of standardized design. This means that a coherent approach can be used if statisticians plan their actions following a logical hierarchy of the variables estimation in a common frame. What the IT must support is then an adequate environment for designing this strategy. As an example, according to a common strategy, we consider Surveys 1 and 2 which collect data with questionnaires and one administrative data source. But this time, decisions done in design phase (design of the questionnaire, sample selection, imputation method, etc.) are made “globally”, taking into consideration all three surveys. In this way, integration of processes gives us reusable data in the warehouse. Our warehouse now contains each variable only once, making it much easier to reuse and manage our valuable data. Figure 4 - Integration to achieve each variable only once - Information Re-use Another way of reusing data which is already in the warehouse is to calculate new variables. The following figure illustrates the scenario where a new variable E is calculated from variables C* and D, loaded already into the warehouse. It means that data can be moved back from the warehouse to the integration layer. Warehouse data can be used in the integration layer in multiple purposes, calculating new variables is only one example. Integrated variable based on a warehouse data opens the way to any new possible sub-sequent statistical actions that do not have to collect and process data, and can produce statistics directly from the warehouse. Skipping the collection and processing phases, one can produce new statistics, and analyses are very fast and much cheaper than in case of the classical survey. Figure 5 - Building a new variable - Information Re-Use Designing and building a statistical production system according to the integrated warehouse model takes initially more time and effort than building the stovepipe model. But maintenance costs of integrated warehouse system should be lower, and new products which can be produced faster and cheaper, to meet the changing needs, should compensate the initial investments soon. The challenge in data warehouse environment is to integrate, rearrange and consolidate large volumes of data from different sources to provide a new unified information base for business intelligence. To meet this challenge, we propose that the processes defined in GSBPM are distributed into four groups of specialized functionalities, each represented as a layer in the S-DWH. 3.2.2 Layered approach of a full active S-DWH The layered architecture reflects a conceptual organization in which we will consider the first two levels as pure statistical operational infrastructures, functional for acquiring, storing, editing and validating data and the last two layers as the effective data warehouse, i.e. levels in which data are accessible for analysis. These reflect two different IT environments: an operational one (where we support semi-automatic computer interaction systems) and an analytical one (the warehouse, where we maximize human free interaction). ACCESS LAYER DATA WAREHOUSE INTERPRETATION AND ANALYSIS LAYER INTEGRATION LAYER OPERATIONAL DATA SOURCES LAYER Figure 6 - S-DWH Layered Architecture 3.2.3 Source layer The Source layer is the gathering point for all data that is going to be stored in the Data warehouse. Input to the Source layer is data from both internal and external sources. Internal data is mainly data from surveys carried out by the NSI, but it can also be data from maintenance programs used for manipulating data in the Data warehouse. External data is administrative data, which is data collected by someone else (originally for some other purpose). The structure of data in the Source layer depends on how the data is collected and the designs of the various NSI data collection processes. The specifications of collection processes and their output, the data stored in the Source layer, have to be thoroughly described. Some vital information is names, meaning, definition and description, of any collected variable. Also the collection process itself must be described, for example the source of a collected item, when it was collected and how. When data are entering in the source layer from an external source, or administrative archive, data and relative metadata must be checked in terms of completeness and coherence. From a data structure point of view, external data are stored with the same data structure as they arrive. The integration toward the integration layer should be then implemented by mapping of the source variable with the target variable, i.e. the internal variable to the S-DWH.

Load more