An ISO 9001:2000 Compliant for Data Integration in Data Warehouse Systems

Holger Hinrichs Thomas Aden OFFIS University of Oldenburg Escherweg 2, 26121 Oldenburg, Germany 26111 Oldenburg, Germany holger.hinrichs@offis.de [email protected]

data integrated from heterogeneous sources. The integra- tion of heterogeneous data Ð an integral part of data ware- Abstract housing Ð raises the problem of data deficiencies like in- consistent, missing, or obsolete values, unintentional re- In modern information systems, the topic of da- dundancy etc. So if data that do not suffice given user ta quality becomes more and more important due requirements are used for decision support, the so-called to increasing analysis demands. This holds es- ”garbage in, garbage out” effect may occur, possibly with pecially for data warehouse systems and simi- serious consequences. To prevent this, the data integration lar systems that provide data integrated from het- process has to ensure that subsequent analysis tasks work erogeneous sources. Although a large variety of on approved data exclusively. Data that do not fulfil the re- extraction-transformation-loading tools support- quirements have to be either improved or singled out. The ing data integration is available, there is still no entire integration process must align with the fulfilment of process model defining which integration steps users’ needs. Our research project CLIQ (Data Cleansing should be done in which order to best fulfil the with Intelligent ) is dedicated to this users’ needs. Our research project CLIQ is sup- topic. posed to close this gap. In CLIQ, the integration The scenario we just described can be viewed as a data process is being viewed as a kind of production quality problem. In [ISO00a], the term ”quality” is defined process. This view enables us to apply concepts as the ”degree to which a set of inherent characteristics ful- of quality management known from the manufac- fils requirements”. Applied to the context of data ware- turing/service domain. More precisely, we devel- housing, the subject matter under consideration is ”data”, oped a quality management system for data inte- more precisely a fragment of a database, below termed gration that meets the requirements of the recent ”data product”. So which characteristics (with regard to ISO 9001:2000 standard. This paper presents our the above definition) does such a data product possess? approach and describes its added value compared In the literature, several classifications of data character- to traditional approaches to data integration. istics (often called data quality dimensions) can be found [Wan98, JJQV98, NLF99, Sha99]. Unfortunately, there is 1 Introduction no commonly accepted classification. In CLIQ, we adopted The increasing popularity of data warehouse systems the classification depicted in Fig. 1, which we consider to [Inm92] reflects the rising need to make strategic use of be a ”best of breeds” concept (see [Hin01]). Quality characteristics form the backbone of quality The copyright of this paper belongs to the paper’s authors. Permission to management (QM), defined in [ISO00a] as ”coordinated copy without fee all or part of this material is granted provided that the activities to direct and control an organization with regard copies are not made or distributed for direct commercial advantage. to quality”. In accordance with [ISO00a], QM consists of Proceedings of the International Workshop on Design and the following activities: Management of Data Warehouses (DMDW’2001) Interlaken, Switzerland, June 4, 2001 ¯ Quality policy (D. Theodoratos, J. Hammer, M. Jeusfeld, M. Staudt, eds.) Establishing overall quality-related intentions and http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-39/ goals of an organization.

H. Hinrichs, T. Aden 1-1 Data Quality Characteristics Manufacturing Data Warehousing

Credibility Usability Interpretability Customers Data Analysts Correctness Completeness Uniformity

Consistency Precision Unambiguousness

Reliability Timeliness Understandability Products Data Products

Absence of Redundancy

Relevance Production Process Integration Process

Figure 1: Categories of Data Quality Characteristics Quality Management Raw Materials Heterogen. Data

¯ Quality planning Setting quality objectives and specifying processes and related resources necessary to fulfil these objec- Suppliers Data Sources tives.

¯ Quality control Figure 2: Manufacturing and Data Warehousing Executing processes to fulfil quality requirements. [BT99, NLF99, Sha99], are either trivial or cannot be auto-

¯ mated. However, since modern data warehouses store giga- Providing confidence that quality requirements will be bytes up to terabytes of data, a manual quality control is not 1 fulfilled. feasible at all. In [Hin01], we propose a well-defined set of metrics along with automated measuring methods for a

¯ Quality improvement subset of the data quality characteristics mentioned in Fig. Increasing the ability to fulfil quality requirements. 1. These are tailored to relational databases as they prevail in the back-end area of current data warehouse systems. The system within which QM is done is called quality Another major question is how data quality can be im- management system (QMS). As Fig. 2 shows, the data inte- proved. Two basic approaches are possible: gration process in a data warehouse system can be viewed as some kind of production process, with the heteroge- ¯ Symptom-oriented neous source data corresponding to raw materials provided In this case, data cleansing methods are applied to de- by suppliers (in this case: data sources), the integration ficient data to improve their quality as far as possi- process corresponding to the production process, and the ble. If data cannot be made conformable with given consolidated data corresponding to products. As (material) requirements, they have to be flagged and then ei- products, data products can be used by customers (in this ther be sorted out or be released for restricted use. In case: data analysts) whose expectations and needs have to [ISO00b], these aspects are summarized as ”control of be met. nonconforming product”. Following this analogy, we can adapt the well-

¯ Cause-oriented established QM concepts known from the manufactur- In this approach, the business processes which ”pro- ing/service domain to the context of data integration, here- duce” the data (especially data acquisition, transfor- after called data quality management (DQM). mation, and consolidation processes) are tuned con- One important aspect of DQM is data quality measure- tinually in such a way that the quality of produced ment. As DeMarco stated ”You cannot control what you data increases more and more. Causes of nonconfor- cannot measure” [DeM82], we need metrics at hand to mities should be eliminated appropriately (in respect calculate the degree of conformance with given require- of their effects) in order to prevent recurrence. Fur- ments. In the manufacturing domain, we have to measure thermore, the QMS itself should be the subject of per- characteristics like lengths, weights, speeds, etc. In data manent efforts to improve its effectiveness, efficiency, warehousing, on the other hand, we have to measure char- and/or traceability. In [ISO00b], the entirety of these acteristics like consistency, timeliness, and completeness actions is called ”(quality) improvement”. of data products (cf. Fig. 1). Unfortunately, metrics suit- able for these characteristics are still a matter of research. 1A fully automated DQM, however, is also unrealistic. Human inter- Most of the data quality metrics proposed in literature, e. g. action will be inevitable when conflicts cannot be solved automatically.

H. Hinrichs, T. Aden 1-2 Of course it is always better to strike at the root of a problem, which means in this case to optimize the business Continual Improvement of processes and/or the QMS. Unfortunately, cause-oriented the Quality Management System DQM is often hindered by the fact that the processes in question are outside the optimizer’s sphere of influence. Management Legacy systems e. g., which often deliver low quality Responsibility data (due to a lack of input controls etc.), usually can- not be extended by appropriate integrity rules belatedly. Section 5

Besides, many data deficiencies cannot be detected until Section 8 (resp. are raised by) data integration, e. g. duplicate records from different sources. As a consequence, symptom- and Resource Measurement, Ana- lysis & Improvement cause-oriented steps should be implemented in combina- Management tion, complementing each other. Section 6 Customer Satisfaction Customer Requirements The remainder of this paper is structured as follows: Section 7 In Sect. 2, an overview of the ISO 9001:2000 standard is Input Product Output given. A QMS for data integration which meets the re- Product quirements of this standard is presented in Sect. 3. Subse- Realization quent to a discussion of related work in Sect. 4, the paper Section 4 concludes with a rating of our approach and an overview of future work in Sect. 5. Value-Adding Activities Information Flow 2 The ISO 9001:2000 Standard Figure 3: Model of a Process-Based QMS [ISO00b] The ISO 9000 family of standards has been developed by the International Organization for (ISO) to ISO 9001 is made up of eight sections, the first three assist organizations in implementing and operating effec- of which contain general information about the scope of tive quality management systems. First introduced in 1987 the standard, normative references and terms. Sections 4 and revised in 1994, the current revision was published in to 8 describe requirements for a QMS and its components, December 2000, called ISO 9000:2000. It comprises the as indicated in Fig. 3. The structure of these sections is following standards: shown in Fig. 4. We will refer to the single sections when we discuss the ISO 9001 compliance of our QMS in the

¯ ISO 9000: Fundamentals and vocabulary subsequent section.

¯ ISO 9001: Requirements 3 A Quality Management System for Data

¯ ISO 9004: Guidelines for performance improvements Integration In this section, we present a process model for data in-

¯ ISO 19011: Guidelines on quality and environmental management systems auditing. tegration that defines which integration steps have to be executed in which order to achieve optimal data quality. Some excerpts of ISO 9000:2000 have already been This process model is enriched with DQM steps to ensure cited in the previous section. that customer requirements are fulfilled. Integration steps In this section, we give an introduction to ISO plus DQM steps along with organizational DQM activi- 9001:2000 (the postfix ”:2000” will be omitted below), ties form a QMS for data integration, called data quality since it is the standard most relevant to CLIQ. ISO 9004 management system (DQMS). In the context of data ware- and 19011 are beyond the scope of this paper. housing, such a DQMS has to be located at the extraction- ISO 9001 promotes the adoption of a process approach transformation-loading (ETL) stage (see Fig. 5), with the when developing, implementing, and improving a QMS data warehouse as target database. to enhance customer satisfaction by meeting customer re- The following two subsections describe the main con- quirements [ISO00b]. An organization has to identify var- cepts of our DQMS and its implementation as a software ious activities, link them together and assign resources to system. them, thus building up a system of communicating pro- 3.1 DQMS Concepts cesses. Figure 3 illustrates such a process-based QMS. Customers play an important role in this model since their As Fig. 6 shows, the DQMS should be viewed as an integral requirements are used as input to the product realization part of an organization. It is tightly coupled to the organi- process and their satisfaction is continually analyzed. zation’s management, its human and technical resources,

H. Hinrichs, T. Aden 1-3 ISO 9001:2000 QMS Requirements

4 Quality 5 Management 6 Resource 7 Product 8 Measurement, Management Responsibility Management Realization Analysis & System Improvement

4.1 General 5.1 Management 6.1 Provision of 7.1 Planning of 8.1 General Requirements Commitment Resources Product Realization 8.2 Monitoring & 4.2 Documentation 5.2 Customer 6.2 Human Measurement Requirements Focus Resources 7.2 Customer- Related 8.3 Control of 5.3 Quality Policy 6.3 Infrastructure Processes Nonconforming Product 5.4 Planning 6.4 Work 7.3 Design & Environment Development 8.4 Analysis of 5.5 Responsibility, Data Authority & 7.4 Purchasing Communication 8.5 Improvement 7.5 Production & 5.6 Management Service Review Provision

7.6 Control of Monitoring & Measuring Devices

Figure 4: Structure of ISO9001:2000 Sections (Third Level Not Shown) and Ð of course Ð its (data) suppliers and (data) customers. in a central document management system (electroni- Although the latter two are modelled as external entities cally), so that Ð provided that access rights have been in Fig. 6 (to keep the ISO 9001 view), they could also be granted appropriately Ð authorized stakeholders may viewed as internal ones in the data warehousing context always access the most current (meta) information. (as indicated by the dashed lines): an organization’s opera- Additionally, the document management system is tional database would be an internal supplier, an organiza- supposed to support version control to meet the re- tion’s data analyst an internal customer. Customers specify quirements of ISO section 4.2. quality-related requirements and give feedback concerning their satisfaction with the delivered product. An analogous ¯ Any recordings resulting from integration steps (es- relationship exists between the organization and its suppli- pecially measurement results) are stored in a central ers. metadata repository to meet the requirement to keep The requirements of ISO 9001 sections 4, 5, and 6 (see recordings ”legible, readily identifiable, and retriev- Fig. 4) cannot be achieved by the data integration process able” (ISO section 4.2). itself. Instead, they must be met by the surrounding envi- ronment as follows: ¯ We assume that the top management of an organiza- tion takes steps to meet the requirements of ISO sec-

¯ The fulfillment of the general requirements of ISO tion 5 (management responsibility). section 4.1 (identification of processes and their inter-

action, determination of control criteria and methods ¯ We assume that the human resources, technical infra- etc.) has been a basic presupposition of building the structure, and work environment necessary to operate DQMS. The same holds for ISO sections 7.1 (plan- the DQM processes are provided by the organization ning of product realization) and 8.1 (general measure- as claimed by ISO section 6. Low level resources like ment, analysis, and improvement requirements). CPU time slots, memory allocation, and database ac- cess are assumed to be provided by the underlying op-

¯ We assume that the organization has established and erating system and DBMS. maintained documented statements of the quality pol- icy and quality objectives as well as documentation of In the following, we take a closer look at the integration procedures used within the DQMS (plus their config- process itself with regard to its compliance to the remain- urations, where applicable). As in every other ISO ing ISO 9001 sections 7 (product realization) and 8 (mea- 9001 compliant QMS, a quality manual must exist. surement, analysis, and improvement). Since we assume We further assume that all these documents are stored that the organization does not design or develop new (data)

H. Hinrichs, T. Aden 1-4 Analysis (OLAP etc.)

Data Customers

"Garbage In, Requirements Data Data Products Garbage Out" Warehouse Feedback

Organization Loading Management

DQMS Operational ...... Quality Data Store Management

Data Quality Transformation Management Raw Data Requirements Feedback

Data Suppliers Staging Area

Figure 6: Embedding a DQMS into an Organization Extraction sition methods, QMS features etc. This document must then be sent to potential suppliers. Based on an evaluation Data Data . . . Data of the suppliers’ responses, suppliers have to be selected, Source 1 Source 2 Source n supposed their products pass appropriate inspections (ISO section 7.4). Figure 5: DQM within the ETL Model Compared to the purchasing process in manufacturing, there are some peculiarities in the data warehousing con- products in terms of ISO 9001, we can exclude ISO section text: in many cases, there will be just one data supplier 7.3 (design and development) from our discussion. offering a certain contribution to the warehouse data (e. g. the operational databases of the organization itself), i. e. the Step I: Assessment and Review of Requirements organization cannot choose between alternative suppliers. Furthermore, in contrast to raw materials in manufactur- In this step, requirements are determined and maintained ing, raw data from one supplier may show significant defi- (cf. Fig. 6). These include customer requirements in terms ciencies in the first place (e. g. missing values), but when of desired values of data quality characteristics, additional being combined with (complementary) raw data from an- implicit requirements as well as statutory and regulatory other supplier, these deficiencies may disappear. The orga- requirements (e. g. protection of data privacy). The organi- nization has to be aware of these special characteristics. zation has to ensure that it has the ability to meet the spec- Steps I and II rather belong to the modelling phase than ified requirements. The communication between organiza- to the operational phase of the data warehouse system. tion and customer is supposed to be handled electronically Therefore, they are arranged orthogonally to the opera- by means of window dialogs, emails etc. (ISO section 7.2). tional steps described below. Assuming that steps I and II have been completed suc- Step II: Purchasing Raw Data cessfully, the data warehouse system may start operations. According to the requirements specified in the previous In CLIQ, we concentrate on the transformation phase of step, the organization has to evaluate and select data suppli- the ETL model, assuming that extraction and loading func- ers. First, a so-called purchasing information document has tionality is provided by third party tools (see also Sect. 4). to be set up which specifies requirements the raw data have Each integration process which is to be executed within the to meet (cf. Fig. 6), especially concerning data quality char- data warehouse system should pursue the following phase acteristics like relevance and timeliness, but also supplier- model (see Fig. 7) in order to meet the ISO 9001 require- oriented characteristics like reputation, charges, data acqui- ments.

H. Hinrichs, T. Aden 1-5 Step 1: Unification of Representation defined attribute-specific limits are exceeded, an interced- ing action has to be executed in order to reestablish statisti- We assume that previously extracted source data are be- cal control, e. g. deleting data from the transformation area ing stored in so-called source data buffers (SDB, one per and initiating a new data transfer after the problem has been source). SDBs correspond to source-specific ”raw material solved. stores”, from which heterogeneous data can be fetched and then processed. Each SDB implements a relational schema Step 3: Domain-Specific Consistency Checking and thus hides the data model of the respective source (flat file, relational, object-oriented, etc.), simplifying the sub- In this step, the transformation area records are being sequent integration steps. checked with regard to consistency by means of domain- In this step, source data are moved from their SDBs into specific knowledge. The latter should be represented in a temporary storage called transformation area (also called such a way as it can be processed automatically. Differ- staging area [Kim98]). The transformation area is assumed ent types of representation are possible, especially:

to have a global relational schema that is covered (in terms < of content) by the local SDB schemata. The main task of ¯ Rules,e.g.”IF DeliveryDate PurchaseOrderDate this step is to map the heterogeneous source data onto the THEN Error (SeverityCode 0.7)” uniform data structures of the transformation area. The fol-

¯ Look-up tables, e. g. bank code registers lowing mapping tasks are especially important:

¯ Regular expressions, e. g. for phone numbers or article

¯ Generating global IDs and linking them to local keys codes ¯ Unifying character strings syntactically (with regard ¯ Arbitrary domain-specific functions. to umlauts, white spaces, etc.) In case of a detected inconsistency, an appropriate action

¯ Unifying character strings semantically (in case of has to be executed, e. g. by generating an error message or synonyms) warning. However, the checking process Ð usually done as a batch run Ð should not be aborted.

¯ Decomposing complex attribute values into atomic ones (e. g. addresses) Step 4: Inquiries and Postprocessing

¯ Aggregating atomic attribute values into complex Provided that appropriate domain knowledge is available, ones (e. g. date values) the major proportion of inconsistencies can be detected au- tomatically, as described above. However, very few incon-

¯ Converting codings (e. g. gender values m/f to 1/2) sistencies can be corrected automatically. Consequently, if an inconsistency is not tolerable, an inquiry has to be sent to

¯ Converting scales (e. g. inch to cm). the affected data source (generated automatically out of an error message, if possible). Depending on the business pro- To specify the required transformations, conventional cess implemented, the source may send corrected records ETL tools (cf. Sect. 4) can be used. After executing a trans- to its SDB (then continuing with step 1) or directly to the formation, data are written to the transformation area and transformation area. In the latter case, a (preferably auto- deleted from their respective SDB. mated) postprocessing of corrected records has to be done After this step, the newly extracted records can be iden- in the transformation area, followed iteratively by step 3. tified via their global IDs. Traceability (according to ISO section 7.5) is ensured by a look-up table linking global IDs Step 5: Record Linkage to local keys (plus their data source ID). The goal of this step is to detect duplicate records, i. e. records that describe the same real world entity, both within Step 2: Statistical Process Control the transformation area and between the transformation After unification, a statistical process control (SPC) is done area and the so-called target area (also called operational on the transformation area data, based on classical SPC the- data store [Kim98]) in which the consolidated data will be ory [She31]. The idea is to compute attribute-specific sta- stored in the end. tistical key figures (mean, variance, etc.) out of the trans- Two types of record linkage processes should be dis- formation area data and log them over time. The newly tinguished: If a record in the transformation area is just computed key figures are then compared to the formerly an incremental update of another record already stored in logged key figures (used as ”expected values”). By doing the target area, the linkage can be done via their common so, cardinal data deficiencies (e. g. transfer errors) can be (global) keys. Due to the heterogeneity of data sources, this detected at a very early stage. If required, i. e. if a priori does not work when the extensions of data sources overlap

H. Hinrichs, T. Aden 1-6 Data . . . Data Source 1 Source n

outside of CLIQ Extraction Extraction

4. Inquiries and Source Data. . . Source Data Postprocessing Buffer 1 Buffer n

1. Unification of 8. Control of Nonconforming Data Representation Products and Quality Improvement 2. Statistical Process Control Transforma- tion Area 3. Domain-Specific Consistency Checking 5. Record Linkage 9. Data Product Release Consolida- and Target Area Update tion Area 6. Merging

7. Quality Measurement 10. Analysis of Customer Feedback Target and Retraction of Data Products and Analysis Area

outside of CLIQ Loading

Data Flow Data Warehouse Control Flow

Figure 7: Operational Steps of Data Integration in CLIQ and, consequently, several source records (with different Step 6: Merging keys) have to be mapped on the same target area record. The same holds for the case when there are previously un- Records whose correlations have been validated must now detected duplicates within one data source. In these cases, be merged to one single record in order to avoid uninten- the linkage has to be based on other attributes like name, tional redundancy within the data warehouse. Applying city, gender, delivery date, etc. To automate this task, sev- certain criteria (information content, attribute-specific pri- eral methods have been proposed in literature. The most orities on data sources, timeliness etc.), the best pieces of prominent ones are: information have to be extracted from the involved records and written into a target record. In our process model, this target record is not written to the target area (as one could ¯ Probabilistic Record Linkage [Jar89] expect), but into another temporary storage, the so-called consolidation area. The target area will be updated not un- ¯ Basic resp. Duplicate Elimination Sorted-Neighbor- hood Method [HS95] til the very last step of the process model, thus ensuring that only data that have passed all conformance tests (some of

¯ Neighborhood Hash Joins [GFSS00]. which will follow in the subsequent steps) will be written to the target area and thus be made available for analysis All these methods result in a set of record pairs poten- tasks. tially belonging together. These pairs, more precisely their Those records that merged in a consolidation area record transitive closures, now have to be analyzed with respect to are then deleted from the transformation area. The remain- whether the linkage is correct or not. Especially in marginal ing transformation area records are moved to the consolida- cases, an interactive review will always be inevitable. tion area without any modification. The consolidation area

H. Hinrichs, T. Aden 1-7 will now serve as the starting point of the following DQM While all these activities merely tackle the symptoms of activities. a problem, further (cause-oriented) measures may be taken to increase the ability to fulfil quality requirements in the Step 7: Quality Measurement and Analysis future according to ISO section 8.5: In this step, it must be checked if the data in the target area will meet the specified customer (user) requirements ¯ Improve the integration process, especially by tuning (in compliance with ISO section 8.2) supposed that it is process parameters (e. g. attribute mappings, SPC pa- updated with the current consolidation area data. To do rameters, consistency rules, record linkage parame- this, the actual quality of data must be measured, using ap- ters, merging criteria). propriate metrics and measuring software according to ISO

¯ Improve quality planning and quality control pro- section 7.6 (control of monitoring and measuring devices). 2 cesses, e. g. by finding better means to capture user re- These measurements3 must span both the (previous) target quirements, by optimizing measurement methods, or area data and the consolidation area data (the target area by improving feedback methods. has not been updated yet!). In a subsequent analysis phase, the results of quality measurements have to be compared to a priori specified Step 9: Data Product Release and Target Area Update quality requirements (resulting from step I). If data do not Depending on the analysis results of step 7, the approved meet a given requirement, an appropriate action has to be proportion of data is now released (ISO section 8.2), taken (see step 8). Conflicts due to contradicting require- i. e. the affected consolidation area records are flagged as ments (e. g. high timeliness vs. high consistency) have to be ”passed”. The passed proportion of data is then moved resolved, e. g. by data replication and different treatment of from the consolidation area to the target area, replacing the replicas. obsolete target area data, if necessary. With this step, the Furthermore, ISO section 8.2 postulates the measure- newly integrated data have been made available for cus- ment of the effectiveness of integration and (data quality) tomer use. Given a typical data warehouse system, they 4 measuring processes. In our context, this can be done by may now be loaded into the analysis-oriented (e. g. mul- means of reference data. For these, both the aspired inte- tidimensional) data structures of the data warehouse and gration results (characteristics of the resulting data product) attached data marts. and the included quality deficiencies must be known. Measurement results concerning the effectiveness of Step 10: Analysis of Customer Feedback and Retrac- products and processes have to be recorded and analyzed tion of Data Products (ISO section 8.4), resulting Ð if necessary Ð in process im- proving activities as described below. The organization has to record customer feedback and eval- uate it as a measure of the DQMS performance (ISO sec- Step 8: Control of Nonconforming Data Products and tion 8.2). If a deficiency of a released data product is de- Quality Improvement tected during current use (i. e. by a customer), and this de- In this final step before the target area update, data products ficiency impairs the usability of the data product signifi- that do not conform to given requirements must be treated cantly, the organization has to retract the product from the appropriately, in accordance with ISO section 8.3. The fol- target area (and the data warehouse) and ”repair” it if pos- lowing options may be taken: sible (see step 8) before re-releasing it. If necessary, cause- oriented measures should be taken into account (see step

¯ Sort out and re-request data 8).

¯ Restrict the use of data to specific analysis scenarios, e. g. by flagging 3.2 DQMS Implementation

5 The CLIQ DQMS has been implemented as a software

¯ Eliminate detected nonconformities and then con- workbench that offers a continuously quality controlled, tinue at step 7 (subject to ISO section 8.3). extensively automated support of the data integration pro- 2We propose to use the set of metrics and measuring methods devel- cess. It employs both commercial and self-developed com- oped in CLIQ (see [Hin01]). ponents. Figure 8 illustrates the different layers of the 3Including consistency checks as in step 3 (and, if required, postpro- cessing as in step 4), since a merging of records may introduce new com- DQMS architecture. binations of attribute values and thus new inconsistencies. Layer 1 contains the data storage modules (based on Mi- 4Strictly speaking, these process-oriented measurements should be as- crosoft SQL Server) for business and meta data. Layer 2 signed to a dedicated phase, running parallel to the integration steps. offers interfaces to these storage modules. Business data 5In CLIQ, we developed a method that uses data techniques to eliminate inconsistencies and complete previously missing values semi- are accessed by way of ODBC, metadata Ð represented automatically (see [HW00] for details). in accordance with the Open Information Model (OIM)

H. Hinrichs, T. Aden 1-8 DQM Controller 7

Domain-Specific Measurement Engine 6

SPC Engine 5

Data MigratorData Linker Data Auditor Rule Editor Measurement Engine 4

Microsoft DTS Vality Integrity MLC++ ILOG Rules 3

ODBC MS Repository 2

Business Microsoft SQL Server 1 data Metadata

Key:external component implemented partially implemented

Figure 8: Software Implementation of the DQMS

[MDC01] Ð by way of the COM-based Microsoft Repos- comprehensive libraries of transformation functions, they itory [Ber97]. are insufficient for DQM for the following reasons: Layers 4 and 5 form the DQMS core, their submodules

being tailored to single steps of the data integration process ¯ Their functionality is usually limited to the quality (Data Migrator: step 1, Data Linker: steps 5 and 6, Data characteristic of uniformity (cf. Fig. 1) by providing Auditor: step 8, Rule Editor/Measurement Engine: steps typical data migration functions, like mapping of at- 3 and 7, SPC Engine: step 2). On their part, they make tributes, standardization of addresses, and value con- use of the commercial components of layer 3, including the versions. Microsoft Data Transformation Services [AL99], the ETL tool Integrity [Val01], the data mining class library MLC++ ¯ They do not support any DQM functions, like quality [KSD96], and the rule engine ILOG Rules [Ilo01]. planning, control, assurance, and improvement. To be able to integrate domain-specific quality control methods, a so-called Domain-Specific Measurement En- Nevertheless, an ETL tool can be particularly useful as a gine was introduced in layer 6. Data quality planning, con- single module of an overall DQMS (cf. step 1 in Sect. 3.1). trol, and improvement are being coordinated by the DQM Some prominent examples of commercial ETL tools are: controller in the top layer of the software system.

Those software components that have been developed ¯ DataStage Suite (Ardent Software) within CLIQ have been implemented using Microsoft Vi-

¯ DQ Plus (Group 1 Software) sual C++ 6.0.

¯ Genio Suite (Hummingbird) 4 Related Work

¯ i. D. Centric Data Quality Suite (Firstlogic) The importance of data quality for organizational success has been underestimated for a long time. For this reason, ¯ InfoRefiner/InfoTransport/InfoPump (Computer As- quality management in information processing is not nearly sociates, formerly Platinum) as established as in other disciplines, e. g. manufacturing.

Contemplating the related work concerning DQM, a clear ¯ Integrity (Vality) distinction can be made between commercial approaches on the one hand and research activities on the other. ¯ Pure*Integrate (Oracle)

4.1 Commercial Approaches ¯ Trillium Software System (Trillium)

In recent years, a large number of ETL tools (cf. Sect. 3) ¯ Warehouse Workbench (Systemfabrik). hit the market, claiming to simplify the process of populat- ing a data warehouse significantly. But although some of A quite extensive list of ETL tools can be found in these tools provide sophisticated graphical interfaces and [Eng99].

H. Hinrichs, T. Aden 1-9 4.2 Research Approaches query processing in distributed databases, based on a wrap- per/mediator architecture. [RCH99] describe an interactive In 1992, DeLone und McLean [DM92] set up a model of framework for data cleansing, called Potter’s Wheel A-B- information systems success which included the quality of C, that tightly integrates data transformation and error de- data as a critical success factor, along with system quality. tection. Potter’s Wheel A-B-C allows the user to gradu- In the following years, two major research projects ally build transformations by adding or undoing transforms emerged, namely MIT’s Total Data Quality Management through a spreadsheet-like interface. In the background, (TDQM) program [Wan98], active since 1992, and the data deficiencies are flagged as they are found. [MM00] ESPRIT project Foundations of Data Warehouse Quality and [DJ99] use data mining methods to detect errors auto- (DWQ) [JJQV98], running from 1996 to 1999. matically, comparable with the commercial tool WizRule TDQM claims to establish a theoretical foundation of [Wiz01]. data quality. Based on the enterprise philosophy of To- All in all, even though there are quite a few projects tal Quality Management [Dem86], a so-called TDQM cy- dealing with data quality aspects, there is still a serious cle is derived from Deming’s classical PDCA cyle (plan, lack of formally funded methods to measure data quality. do, check, act) comprising phases of data quality planning, Furthermore, there is no process model supporting quality- measurement, analysis and improvement. Later, this con- driven data integration. cept has been extended by [Eng99] and [Hel00]. Addi- tionally, TDQM identifies a set of data quality characteris- 5 Conclusion tics (called quality dimensions) based on empirical studies and proposes some simple metrics for data quality mea- In this paper, we have proposed a quality management sys- surement. Finally, TDQM provides an approach to enrich tem for the integration of data from heterogeneous sources. the relational data model with data quality information by Following a rigorous analogy between the integration pro- introducing meta relations. Although TDQM offers some cess and a conventional manufacturing process, we were interesting ideas, most of the concepts remain on a quite able to map the requirements of the ISO 9001:2000 stan- superficial level, orientating much more to business eco- dard to our model and thus implement a standard-compliant nomics than to information technology. data quality management system for data warehouse sys- The DWQ project, on the other hand, is tailored to the tems. data warehousing world. Not only aspects of data quali- During the next months, we will evaluate the DQMS ty are considered, but also aspects of schema and software by means of the epidemiological cancer registry of Lower- quality. There is a clear distinction between conceptual, Saxony [HP99] which features a typical data warehouse ar- logical, and physical data models, the semantics of which chitecture. Further future work will include the adaptation are explicitly described as metadata. Integration of source of the DQMS to the metadata standard CWM (Common schemata is supported by so-called interschema knowledge Warehouse Metamodel) [OMG00]. networks which specify relationships between schemata. Quality assessment is done using the goal-question-metric References (GQM) approach by [BW84]. With GQM, quality goals [AL99] Awalt, D., Lawton, B. K.: Introduction are specified and then mapped to ”quality questions” (in to Data Transformation Services. In: SQL this case: queries on a meta database). To get a response to Server Magazine, 1 (1): 34Ð36, 1999. a quality question, (simple) metrics are used. Since DWQ covers a wide range of aspects, it inevitably shows some [Ber97] Bernstein, P. A. et al.: The Microsoft Repos- shortcomings relating to specific questions of detail, e. g. itory. In: Proc. of the 23rd Intl. Conf. on concerning sophisticated data quality metrics. Very Large Databases (VLDB ’97), Athens, Apart from TDQM and DWQ, several minor research Greece, 1997. activities have been launched during the last two to three [BT99] Ballou, D. P., Tayi, G. K.: Enhancing Data years, reflecting the rising awareness of the importance of Quality in Data Warehouse Environments. In: DQM: In the CARAVEL project [GFSS00], a metadata- Communications of the ACM, 42 (1): 73Ð78, based cleansing system called AJAX has been developed, 1999. including modules for the specification, optimization, exe- cution, and explanation of cleansing tasks. The main focus [BW84] Basili, V. R., Weiss, D. M.: A Method for of CARAVEL lies on the elimination of duplicates (quality Collecting Valid Data. characteristic ”absence of redundancy”, cf. Fig. 1). This In: IEEE Transactions on Software Engineer- approach is similar to the IntelliClean project [LLL00], ing, 10 (6): 728Ð738, 1984. the Sorted-Neighborhood Method by [HS95], and related statistical methods [Jar89]. The project HiQiQ [NLF99] [DeM82] DeMarco, T.: Controlling Software Projects, deals with the inclusion of data quality information into Yourdon Press, New York, 1982.

H. Hinrichs, T. Aden 1-10 [Dem86] Deming, W. E.: Out of the crisis, MIT, Center [ISO00a] International Organization for Standardiza- for Advanced Engineering Study, 1986. tion: ISO 9000:2000: Quality Management Systems – Fundamentals and Vocabulary. [DJ99] Dasu, T., Johnson, T.: Hunting of the Beuth, Berlin, 2000. Snark: Finding Data Glitches with Data Min- ing Methods. In: Proc. Information Quality [ISO00b] International Organization for Standardiza- IQ1999, MIT, Boston, MA, 1999. tion: ISO 9001:2000: Quality Management Systems – Requirements. Beuth, Berlin, 2000. [DM92] DeLone, W. H., McLean, E. R.: Information Systems Success: The Quest for the Depen- [Jar89] Jaro, M. A.: Advances in Record Linkage dent Variable. In: Inf. Systems Research, 3 Methodology as Applied to Matching the (1): 60Ð95, 1992. 1985 Census of Tampa, Florida. In: Journal of the American Statistical Association, 84: [Eng99] English, L. P.: Improving Data Warehouse 414Ð420, 1989. and Business Information Quality. Wiley, New York, 1999. [JJQV98] Jarke, M., Jeusfeld, M. A., Quix, C., Vas- siliadis, P.: Architecture and Quality in Data [GFSS00] Galhardas, H., Florescu, D., Shasha, D., Si- Warehouses. In: Proc. of the 10th Intl. Conf. mon, E.: Declaratively Cleaning your Data CAiSE*98, Pisa, Italy, pp. 93Ð113, Springer, using AJAX. In: Journ. Bases de Donnees´ Berlin, 1998. Avancees´ , Oct. 2000. [Kim98] Kimball, R.: The Data Warehouse Lifecycle [Hel00] Helfert, M.: Massnahmen und Konzepte zur Toolkit. Wiley, N. Y., 1998. Sicherung der Datenqualitaet. In: Data Ware- [KSD96] Kohavi, R., Sommerfield, D., Dougherty, J.: housing Strategie – Erfahrungen, Methoden, Data Mining using MLC++ Ð A Machine Visionen, pp. 61Ð77, Springer, Berlin, 2000 Learning Library. In: Tools with AI 1996, pp. (in German). 234Ð245, 1996. [Hin01] Hinrichs, H.: Datenqualitaetsmanagement in [LLL00] Lee, M. L., Ling, T. W., Low W. L.: Intelli- Data Warehouse-Umgebungen. In: Daten- Clean Ð A Knowledge-Based Intelligent Data banksysteme in Buero, Technik und Wissen- Cleaner. In: Proc. of the 6th ACM SIGKDD schaft, 9. GI-Fachtagung BTW 2001, Olden- Intl. Conf. on Knowledge Discovery and Data burg, pp. 187Ð206, Springer, Berlin, 2001 (in Mining, Boston, MA, 2000. German). [MDC01] Meta Data Coalition: http://www.MDCinfo. [HP99] Hinrichs, H., Panienski, K.: Experiences com, 2001. with Knowledge-Based Data Cleansing at the Epidemiological Cancer Registry of Lower- [MM00] Maletic, J. I., Marcus, A.: Data Cleansing Ð Saxony. In: XPS-99: Knowledge-Based Sys- Beyond Integrity Analysis. In: Proc. of the tems - Survey and Future Directions, pp. 218Ð Conf. on Information Quality IQ2000, MIT, 226, LNAI 1570, Springer, Berlin, 1999. Boston, MA, pp. 200Ð209, 2000.

[HS95] Hernandez, M. A., Stolfo, S. J.: The [NLF99] Naumann, F., Leser, U., Freytag, J. C.: Merge/Purge Problem for Large Databases. Quality-Driven Integration of Heterogeneous In: Proc. of the 1995 ACM SIGMOD Con- Information Sources. In: Proc. of the 1999 ference, 1995. Intl. Conf. on Very Large Databases (VLDB ’99), Edinburgh, UK, 1999. [HW00] Hinrichs, H., Wilkens, T.: Metadata-Based Data Auditing. In: Data Mining II (Proc. [OMG00] Object Management Group: Common of the 2nd Intl. Conf. on Data Mining, Warehouse Metamodel (CWM) Specification, Cambridge, UK), pp. 141Ð150, WIT Press, 2000. Southampton, 2000. [RCH99] Raman, V., Chou, A., Hellerstein, J. M.: Scal- [Ilo01] Ilog Inc.: http://www.ilog.com/products/ able Spreadsheets for Interactive Data Anal- rules, 2001. ysis. In: Proc. of the ACM-SIGMOD Work- shop on Research Issues in Data Mining and [Inm92] Inmon, W. H.: Building the Data Warehouse. Knowledge Discovery (DMKD), Philadel- Wiley, New York, 1992. phia, 1999.

H. Hinrichs, T. Aden 1-11 [Sha99] Shanks, G.: Semiotic Approach to Under- standing Representation in Information Sys- tems. In: Proc. of the IS Foundations Work- shop, ICS Macquarie University, Sydney, 1999. [She31] Shewhart, W. A.: Economic Control of Qual- ity of Manufactured Product. D. Van Nos- trand, New York, 1931. [Val01] Vality Technology Inc.: http://www.vality. com, 2001. [Wan98] Wang, R. Y.: A Product Perspective on Total Data Quality Management. In: Communica- tions of the ACM, 41 (2): 58Ð65, 1998. [Wiz01] WizSoft Inc.: http://www.wizsoft.com, 2001.

H. Hinrichs, T. Aden 1-12