Data Quality and Data Cleaning in Database Applications

Data Quality and Data Cleaning in Database Applications Lin Li A thesis submitted in partial fulfilment of the requirements of Edinburgh Napier University for the award of Doctor of Philosophy School of Computing September 2012 ABSTRACT Today, data plays an important role in people‟s daily activities. With the help of some database applications such as decision support systems and customer relationship management systems (CRM), useful information or knowledge could be derived from large quantities of data. However, investigations show that many such applications fail to work successfully. There are many reasons to cause the failure, such as poor system infrastructure design or query performance. But nothing is more certain to yield failure than lack of concern for the issue of data quality. High quality of data is a key to today‟s business success. The quality of any large real world data set depends on a number of factors among which the source of the data is often the crucial factor. It has now been recognized that an inordinate proportion of data in most data sources is dirty. Obviously, a database application with a high proportion of dirty data is not reliable for the purpose of data mining or deriving business intelligence and the quality of decisions made on the basis of such business intelligence is also unreliable. In order to ensure high quality of data, enterprises need to have a process, methodologies and resources to monitor and analyze the quality of data, methodologies for preventing and/or detecting and repairing dirty data. This thesis is focusing on the improvement of data quality in database applications with the help of current data cleaning methods. It provides a systematic and comparative description of the research issues related to the improvement of the quality of data, and has addressed a number of research issues related to data cleaning. In the first part of the thesis, related literature of data cleaning and data quality are reviewed and discussed. Building on this research, a rule-based taxonomy of dirty data is proposed in the second part of the thesis. The proposed taxonomy not only summarizes the most dirty data types but is the basis on which the proposed method for solving the Dirty Data Selection (DDS) problem during the data cleaning process II was developed. This helps us to design the DDS process in the proposed data cleaning framework described in the third part of the thesis. This framework retains the most appealing characteristics of existing data cleaning approaches, and improves the efficiency and effectiveness of data cleaning as well as the degree of automation during the data cleaning process. Finally, a set of approximate string matching algorithms are studied and experimental work has been undertaken. Approximate string matching is an important part in many data cleaning approaches which has been well studied for many years. The experimental work in the thesis confirmed the statement that there is no clear best technique. It shows that the characteristics of data such as the size of a dataset, the error rate in a dataset, the type of strings in a dataset and even the type of typo in a string will have significant effect on the performance of the selected techniques. In addition, the characteristics of data also have effect on the selection of suitable threshold values for the selected matching algorithms. The achievements based on these experimental results provide the fundamental improvement in the design of „algorithm selection mechanism‟ in the data cleaning framework, which enhances the performance of data cleaning system in database applications. III ACKNOWLEDGEMENT I would like to thank my supervisors Dr. Taoxin Peng, Professor Jessie Kennedy, and my PhD panel chairs for all their help, support, expertise and understanding throughout my period of PhD study. I would also like to thank all staff at the School of Computing at Napier University, especially the members at the Centre for Information and Software Systems group, for providing me valuable feedback and suggestions during my PhD study. Finally, great appreciation and thanks to my mum and my dad for their consistent spiritual support. I am proud of them and appreciate what they contribute to my life. IV PUBLICATIONS FROM THE PHD WORK [1] Li, L., Peng, T., & Kennedy, J. (2010). Improving Data Quality in Data Warehousing Applications. Proceedings of the 12th International Conference on Enterprise Information Systems, Funchal, Madeira Portugal. [2] Li, L., Peng, T., & Kennedy, J. (2011). A Rule Based Taxonomy of Dirty Data. GSTF International Journal on Computing, 1(2), 140-148. [3] Peng, T., Li, L., & Kennedy, J. (2011). An Evaluation of Name Matching Techniques. Proceedings of 2nd Annual International Conference on Business Intelligence and Data Warehousing, Singapore. [4] Peng, T., Li, L., & Kennedy, J. (2012). A Comparison of Techniques for Name Matching. International Journal on Computing, 2(1), 55-61. V TABLE OF CONTENTS Abstract ......................................................................................................................................... II Acknowledgement........................................................................................................................ IV Publications from the PhD work ................................................................................................... V Table of Contents ......................................................................................................................... VI List of Figures .............................................................................................................................. IX List of Tables ................................................................................................................................. X Chapter 1 Introduction ................................................................................................................ 1 1.1 Data Quality .................................................................................................................. 2 1.2 Data Cleaning ................................................................................................................ 3 1.3 Objectives of the research ............................................................................................. 9 1.4 Contributions to knowledge ........................................................................................ 11 1.5 The structure of the thesis ........................................................................................... 13 Chapter 2 Literature review and related work ........................................................................... 15 2.1 Dirty data ..................................................................................................................... 15 2.1.1 Müller and Freytag‟s Data Anomalies .............................................................. 15 2.1.2 Rahm and Do‟s classification of data quality problems ................................... 17 2.1.3 Kim et al‟s taxonomy of dirty data ................................................................... 19 2.1.4 Oliveira et al‟s taxonomy of data quality problems ......................................... 23 2.2 Methods used for Data cleaning .................................................................................. 27 2.3 Existing approaches for Data cleaning ........................................................................ 32 2.4 Data quality, data quality dimensions and other related concepts ............................... 55 2.4.1 Data Quality ..................................................................................................... 56 2.4.2 Data quality dimensions ................................................................................... 58 2.4.3 Impacts and costs of Data quality ..................................................................... 70 2.4.3.1 The impact ............................................................................................. 72 2.4.3.2 The cost ................................................................................................. 73 2.4.4 Data quality assessment.................................................................................... 77 2.5 Conclusion................................................................................................................... 85 Chapter 3 A rule-based taxonomy of dirty data ........................................................................ 89 3.1 Data quality rules ........................................................................................................ 90 3.2 Dirty data types ........................................................................................................... 96 3.3 The taxonomy ............................................................................................................ 103 3.4 Conclusion................................................................................................................. 106 Chapter 4 A Data cleaning framework .................................................................................... 108 4.1 Introduction ............................................................................................................... 108 4.2 Data cleaning framework .......................................................................................... 111 4.2.1 Basic ideas .....................................................................................................

Load more