International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 ISSN 2278-6856

Data Cleaning Using Clustering Based Mining Technique

Sujata Joshi1, Usha.T2

1 Assistant Professor, Dept. of CSE, Nitte Meenakshi Institute of Technology, Bangalore, India.

2Research scholar M.Tech, Dept. of CSE Nitte Meenakshi Institute of Technology, Bangalore, India.

Abstract 3. Irregularities are concerned with the non uniform Data Cleaning is one of the basic tasks performed during the use of values and abbreviations. process of knowledge discovery in the , during 4. Duplicates [4] [5] are two or more tuples modification and integration of schemas and also in representing the same entity from the real world. The the creation of data warehouses. Data Cleaning, also called values of these tuples need not be entirely identical. data cleansing or scrubbing, deals with detecting and Inexact duplicates represent the same entity but with removing errors and inconsistencies from data in order to different values for all or some of its attributes. improve the quality of data. In this paper, 5. Missing values are the result of omissions while problems are summarized. An algorithm is implemented using collecting the data. technique for data standardization and data 6. Data entry anomalies, these errors occur while the correction. Keywords: Data Cleaning, Data quality problems, user is entering data into the data pool. Attribute correction, Levenshtein distance. 3. DATA QUALITY 1. INTRODUCTION Data quality [6] is a state of completeness, validity, Data plays a fundamental role in every software system. In consistency, timeliness and accuracy that makes data particular, information systems and decision support appropriate for a specific use. High quality data needs to systems depend on it more deeply. Data quality is the pass a set of quality criteria. The hierarchy of data quality crucial factor in [1] creation and data is as shown in figure-1. integration. Data cleaning or scrubbing is the process of removing the errors from the data. It is an inherent activity related to database processing, updating and maintenance. Data fed from various operational systems prevailing in the different departments/sub-departments of the organization, has discrepancies in schemas, formats, semantics etc. due to numerous factors. These representations may introduce redundancy leading to exact duplicates of records [2], inconsistency where records differ in schemas, formats, abbreviations, etc. and lastly erroneous data. All such unwanted data records are referred to as ‘dirty data’. Our approach focuses on the correction of errors in an attribute. Here, we present a brief overview of various 4. DATA QUALITY PROBLEMS sources of errors that arise due to machine or human Data quality problems [7] are present in single data intervention and a summarization of data quality collections, such as files and databases, e.g. due to problems. Also an algorithm based on clustering is misspellings during data entry, missing information or implemented for data correction and standardization. other invalid data. When multiple data sources need to be integrated, e.g., in data warehouses, federated database 2. SOURCES OF ERRONEOUS DATA systems or global web-based information systems, the need for data cleaning increases significantly. This is The sources of erroneous data are : because the sources often contain redundant data in 1. Lexical errors [3], name discrepancies between the different representations. structure of the data items and the specified format. This section classifies the major data quality problems to 2. Syntactical errors represent the violations of the be solved by data cleaning and . It is overall format. roughly distinguished has single-source and multi-source

Volume 4, Issue 2, March – April 2015 Page 137

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 ISSN 2278-6856 problems and also has schema- and instance-related values Phone=”9999-999999” problems. Schema-level problems [8] are reflected in the instances, Table 2 and Table 3 show the example of multi- source they can be addressed at the schema level by an improved problems at schema and instance level. The two sources schema design (schema evolution), schema translation are both in relational format but exhibit schema and data and schema integration. conflicts. Instance-level problems [9], on the other hand, refer to At the schema level, there are name conflicts (synonyms errors and inconsistencies in the actual data contents Customer/Client, Cid/Cno, Sex/Gender) and structural which are not visible at the schema level. They are the conflicts (different representations for names and primary focus of data cleaning. addresses). At the instance level, we note that there are Figure-2 shows the categorization of data quality problems different gender representations (“0”/”1” vs. “F”/”M”) in data sources. and a duplicate record (John Smith). Solving these problems requires both schema integration and data cleaning; Table 4 shows a possible solution. Table 2: Customer (source 1) CID Name Street City Sex

214 John Smith 2 Harley Pl South Fork 1 461 Mary Thomas Harley St 2 S Fork 0

Table 3: Client (source 2) Cno LastName FirstName Gender Addr ess 153 Smith Kowalski M 23 Harle y The data quality problems for single source at schema Street level and instance level are illustrated with examples in , Table 1. Chica Table 1: Example of single source problem at schema and go instance level 186 Smith John M 2 Problem class Example Harle Detection of uniqueness Name=” John Smith” , y SSN=”158739” Place, Name=”Kowalski” , South SSN=”158739” Fork Detection of invalid Name=”John Smith” , references(Referential DepartmentId=14 integrity violation) Name=” Kowalski” , DepartmentId=16 Detection on misspellings Name=” John Smith” , City=”Germany” Name=”Kowalski” , City=”Germaany” Detection of Duplicate Name=” John Smith” , values Born=”1978” Name=”J. Smith” , Born=”1978”

Detection of invalid Name=” John Smith” , values Bdate=”28-9-1991” 5. METHODOLOGY Name=”Kowalski” , Bdate=”8-13-2015” In this paper we implement an attribute correction method Detection of inconsistent Bdate=”28-9-1991” , using clustering technique. Attribute correction [10] values Age=”23” solutions require reference data in order to provide Bdate=”8-2-1981” , satisfying results. In this algorithm all the record Age=“60” attributes are examined and cleaned in isolation without Detection of missing Name=” John Smith” , regard to values of others attributes of a given record.

Volume 4, Issue 2, March – April 2015 Page 138

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 ISSN 2278-6856

The main idea behind this algorithm is based on an occurrences. The distance between two clusters is observation that in most data sets there is a certain defined as modified Levenshtein distance between number of values having large number of occurrences cluster representatives. within the data sets and a very large number of attributes 6. If the distance is lower than the disThresh parameter with a very low number of occurrences. Therefore, the and the ratio of occurrences of cluster representative most representative values may be the source of reference is greater or equal the occRel parameter, the clusters data. The values with low number of occurrences are noise are merged. or misspelled instance of the reference data. Table 5 7. After all the clusters are compared, the clusters are shows the attribute and its occurrence frequency. Here examined whether they contain values having since, “Asymptomatic” occurs more frequently, it is taken distance between them and the cluster representative as reference dataset. All others are discarded since they above the threshold value. If so, they are removed have low frequency count. from the cluster and added to the cluster list as Table 5: Example of Chest_pain_type attributes separate clusters. distribution 8. Steps 4-7 are repeated until there are no changes in the cluster list i.e. no clusters are merged and no Chest_pain_type Number of clusters are created. occurrences 9. The cluster representative is the reference data set Asymptomatic 2184 and the cluster defined transformation rule for a given Asmytomatic 6 cluster values should be replaced with the value of the cluster representative. Asmythmatics 3 Assymtomatics 1 Table 6 shows the example transformation rules Asympotmatic 1 discovered during the execution of the above algorithm.

Asymptomac 1 Table 6: Example of corrected values Original value Correct value The algorithm uses two parameters: Asmytomatic Asymptomatic 1. Distance Threshold: distThresh being the minimum Asmythmatics Asymptomatic distance between two values allowing them to be marked Assymtomatics Asymptomatic as similar and related. 2. Occurrence Relation: occRel used to determine whether Asympotmatic Asymptomatic both compared values belong to the reference data set. Asymptomac Asymptomatic To measure the distance between two values a modified Levenshtein distance is used. Levenshtein distance [11]- [13] for two strings is the number of text edit operations 6. CONCLUSION (insertion, deletion, exchange) needed to transform one Data cleaning is a key precondition for analysis of string into another. For instance, the Levenshtein distance decision support systems and . High data between “Asymptomatic” and “Asmyptomatic“ is 2. quality is a general requirement in current information The algorithm for attribute correction utilizes a modified system construction. In order to provide access to accurate Levenshtein distance defined as and consistent data, data cleaning becomes necessary. This paper presents an overview of categorization of data quality problems in single and multiple data sources. Also 1 Lev(s1, s2) Lev(s1, s2) Leˆv(s1, s2)  .(  ) we implemented a clustering technique for data 2 || s1|| || s2 || standardization and correction of an attribute using Levenshtein distance. The technique was then applied to The algorithm consists of following steps: the data as shown in Table 6 and the results obtained. 1. Preliminary Cleaning – All attributes are transformed into uppercase or lowercase. REFERENCES 2. The number of occurrences for all the values of the [1] Lee, M.L.; Lu, H.; Ling, T.W.; Ko, Y.T.: Cleansing cleaned data set is calculated. Data for Mining and Warehousing. Proc. 10th Intl. 3. Each value is assigned to a separate cluster. The Conf. Database and Expert Systems Applications cluster element having higher number of occurrences (DEXA), 1999. is denoted as the cluster representative. [2] Mauricio Hernandez, Salvatore Stolfo, “Real World 4. The cluster list is sorted descending according the Data Is Dirty: Data Cleansing and The Merge/Purge number of occurrences for each cluster representative. Problem”, Journal of Data Mining and Knowledge 5. Starting with the first cluster, each cluster is Discovery, 1(2), 1998. compared with the other clusters from the list in the order defined by the number of cluster representative

Volume 4, Issue 2, March – April 2015 Page 139

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 ISSN 2278-6856

[3] Huang Yu, Zhang Xiao-yi, Yuan Zhen , Jiang Guo- quan , “A Universal Data Cleaning Framework Based on User Model “, 2009 ISECS. [4] H.H. Shahri; S.H. Shahri, Eliminating Duplicates in Information Integration: An Adaptive, Extensible Framework, Intelligent Systems, IEEE, Volume 21, Issue 5, Sept.-Oct. 2006 Page(s):63 – 71 [5] Monge, A. E. Matching Algorithm within a Duplicate Detection System. IEEE Techn. Bulletin Data Engineering 23 (4), 2000. [6] Paul Jermyn, Maurice Dixon and Brian J Read, “Preparing clean views of data for data mining”. [7] Erhard Rahm, Hong Hai Do. “Data Cleaning: Problems and Current Approaches”. IEEE Data Engineering Bulletin, 2000,23 (4):3-13. [8] KDnuggets Polls. “ Part in Data Mining Projects”,Sep30-Oct-12,2003. http://www.kdnuggets.com/polls/2003/data_preparati on.htm. [9] Wang Y.R. ; Madnick S.E, “The inter-database instance identification problem in integrating autonomous systems” Proceedings of the Fifth International Conference on Data Engineering, IEEE Computer Society, Silver Spring 1999, February 6– 10, 1999, Los Angeles, California, USA, p. 46–55. [10] Lukasz Ciszak, “Application of clustering and Association Methods in data cleaning”, 978-83- 60810-14-9/08, 2008 IEEE. [11] M. Bilenko and R. J. Mooney, “Adaptive duplicate detection using learnable string similarity measures”. ACM SIGKDD, 39-48, 2003. [12] Monge, A. E.; Elkan, P.C.: The Field Matching Problem: Algorithms and Applications. Proc. 2nd Intl. Conf. Knowledge Discovery and Data Mining (KDD), 1996. [13] W. Cohen, P. Ravi Kumar, S. Fienberg “A Comparison of String Metrics for Name-matching Tasks” in Proceedings of the IJCAI-2003.

AUTHOR

Sujata Joshi received the B.E. degree in Computer Science and Engineering from B.V.B. College of Engineering and Technology, Hubli in 1995 and M.Tech. degree in Computer Science and Engineering from M.S.Ramaiah Institute of Technology, Bangalore in 2007. Currently working as Assistant Professor in the Department of Computer Science and Engineering at Nitte Meenakshi Institute of Technology, Bangalore and pursuing Ph.D in the area of data mining under Visvesvaraya Technological University, Belagavi.

Volume 4, Issue 2, March – April 2015 Page 140