Data Cleaning Using Clustering Based Data Mining Technique

International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 ISSN 2278-6856 Data Cleaning Using Clustering Based Data Mining Technique 1 2 Sujata Joshi , Usha.T 1 Assistant Professor, Dept. of CSE, Nitte Meenakshi Institute of Technology, Bangalore, India. 2Research scholar M.Tech, Dept. of CSE Nitte Meenakshi Institute of Technology, Bangalore, India. Abstract 3. Irregularities are concerned with the non uniform Data Cleaning is one of the basic tasks performed during the use of values and abbreviations. process of knowledge discovery in the databases, during 4. Duplicates [4] [5] are two or more tuples modification and integration of database schemas and also in representing the same entity from the real world. The the creation of data warehouses. Data Cleaning, also called values of these tuples need not be entirely identical. data cleansing or scrubbing, deals with detecting and Inexact duplicates represent the same entity but with removing errors and inconsistencies from data in order to different values for all or some of its attributes. improve the quality of data. In this paper, data quality 5. Missing values are the result of omissions while problems are summarized. An algorithm is implemented using collecting the data. data mining technique for data standardization and data 6. Data entry anomalies, these errors occur while the correction. Keywords: Data Cleaning, Data quality problems, user is entering data into the data pool. Attribute correction, Levenshtein distance. 3. DATA QUALITY 1. INTRODUCTION Data quality [6] is a state of completeness, validity, Data plays a fundamental role in every software system. In consistency, timeliness and accuracy that makes data particular, information systems and decision support appropriate for a specific use. High quality data needs to systems depend on it more deeply. Data quality is the pass a set of quality criteria. The hierarchy of data quality crucial factor in data warehouse [1] creation and data is as shown in figure-1. integration. Data cleaning or scrubbing is the process of removing the errors from the data. It is an inherent activity related to database processing, updating and maintenance. Data fed from various operational systems prevailing in the different departments/sub-departments of the organization, has discrepancies in schemas, formats, semantics etc. due to numerous factors. These representations may introduce redundancy leading to exact duplicates of records [2], inconsistency where records differ in schemas, formats, abbreviations, etc. and lastly erroneous data. All such unwanted data records are referred to as ‘dirty data’. Our approach focuses on the correction of errors in an attribute. Here, we present a brief overview of various 4. DATA QUALITY PROBLEMS sources of errors that arise due to machine or human Data quality problems [7] are present in single data intervention and a summarization of data quality collections, such as files and databases, e.g. due to problems. Also an algorithm based on clustering is misspellings during data entry, missing information or implemented for data correction and standardization. other invalid data. When multiple data sources need to be integrated, e.g., in data warehouses, federated database 2. SOURCES OF ERRONEOUS DATA systems or global web-based information systems, the need for data cleaning increases significantly. This is The sources of erroneous data are : because the sources often contain redundant data in 1. Lexical errors [3], name discrepancies between the different representations. structure of the data items and the specified format. This section classifies the major data quality problems to 2. Syntactical errors represent the violations of the be solved by data cleaning and data transformation. It is overall format. roughly distinguished has single-source and multi-source Volume 4, Issue 2, March – April 2015 Page 137 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 ISSN 2278-6856 problems and also has schema- and instance-related values Phone=”9999-999999” problems. Schema-level problems [8] are reflected in the instances, Table 2 and Table 3 show the example of multi- source they can be addressed at the schema level by an improved problems at schema and instance level. The two sources schema design (schema evolution), schema translation are both in relational format but exhibit schema and data and schema integration. conflicts. Instance-level problems [9], on the other hand, refer to At the schema level, there are name conflicts (synonyms errors and inconsistencies in the actual data contents Customer/Client, Cid/Cno, Sex/Gender) and structural which are not visible at the schema level. They are the conflicts (different representations for names and primary focus of data cleaning. addresses). At the instance level, we note that there are Figure-2 shows the categorization of data quality problems different gender representations (“0”/”1” vs. “F”/”M”) in data sources. and a duplicate record (John Smith). Solving these problems requires both schema integration and data cleaning; Table 4 shows a possible solution. Table 2: Customer (source 1) CID Name Street City Sex 214 John Smith 2 Harley Pl South Fork 1 461 Mary Thomas Harley St 2 S Fork 0 Table 3: Client (source 2) Cno LastName FirstName Gender Addr ess 153 Smith Kowalski M 23 Harle y The data quality problems for single source at schema Street level and instance level are illustrated with examples in , Table 1. Chica Table 1: Example of single source problem at schema and go instance level 186 Smith John M 2 Problem class Example Harle Detection of uniqueness Name=” John Smith” , y SSN=”158739” Place, Name=”Kowalski” , South SSN=”158739” Fork Detection of invalid Name=”John Smith” , references(Referential DepartmentId=14 integrity violation) Name=” Kowalski” , DepartmentId=16 Detection on misspellings Name=” John Smith” , City=”Germany” Name=”Kowalski” , City=”Germaany” Detection of Duplicate Name=” John Smith” , values Born=”1978” Name=”J. Smith” , Born=”1978” Detection of invalid Name=” John Smith” , values Bdate=”28-9-1991” 5. METHODOLOGY Name=”Kowalski” , Bdate=”8-13-2015” In this paper we implement an attribute correction method Detection of inconsistent Bdate=”28-9-1991” , using clustering technique. Attribute correction [10] values Age=”23” solutions require reference data in order to provide Bdate=”8-2-1981” , satisfying results. In this algorithm all the record Age=“60” attributes are examined and cleaned in isolation without Detection of missing Name=” John Smith” , regard to values of others attributes of a given record. Volume 4, Issue 2, March – April 2015 Page 138 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected] Volume 4, Issue 2, March-April 2015 ISSN 2278-6856 The main idea behind this algorithm is based on an occurrences. The distance between two clusters is observation that in most data sets there is a certain defined as modified Levenshtein distance between number of values having large number of occurrences cluster representatives. within the data sets and a very large number of attributes 6. If the distance is lower than the disThresh parameter with a very low number of occurrences. Therefore, the and the ratio of occurrences of cluster representative most representative values may be the source of reference is greater or equal the occRel parameter, the clusters data. The values with low number of occurrences are noise are merged. or misspelled instance of the reference data. Table 5 7. After all the clusters are compared, the clusters are shows the attribute and its occurrence frequency. Here examined whether they contain values having since, “Asymptomatic” occurs more frequently, it is taken distance between them and the cluster representative as reference dataset. All others are discarded since they above the threshold value. If so, they are removed have low frequency count. from the cluster and added to the cluster list as Table 5: Example of Chest_pain_type attributes separate clusters. distribution 8. Steps 4-7 are repeated until there are no changes in the cluster list i.e. no clusters are merged and no Chest_pain_type Number of clusters are created. occurrences 9. The cluster representative is the reference data set Asymptomatic 2184 and the cluster defined transformation rule for a given Asmytomatic 6 cluster values should be replaced with the value of the cluster representative. Asmythmatics 3 Assymtomatics 1 Table 6 shows the example transformation rules Asympotmatic 1 discovered during the execution of the above algorithm. Asymptomac 1 Table 6: Example of corrected values Original value Correct value The algorithm uses two parameters: Asmytomatic Asymptomatic 1. Distance Threshold: distThresh being the minimum Asmythmatics Asymptomatic distance between two values allowing them to be marked Assymtomatics Asymptomatic as similar and related. 2. Occurrence Relation: occRel used to determine whether Asympotmatic Asymptomatic both compared values belong to the reference data set. Asymptomac Asymptomatic To measure the distance between two values a modified Levenshtein distance is used. Levenshtein distance [11]- [13] for two strings is the number of text edit operations 6. CONCLUSION (insertion,

Load more