Data Quality Measurement Using Data Mining
Total Page:16
File Type:pdf, Size:1020Kb
International Journal of Computer Theory and Engineering, Vol. 2, No. 1 February, 2010 1793-8201 Data Quality Measurement using Data Mining Saeed Farzi, Ahmad Baraani Dastjerdi challenging issues. Our method does not need to extract all Abstract— Regarding to use of correct information in many association rules and we propose exact mathematical formula applications, data quality measurement is very important. Nowadays, many statistical methods for measuring the quality for measuring data quality. of data have been proposed. The major problem of statistical In our method, we propose an algorithm with three steps, methods is to lack of using data nature to measure the quality. which calculates the data quality of transaction. step1: Extract Data mining is another method for measuring the quality of data. association rules, which depend on input transaction (T) and Data mining algorithms extract some knowledge. The extracted are adapted by the functional dependency. Step2: Separate knowledge is used to measure the quality of data. In this paper, we introduce a new method, which uses data mining to extract compatible and incompatible association rules. Step3: some knowledge from database, and then we use it to measure Calculate the quality of input transaction. the quality of input transaction. This paper is organized as follows. Notations and Definitions are explained in Sec.II. Data quality mining is Index Terms—data quality, data mining, input transactions, described in Sec.III. Mathematical formula for measuring data statistical method. quality is introduced in Sec.IV. Experimental study is covered in Sec.V. Concluding remarks are included in Sec.VI. I. INTRODUCTION II. NOTATIONS AND DEFINITIONS The major role of most information systems is to present the D: Database real world in the computer world. People in the organization T: Transaction that is included by the set of items can create products or make decisions with information that is included in information systems. If this information does not I: set of all items in D compatible with the real world, systems will be poor and the MIN_SUPP: minimum support for set of items[4]. organization, which uses them, will be began to act Definition 1(association rule): Let I is set of all items in D irrationality [2],[14],[15], [18]. Therefore, Data quality can be and X,Y are subset of I and so XY is defined as consisting between data in the information system X Y 0,Y I, X I , and that same data in the real world [2],[13],[14]. called an association rule that Y is head and X is body of the Data mining and statistical methods have been used to measure data quality. Statistical methods introduced some association rule [4]. metrics, which they have been calculated by statistical Definition 2(support factor of X): If X is a subset of I, the functions such as average [2]. Mr. Ulrich G¨untze and et al [3] support factor of X (SUPP(X)) will be as follow [4]: have been used data mining algorithms to measure data quality. In their method, There are three steps for measuring data | {T D | X T }| sup p( X ) (1) quality.1) Extract all association rules. 2) Select compatible | D | association rules. 3) Add confidence factor of compatible rules as criteria of data quality of transaction. There are two Definition 3 (support factor of an association rule): If X important challenging issues. First, extracting all association and Y are subsets of I and XY is an association rule, The rules needs a lot of time and next, there is no exact support factor of XY (SUPP(XY)) will be as follow [4]: mathematical formula for measuring data quality. SUPP(X Y) SUPP(X Y) (2) In this paper, we introduce a method for measuring data Definition 4(confidence factor of an association rule): If X quality based on data mining algorithms, which solves above and Y are subsets of I and XY is an association rule, the Manuscript received October 9, 2008. confidence factor of XY (CF(XY)) will be as follow [4]: Saeed Farzi is with Department of Computer Engineering, Islamic Azad CF(X Y) SUPP(X Y)/SUPP(X) (3) University – branch of Kermanshah, Iran (corresponding author to provide phone: +988318247902; fax: +988317243065; e-mail: saeedfarzi@ Definition 5(The association rule is adapted by gmail.com). Ahmad Baraani Dastjerdi is with the Computer Engineering Department, functional dependency): Let R: XY be an association University of Isfahan, Iran. (e-mail: [email protected]). 115 International Journal of Computer Theory and Engineering, Vol. 2, No. 1 February, 2010 1793-8201 rule and FD:AB be a functional dependency where A and B are sets of database attributes, R will be adapted by FD if attributes in X are member of A and attributes in Y are Algorithm 2(extracting association rules which member of B and vice versa. depend on input transaction (T)) Definition 6(association rules depend on a transaction (T)): Step1: Extract power set of items in T If T is a transaction and R: XY is an association rule, R is Step2: Extract the large itemsets (condition for depended on T if X T . each itemset: one of power set member must be Definition 7(compatibility): If T is a transaction and R: XY subset of the large itemset) is an association rule and R will be X T Y T , Step 3: Extract association rules from the large compatible with T. itemsets [definitions 5, 6, 7] (condition: one of Definition 8(incompatibility): If T is a transaction and R: the functional dependencies must adapt XY is an association rule and R will be X T Y T , Association rules. incompatible with T. In Algo.2, the number of large itemsets will be reduced because of a power set member must be subset of the large III. DATA QUALITY MINING itemsets. Also In third step, there is a limitation for extracting The quality of database will be set in a desirable extent in association rules from the large itemsets. It causes to extract long term by considering input transactions in databases and fewer association rules. avoiding not to inter unqualified data. For achieving this goal, Step 2: Separate compatible and incompatible rules. we should evaluate the quality of every transaction that inters In this step, the compatible and incompatible association some data to the database and rollbacks transactions, which rules must be distinguished by Algo.3. have not good quality. In this way, we can measure data quality Algorithm3(separate the compatible and of input transaction with Algo.1 incompatible association rules) Algorithm 1 (Data Quality Measurement) Repeat for all extracted association rules. Step1: Extract association rule, which depends on - If AR: bodyhead is an association rule and the input transaction (T) and is adapted by the 1 head powerset(T) then AR will be functional dependency. Step2: Separate compatible and incompatible rules. compatible rule - If AR: bodyhead is an association rule and Step3: Calculate the quality of input transaction. head powerset(T) then AR will be Step1: Extract association rules, which depend on the input incompatible rule transaction (T) and are adapted by the functional dependency. Step3: Calculate the quality of input transaction Unlike Mr. Ulrich G¨untze and et al [3], in Algo.1 it is not In this step, we use Eq.1 to calculate the quality of input necessary to extract all association rules of database. You transaction. must obtain large itemsets from database and use them to make association rules. Extracting all association rules needs a lot IV. QUALITY MEASUREMENT FUNCTION of time and too much memory (time and space complexity). If All extracted association rule (Algo.2) have been we reduce the number of large itemsets, we can decrease a categorized into the compatible and incompatible rules number of association rules. Therefore, we decrease time and (Algo.3). space complexity. In addition, in Algo.1, we have just Axiom: if n equals the number of association rules depend extracted association rules, which depend on input transaction on input transaction (T) that is extracted from D and they are and are adapted by one of the functional dependency. adapted with functional dependency in D. Eq1 that is given as Therefore, we will reduce the number of large itemsets. follows will calculate the quality of T. nc nnc ' ' i n nc cf cf j i1 j1 Q(T) (4) n Where nc is the number of compatible rules. cf i is the confident factor of the ith compatible association rule and ' cf j is the confident factor of the jth incompatible association 116 International Journal of Computer Theory and Engineering, Vol. 2, No. 1 February, 2010 1793-8201 rules. are shown as follows. Proof: State=Kermanshah Zipcode=831 with cf=1.0 Let r: bodyhead be an association rule with cf as a City=Biseton State=Kermanshah with cf= 1.0 confident factor. Based on definition 2, 3, 4, we can calculate Job=manager Salary=1000 with cf= 1.0 cf (r) as follow. Some of these association rules are compatible and some of them are incompatible with T. Compatible association rules sup(body head) sup (body head) with T are shown as follows (Algo.3) (Definition: 7): cf( r) sup(body) sup (body) State= Kermanshah Zipcode=831 with cf=1.0 {T D|(body head) T} {T D|body T} City= Biseton State= Kermanshah with cf=1.0 ( )/ Incompatible association rules with T are shown as follow DD (Algo.3)(Definition 8): {T D | (body head) T} cf (r) (5) Job=manager Salary=1000 with cf=1.0 {T D | body T} Therefore, By using Eq.1, the quality of the transaction (T) equals 66%.