<<

Automated Detection of in Real-World

Mark Last Abraham Kandel Department of Information Systems Engineering Department of Computer Science and Engineering Ben-Gurion University of the Negev University of South Florida Beer-Sheva 84105, Israel 4202 E. Fowler Avenue, ENB 118 E-mail: [email protected] Tampa, FL 33620, USA E-mail: [email protected]

Abstract: Most real-world databases include a certain amount of exceptional values, generally termed as “outliers”. The isolation of outliers is important both for improving the quality of original data and for reducing the impact of outlying values in the process of knowledge discovery in databases. Most existing methods of detection are based on manual inspection of graphically represented data. In this paper, we present a new approach to automating the process of detecting and isolating outliers. The process is based on modeling the human perception of exceptional values by using the fuzzy set theory. Separate procedures are developed for detecting outliers in discrete and continuous univariate data. The outlier detection procedures are demonstrated on several standard datasets of varying data quality. Keywords: Outlier detection, , data quality, , knowledge discovery in databases, fuzzy set theory.

1. Introduction equipment, may produce a certain amount of The statistical definition of an “outlier” depends distorted values. In these cases, no useful on the underlying distribution of the variable in information is conveyed by the outlier value. question. Thus, Mendenhall et al. [9] apply the However, it is also possible that an outlier term “outliers” to values “that lie very far from represents correct, though exceptional, the middle of the distribution in either direction”. information [9]. For example, if clusters of This intuitive definition is certainly limited to outliers result from fluctuations in behavior of a continuously valued variables having a smooth controlled process, their values are important for function of probability density. However, the process monitoring. According to the approach numeric distance is not the only consideration in suggested by [12], recorded measurements detecting continuous outliers. The importance of should be considered correct, unless shown as outlier frequency is emphasized in a slightly definite errors. different definition, provided by Pyle [12]: “An outlier is a single, or very low frequency, 1.1. Why outliers should be isolated? occurrence of the value of a variable that is far The main reason for isolating outliers is away from the bulk of the values of the associated with data quality assurance. The variable”. The frequency of occurrence should exceptional values are more likely to be be an important criterion for detecting outliers in incorrect. According to the definition, given by categorical (nominal) data, which is quite Wand and Wang [14], unreliable data represents common in the real-world databases. A more an unconformity between the state of the general definition of an outlier is given in [1]: an database and the state of the real world. For a observation (or subset of observations) which variety of database applications, the amount of appears to be inconsistent with the remainder of erroneous data may reach ten percent and even that set of data. more [15]. Thus, removing or replacing outliers The real cause of outlier occurrence is usually can improve the quality of stored data. unknown to data users and/or analysts. Isolating outliers may also have a positive Sometimes, this is a flawed value, resulting from impact on the results of and data the poor quality of a data set, i.e., a data entry or mining. Simple statistical estimates, like sample a data conversion error. Physical measurements, mean and standard deviation can be significantly especially when performed with malfunctioning biased by individual outliers that are far away from the middle of the distribution. In

1

regression models, the outliers can affect the set. Most values are expected in the estimated correlation coefficient [10]. Presence interquartile range (H) located between the two of outliers in training and testing data can bring hinges. Values lying outside the ±1.5H range are about several difficulties for methods of termed “mild outliers” and values outside the decision-tree learning, described by Mitchell in boundaries of ±3H are termed “extreme [11]. For example, using an outlying value of a outliers”. While this method represents a predicting nominal attribute can unnecessarily practical alternative to manual inspection of each increase the number of decision tree branches box plot, it can deal only with continuous associated with that attribute. In turn, this will variables characterized by unimodal probability lead to inaccurate calculation of attribute distributions. The other limitation is imposed by selection criterion (e.g., information gain). the ternary classification of all values into Consequently, the predicting accuracy of the “extreme outliers”, “mild outliers”, and “non- resulting decision tree may be decreased. As outliers”. The classification changes abruptly emphasized in [[12]], isolating outliers is an with moving a value across one of the 1.5H or important step in preparing a data set for any 3H boundaries. kind of data analysis. An information-theoretic approach to supervised detection of erroneous data has been developed 1.2. Outlier detection and treatment by Guyon et al. in [4]. The method requires Manual inspection of scatter plots is the most building a prediction model by using one of data common approach to outlier detection [10], [12]. mining techniques (e.g., neural networks or Making an analogy with unsupervised and decision trees). The most “surprising” patterns supervised methods of [11], (having the lowest probability to be predicted two types of detection methods can be correctly by the model) are suspicious to be distinguished: univariate methods, which unreliable and should be treated as outliers. examine each variable individually, and However, this approach ignores the fact that data multivariate methods, which take into account conformity may also depend on the inherent associations between variables in the same distribution of database attributes and some dataset. In [12], a univariate method of detecting subjective, user-dependent factors. outliers is described. According to the approach Zadeh [16] applies the fuzzy set theory to of [12], a value is considered outlier, if it is far calculating the usuality of given patterns. The away from other values of the same attribute. notion of usuality is closely related to the However, some very definite outliers can be concept of disposition (proposition which is detected only by examining the values of other preponderantly but not necessarily true). In [16], attributes. An example is given in [10], where the fuzzy quantifier usually is represented by a one data point stands clearly apart from a fuzzy number of the same form as most. One bivariate relationship formed by the other points. example of a disposition is usually it takes about Such an outlier can be detected only by a one hour to drive from A to B. The fuzzy set of multivariate method, since it is based on normal (or regular) values is considered the dependency between two variables. complement of a set of exceptions. In the above Manual detection of outliers suffers from the two example, a five-hour trip from A to B would basic limitations of methods: have a higher membership grade in the set of subjectiveness and poor scalability (see [6]). exceptions than in the set of normal values. The analysts have to apply their own subjective In [8], we have presented an advanced method, perception to determine the parameters like based on the information theory and fuzzy logic, “very far away” and “low frequency”. Manual for measuring reliability of multivariate data. inspection of scatter plots for every variable is The fuzzy degree of reliability depends on two also an extremely time-consuming task, not factors: suitable for most commercial databases, 1) The distance between the value containing hundreds of numeric and nominal predicted by a data mining model and attributes. the actual value of an attribute. An objective, quantitative approach to 2) The user perception of “unexpected” unsupervised detection of numeric outliers is data. described in [9]. It is based on the graphical Rather than partitioning the attribute values into technique of constructing a box plot, which two “crisp” sets of “outliers” and “non-outliers”, represents the median of all the observations and the fuzzy logic approach assigns a continuous two hinges, or medians of each half of the data degree of reliability (ranging between 0.0 and

2

1.0) to each value. In this paper, we are categorical variables, and numeric variables with extending the approach of [8] to detection of a limited number of values. outliers in univariate data. Actually, any single or low frequency occurrence No matter how the outliers are detected, they of a value may be considered an outlier. should be handled before applying a data mining However the human perception of an outlying procedure to the data set. One trivial approach is (rare) discrete value depends on some additional to discard an entire record, if it includes at least factors, like the number of records where the one outlying value. This is similar to ignoring value occurs, the number of records with a non- records containing missing values by some data null value, and the total number of distinct mining methods. An alternative is to correct values in an attribute. For instance, a single (“rectify”) an outlier to another value. Thus, occurrence of a value in one record out of twenty Pyle [12] suggests a procedure for remapping the does not seem very exceptional, but it is an variable’s outlying value to the range of valid obvious outlier in a database of 20,000 records. values. Outliers detected by a supervised The value taken by only 1% of records is clearly method, based on the linear regression model, exceptional if the attribute is binary-valued, but can be adjusted iteratively by assigning to each it is not necessarily an outlier in an attribute observation a weight depending on its residual having more than 100 distinct values. Our from the fitted value [3]. Actually, if the attitude to rare values also depends on the outlying value is assumed completely erroneous, objectives of our analysis. For example, in a the correct value can be estimated by any method marketing survey, the records of potential buyers for estimating missing attribute values (see [11]). represent a very important part of the population, even if they constitute just one percent of the 1.3. Paper organization entire sample. This paper is organized as follows. To automate the cognitive process of detecting In Section 2, we present novel, fuzzy-based discrete outliers, we represent the conformity of methods for univariate detection of outliers in an attribute value by the following membership discrete and continuous attributes. In Section 3, function µR: the methods are applied to several machine 2 learning datasets of varying size and quality. µ ( ) = (1) R Vij γ Di Section 4 concludes the paper with some • TVi Nij directions for future research in outlier detection. 1+ e Where Vij is the discrete value No. j of the γ 2. Fuzzy-based detection of outliers attribute Ai; is the shape factor, representing the in univariate data user attitude towards conformity of rare values; ≠ 2.1. Detecting outliers in discrete variables Di is the total number of records where Ai null; According to [9], a discrete variable is assumed TVi is the total number of distinct values taken to have a countable number of values. On the by the attribute Ai; and Nij is the number of other hand, a continuous variable can have occurrences of the value Vij. infinitely many values corresponding to the It can be easily verified that the Equation (1) points on a line interval. In practice, however, agrees with the definition of fuzzy measure (see the distinction between these two types of [5]). The conformity becomes close to zero, attributes may be not so clear (see [12]). The when the number of value occurrences is much same attributes may be considered discrete or smaller than the average number of records per continuous, depending on the precision of value (given by Di / TVi). On the other hand, if µ measurement, and other application-related the data set is very small (Di close to zero), R factors. For the purpose of our discussion here, approaches one, which means that even a single we assume the discrete attributes to include occurrence of a value is not considered an binary-valued (dichotomous) variables, outlier.

3

1.00 0.90 0.80 0.70 0.60 Gamma = 0.5 0.50 0.40 Gamma = 0.1 Conformity D = 1000 0.30 TV = 20 0.20 0.10 0.00 0 20406080100 Number of Occurrences

Figure 1: Discrete value conformity as a function of the shape factor γ 2.2. Detecting outliers in continuous variables The shape factor γ represents the subjective The sets of discrete and continuous data are not attitude of a particular data analyst to values of completely disjoint. Numeric attributes taking a the same frequency. Low values of γ (about 0.5) small number of values can be treated as discrete make µR a sigmoidal function gradually variables, and categorical values can be increasing from zero to one with the number of translated into consecutive numbers (examples of such transformations are represented in [12]). occurrences Nij. For γ = 0, all values are However, the criterion of value frequency (see considered conforming (µR = 1). On the other sub-section 2.1 above) cannot be applied directly hand, high values of γ (more than one) make µR a step function, marking almost any value as to detecting continuous outliers, since a continuous attribute can take an infinite number “outlier” (µ = 0). In Figure 1, we show µ as a R R of values in its range. In the continuous case, we function of N for two values of the factor γ. For ij should sort the distinct values of the inspected our case studies in the next Section, we have γ attribute and examine the distance between each used the value of = 0.5. single value and its neighboring values. Greater The subset of outliers in each attribute is found that distance, lower is our confidence in a value α by defining an -cut of all the attribute values: (e.g., see the rightmost point in Figure 2a). Still µ α {Vij: R (Vij) < }. In our applications (see we may have several clusters of attribute values, Section 3 below), we use α = 0.05. If an located at a large distance from each other (see attribute is dichotomous (has only two distinct Figure 2b). None of these clusters may contain values) and one of the values is recognized as an outliers. outlier, we can remove that attribute from the To automate the human perception of non- subsequent knowledge discovery process and conforming (outlying) continuous values, we even from the original database, since it does not need a formal definition of value conformity, provide us with any new information about with respect to its preceding and succeeding specific records. Thus, in some cases, the outlier values. Such a definition is suggested below. detection can be used for dimensionality reduction of data.

4

Figure 2: Examples of outliers (from[12])

occurrences of the value Vij. The expressions for the membership functions µRL and µRH follow. Definition. Given a set of sorted distinct values, a value is considered conforming from below 2 (conforming from above), if it is close enough to µ = β • • • − RL M Di (Vi , j+1 Vij ) (2) the succeeding (preceding) values respectively. N •(V + + −V + ) Thus, automated detection of outliers requires 1+ e ij i , j M 1 i , j 1 sorting the continuous attribute values in ascending order (from Vi1 to Vi,TVi, where TVi is 2 the total number of distinct values). Afterwards, µ = β•M •D •(V −V − ) (3) the conformity of each continuous value is RH i ij i , j 1 • − measured with respect to its preceding and + Nij (Vi , j−1 Vi , j−M −1 ) succeeding values. The conformity from below is 1 e The shape factor β represents the user-dependent denoted by µRL and the conformity from above is µ attitude to distances between succeeding values. denoted by RH. Each conformity measure β -4 depends on the so-called “distance to density” Lower values of (about 10 ) assign conformity of 0.5 even to a single value, having a 10 times ratio. For the conformity from below, this is the larger distance to the neighboring value than the ratio between the distance from each value to the subsequent value and the average density of M average density of succeeding (or preceding) β -3 subsequent values (M is termed the “look- values. Higher values of (like 10 ) provide a ahead”). Accordingly, the conformity from sharper decrease of conformity from 1.0 to zero, above is determined by the ratio between the as the distance between succeeding values µ distance from each value to the preceding value becomes larger. In Figure 3, we show R* as a and the average density of M preceding values. function of the “distance to density” ratio for two Other factors involved in the conformity different values of β (0.001 and 0.002), given calculation include the shape factor β, the total that the value occurs only in one record out of 1000. The value of β = 0.001 has been used for number of records Di, and Nij, the number of in our case studies (see Section 3).

5

1.000 0.900 D = 1000 Beta = 0.001 0.800 N = 1 0.700 0.600 0.500 0.400

Conformity Beta = 0.002 0.300 0.200 0.100 0.000 0.00 1.00 2.00 3.00 4.00 5.00 Distance to density ratio

Figure 3: Conformity of a continuous value as a function of the shape factor β • While the index of current value < The procedure of determining the conformity of (Total_Number_of_Values - M – 2) and each continuous value depends on the available µRL(Vij) < α prior knowledge about the distribution of all the • Set the lower bound of the attribute attribute values. If our prior knowledge causes (Low_Bound) to the current value Vij us to assume that the distribution is unimodal Calculating conformity from above (making the situation in Fig. 2b unlikely), the • Initialize the index of current value to outlier detection can be focused on checking Total_Number_of_Values – 1 (index of the only the conformity “from below” of the lowest highest value) values and the conformity “from above” of the • Do highest values. Moreover, we can stop the • Calculate the “high value” conformity search for outliers, once we have found the first µ (V ) for the current value by conforming value from each side because a RH ij Equation (3) unimodal distribution is supposed to have only • Decrement the index of the current one cluster of values. The pseudocode of outlier value by one detection in unimodal attributes is given in the • M + 1 algorithm A below. While the index of current value > µ α and RH (Vij) < • Algorithm A. Outlier detection in Unimodal Set the upper bound of the attribute Attributes (Upper_Bound) to the current value Vij • • Select the look-ahead M, the threshold α, Finding outlying values • and the shape factor β For each value V ij, • • Sort distinct values in ascending order If (Vij < Low_Bound) or (Vij > • Initialize conformity of each value to 1.00 Upper_Bound) • (∀j: µ (V ) = µ (V ) = 1) Denote Vij as outlier RL ij RH ij • Calculating conformity from below End • Initialize the index of current value to zero When no information about the attribute (index of the lowest value) distribution is available, we can consider as an outlier only a value, which is far away from both • Do its preceding and succeeding values. That is, in • Calculate the “low value” conformity this case we should calculate two conformity µ (V ) for the current value by RL ij degrees for each value: µ and µ . An outlying Equation (2) RL RH value should have both conformity degrees • Increment the index of the current value below the threshold α. If only one conformity by one degree is below α, this means that a value is the highest or the lowest in a cluster, but not

6

necessarily an outlier. The algorithm for nine discrete multi-valued attributes (ranging detecting outliers in continuous attributes of from 1 to 10) that represent results of medical unknown distribution follows. tests and one discrete binary-valued attribute standing for the class of each case. The Algorithm B. Outlier detection in Attributes of documentation indicates that several records Unknown Distribution have been removed from the original dataset • Select the look-ahead M, the threshold α, (probably, due to data quality problems). and the shape factor β Chess Endgames. This is an artificial data set • Sort distinct values in ascending order representing a situation in the end of a chess • Initialize reliabilities of each value to zero game. Each of 3,196 instances is a board- description for this chess endgame. There are 36 (∀j: µRL (Vij) = µRH (Vij) = 0) Calculating conformity from below attributes describing the board. All the attributes • For index_of_current_value = 0 to are nominal (mostly, binary). Since the data was index_of_current_value = artificially created, no data cleaning was (Total_Number_of_Values - M – 3) performed. Credit Approval. This is an encoded form of a • Calculate the “low value” conformity proprietary database, containing data on credit µ (V ) for the current value by RL ij card applications for 690 customers. The data Equation (2) set includes eight discrete attributes (with Calculating conformity from above different number of distinct values) and six • index_of_current_value For = continuous attributes. There were originally a Total_Number_of_Values – 1 ( ) to few missing values, but these have all been index_of_current_value = M + 2 replaced by the overall median. There is no • Calculate the “high value” conformity information about any other cleaning operations µ RL (Vij) for the current value by applied. Equation (3) Diabetes. This is a part of larger diagnostic • Finding outlying values dataset, and it contains data on 768 patients. In • For each value V ij, the dataset, there are eight continuous attributes • If max {µ RL, µ RH} < α and one binary-valued attribute. No data • Denote Vij as outlier cleaning is mentioned in the dataset • End documentation. Glass Identification. This database deals with 3. Evaluating quality of machine classification of types of glass, based on learning datasets chemical and physical tests. It includes 214 We have applied the automated methods of cases, with nine continuous attributes and one detecting outliers in discrete and continuous discrete attribute having seven distinct values. attributes (see Section 2 above) to seven datasets Using any kind of data cleaning by the data from the UCI Machine Learning Repository [2]. donors is not mentioned. The datasets in this repository are widely used by Heart Disease. The dataset includes results of the data mining community for the empirical medical tests aimed at detecting a heart disease evaluation of learning algorithms. There are for 297 patients. There are seven nominal many studies comparing the performance of (including three binary-valued) and six different classification methods on these continuous attributes. Other 62 attributes datasets, but we do not know about any presenting in the original database have been documented attempt of evaluating their quality. excluded from the dataset. According to the Though these datasets are supposed to represent documentation, missing values were encoded the “real-world” data, their documentation with a pre-defined value. sometimes mentions certain amount of manual Iris Plants. R.A. Fisher has created this database cleaning, performed by the data providers. The in 1936 and, since then, it has been extensively datasets selected by us for the outlier detection used in the pattern recognition literature. The contain a diverse mixture of discrete and data set contains three classes of 50 instances continuous attributes. The brief description of each, where each class refers to a type of iris each dataset follows. plant. Each instance is described by four The Breast Cancer Database. This is a medical continuous attributes. data set including 699 clinical cases. There are The automated methods of outlier detection have been applied to the above datasets by using the

7

following parameters (tuned to the perception of outlying value. The Iris dataset is the most the authors on small samples of graphically “clean” one: it contains no outliers at all. This represented data): finding is not surprising, taking into account the • Minimum conformity of non-outliers (α): popularity of Iris in the Machine Learning 0.05 community. Most other datasets, tested by us, • Shape factor for calculating conformity of have a small amount of outliers in their records continuous values (β): 0.001 (between 0.5% in the Glass and 8.3% in the • Shape factor for calculating conformity of Breast). However, the Chess dataset suffers discrete values (γ): 0.500 from a significant portion of outlying values – • Look-ahead for calculating average density more than 25% of its records contain at least one of consecutive values (M): 10 outlier! The outliers are found in 12 binary- Algorithm A (see above) has been applied to all valued attributes, where one of the values has a continuous attributes, assuming their distribution very low frequency (e.g., in spcop attribute, there to be unimodal. This assumption was verified by is one record having the value of 0 vs. 3,195 comparing the outputs of Algorithm A and records having the value of 1). These attributes Algorithm B. The results of outlier detection in can be completely discarded from the data all datasets are represented in Table 1 below. In mining process. As indicated above, our method the third and the fourth columns from the left, we of automated outlier detection can also be used can see the number and the percentage of “non- as a dimensionality reduction tool. conforming records”, containing at least one

Dataset Total Non- Percentage Discrete Attributes Continuous Attributes records conforming Records Total Containing Total Containing outliers outliers Breast 699 58 8.3% 10 6 0 0 Chess 3196 838 26.2% 37 12 0 0 Credit 690 41 5.9% 9 4 6 5 Diabetes 768 14 1.8% 1 0 8 6 Glass 214 1 0.5% 1 0 9 1 Heart 297 5 1.7% 8 1 6 1 Iris 150 0 0.0% 1 0 4 0

Table 1. Detecting outliers in the Machine Learning Datasets Credit Approval dataset). Visually, the values In columns no. 5 – 8 of Table 1, we summarize 2, 3, 7, and 9 seem to be clear outliers, the conformity of discrete and continuous confirming the results of the automated outlier attributes in each dataset. One example is the detection (their conformity is below the popular Credit Approval, comprising a mixture threshold α). demonstrates detection of outliers of discrete and continuous attributes. Since this in a continuous attribute (“Age”, or A2). Only dataset is based on real-world banking data, it is the lower values of this attribute are shown (18 quite reasonable that four discrete attributes (out years and younger). When looking at the chart, of nine) and five continuous attributes (out of the 13.75 years old person seems considerably six) contain outlying values. In Figure 4, we younger than most other customers. The show the number of occurrences (solid bars) and automated method of outlier detection has the calculated conformity (a solid line) for each assigned to this value the conformity “from discrete value of the attribute “Job Status” below” of 0.007, marking it as an outlier. (denoted as A6 in the encoded versions of the

8

450 1 408 400 0.9

350 0.8 0.7 300 0.6 250 0.5

Count 200 0.4 Conformity 150 138 0.3

100 0.2 57 59 50 0.1 6 8 6 8 0 0 12345789 Value Threshold Count Conformity Figure 4: Detecting outliers in a discrete attribute (“Job Status”)

4 1 0.9 0.8 3 0.7 0.6 2 0.5

Count 0.4 Conformity 0.3 1 0.2 0.1 0 0 13 14 15 16 17 18 Age Threshold Count Conformity from below

Figure 5: Detecting outliers in a continuous attribute (“Age”)

perceptions of outliers over the manual analysis 4. Conclusions of visualized data. Unlike the case of human In this paper, we have presented novel, fuzzy- decision-making, the parameters of the based procedures for automating the human automated detection of outliers can be perceptions of outlying values in univariate data. completely controlled, making it an objective The prior knowledge about the data can be tool of data analysis. The results of outlier utilized to choose the appropriate procedure for a detection can be used to improve the quality of given dataset, and further to adapt the process of data in a database and, in some cases, to enhance outlier detection to the form of attribute the performance of data mining algorithms. distribution (e.g., unimodal distribution). The As already shown by us in [6], the fuzzy set high dimensionality of modern databases theory enables to develop computational models provides a significant advantage to automated of human perception for the problems, where the

9

graphically represented data has been [6] Last, M., & Kandel, A. (1999). Automated traditionally analyzed by human experts. In this Perceptions in Data Mining. Proceedings of paper, we have described one procedure for the Eighth International Conference on detecting outliers in discrete data and two Fuzzy Systems. Seoul, Korea. Part I, pp. 190 procedures of outlier detection in continuous – 197. data. More techniques may be developed by catching different aspects of human perception [7] Maimon, O., & Last M. (2000). Knowledge and making use of prior knowledge, available Discovery and Data Mining – The Info- about attributes, including domain size, validity Fuzzy Network (IFN) Methodology. Kluwer range and form of distribution. The automated Academic Publishers. detection of outliers may be embedded in [8] Maimon, O., Kandel, A. & Last, M. (2001). database management systems, to warn the users Information-Theoretic Fuzzy Approach to against possibly inaccurate information. The Data Reliability and Data Mining. Fuzzy integration of outlier detection with data mining Sets and Systems. Vol. 117, No. 2, pp. 183- methods has a potential benefit for extracting 194. valid knowledge from “dirty” data. [9] Mendenhall, W., Reinmuth, J.E., & Beaver, Acknowledgment. This work was supported in R.J. (1993). Statistics for Management and part by the National Institute for Systems Test Economics. Belmont, CA: Duxbury Press. and Productivity at USF under the USA Space [10] Minium, E.W., Clarke, R.B., & Coladarci, and Naval Warfare Systems Command Grant T. (1999). Elements of Statistical Reasoning. No. N00039-01-1-2248 and in part by the USF New York: Wiley. Center for Software Testing under grant No. 2108-004-00. [11] Mitchell, T.M. (1997). Machine Learning. New York: McGraw-Hill. References [12] Pyle, D. (1999). Data Preparation for [1] Barnett, V., Lewis, T. (1995). Outliers in Data Mining. San Francisco, CA: Morgan Statistical Data. Wiley, 3rd Edition. Kaufmann. [2] Blake, C., Keogh, E. & Merz, C.J. (1998). [13] Quinlan, J.R. (1986). Induction of Decision UCI Repository Of Machine Learning Trees. Machine Learning, 1, 1, 81-106. Databases [14] Wand, Y., & Wang, R.Y. (1996). [http://www.ics.uci.edu/~mlearn/MLReposit Anchoring Data Quality Dimensions in ory.html]. Irvine, CA: University of Ontological Foundations. Communications California, Department of Information and of the ACM, 39, 11, 86-95. Computer Science. [15] Wang, R.Y., Reddy, M.P., & Kon, H.B. [3] Fan, J., & Gijbels, I. (1996). Local (1995). Toward Quality Data: An Attribute- Polynomial Modelling and Its Applications. based Approach. Decision Support Systems, London: Chapman & Hall. 13, 349-372. [4] Guyon, I., Matic, N., & Vapnik, V. (1996). [16] Zadeh, L.A. (1985). Syllogistic Reasoning Discovering Informative Patterns and Data in Fuzzy Logic and its Application to Cleaning. In U. Fayyad, G. Piatetsky- Usuality and Reasoning with Dispositions. Shapiro, P. Smyth, & R. Uthurusamy IEEE Transactions on Systems, Man, and (Eds.), Advances in Knowledge Discovery Cybernetics, SMC-15, 6, 754-763. and Data Mining pp. 181-203. Menlo Park, CA: AAAI/MIT Press. [5] Klir, G. J., & Yuan, B. (1995). Fuzzy Sets and Fuzzy Logic: Theory and Applications. Upper Saddle River, CA: Prentice-Hall.

10