Data Mining Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Data Management and Exploration © for the original version: Prof. Dr. Thomas Seidl Jiawei Han and Micheline Kamber http://www.cs.sfu.ca/~han/dmbook Data Mining Algorithms Lecture Course with Tutorials Wintersemester 2003/04 Chapter 4: Data Preprocessing Chapter 3: Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary WS 2003/04 Data Mining Algorithms 4 – 2 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data WS 2003/04 Data Mining Algorithms 4 – 3 Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy (range of tolerance) Completeness (fraction of missing values) Consistency (plausibility, presence of contradictions) Timeliness (data is available in time; data is up-to-date) Believability (user’s trust in the data; reliability) Value added (data brings some benefit) Interpretability (there is some explanation for the data) Accessibility (data is actually available) Broad categories: intrinsic, contextual, representational, and accessibility. WS 2003/04 Data Mining Algorithms 4 – 4 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data WS 2003/04 Data Mining Algorithms 4 – 5 Chapter 3: Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary WS 2003/04 Data Mining Algorithms 4 – 6 Data Cleaning Data cleaning tasks Fill in missing values Identify outliers and smooth out noisy data Correct inconsistent data WS 2003/04 Data Mining Algorithms 4 – 7 Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred. WS 2003/04 Data Mining Algorithms 4 – 8 How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious (i.e., boring & time- consuming), infeasible? Use a global constant to fill in the missing value: e.g., a default value, or “unknown”, a new class?! – not recommended! Use the attribute mean (average value) to fill in the missing value Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference- based such as Bayesian formula or decision tree WS 2003/04 Data Mining Algorithms 4 – 9 Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data WS 2003/04 Data Mining Algorithms 4 – 10 How to Handle Noisy Data? Binning method: first sort data and partition into (equi-depth) bins then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc. Clustering detect and remove outliers Combined computer and human inspection detect suspicious values and check by human Regression smooth by fitting the data into regression functions WS 2003/04 Data Mining Algorithms 4 – 11 Noisy Data—Simple Discretization (1) Equal-width (distance) partitioning: It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute, the width of intervals will be: W = (B-A)/N. The most straightforward Shortcoming: outliers may dominate presentation 12 12 10 Skewed data is not handled well. 8 8 6 6 Example (data sorted, here: 10 bins): 5 5, 7, 8, 8, 9, 11, 13, 13, 14, 14, 4 14, 15, 17, 17, 17, 18, 19, 23, 24, 2 1 25, 26, 26, 26, 27, 28, 32, 34, 36, 0 0 0 0 0 0 37, 38, 39, 97 1-10 21-30 31-40 41-50 51-60 61-70 71-80 81-90 11-20 91-100 WS 2003/04 Data Mining Algorithms 4 – 12 Noisy Data—Simple Discretization (2) Equal-depth (frequency) partitioning: It divides the range into N intervals, each containing approximately same number of samples (quantile-based approach) 8 8 8 8 8 Good data scaling 7 Managing categorical attributes 6 can be tricky. 5 4 3 2 Same Example (here: 4 bins): 1 5, 7, 8, 8, 9, 11, 13, 13, 14, 14, 0 14, 15, 17, 17, 17, 18, 19, 23, 24, 1-13 14-17 18-26 25, 26, 26, 26, 27, 28, 32, 34, 34, 27-100 36, 37, 37, 38, 39, 97 1- 14 - 18 - 27 - 13 17 26 100 WS 2003/04 Data Mining Algorithms 4 – 13 Noisy Data— Binning Methods for Data Smoothing * Sorted data for price (in dollars): 35 4, 8, 9, 15, 21, 21, 24, 30 price [US$] 25 25, 26, 28, 29, 34 20 15 * Partition into (equi-depth) bins: 10 - Bin 1: 4, 8, 9, 15 5 0 - Bin 2: 21, 21, 24, 25 35 - Bin 3: 26, 28, 29, 34 30 bin means 25 * Smoothing by bin means: 20 15 - Bin 1: 9, 9, 9, 9 10 - Bin 2: 23, 23, 23, 23 5 0 - Bin 3: 29, 29, 29, 29 35 * Smoothing by bin boundaries: 30 bin boundaries 25 - Bin 1: 4, 4, 4, 15 20 15 - Bin 2: 21, 21, 25, 25 10 5 - Bin 3: 26, 26, 26, 34 0 WS 2003/04 Data Mining Algorithms 4 – 14 Noisy Data—Cluster Analysis Detect and remove outliers WS 2003/04 Data Mining Algorithms 4 – 15 Noisy Data—Regression y Smooth data according to some regression function Y1 Y1’ y = x + 1 X1 x WS 2003/04 Data Mining Algorithms 4 – 16 Chapter 3: Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary WS 2003/04 Data Mining Algorithms 4 – 17 Data Integration Data integration: combines data from multiple sources into a coherent store Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from multiple data sources, e.g., A.cust-id ≡ B.cust-# Detecting and resolving data value conflicts for the same real world entity, attribute values from different sources are different possible reasons: different representations, different scales, e.g., metric vs. British units WS 2003/04 Data Mining Algorithms 4 – 18 Handling Redundant Data in Data Integration Redundant data occur often when integrating multiple databases The same attribute may have different names in different databases One attribute may be a “derived” attribute in another table, e.g., birthday vs. age; annual revenue Redundant data may be able to be detected by correlational analysis Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality WS 2003/04 Data Mining Algorithms 4 – 19 Data Transformation Smoothing: remove noise from data Aggregation: summarization, data cube construction Generalization: concept hierarchy climbing e.g., {young, middle-aged, senior} rather than {1…100} Normalization: scaled to fall within a small, specified range min-max normalization z-score normalization (= zero-mean normalization) normalization by decimal scaling Attribute/feature construction New attributes constructed from the given ones e.g., age = years(current_date – birthday) WS 2003/04 Data Mining Algorithms 4 – 20 Data Transformation: min-max Normalization min-max normalization new_max − new_min v'= ()v − min A A + new_min A − A maxA minA transforms data linearly to a new range range outliers may be detected afterwards as well new_maxA slope is: new_max − new_min new_min A A A − maxA minA minA maxA WS 2003/04 Data Mining Algorithms 4 – 21 Data Transformation: zero-mean Normalization zero-mean (z-score) normalization v − mean v'= A std _ devA Particularly useful if min/max values are unknown Outliers dominate min/max normalization WS 2003/04 Data Mining Algorithms 4 – 22 Data Transformation: Normalization by Decimal Scaling normalization by decimal scaling v v'= Where j is the smallest integer such that max(|ν’|) < 1 10 j New data range: 0=|ν’| < 1 i.e., –1 < ν’ < 1 Normalization (in general) is important when considering several attributes in combination. Large value ranges should not dominate small ones of other attributes WS 2003/04 Data Mining Algorithms 4 – 23 Chapter 3: Data Preprocessing Why preprocess the data? Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary WS 2003/04 Data Mining Algorithms 4 – 24 Data Reduction Strategies Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data set Data reduction Obtains a reduced representation of the data set that is much smaller in volume but yet produces the same (or almost the same) analytical results Data reduction strategies Data cube aggregation Dimensionality reduction Numerosity reduction Discretization and concept hierarchy generation WS 2003/04 Data Mining Algorithms 4 – 25 Data Cube Aggregation The lowest level of a data cube the aggregated data for an individual entity of interest e.g., a customer in a phone calling data warehouse.