4-Preprocess.Pdf

Data cleaning and Data preprocessing Nguyen Hung Son This presentation was prepared on the basis of the following public materials: 1. Jiawei Han and Micheline Kamber, „Data mining, concept and techniques” http://www.cs.sfu.ca 2. Gregory Piatetsky-Shapiro, „kdnuggest”, http://www.kdnuggets.com/data_mining_course/ preprocessing 1 Outline Introduction Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary preprocessing 2 Why Data Preprocessing? Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data noisy: containing errors or outliers inconsistent: containing discrepancies in codes or names No quality data, no quality mining results! Quality decisions must be based on quality data Data warehouse needs consistent integration of quality data preprocessing 3 Data Understanding: Relevance What data is available for the task? Is this data relevant? Is additional relevant data available? How much historical data is available? Who is the data expert ? preprocessing 4 Data Understanding: Quantity Number of instances (records, objects) Rule of thumb: 5,000 or more desired if less, results are less reliable; use special methods (boosting, …) Number of attributes (fields) Rule of thumb: for each attribute, 10 or more instances If more fields, use feature reduction and selection Number of targets Rule of thumb: >100 for each class if very unbalanced, use stratified sampling preprocessing 5 Multi-Dimensional Measure of Data Quality A well-accepted multidimensional view: Accuracy Completeness Consistency Timeliness Believability Value added Interpretability Accessibility Broad categories: intrinsic, contextual, representational, and accessibility. preprocessing 6 Major Tasks in Data Preprocessing Data cleaning Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies Data integration Integration of multiple databases, data cubes, or files Data transformation Normalization and aggregation Data reduction Obtains reduced representation in volume but produces the same or similar analytical results Data discretization Part of data reduction but with particular importance, especially for numerical data preprocessing 7 Forms of data preprocessing preprocessing 8 Outline Introduction Data cleaning Data integration and transformation Data reduction Discretization and concept hierarchy generation Summary preprocessing 9 Data Cleaning Data cleaning tasks Data acquisition and metadata Fill in missing values Unified date format Converting nominal to numeric Identify outliers and smooth out noisy data Correct inconsistent data preprocessing 10 Data Cleaning: Acquisition Data can be in DBMS ODBC, JDBC protocols Data in a flat file Fixed-column format Delimited format: tab, comma “,” , other E.g. C4.5 and Weka “arff” use comma-delimited data Attention: Convert field delimiters inside strings Verify the number of fields before and after preprocessing 11 Data Cleaning: Example Original data (fixed column format) 000000000130.06.19971979-10- -3080145722 #000310 111000301.01.000100000000004 0000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.0000 00000000000. 000000000000000.000000000000000.0000000...… 000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.00 0000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.0000 00000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000000000000.000000 000000000.000000000000000.000000000000000.000000000000000.00 0000000000300.00 0000000000300.00 Clean data 0000000001,199706,1979.833,8014,5722 , ,#000310 …. ,111,03,000101,0,04,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0300,0300.00 preprocessing 12 Data Cleaning: Metadata Field types: binary, nominal (categorical), ordinal, numeric, … For nominal fields: tables translating codes to full descriptions Field role: input : inputs for modeling target : output id/auxiliary : keep, but not use for modeling ignore : don’t use for modeling weight : instance weight … Field descriptions preprocessing 13 Data Cleaning: Reformatting Convert data to a standard format (e.g. arff or csv) Missing values Unified date format Binning of numeric data Fix errors and outliers Convert nominal fields whose values have order to numeric. Q: Why? A: to be able to use “>” and “<“ comparisons on these fields) preprocessing 14 Missing Data Data is not always available E.g., many tuples have no recorded value for several attributes, such as customer income in sales data Missing data may be due to equipment malfunction inconsistent with other recorded data and thus deleted data not entered due to misunderstanding certain data may not be considered important at the time of entry not register history or changes of the data Missing data may need to be inferred. preprocessing 15 How to Handle Missing Data? Ignore the tuple: usually done when class label is missing (assuming the tasks in classification—not effective when the percentage of missing values per attribute varies considerably. Fill in the missing value manually: tedious + infeasible? Use a global constant to fill in the missing value: e.g., “unknown”, a new class?! Imputation: Use the attribute mean to fill in the missing value, or use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree preprocessing 16 Data Cleaning: Unified Date Format We want to transform all dates to the same format internally Some systems accept dates in many formats e.g. “Sep 24, 2003” , 9/24/03, 24.09.03, etc dates are transformed internally to a standard value Frequently, just the year (YYYY) is sufficient For more details, we may need the month, the day, the hour, etc Representing date as YYYYMM or YYYYMMDD can be OK, but has problems Q: What are the problems with YYYYMMDD dates? A: Ignoring for now the Looming Y10K (year 10,000 crisis …) YYYYMMDD does not preserve intervals: 20040201 - 20040131 /= 20040131 – 20040130 This can introduce bias into models preprocessing 17 Unified Date Format Options To preserve intervals, we can use Unix system date: Number of seconds since 1970 Number of days since Jan 1, 1960 (SAS) Problem: values are non-obvious don’t help intuition and knowledge discovery harder to verify, easier to make an error preprocessing 18 KSP Date Format days_starting_Jan_1 - 0.5 KSP Date = YYYY + ---------------------------------- 365 + 1_if_leap_year Preserves intervals (almost) The year and quarter are obvious Sep 24, 2003 is 2003 + (267-0.5)/365= 2003.7301 (round to 4 digits) Consistent with date starting at noon Can be extended to include time preprocessing 19 Y2K issues: 2 digit Year 2-digit year in old data – legacy of Y2K E.g. Q: Year 02 – is it 1902 or 2002 ? A: Depends on context (e.g. child birthday or year of house construction) Typical approach: CUTOFF year, e.g. 30 if YY < CUTOFF , then 20YY, else 19YY preprocessing 20 Conversion: Nominal to Numeric Some tools can deal with nominal values internally Other methods (neural nets, regression, nearest neighbor) require only numeric inputs To use nominal fields in such methods need to convert them to a numeric value Q: Why not ignore nominal fields altogether? A: They may contain valuable information Different strategies for binary, ordered, multi-valued nominal fields preprocessing 21 Conversion: Binary to Numeric Binary fields E.g. Gender=M, F Convert to Field_0_1 with 0, 1 values e.g. Gender = M Æ Gender_0_1 = 0 Gender = F Æ Gender_0_1 = 1 preprocessing 22 Conversion: Ordered to Numeric Ordered attributes (e.g. Grade) can be converted to numbers preserving natural order, e.g. A Æ 4.0 A- Æ 3.7 B+ Æ 3.3 B Æ 3.0 Q: Why is it important to preserve natural order? A: To allow meaningful comparisons, e.g. Grade > 3.5 preprocessing 23 Conversion: Nominal, Few Values Multi-valued, unordered attributes with small (rule of thumb < 20) no. of values e.g. Color=Red, Orange, Yellow, …, Violet for each value v create a binary “flag” variable C_v , which is 1 if Color=v, 0 otherwise ID Color … ID C_re C_orange C_yello … d w 371 red 371 1 0 0 433 yellow 433 0 0 1 preprocessing 24 Conversion: Nominal, Many Values Examples: US State Code (50 values) Profession Code (7,000 values, but only few frequent) Q: How to deal with such fields ? A: Ignore ID-like fields whose values are unique for each record For other fields, group values “naturally”: e.g. 50 US States Æ 3 or 5 regions Profession – select most frequent ones, group the rest Create binary flag-fields for selected values preprocessing 25 Noisy Data Noise: random error or variance in a measured variable Incorrect attribute values may due to faulty data collection instruments data entry problems data transmission problems technology limitation inconsistency in naming convention Other data problems which requires data cleaning duplicate records incomplete data inconsistent data preprocessing 26 How to Handle Noisy Data? Binning method: first sort data and

4-Preprocess.Pdf

Informal Data Transformation Considered Harmful

Scientific Data Mining in Astronomy

Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation Jie Song George Alter H

What Is a Data Warehouse?

POLITECNICO DI TORINO Repository ISTITUZIONALE

Data Quality Through Data Integration: How Integrating Your IDEA Data Will Help Improve Data Quality

Data Warehousing on AWS

Managing Data in Motion This Page Intentionally Left Blank Managing Data in Motion Data Integration Best Practice Techniques and Technologies

Informatica Cloud Data Integration

What to Look for When Selecting a Master Data Management Solution What to Look for When Selecting a Master Data Management Solution

Master Data Simplified

Using ETL, EAI, and EII Tools to Create an Integrated Enterprise