Data Mining with Rattle and R: Survival Guide
Total Page:16
File Type:pdf, Size:1020Kb
Personal Copy for Martin Schultz, 18 January 2008 | | | | Data Mining Desktop Survival Guide Togaware Watermark For Data Mining Survival Copyright c 2006-2008 Graham Williams | | | Graham Williams | Data Mining Desktop Survival 18th January 2008 | Personal Copy for Martin Schultz, 18 January 2008 | | | | Togaware Series of Open Source Desktop Survival Guides This innovative series presents open source and freely available software tools and techniques with a focus on the tasks they are built for, ranging from working with the GNU/Linux operating system, through common desktop productivity tools, to sophisticated data mining applications. Each volume aims to be self contained, and slim, presenting the informa- tion in an easy to follow format without overwhelming the reader with details. Togaware Watermark For Data Mining Survival Series Titles Data Mining with Rattle R for the Data Miner Text Mining with Rattle Debian GNU/Linux OpenMoko and the Neo 1973 ii Copyright c 2006-2008 Graham Williams | | | Graham Williams | Data Mining Desktop Survival 18th January 2008 | Personal Copy for Martin Schultz, 18 January 2008 | | | | Data Mining Desktop Survival Guide Graham Williams Togaware.com Togaware Watermark For Data Mining Survival iii Copyright c 2006-2008 Graham Williams | | | Graham Williams | Data Mining Desktop Survival 18th January 2008 | Personal Copy for Martin Schultz, 18 January 2008 | | | | The procedures and applications presented in this book have been in- cluded for their instructional value. They have been tested but are not guaranteed for any particular purpose. Neither the publisher nor the author offer any warranties or representations, nor do they accept any liabilities with respect to the programs and applications. The book, as you see it presently, is a work in progress, and different sec- tions are progressed depending on feedback. Please send comments, sug- gestions, updates, and criticisms to [email protected]. I hope you find it useful! Togaware Watermark For Data Mining Survival Printed 18th January 2008 Copyright c 2006-2007 by Graham Williams ISBN 0-9757109-2-3 iv Copyright c 2006-2008 Graham Williams | | | Graham Williams | Data Mining Desktop Survival 18th January 2008 | Personal Copy for Martin Schultz, 18 January 2008 | | | | Togaware Watermark For Data Mining Survival Where knowledge is power, data is the fuel and data mining the engine room for delivering that knowledge. v Copyright c 2006-2008 Graham Williams | | | Graham Williams | Data Mining Desktop Survival 18th January 2008 | Personal Copy for Martin Schultz, 18 January 2008 | | | | Togaware Watermark For Data Mining Survival vi Copyright c 2006-2008 Graham Williams | | | Graham Williams | Data Mining Desktop Survival 18th January 2008 | Personal Copy for Martin Schultz, 18 January 2008 | | | | Dedication Togaware Watermark For Data Mining Survival vii Copyright c 2006-2008 Graham Williams | | | Graham Williams | Data Mining Desktop Survival 18th January 2008 | Personal Copy for Martin Schultz, 18 January 2008 | | | | Togaware Watermark For Data Mining Survival viii Copyright c 2006-2008 Graham Williams | | | Graham Williams | Data Mining Desktop Survival 18th January 2008 | Personal Copy for Martin Schultz, 18 January 2008 | | | | Contents I Data Mining with Rattle1 1 Introduction3 1.1 Data Mining.........................4 Togaware Watermark For Data Mining Survival 1.2 Types of Analysis.......................4 1.3 Data Mining Applications..................4 1.4 A Framework for Modelling.................4 1.5 Agile Data Mining......................5 2 Rattle Data Miner7 2.1 Installing GTK, R, and Rattle ................8 2.1.1 Quick Start Install..................9 2.1.2 Installation Details.................. 10 2.2 The Initial Interface..................... 15 2.3 Interacting with Rattle .................... 16 2.4 Menus and Buttons...................... 18 2.4.1 Project Menu and Buttons............. 19 2.4.2 Edit Menu...................... 19 ix Copyright c 2006-2008 Graham Williams | | | Graham Williams | Data Mining Desktop Survival 18th January 2008 | Personal Copy for Martin Schultz, 18 January 2008 | | | | 2.4.3 Tools Menu and Toolbar............... 19 2.4.4 Settings........................ 20 2.4.5 Help.......................... 20 2.5 Paradigms........................... 20 2.6 Interacting with Plots.................... 23 2.7 Summary........................... 24 3 Sourcing Data 25 3.1 Nomenclature......................... 25 3.2 Loading Data......................... 26 3.3 CSV Data........................... 27 Togaware Watermark For Data Mining Survival 3.4 ARFF Data.......................... 32 3.5 ODBC Sourced Data..................... 35 3.6 R Data............................ 37 3.7 R Dataset........................... 37 3.8 Data Entry.......................... 39 4 Selecting Data 41 4.1 Sampling Data........................ 41 4.2 Variable Roles......................... 43 4.3 Automatic Role Identification................ 44 4.4 Weights Calculator...................... 45 5 Exploring Data 47 5.1 Summarising Data...................... 48 x Copyright c 2006-2008 Graham Williams | | | Graham Williams | Data Mining Desktop Survival 18th January 2008 | Personal Copy for Martin Schultz, 18 January 2008 | | | | 5.1.1 Summary....................... 49 5.1.2 Describe........................ 50 5.1.3 Basics......................... 51 5.1.4 Kurtosis........................ 52 5.1.5 Skewness....................... 54 5.1.6 Missing........................ 55 5.2 Exploring Distributions................... 57 5.2.1 Box Plot....................... 60 5.2.2 Histogram....................... 62 5.2.3 Cumulative Distribution Plot............ 64 Togaware5.2.4 Watermark Benford's Law.................... For Data Mining Survival 65 5.2.5 Bar Plot........................ 70 5.2.6 Dot Plot........................ 70 5.2.7 Mosaic Plot...................... 70 5.3 Sophisticated Exploration with GGobi........... 71 5.3.1 Scatterplot...................... 72 5.3.2 Data Viewer: Identifying Entities in Plots..... 75 5.3.3 Other Options.................... 76 5.3.4 Further GGobi Documentation........... 77 5.4 Correlation Analysis..................... 78 5.4.1 Hierarchical Correlation............... 82 5.4.2 Principal Components................ 82 5.5 Single Variable Overviews.................. 82 xi Copyright c 2006-2008 Graham Williams | | Graham Williams | Data Mining Desktop Survival 18th January 2008 | | Personal Copy for Martin Schultz, 18 January 2008 | | | | 6 Transforming Data 83 6.1 Normalising Data....................... 85 6.1.1 Recenter........................ 85 6.1.2 Scale [0,1]....................... 88 6.1.3 Rank.......................... 88 6.1.4 Median/MAD..................... 88 6.2 Impute............................. 88 6.2.1 Zero/Missing..................... 90 6.2.2 Mean/Median/Mode................. 90 6.2.3 Constant....................... 92 6.3Togaware Remap............................. Watermark For Data Mining Survival 92 6.3.1 Binning........................ 92 6.3.2 Indicator Variables.................. 92 6.3.3 Join Categoricals................... 95 6.3.4 Math Transforms................... 95 6.4 Outliers............................ 95 6.5 Cleanup............................ 95 6.5.1 Delete Ignored.................... 97 6.5.2 Delete Selected.................... 97 6.5.3 Delete Missing.................... 97 6.5.4 Delete Entities with Missing............. 97 7 Building Classification Models 99 7.1 Building Models....................... 100 xii Copyright c 2006-2008 Graham Williams | | | Graham Williams | Data Mining Desktop Survival 18th January 2008 | Personal Copy for Martin Schultz, 18 January 2008 | | | | 7.2 Risk Charts.......................... 101 7.3 Decision Trees......................... 105 7.3.1 Tutorial Example................... 107 7.3.2 Formalities...................... 107 7.3.3 Tuning Parameters.................. 107 7.4 Boosting............................ 109 7.4.1 Tutorial Example................... 110 7.4.2 Formalities...................... 111 7.4.3 Tuning Parameters.................. 111 7.5 Random Forests....................... 111 Togaware7.5.1 Tutorial Watermark Example................... For Data Mining Survival 113 7.5.2 Formalities...................... 118 7.5.3 Tuning Parameters.................. 118 7.6 Support Vector Machine................... 119 7.7 Logistic Regression...................... 121 7.8 Bibliographic Notes...................... 123 8 Unsupervised Modelling 125 8.1 Cluster Analysis....................... 125 8.1.1 KMeans........................ 125 8.1.2 Export KMeans Clusters.............. 125 8.1.3 Discriminant Coordinates Plot........... 126 8.1.4 Number of Clusters................. 126 8.2 Hierarchical Clusters..................... 128 xiii Copyright c 2006-2008 Graham Williams | | | Graham Williams | Data Mining Desktop Survival 18th January 2008 | Personal Copy for Martin Schultz, 18 January 2008 | | | | 8.3 Association Rules....................... 129 8.3.1 Basket Analysis.................... 129 8.3.2 General Rules..................... 131 9 Evaluation 135 9.1 The Evaluate Tab...................... 136 9.2 Confusion Matrix....................... 138 9.2.1 Measures....................... 138 9.2.2 Graphical Measures................. 138 9.3 Lift............................... 139 9.4 ROC Curves......................... 140 Togaware Watermark For Data Mining Survival 9.5 Precision versus Recall.................... 140 9.6 Sensitivity versus Specificity................