Handouts-For-Workshop-On-Rattle-And-R.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
Workshop Overview A Data Mining Workshop 1 R: A Language for Data Mining Excavating Knowledge from Data Introducing Data Mining using R 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle [email protected] 4 Descriptive Data Mining Data Scientist Australian Taxation Office 5 Predictive Data Mining: Decision Trees Adjunct Professor, Australian National University Adjunct Professor, University of Canberra 6 Predictive Data Mining: Ensembles Fellow, Institute of Analytics Professionals of Australia 7 Moving into R and Scripting our Analyses [email protected] http://datamining.togaware.com 8 Literate Data Mining in R Visit: http://onepager.togaware.com for Workshop Notes http: // togaware. com Copyright © 2014, [email protected] 1/17 http: // togaware. com Copyright © 2014, [email protected] 2/17 R: A Language for Data Mining What is R? Installing R Workshop Overview Installing R and Rattle 1 R: A Language for Data Mining First task is to install R 2 Data Mining, Rattle, and R As free/libre open source software (FLOSS or FOSS), R and Rattle are available to all, with no limitations on our freedom to 3 Loading, Cleaning, Exploring Data in Rattle use and share the software, except to share and share alike. 4 Descriptive Data Mining Visit CRAN at http://cran.rstudio.com 5 Predictive Data Mining: Decision Trees Visit Rattle at http://rattle.togaware.com 6 Predictive Data Mining: Ensembles Linux: Install packages (Ubuntu is recommended) 7 Moving into R and Scripting our Analyses $ wajig install r-recommended r-cran-rattle Windows: Download and install from CRAN 8 Literate Data Mining in R MacOSX: Download and install from CRAN http: // togaware. com Copyright © 2014, [email protected] 3/17 http: // togaware. com Copyright © 2014, [email protected] 4/28 What is R? Why a Workshop on R? What is R? Why a Workshop on R? Why do Data Science with R? Why do Data Science with R? Most widely used Data Mining and Machine Learning Package Most widely used Data Mining and Machine Learning Package Machine Learning Machine Learning Statistics Statistics Software Engineering and Programming with Data Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software Free (Libre) Open Source Statistical Software . all modern statistical approaches . all modern statistical approaches . many/most machine learning algorithms . many/most machine learning algorithms ... opportunity to readily add new algorithms ... opportunity to readily add new algorithms That is important for us in the research community That is important for us in the research community Get our algorithms out there and being used—impact!!! Get our algorithms out there and being used—impact!!! http: // togaware. com Copyright © 2014, [email protected] 5/28 http: // togaware. com Copyright © 2014, [email protected] 5/28 What is R? Why a Workshop on R? What is R? Why a Workshop on R? Why do Data Science with R? Why do Data Science with R? Most widely used Data Mining and Machine Learning Package Most widely used Data Mining and Machine Learning Package Machine Learning Machine Learning Statistics Statistics Software Engineering and Programming with Data Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software Free (Libre) Open Source Statistical Software . all modern statistical approaches . all modern statistical approaches . many/most machine learning algorithms . many/most machine learning algorithms ... opportunity to readily add new algorithms ... opportunity to readily add new algorithms That is important for us in the research community That is important for us in the research community Get our algorithms out there and being used—impact!!! Get our algorithms out there and being used—impact!!! http: // togaware. com Copyright © 2014, [email protected] 5/28 http: // togaware. com Copyright © 2014, [email protected] 5/28 What is R? Popularity of R? What is R? Popularity of R? How Popular is R? Discussion List Traffic How Popular is R? Discussion Topics Monthly email traffic on software’s main discussion list. Number of discussions on popular QandA forums 2013. Source: http://r4stats.com/articles/popularity/ http: // togaware. com Copyright © 2014, [email protected]: http://r4stats.com/articles/popularity/6/28 http: // togaware. com Copyright © 2014, [email protected] 7/28 What is R? Popularity of R? What is R? Popularity of R? How Popular is R? R versus SAS How Popular is R? Professional Forums Number of R/SAS related posts to Stack Overflow by week. Registered for the main discussion group for each software. Source: http://r4stats.com/articles/popularity/ Source: http://r4stats.com/articles/popularity/ http: // togaware. com Copyright © 2014, [email protected] 8/28 http: // togaware. com Copyright © 2014, [email protected] 9/28 What is R? Popularity of R? What is R? Popularity of R? How Popular is R? Used in Analytics How Popular is R? User Survey Competitions Rexer Analytics Survey 2010 results for data mining/analytic tools. Software used in data analysis competitions in 2011. Source: http://r4stats.com/articles/popularity/ Source: http://r4stats.com/articles/popularity/ http: // togaware. com Copyright © 2014, [email protected] 10/28 http: // togaware. com Copyright © 2014, [email protected] 11/28 What is R? Popularity of R? Data Mining, Rattle, and R What is R? Workshop Overview 1 R: A Language for Data Mining 2 Data Mining, Rattle, and R R — The Video 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining A 90 Second Promo from Revolution Analytics 5 Predictive Data Mining: Decision Trees http://www.revolutionanalytics.com/what-is-open-source-r/ 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R http: // togaware. com Copyright © 2014, [email protected] 12/28 http: // togaware. com Copyright © 2014, [email protected] 4/17 An Introduction to Data Mining Big Data and Big Business An Introduction to Data Mining Big Data and Big Business Data Mining Data Mining Application of Machine Learning Statistics Software Engineering and Programming with Data A data driven analysis to uncover otherwise unknown but useful Effective Communications and Intuition patterns in large datasets, to discover new knowledge and to develop predictive models, turning data and information into knowledge and . to Datasets that vary by (one day perhaps) wisdom, in a timely manner. Volume, Velocity, Variety, Value, Veracity . to discover new knowledge . to improve business outcomes . to deliver better tailored services http: // togaware. com Copyright © 2014, [email protected] 4/40 http: // togaware. com Copyright © 2014, [email protected] 5/40 An Introduction to Data Mining Big Data and Big Business An Introduction to Data Mining Big Data and Big Business Data Mining in Research Data Mining in Government Health Research Australian Taxation Office Adverse reactions using linked Pharmaceutical, General Lodgment ($110M) Practitioner, Hospital, Pathology datasets. Tax Havens ($150M) $ Astronomy Tax Fraud ( 250M) Microlensing events in the Large Magellanic Cloud of several million observed stars (out of 10 billion). Immigration and Border Control Psychology Check passengers before boarding Investigation of age-of-onset for Alzheimer’s disease from 75 variables for 800 people. Health and Human Services Social Sciences Survey evaluation. Social network analysis - identifying key Doctor shoppers Over servicing influencers. http: // togaware. com Copyright © 2014, [email protected] 6/40 http: // togaware. com Copyright © 2014, [email protected] 7/40 An Introduction to Data Mining Big Data and Big Business An Introduction to Data Mining Algorithms The Business of Data Mining Basic Tools: Data Mining Algorithms Cluster Analysis (kmeans, wskm) Association Analysis (arules) Linear Discriminant Analysis (lda) SAS has annual revenues of $3B (2013) Logistic Regression (glm) IBM bought SPSS for $1.2B (2009) Decision Trees (rpart, wsrpart) Analytics is >$100B business and >$320B by 2020 Random Forests (randomForest, wsrf) Amazon, eBay/PayPal, Google, Facebook, LinkedIn, . Boosted Stumps (ada) Shortage of 180,000 data scientists in US in 2018 (McKinsey) . Neural Networks (nnet) Support Vector Machines (kernlab) ... That’s a lot of tools to learn in R! Many with different interfaces and options. http: // togaware. com Copyright © 2014, [email protected] 8/40 http: // togaware. com Copyright © 2014, [email protected] 9/40 The Rattle Package for Data Mining A GUI for Data Mining The Rattle Package for Data Mining A GUI for Data Mining Why a GUI? Users of Rattle Today, Rattle is used world wide in many industries Statistics can be complex and traps await Health analytics So many tools in R to deliver insights Customer segmentation and marketing Effective analyses should be scripted Fraud detection Scripting also required for repeatability Government R is a language for programming with data It is used by Universities to teach Data Mining Within research projects for basic analyses Consultants