Workshop Overview A Data Mining Workshop

1 R: A Language for Data Mining Excavating Knowledge from Data Introducing Data Mining using R 2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle [email protected] 4 Descriptive Data Mining Data Scientist Australian Taxation Office 5 Predictive Data Mining: Decision Trees Adjunct Professor, Australian National University Adjunct Professor, University of Canberra 6 Predictive Data Mining: Ensembles Fellow, Institute of Analytics Professionals of Australia 7 Moving into R and Scripting our Analyses [email protected] http://datamining.togaware.com 8 Literate Data Mining in R

Visit: http://onepager.togaware.com for Workshop Notes http: // togaware. com Copyright © 2014, [email protected] 1/17 http: // togaware. com Copyright © 2014, [email protected] 2/17

R: A Language for Data Mining What is R? Installing R Workshop Overview Installing R and Rattle

1 R: A Language for Data Mining First task is to install R

2 Data Mining, Rattle, and R As free/libre open source software (FLOSS or FOSS), R and Rattle are available to all, with no limitations on our freedom to 3 Loading, Cleaning, Exploring Data in Rattle use and share the software, except to share and share alike. 4 Descriptive Data Mining Visit CRAN at http://cran.rstudio.com 5 Predictive Data Mining: Decision Trees Visit Rattle at http://rattle.togaware.com

6 Predictive Data Mining: Ensembles : Install packages ( is recommended) 7 Moving into R and Scripting our Analyses $ wajig install r-recommended r-cran-rattle Windows: Download and install from CRAN 8 Literate Data Mining in R MacOSX: Download and install from CRAN

http: // togaware. com Copyright © 2014, [email protected] 3/17 http: // togaware. com Copyright © 2014, [email protected] 4/28

What is R? Why a Workshop on R? What is R? Why a Workshop on R? Why do Data Science with R? Why do Data Science with R?

Most widely used Data Mining and Machine Learning Package Most widely used Data Mining and Machine Learning Package Machine Learning Machine Learning Statistics Statistics Software Engineering and Programming with Data Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! But not the nicest of languages for a Computer Scientist!

Free (Libre) Open Source Statistical Software Free (Libre) Open Source Statistical Software . . . all modern statistical approaches . . . all modern statistical approaches . . . many/most machine learning algorithms . . . many/most machine learning algorithms ... opportunity to readily add new algorithms ... opportunity to readily add new algorithms That is important for us in the research community That is important for us in the research community Get our algorithms out there and being used—impact!!! Get our algorithms out there and being used—impact!!!

http: // togaware. com Copyright © 2014, [email protected] 5/28 http: // togaware. com Copyright © 2014, [email protected] 5/28 What is R? Why a Workshop on R? What is R? Why a Workshop on R? Why do Data Science with R? Why do Data Science with R?

Most widely used Data Mining and Machine Learning Package Most widely used Data Mining and Machine Learning Package Machine Learning Machine Learning Statistics Statistics Software Engineering and Programming with Data Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! But not the nicest of languages for a Computer Scientist!

Free (Libre) Open Source Statistical Software Free (Libre) Open Source Statistical Software . . . all modern statistical approaches . . . all modern statistical approaches . . . many/most machine learning algorithms . . . many/most machine learning algorithms ... opportunity to readily add new algorithms ... opportunity to readily add new algorithms That is important for us in the research community That is important for us in the research community Get our algorithms out there and being used—impact!!! Get our algorithms out there and being used—impact!!!

http: // togaware. com Copyright © 2014, [email protected] 5/28 http: // togaware. com Copyright © 2014, [email protected] 5/28

What is R? Popularity of R? What is R? Popularity of R? How Popular is R? Discussion List Traffic How Popular is R? Discussion Topics

Monthly email traffic on software’s main discussion list. Number of discussions on popular QandA forums 2013.

Source: http://r4stats.com/articles/popularity/ http: // togaware. com Copyright © 2014, [email protected]: http://r4stats.com/articles/popularity/6/28 http: // togaware. com Copyright © 2014, [email protected] 7/28

What is R? Popularity of R? What is R? Popularity of R? How Popular is R? R versus SAS How Popular is R? Professional Forums

Number of R/SAS related posts to Stack Overflow by week. Registered for the main discussion group for each software.

Source: http://r4stats.com/articles/popularity/ Source: http://r4stats.com/articles/popularity/ http: // togaware. com Copyright © 2014, [email protected] 8/28 http: // togaware. com Copyright © 2014, [email protected] 9/28 What is R? Popularity of R? What is R? Popularity of R? How Popular is R? Used in Analytics How Popular is R? User Survey

Competitions Rexer Analytics Survey 2010 results for data mining/analytic tools. Software used in data analysis competitions in 2011.

Source: http://r4stats.com/articles/popularity/ Source: http://r4stats.com/articles/popularity/ http: // togaware. com Copyright © 2014, [email protected] 10/28 http: // togaware. com Copyright © 2014, [email protected] 11/28

What is R? Popularity of R? Data Mining, Rattle, and R What is R? Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R R — The Video 3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining A 90 Second Promo from Revolution Analytics 5 Predictive Data Mining: Decision Trees

http://www.revolutionanalytics.com/what-is-open-source-r/ 6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright © 2014, [email protected] 12/28 http: // togaware. com Copyright © 2014, [email protected] 4/17

An Introduction to Data Mining Big Data and Big Business An Introduction to Data Mining Big Data and Big Business Data Mining Data Mining

Application of Machine Learning Statistics Software Engineering and Programming with Data A data driven analysis to uncover otherwise unknown but useful Effective Communications and Intuition patterns in large datasets, to discover new knowledge and to develop predictive models, turning data and information into knowledge and . . . to Datasets that vary by (one day perhaps) wisdom, in a timely manner. Volume, Velocity, Variety, Value, Veracity

. . . to discover new knowledge . . . to improve business outcomes . . . to deliver better tailored services

http: // togaware. com Copyright © 2014, [email protected] 4/40 http: // togaware. com Copyright © 2014, [email protected] 5/40 An Introduction to Data Mining Big Data and Big Business An Introduction to Data Mining Big Data and Big Business Data Mining in Research Data Mining in Government

Health Research Australian Taxation Office Adverse reactions using linked Pharmaceutical, General Lodgment ($110M) Practitioner, Hospital, Pathology datasets. Tax Havens ($150M) $ Astronomy Tax Fraud ( 250M) Microlensing events in the Large Magellanic Cloud of several million observed stars (out of 10 billion). Immigration and Border Control Psychology Check passengers before boarding Investigation of age-of-onset for Alzheimer’s disease from 75 variables for 800 people. Health and Human Services Social Sciences Survey evaluation. Social network analysis - identifying key Doctor shoppers Over servicing influencers.

http: // togaware. com Copyright © 2014, [email protected] 6/40 http: // togaware. com Copyright © 2014, [email protected] 7/40

An Introduction to Data Mining Big Data and Big Business An Introduction to Data Mining Algorithms The Business of Data Mining Basic Tools: Data Mining Algorithms

Cluster Analysis (kmeans, wskm) Association Analysis (arules) Linear Discriminant Analysis (lda) SAS has annual revenues of $3B (2013) Logistic Regression (glm) IBM bought SPSS for $1.2B (2009) Decision Trees (rpart, wsrpart) Analytics is >$100B business and >$320B by 2020 Random Forests (randomForest, wsrf) Amazon, eBay/PayPal, Google, Facebook, LinkedIn, . . . Boosted Stumps (ada) Shortage of 180,000 data scientists in US in 2018 (McKinsey) . . . Neural Networks (nnet) Support Vector Machines (kernlab) ... That’s a lot of tools to learn in R! Many with different interfaces and options.

http: // togaware. com Copyright © 2014, [email protected] 8/40 http: // togaware. com Copyright © 2014, [email protected] 9/40

The Rattle Package for Data Mining A GUI for Data Mining The Rattle Package for Data Mining A GUI for Data Mining Why a GUI? Users of Rattle

Today, Rattle is used world wide in many industries Statistics can be complex and traps await Health analytics So many tools in R to deliver insights Customer segmentation and marketing Effective analyses should be scripted Fraud detection Scripting also required for repeatability Government R is a language for programming with data It is used by Universities to teach Data Mining Within research projects for basic analyses Consultants and Analytics Teams across business How to remember how to do all of this in R? It is and will remain freely available. How to skill up 150 data analysts with Data Mining? CRAN and http://rattle.togaware.com

http: // togaware. com Copyright © 2014, [email protected] 11/40 http: // togaware. com Copyright © 2014, [email protected] 12/40 The Rattle Package for Data Mining Setting Things Up The Rattle Package for Data Mining Tour Installation A Tour Thru Rattle: Startup

Rattle is built using R Need to download and install R from cran.r-project.org Recommend also install RStudio from www.rstudio.org

Then start up RStudio and install Rattle: install.packages("rattle") Then we can start up Rattle: rattle()

Required packages are loaded as needed.

http: // togaware. com Copyright © 2014, [email protected] 13/40 http: // togaware. com Copyright © 2014, [email protected] 14/40

The Rattle Package for Data Mining Tour The Rattle Package for Data Mining Tour A Tour Thru Rattle: Loading Data A Tour Thru Rattle: Explore Distribution

http: // togaware. com Copyright © 2014, [email protected] 15/40 http: // togaware. com Copyright © 2014, [email protected] 16/40

The Rattle Package for Data Mining Tour The Rattle Package for Data Mining Tour A Tour Thru Rattle: Explore Correlations A Tour Thru Rattle: Hierarchical Cluster

http: // togaware. com Copyright © 2014, [email protected] 17/40 http: // togaware. com Copyright © 2014, [email protected] 18/40 The Rattle Package for Data Mining Tour The Rattle Package for Data Mining Tour A Tour Thru Rattle: Decision Tree A Tour Thru Rattle: Decision Tree Plot

http: // togaware. com Copyright © 2014, [email protected] 19/40 http: // togaware. com Copyright © 2014, [email protected] 20/40

The Rattle Package for Data Mining Tour The Rattle Package for Data Mining Tour A Tour Thru Rattle: Random Forest A Tour Thru Rattle: Risk Chart

Risk Chart Random Forest weather.csv [test] RainTomorrow

Risk Scores 0.80.90.7 0.6 0.5 0.4 0.3 0.2 0.1 Lift 100

4

80

3

60

2

Performance (%) Performance 40

22% 1 20

RainTomorrow (92%) Rain in MM (97%) Precision 0

0 20 40 60 80 100 Caseload (%)

http: // togaware. com Copyright © 2014, [email protected] 21/40 http: // togaware. com Copyright © 2014, [email protected] 22/40

Moving Into R Programming with Data Moving Into R Programming with Data Data Scientists are Programmers of Data From GUI to CLI — Rattle’s Log Tab

But... Data scientists are programmers of data A GUI can only do so much R is a powerful statistical language

Data Scientists Desire... Scripting Transparency Repeatability Sharing

http: // togaware. com Copyright © 2014, [email protected] 24/40 http: // togaware. com Copyright © 2014, [email protected] 25/40 Moving Into R Programming with Data R Tool Suite The Power of Free/Libre and Open Source Software From GUI to CLI — Rattle’s Log Tab Tools

Ubuntu GNU/Linux Feature rich toolkit, up-to-date, easy to install, FLOSS

RStudio Easy to use integrated development environment, FLOSS Powerful alternative is Emacs (Speaks Statistics), FLOSS

R Statistical Software Language Extensive, powerful, thousands of contributors, FLOSS

KnitR and LATEX Produce beautiful documents, easily reproducible, FLOSS

http: // togaware. com Copyright © 2014, [email protected] 26/40 http: // togaware. com Copyright © 2014, [email protected] 4/34

R Tool Suite Ubuntu RStudio Interface Using Ubuntu RStudio—The Default Three Panels

Desktop Operating System (GNU/Linux) Replacing Windows and OSX

The GNU Tool Suite based on Unix — significant heritage Multiple specialised single task tools, working well together Compared to single application trying to do it all Powerful data processing from the command line: grep, awk, head, tail, wc, sed, perl, python, most, diff, make, paste, join, patch, . . .

For interacting with R — start up RStudio from the Dash

http: // togaware. com Copyright © 2014, [email protected] 5/34 http: // togaware. com Copyright © 2014, [email protected] 7/34

RStudio Interface Introduction to R Simple Plots RStudio—With R Script File—Editor Panel Scatterplot—R Code

Our first little bit of R code: Load a couple of packages into the R library

library(rattle) # Provides the weather dataset library(ggplot2) # Provides the qplot() function

Then produce a quick plot using qplot()

ds <- weather qplot(MinTemp, MaxTemp, data=ds)

Your turn: give it a go.

http: // togaware. com Copyright © 2014, [email protected] 8/34 http: // togaware. com Copyright © 2014, [email protected] 10/34 Introduction to R Simple Plots Introduction to R Simple Plots Scatterplot—Plot Scatterplot—RStudio

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● 30 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● MaxTemp 20 ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●● ● ● ● ● ● ● ● ●● ● ● ● 10 ● ● ● ● ● ● ● ● ●

0 10 20 MinTemp http: // togaware. com Copyright © 2014, [email protected] 11/34 http: // togaware. com Copyright © 2014, [email protected] 12/34

Introduction to R Installing Packages Introduction to R Installing Packages Missing Packages–Tools Install Packages. . . RStudio—Installing ggplot2 →

http: // togaware. com Copyright © 2014, [email protected] 13/34 http: // togaware. com Copyright © 2014, [email protected] 14/34

Introduction to R RStudio Shortcuts Introduction to R Basic R Commands RStudio—Keyboard Shortcuts Basic R

These will become very useful! library(rattle) # Load the weather dataset. head(weather) # First 6 observations of the dataset. Editor: ## Date Location MinTemp MaxTemp Rainfall Evapora... Ctrl-Enter will send the line of code to the R console ## 1 2007-11-01 Canberra 8.0 24.3 0.0 ... Ctrl-2 will move the cursor to the Console ## 2 2007-11-02 Canberra 14.0 26.9 3.6 ... ## 3 2007-11-03 Canberra 13.7 23.4 3.6 ...... Console: UpArrow will cycle through previous commands str(weather) # Struncture of the variables in the dataset. Ctrl-UpArrow will search previous commands Tab will complete function names and list the arguments ## 'data.frame': 366 obs. of 24 variables: Ctrl-1 will move the cursor to the Editor ## $ Date : Date, format: "2007-11-01" "2007-11-... ## $ Location : Factor w/ 46 levels "Adelaide","Alba... ## $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ... Your turn: try them out. ....

http: // togaware. com Copyright © 2014, [email protected] 15/34 http: // togaware. com Copyright © 2014, [email protected] 16/34 Introduction to R Basic R Commands Introduction to R Visualising Data Basic R Visual Summaries—Add A Little Colour

qplot(Humidity3pm, Pressure3pm, colour=RainTomorrow, data=ds) summary(weather) # Univariate summary of the variables.

1035

## Date Location MinTemp ... ● ● ●

● ## Min. :2007-11-01 Canberra :366 Min. :-5.30 ... ● ● ● ● ● ● ● ● ● ● ## 1st Qu.:2008-01-31 Adelaide : 0 1st Qu.: 2.30 ... ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 1025 ● ● ●●● ● ● ● ## Median :2008-05-01 Albany : 0 Median : 7.45 ... ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ## Mean :2008-05-01 Albury : 0 Mean : 7.27 ... ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ## 3rd Qu.:2008-07-31 AliceSprings : 0 3rd Qu.:12.50 ... ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ## Max. :2008-10-31 BadgerysCreek: 0 Max. :20.90 ... ● ● ● RainTomorrow ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● No 1015 ● ● ● ● ● ● ● ● ●● ● ● ## (Other) : 0 ... ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● Yes ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●

Pressure3pm ● ● ● ● ● ● ## Rainfall Evaporation Sunshine WindGust... ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW : ... ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ## 1st Qu.: 0.00 1st Qu.: 2.20 1st Qu.: 5.95 NNW : ... ● ● ● ● ● ● ● ● ● ● ● ● 1005 ● ## Median : 0.00 Median : 4.20 Median : 8.60 E : ... ● ● ● ● ● ## Mean : 1.43 Mean : 4.52 Mean : 7.91 WNW : ... ● ●

## 3rd Qu.: 0.20 3rd Qu.: 6.40 3rd Qu.:10.50 ENE : ... ● ● ● .... ● 995 25 50 75 100 Humidity3pm http: // togaware. com Copyright © 2014, [email protected] 17/34 http: // togaware. com Copyright © 2014, [email protected] 18/34

Introduction to R Visualising Data Introduction to R Visualising Data Visual Summaries—Careful with Categorics Visual Summaries—Add A Little Jitter

qplot(WindGustDir, Pressure3pm, data=ds) qplot(WindGustDir, Pressure3pm, data=ds, geom="jitter")

1035 1035

● ●

● ● ● ● ● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1025 ● ● ● ● ● 1025 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● 1015 ● ● ● ● ● 1015 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● Pressure3pm ● Pressure3pm ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1005 ● 1005 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

● ●

● ● ● ● ● ●

995 N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NA N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NA WindGustDir WindGustDir http: // togaware. com Copyright © 2014, [email protected] 19/34 http: // togaware. com Copyright © 2014, [email protected] 20/34

Introduction to R Visualising Data Introduction to R Help Visual Summaries—And Some Colour Getting Help—Precede Command with ?

qplot(WindGustDir, Pressure3pm, data=ds, colour=WindGustDir, geom="jitter")

● ● ●

● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 1025 ● ●● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ●● ● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● 1015 ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ●●●● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● Pressure3pm ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● 1005 ● ● ● ● ● ● ● ●

● ● ●

995 N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NA WindGustDir http: // togaware. com Copyright © 2014, [email protected] 21/34 http: // togaware. com Copyright © 2014, [email protected] 22/34 Loading, Cleaning, Exploring Data in Rattle Loading, Cleaning, Exploring Data in Rattle Workshop Overview Loading Data

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright © 2014, [email protected] 5/17 http: // togaware. com Copyright © 2014, [email protected] 6/17

Loading, Cleaning, Exploring Data in Rattle Loading, Cleaning, Exploring Data in Rattle Exploring Data Test Data

http: // togaware. com Copyright © 2014, [email protected] 7/17 http: // togaware. com Copyright © 2014, [email protected] 8/17

Loading, Cleaning, Exploring Data in Rattle Descriptive Data Mining Transform Data Workshop Overview

1 R: A Language for Data Mining

2 Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining

5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright © 2014, [email protected] 9/17 http: // togaware. com Copyright © 2014, [email protected] 10/17 Cluster Analysis Requirements Algorithms Cluster Methods What is Cluster Analysis? Major Clustering Approaches

Partitioning algorithms( kmeans, pam, clara, fanny): Construct various partitions and then evaluate them by some criterion. A Cluster: a collection of observations fixed number of clusters, k, is generated. Start with an initial Similar to one another within the same cluster (perhaps random) cluster. Dissimilar to the observations in other clusters Hierarchical algorithms:(hclust, agnes, diana) Create a Cluster analysis hierarchical decomposition of the set of observations using some Grouping a set of data observations into classes criterion Clustering is unsupervised classification: no predefined Density-based algorithms: based on connectivity and density classes—descriptive data mining. functions Typical applications Grid-based algorithms: based on a multiple-level granularity As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms structure Model-based algorithms:(mclust for mixture of Gaussians) A model is hypothesized for each of the clusters and the idea is to find the best fit of that model http: // togaware. com Copyright © 2014, [email protected] 6/33 http: // togaware. com Copyright © 2014, [email protected] 22/33

Descriptive Data Mining Introduction Rules KMeans Clustering Association Rule Mining

An unsupervised learning algorithm—descriptive data mining. Identify items (patterns) that occur frequently together in a given set of data. Patterns = associations, correlations, causal structures (Rules). Data = sets of items in . . . transactional database relational database complex information repositories Rule: Body Head [support, confidence] →

http: // togaware. com Copyright © 2014, [email protected] 11/17 http: // togaware. com Copyright © 2014, [email protected] 3/29

Introduction Rules Descriptive Data Mining Examples Association Rules

Friday Nappies Beer ∩ → [0.5%, 60%] Age [20, 30] Income [20K, 30K] MP3Player ∈ ∩ ∈ → [2%, 60%] Maths CS HDinCS ∩ → [1%, 75%] Gladiator Patriot Sixth Sense ∩ → [0.1%, 90%] Statins Peritonitis Chronic Renal Failure ∩ → [0.1%, 32%]

http: // togaware. com Copyright © 2014, [email protected] 5/29 http: // togaware. com Copyright © 2014, [email protected] 12/17 Predictive Data Mining: Decision Trees Decision Trees Basics Workshop Overview Predictive Modelling: Classification

1 R: A Language for Data Mining Goal of classification is to build models (sentences) in a 2 Data Mining, Rattle, and R knowledge representation (language) from examples of past 3 Loading, Cleaning, Exploring Data in Rattle decisions.

4 Descriptive Data Mining The model is to be used on unseen cases to make decisions.

5 Predictive Data Mining: Decision Trees Often referred to as supervised learning. 6 Predictive Data Mining: Ensembles Common approaches: decision trees; neural networks; logistic 7 Moving into R and Scripting our Analyses regression; support vector machines.

8 Literate Data Mining in R

http: // togaware. com Copyright © 2014, [email protected] 13/17 http: // togaware. com Copyright © 2014, [email protected] 5/46

Decision Trees Basics Decision Trees Basics Language: Decision Trees Tree Construction: Divide and Conquer

Decision tree induction is an example of a recursive partitioning Knowledge representation: A flow-chart-like tree structure algorithm: divide and conquer. Internal nodes denotes a test on a variable At start, all the training examples are at the root Branch represents an outcome of the test Partition examples recursively based on selected variables

Leaf nodes represent class labels or class distribution + + + + − + Females − + − − + + + + − + + + Males − + + + + − + Gender + − + − − + Male Female

_ + Age + − Y + −− − + + −− >42 ++ < 42 > 42 − − <42 Y N + http: // togaware. com Copyright © 2014, [email protected] 6/46 http: // togaware. com Copyright © 2014, [email protected] 7/46

Decision Trees Algorithm Decision Trees Algorithm Algorithm for Decision Tree Induction Basic Motivation: Entropy

A greedy algorithm: takes the best immediate (local) decision while building the overall model We are trying to predict output Y (e.g., Yes/No) from input X .

Tree constructed top-down, recursive, divide-and-conquer A random data set may have high entropy: Begin with all training examples at the root Y is from a uniform distribution a frequency distribution would be flat! Data is partitioned recursively based on selected variables a sample will include uniformly random values of Y A data set with low entropy: Select variables on basis of a measure Y ’s distribution will be very skewed Stop partitioning when? a frequency distribution will have a single peak a sample will predominately contain just Yes or just No All samples for a given node belong to the same class There are no remaining variables for further partitioning – Work towards reducing the amount of entropy in the data! majority voting is employed for classifying the leaf There are no samples left

http: // togaware. com Copyright © 2014, [email protected] 10/46 http: // togaware. com Copyright © 2014, [email protected] 11/46 Decision Trees Algorithm Decision Trees Algorithm Basic Motivation: Entropy Basic Motivation: Entropy

We are trying to predict output Y (e.g., Yes/No) from input X . We are trying to predict output Y (e.g., Yes/No) from input X .

A random data set may have high entropy: A random data set may have high entropy: Y is from a uniform distribution Y is from a uniform distribution a frequency distribution would be flat! a frequency distribution would be flat! a sample will include uniformly random values of Y a sample will include uniformly random values of Y A data set with low entropy: A data set with low entropy: Y ’s distribution will be very skewed Y ’s distribution will be very skewed a frequency distribution will have a single peak a frequency distribution will have a single peak a sample will predominately contain just Yes or just No a sample will predominately contain just Yes or just No Work towards reducing the amount of entropy in the data! Work towards reducing the amount of entropy in the data!

http: // togaware. com Copyright © 2014, [email protected] 11/46 http: // togaware. com Copyright © 2014, [email protected] 11/46

Decision Trees Algorithm Decision Trees Algorithm Basic Motivation: Entropy Variable Selection Measure: Entropy

Information gain (ID3/C4.5) We are trying to predict output Y (e.g., Yes/No) from input X . Select the variable with the highest information gain A random data set may have high entropy: Y is from a uniform distribution Assume there are two classes: P and N a frequency distribution would be flat! a sample will include uniformly random values of Y Let the data S contain p elements of class P and n elements of A data set with low entropy: class N Y ’s distribution will be very skewed a frequency distribution will have a single peak The amount of information, needed to decide if an arbitrary a sample will predominately contain just Yes or just No example in S belongs to P or N is defined as Work towards reducing the amount of entropy in the data! p p n n IE (p, n) = log log −p + n 2 p + n − p + n 2 p + n

http: // togaware. com Copyright © 2014, [email protected] 11/46 http: // togaware. com Copyright © 2014, [email protected] 15/46

Decision Trees Algorithm Decision Trees Algorithm Variable Selection Measure: Gini Variable Selection Measure

Variable Importance Measure Gini index of impurity – traditional statistical measure – CART 1.00 Measure how often a randomly chosen observation is incorrectly classified if it were randomly classified in proportion to the actual classes. 0.75 Calculated as the sum of the probability of each observation Formula being chosen times the probability of incorrect classification, 0.50 Info Gini equivalently: Measure

2 2 IG (p, n) = 1 (p + (1 p) ) − − 0.25 As with Entropy, the Gini measure is maximal when the classes are equally distributed and minimal when all observations are in 0.00 one class or the other. 0.00 0.25 0.50 0.75 1.00 Proportion of Positives

http: // togaware. com Copyright © 2014, [email protected] 16/46 http: // togaware. com Copyright © 2014, [email protected] 17/46 Decision Trees Algorithm Building Decision Trees In Rattle Information Gain Startup Rattle

library(rattle) rattle() Now use variable A to partition S into v cells: S1, S2,..., Sv { } If Si contains pi examples of P and ni examples of N, the information now needed to classify objects in all subtrees Si is:

v pi + ni E(A) = I (pi , ni ) p + n Xi=1 So, the information gained by branching on A is:

Gain(A) = I (p, n) E(A) −

So choose the variable A which results in the greatest gain in information.

http: // togaware. com Copyright © 2014, [email protected] 18/46 http: // togaware. com Copyright © 2014, [email protected] 20/46

Building Decision Trees In Rattle Building Decision Trees In Rattle Load Example Weather Dataset Summary of the Weather Dataset

A summary of the weather dataset is displayed.

Click on the Execute button and an example dataset is offered. Click on Yes to load the weather dataset.

http: // togaware. com Copyright © 2014, [email protected] 21/46 http: // togaware. com Copyright © 2014, [email protected] 22/46

Building Decision Trees In Rattle Building Decision Trees In Rattle Model Tab — Decision Tree Build Tree to Predict RainTomorrow

Click on the Model tab to display the modelling options. Decision Tree is the default model type—simply click Execute.

http: // togaware. com Copyright © 2014, [email protected] 23/46 http: // togaware. com Copyright © 2014, [email protected] 24/46 Building Decision Trees In Rattle Building Decision Trees In Rattle Decision Tree Predicting RainTomorrow Evaluate Decision Tree

Click the Draw button to display a tree Click Evaluate tab—options to evaluate model performance. (Settings Advanced Graphics). →

http: // togaware. com Copyright © 2014, [email protected] 25/46 http: // togaware. com Copyright © 2014, [email protected] 26/46

Building Decision Trees In Rattle Building Decision Trees In Rattle Evaluate Decision Tree—Error Matrix Decision Tree Risk Chart

Click Execute to display simple error matrix. Click the Risk type and then Execute. Identify the True/False Positives/Negatives.

http: // togaware. com Copyright © 2014, [email protected] 27/46 http: // togaware. com Copyright © 2014, [email protected] 28/46

Building Decision Trees In Rattle Building Decision Trees In Rattle Decision Tree ROC Curve Score a Dataset

Click the ROC type and then Execute. Click the Score type to score a new dataset using model.

http: // togaware. com Copyright © 2014, [email protected] 29/46 http: // togaware. com Copyright © 2014, [email protected] 30/46 Building Decision Trees In Rattle Building Decision Trees In Rattle Log of R Commands Log of R Commands—rpart()

Click the Log tab for a history of all your interactions. Here we see the call to rpart() to build the model. Save the log contents as a script to repeat what we did. Click on the Export button to save the script to file.

http: // togaware. com Copyright © 2014, [email protected] 31/46 http: // togaware. com Copyright © 2014, [email protected] 32/46

Building Decision Trees In Rattle Building Decision Trees In R Help Model Tree Weather Dataset - Inputs → → Rattle provides some basic help—click Yes for R help. ds <- weather head(ds,4)

## Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine ## 1 2007-11-01 Canberra 8.0 24.3 0.0 3.4 6.3 ## 2 2007-11-02 Canberra 14.0 26.9 3.6 4.4 9.7 ## 3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3 ## 4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1 ....

summary(ds[c(3:5,23)])

## MinTemp MaxTemp Rainfall RISK_MM ## Min. :-5.30 Min. : 7.6 Min. : 0.00 Min. : 0.00 ## 1st Qu.: 2.30 1st Qu.:15.0 1st Qu.: 0.00 1st Qu.: 0.00 ## Median : 7.45 Median :19.6 Median : 0.00 Median : 0.00 ## Mean : 7.27 Mean :20.6 Mean : 1.43 Mean : 1.43 ....

http: // togaware. com Copyright © 2014, [email protected] 33/46 http: // togaware. com Copyright © 2014, [email protected] 35/46

Building Decision Trees In R Building Decision Trees In R Weather Dataset - Target Simple Train/Test Paradigm

target <- "RainTomorrow" summary(ds[target]) set.seed(1421) ## RainTomorrow train <-c(sample(1:nrow(ds), 0.70*nrow(ds))) # Training dataset ## No :300 head(train) ## Yes: 66 ## [1] 288 298 363 107 70 232 (form <- formula(paste(target, "~ ."))) length(train) ## RainTomorrow ~ . ## [1] 256 (vars <- names(ds)[-c(1,2, 23)]) test <- setdiff(1:nrow(ds), train) # Testing dataset ## [1] "MinTemp" "MaxTemp" "Rainfall" "Evaporation" length(test) ## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "WindDir9am" ## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" ## [13] "Humidity3pm" "Pressure9am" "Pressure3pm" "Cloud9am" ## [1] 110 ## [17] "Cloud3pm" "Temp9am" "Temp3pm" "RainToday" ## [21] "RainTomorrow"

http: // togaware. com Copyright © 2014, [email protected] 36/46 http: // togaware. com Copyright © 2014, [email protected] 37/46 Building Decision Trees In R Building Decision Trees In R Display the Model Performance on Test Dataset

model <- rpart(form, ds[train, vars]) model The predict() function is used to score new data.

## n= 256 ## head(predict(model, ds[test,], type="class")) ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## 2 4 6 8 11 12 ## ## No No No No No No ## 1) root 256 44 No (0.82812 0.17188) ## Levels: No Yes ## 2) Humidity3pm< 59.5 214 21 No (0.90187 0.09813) ## 4) WindGustSpeed< 64 204 14 No (0.93137 0.06863) table(predict(model, ds[test,], type="class"), ds[test, target]) ## 8) Cloud3pm< 6.5 163 5 No (0.96933 0.03067) * ## 9) Cloud3pm>=6.5 41 9 No (0.78049 0.21951) ## ## 18) Temp3pm< 26.1 34 4 No (0.88235 0.11765) * ## No Yes ## 19) Temp3pm>=26.1 7 2 Yes (0.28571 0.71429) * ## No 77 14 .... ## Yes 11 8

Notice the legend to help interpret the tree. http: // togaware. com Copyright © 2014, [email protected] 38/46 http: // togaware. com Copyright © 2014, [email protected] 39/46

Building Decision Trees In R Building Decision Trees In R Example DTree Plot using Rattle An R Scripting Hint

1 No .83 .17 100% yes Humidity3pm < 60 no

2 3 Notice the use of variables ds, target, vars. No Yes .90 .10 .45 .55 Change these variables, and the remaining script is unchanged. 84% 16% WindGustSpeed < 64 Pressure3pm >= 1015 Simplifies script writing and reuse of scripts.

4 No .93 .07 ds <- iris 80% Cloud3pm < 6.5 target <- "Species" vars <- names(ds) 9 No .78 .22 16% Then repeat the rest of the script, without change. Temp3pm < 26

8 18 19 5 6 7 No No Yes Yes No Yes .97 .03 .88 .12 .29 .71 .30 .70 .79 .21 .17 .83 64% 13% 3% 4% 7% 9% http: // togaware. com Copyright © 2014, [email protected] 40/46 http: // togaware. com Copyright © 2014, [email protected] 41/46 Rattle 2014−Feb−27 23:21:10 gjw

Building Decision Trees In R Building Decision Trees In R An R Scripting Hint — Unchanged Code An R Scripting Hint — Unchanged Code

This code remains the same to build the decision tree. Similarly for the predictions.

form <- formula(paste(target, "~ .")) train <-c(sample(1:nrow(ds), 0.70*nrow(ds))) head(predict(model, ds[test,], type="class")) test <- setdiff(1:nrow(ds), train) model <- rpart(form, ds[train, vars]) ## 3 8 9 10 11 12 model ## setosa setosa setosa setosa setosa setosa ## Levels: setosa versicolor virginica ## n= 105 ## table(predict(model, ds[test,], type="class"), ds[test, target]) ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## ## setosa versicolor virginica ## 1) root 105 69 setosa (0.34286 0.32381 0.33333) ## setosa 14 0 0 ## 2) Petal.Length< 2.6 36 0 setosa (1.00000 0.00000 0.00000) * ## versicolor 0 15 4 ## 3) Petal.Length>=2.6 69 34 virginica (0.00000 0.49275 0.50725) ## virginica 0 1 11 ## 6) Petal.Length< 4.95 35 2 versicolor (0.00000 0.94286 0.05714) * ## 7) Petal.Length>=4.95 34 1 virginica (0.00000 0.02941 0.97059) *

http: // togaware. com Copyright © 2014, [email protected] 42/46 http: // togaware. com Copyright © 2014, [email protected] 43/46 Summary Overview Predictive Data Mining: Ensembles Summary Workshop Overview

1 R: A Language for Data Mining

Decision Tree Induction. 2 Data Mining, Rattle, and R

Most widely deployed machine learning algorithm. 3 Loading, Cleaning, Exploring Data in Rattle

4 Descriptive Data Mining Simple idea, powerful learner. 5 Predictive Data Mining: Decision Trees Available in R through the rpart package. 6 Predictive Data Mining: Ensembles Related packages include party, Cubist, C50, RWeka (J48). 7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright © 2014, [email protected] 45/46 http: // togaware. com Copyright © 2014, [email protected] 14/17

Multiple Models Boosting Algorithm Building Multiple Models Boosting Algorithms Basic idea: boost observations that are “hard to model.”

General idea developed in Multiple Inductive Learning algorithm (Williams 1987).

Ideas were developed (ACJ 1987, PhD 1990) in the context of: Algorithm: iteratively build weak models using a poor learner: observe that variable selection methods don’t discriminate; Build an initial model; so build multiple decision trees; Identify mis-classified cases in the training dataset; then combine into a single model. Boost (over-represent) training observations modelled incorrectly; Basic idea is that multiple models, like multiple experts, may Build a new model on the boosted training dataset; produce better results when working together, rather than in Repeat. isolation The result is an ensemble of weighted models. Two approaches covered: Boosting and Random Forests. Meta learners. Best off the shelf model builder. (Leo Brieman)

http: // togaware. com Copyright © 2014, [email protected] 4/36 http: // togaware. com Copyright © 2014, [email protected] 6/36

Boosting Example Boosting Example Example: Error Rate Example: Variable Importance Notice error rate decreases quickly then flattens. Helps understand the knowledge captured.

plot(m) varplot(m)

Training Error Variable Importance Plot

1 Train Temp3pm ● 0.14 Pressure9am ● MinTemp ● Humidity3pm ● Temp9am ●

0.12 MaxTemp ● Evaporation ● Cloud9am ● Humidity9am ● 1 ●

0.10 WindSpeed3pm ● Error WindSpeed9am 1 Cloud3pm ● WindGustSpeed ● 1 Sunshine ● 0.08 Pressure3pm ● 1 WindGustDir ● WindDir3pm ● Rainfall ●

0.06 1 WindDir9am ● RainToday ●

0 10 20 30 40 50 0.04 0.05 0.06 0.07 0.08 0.09 0.10

Iteration 1 to 50 Score http: // togaware. com Copyright © 2014, [email protected] 12/36 http: // togaware. com Copyright © 2014, [email protected] 13/36 Boosting Example Boosting Example Example: Sample Trees Example: Performance

There are 50 trees in all. Here’s the first 3. predicted <- predict(m, weather[-train,], type="prob")[,2] actual <- weather[-train,]$RainTomorrow fancyRpartPlot(m$model$trees[[1]]) risks <- weather[-train,]$RISK_MM fancyRpartPlot(m$model$trees[[2]]) riskchart(predicted, actual, risks) fancyRpartPlot(m$model$trees[[3]])

Risk Scores 0.9 0.8 0.7 0.6 0.50.4 0.30.2 0.1 Lift 1 100 −1 .96 .04 100% 4

yes Cloud3pm < 7.5 no 1 1 −1 −1 .96 .04 .97 .03 80 100% 100% 2 yes Pressure3pm >= 1012 no yes Pressure3pm >= 1012 no −1 .97 .03 3 94%

Pressure3pm >= 1012 3 3 60 1 1 .82 .18 .83 .17 20% 18% 5 Sunshine >= 11 MaxTemp >= 27 2 −1 .87 .13 (%) Performance 40 18% Humidity3pm < 42

2 6 7 2 6 7 23% −1 −1 1 −1 −1 1 .98 .02 1.00 .00 .63 .37 .99 .01 .97 .03 .67 .33 1 20 80% 6% 14% 82% 6% 12%

4 10 11 3 Recall (88%) −1 −1 1 1 .99 .01 .96 .04 .72 .28 .42 .58 Risk (94%) 76% 9% 9% 6% Precision 0 Rattle 2014−Feb−27 23:21:19 gjw Rattle 2014−Feb−27 23:21:19 gjw Rattle 2014−Feb−27 23:21:20 gjw 0 20 40 60 80 100 Caseload (%)

http: // togaware. com Copyright © 2014, [email protected] 14/36 http: // togaware. com Copyright © 2014, [email protected] 15/36

Boosting Example Boosting Example Example Applications Summary

1 Boosting is implemented in R in the ada library ATO Application: What life events affect compliance? m m 2 AdaBoost uses e− ; LogitBoost uses log(1 + e− ); Doom II First application of the technology — 1995 uses 1 tanh(m) Decision Stumps: Age > NN; Change in Marital Status − 3 AdaBoost tends to be sensitive to noise (addressed by BrownBoost) Boosted Neural Networks 4 AdaBoost tends not to overfit, and as new models are added, OCR using neural networks as base learners generalisation error tends to improve. Drucker, Schapire, Simard, 1993 5 Can be proved to converge to a perfect model if the learners are always better than chance.

http: // togaware. com Copyright © 2014, [email protected] 16/36 http: // togaware. com Copyright © 2014, [email protected] 17/36

Random Forests Random Forests Random Forests Random Forests

Build many decision trees (e.g., 500). Original idea from Leo Brieman and Adele Cutler. For each tree: The name is Licensed to Salford Systems! Select a random subset of the training set (N); Choose different subsets of variables for each node of the Hence, R package is randomForest. decision tree (m << M); Typically presented in context of decision trees. Build the tree without pruning (i.e., overfit) Random Multinomial Logit uses multiple multinomial logit Classify a new entity using every decision tree: models. Each tree “votes” for the entity. The decision with the largest number of votes wins! The proportion of votes is the resulting score.

http: // togaware. com Copyright © 2014, [email protected] 19/36 http: // togaware. com Copyright © 2014, [email protected] 20/36 Random Forests Random Forests Example: RF on Weather Data Example: Error Rate

Error rate decreases quickly then flattens over the 500 trees. set.seed(42) plot(m) (m <- randomForest(RainTomorrow~ ., weather[train,-c(1:2, 23)], na.action=na.roughfix, importance=TRUE)) m 0.8 ## ## Call: ## randomForest(formula=RainTomorrow ~ ., data=weath... 0.6 ## Type of random forest: classification ## Number of trees: 500 0.4 ## No. of variables tried at each split: 4 Error ## ## OOB estimate of error rate: 13.67% ## Confusion matrix: 0.2 ## No Yes class.error ## No 211 4 0.0186 0.0

## Yes 31 10 0.7561 0 100 200 300 400 500

trees http: // togaware. com Copyright © 2014, [email protected] 21/36 http: // togaware. com Copyright © 2014, [email protected] 22/36

Random Forests Random Forests Example: Variable Importance Example: Sample Trees

Helps understand the knowledge captured. There are 500 trees in all. Here’s some rules from the first tree.

varImpPlot(m, main="Variable Importance") ## Random Forest Model 1 ## Variable Importance ## ------... ## Tree 1 Rule 1 Node 30 Decision No ## Sunshine ● Pressure3pm ● Cloud3pm ● Sunshine ● ## 1: Evaporation <= 9 Pressure3pm ● Cloud3pm ● Temp3pm ● Pressure9am ● ## 2: Humidity3pm <= 71 WindGustSpeed ● WindGustSpeed ● MaxTemp ● Humidity3pm ● ## 3: Cloud3pm <= 2.5 Pressure9am ● MinTemp ● Temp9am ● Temp3pm ● ## 4: WindDir9am IN ("NNE") Humidity3pm ● Temp9am ● ## 5: Sunshine <= 10.25 MinTemp ● MaxTemp ● Cloud9am ● Humidity9am ● ## 6: Temp3pm <= 17.55 WindSpeed3pm ● WindSpeed3pm ● WindSpeed9am ● WindSpeed9am ● ## ------... Humidity9am ● Cloud9am ● WindGustDir ● Evaporation ● ## Tree 1 Rule 2 Node 31 Decision Yes Evaporation ● WindDir9am ● WindDir9am ● WindGustDir ● ## WindDir3pm ● WindDir3pm ● ## 1: Evaporation <= 9 Rainfall ● Rainfall ● RainToday ● RainToday ● ## 2: Humidity3pm <= 71

0 5 10 15 0 2 4 6 8 .... MeanDecreaseAccuracy MeanDecreaseGini http: // togaware. com Copyright © 2014, [email protected] 23/36 http: // togaware. com Copyright © 2014, [email protected] 24/36

Random Forests Random Forests Example: Performance Features of Random Forests: By Brieman

predicted <- predict(m, weather[-train,], type="prob")[,2] actual <- weather[-train,]$RainTomorrow risks <- weather[-train,]$RISK_MM riskchart(predicted, actual, risks) Most accurate of current algorithms.

Risk Scores 0.80.9 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Lift Runs efficiently on large data sets. 100

4 80 Can handle thousands of input variables.

3

60 Gives estimates of variable importance. 2 Performance (%) Performance 40

22% 1 20

Recall (92%) Risk (97%) Precision 0

0 20 40 60 80 100 Caseload (%)

http: // togaware. com Copyright © 2014, [email protected] 25/36 http: // togaware. com Copyright © 2014, [email protected] 26/36 Summary Summary Moving into R and Scripting our Analyses Summary Workshop Overview

1 R: A Language for Data Mining Ensemble: Multiple models working together 2 Data Mining, Rattle, and R Often better than a single model Variance and bias of the model are reduced 3 Loading, Cleaning, Exploring Data in Rattle The best available models today - accurate and robust 4 Descriptive Data Mining In daily use in very many areas of application 5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R http: // togaware. com Copyright © 2014, [email protected] 36/36 http: // togaware. com Copyright © 2014, [email protected] 15/17

Moving Into R Programming with Data Moving Into R Programming with Data Data Scientists are Programmers of Data From GUI to CLI — Rattle’s Log Tab

But... Data scientists are programmers of data A GUI can only do so much R is a powerful statistical language

Data Scientists Desire... Scripting Transparency Repeatability Sharing

http: // togaware. com Copyright © 2014, [email protected] 24/40 http: // togaware. com Copyright © 2014, [email protected] 25/40

Moving Into R Programming with Data Moving Into R Programming with Data From GUI to CLI — Rattle’s Log Tab Step 1: Load the Dataset

dsname <- "weather" ds <- get(dsname) dim(ds)

## [1] 366 24

names(ds)

## [1] "Date" "Location" "MinTemp" "... ## [5] "Rainfall" "Evaporation" "Sunshine" "... ## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "... ## [13] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "......

http: // togaware. com Copyright © 2014, [email protected] 26/40 http: // togaware. com Copyright © 2014, [email protected] 27/40 Moving Into R Programming with Data Moving Into R Programming with Data Step 2: Observe the Data — Observations Step 2: Observe the Data — Structure

head(ds) str(ds)

## Date Location MinTemp MaxTemp Rainfall Evapora... ## 'data.frame': 366 obs. of 24 variables: ## 1 2007-11-01 Canberra 8.0 24.3 0.0 ... ## $ Date : Date, format: "2007-11-01" "2007-11-... ## 2 2007-11-02 Canberra 14.0 26.9 3.6 ... ## $ Location : Factor w/ 46 levels "Adelaide","Alba... ## 3 2007-11-03 Canberra 13.7 23.4 3.6 ... ## $ MinTemp : num 8 14 13.7 13.3 7.6 6.2 6.1 8.3 ...... ## $ MaxTemp : num 24.3 26.9 23.4 15.5 16.1 16.9 1... ## $ Rainfall : num 0 3.6 3.6 39.8 2.8 0 0.2 0 0 16... tail(ds) ## $ Evaporation : num 3.4 4.4 5.8 7.2 5.6 5.8 4.2 5.6... ## $ Sunshine : num 6.3 9.7 3.3 9.1 10.6 8.2 8.4 4.... ## Date Location MinTemp MaxTemp Rainfall Evapo... ## $ WindGustDir : Ord.factor w/ 16 levels "N"<"NNE"<"N... ## 361 2008-10-26 Canberra 7.9 26.1 0 ... ## $ WindGustSpeed: num 30 39 85 54 50 44 43 41 48 31 ... ## 362 2008-10-27 Canberra 9.0 30.7 0 ... ## $ WindDir9am : Ord.factor w/ 16 levels "N"<"NNE"<"N... ## 363 2008-10-28 Canberra 7.1 28.4 0 ... ## $ WindDir3pm : Ord.factor w/ 16 levels "N"<"NNE"<"N......

http: // togaware. com Copyright © 2014, [email protected] 28/40 http: // togaware. com Copyright © 2014, [email protected] 29/40

Moving Into R Programming with Data Moving Into R Programming with Data Step 2: Observe the Data — Summary Step 2: Observe the Data — Variables

summary(ds) id <-c("Date", "Location") target <- "RainTomorrow" ## Date Location MinTemp ... risk <- "RISK_MM" ## Min. :2007-11-01 Canberra :366 Min. :-5.3... (ignore <- union(id, risk)) ## 1st Qu.:2008-01-31 Adelaide : 0 1st Qu.: 2.3... ## Median :2008-05-01 Albany : 0 Median : 7.4... ## [1] "Date" "Location" "RISK_MM" ## Mean :2008-05-01 Albury : 0 Mean : 7.2... ## 3rd Qu.:2008-07-31 AliceSprings : 0 3rd Qu.:12.5... (vars <- setdiff(names(ds), ignore)) ## Max. :2008-10-31 BadgerysCreek: 0 Max. :20.9... ## (Other) : 0 ... ## [1] "MinTemp" "MaxTemp" "Rainfall" "... ## Rainfall Evaporation Sunshine Wind... ## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "... ## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW ... ## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "... ## 1st Qu.: 0.00 1st Qu.: 2.20 1st Qu.: 5.95 NNW ... ## [13] "Humidity3pm" "Pressure9am" "Pressure3pm" "... ## Median : 0.00 Median : 4.20 Median : 8.60 E ......

http: // togaware. com Copyright © 2014, [email protected] 30/40 http: // togaware. com Copyright © 2014, [email protected] 31/40

Moving Into R Programming with Data Moving Into R Programming with Data Step 3: Clean the Data — Remove Missing Step 3: Clean the Data — Remove Missing

dim(ds) dim(ds) ## [1] 366 24 ## [1] 328 24 sum(is.na(ds[vars])) sum(is.na(ds[vars])) ## [1] 47 ## [1] 0 ds <- ds[-attr(na.omit(ds[vars]), "na.action"),]

http: // togaware. com Copyright © 2014, [email protected] 32/40 http: // togaware. com Copyright © 2014, [email protected] 33/40 Moving Into R Programming with Data Moving Into R Programming with Data Step 3: Clean the Data—Target as Categoric Step 3: Clean the Data—Target as Categoric

summary(ds[target])

summary(ds[target]) ## RainTomorrow ## 0:268 ## RainTomorrow ## 1: 60 ## Min. :0.000 ## 1st Qu.:0.000 ## Median :0.000 ## Mean :0.183 ## 3rd Qu.:0.000 200 ## Max. :1.000 .... count 100 ds[target] <- as.factor(ds[[target]]) levels(ds[target]) <-c("No", "Yes")

0

0 1 RainTomorrow http: // togaware. com Copyright © 2014, [email protected] 34/40 http: // togaware. com Copyright © 2014, [email protected] 35/40

Moving Into R Programming with Data Moving Into R Programming with Data Step 4: Prepare for Modelling Step 5: Build the Model—Random Forest

(form <- formula(paste(target, "~ .")))

## RainTomorrow ~ . library(randomForest) model <- randomForest(form, ds[train, vars], na.action=na.omit) (nobs <- nrow(ds)) model ## [1] 328 ## ## Call: train <- sample(nobs, 0.70*nobs) ## randomForest(formula=form, data=ds[train, vars], ... length(train) ## Type of random forest: classification ## Number of trees: 500 ## [1] 229 ## No. of variables tried at each split: 4 .... test <- setdiff(1:nobs, train) length(test)

## [1] 99

http: // togaware. com Copyright © 2014, [email protected] 36/40 http: // togaware. com Copyright © 2014, [email protected] 37/40

Moving Into R Programming with Data Literate Data Mining in R Step 6: Evaluate the Model—Risk Chart Workshop Overview

pr <- predict(model, ds[test,], type="prob")[,2] 1 R: A Language for Data Mining riskchart(pr, ds[test, target], ds[test, risk], title="Random Forest - Risk Chart", 2 risk=risk, recall=target, thresholds=c(0.35, 0.15)) Data Mining, Rattle, and R

3 Loading, Cleaning, Exploring Data in Rattle Random Forest − Risk Chart Risk Scores 100 0.9 0.8 0.7 0.60.5 0.4 0.3 0.2 0.1 Lift 4 5 Descriptive Data Mining

80 4 5 Predictive Data Mining: Decision Trees

60 3 6 Predictive Data Mining: Ensembles

40 2

Performance (%) Performance RainTomorrow (98%) 19% 7 Moving into R and Scripting our Analyses 20 RISK_MM (97%) 1 Precision 0 8 Literate Data Mining in R 0 20 40 60 80 100 Caseload (%) http: // togaware. com Copyright © 2014, [email protected] 38/40 http: // togaware. com Copyright © 2014, [email protected] 16/17 Motivation Motivation Why is Reproducibility Important? Literate Data Mining Overview

Your Research Leader or Executive drops by and asks:

“Remember that research you did last year? I’ve heard there is One document to intermix the analysis, code, and results an update on the data that you used. Can you add the new data in and repeat the same analysis?” Authors productive with narrative and code in one document “Jo Bloggs did a great analysis of the company returns data just before she left. Can you get someone else to analyse the new Sweave (Leisch 2002) and now KnitR (Yihui 2011) data set using the same methods, and so produce an updated report that we can present to the Exec next week?” Embed R code into LATEX documents for typesetting “The fraud case you provided an analysis of last year has finally KnitR also supports publishing to the web reached the courts. We need to ensure we have a clear trail of the data sources, the analyses performed, and the results obtained, to stand up in court. Could you document these please.”

http: // togaware. com Copyright © 2014, [email protected] 4/37 http: // togaware. com Copyright © 2014, [email protected] 5/37

Motivation Motivation Why Reproducible Data Mining? Prime Objective: Trustworthy Software

Automatically regenerate documents when code, data, or assumptions change. Those who receive the results of modern data analysis have limited opportunity to verify the results by direct observation. Eliminate errors that occur when transcribing results into Users of the analysis have no option but to trust the analysis, documents. and by extension the software that produced it. This places an obligation on all creators of software to program in such a way Record the context for the analysis and decisions made about the that the computations can be understood and trusted. This type of analysis to perform in the one place. obligation I label the Prime Directive. Document the processes to provide integrity for the conclusions of the analysis. John Chambers (2008) Share approach with others for peer review and for learning from Software for Data Analysis: Programming with R each other—engender a continuous learning environment.

http: // togaware. com Copyright © 2014, [email protected] 6/37 http: // togaware. com Copyright © 2014, [email protected] 7/37

Motivation Motivation Beautiful Output by KnitR Beautiful Output by Default

The reader wants to read the document and easily do so! KnitR combined with LATEX will Code highlighting is done automatically Intermix analysis and results of analysis Default theme is carefully designed Automatically generate graphics and tables Many other themes are available Support reproducible and transparent analysis R Code is “properly” reformatted Produce the best looking reports. Analyses (Graphs and Tables) automatically included.

http: // togaware. com Copyright © 2014, [email protected] 8/37 http: // togaware. com Copyright © 2014, [email protected] 9/37 Using RStudio Basic LATEX Markup Using RStudio Introducing LATEX

Simplified interaction with R, LATEX, and KnitR A text markup language rather than a WYSIWYG. Executes R code one line at a time Based on TEX from 1977 — very stable and powerful. Formats LATEX documents and provides and spell checking LATEX is easier to use macro package built on TEX. A single click compile to PDF and synchronised views Ensures consistent style (layout, fonts, tables, maths, etc.) Automatic indexes, footnotes and references. Documents are well structured and are clear text. Has a learning curve. Demonstrate: Startup and explore RStudio.

http: // togaware. com Copyright © 2014, [email protected] 12/37 http: // togaware. com Copyright © 2014, [email protected] 14/37

Basic LATEX Markup Basic LATEX Markup Basic LATEX Usage Structures

\documentclass{article}

\documentclass{article} \begin{document}

\begin{document} \section{Introduction}

... \end{document} \subsection{Concepts}

... Demonstrate Create a new Sweave document in RStudio \end{document}

http: // togaware. com Copyright © 2014, [email protected] 15/37 http: // togaware. com Copyright © 2014, [email protected] 16/37

Basic LATEX Markup Basic LATEX Markup Formats RStudio Support for LATEX

RStudio provides excellent support for working with LATEX documents \documentclass{article} Helps to avoid having to know too much abuot LATEX \begin{document} Best illustrated through a demonstration Format menu \begin{itemize} Section commands \item ABC Font commands \item DEF List commands \end{itemize} Verbatim/Block commands Spell Checker This if \textbf{bold} text or \textbf{italic} text, ... Compile PDF \end{document} Demonstrate: Start a new document, add contents, format to PDF.

http: // togaware. com Copyright © 2014, [email protected] 17/37 http: // togaware. com Copyright © 2014, [email protected] 18/37 Incorporating R Code Incorporating R Code Incorporating R Code Making You Look Good

We insert R code in a Chunk starting with << >>= <>= We terminate the Chunk with @ for(i in 1:5){j<-cos(sin(i)*i^2)+3;print(j-5)} Save LATEX with extension Rnw @ This Chunk for(i in1:5) <>= x <- sum(1:10) { j <- cos(sin(i)*i^2)+3 x print(j-5) @ } Produces ## [1] -1.334 ## [1] -2.88 x <- sum(1:10) ## [1] -1.704 x ## [1] -1.103 ## [1] 55 .... Demonstrate: Do this in RStudio http: // togaware. com Copyright © 2014, [email protected] 20/37 http: // togaware. com Copyright © 2014, [email protected] 21/37

Incorporating R Code Formatting Tables and Plots R Within the Text A Simple Table

library(xtable) Include information about data within the narrative. obs <- sample(1:nrow(weatherAUS),8) We can do that with Sexpr{...}. vars <-2:6 \ xtable(weatherAUS[obs, vars]) Our dataset has Sexpr{nrow(ds)} observations of \ Sexpr{ncol(ds)} variables. \ Location MinTemp MaxTemp Rainfall Evaporation Becomes 50959 Cairns 17.50 27.30 0.00 4.80 26581 Canberra 13.20 31.30 0.00 6.60 Our dataset has 82169 observations of 24 variables. 3947 Cobar 18.80 34.00 0.00 7.20 73014 SalmonGums 10.60 33.50 0.00 Better Still: Sexpr{format(nrow(ds), big.mark=",")} 27770 Canberra 3.10 14.00 0.00 1.60 \ 67989 PerthAirport 13.30 24.00 0.00 5.80 Our dataset has 82,169 observations of 24 variables. 80467 Darwin 25.80 31.70 0.40 9.00 33587 Ballarat 6.00 19.30 0.00 http: // togaware. com Copyright © 2014, [email protected] 22/37 http: // togaware. com Copyright © 2014, [email protected] 24/37

Formatting Tables and Plots Formatting Tables and Plots Table: Exclude Row Names Table: Limit Number of Digits

print(xtable(weatherAUS[obs, vars]), print(xtable(weatherAUS[obs, vars], include.rownames=FALSE) digits=1), include.rownames=FALSE)

Location MinTemp MaxTemp Rainfall Evaporation Location MinTemp MaxTemp Rainfall Evaporation Cairns 17.50 27.30 0.00 4.80 Canberra 13.20 31.30 0.00 6.60 Cairns 17.5 27.3 0.0 4.8 Cobar 18.80 34.00 0.00 7.20 Canberra 13.2 31.3 0.0 6.6 SalmonGums 10.60 33.50 0.00 Cobar 18.8 34.0 0.0 7.2 Canberra 3.10 14.00 0.00 1.60 SalmonGums 10.6 33.5 0.0 PerthAirport 13.30 24.00 0.00 5.80 Canberra 3.1 14.0 0.0 1.6 Darwin 25.80 31.70 0.40 9.00 PerthAirport 13.3 24.0 0.0 5.8 Ballarat 6.00 19.30 0.00 Darwin 25.8 31.7 0.4 9.0 Ballarat 6.0 19.3 0.0

http: // togaware. com Copyright © 2014, [email protected] 25/37 http: // togaware. com Copyright © 2014, [email protected] 26/37 Formatting Tables and Plots Formatting Tables and Plots Table: Tiny Font Table: Column Alignment

vars <-2:8 vars <-2:8 print(xtable(weatherAUS[obs, vars], print(xtable(weatherAUS[obs, vars], digits=0), digits=0, size="tiny", align="rlrrrrrr"), include.rownames=FALSE) size="tiny")

Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir Cairns 18 27 0 5 10 SSE 50959 Cairns 18 27 0 5 10 SSE Canberra 13 31 0 7 12 WSW 26581 Canberra 13 31 0 7 12 WSW Cobar 19 34 0 7 11 ENE 3947 Cobar 19 34 0 7 11 ENE SalmonGums 11 34 0 SE 73014 SalmonGums 11 34 0 SE Canberra 3 14 0 2 2 S 27770 Canberra 3 14 0 2 2 S PerthAirport 13 24 0 6 8 W 67989 PerthAirport 13 24 0 6 8 W Darwin 26 32 0 9 7 WNW 80467 Darwin 26 32 0 9 7 WNW Ballarat 6 19 0 SE 33587 Ballarat 6 19 0 SE

http: // togaware. com Copyright © 2014, [email protected] 27/37 http: // togaware. com Copyright © 2014, [email protected] 28/37

Formatting Tables and Plots Formatting Tables and Plots Table: Caption Plots

library(ggplot2) cities <-c("Canberra", "Darwin", "Melbourne", "Sydney") print(xtable(weatherAUS[obs, vars], ds <- subset(weatherAUS, Location %in% cities&! is.na(Temp3pm)) digits=1, g <- ggplot(ds, aes(Temp3pm, colour=Location, fill=Location)) g <-g+ geom_density(alpha= 0.55) caption="This is the table caption."), print(g) size="tiny")

0.20 Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir 50959 Cairns 17.5 27.3 0.0 4.8 10.0 SSE 26581 Canberra 13.2 31.3 0.0 6.6 11.6 WSW 3947 Cobar 18.8 34.0 0.0 7.2 11.3 ENE 0.15 73014 SalmonGums 10.6 33.5 0.0 SE Location 27770 Canberra 3.1 14.0 0.0 1.6 1.9 S Canberra 67989 PerthAirport 13.3 24.0 0.0 5.8 8.1 W Darwin

density Melbourne 80467 Darwin 25.8 31.7 0.4 9.0 7.3 WNW 0.10 33587 Ballarat 6.0 19.3 0.0 SE Sydney Table : This is the table caption. 0.05

0.00

10 20 30 40 Temp3pm http: // togaware. com Copyright © 2014, [email protected] 29/37 http: // togaware. com Copyright © 2014, [email protected] 31/37

Knitting Our First KnitR Document Knitting Our First KnitR Document Create a KnitR Document: New R Sweave Setup KnitR →

We wish to use KnitR rather than the older Sweave processor

In RStudio we can configure the options to use knitr: Select Tools Options → Choose the Sweave group Choose knitr for Weave Rnw files using: The remaining defaults should be okay Click Apply and then OK

http: // togaware. com Copyright © 2014, [email protected] 24/34 http: // togaware. com Copyright © 2014, [email protected] 25/34 Knitting Our First KnitR Document Knitting Our First KnitR Document Simple KnitR Document Simple KnitR Document

Insert the following into your new KnitR document: Insert the following into your new KnitR document:

\title{Sample KnitR Document} \title{Sample KnitR Document} \author{Graham Williams} \author{Graham Williams} \maketitle \maketitle

\section*{My First Section} \section*{My First Section}

This is some text that is automatically typeset This is some text that is automatically typeset by the LaTeX processor to produce well formatted by the LaTeX processor to produce well formatted quality output as PDF. quality output as PDF.

Your turn—Click Compile PDF to view the result. Your turn—Click Compile PDF to view the result.

http: // togaware. com Copyright © 2014, [email protected] 26/34 http: // togaware. com Copyright © 2014, [email protected] 26/34

Knitting Our First KnitR Document Knitting Our First KnitR Document Simple KnitR Document Simple KnitR Document—Resulting PDF

Result of Compile PDF

http: // togaware. com Copyright © 2014, [email protected] 27/34 http: // togaware. com Copyright © 2014, [email protected] 28/34

Knitting Including R Commands in KnitR Knitting Including R Commands in KnitR KnitR: Add R Commands KnitR: Add R Commands

R code can be used to generate results into the document: R code can be used to generate results into the document:

<>= <>= library(rattle) # Provides the weather dataset library(rattle) # Provides the weather dataset library(ggplot2) # Provides the qplot() function library(ggplot2) # Provides the qplot() function ds <- weather ds <- weather qplot(MinTemp, MaxTemp, data=ds) qplot(MinTemp, MaxTemp, data=ds) @ @

Your turn—Click Compile PDF to view the result. Your turn—Click Compile PDF to view the result.

http: // togaware. com Copyright © 2014, [email protected] 29/34 http: // togaware. com Copyright © 2014, [email protected] 29/34 Knitting Including R Commands in KnitR Knitting Including R Commands in KnitR KnitR Document With R Code Simple KnitR Document—PDF with Plot

Result of Compile PDF

http: // togaware. com Copyright © 2014, [email protected] 30/34 http: // togaware. com Copyright © 2014, [email protected] 31/34

Knitting Basics Cheat Sheet Knitting Basics Cheat Sheet LaTeX Basics KnitR Basics

\subsection*{...} % Introduce a Sub Section echo=FALSE # Do not display the R code eval=TRUE # Evaluate the R code \subsubsection*{...} % Introduce a Sub Sub Section results="hide" # Hide the results of the R commands \textbf{...} % Bold font \textit{...} % Italic font fig.width=10 # Extend figure width from 7 to 10 inches fig.height=8 # Extend figure height from 7 to 8 inches \begin{itemize} % A bullet list \item ... out.width="0.8\\textwidth" # Fit figure 80% page width \item ... out.height="0.5\\textheight" # Fit figure 50% page height \end{itemize} Plus an extensive collection of other options. Plus an extensive collection of other markup and capabilities.

http: // togaware. com Copyright © 2014, [email protected] 32/34 http: // togaware. com Copyright © 2014, [email protected] 33/34

Moving Into R Resources Workshop Overview Resources and References

OnePageR: http://onepager.togaware.com – Tutorial Notes 1 R: A Language for Data Mining Rattle: http://rattle.togaware.com Guides: http://datamining.togaware.com 2 Data Mining, Rattle, and R Practise: http://analystfirst.com 3 Loading, Cleaning, Exploring Data in Rattle Book: Data Mining using Rattle/R 4 Descriptive Data Mining Chapter: Rattle and Other Tales Paper: A Data Mining GUI for R — R Journal, Volume 1(2) 5 Predictive Data Mining: Decision Trees

6 Predictive Data Mining: Ensembles

7 Moving into R and Scripting our Analyses

8 Literate Data Mining in R

http: // togaware. com Copyright © 2014, [email protected] 2/17 http: // togaware. com Copyright © 2014, [email protected] 39/40