Statistical Machine Learning Introduction to R
Niklas Wahlström Andreas Svensson Division of Systems and Control Department of Information Technology Uppsala University
[email protected] [email protected]
1 / 47 Introduction to R [email protected], [email protected] About R
I The programming language S developed at Bell laboratories in the 70’s
I R appeared as an open source implementation of S in the 90’s
I Today, there are thousands of available R packages
I Widely used by statisticians
2 / 47 Introduction to R [email protected], [email protected] About R
Ranked 7th most popular programming language in 2017 by IEEE
http://spectrum.ieee.org/computing/software/the-2017-top-programming-languages
3 / 47 Introduction to R [email protected], [email protected] The R environment
I R (download from http://cran.r-project.org/) I Graphical interface RStudio (open source, download from http://www.rstudio.com/products/rstudio/) I Alternatives: Emacs Speaks Statistics (ESS), Tinn-R, RKward, R Commander, . . .
I Packages
4 / 47 Introduction to R [email protected], [email protected] RStudio
5 / 47 Introduction to R [email protected], [email protected] Outline
About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions
6 / 47 Introduction to R [email protected], [email protected] Outline
About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions
7 / 47 Introduction to R [email protected], [email protected] Variables
You do not need to declare variable types in R.
Native R syntax for creating a variable and assign a value to it: > x <- 2
(Alternative syntax: > x = 2)
Variable types: numeric, integer, character, factor, logical Check type with class(), e.g., class(x)
8 / 47 Introduction to R [email protected], [email protected] Basics
Add two numbers x = 2 + 2
> x <- 2 + 2
and print the result on the terminal
> print(x) [1] 4
or
> x [1] 4
9 / 47 Introduction to R [email protected], [email protected] Help resources
I ? opens the help file for a command, e.g., > ?predict
I ?? searches the entire R repository for a keyword, e.g., > ??predict
I "Labs" at the end of each chapter in the ISL book I Internet I http://www.stats.ox.ac.uk/~evans/Rprog/LectureNotes.pdf I http://www.r-bloggers.com/ I http://www.ats.ucla.edu/stat/r/ I ...
10 / 47 Introduction to R [email protected], [email protected] Outline
About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions
11 / 47 Introduction to R [email protected], [email protected] Vectors
1 2 A vector v = is written 3 4 > v <- c(1,2,3,4) > v [1] 1 2 3 4
12 / 47 Introduction to R [email protected], [email protected] Matrices
2 1 5 I A matrix A = 3 6 1 is written 4 2 4 > A <- cbind(c(2,3,4), c(1,6,2), c(5,1,4)) (cf. > B <- rbind(c(2,3,4), c(1,6,2), c(5,1,4)))
or > u <- c(2,3,4,1,6,2,5,1,4) > C <- matrix(u,3,3)
1 1 1 I A matrix D = 1 1 1 is written 1 1 1 > D <- matrix(1,3,3)
13 / 47 Introduction to R [email protected], [email protected] Matrix indexing
R uses 1 based indexes (not 0) I Consider the same matrix > A I Access second column: [,1] [,2] [,3] > A[,2] [1,] 2 1 5 [1] 2 3 4 [2,] 3 6 1 [3,] 4 2 4 I Access all but third column: > A[,-3] I Access element (1,2): [,1] [,2] > A[1,2] [1,] 2 1 [1] 1 [2,] 3 6 I Access first row: [3,] 4 2 > A[1,] [1] 2 1 5
14 / 47 Introduction to R [email protected], [email protected] Vector and matrix operations
I Matrix transpose AT > t(A)
I Matrix inverse A−1 > solve(A)
I Elementwise multiplication Eij = AijCij > E <- A*C
I Matrix multiplication F = AC > F <- A%*%C
15 / 47 Introduction to R [email protected], [email protected] Outline
About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions
16 / 47 Introduction to R [email protected], [email protected] R scripts
A script works essentially as writing in the terminal, but you have everything saved and can easily go back and change things, and run it from the beginning again.
To run a single line (or selected lines) in RStudio: Ctrl + Return
To run the entire script: Ctrl + Shift + Return
Use scripts!!
17 / 47 Introduction to R [email protected], [email protected] Outline
About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions
18 / 47 Introduction to R [email protected], [email protected] Plotting
Now, we want to plot the following data describing the length of an infant at different ages
Age (months) 0 6 12 18 24 Length (cm) 51 67 74 82 88
# Insert the data: > age <- c(0,6,12,18,24) > length <- c(51,67,74,82,88)
# Plot data: > plot(x = age, y = length, col = 1, pch = 0, main="Infant length", xlab="age", ylab="length (cm)")
# Add legend: > legend(x = "topleft", legend = "Data", col = 1, pch=0)
19 / 47 Introduction to R [email protected], [email protected] Outline
About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions
20 / 47 Introduction to R [email protected], [email protected] Linear regression example (I/II)
I We would like to fit a straight line to the data with the model
Y = β0 + β1X + ε, X : age,Y : length
I The normal equations 1 x1 y1 T −1 T . . . βb = (X X) X y where X = . . , y = . 1 x5 y5
I # Solve normal equations > X <- cbind(matrix(1,5,1),age) > beta <- solve(t(X)%*%X)%*%t(X)%*%length
21 / 47 Introduction to R [email protected], [email protected] Linear regression example (II/II)
I Compute the prediction according to
Yb = βb0 + βb1X # Do predictions > lengthhat <- beta[1]+age*beta[2]
I Plot a line corresponding to these predictions # Plot > lines(x=age,y=lengthhat,col=2, lty=1) > legend(x = "topleft", legend = c("Data","LR"), col = c(1,2), pch=c(0,NA), lty=c(NA,1))
I You need to keep track of the color and style of the lines for making the legend correct when using the built-in plot functions. (The popular package ggplot2 does it automatically, if you would prefer using that.)
22 / 47 Introduction to R [email protected], [email protected] Outline
About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions
23 / 47 Introduction to R [email protected], [email protected] Data frames (I/III)
I Data frames can be used to store data tables.
I # Create a data frame > infantdata <- data.frame(age,length) > infantdata age length 1 0 51 2 6 67 3 12 74 4 18 82 5 24 88
I Each column in a data frame may be of a different type, and may also have a descriptive name.
24 / 47 Introduction to R [email protected], [email protected] Data frames (II/III)
I A data frame can be indexed with number or names
I # Either ... # ... or # ... or > infantdata[1] > infantdata["age"] > infantdata$age age age [1] 0 6 12 18 24 1 0 1 0 2 6 2 6 3 12 3 12 4 18 4 18 5 24 5 24
I # Plot the data > plot(x = infantdata$age, y = infantdata$length, col = 1, pch = 0, main="Infant length", xlab="age", ylab="length (cm)")
25 / 47 Introduction to R [email protected], [email protected] Data frames (III/III)
I Data frames are used by many high level commands in R, such as lm(), glm(), lda() and qda().
I Some commands (e.g., knn() and glmnet()), however, works with the matrix format instead, in which case you may need to do as.matrix(infantdata) etc.
26 / 47 Introduction to R [email protected], [email protected] Outline
About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions
27 / 47 Introduction to R [email protected], [email protected] The linear regression command lm (I/III)
I Consider the same regression model as before
Y = β0 + β1X + ε where X is age and Y is length. I We now use the lm() function to do linear regression.
5 X 2 βb = arg min (yi − (β0 + β1xi)) β i=1
# Learn the model > model.fit <- lm(formula = length ~ age, data = infantdata)
I Note, the bias-term β0 will always be included (if we don’t exclude it explicitly)
28 / 47 Introduction to R [email protected], [email protected] The linear regression command lm (II/III)
I model.fit is an object containing the prediction model (including βb) I > model.fit
Call: lm(formula = length ~ age, data = infantdata)
Coefficients: (Intercept) age 54.600 1.483
29 / 47 Introduction to R [email protected], [email protected] The linear regression command lm (III/III)
I With the learned model model.fit we can do predictions
I # Do predictions > model.pred <- predict(object = model.fit, newdata = infantdata)
I # Plot > lines(x=infantdata$age,y=model.pred, col=2, lty=1) > legend(x = "topleft", legend = c("Data","LR"), col = c(1,2), pch=c(0,NA), lty=c(NA,1))
30 / 47 Introduction to R [email protected], [email protected] Outline
About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions
31 / 47 Introduction to R [email protected], [email protected] Working with data sets: Example (I/V)
I R comes with several data sets, often contained in libraries.
I # Install package containing the library we want > install.packages("MASS") # Load library containing a lot of datasets > library(MASS) # See all available datasets > data() # Get information about the Boston dataset > ?Boston
32 / 47 Introduction to R [email protected], [email protected] Working with data sets: Example (II/V)
# Look at the column names in the data frame
> names(Boston) [1] "crim" "zn" "indus" "chas" "nox" "rm" "age" "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
# Launch an editor for making changes (e.g., fix missing values) > fix(Boston) # See a short statistical summary about the dataset > summary(Boston) # make a boxplot > boxplot(Boston) # Plot the data > plot(x=Boston$lstat, y=Boston$medv)
33 / 47 Introduction to R [email protected], [email protected] Working with data sets: Example (III/V)
I Fit a straight line to this data with lstat as input and medv as output
I # Do linear regression > model1.fit <- lm(formula = medv~lstat, data = Boston) > model1.pred <- predict(object = model1.fit, newdata = Boston)
I # Plot the result > lines(Boston$lstat,model1.pred, col=2, lty=1) > legend(x = "topright", legend = c("Data","LR"), col = c(1,2), pch=c(0,NA), lty=c(NA,1))
34 / 47 Introduction to R [email protected], [email protected] Working with data sets: Example (IV/V)
I We can also use the lm() function with multiple input variables
I # Input: lstat and age. Output: medv > model2.fit <- lm(formula=medv~lstat+age,data = Boston)
# Input: all input variables. Output: medv > model3.fit <- lm(formula = medv~., data = Boston)
# Input: all input variables except age. Output: medv > model4.fit <- lm(formula = medv~.-age, data = Boston)
35 / 47 Introduction to R [email protected], [email protected] Working with data sets: Example (V/V)
I To evaluate we compute the root-mean-square-error q 1 Pn 2 n i=1 yi − ybi for the predictions ybi on the training data. I # Compute RMSE to evaluate > model3.pred <- predict(object = model3.fit, newdata = Boston) > model3.RMSE <- sqrt(mean((model3.pred-Boston$medv)^2)) [1] 4.679191
I > model4.pred <- predict(object = model4.fit, newdata = Boston) > model4.RMSE <- sqrt(mean((model4.pred-Boston$medv)^2)) [1] 4.679204
I No noteworthy loss by excluding the input variable age.
36 / 47 Introduction to R [email protected], [email protected] Outline
About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions
37 / 47 Introduction to R [email protected], [email protected] Random number generation (I/II)
I The function rnorm() generates realizations of a random Gaussian variable.
I # Draw 1000 samples from a normal distribution with mean 4 and standard deviation 1 y <- rnorm(n=1000,mean=4,sd=1)
I # Plot x <- seq(1,1000) plot(x,y)
38 / 47 Introduction to R [email protected], [email protected] Random number generation (II/II)
I We want to randomly split the Boston data set into two data sets (called training and test). I First check the number of data sample in Boston > nrow(Boston) [1] 506 I > train = sample(x=1:nrow(Boston), size=250, replace=FALSE) [1] 475 359 49 319 439 200 .....
I > Boston.train = Boston[train,] > Boston.test = Boston[-train,] > nrow(Boston.train) [1] 250 > nrow(Boston.test) [1] 256
39 / 47 Introduction to R [email protected], [email protected] Outline
About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions
40 / 47 Introduction to R [email protected], [email protected] For-loops
I If you want to repeat a calculation multiple times 2 I Compute ui , i = 1,..., 30, where ui ∼ N (0, 1) u <- rnorm(30)
usq <- c() # initialize
for (i in 1:10) { usq[i] <- u[i]*u[i] }
41 / 47 Introduction to R [email protected], [email protected] Conditional statements
I If you want to take different actions depending on a previous outcome I d <- runif(1)
if (d < 0.5) { print("d was smaller than 0.5") } else { print("d was bigger than or equal to 0.5") }
42 / 47 Introduction to R [email protected], [email protected] Outline
About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions
43 / 47 Introduction to R [email protected], [email protected] Functions (I/III)
To write more structured and efficient, you sometimes want to wrap some code in a function. q 1 Pn 2 Let us say we want to compute the RMSE n i=1 yi − ybi for the prediction ybi RMSE <- function(y, yhat){ r = sqrt(mean((y-yhat))) return(r) }
RMSE(Boston$medv,model3.pred) RMSE(Boston$medv,model4.pred)
44 / 47 Introduction to R [email protected], [email protected] Functions (II/III)
General structure for R functions:
myfunction <- function(arg1, arg2, ... ){ statements return(object) }
If you want to return multiple result, return a vector (or list)!
45 / 47 Introduction to R [email protected], [email protected] Functions (III/III)
myfunction <- function(arg1, arg2, ... ){ ... return(object) }
I Define your functions either within your script (before you call them), or in a separate R script file and run source("myfilewithfunctions.R") at the top of your script.
46 / 47 Introduction to R [email protected], [email protected] Some other useful commands
Clear workspace: rm(list=ls()) Close all plots: graphics.off()
47 / 47 Introduction to R [email protected], [email protected]