Statistical Machine Learning Introduction to

Niklas Wahlström Andreas Svensson Division of Systems and Control Department of Information Technology Uppsala University

[email protected] [email protected]

1 / 47 Introduction to R [email protected], [email protected] About R

I The programming language S developed at Bell laboratories in the 70’s

I R appeared as an open source implementation of S in the 90’s

I Today, there are thousands of available R packages

I Widely used by statisticians

2 / 47 Introduction to R [email protected], [email protected] About R

Ranked 7th most popular programming language in 2017 by IEEE

http://spectrum.ieee.org/computing/software/the-2017-top-programming-languages

3 / 47 Introduction to R [email protected], [email protected] The R environment

I R (download from http://cran.r-project.org/) I Graphical interface RStudio (open source, download from http://www.rstudio.com/products/rstudio/) I Alternatives: (ESS), Tinn-R, RKward, R Commander, . . .

I Packages

4 / 47 Introduction to R [email protected], [email protected] RStudio

5 / 47 Introduction to R [email protected], [email protected] Outline

About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions

6 / 47 Introduction to R [email protected], [email protected] Outline

About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions

7 / 47 Introduction to R [email protected], [email protected] Variables

You do not need to declare variable types in R.

Native R syntax for creating a variable and assign a value to it: > x <- 2

(Alternative syntax: > x = 2)

Variable types: numeric, integer, character, factor, logical Check type with class(), e.g., class(x)

8 / 47 Introduction to R [email protected], [email protected] Basics

Add two numbers x = 2 + 2

> x <- 2 + 2

and print the result on the terminal

> print(x) [1] 4

or

> x [1] 4

9 / 47 Introduction to R [email protected], [email protected] Help resources

I ? opens the help file for a command, e.g., > ?predict

I ?? searches the entire R repository for a keyword, e.g., > ??predict

I "Labs" at the end of each chapter in the ISL book I Internet I http://www.stats.ox.ac.uk/~evans/Rprog/LectureNotes.pdf I http://www.r-bloggers.com/ I http://www.ats.ucla.edu/stat/r/ I ...

10 / 47 Introduction to R [email protected], [email protected] Outline

About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions

11 / 47 Introduction to R [email protected], [email protected] Vectors

1 2 A vector v =   is written 3 4 > v <- c(1,2,3,4) > v [1] 1 2 3 4

12 / 47 Introduction to R [email protected], [email protected] Matrices

2 1 5 I A matrix A = 3 6 1 is written 4 2 4 > A <- cbind(c(2,3,4), c(1,6,2), c(5,1,4)) (cf. > B <- rbind(c(2,3,4), c(1,6,2), c(5,1,4)))

or > u <- c(2,3,4,1,6,2,5,1,4) > C <- matrix(u,3,3)

1 1 1 I A matrix D = 1 1 1 is written 1 1 1 > D <- matrix(1,3,3)

13 / 47 Introduction to R [email protected], [email protected] Matrix indexing

R uses 1 based indexes (not 0) I Consider the same matrix > A I Access second column: [,1] [,2] [,3] > A[,2] [1,] 2 1 5 [1] 2 3 4 [2,] 3 6 1 [3,] 4 2 4 I Access all but third column: > A[,-3] I Access element (1,2): [,1] [,2] > A[1,2] [1,] 2 1 [1] 1 [2,] 3 6 I Access first row: [3,] 4 2 > A[1,] [1] 2 1 5

14 / 47 Introduction to R [email protected], [email protected] Vector and matrix operations

I Matrix transpose AT > t(A)

I Matrix inverse A−1 > solve(A)

I Elementwise multiplication Eij = AijCij > E <- A*C

I Matrix multiplication F = AC > F <- A%*%C

15 / 47 Introduction to R [email protected], [email protected] Outline

About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions

16 / 47 Introduction to R [email protected], [email protected] R scripts

A script works essentially as writing in the terminal, but you have everything saved and can easily go back and change things, and run it from the beginning again.

To run a single line (or selected lines) in RStudio: Ctrl + Return

To run the entire script: Ctrl + Shift + Return

Use scripts!!

17 / 47 Introduction to R [email protected], [email protected] Outline

About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions

18 / 47 Introduction to R [email protected], [email protected] Plotting

Now, we want to plot the following data describing the length of an infant at different ages

Age (months) 0 6 12 18 24 Length (cm) 51 67 74 82 88

# Insert the data: > age <- c(0,6,12,18,24) > length <- c(51,67,74,82,88)

# Plot data: > plot(x = age, y = length, col = 1, pch = 0, main="Infant length", xlab="age", ylab="length (cm)")

# Add legend: > legend(x = "topleft", legend = "Data", col = 1, pch=0)

19 / 47 Introduction to R [email protected], [email protected] Outline

About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions

20 / 47 Introduction to R [email protected], [email protected] Linear regression example (I/II)

I We would like to fit a straight line to the data with the model

Y = β0 + β1X + ε, X : age,Y : length

I The normal equations     1 x1 y1 T −1 T  . .   .  βb = (X X) X y where X =  . .  , y =  .  1 x5 y5

I # Solve normal equations > X <- cbind(matrix(1,5,1),age) > beta <- solve(t(X)%*%X)%*%t(X)%*%length

21 / 47 Introduction to R [email protected], [email protected] Linear regression example (II/II)

I Compute the prediction according to

Yb = βb0 + βb1X # Do predictions > lengthhat <- beta[1]+age*beta[2]

I Plot a line corresponding to these predictions # Plot > lines(x=age,y=lengthhat,col=2, lty=1) > legend(x = "topleft", legend = c("Data","LR"), col = c(1,2), pch=c(0,NA), lty=c(NA,1))

I You need to keep track of the color and style of the lines for making the legend correct when using the built-in plot functions. (The popular package does it automatically, if you would prefer using that.)

22 / 47 Introduction to R [email protected], [email protected] Outline

About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions

23 / 47 Introduction to R [email protected], [email protected] Data frames (I/III)

I Data frames can be used to store data tables.

I # Create a data frame > infantdata <- data.frame(age,length) > infantdata age length 1 0 51 2 6 67 3 12 74 4 18 82 5 24 88

I Each column in a data frame may be of a different type, and may also have a descriptive name.

24 / 47 Introduction to R [email protected], [email protected] Data frames (II/III)

I A data frame can be indexed with number or names

I # Either ... # ... or # ... or > infantdata[1] > infantdata["age"] > infantdata$age age age [1] 0 6 12 18 24 1 0 1 0 2 6 2 6 3 12 3 12 4 18 4 18 5 24 5 24

I # Plot the data > plot(x = infantdata$age, y = infantdata$length, col = 1, pch = 0, main="Infant length", xlab="age", ylab="length (cm)")

25 / 47 Introduction to R [email protected], [email protected] Data frames (III/III)

I Data frames are used by many high level commands in R, such as lm(), glm(), lda() and qda().

I Some commands (e.g., knn() and glmnet()), however, works with the matrix format instead, in which case you may need to do as.matrix(infantdata) etc.

26 / 47 Introduction to R [email protected], [email protected] Outline

About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions

27 / 47 Introduction to R [email protected], [email protected] The linear regression command lm (I/III)

I Consider the same regression model as before

Y = β0 + β1X + ε where X is age and Y is length. I We now use the lm() function to do linear regression.

5 X 2 βb = arg min (yi − (β0 + β1xi)) β i=1

# Learn the model > model.fit <- lm(formula = length ~ age, data = infantdata)

I Note, the bias-term β0 will always be included (if we don’t exclude it explicitly)

28 / 47 Introduction to R [email protected], [email protected] The linear regression command lm (II/III)

I model.fit is an object containing the prediction model (including βb) I > model.fit

Call: lm(formula = length ~ age, data = infantdata)

Coefficients: (Intercept) age 54.600 1.483

29 / 47 Introduction to R [email protected], [email protected] The linear regression command lm (III/III)

I With the learned model model.fit we can do predictions

I # Do predictions > model.pred <- predict(object = model.fit, newdata = infantdata)

I # Plot > lines(x=infantdata$age,y=model.pred, col=2, lty=1) > legend(x = "topleft", legend = c("Data","LR"), col = c(1,2), pch=c(0,NA), lty=c(NA,1))

30 / 47 Introduction to R [email protected], [email protected] Outline

About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions

31 / 47 Introduction to R [email protected], [email protected] Working with data sets: Example (I/V)

I R comes with several data sets, often contained in libraries.

I # Install package containing the library we want > install.packages("MASS") # Load library containing a lot of datasets > library(MASS) # See all available datasets > data() # Get information about the Boston dataset > ?Boston

32 / 47 Introduction to R [email protected], [email protected] Working with data sets: Example (II/V)

# Look at the column names in the data frame

> names(Boston) [1] "crim" "zn" "indus" "chas" "nox" "rm" "age" "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"

# Launch an editor for making changes (e.g., fix missing values) > fix(Boston) # See a short statistical summary about the dataset > summary(Boston) # make a boxplot > boxplot(Boston) # Plot the data > plot(x=Boston$lstat, y=Boston$medv)

33 / 47 Introduction to R [email protected], [email protected] Working with data sets: Example (III/V)

I Fit a straight line to this data with lstat as input and medv as output

I # Do linear regression > model1.fit <- lm(formula = medv~lstat, data = Boston) > model1.pred <- predict(object = model1.fit, newdata = Boston)

I # Plot the result > lines(Boston$lstat,model1.pred, col=2, lty=1) > legend(x = "topright", legend = c("Data","LR"), col = c(1,2), pch=c(0,NA), lty=c(NA,1))

34 / 47 Introduction to R [email protected], [email protected] Working with data sets: Example (IV/V)

I We can also use the lm() function with multiple input variables

I # Input: lstat and age. Output: medv > model2.fit <- lm(formula=medv~lstat+age,data = Boston)

# Input: all input variables. Output: medv > model3.fit <- lm(formula = medv~., data = Boston)

# Input: all input variables except age. Output: medv > model4.fit <- lm(formula = medv~.-age, data = Boston)

35 / 47 Introduction to R [email protected], [email protected] Working with data sets: Example (V/V)

I To evaluate we compute the root-mean-square-error q 1 Pn 2 n i=1 yi − ybi for the predictions ybi on the training data. I # Compute RMSE to evaluate > model3.pred <- predict(object = model3.fit, newdata = Boston) > model3.RMSE <- sqrt(mean((model3.pred-Boston$medv)^2)) [1] 4.679191

I > model4.pred <- predict(object = model4.fit, newdata = Boston) > model4.RMSE <- sqrt(mean((model4.pred-Boston$medv)^2)) [1] 4.679204

I No noteworthy loss by excluding the input variable age.

36 / 47 Introduction to R [email protected], [email protected] Outline

About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions

37 / 47 Introduction to R [email protected], [email protected] Random number generation (I/II)

I The function rnorm() generates realizations of a random Gaussian variable.

I # Draw 1000 samples from a normal distribution with mean 4 and standard deviation 1 y <- rnorm(n=1000,mean=4,sd=1)

I # Plot x <- seq(1,1000) plot(x,y)

38 / 47 Introduction to R [email protected], [email protected] Random number generation (II/II)

I We want to randomly split the Boston data set into two data sets (called training and test). I First check the number of data sample in Boston > nrow(Boston) [1] 506 I > train = sample(x=1:nrow(Boston), size=250, replace=FALSE) [1] 475 359 49 319 439 200 .....

I > Boston.train = Boston[train,] > Boston.test = Boston[-train,] > nrow(Boston.train) [1] 250 > nrow(Boston.test) [1] 256

39 / 47 Introduction to R [email protected], [email protected] Outline

About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions

40 / 47 Introduction to R [email protected], [email protected] For-loops

I If you want to repeat a calculation multiple times 2 I Compute ui , i = 1,..., 30, where ui ∼ N (0, 1) u <- rnorm(30)

usq <- c() # initialize

for (i in 1:10) { usq[i] <- u[i]*u[i] }

41 / 47 Introduction to R [email protected], [email protected] Conditional statements

I If you want to take different actions depending on a previous outcome I d <- runif(1)

if (d < 0.5) { print("d was smaller than 0.5") } else { print("d was bigger than or equal to 0.5") }

42 / 47 Introduction to R [email protected], [email protected] Outline

About R R basics Vectors and matrices Scripts Plotting Implementing linear regression Data frames The linear regression command Working with data sets: Example Random number generation Control structures: for and if Functions

43 / 47 Introduction to R [email protected], [email protected] Functions (I/III)

To write more structured and efficient, you sometimes want to wrap some code in a function. q 1 Pn 2 Let us say we want to compute the RMSE n i=1 yi − ybi for the prediction ybi RMSE <- function(y, yhat){ r = sqrt(mean((y-yhat))) return(r) }

RMSE(Boston$medv,model3.pred) RMSE(Boston$medv,model4.pred)

44 / 47 Introduction to R [email protected], [email protected] Functions (II/III)

General structure for R functions:

myfunction <- function(arg1, arg2, ... ){ statements return(object) }

If you want to return multiple result, return a vector (or list)!

45 / 47 Introduction to R [email protected], [email protected] Functions (III/III)

myfunction <- function(arg1, arg2, ... ){ ... return(object) }

I Define your functions either within your script (before you call them), or in a separate R script file and run source("myfilewithfunctions.R") at the top of your script.

46 / 47 Introduction to R [email protected], [email protected] Some other useful commands

Clear workspace: rm(list=ls()) Close all plots: graphics.off()

47 / 47 Introduction to R [email protected], [email protected]