<<

A Statistical View of Part 2

Jennifer Hoeting, Colorado State University A Statistical View of Deep Learning in Ecology Part 1: Introduction

I Introduction to

I Introduction to deep learning Part 2: Going deeper

I Neural networks from 3 viewpoints

I of deep learning

I Model fitting

I Types of deep learning models Part 3: Deep learning in practice

I Ethics in deep learning

I Deep learning in ecology

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 202 / 276 Neural networks from three viewpoints

Neural network: algorithm which allows a computer to learn from Deep learning: multi-layer neural network

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 203 / 276 Neural networks: An Introduction from multiple viewpoints

A neural network

I as a Black Box

I as a

I in Deep Learning

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 204 / 276 Neural Network as a black box algorithm

Image source: www.learnopencv.com

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 205 / 276 Neural Network as a statistical model

Goal: extract linear combinations of the inputs as derived features, and then model the target as a nonlinear function of these features Translated to :

I fit a nonlinear model to the response and predictors

I predictors are transformed using multivariate techniques A neural network is a nonlinear model

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 206 / 276 Neural Network as a statistical model

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 207 / 276

Image source Statistics versus neural network terminology Statistics Neural Networks model network, graph fitting learning, training coefficient, parameter weight predictor, variable input, feature predicted response output observation exemplar parameter estimation training or learning steepest descent back-propagation intercept bias term derived predictor hidden node penalty function weight decay

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 208 / 276 Neural networks in deep learning

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 209 / 276 Some mathematics of neural networks

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 210 / 276 Opening the black box of deep learning

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 211 / 276 Overview

Key idea: Neural networks are merely regression models with transformed predictors

Consider the following progression of models:

I Regression model

I model

I Neural network model

I Deep learning model

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 212 / 276 Regression model

6

5

4 y

3

2

1 0.00 0.25 0.50 0.75 1.00 x

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 213 / 276 Regression model

Model

2 y = β0 + β1x +  where  ∼ N(0, σ ) = f (x) Fitted model yb = βb0 + βb1x = fb(x)

Loss function n X 2 (yi − ybi ) i=1

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 214 / 276 Regression model

1.0

0.5

y 0.0

−0.5

−1.0

0.25 0.50 0.75 1.00 x

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 215 / 276 Nonparametric regression

K X yb = fb(x) = βb0 + βbk σbk (x) k=1 | {z } function of x Examples 1. k 2 K σk (x) = x so fb(x) = βb0 + βb1x + βb2x + ··· + βbK x 2. Regression splines

I A spline is a function that is constructed piece-wise from polynomial functions.

I Apply a family of transformations to x, σ1(x), σ2(x), . . . , σK (x)

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 216 / 276 Nonparametric regression

A spline of degree K is a function formed by connecting polynomial segments of degree K so that:

I the function is continuous,

I the function has K − 1 continuous derivatives, and

I the Kth derivative is constant between knots.

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 217 / 276 Nonparametric regression

Simple spline regression

Moving window: yb is the average of the y’s for nearby x values or a weighted average (kernel smoothing)

1.0

0.5

y 0.0

−0.5 moving window

−1.0

0.25 0.50 0.75 1.00 x

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 218 / 276 Nonparametric regression Basis functions: every can be represented as a linear combination of basis functions. More advanced spline functions: basis splines

B−Splines Basis functions 0.6 0.5 0.4 0.3 0.2 0.1 0.0

0.2 0.4 0.6 0.8 1.0

x A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 219 / 276 Nonparametric regression

1.0

0.5

y 0.0

−0.5

−1.0

0.25 0.50 0.75 1.00 x

key df = 100 df = 3

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 220 / 276 Nonparametric regression

1.0

0.5

y 0.0

−0.5

−1.0

0.25 0.50 0.75 1.00 x

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 221 / 276 Nonparametric regression

The weaknesses of this approach are:

I The basis is fixed and independent of the data

I If there are many predictors, this approach doesn’t work well

I If the basis doesn’t ‘agree’ with true f , then K will have to be large to capture the structure

I What if parts of f have substantially different structure? Alternative approach: the data tell us what kind of basis to use

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 222 / 276 Neural network model

Key idea: think of a neural network as a multiple regression model with transformed predictors Example: One-layer neural network model with K hidden nodes σ1(x), σ2(x), . . . , σK (x) and 3 predictors (x1, x2, x3)

y =β0 + β1 σ1(α10 + α11x1 + α12x2 + α13x3)

+ β2 σ2(α20 + α21x1 + α22x2 + α23x3) + ···

+ βK σK (αK0 + αK1x1 + αK2x2 + αK3x3) | {z } function of x

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 223 / 276 Deep Learning Model

Neural network model K X y = β0 + βk σk (αk0 + αk1x1 + αk2x2 + αk3x3) k=1

Two-layer “deep” learning model

K X yb = fb(x) = βb0 + βbk σk (αk0+αk1x1 + αk2x2 + αk3x3)) k=1 ↑ ↑ ↑ Replace each of these predictors with another neural network model

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 224 / 276 Illustration of a deep learning model Image: Deep Learning, Fig 1.2

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 225 / 276 Image source: Fig 1.2, Deep Learning A Statistical View of Deep Learning in Ecology Part 1: Introduction

I Introduction to machine learning

I Introduction to deep learning Part 2: Going deeper

I Neural networks from 3 viewpoints

I Mathematics of deep learning

I Model fitting

I Types of deep learning models Part 3: Deep learning in practice

I Ethics in deep learning

I Deep learning in ecology

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 226 / 276 Deep Learning: Model fitting

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 227 / 276 Deep Learning Software

I Python

• Most popular language for deep learning • Object-oriented, high-level programming language • Main packages: TensorFlow, PyTorch, , . . . • Do you need to learn Python? Maybe

I Keras

• Open-source neural-network library written in Python. • Keras package is an API. • API (application programming interface) allows multiple software packages to interact. • Keras can interact with: TensorFlow, Microsoft Cognitive Toolkit, R, , PlaidML

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 228 / 276 Deep learning in R

Neural network models in the Caret package 1. Neural network models: nnet, mxnet, mxnetAdam, neuralnet, and more 2. Stacked Deep Neural Network 3. Many other options R packages: keras and kerasR

I Interface to the Python deep learning package Keras

I Rstudio’s keras pages

I Can be buggy when any of the packages it accesses are updated

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 229 / 276 Deep Learning: Model fitting

Steps in model fitting 1. Define network structure 2. Select loss function 3. Select optimizer

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 230 / 276 Network structure

I A layer consists of a set of nodes

I In a fully-connected model, each node on one layer connects to all nodes in the next layer

Image source: towardsdatascience

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 231 / 276 Network structure

Defining the network model in the R interface to Keras

model <- keras_model_sequential() model %>% layer_dense(units = 256, activation = 'relu') %>% layer_dense(units = 128, activation = 'relu') %>% layer_dense(units = 10, activation = 'softmax')

keras.rstudio.com

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 232 / 276 Network structure

Recall the basic neural network model (one layer, p predictors)

K X y = β0+ βk σk (α0k + α1k x1 + α2k x2 + ··· + α3pxp) x k=1   

Activation functions are

I Computationally cheap

I One key to the success of deep learning

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 233 / 276 Network structure

Image source: medium.com

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 234 / 276 Network structure

ReLU (Recticulated Linear Unit) activation function  0 for x < 0 σ(x) = max(0, x) =  x for x ≥ 0

Softmax activation function

I Used for classification problems on the output layer

I Input: x ∈ <, output: 0 ≤ σ(xi ) ≤ 1

exp(xi ) σ(xi ) = PJ for i = 1,..., J j=1 exp(xj )

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 235 / 276 Network structure

Defining the network model in the R interface to Keras

model <- keras_model_sequential() model %>% layer_dense(units = 256, activation = 'relu') %>% layer_dense(units = 128, activation = 'relu') %>% layer_dense(units = 10, activation = 'softmax')

I sequential model: linear stack of neural network layers

I layer_dense: densely connected layer

I units: number of nodes in the layer

I activation function

keras.rstudio.com

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 236 / 276 Deep Learning: Model fitting Recall maximum likelihood estimation

Steps in model fitting 1. Define network structure 2. Select loss function 3. Select optimizer

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 237 / 276 Loss Function

A loss function

I Measures predictive performance of the network

I Measures how close the output from the last layer is to the observed value

I Must be a one-number summary

Goal in model fitting: Find the parameter values (weights) that minimize the loss

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 238 / 276 Loss Function Square Error: used for continuous response data n 1 X 2 (yi − ybi ) n i=1

Cross-entropy (): used for categorical response data

n G X X − yig log(πbig ) i=1 g=1 For a problem with

I g = 1,..., G categories

I πbig = predicted probability that observation i is in the gth category

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 239 / 276 Loss Function

Select a loss function that is compatible with the activiation function of the final layer (output)

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 240 / 276 Loss Function

In R Keras, 1. Set up the model (see above). 2. Compile the model with appropriate loss function, optimizer, and metrics. model %>% compile( loss = 'categorical_crossentropy', optimizer = optimizer_rmsprop(), metrics = c('accuracy') )

Loss functions in R keras

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 241 / 276 Model fitting

Steps in model fitting 1. Define network structure 2. Select loss function 3. Select Optimizer

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 242 / 276 Model fitting

Goal:

I find the parameter values (weights) that minimize the loss

I estimate the parameter values β, α,... Model fitting is called

I Computer science: train the model

I Statistics: estimate the parameters

I Mathematics: optimize the loss function

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 243 / 276 Model fitting

Neural networks have many unknown parameters (weights). Counting parameters

I p + 1 parameters in one node • p covariates plus intercept (bias) • parameters: (αk0, αk1, . . . , αkp) I k = 1,..., K nodes per layer plus intercept: (βg0, . . . , βgK )

I g = 1,..., G layers Total of K(p + 1) + G(K + 1) parameters

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 244 / 276 Model fitting

Example:

I p = 100 covariates

I K = 128 nodes per layer

I G = 5 layers

I 128(100 + 1) + 5(128 + 1) = 13,573 parameters Implications:

I Many parameters and

I All the parameters are highly related.

I Thus you need a lot of data and a special approach to estimate parameters.

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 245 / 276 Optimization

I General approach to minimize loss is to use stochastic • Assign starting values to the weights • Evaluate the partial derivative of the loss function with respect to each weight • Take a step in the downhill direction (opposite the gradient) • Repeat until weights converge

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 246 / 276

Figure source: Chollet (2018) Optimization

Optimization for deep learning

I This was a brief overview of optimization.

I The book Deep Learning by Goodfellow et al. has a good overview

I Optimizers in Keras: SGD, RMSprop, Adagrad, Adadelta, Adam, Adamax, Nadam

I Overview of the optimizers

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 247 / 276 Optimization: Putting it all together (2 layer network)

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 248 / 276 Deep Learning: Model fitting in practice

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 249 / 276 Model fitting in practice

Training deep learning models requires many decisions

You select:

I Model architecture

I Loss function

I Optimization method

I Details on how all of these work

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 250 / 276 Model fitting in practice

Some deep learning parameters to adjust

I Number of hidden layers and units

I Regularization

I Epochs and Batch Size

I General practical advice

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 251 / 276 Numer of hidden layers and units

towardsdatascience.comA Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 252 / 276 Numer of hidden layers and units

How many layers?

I Old strategy: one is enough • Universal approximator theorem: One hidden layer is enough to represent an approximation of any function to an arbitrary degree of accuracy • Theory vs practice!

I Current advice • Start with one or two hidden layers with many units • If performance is poor and you have debugged, try deep learning

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 253 / 276 Over and underfitting

Simple Example

Image source: medium.com

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 254 / 276 Over and underfitting

Overfitting:

I Results won’t generalize beyond the training set

I Solution: get more training data or use regularization Underfitting:

I Results aren’t useful

I Solution: adjust model structure or model fitting setup

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 255 / 276 Regularization

Image source: www.analyticsvidhya.com

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 256 / 276 Regularization vis Dropout

Dropout

I Randomly set to 0 a proportion of weights in the model

I Prevents overfitting via sparser networks

I Dropout rate = fraction of weights set to 0

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 257 / 276 Regularization vis Dropout

model <- keras_model_sequential() model %>% layer_dense(units = 256, activation = 'relu', input_shape = c(784)) %>% layer_dropout(rate = 0.4) %>% layer_dense(units = 128, activation = 'relu') %>% layer_dropout(rate = 0.3) %>% layer_dense(units = 10, activation = 'softmax')

Rstudio’s keras pages

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 258 / 276 Regularization vis Dropout Source: Srivastiva et al. Dropout: A Simple Way to Prevent Neural Networks from Overfittting

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 259 / 276 Regularization

You can also acheive regularization by adding a LASSO-like penalty to the loss function (Tibshirani, 1996) Regularization

Loss + Regularization penalty

L1 Regularization M X Loss + λ |wi | i=1 where

I w1,..., wM represent all model weights (parameters)

I λ is a tuning parameter

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 260 / 276 Epochs and Batch Size

To deal with large data sets, deep learning optimization algorithms break up the data into subsets. Model fitting options:

I Batch size: # of training samples to work through before updating the model

I Epoch: number of complete passes through the data set

I Validation split: Fraction of the training data to be used as validation data per epoch.

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 261 / 276 Epochs and Batch Size

Small example: Use the fit() function to train the model for 30 epochs using batches of 128 images: history <- model %>% fit( x_train, y_train, epochs = 30, batch_size = 128, validation_split = 0.2 ) Useful reference: machinelearningmastery.com

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 262 / 276 Epoch plot

Image source: towardsdatascience.com

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 263 / 276 Image source: xkcd.com/1838

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 264 / 276 Architecture

Network architecture: the overall structure of the network General advice

I Fit the largest network you can afford

I Keep the # nodes per layer the same

I Try different layer sizes

I Use ReLU units for hidden layers

I Choose an appropriate loss function for your data

I Make sure the activation function for your output layer matches the loss function

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 265 / 276 Types of Deep Learning Models

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 266 / 276 Types of Deep Learning Models

Architecture families: 1. Deep feedforward networks: what we’ve covered so far 2. Convolutional networks 3. Recurrent and Recursive Nets 4. Generative adversarial network 5. Bayesian networks 6. Always new methods on the horizon

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 267 / 276 Types of Deep Learning Models

Convolutional networks

I For grid-like data like images

I Data need to be correlated in some way (I think)

I Each layer detects small, meaningful features like edges using kernels instead of the entire image

I Sparse connectivity

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 268 / 276 Types of Deep Learning Models

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 269 / 276 network Image: Deep Learning, Fig 1.2

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 270 / 276 Types of Deep Learning Models

Recurrent and Recursive Nets

I For processing sequential data

I Examples: , financial data, text, speech Generative adversarial network

I Two neural networks compete to generate new data to be similar to old data

I Used for image generation, video generation and voice generation

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 271 / 276 Types of Deep Learning Models

Bayesian networks

I Can be any of the above (e.g., Bayesian convolutional network)

I Pros: estimate uncertainty

I Cons: slow to compute

I Variational Bayes is a current approach for inference

I Current approaches are pretty crude

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 272 / 276 A Statistical View of Deep Learning in Ecology Part 1: Introduction

I Introduction to machine learning

I Introduction to deep learning Part 2: Going deeper

I Neural networks from 3 viewpoints

I Mathematics of deep learning

I Model fitting

I Types of deep learning models Part 3: Deep learning in practice

I Ethics in deep learning

I Deep learning in ecology

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 273 / 276 Some useful references on deep learning

I Deep Learning by I. Goodfellow, Y. Bengio and A. Courville (2016) MIT Press

I Deep Learning with R F. Chollet and J.J. Allaire (2018) Manning Publications

• Chapters 1-3 available online • Focus is on Keras

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 274 / 276 Additional references that I used to develop this presentation

Books:

I Computational Statistics by Givens and Hoeting (2nd edition, Wiley)

I The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman

I Introduction to Statistical Learning by James, Witten, Hastie, Tibshirani Course materials by: , Darren Homrighausen, Ander Wilson, Asa Ben-Hur

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 275 / 276 Thank you to

I ISEC2020 organizers especially David Warton and Gordana Popovic

I Course assistants: Tess Hamzeh, Winston Hilton, Rachael Krawczyk

I Alison Ketz and Dan Walsh Acknowledgment of funding support This material is based upon work supported by the National Science Foundation (NSF) under Grant No. AGS-1419558, the US Geological Survey (USGS) (G17AC00409), and Colorado State University (CSU). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the NSF, USGS or CSU.

A Statistical View of Deep Learning Part 2 | Jennifer Hoeting, Colorado State University 276 / 276