Preliminary Intro to DL with Tensorflow

Introduction to with Tensorflow

Yunhao (Robin) Tang Department of IEOR, Columbia University

February 25, 2019

IEOR 8100: 1/36 Simple Example: Linear Regression with Tensorflow Intro to Deep Learning Deep Learning with Tensorflow Beyond If time permits, we will also cover OpenAI Gym Implement basic DQN

Preliminary Summary Intro to DL with Tensorflow Summary

What we will cover...

IEOR 8100: Reinforcement Learning 2/36 Intro to Deep Learning Deep Learning with Tensorflow Beyond If time permits, we will also cover OpenAI Gym Implement basic DQN

Preliminary Summary Intro to DL with Tensorflow Summary

What we will cover... Simple Example: Linear Regression with Tensorflow

IEOR 8100: Reinforcement Learning 2/36 Deep Learning with Tensorflow Beyond If time permits, we will also cover OpenAI Gym Implement basic DQN

Preliminary Summary Intro to DL with Tensorflow Summary

What we will cover... Simple Example: Linear Regression with Tensorflow Intro to Deep Learning

IEOR 8100: Reinforcement Learning 2/36 Beyond If time permits, we will also cover OpenAI Gym Implement basic DQN

Preliminary Summary Intro to DL with Tensorflow Summary

What we will cover... Simple Example: Linear Regression with Tensorflow Intro to Deep Learning Deep Learning with Tensorflow

IEOR 8100: Reinforcement Learning 2/36 OpenAI Gym Implement basic DQN

Preliminary Summary Intro to DL with Tensorflow Summary

What we will cover... Simple Example: Linear Regression with Tensorflow Intro to Deep Learning Deep Learning with Tensorflow Beyond If time permits, we will also cover

IEOR 8100: Reinforcement Learning 2/36 Implement basic DQN

Preliminary Summary Intro to DL with Tensorflow Summary

What we will cover... Simple Example: Linear Regression with Tensorflow Intro to Deep Learning Deep Learning with Tensorflow Beyond If time permits, we will also cover OpenAI Gym

IEOR 8100: Reinforcement Learning 2/36 Preliminary Summary Intro to DL with Tensorflow Summary

What we will cover... Simple Example: Linear Regression with Tensorflow Intro to Deep Learning Deep Learning with Tensorflow Beyond If time permits, we will also cover OpenAI Gym Implement basic DQN

IEOR 8100: Reinforcement Learning 2/36 TF is an open source project formally supported by Google. Programming interface: Python, C++, C... Python is most well developed and documented. Latest version: 1.12 (same time last year 1.5). Commonly used version: ≈1.0. (depend on other dependencies) Can be installed via pip. Best supported on mac os and linux.

(a) Tensorflow paper

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Info about Tensorflow

IEOR 8100: Reinforcement Learning 3/36 Programming interface: Python, C++, C... Python is most well developed and documented. Latest version: 1.12 (same time last year 1.5). Commonly used version: ≈1.0. (depend on other dependencies) Can be installed via pip. Best supported on mac os and linux.

(b) Tensorflow paper

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Info about Tensorflow

TF is an open source project formally supported by Google.

IEOR 8100: Reinforcement Learning 3/36 Latest version: 1.12 (same time last year 1.5). Commonly used version: ≈1.0. (depend on other dependencies) Can be installed via pip. Best supported on mac os and linux.

(c) Tensorflow paper

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Info about Tensorflow

TF is an open source project formally supported by Google. Programming interface: Python, C++, C... Python is most well developed and documented.

IEOR 8100: Reinforcement Learning 3/36 Can be installed via pip. Best supported on mac os and linux.

(d) Tensorflow paper

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Info about Tensorflow

TF is an open source project formally supported by Google. Programming interface: Python, C++, C... Python is most well developed and documented. Latest version: 1.12 (same time last year 1.5). Commonly used version: ≈1.0. (depend on other dependencies)

IEOR 8100: Reinforcement Learning 3/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Info about Tensorflow

TF is an open source project formally supported by Google. Programming interface: Python, C++, C... Python is most well developed and documented. Latest version: 1.12 (same time last year 1.5). Commonly used version: ≈1.0. (depend on other dependencies) Can be installed via pip. Best supported on mac os and linux.

(e) Tensorflow paper

IEOR 8100: Reinforcement Learning 3/36 N k Linear Regression: N data points {xi , yi }i=1, x ∈ R , y ∈ R Specify Architecture: Consider linear function for regression with parameter: k slope θ ∈ R and bias θ0 ∈ R. The prediction is

T yˆi = θ xi + θ0

1 PN 2 Define Loss: Loss function to minimize J = N i=1(yi − yˆi )

Gradient Descent: θ ← θ − α∇θJ, θ0 ← θ0 − α∇θ0 J

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Simple Example: Linear Regression

IEOR 8100: Reinforcement Learning 4/36 Specify Architecture: Consider linear function for regression with parameter: k slope θ ∈ R and bias θ0 ∈ R. The prediction is

T yˆi = θ xi + θ0

1 PN 2 Define Loss: Loss function to minimize J = N i=1(yi − yˆi )

Gradient Descent: θ ← θ − α∇θJ, θ0 ← θ0 − α∇θ0 J

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Simple Example: Linear Regression

N k Linear Regression: N data points {xi , yi }i=1, x ∈ R , y ∈ R

IEOR 8100: Reinforcement Learning 4/36 1 PN 2 Define Loss: Loss function to minimize J = N i=1(yi − yˆi )

Gradient Descent: θ ← θ − α∇θJ, θ0 ← θ0 − α∇θ0 J

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Simple Example: Linear Regression

N k Linear Regression: N data points {xi , yi }i=1, x ∈ R , y ∈ R Specify Architecture: Consider linear function for regression with parameter: k slope θ ∈ R and bias θ0 ∈ R. The prediction is

T yˆi = θ xi + θ0

IEOR 8100: Reinforcement Learning 4/36 Gradient Descent: θ ← θ − α∇θJ, θ0 ← θ0 − α∇θ0 J

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Simple Example: Linear Regression

N k Linear Regression: N data points {xi , yi }i=1, x ∈ R , y ∈ R Specify Architecture: Consider linear function for regression with parameter: k slope θ ∈ R and bias θ0 ∈ R. The prediction is

T yˆi = θ xi + θ0

1 PN 2 Define Loss: Loss function to minimize J = N i=1(yi − yˆi )

IEOR 8100: Reinforcement Learning 4/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Simple Example: Linear Regression

N k Linear Regression: N data points {xi , yi }i=1, x ∈ R , y ∈ R Specify Architecture: Consider linear function for regression with parameter: k slope θ ∈ R and bias θ0 ∈ R. The prediction is

T yˆi = θ xi + θ0

1 PN 2 Define Loss: Loss function to minimize J = N i=1(yi − yˆi )

Gradient Descent: θ ← θ − α∇θJ, θ0 ← θ0 − α∇θ0 J

IEOR 8100: Reinforcement Learning 4/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Linear Regression: Data

(f) True Parameters

(g) Data

IEOR 8100: Reinforcement Learning 5/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Linear Regression with Tensorflow

Let us just focus on defining architecture...

(h) Linear regression tf code

IEOR 8100: Reinforcement Learning 6/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Linear Regression with Tensorflow

We launch the training...

(i) loss (j) fitted line

IEOR 8100: Reinforcement Learning 7/36 N Consider data sets with complex relation {xi , yi }i=1, linear regression will not work. MLP (Multi Layer Perceptron): stack multiple layers of linear transformations and nonlinear functions. 6 4 input x ∈ R , first layer linear transformation W1x + b1 ∈ R , then apply nonlinear 4 function h1 = σ(W1x + b1) ∈ R . Nonlinear function applies elementwise. 3 second layer has W2, b2, transform as h2 = σ(W2h2 + b2) ∈ R final outputy ˆ = W3σ(W2σ(W1x + b1) + b2) + b3 ∈ R

(k) mlp (l) nonlinear functions

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Intro to DL

IEOR 8100: Reinforcement Learning 8/36 MLP (Multi Layer Perceptron): stack multiple layers of linear transformations and nonlinear functions. 6 4 input x ∈ R , first layer linear transformation W1x + b1 ∈ R , then apply nonlinear 4 function h1 = σ(W1x + b1) ∈ R . Nonlinear function applies elementwise. 3 second layer has W2, b2, transform as h2 = σ(W2h2 + b2) ∈ R final outputy ˆ = W3σ(W2σ(W1x + b1) + b2) + b3 ∈ R

(m) mlp (n) nonlinear functions

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Intro to DL

N Consider data sets with complex relation {xi , yi }i=1, linear regression will not work.

IEOR 8100: Reinforcement Learning 8/36 6 4 input x ∈ R , first layer linear transformation W1x + b1 ∈ R , then apply nonlinear 4 function h1 = σ(W1x + b1) ∈ R . Nonlinear function applies elementwise. 3 second layer has W2, b2, transform as h2 = σ(W2h2 + b2) ∈ R final outputy ˆ = W3σ(W2σ(W1x + b1) + b2) + b3 ∈ R

(o) mlp (p) nonlinear functions

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Intro to DL

N Consider data sets with complex relation {xi , yi }i=1, linear regression will not work. MLP (Multi Layer Perceptron): stack multiple layers of linear transformations and nonlinear functions.

IEOR 8100: Reinforcement Learning 8/36 3 second layer has W2, b2, transform as h2 = σ(W2h2 + b2) ∈ R final outputy ˆ = W3σ(W2σ(W1x + b1) + b2) + b3 ∈ R

(q) mlp (r) nonlinear functions

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Intro to DL

N Consider data sets with complex relation {xi , yi }i=1, linear regression will not work. MLP (Multi Layer Perceptron): stack multiple layers of linear transformations and nonlinear functions. 6 4 input x ∈ R , first layer linear transformation W1x + b1 ∈ R , then apply nonlinear 4 function h1 = σ(W1x + b1) ∈ R . Nonlinear function applies elementwise.

IEOR 8100: Reinforcement Learning 8/36 final outputy ˆ = W3σ(W2σ(W1x + b1) + b2) + b3 ∈ R

(s) mlp (t) nonlinear functions

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Intro to DL

N Consider data sets with complex relation {xi , yi }i=1, linear regression will not work. MLP (Multi Layer Perceptron): stack multiple layers of linear transformations and nonlinear functions. 6 4 input x ∈ R , first layer linear transformation W1x + b1 ∈ R , then apply nonlinear 4 function h1 = σ(W1x + b1) ∈ R . Nonlinear function applies elementwise. 3 second layer has W2, b2, transform as h2 = σ(W2h2 + b2) ∈ R

IEOR 8100: Reinforcement Learning 8/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Intro to DL

N Consider data sets with complex relation {xi , yi }i=1, linear regression will not work. MLP (Multi Layer Perceptron): stack multiple layers of linear transformations and nonlinear functions. 6 4 input x ∈ R , first layer linear transformation W1x + b1 ∈ R , then apply nonlinear 4 function h1 = σ(W1x + b1) ∈ R . Nonlinear function applies elementwise. 3 second layer has W2, b2, transform as h2 = σ(W2h2 + b2) ∈ R final outputy ˆ = W3σ(W2σ(W1x + b1) + b2) + b3 ∈ R

(u) mlp (v) nonlinear functions

IEOR 8100: Reinforcement Learning 8/36 Architecture: define a complex input to output relation with stacked layers of linear transformations and nonlinear functions (activations). Architectures are adapted to data structure at hand. Input prior knowledge into modeling through architecture design. MLP, CNN (image), RNN (sequence data, audio, language...

Parameters: weights W1, W2, W3 and bias b1, b2, b3. Linear regression: slope θ and bias θ0

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Intro to DL

IEOR 8100: Reinforcement Learning 9/36 Architectures are adapted to data structure at hand. Input prior knowledge into modeling through architecture design. MLP, CNN (image), RNN (sequence data, audio, language...

Parameters: weights W1, W2, W3 and bias b1, b2, b3. Linear regression: slope θ and bias θ0

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Intro to DL

Architecture: define a complex input to output relation with stacked layers of linear transformations and nonlinear functions (activations).

IEOR 8100: Reinforcement Learning 9/36 Input prior knowledge into modeling through architecture design. MLP, CNN (image), RNN (sequence data, audio, language...

Parameters: weights W1, W2, W3 and bias b1, b2, b3. Linear regression: slope θ and bias θ0

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Intro to DL

Architecture: define a complex input to output relation with stacked layers of linear transformations and nonlinear functions (activations). Architectures are adapted to data structure at hand.

IEOR 8100: Reinforcement Learning 9/36 MLP, CNN (image), RNN (sequence data, audio, language...

Parameters: weights W1, W2, W3 and bias b1, b2, b3. Linear regression: slope θ and bias θ0

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Intro to DL

Architecture: define a complex input to output relation with stacked layers of linear transformations and nonlinear functions (activations). Architectures are adapted to data structure at hand. Input prior knowledge into modeling through architecture design.

IEOR 8100: Reinforcement Learning 9/36 Parameters: weights W1, W2, W3 and bias b1, b2, b3. Linear regression: slope θ and bias θ0

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Intro to DL

Architecture: define a complex input to output relation with stacked layers of linear transformations and nonlinear functions (activations). Architectures are adapted to data structure at hand. Input prior knowledge into modeling through architecture design. MLP, CNN (image), RNN (sequence data, audio, language...

IEOR 8100: Reinforcement Learning 9/36 Linear regression: slope θ and bias θ0

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Intro to DL

Architecture: define a complex input to output relation with stacked layers of linear transformations and nonlinear functions (activations). Architectures are adapted to data structure at hand. Input prior knowledge into modeling through architecture design. MLP, CNN (image), RNN (sequence data, audio, language...

Parameters: weights W1, W2, W3 and bias b1, b2, b3.

IEOR 8100: Reinforcement Learning 9/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Intro to DL

Architecture: define a complex input to output relation with stacked layers of linear transformations and nonlinear functions (activations). Architectures are adapted to data structure at hand. Input prior knowledge into modeling through architecture design. MLP, CNN (image), RNN (sequence data, audio, language...

Parameters: weights W1, W2, W3 and bias b1, b2, b3. Linear regression: slope θ and bias θ0

IEOR 8100: Reinforcement Learning 9/36 Specify Architecture: how many layers, how many hidden unit per layer? yˆi = fθ(xi ) with generic parameters θ (weights and biases). 1 PN 2 Define Loss: Loss function to minimize J = N i=1(yi − yˆi )

Gradient Descent: θ ← θ − α∇θJ, θ0 ← θ0 − α∇θ0 J

Algorithm 1 Generic DL regression

1: Input: model architecture, data {xi , yi }, learning rate α 2: Initialize: Model parameters θ = {Wi , bi } 3: for t=1,2,3...T do 4: Compute predictiony ˆi = fθ(xi ) 1 P 2 5: Compute loss J = N i (ˆyi − yi ) 6: Gradient update θ ← θ − α∇θJ 7: end for

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL algorithm

IEOR 8100: Reinforcement Learning 10/36 1 PN 2 Define Loss: Loss function to minimize J = N i=1(yi − yˆi )

Gradient Descent: θ ← θ − α∇θJ, θ0 ← θ0 − α∇θ0 J

Algorithm 2 Generic DL regression

1: Input: model architecture, data {xi , yi }, learning rate α 2: Initialize: Model parameters θ = {Wi , bi } 3: for t=1,2,3...T do 4: Compute predictiony ˆi = fθ(xi ) 1 P 2 5: Compute loss J = N i (ˆyi − yi ) 6: Gradient update θ ← θ − α∇θJ 7: end for

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL algorithm

Specify Architecture: how many layers, how many hidden unit per layer? yˆi = fθ(xi ) with generic parameters θ (weights and biases).

IEOR 8100: Reinforcement Learning 10/36 Gradient Descent: θ ← θ − α∇θJ, θ0 ← θ0 − α∇θ0 J

Algorithm 3 Generic DL regression

1: Input: model architecture, data {xi , yi }, learning rate α 2: Initialize: Model parameters θ = {Wi , bi } 3: for t=1,2,3...T do 4: Compute predictiony ˆi = fθ(xi ) 1 P 2 5: Compute loss J = N i (ˆyi − yi ) 6: Gradient update θ ← θ − α∇θJ 7: end for

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL algorithm

Specify Architecture: how many layers, how many hidden unit per layer? yˆi = fθ(xi ) with generic parameters θ (weights and biases). 1 PN 2 Define Loss: Loss function to minimize J = N i=1(yi − yˆi )

IEOR 8100: Reinforcement Learning 10/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL algorithm

Specify Architecture: how many layers, how many hidden unit per layer? yˆi = fθ(xi ) with generic parameters θ (weights and biases). 1 PN 2 Define Loss: Loss function to minimize J = N i=1(yi − yˆi )

Gradient Descent: θ ← θ − α∇θJ, θ0 ← θ0 − α∇θ0 J

Algorithm 4 Generic DL regression

1: Input: model architecture, data {xi , yi }, learning rate α 2: Initialize: Model parameters θ = {Wi , bi } 3: for t=1,2,3...T do 4: Compute predictiony ˆi = fθ(xi ) 1 P 2 5: Compute loss J = N i (ˆyi − yi ) 6: Gradient update θ ← θ − α∇θJ 7: end for

IEOR 8100: Reinforcement Learning 10/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL Regression: Data

Generate data using y = x 3 + x 2 + sin(10x) + noise. Linear regression will fail.

(w) data generation

(x) data plots

IEOR 8100: Reinforcement Learning 11/36 train the model just like linear regression

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL Regression with Tensorflow

Specify the Architecture, Loss using Tensorflow syntax just like linear regression.

(y) Linear regression tf code

IEOR 8100: Reinforcement Learning 12/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL Regression with Tensorflow

Specify the Architecture, Loss using Tensorflow syntax just like linear regression.

(z) Linear regression tf code

train the model just like linear regression

IEOR 8100: Reinforcement Learning 12/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL Regression with Tensorflow

We launch the training...

() loss () fitted line

IEOR 8100: Reinforcement Learning 13/36 When data set is large, can only compute gradients on mini-batches. Sample a batch of data → Compute gradient on the batch → Update parameters. Batch size: too small: high variance in data stream, unstable training; too large: too costly to compute gradients.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Batch training

IEOR 8100: Reinforcement Learning 14/36 Sample a batch of data → Compute gradient on the batch → Update parameters. Batch size: too small: high variance in data stream, unstable training; too large: too costly to compute gradients.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Batch training

When data set is large, can only compute gradients on mini-batches.

IEOR 8100: Reinforcement Learning 14/36 Batch size: too small: high variance in data stream, unstable training; too large: too costly to compute gradients.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Batch training

When data set is large, can only compute gradients on mini-batches. Sample a batch of data → Compute gradient on the batch → Update parameters.

IEOR 8100: Reinforcement Learning 14/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Batch training

When data set is large, can only compute gradients on mini-batches. Sample a batch of data → Compute gradient on the batch → Update parameters. Batch size: too small: high variance in data stream, unstable training; too large: too costly to compute gradients.

IEOR 8100: Reinforcement Learning 14/36 Taking gradients in linear model is easy.

Question: how to take gradients for MLP? ∇W1 J is not straightforward as in linear regression.

1 X 2 J = (y − yˆ ) , yˆ = W σ(W σ(W x + b ) + b ) + b N i i i 3 2 1 i 1 2 3 i

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Auto-differentiation

IEOR 8100: Reinforcement Learning 15/36 Question: how to take gradients for MLP? ∇W1 J is not straightforward as in linear regression.

1 X 2 J = (y − yˆ ) , yˆ = W σ(W σ(W x + b ) + b ) + b N i i i 3 2 1 i 1 2 3 i

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Auto-differentiation

Taking gradients in linear model is easy.

IEOR 8100: Reinforcement Learning 15/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Auto-differentiation

Taking gradients in linear model is easy.

Question: how to take gradients for MLP? ∇W1 J is not straightforward as in linear regression.

1 X 2 J = (y − yˆ ) , yˆ = W σ(W σ(W x + b ) + b ) + b N i i i 3 2 1 i 1 2 3 i

IEOR 8100: Reinforcement Learning 15/36 Tensorflow entails automatic differentiation (autodiff), i.e. automatic gradient computation. Users specify forward computation y = fθ(x), the program will internally specify a way to compute ∇θy. Not symbolic computation, not finite difference Example: start with a, b, compute e: c = a + b, d = b + 1, e = c · d. Consider ∂e ∂e ∂d ∂e ∂c = + ∂a ∂d ∂a ∂c ∂a

() computation graph

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Auto-differentiation

IEOR 8100: Reinforcement Learning 16/36 Example: start with a, b, compute e: c = a + b, d = b + 1, e = c · d. Consider ∂e ∂e ∂d ∂e ∂c = + ∂a ∂d ∂a ∂c ∂a

() computation graph

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Auto-differentiation

Tensorflow entails automatic differentiation (autodiff), i.e. automatic gradient computation. Users specify forward computation y = fθ(x), the program will internally specify a way to compute ∇θy. Not symbolic computation, not finite difference

IEOR 8100: Reinforcement Learning 16/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Auto-differentiation

Tensorflow entails automatic differentiation (autodiff), i.e. automatic gradient computation. Users specify forward computation y = fθ(x), the program will internally specify a way to compute ∇θy. Not symbolic computation, not finite difference Example: start with a, b, compute e: c = a + b, d = b + 1, e = c · d. Consider ∂e ∂e ∂d ∂e ∂c = + ∂a ∂d ∂a ∂c ∂a

() computation graph

IEOR 8100: Reinforcement Learning 16/36 High level idea: computation graph. Forward computation specifies local operations that connect locally related variables. Local gradients can be computed. forward computation → local forward ops → local gradient → long range gradients. Just like error back-propagating from output to inputs and params, back-propagation. Useful link: http://deeplearning.stanford.edu/wiki/index.php/UFLDL Tutorial

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

IEOR 8100: Reinforcement Learning 17/36 forward computation → local forward ops → local gradient → long range gradients. Just like error back-propagating from output to inputs and params, back-propagation. Useful link: http://deeplearning.stanford.edu/wiki/index.php/UFLDL Tutorial

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

High level idea: computation graph. Forward computation specifies local operations that connect locally related variables. Local gradients can be computed.

IEOR 8100: Reinforcement Learning 17/36 Just like error back-propagating from output to inputs and params, back-propagation. Useful link: http://deeplearning.stanford.edu/wiki/index.php/UFLDL Tutorial

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

High level idea: computation graph. Forward computation specifies local operations that connect locally related variables. Local gradients can be computed. forward computation → local forward ops → local gradient → long range gradients.

IEOR 8100: Reinforcement Learning 17/36 Useful link: http://deeplearning.stanford.edu/wiki/index.php/UFLDL Tutorial

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

High level idea: computation graph. Forward computation specifies local operations that connect locally related variables. Local gradients can be computed. forward computation → local forward ops → local gradient → long range gradients. Just like error back-propagating from output to inputs and params, back-propagation.

IEOR 8100: Reinforcement Learning 17/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

High level idea: computation graph. Forward computation specifies local operations that connect locally related variables. Local gradients can be computed. forward computation → local forward ops → local gradient → long range gradients. Just like error back-propagating from output to inputs and params, back-propagation. Useful link: http://deeplearning.stanford.edu/wiki/index.php/UFLDL Tutorial

IEOR 8100: Reinforcement Learning 17/36 Placeholder: tf objects that specify users’ inputs into the graph, as literally placeholders for data. Variable: tf objects that represent parameters that can be updated through autodiff. Embedded inside the graph. All expressions computed from primitive variables and placeholders are vertices in the graph.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

What we have covered from the code...

IEOR 8100: Reinforcement Learning 18/36 Variable: tf objects that represent parameters that can be updated through autodiff. Embedded inside the graph. All expressions computed from primitive variables and placeholders are vertices in the graph.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

What we have covered from the code... Placeholder: tf objects that specify users’ inputs into the graph, as literally placeholders for data.

IEOR 8100: Reinforcement Learning 18/36 All expressions computed from primitive variables and placeholders are vertices in the graph.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

What we have covered from the code... Placeholder: tf objects that specify users’ inputs into the graph, as literally placeholders for data. Variable: tf objects that represent parameters that can be updated through autodiff. Embedded inside the graph.

IEOR 8100: Reinforcement Learning 18/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

What we have covered from the code... Placeholder: tf objects that specify users’ inputs into the graph, as literally placeholders for data. Variable: tf objects that represent parameters that can be updated through autodiff. Embedded inside the graph. All expressions computed from primitive variables and placeholders are vertices in the graph.

IEOR 8100: Reinforcement Learning 18/36 TF expression: python objects that define how to update the graph (variables in the graph) in a single step. Optimizer: python objects that can specify gradient updates of variables. Session: python objects that launch the training. And many more...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

What we have not yet covered from the code... how to launch the training? how to use the gradients computed by Tensorflow to update parameters?

IEOR 8100: Reinforcement Learning 19/36 Optimizer: python objects that can specify gradient updates of variables. Session: python objects that launch the training. And many more...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

What we have not yet covered from the code... how to launch the training? how to use the gradients computed by Tensorflow to update parameters? TF expression: python objects that define how to update the graph (variables in the graph) in a single step.

IEOR 8100: Reinforcement Learning 19/36 Session: python objects that launch the training. And many more...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

What we have not yet covered from the code... how to launch the training? how to use the gradients computed by Tensorflow to update parameters? TF expression: python objects that define how to update the graph (variables in the graph) in a single step. Optimizer: python objects that can specify gradient updates of variables.

IEOR 8100: Reinforcement Learning 19/36 And many more...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

What we have not yet covered from the code... how to launch the training? how to use the gradients computed by Tensorflow to update parameters? TF expression: python objects that define how to update the graph (variables in the graph) in a single step. Optimizer: python objects that can specify gradient updates of variables. Session: python objects that launch the training.

IEOR 8100: Reinforcement Learning 19/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow DL with Tensorflow

What we have not yet covered from the code... how to launch the training? how to use the gradients computed by Tensorflow to update parameters? TF expression: python objects that define how to update the graph (variables in the graph) in a single step. Optimizer: python objects that can specify gradient updates of variables. Session: python objects that launch the training. And many more...

IEOR 8100: Reinforcement Learning 19/36 Define loss and optimizers: which specifies how to update model parameters during training. Define session to launch the training Initialize variables Feed data into the computation graph

() launch the training

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Launch the training

IEOR 8100: Reinforcement Learning 20/36 Define session to launch the training Initialize variables Feed data into the computation graph

() launch the training

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Launch the training

Define loss and optimizers: which specifies how to update model parameters during training.

IEOR 8100: Reinforcement Learning 20/36 Initialize variables Feed data into the computation graph

() launch the training

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Launch the training

Define loss and optimizers: which specifies how to update model parameters during training. Define session to launch the training

IEOR 8100: Reinforcement Learning 20/36 Feed data into the computation graph

() launch the training

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Launch the training

Define loss and optimizers: which specifies how to update model parameters during training. Define session to launch the training Initialize variables

IEOR 8100: Reinforcement Learning 20/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Launch the training

Define loss and optimizers: which specifies how to update model parameters during training. Define session to launch the training Initialize variables Feed data into the computation graph

() launch the training

IEOR 8100: Reinforcement Learning 20/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Training: whole landscape

() launch the training

IEOR 8100: Reinforcement Learning 21/36 Classification: regression on probability (image classification) Structured Prediction: regression on high dimensional objects (autoencoders) Reinforcement Learning: DQN, policy gradients... DQN is an analogy to regression PG is an analogy to classification

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Regression

IEOR 8100: Reinforcement Learning 22/36 Structured Prediction: regression on high dimensional objects (autoencoders) Reinforcement Learning: DQN, policy gradients... DQN is an analogy to regression PG is an analogy to classification

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Regression

Classification: regression on probability (image classification)

IEOR 8100: Reinforcement Learning 22/36 Reinforcement Learning: DQN, policy gradients... DQN is an analogy to regression PG is an analogy to classification

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Regression

Classification: regression on probability (image classification) Structured Prediction: regression on high dimensional objects (autoencoders)

IEOR 8100: Reinforcement Learning 22/36 DQN is an analogy to regression PG is an analogy to classification

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Regression

Classification: regression on probability (image classification) Structured Prediction: regression on high dimensional objects (autoencoders) Reinforcement Learning: DQN, policy gradients...

IEOR 8100: Reinforcement Learning 22/36 PG is an analogy to classification

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Regression

Classification: regression on probability (image classification) Structured Prediction: regression on high dimensional objects (autoencoders) Reinforcement Learning: DQN, policy gradients... DQN is an analogy to regression

IEOR 8100: Reinforcement Learning 22/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Regression

Classification: regression on probability (image classification) Structured Prediction: regression on high dimensional objects (autoencoders) Reinforcement Learning: DQN, policy gradients... DQN is an analogy to regression PG is an analogy to classification

IEOR 8100: Reinforcement Learning 22/36 Optimization: optimization subroutines that update parameters with gradients. beyond simple sgd, there are rmsprop, adam, adagrad... adam is popular. learning rate Architectures: number of layers, hidden units per layers Nonlinear function: sigmoid, relu, tanh... Initialization: initialization of variables W , b. arbitrary initializations will fail. xavier initialization, truncated normal... Need hand tuning or cross validation to select good hyper-parameters. Other topics: batch-normalization, dropout, stochastic layers...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Hyper-parameters

Get a DL model to work involves a lot of hyper-parameters.

IEOR 8100: Reinforcement Learning 23/36 beyond simple sgd, there are rmsprop, adam, adagrad... adam is popular. learning rate Architectures: number of layers, hidden units per layers Nonlinear function: sigmoid, relu, tanh... Initialization: initialization of variables W , b. arbitrary initializations will fail. xavier initialization, truncated normal... Need hand tuning or cross validation to select good hyper-parameters. Other topics: batch-normalization, dropout, stochastic layers...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Hyper-parameters

Get a DL model to work involves a lot of hyper-parameters. Optimization: optimization subroutines that update parameters with gradients.

IEOR 8100: Reinforcement Learning 23/36 learning rate Architectures: number of layers, hidden units per layers Nonlinear function: sigmoid, relu, tanh... Initialization: initialization of variables W , b. arbitrary initializations will fail. xavier initialization, truncated normal... Need hand tuning or cross validation to select good hyper-parameters. Other topics: batch-normalization, dropout, stochastic layers...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Hyper-parameters

Get a DL model to work involves a lot of hyper-parameters. Optimization: optimization subroutines that update parameters with gradients. beyond simple sgd, there are rmsprop, adam, adagrad... adam is popular.

IEOR 8100: Reinforcement Learning 23/36 Architectures: number of layers, hidden units per layers Nonlinear function: sigmoid, relu, tanh... Initialization: initialization of variables W , b. arbitrary initializations will fail. xavier initialization, truncated normal... Need hand tuning or cross validation to select good hyper-parameters. Other topics: batch-normalization, dropout, stochastic layers...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Hyper-parameters

Get a DL model to work involves a lot of hyper-parameters. Optimization: optimization subroutines that update parameters with gradients. beyond simple sgd, there are rmsprop, adam, adagrad... adam is popular. learning rate

IEOR 8100: Reinforcement Learning 23/36 Nonlinear function: sigmoid, relu, tanh... Initialization: initialization of variables W , b. arbitrary initializations will fail. xavier initialization, truncated normal... Need hand tuning or cross validation to select good hyper-parameters. Other topics: batch-normalization, dropout, stochastic layers...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Hyper-parameters

Get a DL model to work involves a lot of hyper-parameters. Optimization: optimization subroutines that update parameters with gradients. beyond simple sgd, there are rmsprop, adam, adagrad... adam is popular. learning rate Architectures: number of layers, hidden units per layers

IEOR 8100: Reinforcement Learning 23/36 Initialization: initialization of variables W , b. arbitrary initializations will fail. xavier initialization, truncated normal... Need hand tuning or cross validation to select good hyper-parameters. Other topics: batch-normalization, dropout, stochastic layers...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Hyper-parameters

Get a DL model to work involves a lot of hyper-parameters. Optimization: optimization subroutines that update parameters with gradients. beyond simple sgd, there are rmsprop, adam, adagrad... adam is popular. learning rate Architectures: number of layers, hidden units per layers Nonlinear function: sigmoid, relu, tanh...

IEOR 8100: Reinforcement Learning 23/36 arbitrary initializations will fail. xavier initialization, truncated normal... Need hand tuning or cross validation to select good hyper-parameters. Other topics: batch-normalization, dropout, stochastic layers...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Hyper-parameters

Get a DL model to work involves a lot of hyper-parameters. Optimization: optimization subroutines that update parameters with gradients. beyond simple sgd, there are rmsprop, adam, adagrad... adam is popular. learning rate Architectures: number of layers, hidden units per layers Nonlinear function: sigmoid, relu, tanh... Initialization: initialization of variables W , b.

IEOR 8100: Reinforcement Learning 23/36 xavier initialization, truncated normal... Need hand tuning or cross validation to select good hyper-parameters. Other topics: batch-normalization, dropout, stochastic layers...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Hyper-parameters

Get a DL model to work involves a lot of hyper-parameters. Optimization: optimization subroutines that update parameters with gradients. beyond simple sgd, there are rmsprop, adam, adagrad... adam is popular. learning rate Architectures: number of layers, hidden units per layers Nonlinear function: sigmoid, relu, tanh... Initialization: initialization of variables W , b. arbitrary initializations will fail.

IEOR 8100: Reinforcement Learning 23/36 Need hand tuning or cross validation to select good hyper-parameters. Other topics: batch-normalization, dropout, stochastic layers...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Hyper-parameters

Get a DL model to work involves a lot of hyper-parameters. Optimization: optimization subroutines that update parameters with gradients. beyond simple sgd, there are rmsprop, adam, adagrad... adam is popular. learning rate Architectures: number of layers, hidden units per layers Nonlinear function: sigmoid, relu, tanh... Initialization: initialization of variables W , b. arbitrary initializations will fail. xavier initialization, truncated normal...

IEOR 8100: Reinforcement Learning 23/36 Other topics: batch-normalization, dropout, stochastic layers...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Hyper-parameters

Get a DL model to work involves a lot of hyper-parameters. Optimization: optimization subroutines that update parameters with gradients. beyond simple sgd, there are rmsprop, adam, adagrad... adam is popular. learning rate Architectures: number of layers, hidden units per layers Nonlinear function: sigmoid, relu, tanh... Initialization: initialization of variables W , b. arbitrary initializations will fail. xavier initialization, truncated normal... Need hand tuning or cross validation to select good hyper-parameters.

IEOR 8100: Reinforcement Learning 23/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Hyper-parameters

Get a DL model to work involves a lot of hyper-parameters. Optimization: optimization subroutines that update parameters with gradients. beyond simple sgd, there are rmsprop, adam, adagrad... adam is popular. learning rate Architectures: number of layers, hidden units per layers Nonlinear function: sigmoid, relu, tanh... Initialization: initialization of variables W , b. arbitrary initializations will fail. xavier initialization, truncated normal... Need hand tuning or cross validation to select good hyper-parameters. Other topics: batch-normalization, dropout, stochastic layers...

IEOR 8100: Reinforcement Learning 23/36 Other autodiff softwares: Pytorch/Chainer//Caffe and chainer allow for dynamic graph building others use static graph building strengths/weaknesses: gpu, distributed computing, model building flexibility, debug... High level interface: use tf and theano as backend specify model architecture more easily High level interface: stick with Tensorflow tensorflow.contrib

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Tensorflow

IEOR 8100: Reinforcement Learning 24/36 pytorch and chainer allow for dynamic graph building others use static graph building strengths/weaknesses: gpu, distributed computing, model building flexibility, debug... High level interface: Keras use tf and theano as backend specify model architecture more easily High level interface: stick with Tensorflow tensorflow.contrib

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Tensorflow

Other autodiff softwares: Pytorch/Chainer/Theano/Caffe

IEOR 8100: Reinforcement Learning 24/36 others use static graph building strengths/weaknesses: gpu, distributed computing, model building flexibility, debug... High level interface: Keras use tf and theano as backend specify model architecture more easily High level interface: stick with Tensorflow tensorflow.contrib

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Tensorflow

Other autodiff softwares: Pytorch/Chainer/Theano/Caffe pytorch and chainer allow for dynamic graph building

IEOR 8100: Reinforcement Learning 24/36 strengths/weaknesses: gpu, distributed computing, model building flexibility, debug... High level interface: Keras use tf and theano as backend specify model architecture more easily High level interface: stick with Tensorflow tensorflow.contrib

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Tensorflow

Other autodiff softwares: Pytorch/Chainer/Theano/Caffe pytorch and chainer allow for dynamic graph building others use static graph building

IEOR 8100: Reinforcement Learning 24/36 High level interface: Keras use tf and theano as backend specify model architecture more easily High level interface: stick with Tensorflow tensorflow.contrib

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Tensorflow

Other autodiff softwares: Pytorch/Chainer/Theano/Caffe pytorch and chainer allow for dynamic graph building others use static graph building strengths/weaknesses: gpu, distributed computing, model building flexibility, debug...

IEOR 8100: Reinforcement Learning 24/36 use tf and theano as backend specify model architecture more easily High level interface: stick with Tensorflow tensorflow.contrib

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Tensorflow

Other autodiff softwares: Pytorch/Chainer/Theano/Caffe pytorch and chainer allow for dynamic graph building others use static graph building strengths/weaknesses: gpu, distributed computing, model building flexibility, debug... High level interface: Keras

IEOR 8100: Reinforcement Learning 24/36 specify model architecture more easily High level interface: stick with Tensorflow tensorflow.contrib

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Tensorflow

Other autodiff softwares: Pytorch/Chainer/Theano/Caffe pytorch and chainer allow for dynamic graph building others use static graph building strengths/weaknesses: gpu, distributed computing, model building flexibility, debug... High level interface: Keras use tf and theano as backend

IEOR 8100: Reinforcement Learning 24/36 High level interface: stick with Tensorflow tensorflow.contrib

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Tensorflow

Other autodiff softwares: Pytorch/Chainer/Theano/Caffe pytorch and chainer allow for dynamic graph building others use static graph building strengths/weaknesses: gpu, distributed computing, model building flexibility, debug... High level interface: Keras use tf and theano as backend specify model architecture more easily

IEOR 8100: Reinforcement Learning 24/36 tensorflow.contrib

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Tensorflow

Other autodiff softwares: Pytorch/Chainer/Theano/Caffe pytorch and chainer allow for dynamic graph building others use static graph building strengths/weaknesses: gpu, distributed computing, model building flexibility, debug... High level interface: Keras use tf and theano as backend specify model architecture more easily High level interface: stick with Tensorflow

IEOR 8100: Reinforcement Learning 24/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Tensorflow

Other autodiff softwares: Pytorch/Chainer/Theano/Caffe pytorch and chainer allow for dynamic graph building others use static graph building strengths/weaknesses: gpu, distributed computing, model building flexibility, debug... High level interface: Keras use tf and theano as backend specify model architecture more easily High level interface: stick with Tensorflow tensorflow.contrib

IEOR 8100: Reinforcement Learning 24/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond Tensorflow

() low level vs. high level

IEOR 8100: Reinforcement Learning 25/36 TF official website: https://www.tensorflow.org/ TF Basic tutorial: https://www.tensorflow.org/tutorials/ Keras doc: https://keras.io/ TF tutorials: https://github.com/aymericdamien/TensorFlow-Examples

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Additional Resources

IEOR 8100: Reinforcement Learning 26/36 TF Basic tutorial: https://www.tensorflow.org/tutorials/ Keras doc: https://keras.io/ TF tutorials: https://github.com/aymericdamien/TensorFlow-Examples

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Additional Resources

TF official website: https://www.tensorflow.org/

IEOR 8100: Reinforcement Learning 26/36 Keras doc: https://keras.io/ TF tutorials: https://github.com/aymericdamien/TensorFlow-Examples

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Additional Resources

TF official website: https://www.tensorflow.org/ TF Basic tutorial: https://www.tensorflow.org/tutorials/

IEOR 8100: Reinforcement Learning 26/36 TF tutorials: https://github.com/aymericdamien/TensorFlow-Examples

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Additional Resources

TF official website: https://www.tensorflow.org/ TF Basic tutorial: https://www.tensorflow.org/tutorials/ Keras doc: https://keras.io/

IEOR 8100: Reinforcement Learning 26/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Additional Resources

TF official website: https://www.tensorflow.org/ TF Basic tutorial: https://www.tensorflow.org/tutorials/ Keras doc: https://keras.io/ TF tutorials: https://github.com/aymericdamien/TensorFlow-Examples

IEOR 8100: Reinforcement Learning 26/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow OpenAI Gym - very brief intro

() classic control () humanoid () atari game - pong

IEOR 8100: Reinforcement Learning 27/36 Gym is a testbed for RL algorithms. Doc: https://github.com/openai/gym make an environment: env = gym.make(”CartPole-v0”) initialize: obs = env.reset() first obs, we take the first action based on this take a step: obs, reward, done, info = env.step(action) based on obs, we make choices on which action to take reward is the reward collected during this one step transition done is True or False, to indicate if the episode terminates info additional info about the env, normally empty dictionary display state: env.render(). Only display current state, need to render in a loop to display consecutive states.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow OpenAI Gym - very brief intro

IEOR 8100: Reinforcement Learning 28/36 make an environment: env = gym.make(”CartPole-v0”) initialize: obs = env.reset() first obs, we take the first action based on this take a step: obs, reward, done, info = env.step(action) based on obs, we make choices on which action to take reward is the reward collected during this one step transition done is True or False, to indicate if the episode terminates info additional info about the env, normally empty dictionary display state: env.render(). Only display current state, need to render in a loop to display consecutive states.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow OpenAI Gym - very brief intro

Gym is a testbed for RL algorithms. Doc: https://github.com/openai/gym

IEOR 8100: Reinforcement Learning 28/36 initialize: obs = env.reset() first obs, we take the first action based on this take a step: obs, reward, done, info = env.step(action) based on obs, we make choices on which action to take reward is the reward collected during this one step transition done is True or False, to indicate if the episode terminates info additional info about the env, normally empty dictionary display state: env.render(). Only display current state, need to render in a loop to display consecutive states.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow OpenAI Gym - very brief intro

Gym is a testbed for RL algorithms. Doc: https://github.com/openai/gym make an environment: env = gym.make(”CartPole-v0”)

IEOR 8100: Reinforcement Learning 28/36 first obs, we take the first action based on this take a step: obs, reward, done, info = env.step(action) based on obs, we make choices on which action to take reward is the reward collected during this one step transition done is True or False, to indicate if the episode terminates info additional info about the env, normally empty dictionary display state: env.render(). Only display current state, need to render in a loop to display consecutive states.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow OpenAI Gym - very brief intro

Gym is a testbed for RL algorithms. Doc: https://github.com/openai/gym make an environment: env = gym.make(”CartPole-v0”) initialize: obs = env.reset()

IEOR 8100: Reinforcement Learning 28/36 take a step: obs, reward, done, info = env.step(action) based on obs, we make choices on which action to take reward is the reward collected during this one step transition done is True or False, to indicate if the episode terminates info additional info about the env, normally empty dictionary display state: env.render(). Only display current state, need to render in a loop to display consecutive states.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow OpenAI Gym - very brief intro

Gym is a testbed for RL algorithms. Doc: https://github.com/openai/gym make an environment: env = gym.make(”CartPole-v0”) initialize: obs = env.reset() first obs, we take the first action based on this

IEOR 8100: Reinforcement Learning 28/36 based on obs, we make choices on which action to take reward is the reward collected during this one step transition done is True or False, to indicate if the episode terminates info additional info about the env, normally empty dictionary display state: env.render(). Only display current state, need to render in a loop to display consecutive states.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow OpenAI Gym - very brief intro

Gym is a testbed for RL algorithms. Doc: https://github.com/openai/gym make an environment: env = gym.make(”CartPole-v0”) initialize: obs = env.reset() first obs, we take the first action based on this take a step: obs, reward, done, info = env.step(action)

IEOR 8100: Reinforcement Learning 28/36 reward is the reward collected during this one step transition done is True or False, to indicate if the episode terminates info additional info about the env, normally empty dictionary display state: env.render(). Only display current state, need to render in a loop to display consecutive states.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow OpenAI Gym - very brief intro

Gym is a testbed for RL algorithms. Doc: https://github.com/openai/gym make an environment: env = gym.make(”CartPole-v0”) initialize: obs = env.reset() first obs, we take the first action based on this take a step: obs, reward, done, info = env.step(action) based on obs, we make choices on which action to take

IEOR 8100: Reinforcement Learning 28/36 done is True or False, to indicate if the episode terminates info additional info about the env, normally empty dictionary display state: env.render(). Only display current state, need to render in a loop to display consecutive states.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow OpenAI Gym - very brief intro

Gym is a testbed for RL algorithms. Doc: https://github.com/openai/gym make an environment: env = gym.make(”CartPole-v0”) initialize: obs = env.reset() first obs, we take the first action based on this take a step: obs, reward, done, info = env.step(action) based on obs, we make choices on which action to take reward is the reward collected during this one step transition

IEOR 8100: Reinforcement Learning 28/36 info additional info about the env, normally empty dictionary display state: env.render(). Only display current state, need to render in a loop to display consecutive states.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow OpenAI Gym - very brief intro

Gym is a testbed for RL algorithms. Doc: https://github.com/openai/gym make an environment: env = gym.make(”CartPole-v0”) initialize: obs = env.reset() first obs, we take the first action based on this take a step: obs, reward, done, info = env.step(action) based on obs, we make choices on which action to take reward is the reward collected during this one step transition done is True or False, to indicate if the episode terminates

IEOR 8100: Reinforcement Learning 28/36 display state: env.render(). Only display current state, need to render in a loop to display consecutive states.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow OpenAI Gym - very brief intro

Gym is a testbed for RL algorithms. Doc: https://github.com/openai/gym make an environment: env = gym.make(”CartPole-v0”) initialize: obs = env.reset() first obs, we take the first action based on this take a step: obs, reward, done, info = env.step(action) based on obs, we make choices on which action to take reward is the reward collected during this one step transition done is True or False, to indicate if the episode terminates info additional info about the env, normally empty dictionary

IEOR 8100: Reinforcement Learning 28/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow OpenAI Gym - very brief intro

Gym is a testbed for RL algorithms. Doc: https://github.com/openai/gym make an environment: env = gym.make(”CartPole-v0”) initialize: obs = env.reset() first obs, we take the first action based on this take a step: obs, reward, done, info = env.step(action) based on obs, we make choices on which action to take reward is the reward collected during this one step transition done is True or False, to indicate if the episode terminates info additional info about the env, normally empty dictionary display state: env.render(). Only display current state, need to render in a loop to display consecutive states.

IEOR 8100: Reinforcement Learning 28/36 Continuous control: DeepMind Control Suite, Roboschool Gym-like: rllab Video game: VizDoom Cooler video game: DeepMind StarCraft II Learning Environment Or build your own environments...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond OpenAI Gym

There are many other open source projects that provide RL benchmarks

IEOR 8100: Reinforcement Learning 29/36 Gym-like: rllab Video game: VizDoom Cooler video game: DeepMind StarCraft II Learning Environment Or build your own environments...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond OpenAI Gym

There are many other open source projects that provide RL benchmarks Continuous control: DeepMind Control Suite, Roboschool

IEOR 8100: Reinforcement Learning 29/36 Video game: VizDoom Cooler video game: DeepMind StarCraft II Learning Environment Or build your own environments...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond OpenAI Gym

There are many other open source projects that provide RL benchmarks Continuous control: DeepMind Control Suite, Roboschool Gym-like: rllab

IEOR 8100: Reinforcement Learning 29/36 Cooler video game: DeepMind StarCraft II Learning Environment Or build your own environments...

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond OpenAI Gym

There are many other open source projects that provide RL benchmarks Continuous control: DeepMind Control Suite, Roboschool Gym-like: rllab Video game: VizDoom

IEOR 8100: Reinforcement Learning 29/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Beyond OpenAI Gym

There are many other open source projects that provide RL benchmarks Continuous control: DeepMind Control Suite, Roboschool Gym-like: rllab Video game: VizDoom Cooler video game: DeepMind StarCraft II Learning Environment Or build your own environments...

IEOR 8100: Reinforcement Learning 29/36 MDP with state action space S, A Instant reward r Policy π : S 7→ A P∞ t Maximize cumulative reward Eπ[ t=0 rt γ ] Action value function Qπ(s, a) for state s, action a, under policy π. Bellman error π π 2 E[(Q (st , at ) − max E[rt + γQ (st+1, a)) ] a

Bellman error is zero iff optimal policy Q∗(s, a)

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN [Mnih, 2013]

IEOR 8100: Reinforcement Learning 30/36 Instant reward r Policy π : S 7→ A P∞ t Maximize cumulative reward Eπ[ t=0 rt γ ] Action value function Qπ(s, a) for state s, action a, under policy π. Bellman error π π 2 E[(Q (st , at ) − max E[rt + γQ (st+1, a)) ] a

Bellman error is zero iff optimal policy Q∗(s, a)

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN [Mnih, 2013]

MDP with state action space S, A

IEOR 8100: Reinforcement Learning 30/36 Policy π : S 7→ A P∞ t Maximize cumulative reward Eπ[ t=0 rt γ ] Action value function Qπ(s, a) for state s, action a, under policy π. Bellman error π π 2 E[(Q (st , at ) − max E[rt + γQ (st+1, a)) ] a

Bellman error is zero iff optimal policy Q∗(s, a)

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN [Mnih, 2013]

MDP with state action space S, A Instant reward r

IEOR 8100: Reinforcement Learning 30/36 P∞ t Maximize cumulative reward Eπ[ t=0 rt γ ] Action value function Qπ(s, a) for state s, action a, under policy π. Bellman error π π 2 E[(Q (st , at ) − max E[rt + γQ (st+1, a)) ] a

Bellman error is zero iff optimal policy Q∗(s, a)

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN [Mnih, 2013]

MDP with state action space S, A Instant reward r Policy π : S 7→ A

IEOR 8100: Reinforcement Learning 30/36 Action value function Qπ(s, a) for state s, action a, under policy π. Bellman error π π 2 E[(Q (st , at ) − max E[rt + γQ (st+1, a)) ] a

Bellman error is zero iff optimal policy Q∗(s, a)

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN [Mnih, 2013]

MDP with state action space S, A Instant reward r Policy π : S 7→ A P∞ t Maximize cumulative reward Eπ[ t=0 rt γ ]

IEOR 8100: Reinforcement Learning 30/36 Bellman error π π 2 E[(Q (st , at ) − max E[rt + γQ (st+1, a)) ] a

Bellman error is zero iff optimal policy Q∗(s, a)

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN [Mnih, 2013]

MDP with state action space S, A Instant reward r Policy π : S 7→ A P∞ t Maximize cumulative reward Eπ[ t=0 rt γ ] Action value function Qπ(s, a) for state s, action a, under policy π.

IEOR 8100: Reinforcement Learning 30/36 Bellman error is zero iff optimal policy Q∗(s, a)

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN [Mnih, 2013]

MDP with state action space S, A Instant reward r Policy π : S 7→ A P∞ t Maximize cumulative reward Eπ[ t=0 rt γ ] Action value function Qπ(s, a) for state s, action a, under policy π. Bellman error π π 2 E[(Q (st , at ) − max E[rt + γQ (st+1, a)) ] a

IEOR 8100: Reinforcement Learning 30/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN [Mnih, 2013]

MDP with state action space S, A Instant reward r Policy π : S 7→ A P∞ t Maximize cumulative reward Eπ[ t=0 rt γ ] Action value function Qπ(s, a) for state s, action a, under policy π. Bellman error π π 2 E[(Q (st , at ) − max E[rt + γQ (st+1, a)) ] a

Bellman error is zero iff optimal policy Q∗(s, a)

IEOR 8100: Reinforcement Learning 30/36 Start with any Q vector Q(0) Contractive operator T defined as

TQ(st , at ) =def max E[rt + γQ(st+1, a)] a

Apply contractive operator Q(t+1) ← TQ(t)

WIll converge under mild conditions. Converge to the fixed point

Q = TQ

which gives zero bellman error.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN: Conventional Q learning

IEOR 8100: Reinforcement Learning 31/36 Contractive operator T defined as

TQ(st , at ) =def max E[rt + γQ(st+1, a)] a

Apply contractive operator Q(t+1) ← TQ(t)

WIll converge under mild conditions. Converge to the fixed point

Q = TQ

which gives zero bellman error.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN: Conventional Q learning

Start with any Q vector Q(0)

IEOR 8100: Reinforcement Learning 31/36 Apply contractive operator Q(t+1) ← TQ(t)

WIll converge under mild conditions. Converge to the fixed point

Q = TQ

which gives zero bellman error.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN: Conventional Q learning

Start with any Q vector Q(0) Contractive operator T defined as

TQ(st , at ) =def max E[rt + γQ(st+1, a)] a

IEOR 8100: Reinforcement Learning 31/36 WIll converge under mild conditions. Converge to the fixed point

Q = TQ

which gives zero bellman error.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN: Conventional Q learning

Start with any Q vector Q(0) Contractive operator T defined as

TQ(st , at ) =def max E[rt + γQ(st+1, a)] a

Apply contractive operator Q(t+1) ← TQ(t)

IEOR 8100: Reinforcement Learning 31/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN: Conventional Q learning

Start with any Q vector Q(0) Contractive operator T defined as

TQ(st , at ) =def max E[rt + γQ(st+1, a)] a

Apply contractive operator Q(t+1) ← TQ(t)

WIll converge under mild conditions. Converge to the fixed point

Q = TQ

which gives zero bellman error.

IEOR 8100: Reinforcement Learning 31/36 π Approximate using neural net Qθ(s, a) ≈ Q (s, a) with parameter θ. Discrete action space |A| < ∞: input state s, output |A| values, the ith value represents the approximate action-value function for the ith action. Continuous action space |A| = ∞: input state s and action a, output one value which represents Qπ(s, a). Not our focus here. Minimize Bellman error

0 2 Eπ[(Qθ(si , ai ) − ri − max Qθ(si , a)) ] a

0 N Sample based, given tuples {(si , ai , ri , si )}i=1

N X 1 0 2 min (Qθ(si , ai ) − ri − max Qθ(si , a)) θ N a i=1

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

IEOR 8100: Reinforcement Learning 32/36 Discrete action space |A| < ∞: input state s, output |A| values, the ith value represents the approximate action-value function for the ith action. Continuous action space |A| = ∞: input state s and action a, output one value which represents Qπ(s, a). Not our focus here. Minimize Bellman error

0 2 Eπ[(Qθ(si , ai ) − ri − max Qθ(si , a)) ] a

0 N Sample based, given tuples {(si , ai , ri , si )}i=1

N X 1 0 2 min (Qθ(si , ai ) − ri − max Qθ(si , a)) θ N a i=1

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

π Approximate using neural net Qθ(s, a) ≈ Q (s, a) with parameter θ.

IEOR 8100: Reinforcement Learning 32/36 Continuous action space |A| = ∞: input state s and action a, output one value which represents Qπ(s, a). Not our focus here. Minimize Bellman error

0 2 Eπ[(Qθ(si , ai ) − ri − max Qθ(si , a)) ] a

0 N Sample based, given tuples {(si , ai , ri , si )}i=1

N X 1 0 2 min (Qθ(si , ai ) − ri − max Qθ(si , a)) θ N a i=1

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

π Approximate using neural net Qθ(s, a) ≈ Q (s, a) with parameter θ. Discrete action space |A| < ∞: input state s, output |A| values, the ith value represents the approximate action-value function for the ith action.

IEOR 8100: Reinforcement Learning 32/36 Minimize Bellman error

0 2 Eπ[(Qθ(si , ai ) − ri − max Qθ(si , a)) ] a

0 N Sample based, given tuples {(si , ai , ri , si )}i=1

N X 1 0 2 min (Qθ(si , ai ) − ri − max Qθ(si , a)) θ N a i=1

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

π Approximate using neural net Qθ(s, a) ≈ Q (s, a) with parameter θ. Discrete action space |A| < ∞: input state s, output |A| values, the ith value represents the approximate action-value function for the ith action. Continuous action space |A| = ∞: input state s and action a, output one value which represents Qπ(s, a). Not our focus here.

IEOR 8100: Reinforcement Learning 32/36 0 N Sample based, given tuples {(si , ai , ri , si )}i=1

N X 1 0 2 min (Qθ(si , ai ) − ri − max Qθ(si , a)) θ N a i=1

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

π Approximate using neural net Qθ(s, a) ≈ Q (s, a) with parameter θ. Discrete action space |A| < ∞: input state s, output |A| values, the ith value represents the approximate action-value function for the ith action. Continuous action space |A| = ∞: input state s and action a, output one value which represents Qπ(s, a). Not our focus here. Minimize Bellman error

0 2 Eπ[(Qθ(si , ai ) − ri − max Qθ(si , a)) ] a

IEOR 8100: Reinforcement Learning 32/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

π Approximate using neural net Qθ(s, a) ≈ Q (s, a) with parameter θ. Discrete action space |A| < ∞: input state s, output |A| values, the ith value represents the approximate action-value function for the ith action. Continuous action space |A| = ∞: input state s and action a, output one value which represents Qπ(s, a). Not our focus here. Minimize Bellman error

0 2 Eπ[(Qθ(si , ai ) − ri − max Qθ(si , a)) ] a

0 N Sample based, given tuples {(si , ai , ri , si )}i=1

N X 1 0 2 min (Qθ(si , ai ) − ri − max Qθ(si , a)) θ N a i=1

IEOR 8100: Reinforcement Learning 32/36 Instablity in optimizing

N X 1 0 2 min (Qθ(si , ai ) − ri − max Qθ(si , a)) θ N a i=1

Two techniques to alleviate instability: experience replay and target network. 0 Experience Replay: store experience tuple {si , ai , ri , si } into replay buffer B, 0 b when training, sample batches of experience {si , ai , ri , si }i=1 from B and update parameters using SGD 0 Target Network: the training target ri + maxa Qθ(si , a) is non-stationary. Make it stationary by introducing a slowly updated target net θ− and compute target as 0 ri + maxa Qθ− (si , a)

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

IEOR 8100: Reinforcement Learning 33/36 Two techniques to alleviate instability: experience replay and target network. 0 Experience Replay: store experience tuple {si , ai , ri , si } into replay buffer B, 0 b when training, sample batches of experience {si , ai , ri , si }i=1 from B and update parameters using SGD 0 Target Network: the training target ri + maxa Qθ(si , a) is non-stationary. Make it stationary by introducing a slowly updated target net θ− and compute target as 0 ri + maxa Qθ− (si , a)

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

Instablity in optimizing

N X 1 0 2 min (Qθ(si , ai ) − ri − max Qθ(si , a)) θ N a i=1

IEOR 8100: Reinforcement Learning 33/36 0 Experience Replay: store experience tuple {si , ai , ri , si } into replay buffer B, 0 b when training, sample batches of experience {si , ai , ri , si }i=1 from B and update parameters using SGD 0 Target Network: the training target ri + maxa Qθ(si , a) is non-stationary. Make it stationary by introducing a slowly updated target net θ− and compute target as 0 ri + maxa Qθ− (si , a)

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

Instablity in optimizing

N X 1 0 2 min (Qθ(si , ai ) − ri − max Qθ(si , a)) θ N a i=1

Two techniques to alleviate instability: experience replay and target network.

IEOR 8100: Reinforcement Learning 33/36 0 Target Network: the training target ri + maxa Qθ(si , a) is non-stationary. Make it stationary by introducing a slowly updated target net θ− and compute target as 0 ri + maxa Qθ− (si , a)

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

Instablity in optimizing

N X 1 0 2 min (Qθ(si , ai ) − ri − max Qθ(si , a)) θ N a i=1

Two techniques to alleviate instability: experience replay and target network. 0 Experience Replay: store experience tuple {si , ai , ri , si } into replay buffer B, 0 b when training, sample batches of experience {si , ai , ri , si }i=1 from B and update parameters using SGD

IEOR 8100: Reinforcement Learning 33/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

Instablity in optimizing

N X 1 0 2 min (Qθ(si , ai ) − ri − max Qθ(si , a)) θ N a i=1

Two techniques to alleviate instability: experience replay and target network. 0 Experience Replay: store experience tuple {si , ai , ri , si } into replay buffer B, 0 b when training, sample batches of experience {si , ai , ri , si }i=1 from B and update parameters using SGD 0 Target Network: the training target ri + maxa Qθ(si , a) is non-stationary. Make it stationary by introducing a slowly updated target net θ− and compute target as 0 ri + maxa Qθ− (si , a)

IEOR 8100: Reinforcement Learning 33/36 Naive exploration −greedy in state s, with prob  take action randomly; with prob 1 −  take greedy action

arg max Qθ(s, a) a More advanced exploration: noisy net, parameter space noise, bayesian updates... Learning rate: constant α = .001 for example, not RM scheme.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

IEOR 8100: Reinforcement Learning 34/36 in state s, with prob  take action randomly; with prob 1 −  take greedy action

arg max Qθ(s, a) a More advanced exploration: noisy net, parameter space noise, bayesian updates... Learning rate: constant α = .001 for example, not RM scheme.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

Naive exploration −greedy

IEOR 8100: Reinforcement Learning 34/36 More advanced exploration: noisy net, parameter space noise, bayesian updates... Learning rate: constant α = .001 for example, not RM scheme.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

Naive exploration −greedy in state s, with prob  take action randomly; with prob 1 −  take greedy action

arg max Qθ(s, a) a

IEOR 8100: Reinforcement Learning 34/36 Learning rate: constant α = .001 for example, not RM scheme.

Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

Naive exploration −greedy in state s, with prob  take action randomly; with prob 1 −  take greedy action

arg max Qθ(s, a) a More advanced exploration: noisy net, parameter space noise, bayesian updates...

IEOR 8100: Reinforcement Learning 34/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

Naive exploration −greedy in state s, with prob  take action randomly; with prob 1 −  take greedy action

arg max Qθ(s, a) a More advanced exploration: noisy net, parameter space noise, bayesian updates... Learning rate: constant α = .001 for example, not RM scheme.

IEOR 8100: Reinforcement Learning 34/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow Basic DQN

() dqn pseudocode

IEOR 8100: Reinforcement Learning 35/36 Simple Example Preliminary Intro to DL Intro to DL with Tensorflow Beyond Tensorflow End

Thanks!

IEOR 8100: Reinforcement Learning 36/36