Deep Optimal Stopping

Sebastian Becker Patrick Cheridito Arnulf Jentzen ZENAI RiskLab, ETH Zurich Universität Münster

Vienna, November 2019 where

N (Xn)n=0 is a d-dimensional Markov process on a probability space (Ω, F, P)

g: {0, 1,..., N} × Rd → R a measurable function such that

E|g(n, Xn)|< ∞ for all n = 0, ..., N

T is the set of all X-stopping times τ

that is, {τ = n} ∈ σ(X0,..., Xn) for all n = 0, 1,..., N

The Problem

sup E g(τ, Xτ ), τ∈T g: {0, 1,..., N} × Rd → R a measurable function such that

E|g(n, Xn)|< ∞ for all n = 0, ..., N

T is the set of all X-stopping times τ

that is, {τ = n} ∈ σ(X0,..., Xn) for all n = 0, 1,..., N

The Problem

sup E g(τ, Xτ ), τ∈T

where

N (Xn)n=0 is a d-dimensional Markov process on a probability space (Ω, F, P) T is the set of all X-stopping times τ

that is, {τ = n} ∈ σ(X0,..., Xn) for all n = 0, 1,..., N

The Problem

sup E g(τ, Xτ ), τ∈T

where

N (Xn)n=0 is a d-dimensional Markov process on a probability space (Ω, F, P)

g: {0, 1,..., N} × Rd → R a measurable function such that

E|g(n, Xn)|< ∞ for all n = 0, ..., N The Problem

sup E g(τ, Xτ ), τ∈T

where

N (Xn)n=0 is a d-dimensional Markov process on a probability space (Ω, F, P)

g: {0, 1,..., N} × Rd → R a measurable function such that

E|g(n, Xn)|< ∞ for all n = 0, ..., N

T is the set of all X-stopping times τ

that is, {τ = n} ∈ σ(X0,..., Xn) for all n = 0, 1,..., N Many problems are already in discrete time Most relevant continuous-time problems can be approximated by time-discretized versions

Markov assumption Every discrete-time process can be made Markov by including all relevant information N in the current state ... by increasing the dimension of (Xn)n=0

About the assumptions

Discrete time Most relevant continuous-time problems can be approximated by time-discretized versions

Markov assumption Every discrete-time process can be made Markov by including all relevant information N in the current state ... by increasing the dimension of (Xn)n=0

About the assumptions

Discrete time Many problems are already in discrete time Markov assumption Every discrete-time process can be made Markov by including all relevant information N in the current state ... by increasing the dimension of (Xn)n=0

About the assumptions

Discrete time Many problems are already in discrete time Most relevant continuous-time problems can be approximated by time-discretized versions Every discrete-time process can be made Markov by including all relevant information N in the current state ... by increasing the dimension of (Xn)n=0

About the assumptions

Discrete time Many problems are already in discrete time Most relevant continuous-time problems can be approximated by time-discretized versions

Markov assumption About the assumptions

Discrete time Many problems are already in discrete time Most relevant continuous-time problems can be approximated by time-discretized versions

Markov assumption Every discrete-time process can be made Markov by including all relevant information N in the current state ... by increasing the dimension of (Xn)n=0 Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, for i initial values s0 ∈ (0, ∞) a risk-free interest rate r ∈ R

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of ﬁnitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T " +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T

Examples

1 Bermudan max-call options for i initial values s0 ∈ (0, ∞) a risk-free interest rate r ∈ R

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

Examples

1 Bermudan max-call options Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, a risk-free interest rate r ∈ R

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

Examples

1 Bermudan max-call options Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, for i initial values s0 ∈ (0, ∞) dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

Examples

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

Examples

dividend yields δi ∈ [0, ∞) and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

Examples

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞) i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of ﬁnitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T " +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T

Examples

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj T 2T and can be exercised at one of ﬁnitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T " +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T

Examples

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K " +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T

Examples

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of ﬁnitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T = sup E g(τ, Xτ ) τ∈T

Examples

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of ﬁnitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T " +# −rτ i Price: sup E e max Sτ − K 1≤i≤d τ∈{t0,t1,...,T} Examples

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

Longstaff and Schwartz (2001) Rogers (2002) García (2003) Boyle, Kolkiewicz and Tan (2003) Haugh and Kogan (2004) Broadie and Glasserman (2004) Andersen and Broadie (2004) Broadie and Cao (2008) Berridge and Schumacher (2008) Belomestny (2011, 2013) Jain and Oosterlee (2015) Lelong (2016) Our price estimates

i for s0 = 100, σi = 20%, r = 5%, δ = 10%, ρij = 0, K = 100, T = 3, N = 9:

# assets Point Est. Comp. Time 95% Conf. Int. Bin. Tree

2 13.899 28.7s [13.880, 13.910] 13.902 3 18.690 28.9s [18.673, 18.699] 18.69 Our price estimates

i for s0 = 100, σi = 20%, r = 5%, δ = 10%, ρij = 0, K = 100, T = 3, N = 9:

# assets Point Est. Comp. Time 95% Conf. Int. Bin. Tree Broadie–Cao 95% Conf. Int.

2 13.899 28.7s [13.880, 13.910] 13.902 3 18.690 28.9s [18.673, 18.699] 18.69 5 26.159 28.1s [26.138, 26.174][26.115, 26.164] Our price estimates

i for s0 = 100, σi = 20%, r = 5%, δ = 10%, ρij = 0, K = 100, T = 3, N = 9:

# Assets Point Est. Comp. Time 95% Conf. Int. Bin. Tree Broadie–Cao 95% Conf. Int.

2 13.899 28.7s [13.880, 13.910] 13.902 3 18.690 28.9s [18.673, 18.699] 18.69 5 26.159 28.1s [26.138, 26.174][26.115, 26.164] 10 38.337 30.5s [38.300, 38.367] 20 51.668 37.5s [51.549, 51.803] 30 59.659 45.5s [59.476, 59.872] 50 69.736 59.1s [69.560, 69.945] 100 83.584 95.9s [83.357, 83.862] 200 97.612 170.1s [97.381, 97.889] 500 116.425 493.5s [116.210, 116.685] A fractional Brownian motion with Hurst parameter H ∈ (0, 1] is a continuous centered H Gaussian process (Wt )t≥0 with covariance structure 1 Cov(WH, WH) = t2H + s2H − |t − s|2H t s 2 For H = 1/2, WH is a Brownian motion For H > 1/2, WH has positively correlated increments For H < 1/2, WH has negatively correlated increments

H = 0.1 H = 0.5 H = 0.8

2 Optimally stopping a fractional Brownian motion For H = 1/2, WH is a Brownian motion For H > 1/2, WH has positively correlated increments For H < 1/2, WH has negatively correlated increments

H = 0.1 H = 0.5 H = 0.8

2 Optimally stopping a fractional Brownian motion A fractional Brownian motion with Hurst parameter H ∈ (0, 1] is a continuous centered H Gaussian process (Wt )t≥0 with covariance structure 1 Cov(WH, WH) = t2H + s2H − |t − s|2H t s 2 For H = 1/2, WH is a Brownian motion For H > 1/2, WH has positively correlated increments For H < 1/2, WH has negatively correlated increments 2 Optimally stopping a fractional Brownian motion A fractional Brownian motion with Hurst parameter H ∈ (0, 1] is a continuous centered H Gaussian process (Wt )t≥0 with covariance structure 1 Cov(WH, WH) = t2H + s2H − |t − s|2H t s 2 For H = 1/2, WH is a Brownian motion For H > 1/2, WH has positively correlated increments For H < 1/2, WH has negatively correlated increments

H = 0.1 H = 0.5 H = 0.8 denote tn = n/100, n = 0, 1, 2,..., 100 100 introduce the 100-dimensional Markov process (Xn)n=0 given by

X0 = (0, 0,..., 0) X WH 1 = ( t1 , 0,..., 0) X WH WH 2 = ( t2 , t1 , 0,..., 0) . . X WH WH WH 100 = ( t100 , t99 ,..., t1 ). The discretized stopping problem

1 100 1 sup E g(Xτ ) for g(x ,..., x ) = x , τ∈T

approximates the continuous-time problem (∗) from below

H Problem: sup E Wτ (∗) 0≤τ≤1 100 introduce the 100-dimensional Markov process (Xn)n=0 given by

X0 = (0, 0,..., 0) X WH 1 = ( t1 , 0,..., 0) X WH WH 2 = ( t2 , t1 , 0,..., 0) . . X WH WH WH 100 = ( t100 , t99 ,..., t1 ). The discretized stopping problem

1 100 1 sup E g(Xτ ) for g(x ,..., x ) = x , τ∈T

approximates the continuous-time problem (∗) from below

H Problem: sup E Wτ (∗) 0≤τ≤1 denote tn = n/100, n = 0, 1, 2,..., 100 The discretized stopping problem

1 100 1 sup E g(Xτ ) for g(x ,..., x ) = x , τ∈T

approximates the continuous-time problem (∗) from below

H Problem: sup E Wτ (∗) 0≤τ≤1 denote tn = n/100, n = 0, 1, 2,..., 100 100 introduce the 100-dimensional Markov process (Xn)n=0 given by

X0 = (0, 0,..., 0) X WH 1 = ( t1 , 0,..., 0) X WH WH 2 = ( t2 , t1 , 0,..., 0) . . X WH WH WH 100 = ( t100 , t99 ,..., t1 ). H Problem: sup E Wτ (∗) 0≤τ≤1

denote tn = n/100, n = 0, 1, 2,..., 100 100 introduce the 100-dimensional Markov process (Xn)n=0 given by

X0 = (0, 0,..., 0) X WH 1 = ( t1 , 0,..., 0) X WH WH 2 = ( t2 , t1 , 0,..., 0) . . X WH WH WH 100 = ( t100 , t99 ,..., t1 ). The discretized stopping problem

1 100 1 sup E g(Xτ ) for g(x ,..., x ) = x , τ∈T

approximates the continuous-time problem (∗) from below Results of Kulikov and Gusyatnikov (2016) (based on heuristic stopping rules) Results of Kulikov and Gusyatnikov (2016) Our results (based on heuristic stopping rules)

1.6

1.5

1.4

1.3

1.2

1.1

H 0.9 τ

W 0.8

E 0.7 τ 0.6 sup 0.5

0.4

0.3

0.2

0.1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 H Stopping times and stopping decisions d Let fn, fn+1,..., fN : R → {0, 1} be measurable functions such that fN ≡ 1. Then

N m−1 n−1 X Y Y τn = m fm(Xm) (1 − fj(Xj)) with (1 − fj(Xj)) := 1 m=n j=n j=n

is a stopping time in Tn

Computing a candidate optimal stopping time

Introduce the sequence of auxiliary stopping problems

Vn = sup E g(τ, Xτ ), n = 0, 1,..., N, τ∈Tn

where Tn is the set of all stopping times n ≤ τ ≤ N Computing a candidate optimal stopping time

Introduce the sequence of auxiliary stopping problems

Vn = sup E g(τ, Xτ ), n = 0, 1,..., N, τ∈Tn

where Tn is the set of all stopping times n ≤ τ ≤ N

Stopping times and stopping decisions d Let fn, fn+1,..., fN : R → {0, 1} be measurable functions such that fN ≡ 1. Then

N m−1 n−1 X Y Y τn = m fm(Xm) (1 − fj(Xj)) with (1 − fj(Xj)) := 1 m=n j=n j=n

is a stopping time in Tn d Then there exists a measurable function fn : R → {0, 1} such that the stopping time

N m−1 X Y τn = n fn(Xn) + τn+1(1 − fn(Xn)) = m fm(Xm) (1 − fj(Xj)) m=n j=n

satisﬁes E g(τn, Xτn ) ≥ Vn − Vn+1 − E g(τn+1, Xτn+1 )

Proof: Compare g(n, Xn) to E g(τn+1, Xτn+1 ) | X0, X1,..., Xn = E g(τn+1, Xτn+1 ) | Xn

Theorem

For a given n ∈ {0, 1,..., N − 1}, let τn+1be a stopping time in Tn+1 of the form

N m−1 X Y τn+1 = m fm(Xm) (1 − fj(Xj)), m=n+1 j=n+1

d for measurable functions fn+1,..., fN : R → {0, 1} with fN ≡ 1. Proof: Compare g(n, Xn) to E g(τn+1, Xτn+1 ) | X0, X1,..., Xn = E g(τn+1, Xτn+1 ) | Xn

Theorem

For a given n ∈ {0, 1,..., N − 1}, let τn+1be a stopping time in Tn+1 of the form

N m−1 X Y τn+1 = m fm(Xm) (1 − fj(Xj)), m=n+1 j=n+1

d for measurable functions fn+1,..., fN : R → {0, 1} with fN ≡ 1.

d Then there exists a measurable function fn : R → {0, 1} such that the stopping time

N m−1 X Y τn = n fn(Xn) + τn+1(1 − fn(Xn)) = m fm(Xm) (1 − fj(Xj)) m=n j=n satisﬁes E g(τn, Xτn ) ≥ Vn − Vn+1 − E g(τn+1, Xτn+1 ) Theorem

For a given n ∈ {0, 1,..., N − 1}, let τn+1be a stopping time in Tn+1 of the form

N m−1 X Y τn+1 = m fm(Xm) (1 − fj(Xj)), m=n+1 j=n+1

d for measurable functions fn+1,..., fN : R → {0, 1} with fN ≡ 1.

d Then there exists a measurable function fn : R → {0, 1} such that the stopping time

N m−1 X Y τn = n fn(Xn) + τn+1(1 − fn(Xn)) = m fm(Xm) (1 − fj(Xj)) m=n j=n satisﬁes E g(τn, Xτn ) ≥ Vn − Vn+1 − E g(τn+1, Xτn+1 )

Proof: Compare g(n, Xn) to E g(τn+1, Xτn+1 ) | X0, X1,..., Xn = E g(τn+1, Xτn+1 ) | Xn θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers,

θ d q1 θ q1 q2 θ q2 a1 : R → R , a2 : R → R and a3 : R → R are afﬁne functions given by θ ai (x) = Aix + bi, i = 1, 2, 3, j j for j ∈ N, ϕj: R → R is the component-wise ReLU activation function given by + + ϕj(x1,..., xj) = (x1 ,..., xj ) 2 The components of θ consist of the entries of Ai and bi, i = 1, 2, 3 so # of parameters ≈ d

Neural network approximation where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers,

Neural network approximation

θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , θ d q1 θ q1 q2 θ q2 a1 : R → R , a2 : R → R and a3 : R → R are afﬁne functions given by θ ai (x) = Aix + bi, i = 1, 2, 3, j j for j ∈ N, ϕj: R → R is the component-wise ReLU activation function given by + + ϕj(x1,..., xj) = (x1 ,..., xj ) 2 The components of θ consist of the entries of Ai and bi, i = 1, 2, 3 so # of parameters ≈ d

Neural network approximation

θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers, j j for j ∈ N, ϕj: R → R is the component-wise ReLU activation function given by + + ϕj(x1,..., xj) = (x1 ,..., xj ) 2 The components of θ consist of the entries of Ai and bi, i = 1, 2, 3 so # of parameters ≈ d

Neural network approximation

θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers,

θ d q1 θ q1 q2 θ q2 a1 : R → R , a2 : R → R and a3 : R → R are afﬁne functions given by θ ai (x) = Aix + bi, i = 1, 2, 3, 2 The components of θ consist of the entries of Ai and bi, i = 1, 2, 3 so # of parameters ≈ d

Neural network approximation

θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers,

Neural network approximation

θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers,

The components of θ consist of the entries of Ai and bi, i = 1, 2, 3 Neural network approximation

θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers,

θ θ θ 7→ E g(n, Xn) f (Xn) + g(τn+1, Xn+1)(1 − f (Xn))

More precisely,

q θN assume parameter values θn+1, θn+2, . . . , θN ∈ R have been found such that f ≡ 1 and the stopping time N m−1 X θm Y θj τn+1 = m f (Xm) (1 − f (Xj)) m=n+1 j=n+1

produces an expectation E g(τn+1, Xτn+1 ) close to the optimal value Vn+1 More precisely,

q θN assume parameter values θn+1, θn+2, . . . , θN ∈ R have been found such that f ≡ 1 and the stopping time N m−1 X θm Y θj τn+1 = m f (Xm) (1 − f (Xj)) m=n+1 j=n+1

produces an expectation E g(τn+1, Xτn+1 ) close to the optimal value Vn+1

q now try to ﬁnd a maximizer θn ∈ R of

θ θ θ 7→ E g(n, Xn) f (Xn) + g(τn+1, Xn+1)(1 − f (Xn)) Problem for x ∈ Rd, the θ-gradient of

θ θ θ θ f (x) = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 (x)

is0 or does not exist

As an intermediate step consider a neural network Fθ: Rd → (0, 1) of the form x θ θ θ θ e F = ψ ◦ a ◦ ϕq ◦ a ◦ ϕq ◦ a for ψ(x) = 3 2 2 1 1 1 + ex

q Use stochastic gradient ascent to ﬁnd an approximate optimizer θn ∈ R of

θ θ θ 7→ E g(n, Xn)F (Xn) + g(τn+1, Xτn+1 )(1 − F (Xn))

θn θn θn θn Approximate fn ≈ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 Repeat the same steps at times n − 1, n − 2,..., 0

q Goal ﬁnd an (approximately) optimal θn ∈ R with a stochastic gradient ascent method As an intermediate step consider a neural network Fθ: Rd → (0, 1) of the form x θ θ θ θ e F = ψ ◦ a ◦ ϕq ◦ a ◦ ϕq ◦ a for ψ(x) = 3 2 2 1 1 1 + ex

q Use stochastic gradient ascent to ﬁnd an approximate optimizer θn ∈ R of

θ θ θ 7→ E g(n, Xn)F (Xn) + g(τn+1, Xτn+1 )(1 − F (Xn))

θn θn θn θn Approximate fn ≈ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 Repeat the same steps at times n − 1, n − 2,..., 0

q Goal ﬁnd an (approximately) optimal θn ∈ R with a stochastic gradient ascent method

Problem for x ∈ Rd, the θ-gradient of

θ θ θ θ f (x) = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 (x) is0 or does not exist q Use stochastic gradient ascent to ﬁnd an approximate optimizer θn ∈ R of

θ θ θ 7→ E g(n, Xn)F (Xn) + g(τn+1, Xτn+1 )(1 − F (Xn))

θn θn θn θn Approximate fn ≈ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 Repeat the same steps at times n − 1, n − 2,..., 0

q Goal ﬁnd an (approximately) optimal θn ∈ R with a stochastic gradient ascent method

Problem for x ∈ Rd, the θ-gradient of

θ θ θ θ f (x) = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 (x) is0 or does not exist

As an intermediate step consider a neural network Fθ: Rd → (0, 1) of the form x θ θ θ θ e F = ψ ◦ a ◦ ϕq ◦ a ◦ ϕq ◦ a for ψ(x) = 3 2 2 1 1 1 + ex θn θn θn θn Approximate fn ≈ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 Repeat the same steps at times n − 1, n − 2,..., 0

q Goal ﬁnd an (approximately) optimal θn ∈ R with a stochastic gradient ascent method

Problem for x ∈ Rd, the θ-gradient of

θ θ θ θ f (x) = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 (x) is0 or does not exist

As an intermediate step consider a neural network Fθ: Rd → (0, 1) of the form x θ θ θ θ e F = ψ ◦ a ◦ ϕq ◦ a ◦ ϕq ◦ a for ψ(x) = 3 2 2 1 1 1 + ex

q Use stochastic gradient ascent to ﬁnd an approximate optimizer θn ∈ R of

θ θ θ 7→ E g(n, Xn)F (Xn) + g(τn+1, Xτn+1 )(1 − F (Xn)) Repeat the same steps at times n − 1, n − 2,..., 0

q Goal ﬁnd an (approximately) optimal θn ∈ R with a stochastic gradient ascent method

Problem for x ∈ Rd, the θ-gradient of

θ θ θ θ f (x) = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 (x) is0 or does not exist

As an intermediate step consider a neural network Fθ: Rd → (0, 1) of the form x θ θ θ θ e F = ψ ◦ a ◦ ϕq ◦ a ◦ ϕq ◦ a for ψ(x) = 3 2 2 1 1 1 + ex

q Use stochastic gradient ascent to ﬁnd an approximate optimizer θn ∈ R of

θ θ θ 7→ E g(n, Xn)F (Xn) + g(τn+1, Xτn+1 )(1 − F (Xn))

θn θn θn θn Approximate fn ≈ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 q Goal ﬁnd an (approximately) optimal θn ∈ R with a stochastic gradient ascent method

Problem for x ∈ Rd, the θ-gradient of

θ θ θ θ f (x) = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 (x) is0 or does not exist

As an intermediate step consider a neural network Fθ: Rd → (0, 1) of the form x θ θ θ θ e F = ψ ◦ a ◦ ϕq ◦ a ◦ ϕq ◦ a for ψ(x) = 3 2 2 1 1 1 + ex

q Use stochastic gradient ascent to ﬁnd an approximate optimizer θn ∈ R of

θ θ θ 7→ E g(n, Xn)F (Xn) + g(τn+1, Xτn+1 )(1 − F (Xn))

θn θn θn θn Approximate fn ≈ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 Repeat the same steps at times n − 1, n − 2,..., 0 Proof d 1 Every measurable set A ⊆ R can be approximated in measure by compact sets K ⊆ A

2 1K − 1Kc can be approximated by continuous functions kj

3 kj can be approximated uniformly on compacts by functions of the form

r s X T + X T + h(x) = (vi x + ci) − (wi x + di) (Leshno–Lin–Pinkus–Schocken, 1993) i=1 i=1

4 θ θ θ θ 1[0,∞) ◦ h can be written as a neural network of the form f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1

Proposition

Let n ∈ {0, 1,..., N − 1} and ﬁx a stopping time τn+1 ∈ Tn+1. Then, for every constant ε > 0, there exist numbers of hidden nodes q1 and q2 such that

θ θ sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) q θ∈R ≥ sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) − ε, f ∈D where D is the set of all measurable functions f : Rd → {0, 1}. 2 1K − 1Kc can be approximated by continuous functions kj

3 kj can be approximated uniformly on compacts by functions of the form

r s X T + X T + h(x) = (vi x + ci) − (wi x + di) (Leshno–Lin–Pinkus–Schocken, 1993) i=1 i=1

4 θ θ θ θ 1[0,∞) ◦ h can be written as a neural network of the form f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1

Proposition

Let n ∈ {0, 1,..., N − 1} and ﬁx a stopping time τn+1 ∈ Tn+1. Then, for every constant ε > 0, there exist numbers of hidden nodes q1 and q2 such that

θ θ sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) q θ∈R ≥ sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) − ε, f ∈D where D is the set of all measurable functions f : Rd → {0, 1}.

Proof d 1 Every measurable set A ⊆ R can be approximated in measure by compact sets K ⊆ A 3 kj can be approximated uniformly on compacts by functions of the form

r s X T + X T + h(x) = (vi x + ci) − (wi x + di) (Leshno–Lin–Pinkus–Schocken, 1993) i=1 i=1

4 θ θ θ θ 1[0,∞) ◦ h can be written as a neural network of the form f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1

Proposition

Let n ∈ {0, 1,..., N − 1} and ﬁx a stopping time τn+1 ∈ Tn+1. Then, for every constant ε > 0, there exist numbers of hidden nodes q1 and q2 such that

Proof d 1 Every measurable set A ⊆ R can be approximated in measure by compact sets K ⊆ A

2 1K − 1Kc can be approximated by continuous functions kj 4 θ θ θ θ 1[0,∞) ◦ h can be written as a neural network of the form f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1

Proposition

Let n ∈ {0, 1,..., N − 1} and ﬁx a stopping time τn+1 ∈ Tn+1. Then, for every constant ε > 0, there exist numbers of hidden nodes q1 and q2 such that

Proof d 1 Every measurable set A ⊆ R can be approximated in measure by compact sets K ⊆ A

2 1K − 1Kc can be approximated by continuous functions kj

3 kj can be approximated uniformly on compacts by functions of the form

r s X T + X T + h(x) = (vi x + ci) − (wi x + di) (Leshno–Lin–Pinkus–Schocken, 1993) i=1 i=1 Proposition

Let n ∈ {0, 1,..., N − 1} and ﬁx a stopping time τn+1 ∈ Tn+1. Then, for every constant ε > 0, there exist numbers of hidden nodes q1 and q2 such that

Proof d 1 Every measurable set A ⊆ R can be approximated in measure by compact sets K ⊆ A

2 1K − 1Kc can be approximated by continuous functions kj

3 kj can be approximated uniformly on compacts by functions of the form

r s X T + X T + h(x) = (vi x + ci) − (wi x + di) (Leshno–Lin–Pinkus–Schocken, 1993) i=1 i=1

4 θ θ θ θ 1[0,∞) ◦ h can be written as a neural network of the form f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 there exist

numbers of hidden nodes q1, q2 and functions f θ0 , f θ1 ,..., f θN : Rd → {0, 1} of the form

θn θn θn θn f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1

such that f θN ≡ 1 and the stopping time

N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0

Θ satisﬁes E g(τ , Xτ Θ ) ≥ supτ∈T E g(τ, Xτ ) − ε

Corollary For a given optimal stopping problem of the form

sup E g(τ, Xτ ) τ∈T and a constant ε > 0, Corollary For a given optimal stopping problem of the form

sup E g(τ, Xτ ) τ∈T and a constant ε > 0, there exist

numbers of hidden nodes q1, q2 and functions f θ0 , f θ1 ,..., f θN : Rd → {0, 1} of the form

θn θn θn θn f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 such that f θN ≡ 1 and the stopping time

N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0

Θ satisﬁes E g(τ , Xτ Θ ) ≥ supτ∈T E g(τ, Xτ ) − ε q Let θn+1, . . . , θN ∈ R be given, and consider the corresponding stopping time N m−1 X θm Y θj τn+1 = m f (Xm) (1 − f (Xj)) m=n+1 j=n+1

τn+1 is of the form τn+1 = ln+1(Xn+1,..., XN−1) for a measurable function d(N−n−1) ln+1 : R → {n + 1, n + 2,..., N}

Denote ( k N if n = N − 1 ln+1 = k k ln+1(xn+1,..., xN−1) if n ≤ N − 2

The realized reward k k θ k k k θ k rn(θ) = g(n, xn)F (xn) + g(ln+1, x k )(1 − F (xn)) ln+1 is continuous and almost everywhere differentiable in θ

Training the networks

k N N Let (xn)n=0, k = 1, 2,... be independent simulations of (Xn)n=0 τn+1 is of the form τn+1 = ln+1(Xn+1,..., XN−1) for a measurable function d(N−n−1) ln+1 : R → {n + 1, n + 2,..., N}

Denote ( k N if n = N − 1 ln+1 = k k ln+1(xn+1,..., xN−1) if n ≤ N − 2

The realized reward k k θ k k k θ k rn(θ) = g(n, xn)F (xn) + g(ln+1, x k )(1 − F (xn)) ln+1 is continuous and almost everywhere differentiable in θ

Training the networks

k N N Let (xn)n=0, k = 1, 2,... be independent simulations of (Xn)n=0 q Let θn+1, . . . , θN ∈ R be given, and consider the corresponding stopping time N m−1 X θm Y θj τn+1 = m f (Xm) (1 − f (Xj)) m=n+1 j=n+1 Denote ( k N if n = N − 1 ln+1 = k k ln+1(xn+1,..., xN−1) if n ≤ N − 2

The realized reward k k θ k k k θ k rn(θ) = g(n, xn)F (xn) + g(ln+1, x k )(1 − F (xn)) ln+1 is continuous and almost everywhere differentiable in θ

Training the networks

τn+1 is of the form τn+1 = ln+1(Xn+1,..., XN−1) for a measurable function d(N−n−1) ln+1 : R → {n + 1, n + 2,..., N} The realized reward k k θ k k k θ k rn(θ) = g(n, xn)F (xn) + g(ln+1, x k )(1 − F (xn)) ln+1 is continuous and almost everywhere differentiable in θ

Training the networks

τn+1 is of the form τn+1 = ln+1(Xn+1,..., XN−1) for a measurable function d(N−n−1) ln+1 : R → {n + 1, n + 2,..., N}

Denote ( k N if n = N − 1 ln+1 = k k ln+1(xn+1,..., xN−1) if n ≤ N − 2 Training the networks

τn+1 is of the form τn+1 = ln+1(Xn+1,..., XN−1) for a measurable function d(N−n−1) ln+1 : R → {n + 1, n + 2,..., N}

Denote ( k N if n = N − 1 ln+1 = k k ln+1(xn+1,..., xN−1) if n ≤ N − 2

The realized reward k k θ k k k θ k rn(θ) = g(n, xn)F (xn) + g(ln+1, x k )(1 − F (xn)) ln+1 is continuous and almost everywhere differentiable in θ k Standard updating θn,k+1 = θn,k + η∇rn(θn,k)

Variants Mini-batches Batch normalization Momentum Adagrad RMSProp AdaDelta ADAM Decoupling weight decay Warm restarts ...

Stochastic Gradient Ascent

Initialize θn,0 typically random; e.g. Xavier initialization Variants Mini-batches Batch normalization Momentum Adagrad RMSProp AdaDelta ADAM Decoupling weight decay Warm restarts ...

Stochastic Gradient Ascent

Initialize θn,0 typically random; e.g. Xavier initialization

k Standard updating θn,k+1 = θn,k + η∇rn(θn,k) Stochastic Gradient Ascent

Initialize θn,0 typically random; e.g. Xavier initialization

k Standard updating θn,k+1 = θn,k + η∇rn(θn,k)

Variants Mini-batches Batch normalization Momentum Adagrad RMSProp AdaDelta ADAM Decoupling weight decay Warm restarts ... Stochastic Gradient Ascent

Initialize θn,0 typically random; e.g. Xavier initialization

m Standard updating θn,m+1 = θn,m + η∇rn (θn,m)

Variants Mini-batches Batch normalization Momentum Adagrad RMSProp AdaDelta ADAM Decoupling weight decay Warm restarts ... k N N Let (yn)n=0, k = 1, 2,..., KL, be a new set of independent simulations of (Xn)n=0

Θ Θ dN τ can be written as τ = l(X0,..., XN−1) for a measurable function l : R → {0, 1,..., N}

k k k Denote l = l(y0,..., yN−1)

Use the Monte Carlo approximation

KL ˆ 1 X k k L = g(l , ylk ) as an estimate for L KL k=1

Lower bound

The candidate optimal stopping time N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0 yields a lower bound Θ L = E g(τ , Xτ Θ ) for the optimal value V0 = sup E g(τ, Xτ ) τ Θ Θ dN τ can be written as τ = l(X0,..., XN−1) for a measurable function l : R → {0, 1,..., N}

k k k Denote l = l(y0,..., yN−1)

Use the Monte Carlo approximation

KL ˆ 1 X k k L = g(l , ylk ) as an estimate for L KL k=1

Lower bound

The candidate optimal stopping time N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0 yields a lower bound Θ L = E g(τ , Xτ Θ ) for the optimal value V0 = sup E g(τ, Xτ ) τ

k N N Let (yn)n=0, k = 1, 2,..., KL, be a new set of independent simulations of (Xn)n=0 k k k Denote l = l(y0,..., yN−1)

Use the Monte Carlo approximation

KL ˆ 1 X k k L = g(l , ylk ) as an estimate for L KL k=1

Lower bound

The candidate optimal stopping time N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0 yields a lower bound Θ L = E g(τ , Xτ Θ ) for the optimal value V0 = sup E g(τ, Xτ ) τ

k N N Let (yn)n=0, k = 1, 2,..., KL, be a new set of independent simulations of (Xn)n=0

Θ Θ dN τ can be written as τ = l(X0,..., XN−1) for a measurable function l : R → {0, 1,..., N} Use the Monte Carlo approximation

KL ˆ 1 X k k L = g(l , ylk ) as an estimate for L KL k=1

Lower bound

The candidate optimal stopping time N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0 yields a lower bound Θ L = E g(τ , Xτ Θ ) for the optimal value V0 = sup E g(τ, Xτ ) τ

k N N Let (yn)n=0, k = 1, 2,..., KL, be a new set of independent simulations of (Xn)n=0

Θ Θ dN τ can be written as τ = l(X0,..., XN−1) for a measurable function l : R → {0, 1,..., N}

k k k Denote l = l(y0,..., yN−1) Lower bound

The candidate optimal stopping time N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0 yields a lower bound Θ L = E g(τ , Xτ Θ ) for the optimal value V0 = sup E g(τ, Xτ ) τ

k N N Let (yn)n=0, k = 1, 2,..., KL, be a new set of independent simulations of (Xn)n=0

Θ Θ dN τ can be written as τ = l(X0,..., XN−1) for a measurable function l : R → {0, 1,..., N}

k k k Denote l = l(y0,..., yN−1)

Use the Monte Carlo approximation

KL ˆ 1 X k k L = g(l , ylk ) as an estimate for L KL k=1 Consider the sample variance

KL 2 1 X k k ˆ2 σˆL = g(l , ylk ) − L KL − 1 k=1 By the CLT, σˆL Lˆ − zα √ , ∞ KL is an asymptotically valid1 − α conﬁdence interval for L

where zα is the 1 − α quantile of the standard normal distribution Therefore, σˆL σˆL P V0 ≥ Lˆ − zα √ ≥ P L ≥ Lˆ − zα √ ≈ 1 − α KL KL

Lower conﬁdence bounds

2 Assume E g(n, Xn) < ∞ for all n = 0, 1,..., N By the CLT, σˆL Lˆ − zα √ , ∞ KL is an asymptotically valid1 − α conﬁdence interval for L

where zα is the 1 − α quantile of the standard normal distribution Therefore, σˆL σˆL P V0 ≥ Lˆ − zα √ ≥ P L ≥ Lˆ − zα √ ≈ 1 − α KL KL

Lower conﬁdence bounds

2 Assume E g(n, Xn) < ∞ for all n = 0, 1,..., N

Consider the sample variance

KL 2 1 X k k ˆ2 σˆL = g(l , ylk ) − L KL − 1 k=1 Therefore, σˆL σˆL P V0 ≥ Lˆ − zα √ ≥ P L ≥ Lˆ − zα √ ≈ 1 − α KL KL

Lower conﬁdence bounds

2 Assume E g(n, Xn) < ∞ for all n = 0, 1,..., N

Consider the sample variance

2 Assume E g(n, Xn) < ∞ for all n = 0, 1,..., N

Consider the sample variance

KL 2 1 X k k ˆ2 σˆL = g(l , ylk ) − L KL − 1 k=1 By the CLT, σˆL Lˆ − zα √ , ∞ KL is an asymptotically valid1 − α conﬁdence interval for L where zα is the 1 − α quantile of the standard normal distribution Therefore, σˆL σˆL P V0 ≥ Lˆ − zα √ ≥ P L ≥ Lˆ − zα √ ≈ 1 − α KL KL H H with Doob decomposition Hn = H0 + Mn − An The following is a variant of the dual formulation of Rogers (2002), Haugh–Kogan (2004) and Andersen–Broadie (2004)

Proposition X X For every (Fn )-martingale (Mn) with M0 = 0 and estimation errors (εn) satisfying E[εn | Fn ] = 0, one has V0 ≤ E max (Gn − Mn − εn) 0≤n≤N On the other hand, H V0 = E max Gn − Mn 0≤n≤N

Upper bound

Let (Hn) be the Snell envelope of Gn = g(n, Xn), n = 0, 1,..., N, The following is a variant of the dual formulation of Rogers (2002), Haugh–Kogan (2004) and Andersen–Broadie (2004)

Upper bound

Let (Hn) be the Snell envelope of Gn = g(n, Xn), n = 0, 1,..., N, H H with Doob decomposition Hn = H0 + Mn − An Proposition X X For every (Fn )-martingale (Mn) with M0 = 0 and estimation errors (εn) satisfying E[εn | Fn ] = 0, one has V0 ≤ E max (Gn − Mn − εn) 0≤n≤N On the other hand, H V0 = E max Gn − Mn 0≤n≤N

Upper bound

Let (Hn) be the Snell envelope of Gn = g(n, Xn), n = 0, 1,..., N, H H with Doob decomposition Hn = H0 + Mn − An The following is a variant of the dual formulation of Rogers (2002), Haugh–Kogan (2004) and Andersen–Broadie (2004) On the other hand, H V0 = E max Gn − Mn 0≤n≤N

Upper bound

Proposition X X For every (Fn )-martingale (Mn) with M0 = 0 and estimation errors (εn) satisfying E[εn | Fn ] = 0, one has V0 ≤ E max (Gn − Mn − εn) 0≤n≤N Upper bound

Θ Θ Θ θn θn Θ Θ ∆Mn = Hn − E Hn | Fn−1 = f (Xn)Gn + (1 − f (Xn))Cn − Cn−1 for the continuation values Θ h Xi C = G Θ | F n E τn+1 n

k N N Let (zn)n=0, k = 1, 2,..., KU, be a third set of independent simulations of (Xn)n=0 k k,j k,j For all zn, simulate J independent continuation paths ˜zn+1,...,˜zN

J k 1 X k,j k,j Cn = g τn+1,˜z k,j J τn+1 j=1 Θ can be understood as realizations of Cn +ε ˜n k Θ This gives realizations Mn of Mn + εn

Estimating a good dual martingale

Θ X Approximate H by H G Θ | F n n = E τn n k N N Let (zn)n=0, k = 1, 2,..., KU, be a third set of independent simulations of (Xn)n=0 k k,j k,j For all zn, simulate J independent continuation paths ˜zn+1,...,˜zN

J k 1 X k,j k,j Cn = g τn+1,˜z k,j J τn+1 j=1 Θ can be understood as realizations of Cn +ε ˜n k Θ This gives realizations Mn of Mn + εn

Estimating a good dual martingale

Θ X Approximate H by H G Θ | F n n = E τn n H X and ∆Mn = Hn − E Hn | Fn−1 by

Θ Θ Θ θn θn Θ Θ ∆Mn = Hn − E Hn | Fn−1 = f (Xn)Gn + (1 − f (Xn))Cn − Cn−1 for the continuation values Θ h Xi C = G Θ | F n E τn+1 n k k,j k,j For all zn, simulate J independent continuation paths ˜zn+1,...,˜zN

J k 1 X k,j k,j Cn = g τn+1,˜z k,j J τn+1 j=1 Θ can be understood as realizations of Cn +ε ˜n k Θ This gives realizations Mn of Mn + εn

Estimating a good dual martingale

Θ X Approximate H by H G Θ | F n n = E τn n H X and ∆Mn = Hn − E Hn | Fn−1 by

Θ Θ Θ θn θn Θ Θ ∆Mn = Hn − E Hn | Fn−1 = f (Xn)Gn + (1 − f (Xn))Cn − Cn−1 for the continuation values Θ h Xi C = G Θ | F n E τn+1 n

k N N Let (zn)n=0, k = 1, 2,..., KU, be a third set of independent simulations of (Xn)n=0 J k 1 X k,j k,j Cn = g τn+1,˜z k,j J τn+1 j=1 Θ can be understood as realizations of Cn +ε ˜n k Θ This gives realizations Mn of Mn + εn

Estimating a good dual martingale

Θ X Approximate H by H G Θ | F n n = E τn n H X and ∆Mn = Hn − E Hn | Fn−1 by

Θ Θ Θ θn θn Θ Θ ∆Mn = Hn − E Hn | Fn−1 = f (Xn)Gn + (1 − f (Xn))Cn − Cn−1 for the continuation values Θ h Xi C = G Θ | F n E τn+1 n

k N N Let (zn)n=0, k = 1, 2,..., KU, be a third set of independent simulations of (Xn)n=0 k k,j k,j For all zn, simulate J independent continuation paths ˜zn+1,...,˜zN k Θ This gives realizations Mn of Mn + εn

Estimating a good dual martingale

Θ X Approximate H by H G Θ | F n n = E τn n H X and ∆Mn = Hn − E Hn | Fn−1 by

Θ Θ Θ θn θn Θ Θ ∆Mn = Hn − E Hn | Fn−1 = f (Xn)Gn + (1 − f (Xn))Cn − Cn−1 for the continuation values Θ h Xi C = G Θ | F n E τn+1 n

k N N Let (zn)n=0, k = 1, 2,..., KU, be a third set of independent simulations of (Xn)n=0 k k,j k,j For all zn, simulate J independent continuation paths ˜zn+1,...,˜zN

J k 1 X k,j k,j Cn = g τn+1,˜z k,j J τn+1 j=1 Θ can be understood as realizations of Cn +ε ˜n Estimating a good dual martingale

Θ X Approximate H by H G Θ | F n n = E τn n H X and ∆Mn = Hn − E Hn | Fn−1 by

Θ Θ Θ θn θn Θ Θ ∆Mn = Hn − E Hn | Fn−1 = f (Xn)Gn + (1 − f (Xn))Cn − Cn−1 for the continuation values Θ h Xi C = G Θ | F n E τn+1 n

k N N Let (zn)n=0, k = 1, 2,..., KU, be a third set of independent simulations of (Xn)n=0 k k,j k,j For all zn, simulate J independent continuation paths ˜zn+1,...,˜zN

J k 1 X k,j k,j Cn = g τn+1,˜z k,j J τn+1 j=1 Θ can be understood as realizations of Cn +ε ˜n k Θ This gives realizations Mn of Mn + εn Use the Monte Carlo approximation

KU 1 X k k Uˆ = max g(n, zn) − Mn as an estimate for U KU 0≤n≤N k=1

Lˆ Uˆ Our point estimate of V : + 0 2

Estimating an upper bound

Θ U = E max Gn − Mn − εn is an upper bound for V0 0≤n≤N Lˆ Uˆ Our point estimate of V : + 0 2

Estimating an upper bound

Θ U = E max Gn − Mn − εn is an upper bound for V0 0≤n≤N

Use the Monte Carlo approximation

KU 1 X k k Uˆ = max g(n, zn) − Mn as an estimate for U KU 0≤n≤N k=1 Estimating an upper bound

Θ U = E max Gn − Mn − εn is an upper bound for V0 0≤n≤N

Use the Monte Carlo approximation

KU 1 X k k Uˆ = max g(n, zn) − Mn as an estimate for U KU 0≤n≤N k=1

Lˆ Uˆ Our point estimate of V : + 0 2 One has σˆU σˆU P V0 ≤ Uˆ + zα √ ≥ P U ≤ Uˆ + zα √ ≈ 1 − α. KU KU

So σˆL σˆU Lˆ − zα √ , Uˆ + zα √ KL KU

is an asymptotically valid1 − 2α conﬁdence interval

Conﬁdence intervals for V0

By the CLT, σÛ −∞ , Uˆ + zα √ KU is an asymptotically valid1 − α confidence interval for U, where σÛ is the corresponding sample standard deviation So σˆL σÛ Lˆ − zα √ , Uˆ + zα √ KL KU

is an asymptotically valid1 − 2α conﬁdence interval

Conﬁdence intervals for V0

By the CLT, σÛ −∞ , Uˆ + zα √ KU is an asymptotically valid1 − α confidence interval for U, where σÛ is the corresponding sample standard deviation

One has σÛ σÛ P V0 ≤ Uˆ + zα √ ≥ P U ≤ Uˆ + zα √ ≈ 1 − α. KU KU Confidence intervals for V0

By the CLT, σÛ −∞ , Uˆ + zα √ KU is an asymptotically valid1 − α confidence interval for U, where σÛ is the corresponding sample standard deviation

One has σˆU σˆU P V0 ≤ Uˆ + zα √ ≥ P U ≤ Uˆ + zα √ ≈ 1 − α. KU KU

So σˆL σˆU Lˆ − zα √ , Uˆ + zα √ KL KU is an asymptotically valid1 − 2α conﬁdence interval Thank You!