Deep

Sebastian Becker Patrick Cheridito Arnulf Jentzen ZENAI RiskLab, ETH Zurich Universität Münster

Vienna, November 2019 where

N (Xn)n=0 is a d-dimensional Markov process on a probability space (Ω, F, P)

g: {0, 1,..., N} × Rd → R a measurable function such that

E|g(n, Xn)|< ∞ for all n = 0, ..., N

T is the set of all X-stopping times τ

that is, {τ = n} ∈ σ(X0,..., Xn) for all n = 0, 1,..., N

The Problem

sup E g(τ, Xτ ), τ∈T g: {0, 1,..., N} × Rd → R a measurable function such that

E|g(n, Xn)|< ∞ for all n = 0, ..., N

T is the set of all X-stopping times τ

that is, {τ = n} ∈ σ(X0,..., Xn) for all n = 0, 1,..., N

The Problem

sup E g(τ, Xτ ), τ∈T

where

N (Xn)n=0 is a d-dimensional Markov process on a probability space (Ω, F, P) T is the set of all X-stopping times τ

that is, {τ = n} ∈ σ(X0,..., Xn) for all n = 0, 1,..., N

The Problem

sup E g(τ, Xτ ), τ∈T

where

N (Xn)n=0 is a d-dimensional Markov process on a probability space (Ω, F, P)

g: {0, 1,..., N} × Rd → R a measurable function such that

E|g(n, Xn)|< ∞ for all n = 0, ..., N The Problem

sup E g(τ, Xτ ), τ∈T

where

N (Xn)n=0 is a d-dimensional Markov process on a probability space (Ω, F, P)

g: {0, 1,..., N} × Rd → R a measurable function such that

E|g(n, Xn)|< ∞ for all n = 0, ..., N

T is the set of all X-stopping times τ

that is, {τ = n} ∈ σ(X0,..., Xn) for all n = 0, 1,..., N Many problems are already in discrete time Most relevant continuous-time problems can be approximated by time-discretized versions

Markov assumption Every discrete-time process can be made Markov by including all relevant information N in the current state ... by increasing the dimension of (Xn)n=0

About the assumptions

Discrete time Most relevant continuous-time problems can be approximated by time-discretized versions

Markov assumption Every discrete-time process can be made Markov by including all relevant information N in the current state ... by increasing the dimension of (Xn)n=0

About the assumptions

Discrete time Many problems are already in discrete time Markov assumption Every discrete-time process can be made Markov by including all relevant information N in the current state ... by increasing the dimension of (Xn)n=0

About the assumptions

Discrete time Many problems are already in discrete time Most relevant continuous-time problems can be approximated by time-discretized versions Every discrete-time process can be made Markov by including all relevant information N in the current state ... by increasing the dimension of (Xn)n=0

About the assumptions

Discrete time Many problems are already in discrete time Most relevant continuous-time problems can be approximated by time-discretized versions

Markov assumption About the assumptions

Discrete time Many problems are already in discrete time Most relevant continuous-time problems can be approximated by time-discretized versions

Markov assumption Every discrete-time process can be made Markov by including all relevant information N in the current state ... by increasing the dimension of (Xn)n=0 Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, for i initial values s0 ∈ (0, ∞) a risk-free interest rate r ∈ R

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of finitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T "  +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T

Examples

1 Bermudan max-call options for i initial values s0 ∈ (0, ∞) a risk-free interest rate r ∈ R

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of finitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T "  +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T

Examples

1 Bermudan max-call options Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, a risk-free interest rate r ∈ R

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of finitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T "  +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T

Examples

1 Bermudan max-call options Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, for i initial values s0 ∈ (0, ∞) dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of finitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T "  +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T

Examples

1 Bermudan max-call options Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, for i initial values s0 ∈ (0, ∞) a risk-free interest rate r ∈ R volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of finitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T "  +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T

Examples

1 Bermudan max-call options Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, for i initial values s0 ∈ (0, ∞) a risk-free interest rate r ∈ R

dividend yields δi ∈ [0, ∞) and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of finitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T "  +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T

Examples

1 Bermudan max-call options Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, for i initial values s0 ∈ (0, ∞) a risk-free interest rate r ∈ R

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞) i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of finitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T "  +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T

Examples

1 Bermudan max-call options Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, for i initial values s0 ∈ (0, ∞) a risk-free interest rate r ∈ R

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj T 2T and can be exercised at one of finitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T "  +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T

Examples

1 Bermudan max-call options Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, for i initial values s0 ∈ (0, ∞) a risk-free interest rate r ∈ R

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K "  +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T

Examples

1 Bermudan max-call options Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, for i initial values s0 ∈ (0, ∞) a risk-free interest rate r ∈ R

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of finitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T = sup E g(τ, Xτ ) τ∈T

Examples

1 Bermudan max-call options Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, for i initial values s0 ∈ (0, ∞) a risk-free interest rate r ∈ R

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of finitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T "  +# −rτ i Price: sup E e max Sτ − K 1≤i≤d τ∈{t0,t1,...,T} Examples

1 Bermudan max-call options Consider d assets with prices evolving according to a multi-dimensional Black–Scholes model i i 2 i St = s0 exp [r − δi − σi /2]t + σiWt , i = 1, 2,..., d, for i initial values s0 ∈ (0, ∞) a risk-free interest rate r ∈ R

dividend yields δi ∈ [0, ∞)

volatilities σi ∈ (0, ∞)

and a d-dimensional Brownian motion W with constant correlation ρij between increments of different components Wi and Wj

i + A Bermudan max-call option has time-t payoff max1≤i≤d St − K T 2T and can be exercised at one of finitely many times 0 = t0 < t1 = N < t2 = N < ··· < tN = T "  +# −rτ i Price: sup E e max Sτ − K = sup E g(τ, Xτ ) 1≤i≤d τ∈{t0,t1,...,T} τ∈T This problem has been studied for d = 2, 3, 5 (among others) by

Longstaff and Schwartz (2001) Rogers (2002) García (2003) Boyle, Kolkiewicz and Tan (2003) Haugh and Kogan (2004) Broadie and Glasserman (2004) Andersen and Broadie (2004) Broadie and Cao (2008) Berridge and Schumacher (2008) Belomestny (2011, 2013) Jain and Oosterlee (2015) Lelong (2016) Our price estimates

i for s0 = 100, σi = 20%, r = 5%, δ = 10%, ρij = 0, K = 100, T = 3, N = 9:

# assets Point Est. Comp. Time 95% Conf. Int. Bin. Tree

2 13.899 28.7s [13.880, 13.910] 13.902 3 18.690 28.9s [18.673, 18.699] 18.69 Our price estimates

i for s0 = 100, σi = 20%, r = 5%, δ = 10%, ρij = 0, K = 100, T = 3, N = 9:

# assets Point Est. Comp. Time 95% Conf. Int. Bin. Tree Broadie–Cao 95% Conf. Int.

2 13.899 28.7s [13.880, 13.910] 13.902 3 18.690 28.9s [18.673, 18.699] 18.69 5 26.159 28.1s [26.138, 26.174][26.115, 26.164] Our price estimates

i for s0 = 100, σi = 20%, r = 5%, δ = 10%, ρij = 0, K = 100, T = 3, N = 9:

# Assets Point Est. Comp. Time 95% Conf. Int. Bin. Tree Broadie–Cao 95% Conf. Int.

2 13.899 28.7s [13.880, 13.910] 13.902 3 18.690 28.9s [18.673, 18.699] 18.69 5 26.159 28.1s [26.138, 26.174][26.115, 26.164] 10 38.337 30.5s [38.300, 38.367] 20 51.668 37.5s [51.549, 51.803] 30 59.659 45.5s [59.476, 59.872] 50 69.736 59.1s [69.560, 69.945] 100 83.584 95.9s [83.357, 83.862] 200 97.612 170.1s [97.381, 97.889] 500 116.425 493.5s [116.210, 116.685] A fractional Brownian motion with Hurst parameter H ∈ (0, 1] is a continuous centered H (Wt )t≥0 with covariance structure 1 Cov(WH, WH) = t2H + s2H − |t − s|2H t s 2 For H = 1/2, WH is a Brownian motion For H > 1/2, WH has positively correlated increments For H < 1/2, WH has negatively correlated increments

H = 0.1 H = 0.5 H = 0.8

2 Optimally stopping a fractional Brownian motion For H = 1/2, WH is a Brownian motion For H > 1/2, WH has positively correlated increments For H < 1/2, WH has negatively correlated increments

H = 0.1 H = 0.5 H = 0.8

2 Optimally stopping a fractional Brownian motion A fractional Brownian motion with Hurst parameter H ∈ (0, 1] is a continuous centered H Gaussian process (Wt )t≥0 with covariance structure 1 Cov(WH, WH) = t2H + s2H − |t − s|2H t s 2 For H > 1/2, WH has positively correlated increments For H < 1/2, WH has negatively correlated increments

H = 0.1 H = 0.5 H = 0.8

2 Optimally stopping a fractional Brownian motion A fractional Brownian motion with Hurst parameter H ∈ (0, 1] is a continuous centered H Gaussian process (Wt )t≥0 with covariance structure 1 Cov(WH, WH) = t2H + s2H − |t − s|2H t s 2 For H = 1/2, WH is a Brownian motion For H < 1/2, WH has negatively correlated increments

H = 0.1 H = 0.5 H = 0.8

2 Optimally stopping a fractional Brownian motion A fractional Brownian motion with Hurst parameter H ∈ (0, 1] is a continuous centered H Gaussian process (Wt )t≥0 with covariance structure 1 Cov(WH, WH) = t2H + s2H − |t − s|2H t s 2 For H = 1/2, WH is a Brownian motion For H > 1/2, WH has positively correlated increments H = 0.1 H = 0.5 H = 0.8

2 Optimally stopping a fractional Brownian motion A fractional Brownian motion with Hurst parameter H ∈ (0, 1] is a continuous centered H Gaussian process (Wt )t≥0 with covariance structure 1 Cov(WH, WH) = t2H + s2H − |t − s|2H t s 2 For H = 1/2, WH is a Brownian motion For H > 1/2, WH has positively correlated increments For H < 1/2, WH has negatively correlated increments 2 Optimally stopping a fractional Brownian motion A fractional Brownian motion with Hurst parameter H ∈ (0, 1] is a continuous centered H Gaussian process (Wt )t≥0 with covariance structure 1 Cov(WH, WH) = t2H + s2H − |t − s|2H t s 2 For H = 1/2, WH is a Brownian motion For H > 1/2, WH has positively correlated increments For H < 1/2, WH has negatively correlated increments

H = 0.1 H = 0.5 H = 0.8 denote tn = n/100, n = 0, 1, 2,..., 100 100 introduce the 100-dimensional Markov process (Xn)n=0 given by

X0 = (0, 0,..., 0) X WH 1 = ( t1 , 0,..., 0) X WH WH 2 = ( t2 , t1 , 0,..., 0) . . X WH WH WH 100 = ( t100 , t99 ,..., t1 ). The discretized stopping problem

1 100 1 sup E g(Xτ ) for g(x ,..., x ) = x , τ∈T

approximates the continuous-time problem (∗) from below

H Problem: sup E Wτ (∗) 0≤τ≤1 100 introduce the 100-dimensional Markov process (Xn)n=0 given by

X0 = (0, 0,..., 0) X WH 1 = ( t1 , 0,..., 0) X WH WH 2 = ( t2 , t1 , 0,..., 0) . . X WH WH WH 100 = ( t100 , t99 ,..., t1 ). The discretized stopping problem

1 100 1 sup E g(Xτ ) for g(x ,..., x ) = x , τ∈T

approximates the continuous-time problem (∗) from below

H Problem: sup E Wτ (∗) 0≤τ≤1 denote tn = n/100, n = 0, 1, 2,..., 100 The discretized stopping problem

1 100 1 sup E g(Xτ ) for g(x ,..., x ) = x , τ∈T

approximates the continuous-time problem (∗) from below

H Problem: sup E Wτ (∗) 0≤τ≤1 denote tn = n/100, n = 0, 1, 2,..., 100 100 introduce the 100-dimensional Markov process (Xn)n=0 given by

X0 = (0, 0,..., 0) X WH 1 = ( t1 , 0,..., 0) X WH WH 2 = ( t2 , t1 , 0,..., 0) . . X WH WH WH 100 = ( t100 , t99 ,..., t1 ). H Problem: sup E Wτ (∗) 0≤τ≤1

denote tn = n/100, n = 0, 1, 2,..., 100 100 introduce the 100-dimensional Markov process (Xn)n=0 given by

X0 = (0, 0,..., 0) X WH 1 = ( t1 , 0,..., 0) X WH WH 2 = ( t2 , t1 , 0,..., 0) . . X WH WH WH 100 = ( t100 , t99 ,..., t1 ). The discretized stopping problem

1 100 1 sup E g(Xτ ) for g(x ,..., x ) = x , τ∈T

approximates the continuous-time problem (∗) from below Results of Kulikov and Gusyatnikov (2016) (based on heuristic stopping rules) Results of Kulikov and Gusyatnikov (2016) Our results (based on heuristic stopping rules)

1.6

1.5

1.4

1.3

1.2

1.1

1

H 0.9 τ

W 0.8

E 0.7 τ 0.6 sup 0.5

0.4

0.3

0.2

0.1

0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 H Stopping times and stopping decisions d Let fn, fn+1,..., fN : R → {0, 1} be measurable functions such that fN ≡ 1. Then

N m−1 n−1 X Y Y τn = m fm(Xm) (1 − fj(Xj)) with (1 − fj(Xj)) := 1 m=n j=n j=n

is a stopping time in Tn

Computing a candidate optimal stopping time

Introduce the sequence of auxiliary stopping problems

Vn = sup E g(τ, Xτ ), n = 0, 1,..., N, τ∈Tn

where Tn is the set of all stopping times n ≤ τ ≤ N Computing a candidate optimal stopping time

Introduce the sequence of auxiliary stopping problems

Vn = sup E g(τ, Xτ ), n = 0, 1,..., N, τ∈Tn

where Tn is the set of all stopping times n ≤ τ ≤ N

Stopping times and stopping decisions d Let fn, fn+1,..., fN : R → {0, 1} be measurable functions such that fN ≡ 1. Then

N m−1 n−1 X Y Y τn = m fm(Xm) (1 − fj(Xj)) with (1 − fj(Xj)) := 1 m=n j=n j=n

is a stopping time in Tn d Then there exists a measurable function fn : R → {0, 1} such that the stopping time

N m−1 X Y τn = n fn(Xn) + τn+1(1 − fn(Xn)) = m fm(Xm) (1 − fj(Xj)) m=n j=n

satisfies  E g(τn, Xτn ) ≥ Vn − Vn+1 − E g(τn+1, Xτn+1 )

    Proof: Compare g(n, Xn) to E g(τn+1, Xτn+1 ) | X0, X1,..., Xn = E g(τn+1, Xτn+1 ) | Xn

Theorem

For a given n ∈ {0, 1,..., N − 1}, let τn+1be a stopping time in Tn+1 of the form

N m−1 X Y τn+1 = m fm(Xm) (1 − fj(Xj)), m=n+1 j=n+1

d for measurable functions fn+1,..., fN : R → {0, 1} with fN ≡ 1.     Proof: Compare g(n, Xn) to E g(τn+1, Xτn+1 ) | X0, X1,..., Xn = E g(τn+1, Xτn+1 ) | Xn

Theorem

For a given n ∈ {0, 1,..., N − 1}, let τn+1be a stopping time in Tn+1 of the form

N m−1 X Y τn+1 = m fm(Xm) (1 − fj(Xj)), m=n+1 j=n+1

d for measurable functions fn+1,..., fN : R → {0, 1} with fN ≡ 1.

d Then there exists a measurable function fn : R → {0, 1} such that the stopping time

N m−1 X Y τn = n fn(Xn) + τn+1(1 − fn(Xn)) = m fm(Xm) (1 − fj(Xj)) m=n j=n satisfies  E g(τn, Xτn ) ≥ Vn − Vn+1 − E g(τn+1, Xτn+1 ) Theorem

For a given n ∈ {0, 1,..., N − 1}, let τn+1be a stopping time in Tn+1 of the form

N m−1 X Y τn+1 = m fm(Xm) (1 − fj(Xj)), m=n+1 j=n+1

d for measurable functions fn+1,..., fN : R → {0, 1} with fN ≡ 1.

d Then there exists a measurable function fn : R → {0, 1} such that the stopping time

N m−1 X Y τn = n fn(Xn) + τn+1(1 − fn(Xn)) = m fm(Xm) (1 − fj(Xj)) m=n j=n satisfies  E g(τn, Xτn ) ≥ Vn − Vn+1 − E g(τn+1, Xτn+1 )

    Proof: Compare g(n, Xn) to E g(τn+1, Xτn+1 ) | X0, X1,..., Xn = E g(τn+1, Xτn+1 ) | Xn θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers,

θ d q1 θ q1 q2 θ q2 a1 : R → R , a2 : R → R and a3 : R → R are affine functions given by θ ai (x) = Aix + bi, i = 1, 2, 3, j j for j ∈ N, ϕj: R → R is the component-wise ReLU activation function given by + + ϕj(x1,..., xj) = (x1 ,..., xj ) 2 The components of θ consist of the entries of Ai and bi, i = 1, 2, 3 so # of parameters ≈ d

Neural network approximation where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers,

θ d q1 θ q1 q2 θ q2 a1 : R → R , a2 : R → R and a3 : R → R are affine functions given by θ ai (x) = Aix + bi, i = 1, 2, 3, j j for j ∈ N, ϕj: R → R is the component-wise ReLU activation function given by + + ϕj(x1,..., xj) = (x1 ,..., xj ) 2 The components of θ consist of the entries of Ai and bi, i = 1, 2, 3 so # of parameters ≈ d

Neural network approximation

θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , θ d q1 θ q1 q2 θ q2 a1 : R → R , a2 : R → R and a3 : R → R are affine functions given by θ ai (x) = Aix + bi, i = 1, 2, 3, j j for j ∈ N, ϕj: R → R is the component-wise ReLU activation function given by + + ϕj(x1,..., xj) = (x1 ,..., xj ) 2 The components of θ consist of the entries of Ai and bi, i = 1, 2, 3 so # of parameters ≈ d

Neural network approximation

θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers, j j for j ∈ N, ϕj: R → R is the component-wise ReLU activation function given by + + ϕj(x1,..., xj) = (x1 ,..., xj ) 2 The components of θ consist of the entries of Ai and bi, i = 1, 2, 3 so # of parameters ≈ d

Neural network approximation

θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers,

θ d q1 θ q1 q2 θ q2 a1 : R → R , a2 : R → R and a3 : R → R are affine functions given by θ ai (x) = Aix + bi, i = 1, 2, 3, 2 The components of θ consist of the entries of Ai and bi, i = 1, 2, 3 so # of parameters ≈ d

Neural network approximation

θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers,

θ d q1 θ q1 q2 θ q2 a1 : R → R , a2 : R → R and a3 : R → R are affine functions given by θ ai (x) = Aix + bi, i = 1, 2, 3, j j for j ∈ N, ϕj: R → R is the component-wise ReLU activation function given by + + ϕj(x1,..., xj) = (x1 ,..., xj ) 2 so # of parameters ≈ d

Neural network approximation

θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers,

θ d q1 θ q1 q2 θ q2 a1 : R → R , a2 : R → R and a3 : R → R are affine functions given by θ ai (x) = Aix + bi, i = 1, 2, 3, j j for j ∈ N, ϕj: R → R is the component-wise ReLU activation function given by + + ϕj(x1,..., xj) = (x1 ,..., xj )

The components of θ consist of the entries of Ai and bi, i = 1, 2, 3 Neural network approximation

θ d Idea Recursively approximate fn by a neural network f : R → {0, 1} of the form θ θ θ θ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 , where

q1 and q2 are positive integers specifying the number of nodes in the two hidden layers,

θ d q1 θ q1 q2 θ q2 a1 : R → R , a2 : R → R and a3 : R → R are affine functions given by θ ai (x) = Aix + bi, i = 1, 2, 3, j j for j ∈ N, ϕj: R → R is the component-wise ReLU activation function given by + + ϕj(x1,..., xj) = (x1 ,..., xj ) 2 The components of θ consist of the entries of Ai and bi, i = 1, 2, 3 so # of parameters ≈ d q now try to find a maximizer θn ∈ R of

 θ θ  θ 7→ E g(n, Xn) f (Xn) + g(τn+1, Xn+1)(1 − f (Xn))

More precisely,

q θN assume parameter values θn+1, θn+2, . . . , θN ∈ R have been found such that f ≡ 1 and the stopping time N m−1 X θm Y θj τn+1 = m f (Xm) (1 − f (Xj)) m=n+1 j=n+1

produces an expectation E g(τn+1, Xτn+1 ) close to the optimal value Vn+1 More precisely,

q θN assume parameter values θn+1, θn+2, . . . , θN ∈ R have been found such that f ≡ 1 and the stopping time N m−1 X θm Y θj τn+1 = m f (Xm) (1 − f (Xj)) m=n+1 j=n+1

produces an expectation E g(τn+1, Xτn+1 ) close to the optimal value Vn+1

q now try to find a maximizer θn ∈ R of

 θ θ  θ 7→ E g(n, Xn) f (Xn) + g(τn+1, Xn+1)(1 − f (Xn)) Problem for x ∈ Rd, the θ-gradient of

θ θ θ θ f (x) = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 (x)

is0 or does not exist

As an intermediate step consider a neural network Fθ: Rd → (0, 1) of the form x θ θ θ θ e F = ψ ◦ a ◦ ϕq ◦ a ◦ ϕq ◦ a for ψ(x) = 3 2 2 1 1 1 + ex

q Use stochastic gradient ascent to find an approximate optimizer θn ∈ R of

 θ θ  θ 7→ E g(n, Xn)F (Xn) + g(τn+1, Xτn+1 )(1 − F (Xn))

θn θn θn θn Approximate fn ≈ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 Repeat the same steps at times n − 1, n − 2,..., 0

q Goal find an (approximately) optimal θn ∈ R with a stochastic gradient ascent method As an intermediate step consider a neural network Fθ: Rd → (0, 1) of the form x θ θ θ θ e F = ψ ◦ a ◦ ϕq ◦ a ◦ ϕq ◦ a for ψ(x) = 3 2 2 1 1 1 + ex

q Use stochastic gradient ascent to find an approximate optimizer θn ∈ R of

 θ θ  θ 7→ E g(n, Xn)F (Xn) + g(τn+1, Xτn+1 )(1 − F (Xn))

θn θn θn θn Approximate fn ≈ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 Repeat the same steps at times n − 1, n − 2,..., 0

q Goal find an (approximately) optimal θn ∈ R with a stochastic gradient ascent method

Problem for x ∈ Rd, the θ-gradient of

θ θ θ θ f (x) = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 (x) is0 or does not exist q Use stochastic gradient ascent to find an approximate optimizer θn ∈ R of

 θ θ  θ 7→ E g(n, Xn)F (Xn) + g(τn+1, Xτn+1 )(1 − F (Xn))

θn θn θn θn Approximate fn ≈ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 Repeat the same steps at times n − 1, n − 2,..., 0

q Goal find an (approximately) optimal θn ∈ R with a stochastic gradient ascent method

Problem for x ∈ Rd, the θ-gradient of

θ θ θ θ f (x) = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 (x) is0 or does not exist

As an intermediate step consider a neural network Fθ: Rd → (0, 1) of the form x θ θ θ θ e F = ψ ◦ a ◦ ϕq ◦ a ◦ ϕq ◦ a for ψ(x) = 3 2 2 1 1 1 + ex θn θn θn θn Approximate fn ≈ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 Repeat the same steps at times n − 1, n − 2,..., 0

q Goal find an (approximately) optimal θn ∈ R with a stochastic gradient ascent method

Problem for x ∈ Rd, the θ-gradient of

θ θ θ θ f (x) = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 (x) is0 or does not exist

As an intermediate step consider a neural network Fθ: Rd → (0, 1) of the form x θ θ θ θ e F = ψ ◦ a ◦ ϕq ◦ a ◦ ϕq ◦ a for ψ(x) = 3 2 2 1 1 1 + ex

q Use stochastic gradient ascent to find an approximate optimizer θn ∈ R of

 θ θ  θ 7→ E g(n, Xn)F (Xn) + g(τn+1, Xτn+1 )(1 − F (Xn)) Repeat the same steps at times n − 1, n − 2,..., 0

q Goal find an (approximately) optimal θn ∈ R with a stochastic gradient ascent method

Problem for x ∈ Rd, the θ-gradient of

θ θ θ θ f (x) = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 (x) is0 or does not exist

As an intermediate step consider a neural network Fθ: Rd → (0, 1) of the form x θ θ θ θ e F = ψ ◦ a ◦ ϕq ◦ a ◦ ϕq ◦ a for ψ(x) = 3 2 2 1 1 1 + ex

q Use stochastic gradient ascent to find an approximate optimizer θn ∈ R of

 θ θ  θ 7→ E g(n, Xn)F (Xn) + g(τn+1, Xτn+1 )(1 − F (Xn))

θn θn θn θn Approximate fn ≈ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 q Goal find an (approximately) optimal θn ∈ R with a stochastic gradient ascent method

Problem for x ∈ Rd, the θ-gradient of

θ θ θ θ f (x) = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 (x) is0 or does not exist

As an intermediate step consider a neural network Fθ: Rd → (0, 1) of the form x θ θ θ θ e F = ψ ◦ a ◦ ϕq ◦ a ◦ ϕq ◦ a for ψ(x) = 3 2 2 1 1 1 + ex

q Use stochastic gradient ascent to find an approximate optimizer θn ∈ R of

 θ θ  θ 7→ E g(n, Xn)F (Xn) + g(τn+1, Xτn+1 )(1 − F (Xn))

θn θn θn θn Approximate fn ≈ f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 Repeat the same steps at times n − 1, n − 2,..., 0 Proof d 1 Every measurable set A ⊆ R can be approximated in measure by compact sets K ⊆ A

2 1K − 1Kc can be approximated by continuous functions kj

3 kj can be approximated uniformly on compacts by functions of the form

r s X T + X T + h(x) = (vi x + ci) − (wi x + di) (Leshno–Lin–Pinkus–Schocken, 1993) i=1 i=1

4 θ θ θ θ 1[0,∞) ◦ h can be written as a neural network of the form f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1

Proposition

Let n ∈ {0, 1,..., N − 1} and fix a stopping time τn+1 ∈ Tn+1. Then, for every constant ε > 0, there exist numbers of hidden nodes q1 and q2 such that

 θ θ  sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) q θ∈R   ≥ sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) − ε, f ∈D where D is the set of all measurable functions f : Rd → {0, 1}. 2 1K − 1Kc can be approximated by continuous functions kj

3 kj can be approximated uniformly on compacts by functions of the form

r s X T + X T + h(x) = (vi x + ci) − (wi x + di) (Leshno–Lin–Pinkus–Schocken, 1993) i=1 i=1

4 θ θ θ θ 1[0,∞) ◦ h can be written as a neural network of the form f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1

Proposition

Let n ∈ {0, 1,..., N − 1} and fix a stopping time τn+1 ∈ Tn+1. Then, for every constant ε > 0, there exist numbers of hidden nodes q1 and q2 such that

 θ θ  sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) q θ∈R   ≥ sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) − ε, f ∈D where D is the set of all measurable functions f : Rd → {0, 1}.

Proof d 1 Every measurable set A ⊆ R can be approximated in measure by compact sets K ⊆ A 3 kj can be approximated uniformly on compacts by functions of the form

r s X T + X T + h(x) = (vi x + ci) − (wi x + di) (Leshno–Lin–Pinkus–Schocken, 1993) i=1 i=1

4 θ θ θ θ 1[0,∞) ◦ h can be written as a neural network of the form f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1

Proposition

Let n ∈ {0, 1,..., N − 1} and fix a stopping time τn+1 ∈ Tn+1. Then, for every constant ε > 0, there exist numbers of hidden nodes q1 and q2 such that

 θ θ  sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) q θ∈R   ≥ sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) − ε, f ∈D where D is the set of all measurable functions f : Rd → {0, 1}.

Proof d 1 Every measurable set A ⊆ R can be approximated in measure by compact sets K ⊆ A

2 1K − 1Kc can be approximated by continuous functions kj 4 θ θ θ θ 1[0,∞) ◦ h can be written as a neural network of the form f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1

Proposition

Let n ∈ {0, 1,..., N − 1} and fix a stopping time τn+1 ∈ Tn+1. Then, for every constant ε > 0, there exist numbers of hidden nodes q1 and q2 such that

 θ θ  sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) q θ∈R   ≥ sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) − ε, f ∈D where D is the set of all measurable functions f : Rd → {0, 1}.

Proof d 1 Every measurable set A ⊆ R can be approximated in measure by compact sets K ⊆ A

2 1K − 1Kc can be approximated by continuous functions kj

3 kj can be approximated uniformly on compacts by functions of the form

r s X T + X T + h(x) = (vi x + ci) − (wi x + di) (Leshno–Lin–Pinkus–Schocken, 1993) i=1 i=1 Proposition

Let n ∈ {0, 1,..., N − 1} and fix a stopping time τn+1 ∈ Tn+1. Then, for every constant ε > 0, there exist numbers of hidden nodes q1 and q2 such that

 θ θ  sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) q θ∈R   ≥ sup E g(n, Xn) f (Xn) + g(τn+1, Xτn+1 )(1 − f (Xn)) − ε, f ∈D where D is the set of all measurable functions f : Rd → {0, 1}.

Proof d 1 Every measurable set A ⊆ R can be approximated in measure by compact sets K ⊆ A

2 1K − 1Kc can be approximated by continuous functions kj

3 kj can be approximated uniformly on compacts by functions of the form

r s X T + X T + h(x) = (vi x + ci) − (wi x + di) (Leshno–Lin–Pinkus–Schocken, 1993) i=1 i=1

4 θ θ θ θ 1[0,∞) ◦ h can be written as a neural network of the form f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 there exist

numbers of hidden nodes q1, q2 and functions f θ0 , f θ1 ,..., f θN : Rd → {0, 1} of the form

θn θn θn θn f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1

such that f θN ≡ 1 and the stopping time

N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0

Θ satisfies E g(τ , Xτ Θ ) ≥ supτ∈T E g(τ, Xτ ) − ε

Corollary For a given optimal stopping problem of the form

sup E g(τ, Xτ ) τ∈T and a constant ε > 0, Corollary For a given optimal stopping problem of the form

sup E g(τ, Xτ ) τ∈T and a constant ε > 0, there exist

numbers of hidden nodes q1, q2 and functions f θ0 , f θ1 ,..., f θN : Rd → {0, 1} of the form

θn θn θn θn f = 1[0,∞) ◦ a3 ◦ ϕq2 ◦ a2 ◦ ϕq1 ◦ a1 such that f θN ≡ 1 and the stopping time

N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0

Θ satisfies E g(τ , Xτ Θ ) ≥ supτ∈T E g(τ, Xτ ) − ε q Let θn+1, . . . , θN ∈ R be given, and consider the corresponding stopping time N m−1 X θm Y θj τn+1 = m f (Xm) (1 − f (Xj)) m=n+1 j=n+1

τn+1 is of the form τn+1 = ln+1(Xn+1,..., XN−1) for a measurable function d(N−n−1) ln+1 : R → {n + 1, n + 2,..., N}

Denote ( k N if n = N − 1 ln+1 = k k ln+1(xn+1,..., xN−1) if n ≤ N − 2

The realized reward k k θ k k k θ k rn(θ) = g(n, xn)F (xn) + g(ln+1, x k )(1 − F (xn)) ln+1 is continuous and almost everywhere differentiable in θ

Training the networks

k N N Let (xn)n=0, k = 1, 2,... be independent simulations of (Xn)n=0 τn+1 is of the form τn+1 = ln+1(Xn+1,..., XN−1) for a measurable function d(N−n−1) ln+1 : R → {n + 1, n + 2,..., N}

Denote ( k N if n = N − 1 ln+1 = k k ln+1(xn+1,..., xN−1) if n ≤ N − 2

The realized reward k k θ k k k θ k rn(θ) = g(n, xn)F (xn) + g(ln+1, x k )(1 − F (xn)) ln+1 is continuous and almost everywhere differentiable in θ

Training the networks

k N N Let (xn)n=0, k = 1, 2,... be independent simulations of (Xn)n=0 q Let θn+1, . . . , θN ∈ R be given, and consider the corresponding stopping time N m−1 X θm Y θj τn+1 = m f (Xm) (1 − f (Xj)) m=n+1 j=n+1 Denote ( k N if n = N − 1 ln+1 = k k ln+1(xn+1,..., xN−1) if n ≤ N − 2

The realized reward k k θ k k k θ k rn(θ) = g(n, xn)F (xn) + g(ln+1, x k )(1 − F (xn)) ln+1 is continuous and almost everywhere differentiable in θ

Training the networks

k N N Let (xn)n=0, k = 1, 2,... be independent simulations of (Xn)n=0 q Let θn+1, . . . , θN ∈ R be given, and consider the corresponding stopping time N m−1 X θm Y θj τn+1 = m f (Xm) (1 − f (Xj)) m=n+1 j=n+1

τn+1 is of the form τn+1 = ln+1(Xn+1,..., XN−1) for a measurable function d(N−n−1) ln+1 : R → {n + 1, n + 2,..., N} The realized reward k k θ k k k θ k rn(θ) = g(n, xn)F (xn) + g(ln+1, x k )(1 − F (xn)) ln+1 is continuous and almost everywhere differentiable in θ

Training the networks

k N N Let (xn)n=0, k = 1, 2,... be independent simulations of (Xn)n=0 q Let θn+1, . . . , θN ∈ R be given, and consider the corresponding stopping time N m−1 X θm Y θj τn+1 = m f (Xm) (1 − f (Xj)) m=n+1 j=n+1

τn+1 is of the form τn+1 = ln+1(Xn+1,..., XN−1) for a measurable function d(N−n−1) ln+1 : R → {n + 1, n + 2,..., N}

Denote ( k N if n = N − 1 ln+1 = k k ln+1(xn+1,..., xN−1) if n ≤ N − 2 Training the networks

k N N Let (xn)n=0, k = 1, 2,... be independent simulations of (Xn)n=0 q Let θn+1, . . . , θN ∈ R be given, and consider the corresponding stopping time N m−1 X θm Y θj τn+1 = m f (Xm) (1 − f (Xj)) m=n+1 j=n+1

τn+1 is of the form τn+1 = ln+1(Xn+1,..., XN−1) for a measurable function d(N−n−1) ln+1 : R → {n + 1, n + 2,..., N}

Denote ( k N if n = N − 1 ln+1 = k k ln+1(xn+1,..., xN−1) if n ≤ N − 2

The realized reward k k θ k k k θ k rn(θ) = g(n, xn)F (xn) + g(ln+1, x k )(1 − F (xn)) ln+1 is continuous and almost everywhere differentiable in θ k Standard updating θn,k+1 = θn,k + η∇rn(θn,k)

Variants Mini-batches Batch normalization Momentum Adagrad RMSProp AdaDelta ADAM Decoupling weight decay Warm restarts ...

Stochastic Gradient Ascent

Initialize θn,0 typically random; e.g. Xavier initialization Variants Mini-batches Batch normalization Momentum Adagrad RMSProp AdaDelta ADAM Decoupling weight decay Warm restarts ...

Stochastic Gradient Ascent

Initialize θn,0 typically random; e.g. Xavier initialization

k Standard updating θn,k+1 = θn,k + η∇rn(θn,k) Stochastic Gradient Ascent

Initialize θn,0 typically random; e.g. Xavier initialization

k Standard updating θn,k+1 = θn,k + η∇rn(θn,k)

Variants Mini-batches Batch normalization Momentum Adagrad RMSProp AdaDelta ADAM Decoupling weight decay Warm restarts ... Stochastic Gradient Ascent

Initialize θn,0 typically random; e.g. Xavier initialization

m Standard updating θn,m+1 = θn,m + η∇rn (θn,m)

Variants Mini-batches Batch normalization Momentum Adagrad RMSProp AdaDelta ADAM Decoupling weight decay Warm restarts ... k N N Let (yn)n=0, k = 1, 2,..., KL, be a new set of independent simulations of (Xn)n=0

Θ Θ dN τ can be written as τ = l(X0,..., XN−1) for a measurable function l : R → {0, 1,..., N}

k k k Denote l = l(y0,..., yN−1)

Use the Monte Carlo approximation

KL ˆ 1 X k k L = g(l , ylk ) as an estimate for L KL k=1

Lower bound

The candidate optimal stopping time N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0 yields a lower bound Θ L = E g(τ , Xτ Θ ) for the optimal value V0 = sup E g(τ, Xτ ) τ Θ Θ dN τ can be written as τ = l(X0,..., XN−1) for a measurable function l : R → {0, 1,..., N}

k k k Denote l = l(y0,..., yN−1)

Use the Monte Carlo approximation

KL ˆ 1 X k k L = g(l , ylk ) as an estimate for L KL k=1

Lower bound

The candidate optimal stopping time N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0 yields a lower bound Θ L = E g(τ , Xτ Θ ) for the optimal value V0 = sup E g(τ, Xτ ) τ

k N N Let (yn)n=0, k = 1, 2,..., KL, be a new set of independent simulations of (Xn)n=0 k k k Denote l = l(y0,..., yN−1)

Use the Monte Carlo approximation

KL ˆ 1 X k k L = g(l , ylk ) as an estimate for L KL k=1

Lower bound

The candidate optimal stopping time N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0 yields a lower bound Θ L = E g(τ , Xτ Θ ) for the optimal value V0 = sup E g(τ, Xτ ) τ

k N N Let (yn)n=0, k = 1, 2,..., KL, be a new set of independent simulations of (Xn)n=0

Θ Θ dN τ can be written as τ = l(X0,..., XN−1) for a measurable function l : R → {0, 1,..., N} Use the Monte Carlo approximation

KL ˆ 1 X k k L = g(l , ylk ) as an estimate for L KL k=1

Lower bound

The candidate optimal stopping time N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0 yields a lower bound Θ L = E g(τ , Xτ Θ ) for the optimal value V0 = sup E g(τ, Xτ ) τ

k N N Let (yn)n=0, k = 1, 2,..., KL, be a new set of independent simulations of (Xn)n=0

Θ Θ dN τ can be written as τ = l(X0,..., XN−1) for a measurable function l : R → {0, 1,..., N}

k k k Denote l = l(y0,..., yN−1) Lower bound

The candidate optimal stopping time N n−1 Θ X θn Y θj τ = n f (Xn) (1 − f (Xj)) n=1 j=0 yields a lower bound Θ L = E g(τ , Xτ Θ ) for the optimal value V0 = sup E g(τ, Xτ ) τ

k N N Let (yn)n=0, k = 1, 2,..., KL, be a new set of independent simulations of (Xn)n=0

Θ Θ dN τ can be written as τ = l(X0,..., XN−1) for a measurable function l : R → {0, 1,..., N}

k k k Denote l = l(y0,..., yN−1)

Use the Monte Carlo approximation

KL ˆ 1 X k k L = g(l , ylk ) as an estimate for L KL k=1 Consider the sample variance

KL 2 1 X k k ˆ2 σˆL = g(l , ylk ) − L KL − 1 k=1 By the CLT,   σˆL Lˆ − zα √ , ∞ KL is an asymptotically valid1 − α confidence interval for L

where zα is the 1 − α quantile of the standard normal distribution Therefore,     σˆL σˆL P V0 ≥ Lˆ − zα √ ≥ P L ≥ Lˆ − zα √ ≈ 1 − α KL KL

Lower confidence bounds

 2 Assume E g(n, Xn) < ∞ for all n = 0, 1,..., N By the CLT,   σˆL Lˆ − zα √ , ∞ KL is an asymptotically valid1 − α confidence interval for L

where zα is the 1 − α quantile of the standard normal distribution Therefore,     σˆL σˆL P V0 ≥ Lˆ − zα √ ≥ P L ≥ Lˆ − zα √ ≈ 1 − α KL KL

Lower confidence bounds

 2 Assume E g(n, Xn) < ∞ for all n = 0, 1,..., N

Consider the sample variance

KL 2 1 X k k ˆ2 σˆL = g(l , ylk ) − L KL − 1 k=1 Therefore,     σˆL σˆL P V0 ≥ Lˆ − zα √ ≥ P L ≥ Lˆ − zα √ ≈ 1 − α KL KL

Lower confidence bounds

 2 Assume E g(n, Xn) < ∞ for all n = 0, 1,..., N

Consider the sample variance

KL 2 1 X k k ˆ2 σˆL = g(l , ylk ) − L KL − 1 k=1 By the CLT,   σˆL Lˆ − zα √ , ∞ KL is an asymptotically valid1 − α confidence interval for L where zα is the 1 − α quantile of the standard normal distribution Lower confidence bounds

 2 Assume E g(n, Xn) < ∞ for all n = 0, 1,..., N

Consider the sample variance

KL 2 1 X k k ˆ2 σˆL = g(l , ylk ) − L KL − 1 k=1 By the CLT,   σˆL Lˆ − zα √ , ∞ KL is an asymptotically valid1 − α confidence interval for L where zα is the 1 − α quantile of the standard normal distribution Therefore,     σˆL σˆL P V0 ≥ Lˆ − zα √ ≥ P L ≥ Lˆ − zα √ ≈ 1 − α KL KL H H with Doob decomposition Hn = H0 + Mn − An The following is a variant of the dual formulation of Rogers (2002), Haugh–Kogan (2004) and Andersen–Broadie (2004)

Proposition X X For every (Fn )-martingale (Mn) with M0 = 0 and estimation errors (εn) satisfying E[εn | Fn ] = 0, one has   V0 ≤ E max (Gn − Mn − εn) 0≤n≤N On the other hand,   H V0 = E max Gn − Mn 0≤n≤N

Upper bound

Let (Hn) be the Snell envelope of Gn = g(n, Xn), n = 0, 1,..., N, The following is a variant of the dual formulation of Rogers (2002), Haugh–Kogan (2004) and Andersen–Broadie (2004)

Proposition X X For every (Fn )-martingale (Mn) with M0 = 0 and estimation errors (εn) satisfying E[εn | Fn ] = 0, one has   V0 ≤ E max (Gn − Mn − εn) 0≤n≤N On the other hand,   H V0 = E max Gn − Mn 0≤n≤N

Upper bound

Let (Hn) be the Snell envelope of Gn = g(n, Xn), n = 0, 1,..., N, H H with Doob decomposition Hn = H0 + Mn − An Proposition X X For every (Fn )-martingale (Mn) with M0 = 0 and estimation errors (εn) satisfying E[εn | Fn ] = 0, one has   V0 ≤ E max (Gn − Mn − εn) 0≤n≤N On the other hand,   H V0 = E max Gn − Mn 0≤n≤N

Upper bound

Let (Hn) be the Snell envelope of Gn = g(n, Xn), n = 0, 1,..., N, H H with Doob decomposition Hn = H0 + Mn − An The following is a variant of the dual formulation of Rogers (2002), Haugh–Kogan (2004) and Andersen–Broadie (2004) On the other hand,   H V0 = E max Gn − Mn 0≤n≤N

Upper bound

Let (Hn) be the Snell envelope of Gn = g(n, Xn), n = 0, 1,..., N, H H with Doob decomposition Hn = H0 + Mn − An The following is a variant of the dual formulation of Rogers (2002), Haugh–Kogan (2004) and Andersen–Broadie (2004)

Proposition X X For every (Fn )-martingale (Mn) with M0 = 0 and estimation errors (εn) satisfying E[εn | Fn ] = 0, one has   V0 ≤ E max (Gn − Mn − εn) 0≤n≤N Upper bound

Let (Hn) be the Snell envelope of Gn = g(n, Xn), n = 0, 1,..., N, H H with Doob decomposition Hn = H0 + Mn − An The following is a variant of the dual formulation of Rogers (2002), Haugh–Kogan (2004) and Andersen–Broadie (2004)

Proposition X X For every (Fn )-martingale (Mn) with M0 = 0 and estimation errors (εn) satisfying E[εn | Fn ] = 0, one has   V0 ≤ E max (Gn − Mn − εn) 0≤n≤N On the other hand,   H V0 = E max Gn − Mn 0≤n≤N H  X  and ∆Mn = Hn − E Hn | Fn−1 by

Θ Θ  Θ  θn θn Θ Θ ∆Mn = Hn − E Hn | Fn−1 = f (Xn)Gn + (1 − f (Xn))Cn − Cn−1 for the continuation values Θ h Xi C = G Θ | F n E τn+1 n

k N N Let (zn)n=0, k = 1, 2,..., KU, be a third set of independent simulations of (Xn)n=0 k k,j k,j For all zn, simulate J independent continuation paths ˜zn+1,...,˜zN

J   k 1 X k,j k,j Cn = g τn+1,˜z k,j J τn+1 j=1 Θ can be understood as realizations of Cn +ε ˜n k Θ This gives realizations Mn of Mn + εn

Estimating a good dual martingale

Θ  X Approximate H by H G Θ | F n n = E τn n k N N Let (zn)n=0, k = 1, 2,..., KU, be a third set of independent simulations of (Xn)n=0 k k,j k,j For all zn, simulate J independent continuation paths ˜zn+1,...,˜zN

J   k 1 X k,j k,j Cn = g τn+1,˜z k,j J τn+1 j=1 Θ can be understood as realizations of Cn +ε ˜n k Θ This gives realizations Mn of Mn + εn

Estimating a good dual martingale

Θ  X Approximate H by H G Θ | F n n = E τn n H  X  and ∆Mn = Hn − E Hn | Fn−1 by

Θ Θ  Θ  θn θn Θ Θ ∆Mn = Hn − E Hn | Fn−1 = f (Xn)Gn + (1 − f (Xn))Cn − Cn−1 for the continuation values Θ h Xi C = G Θ | F n E τn+1 n k k,j k,j For all zn, simulate J independent continuation paths ˜zn+1,...,˜zN

J   k 1 X k,j k,j Cn = g τn+1,˜z k,j J τn+1 j=1 Θ can be understood as realizations of Cn +ε ˜n k Θ This gives realizations Mn of Mn + εn

Estimating a good dual martingale

Θ  X Approximate H by H G Θ | F n n = E τn n H  X  and ∆Mn = Hn − E Hn | Fn−1 by

Θ Θ  Θ  θn θn Θ Θ ∆Mn = Hn − E Hn | Fn−1 = f (Xn)Gn + (1 − f (Xn))Cn − Cn−1 for the continuation values Θ h Xi C = G Θ | F n E τn+1 n

k N N Let (zn)n=0, k = 1, 2,..., KU, be a third set of independent simulations of (Xn)n=0 J   k 1 X k,j k,j Cn = g τn+1,˜z k,j J τn+1 j=1 Θ can be understood as realizations of Cn +ε ˜n k Θ This gives realizations Mn of Mn + εn

Estimating a good dual martingale

Θ  X Approximate H by H G Θ | F n n = E τn n H  X  and ∆Mn = Hn − E Hn | Fn−1 by

Θ Θ  Θ  θn θn Θ Θ ∆Mn = Hn − E Hn | Fn−1 = f (Xn)Gn + (1 − f (Xn))Cn − Cn−1 for the continuation values Θ h Xi C = G Θ | F n E τn+1 n

k N N Let (zn)n=0, k = 1, 2,..., KU, be a third set of independent simulations of (Xn)n=0 k k,j k,j For all zn, simulate J independent continuation paths ˜zn+1,...,˜zN k Θ This gives realizations Mn of Mn + εn

Estimating a good dual martingale

Θ  X Approximate H by H G Θ | F n n = E τn n H  X  and ∆Mn = Hn − E Hn | Fn−1 by

Θ Θ  Θ  θn θn Θ Θ ∆Mn = Hn − E Hn | Fn−1 = f (Xn)Gn + (1 − f (Xn))Cn − Cn−1 for the continuation values Θ h Xi C = G Θ | F n E τn+1 n

k N N Let (zn)n=0, k = 1, 2,..., KU, be a third set of independent simulations of (Xn)n=0 k k,j k,j For all zn, simulate J independent continuation paths ˜zn+1,...,˜zN

J   k 1 X k,j k,j Cn = g τn+1,˜z k,j J τn+1 j=1 Θ can be understood as realizations of Cn +ε ˜n Estimating a good dual martingale

Θ  X Approximate H by H G Θ | F n n = E τn n H  X  and ∆Mn = Hn − E Hn | Fn−1 by

Θ Θ  Θ  θn θn Θ Θ ∆Mn = Hn − E Hn | Fn−1 = f (Xn)Gn + (1 − f (Xn))Cn − Cn−1 for the continuation values Θ h Xi C = G Θ | F n E τn+1 n

k N N Let (zn)n=0, k = 1, 2,..., KU, be a third set of independent simulations of (Xn)n=0 k k,j k,j For all zn, simulate J independent continuation paths ˜zn+1,...,˜zN

J   k 1 X k,j k,j Cn = g τn+1,˜z k,j J τn+1 j=1 Θ can be understood as realizations of Cn +ε ˜n k Θ This gives realizations Mn of Mn + εn Use the Monte Carlo approximation

KU 1 X k k Uˆ = max g(n, zn) − Mn as an estimate for U KU 0≤n≤N k=1

Lˆ Uˆ Our point estimate of V : + 0 2

Estimating an upper bound

  Θ  U = E max Gn − Mn − εn is an upper bound for V0 0≤n≤N Lˆ Uˆ Our point estimate of V : + 0 2

Estimating an upper bound

  Θ  U = E max Gn − Mn − εn is an upper bound for V0 0≤n≤N

Use the Monte Carlo approximation

KU 1 X k k Uˆ = max g(n, zn) − Mn as an estimate for U KU 0≤n≤N k=1 Estimating an upper bound

  Θ  U = E max Gn − Mn − εn is an upper bound for V0 0≤n≤N

Use the Monte Carlo approximation

KU 1 X k k Uˆ = max g(n, zn) − Mn as an estimate for U KU 0≤n≤N k=1

Lˆ Uˆ Our point estimate of V : + 0 2 One has     σˆU σˆU P V0 ≤ Uˆ + zα √ ≥ P U ≤ Uˆ + zα √ ≈ 1 − α. KU KU

So   σˆL σˆU Lˆ − zα √ , Uˆ + zα √ KL KU

is an asymptotically valid1 − 2α confidence interval

Confidence intervals for V0

By the CLT,   σˆU −∞ , Uˆ + zα √ KU is an asymptotically valid1 − α confidence interval for U, where σˆU is the corresponding sample standard deviation So   σˆL σˆU Lˆ − zα √ , Uˆ + zα √ KL KU

is an asymptotically valid1 − 2α confidence interval

Confidence intervals for V0

By the CLT,   σˆU −∞ , Uˆ + zα √ KU is an asymptotically valid1 − α confidence interval for U, where σˆU is the corresponding sample standard deviation

One has     σˆU σˆU P V0 ≤ Uˆ + zα √ ≥ P U ≤ Uˆ + zα √ ≈ 1 − α. KU KU Confidence intervals for V0

By the CLT,   σˆU −∞ , Uˆ + zα √ KU is an asymptotically valid1 − α confidence interval for U, where σˆU is the corresponding sample standard deviation

One has     σˆU σˆU P V0 ≤ Uˆ + zα √ ≥ P U ≤ Uˆ + zα √ ≈ 1 − α. KU KU

So   σˆL σˆU Lˆ − zα √ , Uˆ + zα √ KL KU is an asymptotically valid1 − 2α confidence interval Thank You!