<<

1

Introduction to : 1. Ideas and Examples

Pierre L’Ecuyer Universit´ede Montr´eal,Canada

Operations Research Seminars, Zinal, June 2016

We focus on general ideas and methods that apply widely, not on specific applications. Draft 2 , models, and simulation

Model: simplified representation of a . Modeling: building a model, based on data and other information. Simulation program: computerized (runnable) representation of the model. Some environments do not distinguish modeling and programming (no explicit programming). Simulation: running the simulation program and observing the results. Much more convenient and powerful than experimenting with the real system.

Deterministic vs stochastic model. Static vs dynamic model. Draft M/M/s queue, arrival rate λ, service rate µ, s > λ/µ servers, steady-state. Let W the waiting time of a random customer. There are exact expressions for E[W ], P[W = 0], P[W > x], etc. Conditionally on W > 0, W is exponential with mean 1/(sµ − λ).

Simulation model: can be much more detailed and realistic. Requires reliable data and information.

3 Analytic models vs simulation models

Analytic model: provides a simple formula. Usually hard to derive and requires unrealistic simplifying assumptions, but easy to apply and insightful. Examples: Erlang formulas in , Black-Scholes formula for pricing in finance, etc.

Draft Simulation model: can be much more detailed and realistic. Requires reliable data and information.

3 Analytic models vs simulation models

Analytic model: provides a simple formula. Usually hard to derive and requires unrealistic simplifying assumptions, but easy to apply and insightful. Examples: Erlang formulas in queueing theory, Black-Scholes formula for option pricing in finance, etc. M/M/s queue, arrival rate λ, service rate µ, s > λ/µ servers, steady-state. Let W the waiting time of a random customer. There are exact expressions for E[W ], P[W = 0], P[W > x], etc. Conditionally on W > 0, W is exponential with mean 1/(sµ − λ). Draft 3 Analytic models vs simulation models

Analytic model: provides a simple formula. Usually hard to derive and requires unrealistic simplifying assumptions, but easy to apply and insightful. Examples: Erlang formulas in queueing theory, Black-Scholes formula for option pricing in finance, etc. M/M/s queue, arrival rate λ, service rate µ, s > λ/µ servers, steady-state. Let W the waiting time of a random customer. There are exact expressions for E[W ], P[W = 0], P[W > x], etc. Conditionally on W > 0, W is exponential with mean 1/(sµ − λ).

Simulation model: can be much more detailed and realistic. Requires reliable data and information. Draft 4

real-life system

decision verification making

modeling data

conceptual programming computerized model model Draftverification 5 Examples: A stochastic activity network Gives precedence relations between activities. Activity k has random duration Yk (also length of arc k) with known cumulative distribution function (cdf) def Fk (y) = P[Yk ≤ y]. Project duration T = (random) length of longest path from source to sink.

5 8 sink Y10 Y5 Y9 Y12

2 4 7 Y4 Y8 Y1 Y2 Y7 Y11

source 0 1 3 6 Y0 DraftY3 Y6 6 Simulation

Repeat n times: generate Yk with cdf Fk for each k, then compute T . From those n realizations of T , we can estimate E[T ], P[T > x] for some x, and compute confidence intervals on those values. Better: construct a histogram to estimate the distribution of T .

Numerical illustration: 2 Yk ∼ N(µk , σk ) for k = 1, 2, 4, 11, 12, and Vk ∼ Expon(1/µk ) otherwise.

µk , . . . , µ13: 13.0, 5.5, 7.0, 5.2, 16.5, 14.7, 10.3, 6.0, 4.0, 20.0, 3.2, 3.2, 16.5. We may pay a penalty if T > 90, for example. Draft Results of an experiment with n = 100 000. Histogram of values of T gives much more information than confidence ˆ interval on E[T ] or P[T > x] or ξ0.99. Values from 14.4 to 268.6; 11.57% exceed x = 90.

Frequency (×103) mean = 64.2 T = 48.2 T = x = 90 10

5 ξˆ0.99 = 131.8

0 T 0 25 50 75 100 125 150 175 200

7 Naive idea: replace each Yk by its expectation. Gives T = 48.2.

Draft 7 Naive idea: replace each Yk by its expectation. Gives T = 48.2. Results of an experiment with n = 100 000. Histogram of values of T gives much more information than confidence ˆ interval on E[T ] or P[T > x] or ξ0.99. Values from 14.4 to 268.6; 11.57% exceed x = 90.

Frequency (×103) mean = 64.2 T = 48.2 T = x = 90 10

5 ξˆ0.99 = 131.8

0 T 0 25 50 75Draft100 125 150 175 200 8 Collisions in a hashing system In applied probability and , Monte Carlo simulation is often used to estimate a distribution that is too hard to compute exactly.

Example: We throw M points randomly and independently into k squares. C = number of collisions (when a point falls in square already occupied). What is the distribution of C?

Exemple: k = 100, M = 10, C = 3.

• • • •

•• • Draft••• Simplified analytic model: We consider two cases: M = m (constant) and M ∼ Poisson(m). Theorem: If k → ∞ and m2/(2k) → λ (a constant) then C ⇒ Poisson(λ). −λ x That is, in the limit, P[C = x] = e λ /x! for x = 0, 1, 2, ....

But how good is this approximation for finite k and m?

9

One application: hashing. Large set of identifiersΦ, e.g., 64-bit numbers, only a few a used. Hashing function h :Φ → {1,..., k} (set of physical addresses), with k  |Φ|.

Another application: testing random number generators.

Draft 9

One application: hashing. Large set of identifiersΦ, e.g., 64-bit numbers, only a few a used. Hashing function h :Φ → {1,..., k} (set of physical addresses), with k  |Φ|.

Another application: testing random number generators.

Simplified analytic model: We consider two cases: M = m (constant) and M ∼ Poisson(m). Theorem: If k → ∞ and m2/(2k) → λ (a constant) then C ⇒ Poisson(λ). −λ x That is, in the limit, P[C = x] = e λ /x! for x = 0, 1, 2, ....

But how good is this approximation for finite k and m? Draft Illustration: k = 10 000 and m = 400. Analytic approx.: C ∼ Poisson with mean λ = m2/(2k) =8. We simulated n = 107 replications (or runs). Empirical mean and variance: 7.87 and 7.47 for M = m = 400 (fixed); 7.89 and 8.10 for M ∼ Poisson(400).

10

Simulation: Generate M indep. integers uniformly over {1,..., k} and compute C. Repeat n times independently and compute the empirical distribution of the n realizations of C. Can also estimate E[C], P[C > x], etc.

Draft 10

Simulation: Generate M indep. integers uniformly over {1,..., k} and compute C. Repeat n times independently and compute the empirical distribution of the n realizations of C. Can also estimate E[C], P[C > x], etc.

Illustration: k = 10 000 and m = 400. Analytic approx.: C ∼ Poisson with mean λ = m2/(2k) =8. We simulated n = 107 replications (or runs). Empirical mean and variance: 7.87 and 7.47 for M = m = 400 (fixed); 7.89 and 8.10 for M ∼ Poisson(400). Draft c M fixed M ∼ Poisson Poisson distrib. 11 0 3181 4168 3354.6 1 25637 32257 26837.0 2 105622 122674 107348.0 3 288155 316532 286261.4 4 587346 614404 572522.8 5 948381 957951 916036.6 6 1269427 1247447 1221382.1 7 1445871 1397980 1395865.3 8 1434562 1377268 1395865.3 9 1251462 1207289 1240769.1 10 978074 958058 992615.3 11 688806 692416 721902.0 12 442950 459198 481268.0 13 260128 282562 296164.9 14 141467 162531 169237.1 15 71443 86823 90259.7 16 33224 43602 45129.9 17 14739 20827 21237.6 18 5931 9412 9438.9 19 2378 3985 3974.3 20 791 1629 1589.7 21 310 592 605.6 22 79 264 220.2 23 26 83 76.6 24 5 33 25.5 ≥ 25 5 Draft15 11.7 12

What if we take smaller values? k = 100 and m = 40. We still have λ = m2/(2k) =8. We simulated n = 107 replications (or runs). Empirical mean, 95% confidence interval, variance: 6.90, (6.896, 6.899), 4.10, for M = m = 40 (fixed); 7.03, (7.032, 7.035), 8.48, for M ∼ Poisson(40). Draft c M fixed M ∼ Poisson Poisson distrib. 13 0 1148 17631 3354.6 1 14355 100100 26837.0 2 84046 294210 107348.0 3 302620 600700 286261.4 4 744407 951184 572522.8 5 1340719 1238485 916036.6 6 1828860 1379400 1221382.1 7 1941207 1353831 1395865.3 8 1634465 1194915 1395865.3 9 1103416 956651 1240769.1 10 602186 705478 992615.3 11 267542 485353 721902.0 12 97047 309633 481268.0 13 29161 187849 296164.9 14 7195 107189 169237.1 15 1363 58727 90259.8 16 224 30450 45129.9 17 35 15115 21237.6 18 4 7271 9438.9 19 0 3311 3974.3 20 0 1468 1589.7 21 0 607 605.6 22 0 258 220.2 23 0 110 76.6 24 0 44 25.5 ≥ 25 0 Draft30 11.7 14

k = 25 and m = 20. Again, λ = m2/(2k) =8. We make n = 107 runs. Empirical mean, variance: 6.05, 2.16 for M = m = 40 (fixed); 6.23, 8.21 for M ∼ Poisson(40). Draft c M fixed M ∼ Poisson Poisson distrib. 15 0 135 49506 3354.6 1 4565 220731 26837.0 2 52910 528292 107348.0 3 314067 905283 286261.4 4 1050787 1224064 572522.9 5 2130621 1400387 916036.6 6 2692369 1392858 1221382.1 7 2170849 1237244 1395865.3 8 1124606 997974 1395865.3 9 371366 743393 1240769.2 10 77190 513360 992615.3 11 9768 332413 721902.0 12 738 203692 481268.0 13 29 118015 296164.9 14 0 65363 169237.1 15 0 34266 90259.8 16 0 17384 45129.9 17 0 8600 21237.6 18 0 3984 9438.9 19 0 1861 3974.2 20 0 768 1589.7 21 0 345 605.6 22 0 128 220.2 23 0 56 76.6 24 0 15 25.5 25 0Draft 18 11.7 First customer arrives time T1 = A1, leaves station 1 at D1,1 = T1 + S1,1 and starts its service at station 2, leaves station 2 at time D2,1 = D1,1 + S2,1 to join queue 3, etc. Customer 2 arrives at time T2 = T1 + A2, starts service at queue 1 at max(T2, D1,1),

leaves station 1 at time max(T2, D1,1) + S1,2 if space is available at station 2 , etc.

16 A tandem queue.

queue server queue server queue server |||||| 1 |||||| 2 ······ |||||| m

Station j: single-server FIFO queue with total capacity cj (c1 = ∞). A customer cannot leave server j when queue j + 1 is full; it is blocked.

Ti = arrival time of customer i at first queue, for i ≥ 1, and T0 = 0; Ai = Ti − Ti−1 = time between arrivals i − 1 and i; Wj,i = waiting time in queue j for customer i; Sj,i = service time at station j for customer i; Bj,i = blocked time at station j for customer i; Dj,i = departure time from station j for customer i. Draft 16 A tandem queue.

queue server queue server queue server |||||| 1 |||||| 2 ······ |||||| m

Station j: single-server FIFO queue with total capacity cj (c1 = ∞). A customer cannot leave server j when queue j + 1 is full; it is blocked.

Ti = arrival time of customer i at first queue, for i ≥ 1, and T0 = 0; Ai = Ti − Ti−1 = time between arrivals i − 1 and i; Wj,i = waiting time in queue j for customer i; Sj,i = service time at station j for customer i; Bj,i = blocked time at station j for customer i; Dj,i = departure time from station j for customer i.

First customer arrives time T1 = A1, leaves station 1 at D1,1 = T1 + S1,1 and starts its service at station 2, leaves station 2 at time D2,1 = D1,1 + S2,1 to join queue 3, etc. Customer 2 arrives at time T2 = T1 + A2, starts service at queue 1 at max(T2, D1,1), leaves station 1 at time max(T2, D1,1Draft) + S1,2 if space is available at station 2 , etc. For infinite buffer sizes (no blocking), we always have Bj,i = 0 and

Dj,i = max[Dj−1,i , Dj,i−1] + Sj,i .

A single queue with no blocking: GI /GI /1 model, obeys Lindley recurrence:

W1,i = max[0, D1,i−1 − Ti ] = max[0, W1,i−1 + S1,i−1 − Ai ].

17 Recurence equations. Suppose we can generate successive interarrival times Ai and service times Sj,i . Then we can compute the Ti ’s easily and the departure times Dj,i via:

Dj,i = max[Dj−1,i + Sj,i , Dj,i−1 + Sj,i , Dj+1,i−cj+1 ] for 1 ≤ j ≤ m and i ≥ 1, where D0,i = Ti , Dj,i = 0 for i ≤ 0, and Dm+1,i = 0 for all i. Then:

Wj,i = max[0, Dj,i−1 − Dj−1,i ],

Bj,i = Dj,i − Dj−1,i − Wj,i − Sj,i .

Draft A single queue with no blocking: GI /GI /1 model, obeys Lindley recurrence:

W1,i = max[0, D1,i−1 − Ti ] = max[0, W1,i−1 + S1,i−1 − Ai ].

17 Recurence equations. Suppose we can generate successive interarrival times Ai and service times Sj,i . Then we can compute the Ti ’s easily and the departure times Dj,i via:

Dj,i = max[Dj−1,i + Sj,i , Dj,i−1 + Sj,i , Dj+1,i−cj+1 ] for 1 ≤ j ≤ m and i ≥ 1, where D0,i = Ti , Dj,i = 0 for i ≤ 0, and Dm+1,i = 0 for all i. Then:

Wj,i = max[0, Dj,i−1 − Dj−1,i ],

Bj,i = Dj,i − Dj−1,i − Wj,i − Sj,i .

For infinite buffer sizes (no blocking), we always have Bj,i = 0 and

Dj,i = max[Dj−1,i , Dj,i−1] + Sj,i . Draft 17 Recurence equations. Suppose we can generate successive interarrival times Ai and service times Sj,i . Then we can compute the Ti ’s easily and the departure times Dj,i via:

Dj,i = max[Dj−1,i + Sj,i , Dj,i−1 + Sj,i , Dj+1,i−cj+1 ] for 1 ≤ j ≤ m and i ≥ 1, where D0,i = Ti , Dj,i = 0 for i ≤ 0, and Dm+1,i = 0 for all i. Then:

Wj,i = max[0, Dj,i−1 − Dj−1,i ],

Bj,i = Dj,i − Dj−1,i − Wj,i − Sj,i .

For infinite buffer sizes (no blocking), we always have Bj,i = 0 and

Dj,i = max[Dj−1,i , Dj,i−1] + Sj,i .

A single queue with no blocking: GI /GI /1 model, obeys Lindley recurrence: W1,i = max[0, D1,i−1 − TDrafti ] = max[0, W1,i−1 + S1,i−1 − Ai ]. What if we want to simulate all customers who arrive before time T ? Replace “For i = 1,..., Nc ” by “For (i = 1, Ti < T , i++)”.

What we just saw is production blocking. In communication blocking, service at station j starts only when queue j + 1 is not full. Then:

Dj,i = max[Dj−1,i , Dj,i−1, Dj+1,i−cj+1 ] + Sj,i . When things get more complicated: Discrete-event simulation.

Algorithm: Simulating Nc customers with production blocking. 18 Let T0 = 0 and D0,0 = D1,0 = 0; For j = 2,..., m, For i = −cj + 1,..., 0, let Dj,i = 0; For i = 1,..., Nc , Generate Ai from its distribution and let D0,i = Ti = Ti−1 + Ai ; Let Wi = Bi = 0; For j = 1,..., m, Generate Sj,i from its distribution;

Let Dj,i = max[Dj−1,i + Sj,i , Dj,i−1 + Sj,i , Dj+1,i−cj+1 ];

Let Wj,i = max[0, Dj,i−1 − Dj−1,i ] and Wi = Wi + Wj,i ;

Let Bj,i = Dj,i − Dj−1,i − Wj,i − Sj,i and Bi = Bi + Bj,i ; Compute and return the averages ¯ ¯ WNc = (W1 + ··· + WNc )/Nc and BNc = (B1 + ··· + BNc )/Nc . Draft What we just saw is production blocking. In communication blocking, service at station j starts only when queue j + 1 is not full. Then:

Dj,i = max[Dj−1,i , Dj,i−1, Dj+1,i−cj+1 ] + Sj,i . When things get more complicated: Discrete-event simulation.

Algorithm: Simulating Nc customers with production blocking. 18 Let T0 = 0 and D0,0 = D1,0 = 0; For j = 2,..., m, For i = −cj + 1,..., 0, let Dj,i = 0; For i = 1,..., Nc , Generate Ai from its distribution and let D0,i = Ti = Ti−1 + Ai ; Let Wi = Bi = 0; For j = 1,..., m, Generate Sj,i from its distribution;

Let Dj,i = max[Dj−1,i + Sj,i , Dj,i−1 + Sj,i , Dj+1,i−cj+1 ];

Let Wj,i = max[0, Dj,i−1 − Dj−1,i ] and Wi = Wi + Wj,i ;

Let Bj,i = Dj,i − Dj−1,i − Wj,i − Sj,i and Bi = Bi + Bj,i ; Compute and return the averages ¯ ¯ WNc = (W1 + ··· + WNc )/Nc and BNc = (B1 + ··· + BNc )/Nc .

What if we want to simulate all customers who arrive before time T ? Replace “For i = 1,..., Nc ” by “For (i = 1, Ti < T , i++)”. Draft Algorithm: Simulating Nc customers with production blocking. 18 Let T0 = 0 and D0,0 = D1,0 = 0; For j = 2,..., m, For i = −cj + 1,..., 0, let Dj,i = 0; For i = 1,..., Nc , Generate Ai from its distribution and let D0,i = Ti = Ti−1 + Ai ; Let Wi = Bi = 0; For j = 1,..., m, Generate Sj,i from its distribution;

Let Dj,i = max[Dj−1,i + Sj,i , Dj,i−1 + Sj,i , Dj+1,i−cj+1 ];

Let Wj,i = max[0, Dj,i−1 − Dj−1,i ] and Wi = Wi + Wj,i ;

Let Bj,i = Dj,i − Dj−1,i − Wj,i − Sj,i and Bi = Bi + Bj,i ; Compute and return the averages ¯ ¯ WNc = (W1 + ··· + WNc )/Nc and BNc = (B1 + ··· + BNc )/Nc .

What if we want to simulate all customers who arrive before time T ? Replace “For i = 1,..., Nc ” by “For (i = 1, Ti < T , i++)”.

What we just saw is production blocking. In communication blocking, service at station j starts only when queue j + 1 is not full. Then: D = max[D , D , D ] + S . j,i j−Draft1,i j,i−1 j+1,i−cj+1 j,i When things get more complicated: Discrete-event simulation. Suppose the contract owner receives a net payoff g(S(t1),..., S(td )) at time T , d for some function g : R → R, where t1 ..., td are fixed observation times, with 0 ≤ t1 < ··· < td = T . Suppose also that money left in the bank account (or borrowed, when negative) yields interest at compounded rate r (the short rate). This means that one dollar placed in the account at time 0 is worth ert dollars at time t. Equivalently, an amount Y to be received in t units of time is worth e−rt Y today. That is, it must be multiplied by the discount factor e−rt .

Our aim is to estimate the fair (present) value of the financial contract. To do this, it is common to assume a no-arbitrage market, which means that one cannot make more money than with the r without taking risks. That is, if there exists an investment strategy with which one dollar at time 0 can be worth more than ert dollars at time t with positive probability, then there must be a positive probability that it is worth less than ert .

19 Example: Pricing a financial . Consider a financial contract based on a single asset (e.g., one share of a stock), whose market price at time t is S(t). We assume {S(t), t ≥ 0} is a with known probability law (in practice, estimated from data).

Draft Our aim is to estimate the fair (present) value of the financial contract. To do this, it is common to assume a no-arbitrage market, which means that one cannot make more money than with the interest rate r without taking risks. That is, if there exists an investment strategy with which one dollar at time 0 can be worth more than ert dollars at time t with positive probability, then there must be a positive probability that it is worth less than ert .

19 Example: Pricing a financial derivative. Consider a financial contract based on a single asset (e.g., one share of a stock), whose market price at time t is S(t). We assume {S(t), t ≥ 0} is a stochastic process with known probability law (in practice, estimated from data).

Suppose the contract owner receives a net payoff g(S(t1),..., S(td )) at time T , d for some function g : R → R, where t1 ..., td are fixed observation times, with 0 ≤ t1 < ··· < td = T . Suppose also that money left in the bank account (or borrowed, when negative) yields interest at compounded rate r (the short rate). This means that one dollar placed in the account at time 0 is worth ert dollars at time t. Equivalently, an amount Y to be received in t units of time is worth e−rt Y today. That is, it must be multiplied by the discount factor e−rt . Draft 19 Example: Pricing a financial derivative. Consider a financial contract based on a single asset (e.g., one share of a stock), whose market price at time t is S(t). We assume {S(t), t ≥ 0} is a stochastic process with known probability law (in practice, estimated from data).

Suppose the contract owner receives a net payoff g(S(t1),..., S(td )) at time T , d for some function g : R → R, where t1 ..., td are fixed observation times, with 0 ≤ t1 < ··· < td = T . Suppose also that money left in the bank account (or borrowed, when negative) yields interest at compounded rate r (the short rate). This means that one dollar placed in the account at time 0 is worth ert dollars at time t. Equivalently, an amount Y to be received in t units of time is worth e−rt Y today. That is, it must be multiplied by the discount factor e−rt .

Our aim is to estimate the fair (present) value of the financial contract. To do this, it is common to assume a no-arbitrage market, which means that one cannot make more money than with the interest rate r without taking risks. That is, if there exists an investment strategy with which one dollar at time 0 can be worth more than ert dollars at time t with positive probability, then there must be a positive probability that it is worthDraft less than ert . Under this measure P∗, the process {e−rt S(t), t ≥ 0} is a martingale, which means that for any t ≥ 0 and 0 < δ ≤ T − t, we have

∗ −δt E [e S(t + δ)] = S(t).

The measure P∗ generally differs from the true measure under which the process {S(t), t ≥ 0} evolves in real life. Note that a risk-neutral measure does not always exist, and is not always unique. Here, we assume that it does exist and is known (or has been estimated).

20 Under this no-arbitrage assumption, it can be proved (this is beyond the scope of this course) that the present value (or fair price) of the financial contract (at time 0), when S(0) = s0, can be written as

∗  −rT  v(s0, T )= E e g(S(t1),..., S(td )) , where E∗ is the mathematical expectation under a certain probability measure P∗ called the risk-neutral measure. Changing the probability measure amounts to a change of variable to account for our utility function.

Draft The measure P∗ generally differs from the true measure under which the process {S(t), t ≥ 0} evolves in real life. Note that a risk-neutral measure does not always exist, and is not always unique. Here, we assume that it does exist and is known (or has been estimated).

20 Under this no-arbitrage assumption, it can be proved (this is beyond the scope of this course) that the present value (or fair price) of the financial contract (at time 0), when S(0) = s0, can be written as

∗  −rT  v(s0, T )= E e g(S(t1),..., S(td )) , where E∗ is the mathematical expectation under a certain probability measure P∗ called the risk-neutral measure. Changing the probability measure amounts to a change of variable to account for our utility function.

Under this measure P∗, the process {e−rt S(t), t ≥ 0} is a martingale, which means that for any t ≥ 0 and 0 < δ ≤ T − t, we have

∗ −δt E [e S(t + δ)] = S(t). Draft 20 Under this no-arbitrage assumption, it can be proved (this is beyond the scope of this course) that the present value (or fair price) of the financial contract (at time 0), when S(0) = s0, can be written as

∗  −rT  v(s0, T )= E e g(S(t1),..., S(td )) , where E∗ is the mathematical expectation under a certain probability measure P∗ called the risk-neutral measure. Changing the probability measure amounts to a change of variable to account for our utility function.

Under this measure P∗, the process {e−rt S(t), t ≥ 0} is a martingale, which means that for any t ≥ 0 and 0 < δ ≤ T − t, we have

∗ −δt E [e S(t + δ)] = S(t).

The measure P∗ generally differs from the true measure under which the process {S(t), t ≥ 0} evolves in real life. Note that a risk-neutral measure does not always exist, and is not always unique. Here, we assume that it does existDraft and is known (or has been estimated). What are the appropriate models for S? In the popular model of Black and Scholes (1973), S(t) is assumed to evolve as a geometric (GBM), which means that its logarithm is a Brownian motion. This implies that under P∗, one has

2 S(t) = S(0)e(r−σ /2)t+σB(t)

where r is the short rate, σ is a constant called the volatility, and B(·) is a standard Brownian motion: for any t2 > t1 ≥ 0, B(t2) − B(t1) is a normal with mean 0 and variance t2 − t1, independent of the behavior of B(·) outside the interval [t1, t2].

21

Except for very simple models, we do not know how to compute v(s0, T ) exactly. But we can estimate it by Monte Carlo if we know how to generate a path ∗ S(t1),..., S(td ) from P and to compute g. We simulate n independent copies −rT of X = e g(S(t1),..., S(td )), say X1,..., Xn, and our estimator is the average X¯n = (X1 + ··· + Xn)/n.

Draft 21

Except for very simple models, we do not know how to compute v(s0, T ) exactly. But we can estimate it by Monte Carlo if we know how to generate a path ∗ S(t1),..., S(td ) from P and to compute g. We simulate n independent copies −rT of X = e g(S(t1),..., S(td )), say X1,..., Xn, and our estimator is the average X¯n = (X1 + ··· + Xn)/n.

What are the appropriate models for S? In the popular model of Black and Scholes (1973), S(t) is assumed to evolve as a geometric Brownian motion (GBM), which means that its logarithm is a Brownian motion. This implies that under P∗, one has

2 S(t) = S(0)e(r−σ /2)t+σB(t) where r is the short rate, σ is a constant called the volatility, and B(·) is a standard Brownian motion: for any t2 > t1 ≥ 0, B(t2) − B(t1) is a normal random variable with mean 0 and variance t2 − t1, independent of the behavior of B(·) outside the interval [t1, t2]. Draft 22

An European call option gives the right to buy one unit of the asset at price K (the strike price) at time T (the expiration date). The net payoff at time T is

g(S(T )) = max[0, S(T ) − K].

In this simple case, under the GBM model, the Black-Scholes formula gives: √ −rT v(s0, T )= s0Φ(−z0 + σ T ) − Ke Φ(−z0), where 2 ln(K/s0) − (r − σ /2)T z0 = √ σ T andΦ is the standard normal cdf. Draft 23 When g or S is more complicated, there is usually no analytic formula for v(s0, T ). For a discretely-monitored Asian call option, for example, we have

 d  1 X g(S(t ),..., S(t )) = max 0, S(t ) − K , 1 d  d j  j=1 and no closed-form formula is available for v(s0, T ). We can estimate it by simulation, as follows (for a general g): Algorithm: Option pricing under a GBM model by Monte Carlo For i = 1,..., n, Let t0 = 0 and B(t0) = 0; For j = 1,..., d, Generate Z ∼ N(0, 1); j √ Let B(tj ) = B(tj−1) + tj − tj−1Zj ;  2  Let S(tj ) = s0 exp (r − σ /2)tj + σB(tj ) ; −rT Compute Xi = e g(S(t1),..., S(td )); Compute and return the average X¯n as an estimator of v(s0, T ), with a c.i.; Optional: Plot a histogram of X1,...,DraftXn. 24

Numerical illustration: Bermudean Asian option with d = 12, T = 1 (one year), tj = j/12 for j = 0,..., 12, K = 100, s0 = 100, r = 0.05, σ = 0.5. We perform n = 106 independent simulation runs. Mean: 13.1. Max = 390.8 In 53.47% of cases, the payoff is 0. Histogram of the 46.53% positive values:

Frequency (×103)

30

20

10

0 Payoff 0 50 Draft100 150 25 Other models

Better models for S: Variance-gamme, Heston, etc.

Several types of payoff functions g.

Draft 26 Improving the accuracy when estimating the mean

If we replace the arithmetic average by a geometric average, we obtain

 d  −rT Y 1/d C = e max 0, (S(tj )) − K , j=1

whose expectation ν = E[C] has a closed-form formula. One can then use C as a control variate to estimate E[X ]: Replace the estimator X by the “corrected” version X − β(C − ν), for some well-chosen β. This can provide a huge variance reduction, by factors of over a million in some examples. Draft 27 American options

Here the option owner can decide when to exercise the option. We now have a dynamic stochastic (stopping time) optimization problem. Solution methods involve variants of approximate dynamic programming combined with simulation. Interesting and challenging. In some cases (e.g., bonds with options), both the issuer and the owner can decide dynamically when to exercise the option. Draft 28 Discrete choice with multinomial mixed logit probability, max likelihood estimation Want to estimate a model of choices between alternatives, for when users buy a product, choose a route or a transportation mode, choose a restaurant, etc.

Utility of alternative j for individual q is

s X t Uq,j = βq,`xq,j,` + q,j = βqxq,j + q,j , where `=1 t βq = (βq,1, . . . , βq,s ) gives the tastes of individual q, t xq,j = (xq,j,1,..., xq,j,s ) attributes of alternative j for individual q, q,j noise; Gumbel of mean 0 and scale parameter λ = 1.

Individual q selects alternative yq with largest utility Uq,j . Can observe the xq,j and choicesDraftyq, but not the rest. Mixed logit. For a random individual, suppose βq is random with density fθ, which depends on (unknown) parameter vector θ. We want to estimate θ from the data (the xq,j and yq). Given θ, the unconditional probability of choosing j is Z pq(j, θ)= Lq(j | β)fθ(β)dβ.

It depends on A(q), j, and θ.

29

Logit model: for βq fixed, j is chosen with probability

t exp[βqxq,j ] Lq(j | βq)= P t a∈A(q) exp[βqxq,a] where A(q) are the available alternatives for q.

Draft 29

Logit model: for βq fixed, j is chosen with probability

t exp[βqxq,j ] Lq(j | βq)= P t a∈A(q) exp[βqxq,a] where A(q) are the available alternatives for q.

Mixed logit. For a random individual, suppose βq is random with density fθ, which depends on (unknown) parameter vector θ. We want to estimate θ from the data (the xq,j and yq). Given θ, the unconditional probability of choosing j is Z pq(j, θ)= Lq(j | β)fθ(β)dβ. It depends on A(q), j, and θ. Draft No formula for pq(j, θ), but can estimate it by simulation. (1) (n) Generate n realizations of β from fθ, say βq (θ),..., βq (θ), and estimate pq(yq, θ) by

n 1 X (i) pˆ (y , θ)= L (j, β (θ)). q q n q q i=1 ˆ Pm Then we can find the maximizer θ of q=1 lnp ˆq(yq, θ) w.r.t. θ. The latter is a deterministic nonlinear nonconvex (difficult) optimization problem. Also Biased! Simulation is usually combined with bias reduction and variance reduction methods to improve the accuracy.

Maximum likelihood: Maximize the log of the joint probability of the 30 sample, w.r.t. θ:

m m Y X ln L(θ) = ln pq(yq, θ) = ln pq(yq, θ). q=1 q=1

Draft Simulation is usually combined with bias reduction and variance reduction methods to improve the accuracy.

Maximum likelihood: Maximize the log of the joint probability of the 30 sample, w.r.t. θ:

m m Y X ln L(θ) = ln pq(yq, θ) = ln pq(yq, θ). q=1 q=1

No formula for pq(j, θ), but can estimate it by simulation. (1) (n) Generate n realizations of β from fθ, say βq (θ),..., βq (θ), and estimate pq(yq, θ) by

n 1 X (i) pˆ (y , θ)= L (j, β (θ)). q q n q q i=1 ˆ Pm Then we can find the maximizer θ of q=1 lnp ˆq(yq, θ) w.r.t. θ. The latter is a deterministic nonlinear nonconvex (difficult) optimization problem. Also Biased! Draft Maximum likelihood: Maximize the log of the joint probability of the 30 sample, w.r.t. θ:

m m Y X ln L(θ) = ln pq(yq, θ) = ln pq(yq, θ). q=1 q=1

No formula for pq(j, θ), but can estimate it by simulation. (1) (n) Generate n realizations of β from fθ, say βq (θ),..., βq (θ), and estimate pq(yq, θ) by

n 1 X (i) pˆ (y , θ)= L (j, β (θ)). q q n q q i=1 ˆ Pm Then we can find the maximizer θ of q=1 lnp ˆq(yq, θ) w.r.t. θ. The latter is a deterministic nonlinear nonconvex (difficult) optimization problem. Also Biased! Simulation is usually combined with bias reduction and variance reduction methods to improve the accuracy.Draft 31 Monte Carlo to estimate an expectation

Suppose we want to estimate the mathematical expectation (the mean) of some random variable X : µ = E[X ]. Monte Carlo estimator: n 1 X X¯ = X , n n i i=1

where X1,..., Xn are independent replicates of X . ¯ ¯ 2 We have E[Xn] = µ and Var[Xn] = σ /n = Var[X ]/n. Draft (ii) Central limit theorem (CLT): √ n(ˆµ − µ) n ⇒ N(0, 1), i.e., σ √  n(ˆµn − µ) lim P ≤ x = Φ(x) = P[Z ≤ x] n→∞ σ

for all x ∈ R, where Z ∼ N(0, 1) andΦ( ·) its cdf. Property (ii) still holds if we replace σ2 by its unbiased estimator

n n ! 1 X 1 X S2 = (X − X¯ )2 = X 2 − n(X¯ )2 , n n − 1 i n n − 1 i n i=1 i=1 √ where Xi = f (Ui ). We have n(ˆµn − µ)/Sn ⇒ N(0, 1).

32 Convergence Theorem. Suppose σ2 < ∞. When n → ∞: (i) Strong law of large numbers: limn→∞ µˆn = µ with probability 1.

Draft Property (ii) still holds if we replace σ2 by its unbiased estimator

n n ! 1 X 1 X S2 = (X − X¯ )2 = X 2 − n(X¯ )2 , n n − 1 i n n − 1 i n i=1 i=1 √ where Xi = f (Ui ). We have n(ˆµn − µ)/Sn ⇒ N(0, 1).

32 Convergence Theorem. Suppose σ2 < ∞. When n → ∞: (i) Strong law of large numbers: limn→∞ µˆn = µ with probability 1. (ii) Central limit theorem (CLT): √ n(ˆµ − µ) n ⇒ N(0, 1), i.e., σ √  n(ˆµn − µ) lim P ≤ x = Φ(x) = P[Z ≤ x] n→∞ σ

for all x ∈ R, where Z ∼ N(0, 1) andΦ( ·) its cdf. Draft 32 Convergence Theorem. Suppose σ2 < ∞. When n → ∞: (i) Strong law of large numbers: limn→∞ µˆn = µ with probability 1. (ii) Central limit theorem (CLT): √ n(ˆµ − µ) n ⇒ N(0, 1), i.e., σ √  n(ˆµn − µ) lim P ≤ x = Φ(x) = P[Z ≤ x] n→∞ σ

for all x ∈ R, where Z ∼ N(0, 1) andΦ( ·) its cdf. Property (ii) still holds if we replace σ2 by its unbiased estimator

n n ! 1 X 1 X S2 = (X − X¯ )2 = X 2 − n(X¯ )2 , n n − 1 i n n − 1 i n i=1 i=1 √ where Xi = f (Ui ). We have nDraft(ˆµn − µ)/Sn ⇒ N(0, 1). Warning: If the Xi have an asymmetric law, these confidence intervals can have very bad coverage (convergence can be very slow).

For large n and confidence level 1 − α, we have 33 √ √ P[ˆµn − µ ≤ xSn/ n] = P[ n(ˆµn − µ)/Sn ≤ x] ≈ Φ(x). Confidence interval at level α (we want Φ(x) = 1 − α/2): √ −1 (ˆµn ± zα/2Sn/ n), where zα/2 = Φ (1 − α/2).

Example: zα/2 ≈ 1.96 for α = 0.05.

−zα/2 zα/2 α/2 1 − α α/2

−3 −1.96 −1 0 1 1.96 3

The width of the confidence interval is asymptotically proportional to √ √ σ/ n, so it converges as O(n−1/2). Relative error: σ/(µ n). We want a small σ. To add one more decimal digitDraft of accuracy, we must multiply n by 100. For large n and confidence level 1 − α, we have 33 √ √ P[ˆµn − µ ≤ xSn/ n] = P[ n(ˆµn − µ)/Sn ≤ x] ≈ Φ(x). Confidence interval at level α (we want Φ(x) = 1 − α/2): √ −1 (ˆµn ± zα/2Sn/ n), where zα/2 = Φ (1 − α/2).

Example: zα/2 ≈ 1.96 for α = 0.05.

−zα/2 zα/2 α/2 1 − α α/2

−3 −1.96 −1 0 1 1.96 3

The width of the confidence interval is asymptotically proportional to √ √ σ/ n, so it converges as O(n−1/2). Relative error: σ/(µ n). We want a small σ. To add one more decimal digit of accuracy, we must multiply n by 100.

Warning: If the Xi have an asymmetric law, these confidence intervals can have very bad coverage (convergenceDraft can be very slow). 34

Example: Stochastic activity network. For µ = P[T > x], we have

Xi = I[Ti > x], n 1 X Y µˆ = X¯ = X = , n n n i n i=1 n 1 X Y (1 − Y /n) S2 = (X − Y /n)2 = . n n − 1 i n − 1 i=1 √ Confidence interval at 95%: (Y /n ± 1.96Sn/ n). Reasonable if E[Y ] = nµ is not too close to 0 or 1. In fact, Y ∼ Binomial(n, µ). Draft This gives the 95% confidence interval: √ (X¯n ± 1.96Sn/ n) ≈ (0.882 ± 0.020) = (0.862, 0.902).

Thus, our estimator of µ has two significant digits: µ ≈ 0.88. The “2” in 0.882 is not significant.

35

Suppose n = 1000 and we observe Y = 882. Then: X¯n = 882/1000 = 0.882; 2 ¯ ¯ Sn = Xn(1 − Xn)n/(n − 1) ≈ 0.1042.

Draft Thus, our estimator of µ has two significant digits: µ ≈ 0.88. The “2” in 0.882 is not significant.

35

Suppose n = 1000 and we observe Y = 882. Then: X¯n = 882/1000 = 0.882; 2 ¯ ¯ Sn = Xn(1 − Xn)n/(n − 1) ≈ 0.1042. This gives the 95% confidence interval: √ (X¯n ± 1.96Sn/ n) ≈ (0.882 ± 0.020) = (0.862, 0.902).

Draft 35

Suppose n = 1000 and we observe Y = 882. Then: X¯n = 882/1000 = 0.882; 2 ¯ ¯ Sn = Xn(1 − Xn)n/(n − 1) ≈ 0.1042. This gives the 95% confidence interval: √ (X¯n ± 1.96Sn/ n) ≈ (0.882 ± 0.020) = (0.862, 0.902).

Thus, our estimator of µ has two significant digits: µ ≈ 0.88. The “2” in 0.882 is not significant. Draft With a 10-fold efficiency improvement, we we need a computing budget 10 times smaller for the same accuracy. In the presence of bias β = E[X ] − µ, we define 1 1 Eff(X ) = = . c(X ) · MSE(X ) c(X ) · (Var[X ] + β2)

In this case, Eff(X¯n) depends on n.

36 Simulation efficiency We define the efficiency of the estimator by 1 Eff(X )= c(X ) · Var(X )

where c(X ) is the (expected) computing cost of X . This measure does not depend on the computing budget: 1 Eff(X¯ ) = = Eff(X ). n n · c(X ) · Var(X )/n

Draft In the presence of bias β = E[X ] − µ, we define 1 1 Eff(X ) = = . c(X ) · MSE(X ) c(X ) · (Var[X ] + β2)

In this case, Eff(X¯n) depends on n.

36 Simulation efficiency We define the efficiency of the estimator by 1 Eff(X )= c(X ) · Var(X )

where c(X ) is the (expected) computing cost of X . This measure does not depend on the computing budget: 1 Eff(X¯ ) = = Eff(X ). n n · c(X ) · Var(X )/n

With a 10-fold efficiency improvement, we we need a computing budget 10 times smaller for the same accuracy. Draft 36 Simulation efficiency We define the efficiency of the estimator by 1 Eff(X )= c(X ) · Var(X )

where c(X ) is the (expected) computing cost of X . This measure does not depend on the computing budget: 1 Eff(X¯ ) = = Eff(X ). n n · c(X ) · Var(X )/n

With a 10-fold efficiency improvement, we we need a computing budget 10 times smaller for the same accuracy. In the presence of bias β = E[X ] − µ, we define 1 1 Eff(X ) = = . c(X ) · MSE(X ) c(X ) · (Var[X ] + β2) ¯ In this case, Eff(Xn) depends onDraftn. 37 Other sources of error

The confidence intervals based on the previous CLTs only take the simulation error (or noise) into account. They ignore the approximations made in building the model and the errors made in estimating the model parameters. Ideally, confidence intervals should take all sources of error into account. Draft d If Yn = (Y1n,..., Ydn) → µ = (µ1, . . . , µd ) and g : R → R is continuous, then g(Yn) → g(µ). d Delta theorem. Suppose g : R → R is continuously differentiable in a neighborhood of µ, and ∇g its gradient. If r(n)(Yn − µ) ⇒ Y when n → ∞, then

t r(n)(g(Yn) − g(µ)) ⇒ (∇g(µ)) Y quand n → ∞.

Idea of the proof: Taylor series approximation.

38 CI for a function of several means

Want a CI on g(µ) where µ = (µ1, . . . , µd ) and g is smooth but generally nonlinear. We estimate µj by Yjn.

Draft d Delta theorem. Suppose g : R → R is continuously differentiable in a neighborhood of µ, and ∇g its gradient. If r(n)(Yn − µ) ⇒ Y when n → ∞, then

t r(n)(g(Yn) − g(µ)) ⇒ (∇g(µ)) Y quand n → ∞.

Idea of the proof: Taylor series approximation.

38 CI for a function of several means

Want a CI on g(µ) where µ = (µ1, . . . , µd ) and g is smooth but generally nonlinear. We estimate µj by Yjn.

d If Yn = (Y1n,..., Ydn) → µ = (µ1, . . . , µd ) and g : R → R is continuous, then g(Yn) → g(µ).

Draft 38 CI for a function of several means

Want a CI on g(µ) where µ = (µ1, . . . , µd ) and g is smooth but generally nonlinear. We estimate µj by Yjn.

d If Yn = (Y1n,..., Ydn) → µ = (µ1, . . . , µd ) and g : R → R is continuous, then g(Yn) → g(µ). d Delta theorem. Suppose g : R → R is continuously differentiable in a neighborhood of µ, and ∇g its gradient. If r(n)(Yn − µ) ⇒ Y when n → ∞, then

t r(n)(g(Yn) − g(µ)) ⇒ (∇g(µ)) Y quand n → ∞.

Idea of the proof: Taylor series approximation. Draft 38 CI for a function of several means

Want a CI on g(µ) where µ = (µ1, . . . , µd ) and g is smooth but generally nonlinear. We estimate µj by Yjn.

d If Yn = (Y1n,..., Ydn) → µ = (µ1, . . . , µd ) and g : R → R is continuous, then g(Yn) → g(µ). d Delta theorem. Suppose g : R → R is continuously differentiable in a neighborhood of µ, and ∇g its gradient. If r(n)(Yn − µ) ⇒ Y when n → ∞, then

t r(n)(g(Yn) − g(µ)) ⇒ (∇g(µ)) Y quand n → ∞.

Idea of the proof: Taylor series approximation. Draft Examples: (a) g(µ) = ln µ where µ = µ1 > 0; (b) g(µ1, µ2) = µ2/µ1; 2 k (c) Var[X ] = g(µ1, µ2) = µ2 − µ1 where µk = E[X ], (d) Cov[X , Y ] = g(µ1, µ3, µ3) where µ1 = E[X ], µ2 = E[Y ], µ3 = E[XY ].

39

√ Corollary. If n(Yn − µ) ⇒ Y ∼ N(0, Σy ) when n → ∞ (e.g., if Yn is an average and obeys the CLT), then we have the CLT: √ n(g(Y ) − g(µ)) n ⇒ N(0, 1) when n → ∞, σg

2 t t 2 where σg = Var[(∇g(µ)) Y] = E[(∇g(µ)) Y) ] t t t = (∇g(µ)) E[YY ]∇g(µ) = (∇g(µ)) Σy ∇g(µ).

Draft 39

√ Corollary. If n(Yn − µ) ⇒ Y ∼ N(0, Σy ) when n → ∞ (e.g., if Yn is an average and obeys the CLT), then we have the CLT: √ n(g(Y ) − g(µ)) n ⇒ N(0, 1) when n → ∞, σg

2 t t 2 where σg = Var[(∇g(µ)) Y] = E[(∇g(µ)) Y) ] t t t = (∇g(µ)) E[YY ]∇g(µ) = (∇g(µ)) Σy ∇g(µ). Examples: (a) g(µ) = ln µ where µ = µ1 > 0; (b) g(µ1, µ2) = µ2/µ1; 2 k (c) Var[X ] = g(µ1, µ2) = µ2 − µ1 where µk = E[X ], (d) Cov[X , Y ] = g(µ1, µ3, µ3) where µ1 = E[X ], µ2 = E[Y ], µ3 = E[XY ]. Draft Biaised but strongly consistent estimator of ν. 2 Let µ1 = E[X ], µ2 = E[Y ], g(µ1, µ2) = µ1/µ2, σ1 = Var[X ], 2 σ2 = Var[Y ], and σ12 = Cov[X , Y ]. 2 t The gradient of g is ∇g(µ1, µ2) = (1/µ2, −µ1/µ2) .

40

Example. (Ratio of expectations.) Let (X1, Y1),..., (Xn, Yn) i.i.d. replicates of( X , Y ) and suppose we want to estimate ν = E[X ]/E[Y ] by ¯ Pn Xn i=1 Xi νˆn = ¯ = Pn . Yn i=1 Yi

Draft 2 Let µ1 = E[X ], µ2 = E[Y ], g(µ1, µ2) = µ1/µ2, σ1 = Var[X ], 2 σ2 = Var[Y ], and σ12 = Cov[X , Y ]. 2 t The gradient of g is ∇g(µ1, µ2) = (1/µ2, −µ1/µ2) .

40

Example. (Ratio of expectations.) Let (X1, Y1),..., (Xn, Yn) i.i.d. replicates of( X , Y ) and suppose we want to estimate ν = E[X ]/E[Y ] by ¯ Pn Xn i=1 Xi νˆn = ¯ = Pn . Yn i=1 Yi Biaised but strongly consistent estimator of ν.

Draft 2 t The gradient of g is ∇g(µ1, µ2) = (1/µ2, −µ1/µ2) .

40

Example. (Ratio of expectations.) Let (X1, Y1),..., (Xn, Yn) i.i.d. replicates of( X , Y ) and suppose we want to estimate ν = E[X ]/E[Y ] by ¯ Pn Xn i=1 Xi νˆn = ¯ = Pn . Yn i=1 Yi Biaised but strongly consistent estimator of ν. 2 Let µ1 = E[X ], µ2 = E[Y ], g(µ1, µ2) = µ1/µ2, σ1 = Var[X ], 2 σ2 = Var[Y ], and σ12 = Cov[X , Y ]. Draft 40

Example. (Ratio of expectations.) Let (X1, Y1),..., (Xn, Yn) i.i.d. replicates of( X , Y ) and suppose we want to estimate ν = E[X ]/E[Y ] by ¯ Pn Xn i=1 Xi νˆn = ¯ = Pn . Yn i=1 Yi Biaised but strongly consistent estimator of ν. 2 Let µ1 = E[X ], µ2 = E[Y ], g(µ1, µ2) = µ1/µ2, σ1 = Var[X ], 2 σ2 = Var[Y ], and σ12 = Cov[X , Y ]. 2 t The gradient of g is ∇g(µ1, µ2) = (1/µ2, −µ1/µ2) . Draft By applying the Delta theorem, we obtain 41 √ 2 2 n(ˆνn − ν) ⇒ (W1, W2) · ∇g(µ1, µ2) = W1/µ2 − W2µ1/µ2 ∼ N(0, σg ) where 2 t σg = (∇g(µ)) Σy ∇g(µ) 2 2 2 2 = (σ1 + σ2ν − 2σ12ν)/µ2 can be replaced by its estimator σˆ2 +σ ˆ2νˆ2 − 2ˆσ νˆ σˆ2 = 1 2 n 12 n , g,n 2 (Y¯n) in which n 1 X σˆ2 = (X − X¯ )2, 1 n − 1 j n j=1 n 1 X σˆ2 = (Y − Y¯ )2, 2 n − 1 j n j=1 n 1 X σˆ2 = (X − X¯ )(Y − Y¯ ). 12 n −Draft1 j n j n j=1 42 Higher order

To reduce the bias, one can also add another term to the Taylor series, if g is twice continuously differentiable:

t t 2 g(Yn)−g(µ) = (∇g(µ)) (Yn−µ)+(Yn−µ) H(µ)(Yn−µ)/2+o(kYn−µk ),

where H is the Hessian matrix of g at µ. But this can increase the variance of the CI. Draft 43 Quantile estimation

If X had cdf F , the q-quantile of F is

−1 ξq = F (q) = inf{x : F (x) ≥ q}.

P[X ≤ x] q x

ξq 0 Draft A simple estimator of ξq is the empirical quantile

ˆ ˆ−1 ˆ ξq,n = Fn (q) = inf{x : Fn(x) ≥ q} = X(dnqe).

Theorem. w.p.1 (i) For each q, ξˆq,n → ξq when n → ∞ (strongly consistant). (ii) If X has a density f > 0 continuous in a neighborhood of ξq, then (CLT): √ n(ξˆ − ξ )f (ξ ) q,n q q ⇒ N(0, 1) when n → ∞. pq(1 − q)

ˆ q(1 − q) This imples Var[ξq,n] ≈ 2 . nf (ξq)

This CLT shows that this estimator has large variance if f (ξq) is small. To use this IC, we need an estimate of f (ξq): difficult.

44 ˆ Let X(1),..., X(n) an i.i.d. sample of X , sorted, and Fn the empirical cdf.

Draft Theorem. w.p.1 (i) For each q, ξˆq,n → ξq when n → ∞ (strongly consistant). (ii) If X has a density f > 0 continuous in a neighborhood of ξq, then (CLT): √ n(ξˆ − ξ )f (ξ ) q,n q q ⇒ N(0, 1) when n → ∞. pq(1 − q)

ˆ q(1 − q) This imples Var[ξq,n] ≈ 2 . nf (ξq)

This CLT shows that this estimator has large variance if f (ξq) is small. To use this IC, we need an estimate of f (ξq): difficult.

44 ˆ Let X(1),..., X(n) an i.i.d. sample of X , sorted, and Fn the empirical cdf.

A simple estimator of ξq is the empirical quantile

ˆ ˆ−1 ˆ ξq,n = Fn (q) = inf{x : Fn(x) ≥ q} = X(dnqe).

Draft ˆ q(1 − q) This imples Var[ξq,n] ≈ 2 . nf (ξq)

This CLT shows that this estimator has large variance if f (ξq) is small. To use this IC, we need an estimate of f (ξq): difficult.

44 ˆ Let X(1),..., X(n) an i.i.d. sample of X , sorted, and Fn the empirical cdf.

A simple estimator of ξq is the empirical quantile

ˆ ˆ−1 ˆ ξq,n = Fn (q) = inf{x : Fn(x) ≥ q} = X(dnqe).

Theorem. w.p.1 (i) For each q, ξˆq,n → ξq when n → ∞ (strongly consistant). (ii) If X has a density f > 0 continuous in a neighborhood of ξq, then (CLT): √ n(ξˆ − ξ )f (ξ ) q,n q q ⇒ N(0, 1) when n → ∞. pq(1 − q) Draft 44 ˆ Let X(1),..., X(n) an i.i.d. sample of X , sorted, and Fn the empirical cdf.

A simple estimator of ξq is the empirical quantile

ˆ ˆ−1 ˆ ξq,n = Fn (q) = inf{x : Fn(x) ≥ q} = X(dnqe).

Theorem. w.p.1 (i) For each q, ξˆq,n → ξq when n → ∞ (strongly consistant). (ii) If X has a density f > 0 continuous in a neighborhood of ξq, then (CLT): √ n(ξˆ − ξ )f (ξ ) q,n q q ⇒ N(0, 1) when n → ∞. pq(1 − q)

ˆ q(1 − q) This imples Var[ξq,n] ≈ 2 . nf (ξq)

This CLT shows that this estimator has large variance if f (ξq) is small. To use this IC, we need an estimateDraft of f (ξq): difficult. Let B be the number of observations X(i) smaller than ξq. Since P[X < ξq] = q, B is Binomial(n, q). If 1 ≤ j < k ≤ n, we have

k−1 X n [X < ξ ≤ X ] = [j ≤ B < k] = qi (1 − q)n−i . P (j) q (k) P i i=j

We choose j and k for this sum to be ≥ 1 − α. (Can be a unilateral or bilateral interval.) If n is large and q not too close to 0 or 1, we can approximate the binomial by a normal:(B − nq)/pnq(1 − q) ≈ N(0, 1). This gives j = bnq + 1 − δc and k = bnq + 1 + δc, where δ = pnq(1 − q)Φ−1(1 − α/2).

To improve the estimator, one can replace Fˆn by a smoother fonction (e.g., an interpolation, a kernel density estimator, etc.). In all cases, the estimator is biased, but this can reduce the bias.

45 Non-asymptotic exact method for an IC on ξq: Suppose F is continuous at ξq.

Draft If 1 ≤ j < k ≤ n, we have

k−1 X n [X < ξ ≤ X ] = [j ≤ B < k] = qi (1 − q)n−i . P (j) q (k) P i i=j

We choose j and k for this sum to be ≥ 1 − α. (Can be a unilateral or bilateral interval.) If n is large and q not too close to 0 or 1, we can approximate the binomial by a normal:(B − nq)/pnq(1 − q) ≈ N(0, 1). This gives j = bnq + 1 − δc and k = bnq + 1 + δc, where δ = pnq(1 − q)Φ−1(1 − α/2).

To improve the estimator, one can replace Fˆn by a smoother fonction (e.g., an interpolation, a kernel density estimator, etc.). In all cases, the estimator is biased, but this can reduce the bias.

45 Non-asymptotic exact method for an IC on ξq: Suppose F is continuous at ξq. Let B be the number of observations X(i) smaller than ξq. Since P[X < ξq] = q, B is Binomial(n, q).

Draft We choose j and k for this sum to be ≥ 1 − α. (Can be a unilateral or bilateral interval.) If n is large and q not too close to 0 or 1, we can approximate the binomial by a normal:(B − nq)/pnq(1 − q) ≈ N(0, 1). This gives j = bnq + 1 − δc and k = bnq + 1 + δc, where δ = pnq(1 − q)Φ−1(1 − α/2).

To improve the estimator, one can replace Fˆn by a smoother fonction (e.g., an interpolation, a kernel density estimator, etc.). In all cases, the estimator is biased, but this can reduce the bias.

45 Non-asymptotic exact method for an IC on ξq: Suppose F is continuous at ξq. Let B be the number of observations X(i) smaller than ξq. Since P[X < ξq] = q, B is Binomial(n, q). If 1 ≤ j < k ≤ n, we have

k−1 X n [X < ξ ≤ X ] = [j ≤ B < k] = qi (1 − q)n−i . P (j) q (k) P i i=j

Draft If n is large and q not too close to 0 or 1, we can approximate the binomial by a normal:(B − nq)/pnq(1 − q) ≈ N(0, 1). This gives j = bnq + 1 − δc and k = bnq + 1 + δc, where δ = pnq(1 − q)Φ−1(1 − α/2).

To improve the estimator, one can replace Fˆn by a smoother fonction (e.g., an interpolation, a kernel density estimator, etc.). In all cases, the estimator is biased, but this can reduce the bias.

45 Non-asymptotic exact method for an IC on ξq: Suppose F is continuous at ξq. Let B be the number of observations X(i) smaller than ξq. Since P[X < ξq] = q, B is Binomial(n, q). If 1 ≤ j < k ≤ n, we have

k−1 X n [X < ξ ≤ X ] = [j ≤ B < k] = qi (1 − q)n−i . P (j) q (k) P i i=j

We choose j and k for this sum to be ≥ 1 − α. (Can be a unilateral or bilateral interval.)

Draft This gives j = bnq + 1 − δc and k = bnq + 1 + δc, where δ = pnq(1 − q)Φ−1(1 − α/2).

To improve the estimator, one can replace Fˆn by a smoother fonction (e.g., an interpolation, a kernel density estimator, etc.). In all cases, the estimator is biased, but this can reduce the bias.

45 Non-asymptotic exact method for an IC on ξq: Suppose F is continuous at ξq. Let B be the number of observations X(i) smaller than ξq. Since P[X < ξq] = q, B is Binomial(n, q). If 1 ≤ j < k ≤ n, we have

k−1 X n [X < ξ ≤ X ] = [j ≤ B < k] = qi (1 − q)n−i . P (j) q (k) P i i=j

We choose j and k for this sum to be ≥ 1 − α. (Can be a unilateral or bilateral interval.) If n is large and q not too close to 0 or 1, we can approximate the binomial by a normal:(B − nq)/pnq(1 − q) ≈ N(0, 1). Draft To improve the estimator, one can replace Fˆn by a smoother fonction (e.g., an interpolation, a kernel density estimator, etc.). In all cases, the estimator is biased, but this can reduce the bias.

45 Non-asymptotic exact method for an IC on ξq: Suppose F is continuous at ξq. Let B be the number of observations X(i) smaller than ξq. Since P[X < ξq] = q, B is Binomial(n, q). If 1 ≤ j < k ≤ n, we have

k−1 X n [X < ξ ≤ X ] = [j ≤ B < k] = qi (1 − q)n−i . P (j) q (k) P i i=j

We choose j and k for this sum to be ≥ 1 − α. (Can be a unilateral or bilateral interval.) If n is large and q not too close to 0 or 1, we can approximate the binomial by a normal:(B − nq)/pnq(1 − q) ≈ N(0, 1). This gives j = bnq + 1 − δc and k = bnq + 1 + δc, where δ = pnq(1 − q)Φ−1(1 − α/2). Draft 45 Non-asymptotic exact method for an IC on ξq: Suppose F is continuous at ξq. Let B be the number of observations X(i) smaller than ξq. Since P[X < ξq] = q, B is Binomial(n, q). If 1 ≤ j < k ≤ n, we have

k−1 X n [X < ξ ≤ X ] = [j ≤ B < k] = qi (1 − q)n−i . P (j) q (k) P i i=j

We choose j and k for this sum to be ≥ 1 − α. (Can be a unilateral or bilateral interval.) If n is large and q not too close to 0 or 1, we can approximate the binomial by a normal:(B − nq)/pnq(1 − q) ≈ N(0, 1). This gives j = bnq + 1 − δc and k = bnq + 1 + δc, where δ = pnq(1 − q)Φ−1(1 − α/2).

To improve the estimator, one can replace Fˆn by a smoother fonction (e.g., an interpolation, a kernel density estimator, etc.). In all cases, the estimator is biased,Draft but this can reduce the bias. Common values: q = 0.01, period = 2 weeks (banks), or months or years (, pension plans). Use of the VaR can be criticized, since it provides a limited information. 7 For example, if x0.01 = 10 dollars, what do we know? A complementary measure is the conditional VaR (CVaR):

E[L | L > xq].

Better: find and display distribution of L conditional on L > xq. Models for VaR estimation: often thousands of dependent assets. Asset prices (or log) can be defined as linear combination of factors, whose changes are modeled via a copula (e.g., Student copula) and normal or lognormal marginals. Except for very simple cases, the VaR cannot be computed exactly. It can be estimated by simulation. For very small p: rare-event simulation.

Example: Value at Risk (VaR). 46 Let L be the net loss of an investment portfolio over a given time period. The VaR is the value of xq such that P[L > xq] = q. It is (1 − q)-quantile of L.

Draft A complementary measure is the conditional VaR (CVaR):

E[L | L > xq].

Better: find and display distribution of L conditional on L > xq. Models for VaR estimation: often thousands of dependent assets. Asset prices (or log) can be defined as linear combination of factors, whose changes are modeled via a copula (e.g., Student copula) and normal or lognormal marginals. Except for very simple cases, the VaR cannot be computed exactly. It can be estimated by simulation. For very small p: rare-event simulation.

Example: Value at Risk (VaR). 46 Let L be the net loss of an investment portfolio over a given time period. The VaR is the value of xq such that P[L > xq] = q. It is (1 − q)-quantile of L. Common values: q = 0.01, period = 2 weeks (banks), or months or years (insurance, pension plans). Use of the VaR can be criticized, since it provides a limited information. 7 For example, if x0.01 = 10 dollars, what do we know?

Draft Models for VaR estimation: often thousands of dependent assets. Asset prices (or log) can be defined as linear combination of factors, whose changes are modeled via a copula (e.g., Student copula) and normal or lognormal marginals. Except for very simple cases, the VaR cannot be computed exactly. It can be estimated by simulation. For very small p: rare-event simulation.

Example: Value at Risk (VaR). 46 Let L be the net loss of an investment portfolio over a given time period. The VaR is the value of xq such that P[L > xq] = q. It is (1 − q)-quantile of L. Common values: q = 0.01, period = 2 weeks (banks), or months or years (insurance, pension plans). Use of the VaR can be criticized, since it provides a limited information. 7 For example, if x0.01 = 10 dollars, what do we know? A complementary measure is the conditional VaR (CVaR):

E[L | L > xq].

Better: find and display distribution of L conditional on L > xq. Draft Example: Value at Risk (VaR). 46 Let L be the net loss of an investment portfolio over a given time period. The VaR is the value of xq such that P[L > xq] = q. It is (1 − q)-quantile of L. Common values: q = 0.01, period = 2 weeks (banks), or months or years (insurance, pension plans). Use of the VaR can be criticized, since it provides a limited information. 7 For example, if x0.01 = 10 dollars, what do we know? A complementary measure is the conditional VaR (CVaR):

E[L | L > xq].

Better: find and display distribution of L conditional on L > xq. Models for VaR estimation: often thousands of dependent assets. Asset prices (or log) can be defined as linear combination of factors, whose changes are modeled via a copula (e.g., Student copula) and normal or lognormal marginals. Except for very simple cases, the VaR cannot be computed exactly. It can be estimated by simulation. ForDraft very small p: rare-event simulation. 47 Estimating the cdf by the empirical cdf With simulation, we can estimate not only E[X ], but the entire distribution of X . It is characterized by its cdf, defined by F (x)= P[X ≤ x], for all x ∈ R. The empirical cdf

n 1 X Fˆ (x)= [x ≤ x]. n n I i i=1

jumps by 1/n at each observation xi . Easy to compute and plot after sorting the observations x1,..., xn into x(1),..., x(n).

Fˆn(x) . 1 . . . 0.5 . . . 0 . 0 x1 x2Draftx3 x4x5 x6 x7 48

If x1,..., xn are i.i.d. random variables with cdf F , then

Dn = sup |Fˆn(x) − F (x)| → 0 w.p.1 when n → ∞, −∞

−1/2 and Dn = Op(n ).

Draft Between two observations very close to each other, this density is very large! It is irregular and does not converge to f when n → ∞.

49 Estimating the density of X To estimate the density f of X , a first idea could be to take the derivative (the slope) of a piecewise-linear interpolation of Fˆn.

F˜n(x) ...... 1 ...... 0.5 ...... 0 ...... 0 x1 x2 x3 x4x5 x6 x7

This gives the piecewise-constant empirical density: 1 f˜ (x)= for x ≤ x ≤ x . n (n − 1)(x − x ) (i) (i+1) (i+1)Draft(i) 49 Estimating the density of X To estimate the density f of X , a first idea could be to take the derivative (the slope) of a piecewise-linear interpolation of Fˆn.

F˜n(x) ...... 1 ...... 0.5 ...... 0 ...... 0 x1 x2 x3 x4x5 x6 x7

This gives the piecewise-constant empirical density:

˜ 1 fn(x)= for x(i) ≤ x ≤ x(i+1). (n − 1)(x(i+1) − x(i))

Between two observations very close to each other, this density is very large! It is irregular and does notDraft converge to f when n → ∞.     ˜ 1 −n/y P max fn(x) > y = P > y = P[n(n−1)dn < n/y] ≈ 1−e . 0≤x≤1 (n − 1)dn For fixed y, converges to 1 exponentially fast when n → ∞. Thus, the density f˜n will have very large peaks when n is large!

Sometimes plotting F may be sufficient, but often a good density estimator gives better visual information. How can we estimate the density?

50

Example. Let f be the U(0, 1) density and let

dn = min (x(i+1) − x(i)). 1≤i

One can show that n(n − 1)dn ⇒ Expon(1) when n → ∞.

Draft Sometimes plotting F may be sufficient, but often a good density estimator gives better visual information. How can we estimate the density?

50

Example. Let f be the U(0, 1) density and let

dn = min (x(i+1) − x(i)). 1≤i

One can show that n(n − 1)dn ⇒ Expon(1) when n → ∞.     ˜ 1 −n/y P max fn(x) > y = P > y = P[n(n−1)dn < n/y] ≈ 1−e . 0≤x≤1 (n − 1)dn For fixed y, converges to 1 exponentially fast when n → ∞. Thus, the density f˜n will have very large peaks when n is large! Draft 50

Example. Let f be the U(0, 1) density and let

dn = min (x(i+1) − x(i)). 1≤i

One can show that n(n − 1)dn ⇒ Expon(1) when n → ∞.     ˜ 1 −n/y P max fn(x) > y = P > y = P[n(n−1)dn < n/y] ≈ 1−e . 0≤x≤1 (n − 1)dn For fixed y, converges to 1 exponentially fast when n → ∞. Thus, the density f˜n will have very large peaks when n is large!

Sometimes plotting F may be sufficient, but often a good density estimator gives better visual information. How can we estimate the density? Draft To minimize the mean integrated square error (MISE) R b 2 E a [fh,n(x) − f (x)] dx, we must choose h such that 3 R b 0 2 1/3 h n a (f (x)) dx ≈ 6. We then have m = (b − a)/h = O(n ) and MISE = O(n−2/3). We should double m when n is multiplied by 8.

Better: polygonal interpolation of the histogram. Gives MISE = O(n−4/5) with m = O(n1/5). In this case, we double m when n is multiplied by 32.

51 Histograms

We have n observations over [a, b]. We partition the interval in m pieces of length h = (b − a)/m. The histogram density (height) fh,n is constant over each interval and proportional to the number of observations in the interval.

Draft Better: polygonal interpolation of the histogram. Gives MISE = O(n−4/5) with m = O(n1/5). In this case, we double m when n is multiplied by 32.

51 Histograms

We have n observations over [a, b]. We partition the interval in m pieces of length h = (b − a)/m. The histogram density (height) fh,n is constant over each interval and proportional to the number of observations in the interval.

To minimize the mean integrated square error (MISE) R b 2 E a [fh,n(x) − f (x)] dx, we must choose h such that 3 R b 0 2 1/3 h n a (f (x)) dx ≈ 6. We then have m = (b − a)/h = O(n ) and MISE = O(n−2/3). We should double m when n is multiplied by 8. Draft 51 Histograms

We have n observations over [a, b]. We partition the interval in m pieces of length h = (b − a)/m. The histogram density (height) fh,n is constant over each interval and proportional to the number of observations in the interval.

To minimize the mean integrated square error (MISE) R b 2 E a [fh,n(x) − f (x)] dx, we must choose h such that 3 R b 0 2 1/3 h n a (f (x)) dx ≈ 6. We then have m = (b − a)/h = O(n ) and MISE = O(n−2/3). We should double m when n is multiplied by 8.

Better: polygonal interpolation of the histogram. Gives MISE = O(n−4/5) with m = O(n1/5). In this case, we double m when n is multiplied by 32. Draft The change of variable y = (x − xi )/h gives 1 Z ∞ Z ∞ k((x − xi )/h)dx = k(y)dy = 1. h −∞ −∞

This generalizes to multivariate distributions (d dimensions). Main difficulty: choices of k and (mostly) h.

Simpler and much more common: just plot a histogram.

52 Kernel density estimation

n 1 X f (x)= k((x − x )/h), n nh i i=1 where k is a base density named the kernel and h > 0 is a constant named the smoothing factor.

Draft Main difficulty: choices of k and (mostly) h.

Simpler and much more common: just plot a histogram.

52 Kernel density estimation

n 1 X f (x)= k((x − x )/h), n nh i i=1 where k is a base density named the kernel and h > 0 is a constant named the smoothing factor. The change of variable y = (x − xi )/h gives 1 Z ∞ Z ∞ k((x − xi )/h)dx = k(y)dy = 1. h −∞ −∞

This generalizes to multivariate distributions (d dimensions). Draft Simpler and much more common: just plot a histogram.

52 Kernel density estimation

n 1 X f (x)= k((x − x )/h), n nh i i=1 where k is a base density named the kernel and h > 0 is a constant named the smoothing factor. The change of variable y = (x − xi )/h gives 1 Z ∞ Z ∞ k((x − xi )/h)dx = k(y)dy = 1. h −∞ −∞

This generalizes to multivariate distributions (d dimensions). Main difficulty: choices of k and (mostly) h. Draft 52 Kernel density estimation

n 1 X f (x)= k((x − x )/h), n nh i i=1 where k is a base density named the kernel and h > 0 is a constant named the smoothing factor. The change of variable y = (x − xi )/h gives 1 Z ∞ Z ∞ k((x − xi )/h)dx = k(y)dy = 1. h −∞ −∞

This generalizes to multivariate distributions (d dimensions). Main difficulty: choices of k and (mostly) h. Simpler and much more common:Draftjust plot a histogram. 53 Discrete-event Simulation

Events e0, e1, e2,... occur at times 0 = t0 ≤ t1 ≤ t2 ≤ · · · . Let Si be the state of the system just after event ei . Simulation time: current value of ti . ↑

• • state • • • • • → 0 t1 t2 t3 t4 t5 t6 (ti , Si ) must contain enough information so we can continue running the simulation with it and also the random variables generated afterward, at events ej for j > i. Draft 54

A discrete-event simulation program requires:

I A simulation clock.

I A list of the planned future events, by chronological order (different possible implementations).

I A “procedure” or “method” to execute for each type of event.

Draft