<<

Motivations GLR MLE STAR-SA & DiGARSM Summary

Simulation Optimization: New Approaches to Gradient-Based Search and MLE

PIs: Michael Fu & Steve Marcus Funding Period: August 1, 2020 – August 1, 2023

Robert H. Smith School of Business Electrical & Computer Engineering Institute for Systems Research University of Maryland https://scholar.rhsmith.umd.edu/mfu

AFOSR Mathematical Optimization Program Review August 19, 2020 Motivations GLR MLE STAR-SA & DiGARSM Summary Project Team

PIs: Michael Fu & Steve Marcus current PhD students: Yunchuan Li (ECE), Peng Wan (math), Yi Zhou (math)

Previous AFOSR project ended Jan.2018. DARPA Lagrange Mar.2018 – Sep.2019. This project: Aug.1, 2020 – Aug.1, 2023.

Today’s talk includes work with many other collaborators.

2 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Introduction Setting: Maximize or Minimize L(θ) = E[ϕ(Y )] where θ is the controllable parameter (decision variables),

Y is a (r.v.), ϕ measurable (possibly discontinuous) Simple Example: single-server queue (what is ϕ?)

min L(θ) = E[Y (θ)] + c/θ θ Y waiting/system time, θ service time, c service “cost” MLE: (what is ϕ?)

max L(θ) = [ln] fY (Y1, Y2,... ; θ) θ

Yi observed with joint density fY

3 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Introduction (cont.) Our usual setting is that Y is an output performance measure represented as Y = g(X (θ); θ), where X can be viewed as the input r.v.s., e.g., in queueing, interarrival and service times; in stochastic activity networks (SANs), activity times

Queueing Network Example

θ1 θ2 θ3 θ4

Think of as simplified/stylized MVA/DMV 

4 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Outline

1 Motivations

2 GLR Stochastic Gradient Estimation Distribution Sensitivities

3 MLE Gradient-Based Simulated MLE Non-convex MLE for Neuroimaging Data

4 STAR-SA & DiGARSM

5 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary MLE for two types of data-driven models

not readily available, e.g., system times from a queueing network or SAN, gradient-based simulated MLE (GSMLE) GLR estimator for distribution sensitivities “Maximum Likelihood Estimation By Monte Carlo Simulation:

Towards Data-Driven Stochastic Modeling” (w/ Y. Peng, B. Heidergott, H. Lam) Operations Research (accepted Dec.2019) likelihood function complex & non-convex, e.g., data from numerous neuroimaging sources MCEM (well-known approach used for MLE) MRAS (global optimization method) to be submitted soon

6 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary New Gradient-Based Search Methods

combining direct gradient information with function evaluations to improve search STAR-SPSA (OR, under revision) DiGARSM (OR, R&R, to be resubmitted this month) “Direct Gradient Augmented Response Surface Methodology as Stochastic Approximation” (w/ Y. Li) Operations Research (to be resubmitted this month) stochastic gradient estimators: generalized likelihood ratio (GLR) method discontinuous sample performance structural parameters various ongoing research collaborations, including paper submitted to Operations Research in May

7 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Quick Overview (Gradient-based Search)

Main idea: (sign depending on max/min)

θk+1 = θk ± ak ∇dθL(θk )

Main challenge: (stochastic) gradient estimate ∇dθL(θ)

When applicable, use infinitesimal perturbation analysis (IPA) and the likelihood ratio (LR) method (e.g., later DiGARSM example) single-run techniques (NO additional simulation required) however, NOT applicable for MLE −→ generalized LR (GLR)

8 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary A different view of a density function (estimation)

density function (aka p.d.f.) is ... derivative of c.d.f. F of r.v. Y , which is the expectation of an indicator function, i.e., 1st-order distribution sensitivity ∂F (y) ∂ = [1 {Y ≤ y}] ∂y ∂y E (i) discontinuous sample performance 1{Y ≤ y} (jumps from 0 to 1 at y) (ii) structural parameter y (as opposed to a distributional parameter, as in MLE)

KEY ISSUES in blue: discontinuity of indicator 1{·}, structural parameters

9 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary

Relevant Methods (simulation-based )

IPA-based methods (Hong 2009, Hong and Liu 2010): (i) slow convergence rate; (ii) only derivative w.r.t. parameters (not argument). CMC/SPA & push-out LR: problem dependent Kernel-based methods (Liu and Hong 2009): (i) biased; (ii) choice of bandwidth parameters; (iii) slow convergence rate. WD (in Heidergott and Volk-Makarewicz 2016): no structural parameters. GLR: Generalized Likelihood Ratio Method (Peng et al.) “A New Unbiased Stochastic Derivative Estimator for Discontinuous Sample Performances with Structural Parameters,” Operations Research, Vol.66, No.2, 487-499, 2018.

10 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary GLR Comparison with Existing Methods

Advantages: handle general discontinuous sample performance unbiased w/ desirable convergence properties analytical form derivatives w.r.t. parameters and argument handle any distribution sensitivity in a unified form

11 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary MLE: Review & Two Scenarios

Data Observed: Y1, Y2,...

Goal: Estimate parameter(s) θ

arg max L(θ; Y1, Y2,..., Yn), θ where L(θ; y1, y2,..., yn) = ln fY (y1, y2,... yn; θ)

Two (Different) Scenarios:

fY not explicitly available

fY non-convex

12 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Big Picture: Data & Model Fitting Illustrative (Motivating) Example: Queueing System

Data Observed: waiting (system) times Y1, Y2,...

Goal: Fit stochastic model and estimate input parameter θ, e.g., mean service time at one of the stations

MLE: arg max L(θ; Y1, Y2,..., Yn), θ

where L(θ; y1, y2,..., yn) = ln fY (y1, y2,... ; θ)

Main Assumption: fY not explicitly available

e.g., complex simulation model or real system generates Y1, Y2,...

13 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary

Models: vs. Stochastics (e.g., regression vs. queueing)

machine learning (ML) models are black-box , fit data statistically

stochastic models are causal/explanatory Example: FCFS G/G/1 queue Lindley equation

Yt (θ) = max(0, Yt−1(θ) − At ) + Xt (θ) dY dX dY IPA: t = t + t−1 1 {Y > A )} dθ dθ dθ t−1 t

14 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Usual Stochastic Model Approach: Input Fitting MLE

Find θ by MLE of input data Xt , t = 1,..., n:

arg max ln fX (X1, X2,..., Xn; θ), θ

where fX is (joint) density of service times.

What happens if service times X1, X2,... NOT observed?

Or... What happens if model is misspecified? e.g., arrival process assumed stationary Poisson, but in reality very time-varying

15 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Gradient-based simulated MLE (GSMLE)

Main idea: Simulation (causal) model available

MLE carried out by gradient-based search:

θk+1 = θk + ak ∇dθL(θk ; Y )

i.e., simulation used to get gradient estimate (as a function of fixed output)

Main challenge: gradient estimate ∇dθL(θ)

16 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary What is the Likelihood Function? let’s add θ

density function is ... derivative of c.d.f, which is the expectation of an indicator function, i.e., 1st-order distribution sensitivity ∂F (y; θ) ∂ L(θ) = = [1 {Y (θ) ≤ y}] ∂y ∂y E AND we actually need derivative of this w.r.t. θ, i.e.,

∂2F ∂2 ∇ L(θ) = = [1 {Y (θ) ≤ y}] θ ∂y∂θ ∂y∂θE

17 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary GLR for Distribution Sensitivities

Under appropriate regularity conditions (Peng et al. 2018),

∂F (y; θ) ∂F (y; θ) = [φ (X ; y, θ)], = [φ (X ; y, θ)], ∂θ E 1 ∂y E 2 ∂2F (y; θ) = [φ (X ; y, θ)], ······ ∂y∂θ E 3 where ∂ log f (x; θ) φ (x; y, θ) = 1{Y (θ) ≤ y} X + d (x; θ) j ∂θ j usual LR estimator

!−1 " 2 2 !−1!# . ∂g(x; θ) ∂ g(x; θ) ∂g(x; θ) ∂ log fX (x; θ) ∂ g(x; θ) ∂g(x; θ) d (x; θ) = + − , d = ... 1 2 2 ∂xi ∂θ∂xi ∂θ ∂xi ∂xi ∂xi “simple” rescaling of observed data by simulation

Y (θ) = g(X ; θ), where g is the causal/stochastic model, e.g., Lindley’s equation

18 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Input Fitting vs. Output Fitting (LN/LN/1 Queue)

Service times i.i.d. lognormal w/ true parameter θ = 0 Interarrival times also lognormal Both MLE using input data (service times) & GSMLE using output data (system times) give true θˆ ≈ 0 Estimation Based on True Model (100 Observations) 0.4

0.2

0 θ −0.2

−0.4 Estimation of −0.6

−0.8

−1 GSMLE(100000) GSMLE(1000000) MLE−Input

19 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Model Misspecification Example (LN/LN/1 Queue)

“All models are wrong, but some are useful” — George Box

Service times i.i.d. lognormal w/ true parameter θ = 0 Interarrival times also lognormal

True arrival process 2-state Markov modulated process (MMP)

What happens if arrival process misspecified as stationary?

20 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Input Fitting vs. Output Fitting (LN/LN/1 Queue)

Arrival process misspecified, and true service parameter θ = 0 MLE using input data (service times) gives true value θˆ = 0 GSMLE using output data (system times) gives θˆ ≈ 0.4 Which is better?

Estimation Under Model Misspecification (100 Observations)

1

0.8

0.6

0.4 θ 0.2

0

−0.2 Estimation of −0.4

−0.6

−0.8

−1 GSMLE(100000) GSMLE(1000000) MLE−Input

21 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Towards Data-Driven Modeling (LN/LN/1 Queue)

True Model GSMLE Fitting MLE-Input Fitting 4.3 ± 0.04 4.8 ± 0.04 2.5 ± 0.02 average system time of first 10 customers (10K reps)

6.5 Takeaway: True Model 6 GSMLE Fitting MLE-Input Fitting If model is misspecified, 5.5 it might be better 5 to use the output data 4.5 for the stochastic model 4 3.5

input MLE! Expected System Time 3

2.5

2

1.5 1 2 3 4 5 6 7 8 9 10 Customer Number

22 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Noninvasive Neural Data

Sensory Stimulus The Neural Code

Animal Brain Human Neurons Dipoles

Measurement Sensors Sources MEG/EEG Dynamic Sparse Structure Electrophysiology Measurements Optimization ~10-10² Sensors ~10²-10⁵ Sources Hours of Data Highly Dynamic

Neural Response Sparse Structure

23 / 36 Motivations ProblemGLR FormulationMLE and ModelingSTAR-SA & DiGARSM Summary State-Space Model Original state space model:

The matrices , and Matrices , and have are known unknown but fixed parameters. Goal: To estimate the parameters in , and from observation

Method: Maximize the log-likelihood of observations using MRAS

Given that is temporally correlated, we first need to reformulate the state space model using state augmentation.

DARPA-BAA-14-62 ICONS 16

24 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Nonconvex Likelihood Function ExampleExample of actual of functionactual objective from simulated function data from simulated data (projected(projected to single-dimension to single-dimension parameter parameter space) space)

25 / 36

DARPA-BAA-14-62 ICONS 18 Motivations GLR MLE STAR-SA & DiGARSM Summary Global Optimization Method: MRAS Preliminary Results: EM-MRAS numerical example Model Reference Adaptive Search (MRAS) 12-dim parameter, m=4, n=1, data from observations over 10000 time steps

Ø Log-likelihood vs. MRAS iterations:

Ø MSE vs. MRAS iterations:

Ø Next steps: extension to generalized linear models, real-time MRAS 26 / 36

DARPA-BAA-14-62 ICONS 19 Motivations GLR MLE STAR-SA & DiGARSM Summary New Stochastic Optimization Gradient Methods

Main Assumption: Both performance function and direct gradient available, e.g., via IPA, LR, GLR, WD

Main (very simple) Idea: Use BOTH of them in the search

1 STAR-SA (Chau, Qu, & Fu 2014) 2 DiGARSM (Li & Fu 2018, 2020)

Deterministic optimization: If gradient directly available, then generally assumed noiseless, so no point in using function values to estimate gradient (though would be useful in determining step size)

27 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary

Secant Tangents-AveRaged (STAR) Stochastic Approximation (SA)

Classical SA methods Robbins-Monro (RM): uses an unbiased direct gradient estimator, e.g., IPA, LR/SF, WD Kiefer-Wolfowitz (KW): estimates gradient with finite differences, e.g., symmetric/forward difference difference

Secant Tangents-AveRaged (STAR) gradient estimator uses a convex combination of BOTH estimates

28 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Secant Tangents-AveRaged (STAR) gradient

f (x) Secant f˜(x + c ) − f˜(x − c ) ˜ n n n n f (xn − cn) 2cn Tangents AveRaged ˜ f (xn + cn) ˜0 ˜0 f (xn + cn) + f (xn − cn) 2

xn − cn xn + cn STAR (Secant-Tangents AveRaged) gradient estimator: convex combination ! f˜(x + c ) − f˜(x − c ) f˜0(x + c ) + f˜0(x − c ) α n n n n +(1−α) n n n n 2cn 2

apologies for notation switch 29 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary

Direct Gradient Augmented Response Surface Methodology (DiGARSM) Setting: 1 sequential RSM with local metamodels 2 single-run direct gradient estimates available 3 relatively limited simulation budget DiGARSM = DiGAR (Fu and Qu 2014) + RSM Review of Direct Gradient Augmented Regression (DiGAR)

yi = β0 + β1xi + i 0 gi = β1 + i n n X 2 X 2 L = α (yi − β0 − β1xi ) + (1 − α) (gi − β1) i=1 i=1  n   n  ˆ 1 P 1−α 1 P 2 1−α β1 = n (xi − x¯)(yi − y¯) + α g¯ n (xi − x¯) + α i=1 i=1

30 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary DiGARSM: Theoretical Results

Under a stochastic approximation (SA) framework, proved 1 convergence almost surely and mean square (1st such result in RSM, as far as we are aware) 2 convergence rate results, as a function of step size and design point spacing 3 optimal weighting on DiGAR estimate used in RSM

31 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary DiGARSM: Numerical Example

Queueing Network Example

θ1 θ2 θ3 θ4

gradient estimator: IPA  

X L(θ) = E[Yi (θ)] + ci /θi i

Yi system time of ith queue, θi mean service time of ith server

32 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Simulation Results: DiGARSM Four single-server FCFS M/M/1 queues in tandem (series)

100 DiGARSM RSM RM-SA

10-1 Parameter error

10-2

0 1 2 3 4 5 6 Function calls 104

33 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Summary & Ongoing/Proposed Research

1 Stochastic gradients for stochastic (simulation) optimization of stochastic models 2 New approach to data-driven MLE based on stochastic models and stochastic gradients (GSMLE)

non-convex MLE (Yunchuan Li) SP-DiGARSM for higher dimensions (Yunchuan Li) SANs: gradient estimation and optimization (Peng Wan) spectral index statistical & selection (Yi Zhou) GLR extensions (Yi Zhou) STAR (???)

34 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Select References Y. Li and M.C. Fu, “Direct Gradient Augmented Response Surface Methodology as Stochastic Approximation,” Operations Research, to be resubmitted this month. Y. Li and M.C. Fu, “Sequential First-Order Response Surface Methodology Augmented with Direct Gradients,” Proceedings of the 2018 Winter Simulation Conference, 2143–54, 2018. P. Wan and M.C. Fu, “Sensitivity Analysis of Arc Criticalities in Stochastic Activity Networks,” Proceedings of the 2020 Winter Simulation Conference, forthcoming (journal version being prepared for Operations Research). Y. Peng, M.C. Fu, B. Heidergott, H. Lam, “Maximum Likelihood Estimation By Monte Carlo Simulation: Towards Data-Driven Stochastic Modeling,” Operations Research, accepted December 2019. Y. Peng, M.C. Fu, J.Q. Hu, B. Heidergott, “A New Unbiased Stochastic Derivative Estimator for Discontinuous Sample Performances with Structural Parameters,” Operations Research, Vol.66, No.2, 487-499, 2018. (INFORMS Simulation Society Outstanding Publication Award, Dec.2019) Y. Peng, M.C. Fu, J. Hu, P. L’Ecuyer, B. Tuffin, “Generalized Likelihood Ratio Method for Stochastic Models with Uniform Random Numbers As Inputs,” Operations Research, submitted May 2020. M. Chau, H. Qu, M.C. Fu, “A New Hybrid Stochastic Approximation Algorithm,” Proc. 12th Intl. WODES, 241–246, 2014. M.C. Fu and H. Qu, “Augmented Regression With Direct Gradient Estimates,” INFORMS Journal on Computing, Vol.26, No.3, 484–499, 2014.

35 / 36