Simulation Optimization: New Approaches to Gradient-Based Search and MLE

Motivations GLR MLE STAR-SA & DiGARSM Summary Simulation Optimization: New Approaches to Gradient-Based Search and MLE PIs: Michael Fu & Steve Marcus Funding Period: August 1, 2020 { August 1, 2023 Robert H. Smith School of Business Electrical & Computer Engineering Institute for Systems Research University of Maryland https://scholar.rhsmith.umd.edu/mfu AFOSR Mathematical Optimization Program Review August 19, 2020 Motivations GLR MLE STAR-SA & DiGARSM Summary Project Team PIs: Michael Fu & Steve Marcus current PhD students: Yunchuan Li (ECE), Peng Wan (math), Yi Zhou (math) Previous AFOSR project ended Jan.2018. DARPA Lagrange Mar.2018 { Sep.2019. This project: Aug.1, 2020 { Aug.1, 2023. Today's talk includes work with many other collaborators. 2 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Introduction Stochastic Optimization Setting: Maximize or Minimize L(θ) = E['(Y )] where θ is the controllable parameter (decision variables), Y is a random variable (r.v.), ' measurable (possibly discontinuous) Simple Example: single-server queue (what is '?) min L(θ) = E[Y (θ)] + c/θ θ Y waiting/system time, θ mean service time, c service \cost" MLE: (what is '?) max L(θ) = [ln] fY (Y1; Y2;::: ; θ) θ Yi observed data with joint density fY 3 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Introduction (cont.) Our usual setting is that Y is an output performance measure represented as Y = g(X (θ); θ); where X can be viewed as the input r.v.s., e.g., in queueing, interarrival and service times; in stochastic activity networks (SANs), activity times Queueing Network Example θ1 θ2 θ3 θ4 Think of as simplified/stylized MVA/DMV 4 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Outline 1 Motivations 2 GLR Stochastic Gradient Estimation Distribution Sensitivities 3 MLE Gradient-Based Simulated MLE Non-convex MLE for Neuroimaging Data 4 STAR-SA & DiGARSM 5 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary MLE for two types of data-driven models likelihood function not readily available, e.g., system times from a queueing network or SAN, gradient-based simulated MLE (GSMLE) GLR estimator for distribution sensitivities \Maximum Likelihood Estimation By Monte Carlo Simulation: Towards Data-Driven Stochastic Modeling" (w/ Y. Peng, B. Heidergott, H. Lam) Operations Research (accepted Dec.2019) likelihood function complex & non-convex, e.g., data from numerous neuroimaging sources MCEM (well-known approach used for MLE) MRAS (global optimization method) to be submitted soon 6 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary New Gradient-Based Search Methods combining direct gradient information with function evaluations to improve search STAR-SPSA (OR, under revision) DiGARSM (OR, R&R, to be resubmitted this month) \Direct Gradient Augmented Response Surface Methodology as Stochastic Approximation" (w/ Y. Li) Operations Research (to be resubmitted this month) stochastic gradient estimators: generalized likelihood ratio (GLR) method discontinuous sample performance structural parameters various ongoing research collaborations, including paper submitted to Operations Research in May 7 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Quick Overview (Gradient-based Search) Main idea: (sign depending on max/min) θk+1 = θk ± ak rdθL(θk ) Main challenge: (stochastic) gradient estimate rdθL(θ) When applicable, use infinitesimal perturbation analysis (IPA) and the likelihood ratio (LR) method (e.g., later DiGARSM example) single-run techniques (NO additional simulation required) however, NOT applicable for MLE −! generalized LR (GLR) 8 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary A different view of a density function (estimation) density function (aka p.d.f.) is ... derivative of c.d.f. F of r.v. Y , which is the expectation of an indicator function, i.e., 1st-order distribution sensitivity @F (y) @ = [1 fY ≤ yg] @y @y E (i) discontinuous sample performance 1fY ≤ yg (jumps from 0 to 1 at y) (ii) structural parameter y (as opposed to a distributional parameter, as in MLE) KEY ISSUES in blue: discontinuity of indicator 1{·}, structural parameters 9 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Relevant Methods (simulation-based density estimation) IPA-based methods (Hong 2009, Hong and Liu 2010): (i) slow convergence rate; (ii) only derivative w.r.t. parameters (not argument). CMC/SPA & push-out LR: problem dependent Kernel-based methods (Liu and Hong 2009): (i) biased; (ii) choice of bandwidth parameters; (iii) slow convergence rate. WD (in Heidergott and Volk-Makarewicz 2016): no structural parameters. GLR: Generalized Likelihood Ratio Method (Peng et al.) \A New Unbiased Stochastic Derivative Estimator for Discontinuous Sample Performances with Structural Parameters," Operations Research, Vol.66, No.2, 487-499, 2018. 10 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary GLR Comparison with Existing Methods Advantages: handle general discontinuous sample performance unbiased w/ desirable convergence properties analytical form derivatives w.r.t. parameters and argument handle any distribution sensitivity in a unified form 11 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary MLE: Review & Two Scenarios Data Observed: Y1; Y2;::: Goal: Estimate parameter(s) θ arg max L(θ; Y1; Y2;:::; Yn); θ where L(θ; y1; y2;:::; yn) = ln fY (y1; y2;::: yn; θ) Two (Different) Scenarios: fY not explicitly available fY non-convex 12 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Big Picture: Data & Model Fitting Illustrative (Motivating) Example: Queueing System Data Observed: waiting (system) times Y1; Y2;::: Goal: Fit stochastic model and estimate input parameter θ, e.g., mean service time at one of the stations MLE: arg max L(θ; Y1; Y2;:::; Yn); θ where L(θ; y1; y2;:::; yn) = ln fY (y1; y2;::: ; θ) Main Assumption: fY not explicitly available e.g., complex simulation model or real system generates Y1; Y2;::: 13 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Models: Statistics vs. Stochastics (e.g., regression vs. queueing) machine learning (ML) models are black-box , fit data statistically stochastic models are causal/explanatory Example: FCFS G/G/1 queue Lindley equation Yt (θ) = max(0; Yt−1(θ) − At ) + Xt (θ) dY dX dY IPA: t = t + t−1 1 fY > A )g dθ dθ dθ t−1 t 14 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Usual Stochastic Model Approach: Input Fitting MLE Find θ by MLE of input data Xt , t = 1;:::; n: arg max ln fX (X1; X2;:::; Xn; θ); θ where fX is (joint) density of service times. What happens if service times X1; X2;::: NOT observed? Or... What happens if model is misspecified? e.g., arrival process assumed stationary Poisson, but in reality very time-varying 15 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Gradient-based simulated MLE (GSMLE) Main idea: Simulation (causal) model available MLE carried out by gradient-based search: θk+1 = θk + ak rdθL(θk ; Y ) i.e., simulation used to get gradient estimate (as a function of fixed output) Main challenge: gradient estimate rdθL(θ) 16 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary What is the Likelihood Function? let's add θ density function is ... derivative of c.d.f, which is the expectation of an indicator function, i.e., 1st-order distribution sensitivity @F (y; θ) @ L(θ) = = [1 fY (θ) ≤ yg] @y @y E AND we actually need derivative of this w.r.t. θ, i.e., @2F @2 r L(θ) = = [1 fY (θ) ≤ yg] θ @y@θ @y@θE 17 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary GLR for Distribution Sensitivities Under appropriate regularity conditions (Peng et al. 2018), @F (y; θ) @F (y; θ) = [φ (X ; y; θ)]; = [φ (X ; y; θ)]; @θ E 1 @y E 2 @2F (y; θ) = [φ (X ; y; θ)]; ······ @y@θ E 3 where @ log f (x; θ) φ (x; y; θ) = 1fY (θ) ≤ yg X + d (x; θ) j @θ j usual LR estimator !−1 " 2 2 !−1!# : @g(x; θ) @ g(x; θ) @g(x; θ) @ log fX (x; θ) @ g(x; θ) @g(x; θ) d (x; θ) = + − ; d = ::: 1 2 2 @xi @θ@xi @θ @xi @xi @xi \simple" rescaling of observed data by simulation Y (θ) = g(X ; θ), where g is the causal/stochastic model, e.g., Lindley's equation 18 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Input Fitting vs. Output Fitting (LN/LN/1 Queue) Service times i.i.d. lognormal w/ true parameter θ = 0 Interarrival times also lognormal Both MLE using input data (service times) & GSMLE using output data (system times) give true θ^ ≈ 0 Estimation Based on True Model (100 Observations) 0.4 0.2 0 θ −0.2 −0.4 Estimation of −0.6 −0.8 −1 GSMLE(100000) GSMLE(1000000) MLE−Input 19 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Model Misspecification Example (LN/LN/1 Queue) \All models are wrong, but some are useful" | George Box Service times i.i.d. lognormal w/ true parameter θ = 0 Interarrival times also lognormal True arrival process 2-state Markov modulated process (MMP) What happens if arrival process misspecified as stationary? 20 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Input Fitting vs. Output Fitting (LN/LN/1 Queue) Arrival process misspecified, and true service parameter θ = 0 MLE using input data (service times) gives true value θ^ = 0 GSMLE using output data (system times) gives θ^ ≈ 0:4 Which is better? Estimation Under Model Misspecification (100 Observations) 1 0.8 0.6 0.4 θ 0.2 0 −0.2 Estimation of −0.4 −0.6 −0.8 −1 GSMLE(100000) GSMLE(1000000) MLE−Input 21 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Towards Data-Driven Modeling (LN/LN/1 Queue) True Model GSMLE Fitting MLE-Input Fitting 4:3 ± 0:04 4:8 ± 0:04 2:5 ± 0:02 average system time of first 10 customers (10K reps) 6.5 Takeaway: True Model 6 GSMLE Fitting MLE-Input Fitting If model is misspecified, 5.5 it might be better 5 to use the output data 4.5 for the stochastic model 4 3.5 input MLE! Expected System Time 3 2.5 2 1.5 1 2 3 4 5 6 7 8 9 10 Customer Number 22 / 36 Motivations GLR MLE STAR-SA & DiGARSM Summary Noninvasive Neural Data Sensory Stimulus The Neural Code Animal Brain Human Neurons Dipoles Measurement Sensors Sources MEG/EEG Dynamic Sparse Structure Electrophysiology Measurements Optimization ~10-10² Sensors ~10²-10⁵ Sources Hours of Data Highly Dynamic Sparse Structure Neural Response 23 / 36 Motivations ProblemGLR FormulationMLE and ModelingSTAR-SA & DiGARSM Summary State-Space Model Original state space model: The matrices , and Matrices , and have covariance are known unknown but fixed parameters.

Load more