Efficient Monte Carlo Computation of Fisher Information Matrix Using Prior

Introduction and Motivation Contribution/Results of the Proposed Work Summary Efficient Monte Carlo computation of Fisher information matrix using prior information Sonjoy Das, UB James C. Spall, APL/JHU Roger Ghanem, USC SIAM Conference on Data Mining Anaheim, California, USA April 26–28, 2012 Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Outline 1 Introduction and Motivation General discussion of Fisher information matrix Current resampling algorithm – No use of prior information 2 Contribution/Results of the Proposed Work Improved resampling algorithm – using prior information Theoretical basis Numerical Illustrations Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Outline 1 Introduction and Motivation General discussion of Fisher information matrix Current resampling algorithm – No use of prior information 2 Contribution/Results of the Proposed Work Improved resampling algorithm – using prior information Theoretical basis Numerical Illustrations Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Outline 1 Introduction and Motivation General discussion of Fisher information matrix Current resampling algorithm – No use of prior information 2 Contribution/Results of the Proposed Work Improved resampling algorithm – using prior information Theoretical basis Numerical Illustrations Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Significance of Fisher Information Matrix Fundamental role of data analysis is to extract information from data Parameter estimation for models is central to process of extracting information The Fisher information matrix plays a central role in parameter estimation for measuring information: Information matrix summarizes amount of information in data relative to parameters being estimated Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Problem Setting Consider classical problem of estimating parameter vector, θ, from n data vectors, Zn Z , , Zn ≡ { 1 · · · } Suppose have a probability density or mass function (PDF or PMF) associated with the data The parameter, θ, appears in the PDF or PMF and affect the nature of the distribution Example: Zi N(µ(θ), Σ(θ)), for all i ∼ Let ℓ(θ Zn) represents the likelihood function, i.e., ℓ( ) is the PDF| or PMF viewed as a function of θ · conditioned on the data Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Selected Applications Information matrix is measure of performance for several applications. Five uses are: 1 Confidence regions for parameter estimation Uses asymptotic normality and/or Cramér-Rao inequality 2 Prediction bounds for mathematical models 3 Basis for “D-optimal” criterion for experimental design Information matrix serves as measure of how well θ can be estimated for a given set of inputs 4 Basis for “noninformative prior” in Bayesian analysis Sometimes used for “objective” Bayesian inference 5 Model selection Is model A “better” than model B? Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Information Matrix Recall likelihood function, ℓ(θ Zn) and the | log-likelihood function by L(θ Zn) ln ℓ(θ Zn) | ≡ | Information matrix defined as ∂L ∂L Fn(θ) E θ ≡ ∂θ · T ∂θ where expectation is w.r.t. the measure of Zn If Hessian matrix exists, equivalent form based on Hessian matrix: ∂2L Fn(θ)= E θ − T ∂θ ∂θ Fn(θ) is positive semidefinite of dimension p p, × p = dim(θ) Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Two Famous Results Connection of Fn(θ) and uncertainty in estimate, θˆn, is rigorously specified via following results (θ∗ = true value of θ): 1 Asymptotic normality: ∗ dist −1 √n θˆn θ Np(0, F¯ ) − −→ ∗ where F¯ lim Fn(θ )/n ≡ n→∞ 2 Cramér-Rao inequality: ∗ −1 cov(θˆn) Fn(θ ) , for all n (unbiased θˆn) ≥ Above two results indicate: greater variability in θˆn = ⇒ “smaller” Fn(θ) (and vice versa) Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Outline 1 Introduction and Motivation General discussion of Fisher information matrix Current resampling algorithm – No use of prior information 2 Contribution/Results of the Proposed Work Improved resampling algorithm – using prior information Theoretical basis Numerical Illustrations Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Monte Carlo Computation of Information Matrix Analytical formula for Fn(θ) requires first or second derivative and expectation calculation Often impossible or very difficult to compute in practical applications Involves expected value of highly nonlinear (possibly unknown) functions of data Schematic next summarizes “easy” Monte Carlo-based method for determining Fn(θ) Uses averages of very efficient (simultaneous perturbation) Hessian estimates Hessian estimates evaluated at artificial (pseudo) data Computational horsepower instead of analytical analysis Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Monte Carlo Computation of Information Matrix Analytical formula for Fn(θ) requires first or second derivative and expectation calculation Often impossible or very difficult to compute in practical applications Involves expected value of highly nonlinear (possibly unknown) functions of data Schematic next summarizes “easy” Monte Carlo-based method for determining Fn(θ) Uses averages of very efficient (simultaneous perturbation) Hessian estimates Hessian estimates evaluated at artificial (pseudo) data Computational horsepower instead of analytical analysis Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Schematic of Monte Carlo Method for Estimating Information Matrix Pseudo data ^ Average Hessian estimates H( i ) − Input k ^ ( i ) ( i ) of H k =H i ( ) − ^ ( )1 (1 ) − (1) H (1) H H^ H Zpseudo 1 , . ., M . ^ ( N ) ^ ( N ) − ( N ) Z pseudo( N ) H H 1 , . ., M H Negative average of FIM estimate, − FM,N (Spall, 2005, JCGS) Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Supplement: Simultaneous Perturbation (SP) Hessian and Gradient Estimate (i) G(θ ± c∆k |Z (i)) 1 δG − − pseudo Hˆ (i) = k ∆ 1, · · · , ∆ 1 k 2 2 c k1 kp θ ∆ h i ∂L( ± c k |Zpseudo(i)) = (i) T ∂θ δG −1 −1 + k ∆ , · · · , ∆ OR 2 c k1 kp ! θ ∆˜ ∆ h i = (1/c˜) L( + c˜ k ± c k |Zpseudo(i)) where h ˜ −1 ∆k1 (i) θ ∆ θ ∆ . δGk ≡ G( + c k |Zpseudo(i)) − L( ± c k |Zpseudo(i)) . θ ∆ ˜ −1 − G( − c k |Zpseudo(i)) i ∆kp T ∆k = [∆k1, · · · , ∆kp] and ∆k1, · · · , ∆kp, mean-zero and statistically independent r.v.s with finite inverse moments ∆˜ k has same statistical properties as ∆k c˜ > c > 0 are “small’ numbers Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Supplement: Simultaneous Perturbation (SP) Hessian and Gradient Estimate (i) G(θ ± c∆k |Z (i)) 1 δG − − pseudo Hˆ (i) = k ∆ 1, · · · , ∆ 1 k 2 2 c k1 kp θ ∆ h i ∂L( ± c k |Zpseudo(i)) = (i) T ∂θ δG −1 −1 + k ∆ , · · · , ∆ OR 2 c k1 kp ! θ ∆˜ ∆ h i = (1/c˜) L( + c˜ k ± c k |Zpseudo(i)) where h ˜ −1 ∆k1 (i) θ ∆ θ ∆ . δGk ≡ G( + c k |Zpseudo(i)) − L( ± c k |Zpseudo(i)) . θ ∆ ˜ −1 − G( − c k |Zpseudo(i)) i ∆kp T ∆k = [∆k1, · · · , ∆kp] and ∆k1, · · · , ∆kp, mean-zero and statistically independent r.v.s with finite inverse moments ∆˜ k has same statistical properties as ∆k c˜ > c > 0 are “small’ numbers Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Supplement: Optimal Implementation Several implementation questions/answers: Q. How to compute (cheap) Hessian estimates? A. Use simultaneous perturbation (SP) based method (Spall, 2000, IEEE Trans. Auto. Control) Q. How to allocate per-realization (M) and across-realization (N) averaging? A. M = 1 is the optimal solution for a fixed total number of Hessian estimates. However, M > 1 is useful when accounting for cost of generating pseudo data Q. Can correlation be introduced to improve overall ¯ accuracy of FM,N (θ)? A. Yes, antithetic random numbers can reduce ¯ variance of elements in FM,N (θ) (Spall, 2005, JCGS) Sonjoy Das James C. Spall Roger Ghanem Improved resampling algorithm Introduction and Motivation Contribution/Results of the Proposed Work Summary Fisher information matrix with analytically known elements 1 1 4 5 8 10 12 15 16 20 20 22 1 5 10 15 20 1 4 8 12 16 20 22 Known elements = 54 (Das, Ghanem, Spall, 2008, SISC) (Navigation application at APL) The previous resampling approach (Spall, 2005, JCGS) yields the “full” Fisher information matrix without

Efficient Monte Carlo Computation of Fisher Information Matrix Using Prior

UC Riverside UC Riverside Previously Published Works

Statistical Estimation in Multivariate Normal Distribution

A Sufficiency Paradox: an Insufficient Statistic Preserving the Fisher

A Note on Inference in a Bivariate Normal Distribution Model Jaya

Risk, Scores, Fisher Information, and Glrts (Supplementary Material for Math 494) Stanley Sawyer — Washington University Vs

Evaluating Fisher Information in Order Statistics Mary F

Information-Geometric Optimization Algorithms: a Unifying Picture Via Invariance Principles

Matrix Algebraic Properties of the Fisher Information Matrix of Stationary Processes

Fisher Information Matrix for Gaussian and Categorical Distributions

Information-Geometric Optimization Algorithms: a Unifying Picture Via Invariance Principles

Optimal Experimental Design for Machine Learning Using the Fisher Information Tracianne B

Information-Geometric Optimization Algorithms: a Unifying Picture Via Invariance Principles Yann Ollivier, Ludovic Arnold, Anne Auger, Nikolaus Hansen