A Mixture Model Approach to Empirical Bayes Testing and Estimation a Dissertation Submitted to the Department of Statistics

A MIXTURE MODEL APPROACH TO EMPIRICAL BAYES TESTING AND ESTIMATION A DISSERTATION SUBMITTED TO THE DEPARTMENT OF STATISTICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Omkar Muralidharan May 2011 © 2011 by Omkar Muralidharan. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/pp730hw1567 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Bradley Efron, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Robert Tibshirani I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Nancy Zhang Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Many modern statistical problems require making similar decisions or estimates for many different entities. For example, we may ask whether each of 10,000 genes is associated with some disease, or try to measure the degree to which each is associated with the disease. As in this example, the entities can often be divided into a vast majority of “null” objects and a small minority of interesting ones. Empirical Bayes is a useful technique for such situations, but finding the right empirical Bayes method for each problem can be difficult. Mixture models, however, provide an easy and effective way to apply empirical Bayes. This thesis motivates mixture models by analyzing a simple high- dimensional problem, and shows their practical use by applying them to detecting single nucleotide polymorphisms. iv Acknowledgements I’d like to thank a few of the many people who made this research possible. Brad Efron was the perfect advisor. Nancy Zhang was a fantastic collaborator and mentor. The other faculty in the Statistics department were a constant source of valuable advice, especially Rob Tibshirani and Trevor Hastie. My fellow students were full of entertaining and interesting conversations, especially Ryan Tib- shirani, Jacob Bien, Nelson Ray, Noah Simon, Brad Klingenberg, Yi Liu and Ya Xu. Finally, I couldn’t have done this without the support of my family: my wife Aditi, my brother Shravan, and my parents Sudha and Murali. v Contents Abstract iv Acknowledgements v 1 Introduction and Outline 1 1.1 High-Dimensional Data and the “Too much, too little” problem . 1 1.1.1 The Normal Means Problem . 1 1.1.2 Sharing Information, Empirical Bayes and Mixture Models . 2 1.2 Outline of Thesis . 2 1.2.1 Literature Review . 2 1.2.2 Motivation: Marginal Densities and Regret . 3 1.2.3 Methodology: Mixture Models for Normal Means . 3 1.2.4 Application: Calling SNPs . 4 2 Previous Work 5 2.1 Introduction . 5 2.2 Testing . 5 2.2.1 False Discovery Rates . 6 2.2.1.1 Estimating π0 .............................. 7 2.2.1.2 F DR=fdr estimators . 8 2.2.2 Empirical Nulls . 8 2.3 Estimation . 10 2.3.1 James-Stein, Parametric Empirical Bayes and Sparsity . 10 2.3.2 Robbins’ Formula and Nonparametric Empirical Bayes . 11 2.4 This Thesis’ Place . 11 3 Marginal Densities and Regret 13 3.1 Introduction . 13 3.2 A Tempered NPEB Method . 15 vi 3.2.1 Setup, Regret and the Proposed Estimator . 15 3.2.2 Regret Bounds . 16 3.3 Example: Simultaneous Chi-Squared Estimation . 17 3.3.1 Specializing Theoretical Results . 17 3.3.2 An Empirical Comparison . 18 3.3.2.1 The UMVU estimator and Berger’s estimator . 18 3.3.2.2 A Parametric EB estimator . 19 3.3.2.3 Tempered NPEB estimators . 19 3.3.2.4 Testing Scenarios . 20 3.3.2.5 Results . 21 3.4 Implied Densities and General Estimators . 23 3.4.1 Implied Densities . 24 3.4.2 Regret Bounds . 24 3.5 Summary . 27 3.5.1 Extensions . 27 3.6 Proofs . 28 3.6.1 Proof of Lemma 1 . 28 3.6.2 Proof of Theorem 1 . 29 3.6.3 Proof of Corollary 1 . 30 3.6.4 Proof of Corollary 2 . 30 3.6.5 Proof of Theorem 2 . 31 3.6.6 Proof of Corollary 3 . 31 4 Mixture Models for Testing and Estimation 32 4.1 Introduction and Setup . 32 4.2 Mixture Model . 32 4.2.1 A Mixture Model . 33 4.2.2 Fitting and Parameter Choice . 34 4.2.3 Identifiability Concerns . 35 4.2.4 Parameter Choice . 36 4.2.5 Example: Binomial Data . 37 4.2.5.1 Brown’s Analysis . 37 4.2.5.2 Mixture Model . 38 4.2.5.3 Results . 38 4.3 Normal Performance . 40 4.3.1 Effect Size Estimation . 40 4.3.1.1 An Asymptotic Comparison . 44 4.3.2 fdr estimation . 45 vii 4.4 Summary . 50 5 Finding SNPs 51 5.1 Introduction . 51 5.1.1 Overview . 51 5.2 Mixture Model . 57 5.2.1 Model . 57 5.2.2 Calling, Filtering and Genotyping . 59 5.2.3 Fitting . 60 5.2.3.1 E-Step . 61 5.2.3.2 CM-Step . 61 5.2.3.3 Starting Points . 62 5.3 A Single Sample Nonparametric F DR estimator . 63 5.4 Results . 65 5.4.1 Yoruban SNP Calls . 65 5.4.2 Power: Spike-In Simulation . 65 5.4.3 Model-based Simulation . 68 5.5 Summary . 68 Bibliography 71 viii List of Tables 2 1 P ^ 3.1 Mean squared errors N θ − θ from the simulations under the priors in the text. The methods are the UMVU estimator, Berger’s estimator, a parametric EB method based on a Gamma prior, a nonparametric EB method based on a log-spline density estimator, and two mixture methods, one with Gamma mixture groups, and another with approximate point prior mixture groups. The quantities shown are averages over 100 simulations, with standard deviations given in parentheses. 22 3.2 Average relative regret from the simulations under the priors in the text. The average h 1 P ^ ^ i ^ relative regret is N MSE(θ)=MSE(θbayes) − 1; if MSE θbayes is near 0, this P MSE(θ^) can be different from the ratio of entries in table 3.1, P ^ −1. The quantities MSE(θbayes) shown are averages and standard deviations (in parentheses) over 100 simulations. The relative regret is infinite for all methods on prior 6, as the Bayes risk is 0. 23 4.1 Estimated estimation accuracy (equation 4.4) for the methods. The naive estimator is normalized to have error 1. Values for all methods except the binomial mixture model are from [Brown, 2008]. The first column gives the errors on the data as a whole (single model), and the next two give errors for pitchers and non-pitchers considered separately. Standard errors range from 0:05 to 0:2 on non-pitchers, are higher for pitchers, and are in between for the overall data [Brown, 2008]. 39 4.2 Mean and median relative error for the methods over the simulation scenarios. The P ^ 2 relative error is the average of the squared error (θi − θi) over the 100 replications, divided by the average squared error for the Bayes estimator. 42 5.1 Example counts, reference base G. For the spike-in simulations later, we used A as the alternative base. 53 ix 5.2 Calls on the Yoruban sample by various methods, with estimated F DP s. We used an estimated mean non-null HNRF of ^b = 0:5 and an estimated mean null HNRF of a^ = 0:1 (the 99th percentile of the mean HNRF over all positions for the Yoruban sample). The overall F DP estimate was calculated by combining F DP estimates on Bentley’s calls and new calls. The ratios of F DP estimates between methods are more reliable than the individual levels. 65 5.3 Noisy null position, reference base T , spiked alternative base G. 66 5.4 Binned true fdr and estimated fdr. ........................... 68 x List of Figures 3.1 Simulation priors as described in the text. ..

A Mixture Model Approach to Empirical Bayes Testing and Estimation a Dissertation Submitted to the Department of Statistics

Efficient Empirical Bayes Variable Selection and Estimation in Linear

Bayes and Empirical-Bayes Multiplicity Adjustment in the Variable-Selection Problem

Approximate Empirical Bayes for Deep Neural Networks

Empirical Bayes Least Squares Estimation Without an Explicit Prior

Nonparametric Empirical Bayes for the Dirichlet Process Mixture Model

The Empirical Bayes Estimation of an Instantaneous Spike Rate with a Gaussian Process Prior

Empirical Bayes Posterior Concentration in Sparse High

Confidence Distributions and Empirical Bayes Posterior Distributions Unified As Distributions of Evidential Support David R

Objective Priors in the Empirical Bayes Framework Arxiv:1612.00064V5 [Stat.ME] 11 May 2020

Empirical Bayes Regression Analysis with Many Regressors but Fewer Observations

Application of GIS for Urban Traffic Accidents: a Critical Review

Bayesian Variable Selection Using an Adaptive Powered Correlation Prior