Machine Learning with Dirichlet and Beta Process Priors: Theory and Applications

Machine Learning with Dirichlet and Beta Process Priors: Theory and Applications by John Paisley Department of Electrical & Computer Engineering Duke University Date: Approved: Lawrence Carin, Advisor David Brady Rebecca Willett Lu David Dunson Mauro Maggioni Dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Electrical & Computer Engineering in the Graduate School of Duke University 2010 Abstract (Machine Learning) Machine Learning with Dirichlet and Beta Process Priors: Theory and Applications by John Paisley Department of Electrical & Computer Engineering Duke University Date: Approved: Lawrence Carin, Advisor David Brady Rebecca Willett Lu David Dunson Mauro Maggioni An abstract of a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Electrical & Computer Engineering in the Graduate School of Duke University 2010 Copyright c 2010 by John Paisley All rights reserved except the rights granted by the Creative Commons Attribution-Noncommercial Licence Abstract Bayesian nonparametric methods are useful for modeling data without having to define the complexity of the entire model a priori, but rather allowing for this complexity to be determined by the data. Two problems considered in this dissertation are the number of components in a mixture model, and the number of factors in a latent factor model, for which the Dirichlet process and the beta process are the two respective Bayesian nonparametric priors selected for handling these issues. The flexibility of Bayesian nonparametric priors arises from the prior's definition over an infinite dimensional parameter space. Therefore, there are theoretically an infinite number of latent components and an infinite number of latent factors. Nevertheless, draws from each respective prior will produce only a small number of components or factors that appear in a given data set. As mentioned, the number of these components and factors, and their corresponding parameter values, are left for the data to decide. This dissertation is split between novel practical applications and novel theoret- ical results for these priors. For the Dirichlet process, we investigate stick-breaking representations for the finite Dirichlet process and their application to novel sampling techniques, as well as a novel mixture modeling framework that incorporates multiple modalities within a data set. For the beta process, we present a new stick- breaking construction for the infinite-dimensional prior, and consider applications to image interpolation problems and dictionary learning for compressive sensing. iv Contents Abstract iv List of Tables vi List of Figures vii Acknowledgements viii Introduction1 1 The Dirichlet Process for Mixture Models7 1.1 Abstract..................................7 1.2 The Dirichlet Distribution........................7 1.2.1 Calculating the Posterior of π .................. 10 1.3 The Pólya Urn Process.......................... 12 1.4 Constructing the Finite-Dimensional Dirichlet Distribution...... 15 1.4.1 Proof of the Construction.................... 16 1.5 The Extension to Infinite-Dimensional Spaces............. 19 1.6 Inference for the Dirichlet Process Mixture Model........... 22 1.6.1 Dirichlet Process Mixture Models................ 22 1.6.2 Gibbs Sampling for Dirichlet Process Mixture Models..... 22 1.7 The Expectation and Variance of the Entropy of Dirichlet Processes. 26 1.7.1 Derivation of the Expectation.................. 28 1.7.2 Derivation of the Variance.................... 29 v 1.8 Appendix................................. 32 2 Sethuraman's Constructive Definition for Finite Mixture Models 33 2.1 Abstract.................................. 33 2.2 Comparing Dir(αg0) and DP(αG0) Priors Using Constructive Definitions 33 2.2.1 Statistical Properties of ..................... 34 2.2.2 Connecting Dir(αg0) and DP(αG0) Priors Via ........ 35 2.3 Applications of the Construction of Dir(αg0).............. 40 2.3.1 Inference for α Using the Constructive Definition of Dir(αg0). 40 2.3.2 Inference for the Hierarchical Dirichlet Process......... 47 2.4 Appendix................................. 52 3 Dirichlet Processes with Product Base Measures 53 3.1 Abstract.................................. 53 3.2 Introduction................................ 54 3.3 The Dirichlet Process with Product Base Measure........... 55 3.3.1 Predicting Values for Missing Modalities............ 56 3.4 MCMC Inference for DP-PBM Mixture Models............ 57 3.5 Applications: The Gaussian-HMM Mixture Model........... 59 3.5.1 Experiment with Synthesized Data............... 59 3.5.2 Major League Baseball Data Set................. 60 3.6 Conclusions................................ 62 4 The Beta Process for Latent Factor Models 63 4.1 Abstract.................................. 63 4.2 Introduction................................ 63 4.3 The Beta Process............................. 65 4.3.1 The Marginalized Beta Process and the Indian Buffet Process 66 vi 4.3.2 Finite Approximation to the Beta Process........... 68 4.4 Beta Process Factor Analysis...................... 69 4.5 Variational Bayesian Inference...................... 72 4.5.1 The VB-E Step.......................... 73 4.5.2 The VB-M Step.......................... 73 4.5.3 Accelerated VB Inference.................... 76 4.5.4 Prediction for New Observations................. 76 4.6 Experiments................................ 76 4.6.1 A Synthetic Example....................... 76 4.6.2 MNIST Handwritten Digits Dataset............... 78 4.6.3 HGDP-CEPH Cell Line Panel.................. 79 4.6.4 Learning Dictionaries for Compressive Sensing Applications. 81 4.7 Conclusion................................. 87 5 A Stick-Breaking Construction of the Beta Process 89 5.1 Abstract.................................. 89 5.2 Introduction................................ 89 5.3 The Beta Process............................. 91 5.3.1 A Construction of the Beta Distribution............ 92 5.3.2 Related Work........................... 93 5.4 A Stick-Breaking Construction of the Beta Process.......... 94 5.4.1 Derivation of the Construction.................. 95 5.5 Inference for the Stick-Breaking Construction............. 98 5.5.1 Inference for dk .......................... 99 5.5.2 Inference for γ ........................... 101 5.5.3 Inference for α .......................... 102 vii 5.5.4 Inference for p(znk = 1jα; dk;Zprev)............... 102 5.6 Experiments................................ 103 5.6.1 Synthetic Data.......................... 103 5.6.2 MNIST Handwritten Digits................... 103 5.6.3 Time-Evolving Gene Expression Data.............. 105 5.7 Conclusion................................. 108 5.8 Appendix................................. 108 6 Image Interpolation Using Dirichlet and Beta Process Priors 109 6.1 Abstract.................................. 109 6.2 Introduction................................ 109 6.3 The Model................................. 110 6.3.1 Handling Missing Data...................... 112 6.4 Model Inference.............................. 113 6.4.1 Maximum A Posteriori Updates and Collapsed Probabilities. 114 6.4.2 Gibbs Sampling of Latent Indicators.............. 115 6.5 Related Algorithms............................ 117 6.5.1 Orthogonal Matching Pursuits.................. 117 6.5.2 Method of Optimal Directions.................. 118 6.5.3 K-SVD............................... 118 6.5.4 Iterative Minimum Mean Squared Error............ 120 6.6 Experiments................................ 121 6.7 Conclusion................................. 123 7 Conclusion 136 Bibliography 138 Biography 145 viii List of Tables 6.1 Average per-iteration run time for algorithms as function of percent missing data (castle image). Comparison is not meaningful for the iMMSE algorithm (which is significantly faster)............. 125 6.2 Average per-iteration run time for algorithms as function of percent missing data for the hyperspectral image problem using 3 × 3 × 210 patches................................... 125 6.3 Average per-iteration run time for algorithms as function of percent missing data for the hyperspectral image problem using 4 × 4 × 210 patches................................... 125 ix List of Figures 1 (a) The original data set. (b) Clustering results for a Gaussian mixture model using a sparsity-promoting Dirichlet prior on the mixing weights and learned using VB-EM inference. Of the initial 20 components, only 3 are ultimately used and shown. (c) Clustering results for a Gaussian mixture model using the maximum likelihood EM algorithm to learn the model parameters. All 20 components are used by the data, resulting in clear overfitting.....................2 2 The empirical Kullback-Liebler divergence between the true underly- ing HMM and the learned HMM using the ML-EM algorithm (blue) and the fully Bayesian VB-EM algorithm (red). This figure is taken from [52]..................................3 3 The RMSE of the interpolated matrix values to the true values for a matrix completion problem where the symmetric matrix is modeled as X = ΦT Φ + E. The x-axis is a function of increasing number of measurements. This figure is taken from [55]..............4 1.1 10,000 samples from a 3-dimensional Dirichlet distribution with g0 uniform and (a) α = 1 (b) α = 3 (c) α = 10 As can be seen, (a) when α < 3, the samples concentrate near vertices and edges of ∆3; (b) when α = 3, the density is uniform; and (c) when

Load more