Bayesian Analysis with Python
Total Page:16
File Type:pdf, Size:1020Kb
Bayesian Analysis with Python Unleash the power and flexibility of the Bayesian framework Osvaldo Martin BIRMINGHAM - MUMBAI Bayesian Analysis with Python Copyright © 2016 Packt Publishing All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews. Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book. Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information. First published: November 2016 Production reference: 1211116 Published by Packt Publishing Ltd. Livery Place 35 Livery Street Birmingham B3 2PB, UK. ISBN 978-1-78588-380-4 www.packtpub.com Credits Author Project Coordinator Osvaldo Martin Nidhi Joshi Reviewer Proofreader Austin Rochford Safis Editor Commissioning Editor Indexer Veena Pagare Mariammal Chettiyar Acquisition Editor Graphics Tushar Gupta Disha Haria Content Development Editor Production Coordinator Aishwarya Pandere Nilesh Mohite Technical Editor Cover Work Suwarna Patil Nilesh Mohite Copy Editor Safis Editing Vikrant Phadke About the Author Osvaldo Martin is a researcher at The National Scientific and Technical Research Council (CONICET), the main organization in charge of the promotion of science and technology in Argentina. He has worked on structural bioinformatics and computational biology problems, especially on how to validate structural protein models. He has experience in using Markov Chain Monte Carlo methods to simulate molecules and loves to use Python to solve data analysis problems. He has taught courses about structural bioinformatics, Python programming, and, more recently, Bayesian data analysis. Python and Bayesian statistics have transformed the way he looks at science and thinks about problems in general. Osvaldo was really motivated to write this book to help others in developing probabilistic models with Python, regardless of their mathematical background. He is an active member of the PyMOL community (a C/Python-based molecular viewer), and recently he has been making small contributions to the probabilistic programming library PyMC3. I would like to thank my wife, Romina, for her support while writing this book and in general for her support in all my projects, specially the unreasonable ones. I also want to thank Walter Lapadula, Juan Manuel Alonso, and Romina Torres-Astorga for providing invaluable feedback and suggestions on my drafts. A special thanks goes to the core developers of PyMC3. This book was possible only because of the dedication, love, and hard work they have put into PyMC3. I hope this book contributes to the spread and adoption of this great library. About the Reviewer Austin Rochford is a principal data scientist at Monetate Labs, where he develops products that allow retailers to personalize their marketing across billions of events a year. He is a mathematician by training and is a passionate advocate of Bayesian methods. www.PacktPub.com eBooks, discount offers, and more Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub. com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks. https://www.packtpub.com/mapt Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career. Why subscribe? • Fully searchable across every book published by Packt • Copy and paste, print, and bookmark content • On demand and accessible via a web browser Table of Contents Preface vii Chapter 1: Thinking Probabilistically - A Bayesian Inference Primer 1 Statistics as a form of modeling 2 Exploratory data analysis 2 Inferential statistics 3 Probabilities and uncertainty 5 Probability distributions 7 Bayes' theorem and statistical inference 10 Single parameter inference 13 The coin-flipping problem 13 The general model 14 Choosing the likelihood 14 Choosing the prior 16 Getting the posterior 18 Computing and plotting the posterior 18 Influence of the prior and how to choose one 21 Communicating a Bayesian analysis 23 Model notation and visualization 23 Summarizing the posterior 24 Highest posterior density 24 Posterior predictive checks 27 Installing the necessary Python packages 28 Summary 29 Exercises 29 Chapter 2: Programming Probabilistically – A PyMC3 Primer 31 Probabilistic programming 32 Inference engines 33 Non-Markovian methods 33 Markovian methods 36 [ i ] Table of Contents PyMC3 introduction 46 Coin-flipping, the computational approach 46 Model specification 47 Pushing the inference button 48 Diagnosing the sampling process 48 Summarizing the posterior 55 Posterior-based decisions 55 ROPE 56 Loss functions 57 Summary 58 Keep reading 58 Exercises 59 Chapter 3: Juggling with Multi-Parametric and Hierarchical Models 61 Nuisance parameters and marginalized distributions 62 Gaussians, Gaussians, Gaussians everywhere 64 Gaussian inferences 64 Robust inferences 69 Student's t-distribution 69 Comparing groups 75 The tips dataset 76 Cohen's d 80 Probability of superiority 81 Hierarchical models 81 Shrinkage 84 Summary 88 Keep reading 88 Exercises 89 Chapter 4: Understanding and Predicting Data with Linear Regression Models 91 Simple linear regression 92 The machine learning connection 92 The core of linear regression models 93 Linear models and high autocorrelation 100 Modifying the data before running 101 Changing the sampling method 103 Interpreting and visualizing the posterior 103 Pearson correlation coefficient 107 Pearson coefficient from a multivariate Gaussian 110 Robust linear regression 113 Hierarchical linear regression 117 Correlation, causation, and the messiness of life 124 [ ii ] Table of Contents Polynomial regression 126 Interpreting the parameters of a polynomial regression 129 Polynomial regression – the ultimate model? 130 Multiple linear regression 131 Confounding variables and redundant variables 135 Multicollinearity or when the correlation is too high 138 Masking effect variables 142 Adding interactions 144 The GLM module 145 Summary 146 Keep reading 146 Exercises 147 Chapter 5: Classifying Outcomes with Logistic Regression 149 Logistic regression 150 The logistic model 151 The iris dataset 152 The logistic model applied to the iris dataset 155 Making predictions 158 Multiple logistic regression 159 The boundary decision 159 Implementing the model 160 Dealing with correlated variables 162 Dealing with unbalanced classes 163 How do we solve this problem? 165 Interpreting the coefficients of a logistic regression 165 Generalized linear models 166 Softmax regression or multinomial logistic regression 167 Discriminative and generative models 171 Summary 174 Keep reading 174 Exercises 175 Chapter 6: Model Comparison 177 Occam's razor – simplicity and accuracy 178 Too many parameters leads to overfitting 179 Too few parameters leads to underfitting 181 The balance between simplicity and accuracy 182 Regularizing priors 183 Regularizing priors and hierarchical models 184 Predictive accuracy measures 185 Cross-validation 185 [ iii ] Table of Contents Information criteria 186 The log-likelihood and the deviance 186 Akaike information criterion 187 Deviance information criterion 188 Widely available information criterion 189 Pareto smoothed importance sampling leave-one-out cross-validation 190 Bayesian information criterion 190 Computing information criteria with PyMC3 190 A note on the reliability of WAIC and LOO computations 194 Interpreting and using information criteria measures 194 Posterior predictive checks 196 Bayes factors 197 Analogy with information criteria 199 Computing Bayes factors 199 Common problems computing Bayes factors 202 Bayes factors and information criteria 202 Summary 205 Keep reading 205 Exercises 205 Chapter 7: Mixture Models 207 Mixture models 207 How to build mixture models 209 Marginalized Gaussian mixture model 215 Mixture models and count data 216 The Poisson distribution 216 The Zero-Inflated Poisson model 218 Poisson regression and ZIP regression 220 Robust logistic regression 223 Model-based clustering 225 Fixed component clustering 227 Non-fixed component clustering 227 Continuous mixtures 228 Beta-binomial and negative binomial 228 The Student's t-distribution 229 Summary 230 Keep reading 230 Exercises 230 Chapter 8: Gaussian Processes 233 Non-parametric statistics 234 Kernel-based models 234 The Gaussian kernel 235 Kernelized linear regression 235 [ iv ] Table of Contents Overfitting and priors 241 Gaussian processes 242 Building the covariance matrix 243 Sampling from a GP prior