Multiple Optimality Guarantees in Statistical Learning by John C
Total Page:16
File Type:pdf, Size:1020Kb
Multiple Optimality Guarantees in Statistical Learning by John C Duchi A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science and the Designated Emphasis in Communication, Computation, and Statistics in the Graduate Division of the University of California, Berkeley Committee in charge: Professor Michael I. Jordan, Co-chair Professor Martin J. Wainwright, Co-chair Professor Peter Bickel Professor Laurent El Ghaoui Spring 2014 Multiple Optimality Guarantees in Statistical Learning Copyright 2014 by John C Duchi 1 Abstract Multiple Optimality Guarantees in Statistical Learning by John C Duchi Doctor of Philosophy in Computer Science and the Designated Emphasis in Communication, Computation, and Statistics University of California, Berkeley Professor Michael I. Jordan, Co-chair Professor Martin J. Wainwright, Co-chair Classically, the performance of estimators in statistical learning problems is measured in terms of their predictive ability or estimation error as the sample size n grows. In modern statistical and machine learning applications, however, computer scientists, statisticians, and analysts have a variety of additional criteria they must balance: estimators must be efficiently computable, data providers may wish to maintain anonymity, large datasets must be stored and accessed. In this thesis, we consider the fundamental questions that arise when trading between multiple such criteria—computation, communication, privacy—while maintaining statistical performance. Can we develop lower bounds that show there must be tradeoffs? Can we develop new procedures that are both theoretically optimal and practically useful? To answer these questions, we explore examples from optimization, confidentiality pre- serving statistical inference, and distributed estimation under communication constraints. Viewing our examples through a general lens of constrained minimax theory, we prove fundamental lower bounds on the statistical performance of any algorithm subject to the constraints—computational, confidentiality, or communication—specified. These lower bounds allow us to guarantee the optimality of the new algorithms we develop addressing the addi- tional criteria we consider, and additionally, we show some of the practical benefits that a focus on multiple optimality criteria brings. In somewhat more detail, the central contributions of this thesis include the following: we develop several new stochastic optimization algorithms, applicable to general classes • of stochastic convex optimization problems, including methods that are automatically 2 adaptive to the structure of the underlying problem, parallelize naturally to attain linear speedup in the number of processors available, and may be used asynchronously, prove lower bounds demonstrating the optimality of these methods, • provide a variety of information-theoretic tools—strong data processing inequalities— • useful for proving lower bounds in privacy-preserving statistical inference, communication- constrained estimation, and optimization, develop new algorithms for private learning and estimation, guaranteeing their opti- • mality, and give simple distributed estimation algorithms and prove fundamental limits showing • that they (nearly) optimally trade off between communication (in terms of the number of bits distributed processors may send) and statistical risk. i To Emily ii Contents Contents ii List of Figures v I Introduction and background 1 1 Introduction 2 1.1 Evaluating statistical learning procedures . ........... 2 1.2 Thesisgoalsandcontributions . ..... 5 1.3 Organization of the thesis and previously published work........... 7 1.4 Notation...................................... 8 2 Minimax rates of convergence 11 2.1 Basicframeworkandminimaxrisk . ... 11 2.2 Methods for lower bounds: Le Cam, Assouad, and Fano . ...... 13 2.3 Summary ..................................... 22 2.4 Proofsofresults................................. 22 II Optimization 25 3 Stochasticoptimizationandadaptivegradientmethods 26 3.1 Stochasticoptimizationalgorithms . ....... 27 3.2 Adaptiveoptimization ............................. 32 3.3 Afewoptimalityguarantees . ... 35 3.4 Summary ..................................... 38 3.5 Proofsofconvergenceandminimaxbounds. ...... 39 4 Data sparsity, asynchrony, and faster stochastic optimization 47 4.1 Problemsetting.................................. 47 4.2 Parallel and asynchronous optimization with sparsity . ............ 48 4.3 Experiments.................................... 53 iii 4.4 Proofsofconvergence. .. 56 5 Randomizedsmoothingforstochasticoptimization 66 5.1 Introduction.................................... 66 5.2 Mainresultsandsomeconsequences . .... 68 5.3 Applicationsandexperimentalresults . ....... 74 5.4 Summary ..................................... 80 5.5 Proofsofconvergence. .. 81 5.6 Propertiesofrandomizedsmoothing. ...... 88 6 Zero-order optimization: the power of two function evaluations 97 6.1 Introduction.................................... 97 6.2 Algorithms..................................... 99 6.3 Lowerboundsonzero-orderoptimization . ....... 108 6.4 Summary ..................................... 109 6.5 Convergenceproofs ............................... 110 6.6 Proofsoflowerbounds ............................. 116 6.7 Technical results for convergence arguments . ......... 121 6.8 Technical proofs associated with lower bounds . ......... 128 III Privacy 130 7 Privacy, minimax rates of convergence, and data processing inequalities 131 7.1 Introduction.................................... 131 7.2 Backgroundandproblemformulation . ..... 135 7.3 Pairwise bounds under privacy: Le Cam and local Fano methods....... 137 7.4 Mutual information under local privacy: Fano’s method . ........... 142 7.5 Bounds on multiple pairwise divergences: Assouad’s method ......... 149 7.6 Comparisontorelatedwork . 157 7.7 Summary ..................................... 161 8 Technicalargumentsforprivateestimation 162 8.1 ProofofTheorem7.1andrelatedresults . ...... 162 8.2 ProofofTheorem7.2andrelatedresults . ...... 169 8.3 ProofofTheorem7.3............................... 172 8.4 Proofs of multi-dimensional mean-estimation results . ............. 174 8.5 Proofsofmultinomialestimationresults . ........ 180 8.6 Proofsofdensityestimationresults . ....... 182 8.7 Informationbounds............................... 187 iv IV Communication 192 9 Communication efficient algorithms 193 9.1 Introduction.................................... 194 9.2 BackgroundandProblemSet-up. 195 9.3 TheoreticalResults .............................. 197 9.4 Summary ..................................... 204 9.5 ProofofTheorem9.1............................... 204 10Optimalityguaranteesfordistributedestimation 207 10.1Introduction................................... 207 10.2Problemsetting................................. 208 10.3RelatedWork ................................... 210 10.4Mainresults.................................... 211 10.5 Consequencesforregression . ..... 218 10.6Summary ..................................... 220 10.7 Proofoutlineofmajorresults . ..... 221 10.8 Techniques,tools,andsetupforproofs . ........ 223 10.9 Proofs of lower bounds for independent protocols . .......... 228 10.10Proofs of interactive lower bounds for Gaussian observations . 236 Bibliography 242 v List of Figures 2.1 Example of a 2δ-packing............................... 15 4.1 ExperimentswithURLdata. .. 54 4.2 Stepsize sensitivity of AdaGrad .......................... 54 4.3 Click-through prediction performance of asynchronous methods ......... 55 5.1 Iterations to optimality versus gradient samples . ............. 77 5.2 Metriclearningoptimizationerror. ......... 79 5.3 Necessityofsmoothing ............................. ... 80 7.1 Graphicalstructureofprivatechannels . .......... 133 7.2 Privatesamplingstrategies. ....... 146 8.1 Densityconstructionsforlowerbounds . ......... 182 10.1 GraphicalmodelforLemma10.1 . ..... 225 vi Acknowledgments There are so many people to whom I owe credit for this thesis that I must begin with an apology: I will probably forget to mention several of you in the coming paragraphs. If I do, please forgive me, and let me know and I will be happy to buy you a beer. My acknowledgments must begin with my advisors, my two official advisors at Berkeley, and my one surrogate advisor down at Google: Michael Jordan, Martin Wainwright, and Yoram Singer. It has become clear to me that having three advisors was a necessary thing, if only for their sakes, because it kept me out of the hair of the other two while I bothered the third. More seriously, Mike and Martin have pushed me into contact with a multitude of disciplines, encouraging and exemplifying a fearlessness and healthy disrespect for academic boundaries. Without them, I could not have fallen in love with as many subjects—statistics, optimization, computing—as I have, and their guidance about how to approach research, write, give talks, and cooling down my neuroses has been invaluable. They have also provided phenomenal environments for doing research, and (because of them) I have been fortunate to be surrounded constantly by other wonderful students and colleagues. Yoram has been a great mentor and friend, going on runs and bike rides with me, making sure I do not lose touch with practicality, and (hopefully) helping me develop a taste for problems at the border of theory and practice, where one informs the other and vice versa. I hope I can maintain the