Gen: a High-Level Programming Platform for Probabilistic Inference

Total Page:16

File Type:pdf, Size:1020Kb

Gen: a High-Level Programming Platform for Probabilistic Inference Gen: A High-Level Programming Platform for Probabilistic Inference by Marco Francis Cusumano-Towner B.S., University of California, Berkeley (2011) M.S., Stanford University (2013) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2020 ○c Massachusetts Institute of Technology 2020. All rights reserved. Author............................................................................ Department of Electrical Engineering and Computer Science August 28, 2020 Certified by . Vikash K. Mansinghka Principal Research Scientist Thesis Supervisor Accepted by....................................................................... Leslie A. Kolodziejski Professor of Electrical Engineering and Computer Science Chair, Department Committee on Graduate Students 2 Gen: A High-Level Programming Platform for Probabilistic Inference by Marco Francis Cusumano-Towner Submitted to the Department of Electrical Engineering and Computer Science on August 28, 2020, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Engineering Abstract Probabilistic inference provides a powerful theoretical framework for engineering intelligent systems. However, diverse modeling approaches and inference algorithms are needed to nav- igate engineering tradeoffs between robustness, adaptability, accuracy, safety, interpretabil- ity, data efficiency, and computational efficiency. Structured generative models represented as symbolic programs provide interpretability. Structure learning of these models provides data-efficient adaptability. Uncertainty quantification is needed for safety. Bottom-up, discriminative inference provides computational efficiency. Iterative “model-in-the-loop” algorithms can improve accuracy by fine-tuning inferences and improve robustness to out- of-distribution data. Recent probabilistic programming systems fully or partially automate inference, but are too restrictive for many applications. Differentiable programming sys- tems are also inadequate: they do not support structure learning of generative models or hybrids of “model-in-the-loop” and discriminative inference. Therefore, probabilistic inference is still often implemented by translating tedious mathematical derivations into low-level numerical programs, which are error-prone and difficult to modify and maintain. This thesis presents the design and implementation of the Gen programming platform for probabilistic inference. Gen automates the low-level implementation of probabilistic inference algorithms while remaining flexible enough to support heterogeneous algorithmic approaches and extensible enough for practical inference engineering. Gen users define their models explicitly using probabilistic programs, but instead of compiling the model directly into an inference algorithm implementation, Gen compiles the model into data types that encapsulate low-level inference operations whose semantics are derived from the model, like sampling, density evaluation, and gradients. Users write their inference application in a general-purpose programming language using Gen’s abstract data types as primitives. This thesis defines Gen’s data types and shows that they can be used to compose a variety of inference techniques including sophisticated Monte Carlo algorithms and hybrids of Monte Carlo, variational, and discriminative techniques. The same data types can be generated from multiple probabilistic programming languages that strike different expressiveness and performance tradeoffs. By decoupling probabilistic programming language implementations from inference algorithm design, Gen enables more flexible specialization of both, leading to performance improvements over existing probabilistic programming systems. Thesis Supervisor: Vikash K. Mansinghka Title: Principal Research Scientist 3 4 Acknowledgments My primary thesis advisor, Vikash Mansinghka, planted the seeds for the research described in this thesis and created an environment in which this work was possible. In the beginning of my PhD, he directed me to pull at research threads that seemed to never end, resulting in the most intellectually rewarding phase of my life. The research problem and the approach of this thesis build directly on his earlier work, and are a product of a new probabilistic programming research paradigm that he has persistently worked to shape. I am also grateful for his example of independent thinking and entrepreneurial attitude, his conscientiousness as an advisor, and his efforts to support his students in non-technical challenges. Thepast five years would have been much harder without knowing that I could count onhim. Several other advisors and mentors have played roles in my graduate school journey. Josh Tenenbaum has inspired me for years with his insightful work; and led me into prob- abilistic programming. I am grateful for his approachable demeanor, encouragment, and continuing inspiration over the past five years. I am also grateful to Martin Rinard and Michael Carbin for being on my thesis committee and helping me to communicate this work more effectively. Martin Rinard gave valuable advice on the core terminology, framing, and exposition in this document. Martin Vechev gave helpful mentorship and advice during our collaboration, which built the foundation for the ‘trace translator’ construct in this thesis. I also thank Sam Gershman for his advice and mentorship and support as I started my graduate career at MIT, Sivan Bercovici for supporting my transition back into graduate school, and Pieter Abbeel for providing my first exposure to computer science research. The first few years of graduate school would not have been the same without the friend- ship and support of my cohort of fellow PhD students and researchers, including Feras Saad, Alex Lew, and Ulrich Schaechtle. Feras has been a good friend and ally during the ups and downs of the past five years. Alex has been a constant source of encouragement, and late night white board discussions with him helped to clarify many of the ideas in this thesis. Feras and Alex were crucial in helping to push the Gen PLDI paper over the finish line, and in various other efforts. The research in this thesis was also aided by discussions with many other current and former affiliates of the MIT Probabilistic Computing Project, including Alexey Radul and Anthony Lu, and with support from Amanda Brower, Rachel Paiste, and many others. Several people contributed to the applications of Gen described in this thesis. Ben Zinberg, Austin Garrett, and Javier Felip Leon contributed to the scene graph inference application; Ulrich Schaechtle and Feras Saad devised the algorithm used in the Gaussian process structure learning application. Since Gen was released, many peo- ple including Alex Lew, Ben Zinberg, Tan Zhi-Xuan, George Matheos, and Sam Witty have helped to improve it and their contributions and use of Gen have been incredibly encouraging. Alex contributed syntax improvements that are reflected in this thesis. This thesis would not have been possible without the support of my parents Maria Cusumano and Mark Towner who consistently put me before themselves and did everything in their power help me succeed. Finally, I would not have embarked on this PhD journey without Lisa Bashkirova, who has been a constant source of good advice for the past decade. 5 6 Contents 1 Introduction 15 1.1 A new approach to implementing probabilistic inference . 16 1.2 Overview of programming languages concepts in Gen . 21 1.2.1 Generative probabilistic models and probabilistic inference . 22 1.2.2 Using probabilistic programming languages to express generative prob- abilistic models . 24 1.2.3 Abstract data types for generative functions and traces . 28 1.2.4 Generating implementations of the abstract data types from the source code of probabilistic programs . 30 1.2.5 Approximate probabilistic inference algorithms . 35 1.2.6 Implementing inference algorithms with abstract data types . 38 2 Abstract Data Types for Inference: Generative Functions and Traces 45 2.1 An abstract formal representation for generative models . 46 2.1.1 Random choices, addresses, and choice dictionaries . 46 2.1.2 Probability distributions on choice dictionaries . 48 2.1.3 Marginal likelihood, conditioning, and expectation . 50 2.1.4 Generalizing beyond discrete random choices . 52 2.1.5 Generative functions . 56 2.2 Languages for defining generative functions . 59 2.2.1 Gen Dynamic Modeling Language . 60 2.2.2 Formal semantics of a toy modeling language . 64 2.3 Abstract data types for probabilistic inference . 67 2.3.1 Generative function and trace ADTs . 67 2.3.2 Implementing the ADT operations compositionally . 73 2.4 Related work . 74 3 Implementing Inference Using Generative Functions and Traces 75 3.1 Simple Monte Carlo with traces . 76 3.2 Importance sampling with traces . 78 3.2.1 Regular importance sampling . 79 3.2.2 Self-normalized importance sampling . 81 3.3 Training proposal distributions on simulated data . 86 7 3.4 Markov chain Monte Carlo with traces . 89 3.4.1 MCMC with the trace abstract data type . 90 3.4.2 Metropolis-Hastings using generative functions as proposals . 91 3.4.3 Hamiltonian Monte Carlo with traces . 96 3.4.4 A language for composing MCMC kernels . 98 3.5 Resample-move particle filtering with traces . 102 3.5.1 Trace-based
Recommended publications
  • Boundary Detection by Constrained Optimization ‘ DONALD GEMAN, MEMBER, IEEE, STUART GEMAN, MEMBER, IEEE, CHRISTINE GRAFFIGNE, and PING DONG, MEMBER, IEEE
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE. VOL. 12. NO. 7, JULY 1990 609 Boundary Detection by Constrained Optimization ‘ DONALD GEMAN, MEMBER, IEEE, STUART GEMAN, MEMBER, IEEE, CHRISTINE GRAFFIGNE, AND PING DONG, MEMBER, IEEE Abstract-We use a statistical framework for finding boundaries and tations related to image segmentation: partition and for partitioning scenes into homogeneous regions. The model is a joint boundary labels. These, in tum, are special cases of a probability distribution for the array of pixel gray levels and an array of “labels.” In boundary finding, the labels are binary, zero, or one, general “label model” in the form of an “energy func- representing the absence or presence of boundary elements. In parti- tional’’ involving two components, one of which ex- tioning, the label values are generic: two labels are the same when the presses the interactions between the labels and the (inten- corresponding scene locations are considered to belong to the same re- sity) data, and the other encodes constraints derived from gion. The distribution incorporates a measure of disparity between cer- general information or expectations about label patterns. tain spatial features of pairs of blocks of pixel gray levels, using the Kolmogorov-Smirnov nonparametric measure of difference between The labeling we seek is, by definition, the minimum of the distributions of these features. Large disparities encourage inter- this energy. vening boundaries and distinct partition labels. The number of model The partition labels do not classify. Instead they are parameters is minimized by forbidding label configurations that are in- generic and are assigned to blocks of pixels; the size of consistent with prior beliefs, such as those defining very small regions, the blocks (or label resolution) is variable, and depends or redundant or blindly ending boundary placements.
    [Show full text]
  • Overview-Of-Statistical-Analytics-And
    Brief Overview of Statistical Analytics and Machine Learning tools for Data Scientists Tom Breur 17 January 2017 It is often said that Excel is the most commonly used analytics tool, and that is hard to argue with: it has a Billion users worldwide. Although not everybody thinks of Excel as a Data Science tool, it certainly is often used for “data discovery”, and can be used for many other tasks, too. There are two “old school” tools, SPSS and SAS, that were founded in 1968 and 1976 respectively. These products have been the hallmark of statistics. Both had early offerings of data mining suites (Clementine, now called IBM SPSS Modeler, and SAS Enterprise Miner) and both are still used widely in Data Science, today. They have evolved from a command line only interface, to more user-friendly graphic user interfaces. What they also share in common is that in the core SPSS and SAS are really COBOL dialects and are therefore characterized by row- based processing. That doesn’t make them inherently good or bad, but it is principally different from set-based operations that many other tools offer nowadays. Similar in functionality to the traditional leaders SAS and SPSS have been JMP and Statistica. Both remarkably user-friendly tools with broad data mining and machine learning capabilities. JMP is, and always has been, a fully owned daughter company of SAS, and only came to the fore when hardware became more powerful. Its initial “handicap” of storing all data in RAM was once a severe limitation, but since computers now have large enough internal memory for most data sets, its computational power and intuitive GUI hold their own.
    [Show full text]
  • Annual Report of the Center for Statistical Research and Methodology Research and Methodology Directorate Fiscal Year 2017
    Annual Report of the Center for Statistical Research and Methodology Research and Methodology Directorate Fiscal Year 2017 Decennial Directorate Customers Demographic Directorate Customers Missing Data, Edit, Survey Sampling: and Imputation Estimation and CSRM Expertise Modeling for Collaboration Economic and Research Experimentation and Record Linkage Directorate Modeling Customers Small Area Simulation, Data Time Series and Estimation Visualization, and Seasonal Adjustment Modeling Field Directorate Customers Other Internal and External Customers ince August 1, 1933— S “… As the major figures from the American Statistical Association (ASA), Social Science Research Council, and new Roosevelt academic advisors discussed the statistical needs of the nation in the spring of 1933, it became clear that the new programs—in particular the National Recovery Administration—would require substantial amounts of data and coordination among statistical programs. Thus in June of 1933, the ASA and the Social Science Research Council officially created the Committee on Government Statistics and Information Services (COGSIS) to serve the statistical needs of the Agriculture, Commerce, Labor, and Interior departments … COGSIS set … goals in the field of federal statistics … (It) wanted new statistical programs—for example, to measure unemployment and address the needs of the unemployed … (It) wanted a coordinating agency to oversee all statistical programs, and (it) wanted to see statistical research and experimentation organized within the federal government … In August 1933 Stuart A. Rice, President of the ASA and acting chair of COGSIS, … (became) assistant director of the (Census) Bureau. Joseph Hill (who had been at the Census Bureau since 1900 and who provided the concepts and early theory for what is now the methodology for apportioning the seats in the U.S.
    [Show full text]
  • Fibrinogen Levels Are Associated with Lymph Node Involvement And
    ANTICANCER RESEARCH 38 : 1097-1104 (2018) doi:10.21873/anticanres.12328 Fibrinogen Levels Are Associated with Lymph Node Involvement and Overall Survival in Gastric Cancer Patients JÚLIUS PALAJ 1, ŠTEFAN KEČKÉŠ 2, VÍTĚZSLAV MAREK 1, DANIEL DYTTERT 1, IVETA WACZULÍKOVÁ 3 and ŠTEFAN DURDÍK 1 1Department of Oncological Surgery, St. Elizabeth Cancer Institute, Slovak Republic and Faculty of Medicine in Bratislava of the Comenius University, Bratislava, Slovak Republic; 2Department of Immunodiagnostics, St. Elizabeth Cancer Institute, Bratislava, Slovak Republic; 3Department of Nuclear Physics and Biophysics, Comenius University, Faculty of Mathematics, Physics and Informatics, Bratislava, Slovak Republic Abstract. Background/Aim: Combination of perioperative accounting for 6.8% of all diagnosed cancers and making this chemotherapy with gastrectomy with D2 lymphadenectomy cancer the 5th most common malignancy globally (2). improves long-term survival in patients with gastric cancer. Moreover, it is the third leading cause of death in both sexes The aim of this study was to investigate the predictive value accounting for 8.8% of the total deaths from cancer (3). of preoperative levels of CRP, albumin, fibrinogen, In spite of advancements in chemotherapy and local neutrophil-to-lymphocyte ratio and routinely used tumor control of GC, prognosis remains poor, mainly because of markers (CEA, CA 19-9, CA 72-4) for lymph node the advancement of the disease at the time of diagnosis. involvement. Materials and Methods: This retrospective Approximately 50% patients in western countries have study was conducted in 136 patients who underwent surgery metastases at the time of diagnosis, and from those without between 2007 and 2015. Bivariable and multivariable metastatic disease only 50% are eligible for gastric resection analyses were performed in order to identify important (4).
    [Show full text]
  • Towards a Fully Automated Extraction and Interpretation of Tabular Data Using Machine Learning
    UPTEC F 19050 Examensarbete 30 hp August 2019 Towards a fully automated extraction and interpretation of tabular data using machine learning Per Hedbrant Per Hedbrant Master Thesis in Engineering Physics Department of Engineering Sciences Uppsala University Sweden Abstract Towards a fully automated extraction and interpretation of tabular data using machine learning Per Hedbrant Teknisk- naturvetenskaplig fakultet UTH-enheten Motivation A challenge for researchers at CBCS is the ability to efficiently manage the Besöksadress: different data formats that frequently are changed. Significant amount of time is Ångströmlaboratoriet Lägerhyddsvägen 1 spent on manual pre-processing, converting from one format to another. There are Hus 4, Plan 0 currently no solutions that uses pattern recognition to locate and automatically recognise data structures in a spreadsheet. Postadress: Box 536 751 21 Uppsala Problem Definition The desired solution is to build a self-learning Software as-a-Service (SaaS) for Telefon: automated recognition and loading of data stored in arbitrary formats. The aim of 018 – 471 30 03 this study is three-folded: A) Investigate if unsupervised machine learning Telefax: methods can be used to label different types of cells in spreadsheets. B) 018 – 471 30 00 Investigate if a hypothesis-generating algorithm can be used to label different types of cells in spreadsheets. C) Advise on choices of architecture and Hemsida: technologies for the SaaS solution. http://www.teknat.uu.se/student Method A pre-processing framework is built that can read and pre-process any type of spreadsheet into a feature matrix. Different datasets are read and clustered. An investigation on the usefulness of reducing the dimensionality is also done.
    [Show full text]
  • An Example of Statistical Data Analysis Using the R Environment for Statistical Computing
    Tutorial: An example of statistical data analysis using the R environment for statistical computing D G Rossiter Version 1.4; May 6, 2017 Subsoil vs. topsoil clay, by zone Regression Residuals vs. Fitted Values, subsoil clay % 128 80 15 138 ● 17119 137 1 ● 139 70 2 ● 3 10 ● 4 ● 60 ● ● 5 50 0 Slopes: Residual 40 zone 1 : 0.834 Subsoil clay % Subsoil clay ● ● zone 2 : 0.739 zone 3 : 0.564 −5 30 zone 4 : 1.081 overall: 0.829 −10 20 81 −15 10 145 10 20 30 40 50 60 70 80 20 30 40 50 60 70 Topsoil clay % Fitted GLS 2nd−order trend surface, subsoil clay % 340000 335000 330000 N 325000 320000 315000 660000 670000 680000 690000 700000 E Copyright © D G Rossiter 2008 { 2010, 2014, 2017 All rights reserved. Repro- duction and dissemination of the work as a whole (not parts) freely permitted if this original copyright notice is included. Sale or placement on a web site where payment must be made to access this document is strictly prohibited. To adapt or translate please contact the author ([email protected]). Contents 1 Introduction1 2 Example Data Set2 2.1 Loading the dataset...........................3 2.2 A normalized database structure*...................5 3 Research questions8 4 Univariarte Analysis9 4.1 Univariarte Exploratory Data Analysis................9 4.2 Point estimation; inference of the mean............... 14 4.3 Answers.................................. 15 5 Bivariate correlation and regression 16 5.1 Conceptual issues in correlation and regression........... 16 5.2 Bivariate Exploratory Data Analysis................. 18 5.3 Bivariate Correlation Analysis..................... 22 5.4 Fitting a regression line........................
    [Show full text]
  • Using SPSS to Analyze Complex Survey Data: a Primer
    Journal of Modern Applied Statistical Methods Volume 18 Issue 1 Article 16 4-6-2020 Using SPSS to Analyze Complex Survey Data: A Primer Danjie Zou University of British Columbia, [email protected] Jennifer E. V. Lloyd University of British Columbia, [email protected] Jennifer L. Baumbusch University of British Columbia, [email protected] Follow this and additional works at: https://digitalcommons.wayne.edu/jmasm Part of the Applied Statistics Commons, Social and Behavioral Sciences Commons, and the Statistical Theory Commons Recommended Citation Zou, D., Lloyd, J. E. V., & Baumbusch, J. L. (2019). Using SPSS to analyze complex survey data: A primer Journal of Modern Applied Statistical Methods, 18(1), eP3253. doi: 10.22237/jmasm/1556670300 This Regular Article is brought to you for free and open access by the Open Access Journals at DigitalCommons@WayneState. It has been accepted for inclusion in Journal of Modern Applied Statistical Methods by an authorized editor of DigitalCommons@WayneState. Using SPSS to Analyze Complex Survey Data: A Primer Cover Page Footnote Thank you to the McCreary Centre Society (https://www.mcs.bc.ca/), who collects and owns the British Columbia Adolescent Health Survey data. Thanks also to Dr. Colleen Poon, Allysha Ram, Dr. Elizabeth Saewyc, and Annie Smith for their guidance as we worked with the data. We also thank the Social Sciences and Humanities Research Council of Canada (SSHRC) for an Insight Development grant awarded to Dr. Baumbusch. Finally, thanks to blind reviewers for their comments that improved the paper. An SPSS syntax file with the commands outlined in this paper is va ailable for download at: http://blogs.ubc.ca/jenniferlloyd/ This regular article is available in Journal of Modern Applied Statistical Methods: https://digitalcommons.wayne.edu/ jmasm/vol18/iss1/16 Journal of Modern Applied Statistical Methods May 2019, Vol.
    [Show full text]
  • Real-Time Process Monitoring and Statistical Process Control for an Automated Casting Facility
    Project Number: SYS-SYS7 Real-Time Process Monitoring And Statistical Process Control For an Automated Casting Facility A Major Qualifying Project Report Submitted to the Faculty Of WORCESTER POLYTECHNIC INSTITUTE In partial fulfillment of the requirements for the Degree of Bachelor of Science In Mechanical Engineering By Daniel Lettiere Date: June 2012 Approved: Prof. Satya Shivkumar, Advisor 1 Contents Abstract ......................................................................................................................................4 1. Introduction ............................................................................................................................5 2. Background .............................................................................................................................7 2.1 Manufacturing of Investment Castings .............................................................................7 2.1.1 Origins ........................................................................................................................7 2.1.2 Defects .......................................................................................................................9 2.1.3 Effects of Humidity .....................................................................................................9 2.1.4 Effects of Temperature .............................................................................................10 2.1.5 Effects of Time ..........................................................................................................10
    [Show full text]
  • Statistics with Free and Open-Source Software
    Free and Open-Source Software • the four essential freedoms according to the FSF: • to run the program as you wish, for any purpose • to study how the program works, and change it so it does Statistics with Free and your computing as you wish Open-Source Software • to redistribute copies so you can help your neighbor • to distribute copies of your modified versions to others • access to the source code is a precondition for this Wolfgang Viechtbauer • think of ‘free’ as in ‘free speech’, not as in ‘free beer’ Maastricht University http://www.wvbauer.com • maybe the better term is: ‘libre’ 1 2 General Purpose Statistical Software Popularity of Statistical Software • proprietary (the big ones): SPSS, SAS/JMP, • difficult to define/measure (job ads, articles, Stata, Statistica, Minitab, MATLAB, Excel, … books, blogs/posts, surveys, forum activity, …) • FOSS (a selection): R, Python (NumPy/SciPy, • maybe the most comprehensive comparison: statsmodels, pandas, …), PSPP, SOFA, Octave, http://r4stats.com/articles/popularity/ LibreOffice Calc, Julia, … • for programming languages in general: TIOBE Index, PYPL, GitHut, Language Popularity Index, RedMonk Rankings, IEEE Spectrum, … • note that users of certain software may be are heavily biased in their opinion 3 4 5 6 1 7 8 What is R? History of S and R • R is a system for data manipulation, statistical • … it began May 5, 1976 at: and numerical analysis, and graphical display • simply put: a statistical programming language • freely available under the GNU General Public License (GPL) → open-source
    [Show full text]
  • STAT 3304/5304 Introduction to Statistical Computing
    STAT 3304/5304 Introduction to Statistical Computing Statistical Packages Some Statistical Packages • BMDP • GLIM • HIL • JMP • LISREL • MATLAB • MINITAB 1 Some Statistical Packages • R • S-PLUS • SAS • SPSS • STATA • STATISTICA • STATXACT • . and many more 2 BMDP • BMDP is a comprehensive library of statistical routines from simple data description to advanced multivariate analysis, and is backed by extensive documentation. • Each individual BMDP sub-program is based on the most competitive algorithms available and has been rigorously field-tested. • BMDP has been known for the quality of it’s programs such as Survival Analysis, Logistic Regression, Time Series, ANOVA and many more. • The BMDP vendor was purchased by SPSS Inc. of Chicago in 1995. SPSS Inc. has stopped all develop- ment work on BMDP, choosing to incorporate some of its capabilities into other products, primarily SY- STAT, instead of providing further updates to the BMDP product. • BMDP is now developed by Statistical Solutions and the latest version (BMDP 2009) features a new mod- ern user-interface with all the statistical functionality of the classic program, running in the latest MS Win- dows environments. 3 LISREL • LISREL is software for confirmatory factor analysis and structural equation modeling. • LISREL is particularly designed to accommodate models that include latent variables, measurement errors in both dependent and independent variables, reciprocal causation, simultaneity, and interdependence. • Vendor information: Scientific Software International http://www.ssicentral.com/ 4 MATLAB • Matlab is an interactive, matrix-based language for technical computing, which allows easy implementation of statistical algorithms and numerical simulations. • Highlights of Matlab include the number of toolboxes (collections of programs to address specific sets of problems) available.
    [Show full text]
  • Some Statistical Models for Prediction Jonathan Auerbach Submitted In
    Some Statistical Models for Prediction Jonathan Auerbach Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy under the Executive Committee of the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2020 © 2020 Jonathan Auerbach All Rights Reserved Abstract Some Statistical Models for Prediction Jonathan Auerbach This dissertation examines the use of statistical models for prediction. Examples are drawn from public policy and chosen because they represent pressing problems facing U.S. governments at the local, state, and federal level. The first five chapters provide examples where the perfunctory use of linear models, the prediction tool of choice in government, failed to produce reasonable predictions. Methodological flaws are identified, and more accurate models are proposed that draw on advances in statistics, data science, and machine learning. Chapter 1 examines skyscraper construction, where the normality assumption is violated and extreme value analysis is more appropriate. Chapters 2 and 3 examine presidential approval and voting (a leading measure of civic participation), where the non-collinearity assumption is violated and an index model is more appropriate. Chapter 4 examines changes in temperature sensitivity due to global warming, where the linearity assumption is violated and a first-hitting-time model is more appropriate. Chapter 5 examines the crime rate, where the independence assumption is violated and a block model is more appropriate. The last chapter provides an example where simple linear regression was overlooked as providing a sensible solution. Chapter 6 examines traffic fatalities, where the linear assumption provides a better predictor than the more popular non-linear probability model, logistic regression.
    [Show full text]
  • Kavita Ramanan; Nicolai Society for Industrial and Applied Meinshausen Mathematics (SIAM) in 2010
    Volume 44 • Issue 4 IMS Bulletin June/July 2015 Donald Geman elected to NAS The USA’s National Academy of Sciences elected 84 new members and 21 foreign CONTENTS associates from 15 countries this year, in recognition of their distinguished and continuing 1 NAS elects Don Geman achievements in original research. Among them was IMS Fellow Donald Geman, who is Professor in the Department of Applied Mathematics and Statistics at Johns Hopkins 2 Members’ News: AMS Fellows, Alison Etheridge, University, Baltimore, MD, and simultaneously a visiting professor at École Normale Emery Brown Supérieure de Cachan. Donald Geman was born in Chicago in 1943. He graduated from the University 3 Carver Award: Patrick Kelly of Illinois at Urbana-Champaign in 1965 with a BA in English Literature and from 4 IMS Travel Awards Northwestern University in 1970 with a PhD in Mathematics. He worked as a 5 Puzzle deadline extended; Professor in the Department of IMU Itô travel awards Mathematics and Statistics at the University of Massachusetts 6 IMS Fellows following graduation, until he 8 Obituary: Evarist Giné joined Johns Hopkins University 9 Recent papers: AIHP in 2001. Donald was elected a Fellow 10 Medallion Lectures: Grégory Miermont; of IMS in 1997, and of the Kavita Ramanan; Nicolai Society for Industrial and Applied Meinshausen Mathematics (SIAM) in 2010. He gave an IMS Medallion Lecture 13 Treasurer’s Report in 2012 at JSM San Diego, on 15 Terence’s Stuff: Ideas “Order Statistics and Gene Regulation.” (See photo.) Donald Geman (right) gave an IMS Medallion Lecture in 2012 16 IMS meetings at JSM in San Diego.
    [Show full text]