Robust Cluster Analysis and Variable Selection
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Machine Learning Or Econometrics for Credit Scoring: Let’S Get the Best of Both Worlds Elena Dumitrescu, Sullivan Hué, Christophe Hurlin, Sessi Tokpavi
Machine Learning or Econometrics for Credit Scoring: Let’s Get the Best of Both Worlds Elena Dumitrescu, Sullivan Hué, Christophe Hurlin, Sessi Tokpavi To cite this version: Elena Dumitrescu, Sullivan Hué, Christophe Hurlin, Sessi Tokpavi. Machine Learning or Econometrics for Credit Scoring: Let’s Get the Best of Both Worlds. 2020. hal-02507499v2 HAL Id: hal-02507499 https://hal.archives-ouvertes.fr/hal-02507499v2 Preprint submitted on 2 Nov 2020 (v2), last revised 15 Jan 2021 (v3) HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Machine Learning or Econometrics for Credit Scoring: Let's Get the Best of Both Worlds∗ Elena, Dumitrescu,y Sullivan, Hu´e,z Christophe, Hurlin,x Sessi, Tokpavi{ October 18, 2020 Abstract In the context of credit scoring, ensemble methods based on decision trees, such as the random forest method, provide better classification performance than standard logistic regression models. However, logistic regression remains the benchmark in the credit risk industry mainly because the lack of interpretability of ensemble methods is incompatible with the requirements of financial regulators. In this paper, we pro- pose to obtain the best of both worlds by introducing a high-performance and inter- pretable credit scoring method called penalised logistic tree regression (PLTR), which uses information from decision trees to improve the performance of logistic regression. -
BERNOULLI NEWS, Vol 24 No 1 (2017)
Vol. 24 (1), May 2017 Published twice per year by the Bernoulli Society ISSN 1360–6727 CONTENTS News from the Bernoulli Society p. 1 Awards and Prizes p. 2 New Executive Members A VIEW FROM THE PRESIDENT in the Bernoulli Society Dear Members of the Bernoulli Society, p. 3 As we all seem to agree, the role and image of statistics has changed dramatically. Still, it takes ones breath when realizing the huge challenges ahead. Articles and Letters Statistics has not always been considered as being very necessary. In 1848 the Dutch On Bayesian Measures of Ministry of Home Affairs established an ofŮice of statistics. And then, thirty years later Uncertainty in Large or InŮinite minister Kappeyne van de Coppelo abolishes the “superŮluous” ofŮice. The ofŮice was Dimensional Models p. 4 quite rightly put back in place in 1899, as “Centraal Bureau voor de Statistiek” (CBS). Statistics at the CBS has evolved from “simple” counting to an art requiring a broad range On the Probability of Co-primality of competences. Of course counting remains important. For example the CBS reports in of two Natural Numbers Chosen February 2017 that almost 1 out of 4 people entitled to vote in the Netherlands is over the at Random p. 7 age of 65. But clearly, knowing this generates questions. What is the inŮluence of this on the outcome of the elections? This calls for more data. Demographic data are combined with survey data and nowadays also with data from other sources, in part to release the “survey pressure” that Ůirms and individuals are facing. -
18.650 (F16) Lecture 10: Generalized Linear Models (Glms)
Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A linear model assumes Y X (µ(X), σ2I), | ∼ N And ⊤ IE(Y X) = µ(X) = X β, | 2/52 Components of a linear model The two components (that we are going to relax) are 1. Random component: the response variable Y X is continuous | and normally distributed with mean µ = µ(X) = IE(Y X). | 2. Link: between the random and covariates X = (X(1),X(2), ,X(p))⊤: µ(X) = X⊤β. · · · 3/52 Generalization A generalized linear model (GLM) generalizes normal linear regression models in the following directions. 1. Random component: Y some exponential family distribution ∼ 2. Link: between the random and covariates: ⊤ g µ(X) = X β where g called link function� and� µ = IE(Y X). | 4/52 Example 1: Disease Occuring Rate In the early stages of a disease epidemic, the rate at which new cases occur can often increase exponentially through time. Hence, if µi is the expected number of new cases on day ti, a model of the form µi = γ exp(δti) seems appropriate. ◮ Such a model can be turned into GLM form, by using a log link so that log(µi) = log(γ) + δti = β0 + β1ti. ◮ Since this is a count, the Poisson distribution (with expected value µi) is probably a reasonable distribution to try. 5/52 Example 2: Prey Capture Rate(1) The rate of capture of preys, yi, by a hunting animal, tends to increase with increasing density of prey, xi, but to eventually level off, when the predator is catching as much as it can cope with. -
Maximum Likelihood from Incomplete Data Via the EM Algorithm A. P. Dempster; N. M. Laird; D. B. Rubin Journal of the Royal Stati
Maximum Likelihood from Incomplete Data via the EM Algorithm A. P. Dempster; N. M. Laird; D. B. Rubin Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1. (1977), pp. 1-38. Stable URL: http://links.jstor.org/sici?sici=0035-9246%281977%2939%3A1%3C1%3AMLFIDV%3E2.0.CO%3B2-Z Journal of the Royal Statistical Society. Series B (Methodological) is currently published by Royal Statistical Society. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/rss.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is an independent not-for-profit organization dedicated to and preserving a digital archive of scholarly journals. For more information regarding JSTOR, please contact [email protected]. http://www.jstor.org Fri Apr 6 01:07:17 2007 Maximum Likelihood from Incomplete Data via the EM Algorithm By A. P. DEMPSTER,N. M. LAIRDand D. B. RDIN Harvard University and Educational Testing Service [Read before the ROYALSTATISTICAL SOCIETY at a meeting organized by the RESEARCH SECTIONon Wednesday, December 8th, 1976, Professor S. -
BERNOULLI NEWS, Vol 22 No 2
Vol. 22 (2), 2015 Published twice per year by the Bernoulli Society ISSN 1360-6727 Contents News from the Bernoulli A VIEW FROM THE PRESIDENT Society p. 2 Prizes, Awards and Special Lectures p. 3 New Executive Members of the Bernoulli Society p. 4 Articles and Letters The Development of Modern Mathematics in Mongolia p. 6 Past Conferences, Sara van de Geer receives the Bernoulli Book from Wilfrid Kendall during the General MeetinGs and Workshops Assembly of the Bernoulli Society ISI World Congress in Rio de Janeiro, Brazil. p. 11 Dear Bernoulli Society Members, ForthcominG Conferences, MeetinGs and Workshops, It is an immense honour for me to write here as new president of the Bernoulli Society. The baton was handed over to me by Wilfrid Kendall, now our past-president, at the ISI World and Calendar of Events Statistics Congress in Rio de Janeiro. I am extremely grateful to Wilfrid for his perfect p. 17 handling of Bernoulli matters in the past two years. My thanks also goes to Ed Waymire, now past past-president. These two wise men helped me through president-electancy and I hope to be able to approximate their standards. Let me welcome Susan Murphy as our new president- elect. I am very much looking forward to work with Susan, with Wilfrid and with the Editor executive committee, council, standing committees and all of you as active Bernoulli Miguel de Carvalho supporters. The new members of council are Arup Bose, Valerie Isham, Victor Rivero, Akira Faculty oF Mathematics Sakai, Lorenzo Zambotti, and Johanna Ziegel. PUC, Chile To meet them, see elsewhere in this issue! Contact Looking at the famous Bernoulli book where pages are reserved for presidents to put their [email protected] signature, one sees there is one name standing much more to the right of the page (see __________________________________________ www.bernoulli-society.org/index.php/history). -
Elect New Council Members
Volume 43 • Issue 3 IMS Bulletin April/May 2014 Elect new Council members CONTENTS The annual IMS elections are announced, with one candidate for President-Elect— 1 IMS Elections 2014 Richard Davis—and 12 candidates standing for six places on Council. The Council nominees, in alphabetical order, are: Marek Biskup, Peter Bühlmann, Florentina Bunea, Members’ News: Ying Hung; 2–3 Sourav Chatterjee, Frank Den Hollander, Holger Dette, Geoffrey Grimmett, Davy Philip Protter, Raymond Paindaveine, Kavita Ramanan, Jonathan Taylor, Aad van der Vaart and Naisyin Wang. J. Carroll, Keith Crank, You can read their statements starting on page 8, or online at http://www.imstat.org/ Bani K. Mallick, Robert T. elections/candidates.htm. Smythe and Michael Stein; Electronic voting for the 2014 IMS Elections has opened. You can vote online using Stephen Fienberg; Alexandre the personalized link in the email sent by Aurore Delaigle, IMS Executive Secretary, Tsybakov; Gang Zheng which also contains your member ID. 3 Statistics in Action: A If you would prefer a paper ballot please contact IMS Canadian Outlook Executive Director, Elyse Gustafson (for contact details see the 4 Stéphane Boucheron panel on page 2). on Big Data Elections close on May 30, 2014. If you have any questions or concerns please feel free to 5 NSF funding opportunity e [email protected] Richard Davis contact Elyse Gustafson . 6 Hand Writing: Solving the Right Problem 7 Student Puzzle Corner 8 Meet the Candidates 13 Recent Papers: Probability Surveys; Stochastic Systems 15 COPSS publishes 50th Marek Biskup Peter Bühlmann Florentina Bunea Sourav Chatterjee anniversary volume 16 Rao Prize Conference 17 Calls for nominations 19 XL-Files: My Valentine’s Escape 20 IMS meetings Frank Den Hollander Holger Dette Geoffrey Grimmett Davy Paindaveine 25 Other meetings 30 Employment Opportunities 31 International Calendar 35 Information for Advertisers Read it online at Kavita Ramanan Jonathan Taylor Aad van der Vaart Naisyin Wang http://bulletin.imstat.org IMSBulletin 2 . -
Analysis of Computer Experiments Using Penalized Likelihood in Gaussian Kriging Models
Editor’s Note: This article will be presented at the Technometrics invited session at the 2005 Joint Statistical Meetings in Minneapolis, Minnesota, August 2005. Analysis of Computer Experiments Using Penalized Likelihood in Gaussian Kriging Models Runze LI Agus SUDJIANTO Department of Statistics Risk Management Quality & Productivity Pennsylvania State University Bank of America University Park, PA 16802 Charlotte, NC 28255 ([email protected]) ([email protected]) Kriging is a popular analysis approach for computer experiments for the purpose of creating a cheap- to-compute “meta-model” as a surrogate to a computationally expensive engineering simulation model. The maximum likelihood approach is used to estimate the parameters in the kriging model. However, the likelihood function near the optimum may be flat in some situations, which leads to maximum likelihood estimates for the parameters in the covariance matrix that have very large variance. To overcome this difficulty, a penalized likelihood approach is proposed for the kriging model. Both theoretical analysis and empirical experience using real world data suggest that the proposed method is particularly important in the context of a computationally intensive simulation model where the number of simulation runs must be kept small because collection of a large sample set is prohibitive. The proposed approach is applied to the reduction of piston slap, an unwanted engine noise due to piston secondary motion. Issues related to practical implementation of the proposed approach are discussed. KEY WORDS: Computer experiment; Fisher scoring algorithm; Kriging; Meta-model; Penalized like- lihood; Smoothly clipped absolute deviation. 1. INTRODUCTION e.g., Ye, Li, and Sudjianto 2000) and by the meta-modeling ap- proach itself (Jin, Chen, and Simpson 2000). -
Statistics Making an Impact
John Pullinger J. R. Statist. Soc. A (2013) 176, Part 4, pp. 819–839 Statistics making an impact John Pullinger House of Commons Library, London, UK [The address of the President, delivered to The Royal Statistical Society on Wednesday, June 26th, 2013] Summary. Statistics provides a special kind of understanding that enables well-informed deci- sions. As citizens and consumers we are faced with an array of choices. Statistics can help us to choose well. Our statistical brains need to be nurtured: we can all learn and practise some simple rules of statistical thinking. To understand how statistics can play a bigger part in our lives today we can draw inspiration from the founders of the Royal Statistical Society. Although in today’s world the information landscape is confused, there is an opportunity for statistics that is there to be seized.This calls for us to celebrate the discipline of statistics, to show confidence in our profession, to use statistics in the public interest and to champion statistical education. The Royal Statistical Society has a vital role to play. Keywords: Chartered Statistician; Citizenship; Economic growth; Evidence; ‘getstats’; Justice; Open data; Public good; The state; Wise choices 1. Introduction Dictionaries trace the source of the word statistics from the Latin ‘status’, the state, to the Italian ‘statista’, one skilled in statecraft, and on to the German ‘Statistik’, the science dealing with data about the condition of a state or community. The Oxford English Dictionary brings ‘statistics’ into English in 1787. Florence Nightingale held that ‘the thoughts and purpose of the Deity are only to be discovered by the statistical study of natural phenomena:::the application of the results of such study [is] the religious duty of man’ (Pearson, 1924). -
A4 -FINAL Report of the ISI Executive Committee to The
INTERNATIONAL STATISTICAL INSTITUTE Report of the ISI Executive Committee 2001-2003 to the General Assembly ISI Executive Committee 2001-2003: President: Dennis Trewin President-Elect: Stephen Stigler Vice Presidents: Denise Lievesley Jef Teugels Jae Chang Lee Director of the ISI Permanent Office: Marcel P.R. Van den Broecke Contents: Activities..................................................................................................................................................................................187 Meetings and Conferences ..................................................................................................................................................187 Publications ............................................................................................................................................................................189 Strengthening the ties between statistical disciplines .....................................................................................................190 Improving the quality of statistics ........................................................................................................................................190 Membership matters .............................................................................................................................................................191 The Future of the ISI .............................................................................................................................................................192 -
Properties of the Zero-And-One Inflated Poisson Distribution and Likelihood
Statistics and Its Interface Volume 9 (2016) 11–32 Properties of the zero-and-one inflated Poisson distribution and likelihood-based inference methods Chi Zhang, Guo-Liang Tian∗, and Kai-Wang Ng The over-dispersion may result from other phenomena such as a high proportion of both zeros and ones in the To model count data with excess zeros and excess ones, count data. For example, Eriksson and Aberg˚ [8] reported a in their unpublished manuscript, Melkersson and Olsson two-year panel data from Swedish Level of Living Surveys (1999) extended the zero-inflated Poisson distribution to in 1974 and 1991, where the visits to a dentist have higher a zero-and-one-inflated Poisson (ZOIP) distribution. How- proportions of zeros and ones and one-visit observations are ever, the distributional theory and corresponding properties even much more frequent than zero-visits. It is common to of the ZOIP have not yet been explored, and likelihood- visit a dentist for a routine control, e.g., school-children go based inference methods for parameters of interest were not to a dentist once a year almost as a rule. As the second ex- well developed. In this paper, we extensively study the ZOIP ample, Carrivick et al.[3] reported that much of the data distribution by first constructing five equivalent stochastic collected on occupational safety involves accident or injury representations for the ZOIP random variable and then de- counts with many zeros and ones. In general, the safety reg- riving other important distributional properties. Maximum ulations protect most workers from injury so that many ze- likelihood estimates of parameters are obtained by both the ros are observed. -
A Brief Survey of Modern Optimization for Statisticians1
International Statistical Review (2014), 82, 1, 46–70 doi:10.1111/insr.12022 A Brief Survey of Modern Optimization for Statisticians1 Kenneth Lange1,2,EricC.Chi2 and Hua Zhou3 1Departments of Biomathematics and Statistics, University of California, Los Angeles, CA 90095-1766, USA E-mail: [email protected] 2Department of Human Genetics, University of California, Los Angeles, CA 90095-1766, USA E-mail: [email protected] 3Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, USA E-mail: [email protected] Summary Modern computational statistics is turning more and more to high-dimensional optimization to handle the deluge of big data. Once a model is formulated, its parameters can be estimated by optimization. Because model parsimony is important, models routinely include non-differentiable penalty terms such as the lasso. This sober reality complicates minimization and maximization. Our broad survey stresses a few important principles in algorithm design. Rather than view these principles in isolation, it is more productive to mix and match them. A few well-chosen examples illustrate this point. Algorithm derivation is also emphasized, and theory is downplayed, particularly the abstractions of the convex calculus. Thus, our survey should be useful and accessible to a broad audience. Key words: Block relaxation; Newton’s method; MM algorithm; penalization; augmented Lagrangian; acceleration. 1 Introduction Modern statistics represents a confluence of data, algorithms, practical inference, and subject area knowledge. As data mining expands, computational statistics is assuming greater promi- nence. Surprisingly, the confident prediction of the previous generation that Bayesian methods would ultimately supplant frequentist methods has given way to a realization that Markov chain Monte Carlo may be too slow to handle modern data sets. -
On the Maximization of Likelihoods Belonging to the Exponential Family
arXiv.org manuscript No. (will be inserted by the editor) On the maximization of likelihoods belonging to the exponential family using ideas related to the Levenberg-Marquardt approach Marco Giordan · Federico Vaggi · Ron Wehrens Received: date / Revised: date Abstract The Levenberg-Marquardt algorithm is a flexible iterative procedure used to solve non-linear least squares problems. In this work we study how a class of possible adaptations of this procedure can be used to solve maximum likelihood problems when the underlying distributions are in the exponential family. We formally demonstrate a local convergence property and we discuss a possible implementation of the penalization involved in this class of algorithms. Applications to real and simulated compositional data show the stability and efficiency of this approach. Keywords Aitchison distribution · compositional data · Dirichlet distribution · generalized linear models · natural link · optimization. arXiv:1410.0793v1 [stat.CO] 3 Oct 2014 M. Giordan IASMA Research and Innovation Centre, Biostatistics and Data Managment E-mail: [email protected] F. Vaggi IASMA Research and Innovation Centre, Integrative Genomic Research E-mail: [email protected] Ron Wehrens IASMA Research and Innovation Centre, Biostatistics and Data Managment E-mail: [email protected] 2 Marco Giordan et al. 1 Introduction The exponential family is a large class of probability distributions widely used in statistics. Distributions in this class possess many useful properties that make them useful for inferential and algorithmic purposes, especially with reference to maximum likelihood estimation. Despite this, when closed form solutions are not available, computational problems can arise in the fitting of a particular distribution in this family to real data.