Algorithmic Inference of Two-Parameter Gamma Distribution Bruno Apolloni, Simone Bassis

Total Page:16

File Type:pdf, Size:1020Kb

Algorithmic Inference of Two-Parameter Gamma Distribution Bruno Apolloni, Simone Bassis Algorithmic Inference of Two-Parameter Gamma Distribution Bruno Apolloni, Simone Bassis To cite this version: Bruno Apolloni, Simone Bassis. Algorithmic Inference of Two-Parameter Gamma Distribution. Com- munications in Statistics - Simulation and Computation, Taylor & Francis, 2009, 38 (09), pp.1950-1968. 10.1080/03610910903171821. hal-00520670 HAL Id: hal-00520670 https://hal.archives-ouvertes.fr/hal-00520670 Submitted on 24 Sep 2010 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Communications in Statistics - Simulation and Computation For Peer Review Only Algorithmic Inference of Two-Parameter Gamma Distribution Journal: Communications in Statistics - Simulation and Computation Manuscript ID: LSSP-2009-0057.R1 Manuscript Type: Original Paper Date Submitted by the 01-Jul-2009 Author: Complete List of Authors: Apolloni, Bruno; University of Milano, Department of Computer Science Bassis, Simone; University of Milano, Department of Computer Science Two-parameters Gamma distribution, Algorithmic inference, Keywords: Population bootstrap, Gibbs sampling, Adaptive rejection sampling, Maximum likelihood estimator We provide an estimation procedure of the two-parameter Gamma distribution based on the Algorithmic Inference approach. As a key feature of this approach, we compute the joint probability distribution of these parameters without assuming any prior. To this end we Abstract: propose a numerical algorithm which is often beneficial of a highly efficient speed up based on an approximate analytical expression of the probability distribution. We contrast our interval and point estimates with those recently obtained in Son and Oh (2006). We realize that our estimates are both unbiased and more accurate, albeit more dispersed than Bayesian methods. Note: The following files were submitted by the author for peer review, but cannot be converted to PDF. You must view these files (e.g. movies) online. URL: http://mc.manuscriptcentral.com/lssp E-mail: [email protected] Page 1 of 49 Communications in Statistics - Simulation and Computation 1 2 3 4 infgamma.zip 5 6 7 8 9 10 11 12 13 14 For Peer Review Only 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 URL: http://mc.manuscriptcentral.com/lssp E-mail: [email protected] Communications in Statistics - Simulation and Computation Page 2 of 49 1 2 INFERENCE 3 4 ALGORITHMIC INFERENCE OF TWO-PARAMETER GAMMA DISTRIBUTION 5 6 7 8 B. Apolloni and S. Bassis 9 10 Dipartimento di Scienze dell’Informazione 11 12 University of Milano 13 Via Comelico 39/41, 20135 14 15 Milano, Italy 16 For Peer Review Only 17 {apolloni,bassis}@dsi.unimi.it 18 19 20 Key Words: Two-parameters Gamma distribution; Algorithmic inference; Population boot- 21 22 strap; Gibbs sampling; Adaptive rejection sampling; Maximum likelihood estimator. 23 24 ABSTRACT 25 26 We provide an estimation procedure of the two-parameter Gamma distribution based 27 28 on the Algorithmic Inference approach. As a key feature of this approach, we compute the 29 30 joint probability distribution of these parameters without assuming any prior. To this end we 31 32 propose a numerical algorithm which is often beneficial of a highly efficient speed up based on 33 an approximate analytical expression of the probability distribution. We contrast our interval 34 35 and point estimates with those recently obtained in Son and Oh (2006) for the same problem. 36 37 From this benchmark we realize that our estimates are both unbiased and more accurate, 38 39 albeit more dispersed, in some cases, than the competitor methods’, where the dispersion 40 41 drawback is notably mitigated w.r.t. Bayesian methods by a greater estimate decorrelation. 42 43 We also briefly discuss the theoretical novelty of the adopted inference paradigm which 44 45 actually represents a brush up on a Fisher perspective dating to almost a century, made 46 feasible today by the available computational tools. 47 48 49 1. INTRODUCTION 50 51 The Gamma distribution is a formidable touchstone for any inference framework. On 52 53 54 55 56 1 57 58 59 60 URL: http://mc.manuscriptcentral.com/lssp E-mail: [email protected] Page 3 of 49 Communications in Statistics - Simulation and Computation 1 2 the one hand it is widely employed in many application fields, ranging from reliability (Ju- 3 ran and Godfrey, 1999; Corp., 2009) to telecommunications (Titus and Wheatley, 1996), 4 5 insurance (Ammeter, 1970) and so on. On the other, the estimation of its parameters still 6 7 poses a challenging problem. Adopting an analogous notation as in Son and Oh (2006), we 8 9 introduce the Gamma distribution through its probability density function (PDF) applied 10 1 11 to the random variable X as follows : 12 13 xα−1e−x/β f (x)= , x> 0, α,β> 0, (1) 14 X Γ(α)βα 15 16 where α and Forβ are respectively Peer a shape Review and scale parameter. Only Of the conventional point 17 18 estimators, the one based on the method of moments (MM) is very simple but inefficient 19 20 as well – with some exceptions in asymptotic conditions (Wiens et al., 2003). Maximum 21 22 likelihood (ML) estimators may rely on a pair of joint sufficient statistics, but they are 23 24 penalized by the absence of a closed analytical expression. To overcome this drawback, we 25 26 find in the literature either approximate expressions (Thom, 1958; Greenwood and D., 1960) 27 28 or recursive procedures to get a numerical solution (Bowman and Shenton, 1988). With the 29 latter we work with a sequence of parameter values converging with some dynamics to the 30 31 planned estimators. With a dual perspective, other researchers adopt the Bayes approach to 32 33 collect a population of parameter estimators – possibly generated through a Gibbs sampler 34 35 (Ripley, 1988) or more sophisticated techniques (Gilks and P., 1992) – and deduce point 36 37 estimators as central values of this population. 38 39 Our work goes in the same direction, but without requiring to set any a priori distri- 40 41 bution of the parameters. We have implemented this kind of procedure in many inference 42 problems, mainly within the machine learning framework (Apolloni et al., 2008a) and in 43 44 specific operational fields such as survival analysis (Apolloni et al., 2007b), human mobility 45 46 modeling (Apolloni et al., 2007a) and statistical tests of hypotheses (Apolloni and Bassis, 47 48 2007). The goal of this paper is twofold: on the one hand we want to efficiently solve the 49 1 50 By default, capital letters (such as U, X) will denote random variables and small letters (u, x) their 51 corresponding realizations; vectorial versions of the above variables will be denoted through bold symbols 52 53 (X, U, x, u). 54 55 56 2 57 58 59 60 URL: http://mc.manuscriptcentral.com/lssp E-mail: [email protected] Communications in Statistics - Simulation and Computation Page 4 of 49 1 2 questioned estimation problem as a distinguished inference task. To this aim, we show our 3 procedures to quickly compute accurate and unbiased estimators based on independent ap- 4 5 proximations of the unknown parameters. On the other hand, we would use this solution as 6 7 a business card to introduce our inference framework, which we call Algorithmic Inference, 8 9 to a more probability methodologically oriented audience and invite the reader to read more 10 11 widely about it in a couple of books we have published on the subject (Apolloni et al., 2008c, 12 13 2006). Therefore the paper is organized as follows. In Section 2 we introduce our method 14 15 from a strictly operational perspective, through definitions and a few logical links, having 16 the questionedFor inference Peer problem as Review lead example. In SectionOnly3 we apply the proposed 17 18 procedure to the same benchmark as Son and Oh (2006), contrasting our results with the 19 20 competitor methods considered therein. We devote Section 4 to a discussion of the benefits 21 22 of our approach and its connections to some inference strategies animating the debate since 23 24 the early 20th century, to conclude the paper. 25 26 2. COMPATIBLE PARAMETERS 27 28 Given a distribution law, think of its unknown parameters as random parameters. But, 29 30 in place of a specific prior distribution, they inherit probabilities from a standard source 31 32 of random seeds that are also at the basis of the random variables they are deputed to 33 34 characterize. Like with a barrel with two interlocking taps, through the former you get, 35 36 from the source, samples of the random variable for given parameters; while from the latter 37 38 you get samples of the parameters for given observed variables. You measure what happens 39 with the former and compute what would happen with the latter. The basic steps of this 40 41 twisting task are the following: 42 43 M 44 1. Sampling mechanism X . It consists of a pair hZ,gθi, where the seed Z is a random 45 θ 46 variable without unknown parameters, while the explaining function g is a function 47 mapping from samples {z ,...,z } of Z to samples {x ,...,x } of the random variable 48 1 m 1 m 49 X we are interested in.
Recommended publications
  • Bayesian Inference for Two-Parameter Gamma Distribution Assuming Different Noninformative Priors
    Revista Colombiana de Estadística Diciembre 2013, volumen 36, no. 2, pp. 319 a 336 Bayesian Inference for Two-Parameter Gamma Distribution Assuming Different Noninformative Priors Inferencia Bayesiana para la distribución Gamma de dos parámetros asumiendo diferentes a prioris no informativas Fernando Antonio Moala1;a, Pedro Luiz Ramos1;b, Jorge Alberto Achcar2;c 1Departamento de Estadística, Facultad de Ciencia y Tecnología, Universidade Estadual Paulista, Presidente Prudente, Brasil 2Departamento de Medicina Social, Facultad de Medicina de Ribeirão Preto, Universidade de São Paulo, Ribeirão Preto, Brasil Abstract In this paper distinct prior distributions are derived in a Bayesian in- ference of the two-parameters Gamma distribution. Noniformative priors, such as Jeffreys, reference, MDIP, Tibshirani and an innovative prior based on the copula approach are investigated. We show that the maximal data information prior provides in an improper posterior density and that the different choices of the parameter of interest lead to different reference pri- ors in this case. Based on the simulated data sets, the Bayesian estimates and credible intervals for the unknown parameters are computed and the performance of the prior distributions are evaluated. The Bayesian analysis is conducted using the Markov Chain Monte Carlo (MCMC) methods to generate samples from the posterior distributions under the above priors. Key words: Gamma distribution, noninformative prior, copula, conjugate, Jeffreys prior, reference, MDIP, orthogonal, MCMC. Resumen En este artículo diferentes distribuciones a priori son derivadas en una in- ferencia Bayesiana de la distribución Gamma de dos parámetros. A prioris no informativas tales como las de Jeffrey, de referencia, MDIP, Tibshirani y una priori innovativa basada en la alternativa por cópulas son investigadas.
    [Show full text]
  • Bayes Estimation and Prediction of the Two-Parameter Gamma Distribution
    Bayes Estimation and Prediction of the Two-Parameter Gamma Distribution Biswabrata Pradhan ¤& Debasis Kundu y Abstract In this article the Bayes estimates of two-parameter gamma distribution is considered. It is well known that the Bayes estimators of the two-parameter gamma distribution do not have compact form. In this paper, it is assumed that the scale parameter has a gamma prior and the shape parameter has any log-concave prior, and they are independently distributed. Under the above priors, we use Gibbs sampling technique to generate samples from the posterior density function. Based on the generated samples, we can compute the Bayes estimates of the unknown parameters and also can construct highest posterior density credible intervals. We also compute the approximate Bayes estimates using Lindley's approximation under the assumption of gamma priors of the shape parameter. Monte Carlo simulations are performed to compare the performances of the Bayes estimators with the classical estimators. One data analysis is performed for illustrative purposes. We further discuss about the Bayesian prediction of future observation based on the observed sample and it is observed that the Gibbs sampling technique can be used quite e®ectively, for estimating the posterior predictive density and also for constructing predictive interval of the order statistics from the future sample. Keywords: Maximum likelihood estimators; Conjugate priors; Lindley's approximation; Gibbs sampling; Predictive density, Predictive distribution. Subject Classifications: 62F15; 65C05 Corresponding Address: Debasis Kundu, Phone: 91-512-2597141; Fax: 91-512-2597500; e-mail: [email protected] ¤SQC & OR Unit, Indian Statistical Institute, 203 B.T. Road, Kolkata, Pin 700108, India yDepartment of Mathematics and Statistics, Indian Institute of Technology Kanpur, Pin 208016, India 1 1 Introduction The two-parameter gamma distribution has been used quite extensively in reliability and survival analysis particularly when the data are not censored.
    [Show full text]
  • Big Data Effective Processing and Analysis of Very Large and Unstructured Data for Official Statistics
    Big Data Effective Processing and Analysis of Very Large and Unstructured data for Official Statistics. Big Data and Interference in Official Statistics Part 2 Diego Zardetto Istat ([email protected]) Eurostat Big Data: Methodological Pitfalls Outline • Representativeness (w.r.t. the desired target population) Selection Bias Actual target population unknown Often sample units’ identity unclear/fuzzy Pre processing errors (acting like measurement errors in surveys) Social media & sentiment analysis: pointless babble & social bots • Ubiquitous correlations Causation fallacy Spurious correlations • Structural break in Nowcasting Algorithms Eurostat 1 Social Networks: representativeness (1/3) The “Pew Research Center” carries out periodic sample surveys on the way US people use the Internet According to its Social Media Update 2013 research: • Among US Internet users aged 18+ (as of Sept 2013) 73% of online adults use some Social Networks 71% of online adults use Facebook 18% of online adults use Twitter 17% of online adults use Instagram 21% of online adults use Pinterest 22% of online adults use LinkedIn • Subpopulations using these Social platforms turn out to be quite different in terms of: Gender, Ethnicity, Age, Education, Income, Urbanity Eurostat Social Networks: representativeness (2/3) According to Social Media Update 2013 research • Social platform users turn out to belong to typical social & demographic profiles Pinterest is skewed towards females LinkedIn over-represents higher education and higher income
    [Show full text]
  • A Philosophical Framework for Explainable Artificial
    Reasons, Values, Stakeholders: A Philosophical Framework for Explainable Artificial Intelligence Atoosa Kasirzadeh [email protected] University of Toronto (Toronto, Canada) & Australian National University (Canberra, Australia) Abstract The societal and ethical implications of the use of opaque artificial intelligence systems for consequential decisions, such as welfare allocation and criminal justice, have generated a lively debate among multiple stakeholder groups, in- cluding computer scientists, ethicists, social scientists, policy makers, and end users. However, the lack of a common language or a multi-dimensional frame- work to appropriately bridge the technical, epistemic, and normative aspects of this debate prevents the discussion from being as productive as it could be. Drawing on the philosophical literature on the nature and value of explana- tions, this paper offers a multi-faceted framework that brings more conceptual precision to the present debate by (1) identifying the types of explanations that are most pertinent to artificial intelligence predictions, (2) recognizing the rel- evance and importance of social and ethical values for the evaluation of these explanations, and (3) demonstrating the importance of these explanations for incorporating a diversified approach to improving the design of truthful al- arXiv:2103.00752v1 [cs.CY] 1 Mar 2021 gorithmic ecosystems. The proposed philosophical framework thus lays the groundwork for establishing a pertinent connection between the technical and ethical aspects of artificial intelligence systems. Keywords: Ethical Algorithm, Explainable AI, Philosophy of Explanation, Ethics of Artificial Intelligence Preprint submitted to Elsevier 1. Preliminaries Governments and private actors are seeking to leverage recent develop- ments in artificial intelligence (AI) and machine learning (ML) algorithms for use in high-stakes decision making.
    [Show full text]
  • An Algorithmic Inference Approach to Learn Copulas Arxiv:1910.02678V1
    An Algorithmic Inference Approach to Learn Copulas Bruno Apolloni Department of Computer Science University of Milano Via Celoria 18, 20133, Milano, Italy [email protected] October 8, 2019 Abstract We introduce a new method for estimating the parameter α of the bivariate Clayton copulas within the framework of Algorithmic Inference. The method consists of a variant of the standard boot- strapping procedure for inferring random parameters, which we ex- arXiv:1910.02678v1 [stat.ML] 7 Oct 2019 pressly devise to bypass the two pitfalls of this specific instance: the non independence of the Kendall statistics, customarily at the basis of this inference task, and the absence of a sufficient statistic w.r.t. α. The variant is rooted on a numerical procedure in order to find the α estimate at a fixed point of an iterative routine. Although paired 1 with the customary complexity of the program which computes them, numerical results show an outperforming accuracy of the estimates. Keywords | Copulas' Inference, Clayton copulas, Algorithmic Infer- ence, Bootstrap Methods. 1 Introduction Copulas are the atoms of the stochastic dependency between variables. For short, the Sklar Theorem (Sklar 1973) states that we may split the joint cumulative distribution function (CDF) FX1;:::;Xn of variables X1;:::;Xn into the composition of their marginal CDF FXi ; i = 1; : : : ; n through their copula 1 CX1;:::;Xn . Namely : FX1;:::;Xn (x1; : : : ; xn) = CX1;:::;Xn (FX1 (x1);:::;FXn (xn)) (1) While the random variable FXi (Xi) is a uniform variable U in [0; 1] for whatever continuous FXi thanks to the Probability Integral Transform Theo- rem (Rohatgi 1976) (with obvious extension to the discrete case), CX1;:::;Xn (FX1 (X1); :::;FXn (Xn)), hence CX1;:::;Xn (U1;:::;Un); has a specific distribution law which characterizes the dependence between the variables.
    [Show full text]
  • Study On: Tools for Causal Inference from Cross-Sectional Innovation Surveys with Continuous Or Discrete Variables
    Study on: Tools for causal inference from cross-sectional innovation surveys with continuous or discrete variables Dominik Janzing June 28, 2016 Abstract The design of effective public policy, in innovation support, and in other areas, has been constrained by the difficulties involved in under- standing the effects of interventions. The limited ability of researchers to unpick causal relationships means that our understanding of the likely ef- fects of public policy remains relatively poor. If improved methods could be found to unpick causal connections our understanding of important phenomena and the impact of policy interventions could be significantly improved, with important implications for future social prosperity of the EU. Recently, a series of new techniques have been developed in the natu- ral and computing sciences that are enabling researchers to unpick causal links that were largely impossible to explore using traditional methods. Such techniques are currently at the cutting edge of research, and remain technical for the non-specialist, but are slowly diffusing to the broader academic community. Economic analysis of innovation data, in particu- lar, stands to gain a lot from applying techniques developed by the ma- chine learning community. Writing recently in the Journal of Economic Perspectives, Varian (2014, p3) explains \my standard advice to graduate students these days is go to the computer science department and take a class in machine learning. There have been very fruitful collaborations between computer scientists and statisticians in the last decade or so, and I expect collaborations between computer scientists and econometricians will also be productive in the future." However, machine learning tech- niques have not yet been applied to innvovation policy.
    [Show full text]
  • Predictive Inference for Non Probability Samples
    Discussion Paper Predictive inference for non-probability samples: a simulation study The views expressed in this paper are those of the author(s) and do not necessarily reflect the policies of Statistics Netherlands 2015 | 13 Bart Buelens Joep Burger Jan van den Brakel 3 CBS | Discussion Paper, 2015 | 13 Non-probability samples provide a challenging source of information for official statistics, because the data generating mechanism is unknown. Making inference from such samples therefore requires a novel approach compared with the classic approach of survey sampling. Design-based inference is a powerful technique for random samples obtained via a known survey design, but cannot legitimately be applied to non-probability samples such as big data and voluntary opt-in panels. We propose a framework for such non-probability samples based on predictive inference. Three classes of methods are discussed. Pseudo-design-based methods are the simplest and apply traditional design-based estimation despite the absence of a survey design; model-based methods specify an explicit model and use that for prediction; algorithmic methods from the field of machine learning produce predictions in a non-linear fashion through computational techniques. We conduct a simulation study with a real-world data set containing annual mileages driven by cars for which a number of auxiliary characteristics are known. A number of data generating mechanisms are simulated, and—in absence of a survey design—a range of methods for inference are applied and compared to the known population values. The first main conclusion from the simulation study is that unbiased inference from a selective non- probability sample is possible, but access to the variables explaining the selection mechanism underlying the data generating process is crucial.
    [Show full text]
  • Engines of Order: a Mechanology of Algorithmic Techniques
    A Mechanology Engines of Algorithmic of Order Techniques BERNHARD RIEDER Amsterdam University Press Engines of Order The book series RECURSIONS: THEORIES OF MEDIA, MATERIALITY, AND CULTURAL TECHNIQUES provides a platform for cuttingedge research in the field of media culture studies with a particular focus on the cultural impact of media technology and the materialities of communication. The series aims to be an internationally significant and exciting opening into emerging ideas in media theory ranging from media materialism and hardware-oriented studies to ecology, the post-human, the study of cultural techniques, and recent contributions to media archaeology. The series revolves around key themes: – The material underpinning of media theory – New advances in media archaeology and media philosophy – Studies in cultural techniques These themes resonate with some of the most interesting debates in international media studies, where non-representational thought, the technicity of knowledge formations and new materialities expressed through biological and technological developments are changing the vocabularies of cultural theory. The series is also interested in the mediatic conditions of such theoretical ideas and developing them as media theory. Editorial Board – Jussi Parikka (University of Southampton) – Anna Tuschling (Ruhr-Universität Bochum) – Geoffrey Winthrop-Young (University of British Columbia) Engines of Order A Mechanology of Algorithmic Techniques Bernhard Rieder Amsterdam University Press This publication is funded by the Dutch Research Council (NWO). Chapter 1 contains passages from Rieder, B. (2016). Big Data and the Paradox of Diversity. Digital Culture & Society 2(2), 1-16 and Rieder, B. (2017). Beyond Surveillance: How Do Markets and Algorithms ‘Think’? Le Foucaldien 3(1), n.p.
    [Show full text]
  • 9<HTMIPI=Aadajf>
    58 Engineering Springer News 6/2008 springer.com/booksellers S. Akella, N. Amato, W. Huang, B. Mishra (Eds.) B. Apolloni, University of Milano, Italy; W. Pedrycz, V. Botti, A. Giret, Technical University of Valencia, University of Alberta, Edmonton, Canada; S. Bassis, Spain Algorithmic Foundation of D. Malchiodi, University of Milano, Italy Robotics VII ANEMONA The Puzzle of Granular A Multi-agent Methodology for Holonic Selected Contributions of the Seventh Computing International Workshop on the Algorithmic Manufacturing Systems Foundations of Robotics “Rem tene, verba sequentur“ Gaius J. Victor, Rome ANEMONA is a multi-agent system (MAS) meth- VI century b.c. odology for holonic manufacturing system (HMS) Algorithms are a fundamental component of The ultimate goal of this book is to bring the analysis and design, based on HMS requirements. robotic systems: they control or reason about fundamental issues of information granularity, ANEMONA defines a mixed top-down and bottom- motion and perception in the physical world. inference tools and problem solving procedures up development process, and provides HMS-specific They receive input from noisy sensors, consider into a coherent, unified, and fully operational guidelines to help the designer in identifying and geometric and physical constraints, and operate on framework. The objective is to offer the reader a implementing holons. In ANEMONA, the specified the world through imprecise actuators. The design comprehensive, self-contained, and uniform expo- HMS is divided into concrete aspects that form and analysis of robot algorithms therefore raises a sure to the subject.The strategy is to isolate some different “views” of the system. unique combination of questions in control theory, fundamental bricks of Computational Intelligence The development process of ANEMONA provides computational and differential geometry, and in terms of key problems and methods, and discuss clear and HMS-specific modelling guidelines computer science.
    [Show full text]
  • An Algorithmic Information Calculus for Causal Discovery and Reprogramming Systems
    An Algorithmic Information Calculus for Causal Discovery and Reprogramming Systems Hector Zenil^*a,b,c,d,e, Narsis A. Kiani*a,b,d,e, Francesco Marabitab,d, Yue Dengb, Szabolcs Eliasb,d, Angelika Schmidtb,d, Gordon Ballb,d, & Jesper Tegnér^b,d,f a) Algorithmic Dynamics Lab, Center for Molecular Medicine, Karolinska Institutet, Stockholm, 171 76, Sweden b) Unit of Computational Medicine, Center for Molecular Medicine, Department of Medicine, Solna, Karolinska Institutet, Stockholm, 171 76, Sweden c) Department of Computer Science, University of Oxford, Oxford, OX1 3QD, UK. d) Science for Life Laboratory, Solna, 171 65, Sweden e) Algorithmic Nature Group, LABORES for the Natural and Digital Sciences, Paris, 75006, France. f) Biological and Environmental Sciences and Engineering Division, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955–6900, Kingdom of Saudi Arabia * Shared-first authors ^ Corresponding authors Author Contributions: HZ, NK and JT are responsible for the general design and conception. HZ and NK are responsible for data acquisition. HZ, NK, YD, FM, SE and AS contributed to data analysis. HZ developed the methodology, with key contributions from NK and JT. HZ undertook most of the numerical experiments, with YD and FM contributing. HZ, AS, GB and SE contributed the literature-based Th17 enrichment analysis. HZ and JT wrote the article, with key contributions from NK. Correspondence should be addressed to HZ: [email protected]
    [Show full text]
  • Large Scale Multinomial Inferences and Its Applications in Genome Wide Association Studies
    Large Scale Multinomial Inferences and Its Applications in Genome Wide Association Studies Chuanhai Liu and Jun Xie Abstract Statistical analysis of multinomial counts with a large number K of cat- egories and a small number n of sample size is challenging to both frequentist and Bayesian methods and requires thinking about statistical inference at a very funda- mental level. Following the framework of Dempster-Shafer theory of belief func- tions, a probabilistic inferential model is proposed for this “large K and small n” problem. The inferential model produces a probability triplet (p,q,r) for an asser- tion conditional on observed data. The probabilities p and q are for and against the truth of the assertion, whereas r = 1 p q is the remaining probability called the probability of “don’t know”. The new− inference− method is applied in a genome-wide association study with very high dimensional count data, to identify association be- tween genetic variants to the disease Rheumatoid Arthritis. 1 Introduction We consider statistical inference for the comparison of two large-scale multino- mial distributions. This problem is motivated by genome-wide association studies (GWAS) with very high dimensional count data, i.e., single nucleotide polymor- phism (SNP) data. SNPs are major genetic variants that may associate with common diseases such as cancer and heart disease. A SNP has three possible genotypes, wild type homozygous, heterozygous, and mutation (rare) homozygous. In genome-wide association studies, genotypes of over 500,000 SNPs are typically measured for dis- ease (case) and normal (control) subjects, resulting in a large amount of count data.
    [Show full text]
  • Tight Bounds for SVM Classification Error
    Tight bounds for SVM classification error B. Apolloni, S. Bassis, S. Gaito, D. Malchiodi Dipartimento di Scienze dell'Informazione, Universita degli Studi di Milano Via Comelico 39/41, 20135 Milano, Italy E-mail: {apolloni, bassis, gaito, malchiodi}@dsi.unimi.it Abstract- We find very tight bounds on the accuracy of a Sup- lies in finding a separating hyperplane, i.e. an h E H such port Vector Machine classification error within the Algorithmic that all the points with a given label belong to one of the two Inference framework. The framework is specially suitable for this half-spaces determined by h. kind of classifier since (i) we know the number of support vectors really employed, as an ancillary output of the learning procedure, In order to obtain such a h, an SVM computes first the and (ii) we can appreciate confidence intervals of misclassifying solution {°ia, . , a* } of a dual constrained optimization probability exactly in function of the cardinality of these vectors. problem As a result we obtain confidence intervals that are up to an order m m narrower than those in the a 1 supplied literature, having slight max Z different meaning due to the different approach they come from, °Zai-2 aiicmjyiYjxi-x (1) but the same operational function. We numerically check the t=1 i,i-=i covering of these intervals. m aiyi = 0 (2) I. INTRODUCTION i-l Support Vector Machines (SVM for short) [1] represent ajt > O i = 1 . .. ,m, (3) an operational tool widely used by the Machine Learning where - denotes the standard dot product in RI, and then community.
    [Show full text]