Heidi Mara Do Rosário Sousa Estudo De Modelos De Classificação

Total Page:16

File Type:pdf, Size:1020Kb

Heidi Mara Do Rosário Sousa Estudo De Modelos De Classificação UNIVERSIDADE ESTADUAL DE CAMPINAS Instituto de Matemática, Estatística e Computação Científica HEIDI MARA DO ROSÁRIO SOUSA ESTUDO DE MODELOS DE CLASSIFICAÇÃO COM APLICAÇÃO A DADOS GENÔMICOS Campinas 2019 HEIDI MARA DO ROSÁRIO SOUSA ESTUDO DE MODELOS DE CLASSIFICAÇÃO COM APLICAÇÃO A DADOS GENÔMICOS Dissertação apresentada ao Instituto de Matemática, Estatística e Computação Científica da Universidade Estadual de Campinas como parte dos requisitos exigidos para a obtenção do título de Mestra em Estatística. Orientador: Benilton de Sá Carvalho ESTE EXEMPLAR CORRESPONDE À VERSÃO FINAL DA DISSERTAÇÃO DEFENDIDA PELA ALUNA HEIDI MARA DO ROSÁRIO SOUSAE ORIENTADA PELO PROF.DR. BENILTON DE SÁ CARVALHO. CAMPINAS 2019 Ficha catalográfica Universidade Estadual de Campinas Biblioteca do Instituto de Matemática, Estatística e Computação Científica Ana Regina Machado - CRB 8/5467 Sousa, Heidi Mara do Rosário, 1991- So85e SouEstudo de modelos de classificação com aplicação a dados genômicos / Heidi Mara do Rosário Sousa. – Campinas, SP : [s.n.], 2019. SouOrientador: Benilton de Sá Carvalho. SouDissertação (mestrado) – Universidade Estadual de Campinas, Instituto de Matemática, Estatística e Computação Científica. Sou1. Microarranjos de DNA. 2. Genotipagem. 3. Genética - Métodos estatísticos. 4. Algoritmos. 5. Redes neurais (Computação). I. Carvalho, Benilton de Sá, 1979-. II. Universidade Estadual de Campinas. Instituto de Matemática, Estatística e Computação Científica. III. Título. Informações para Biblioteca Digital Título em outro idioma: Study of classification models with application to genomic data Palavras-chave em inglês: DNA microarrays Genotyping Genetics - Statistical methods Algorithms Neural networks (Computer science) Área de concentração: Estatística Titulação: Mestra em Estatística Banca examinadora: Benilton de Sá Carvalho [Orientador] Júlia Maria Pavan Soler Samara Flamini Kiihl Data de defesa: 31-05-2019 Programa de Pós-Graduação: Estatística Identificação e informações acadêmicas do(a) aluno(a) - ORCID do autor: https://orcid.org/0000-0002-8630-780 - Currículo Lattes do autor: http://lattes.cnpq.br/3075263212674826 Powered by TCPDF (www.tcpdf.org) Dissertação de Mestrado defendida em 31 de maio de 2019 e aprovada pela banca examinadora composta pelos Profs. Drs. Prof(a). Dr(a). BENILTON DE SÁ CARVALHO Prof(a). Dr(a). JÚLIA MARIA PAVAN SOLER Prof(a). Dr(a). SAMARA FLAMINI KIIHL A Ata da Defesa, assinada pelos membros da Comissão Examinadora, consta no SIGA/Sistema de Fluxo de Dissertação/Tese e na Secretaria de Pós-Graduação do Instituto de Matemática, Estatística e Computação Científica. I dedicate my master dissertation to Nossa Senhora Aparecida, my family and my friends. Especially my parents Elsa and Américo and my love Luís with all my love and admiration. Agradecimentos A Nossa Senhora Aparecida que me acolheu e sempre está ao meu lado me protegendo, me ouvindo e cuidando de mim. Ela me da força e não permite que eu desista nos momentos difíceis. À minha família, pelo amor, cuidado, apoio e compreensão. Aos meus pais, Elsa do Rosário e Américo Sousa, por serem meus pilares, meu porto seguro e minha maior inspiração. À Luís Rocha, por todo carinho, companheirismo, incentivo, paciência e apoio. À Val, Janice, Cátia e toda família P5, pela amizade preciosa, pelo carinho e apoio nos momentos de fraqueza. Ao meu Orientador, professor Benilton de Sá Carvalho, pela confiança, paciência, disposição de ajudar e principalmente por todo o ensinamento inestimável. Aos professores do Departamento de Estatística da Universidade Estadual de Campinas, pela contribuição à minha formação profissional. À Elainy e Joubert, pela parceria incrível nos estudos. A todos que de alguma forma contribuíram para realização desse trabalho. O presente trabalho foi realizado com apoio da Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Código de Financiamento 001. Resumo A tecnologia de microarranjos ou chip de DNA é amplamente utilizada na ciência biomédica. Tem como objetivo realizar triagem de milhões de Polimorfismo de nucleotídeo único (SNP) em todo o genoma, possibilitando a identificação de variantes na sequência de DNA que estejam associadas com fenótipos de interesse. Essa tecnologia revolucionou estudos de associação, genome-wide association studies (GWAS), exatamente por permitir a análise simultânea de vários marcadores [14]. O ponto de partida para determinar a associação entre fenótipos e doenças, é fazer chamadas de genótipos (AA, AB ou BB) para cada SNP. Portanto, vários procedimentos estatísticos sofisticados são necessários, culminando na aplicação de um método de classificação. O objetivo desta dissertação é estudar técnicas de pré-processamento de dados de microarranjos; compreender a metodologia do Modelo Linear Robusto Corrigido com a distância de Mahalanobis (CRLMM) e propor um novo método de genotipagem utilizando modelos de classificação por Redes Neurais Artificiais (RNA), utilizando medidas quantitativas obtidas por meio de microarranjos. Utilizou-se métricas que associem acurácia e qualidade de agrupamento para a avaliação dos métodos de classificação. O maior ganho na aplicação de redes neurais tem sido observado na habilidade de identificar mais apropriadamente observações heterozigotas, quando comparado ao CRLMM, ao mesmo tempo que a precisão de chamadas de homozigotos permanece praticamente estável. Além disso, as redes neurais permitem uma classificação mais concordante com os processos biológicos nas caudas da distribuição da log-razão M. Palavras-chave: Genotipagem, SNP, algoritmos de aprendizado supervisionados. Abstract Microarray technology or DNA chip is widely used in biomedical science. It aims to screen millions of single nucleotide polymorphisms (SNPs) throughout the genome, enabling the identification of variants in the DNA sequence that are associated with phenotypes of interest. This technology revolutionized association studies, genome-wide association studies (GWAS), precisely by allowing the simultaneous analysis of multiple markers. The starting point for determining the association between phenotypes and diseases is to make genotype calls (AA, AB or BB) for each SNP. Therefore, several sophisticated statistical procedures are necessary, culminating in the application of a classification method. The objective of this thesis is to study microarray data preprocessing techniques; to understand the methodology of the Corrected Robust Linear Model with Mahalanobis Distance (CRLMM) and propose a new method of genotyping using Artificial Neural Network (ANN) classification models using quantitative measurements obtained through microarray. Were used metrics that associate accuracy and clustering quality for the evaluation of classification methods. The greatest gain in the application of neural networks has been observed in the ability to more correctly identify heterozygous observations when compared to CRLMM, while the accuracy of homozygous calls remains practically stable. In addition, the neural networks allow a more concordant classification, with the biological processes, in the tail of the distribution of the log M ratio. Keywords: Microarray, Artificial Neural Network (ANN), Corrected Robust Linear Model with Mahalanobis distance (CRLMM). Lista de Figuras 1.1 Distribuição do DNA entre o núcleo e mitocôndria em uma célula humana . 14 1.2 Representação das moléculas de DNA e RNA . 15 1.3 Dogma Central da Biologia Molecular . 15 1.4 O microarranjo de oligonucleotídeos . 18 2.1 Efeito da correção de fundo em dados de microarranjo de SNP . 30 2.2 Efeito da normalização quantílica em microarranjos de SNP . 32 2.3 A utilização de modelos de regressão combinada com o algoritmo EM . 41 3.1 A utilização das estatísticas M e S . 44 3.2 A representação de um neurônio humano . 51 3.3 Uma rede neural feed-forward de única saída . 53 3.4 Uma rede neural feed-forward com múltiplos níveis de saída . 53 3.5 Utilização de Bayes Empírico na predição da localização de classes . 62 5.1 Topologia de rede neural selecionada para genotipagem . 70 5.2 Estatísticas M e S com genótipos dados pelo projeto HapMap . 74 5.3 Estatísticas M e S com genótipos dados pelo algoritmo CRLMM . 74 5.4 Estatísticas M e S com genótipos dados pela predição via Redes Neurais . 75 5.5 Desempenho dos algoritmos CRLMM e RN para SNP com boa separação . 76 Lista de Tabelas 2.1 Exemplo para polimento de mediana . 34 2.2 Exemplo de quarteto de sondas . 35 2.3 Exemplo numérico de polimento de mediana . 35 3.1 Correspondência entre terminologias de redes neurais biológicas e artificiais. 52 4.1 Classificação com dados desbalanceados . 64 4.2 Matriz de confusão binária . 64 4.3 Adaptação de matriz de confusão binária para dados não-binários . 65 5.1 Acurácia média para diferentes topologias de redes . 70 5.2 Matriz de confusão - RNA . 71 5.3 Matriz de confusão - CRLMM . 71 5.4 Matriz de confusão para RN no genótipo AA . 71 5.5 Matriz de confusão para CRLMM no genótipo AA . 71 5.6 Matriz de confusão para RN no genótipo AG . 71 5.7 Matriz de confusão para CRLMM no genótipo AG . 71 5.8 Matriz de confusão para RN no genótipo GG . 72 5.9 Matriz de confusão para CRLMM no genótipo GG . 72 5.10 Desempenho dos algoritmos de RN e CRLMM para SNP_A-1807747 .... 73 5.11 Silhueta média por combinação de método e SNP . 77 5.12 CSM por combinação de algoritmo e SNP . 77 Conteúdo 1 Introdução 13 1.1 Biologia Molecular . 13 1.2 Consórcio HapMap . 16 1.3 Microarranjos de SNP . 16 1.4 Técnicas de Aprendizado de Máquina . 18 1.5 Objetivo . 19 2 Pré-Processamento de Microarranjos de SNP 21 2.1 Correção de Intensidade de Sequência Bruta por Conteúdo de Sequência e Comprimento de Fragmento . 22 2.2 Correção do Ruído de Fundo . 23 2.2.1 Método MAS 5.0 . 23 2.2.2 Método RMA . 24 2.3 Normalização . 29 2.3.1 Normalização Quantílica . 29 2.3.2 Loess Cíclico . 31 2.3.3 Contraste . 31 2.3.4 Normalização de Estabilização de Variância (VSN) . 32 2.4 Sumarização . 33 2.4.1 Polimento de Mediana . 33 2.5 Ajuste de Log-Razão vs Log-Intensidade . 35 2.5.1 Definição de Modelo de Mistura Finita . 37 2.5.2 Estimação dos Parâmetros via Algoritmo EM .
Recommended publications
  • Mixture Models with Grouping Structure: Retail Analytics Applications Haidar Almohri Wayne State University
    Wayne State University Wayne State University Dissertations 1-1-2018 Mixture Models With Grouping Structure: Retail Analytics Applications Haidar Almohri Wayne State University, Follow this and additional works at: https://digitalcommons.wayne.edu/oa_dissertations Part of the Business Administration, Management, and Operations Commons, Engineering Commons, and the Statistics and Probability Commons Recommended Citation Almohri, Haidar, "Mixture Models With Grouping Structure: Retail Analytics Applications" (2018). Wayne State University Dissertations. 1911. https://digitalcommons.wayne.edu/oa_dissertations/1911 This Open Access Dissertation is brought to you for free and open access by DigitalCommons@WayneState. It has been accepted for inclusion in Wayne State University Dissertations by an authorized administrator of DigitalCommons@WayneState. MIXTURE MODELS WITH GROUPING STRUCTURE: RETAIL ANALYTICS APLICATIONS by HAIDAR ALMOHRI DISSERTATION Submitted to the Graduate School of Wayne State University, Detroit, Michigan in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY 2018 MAJOR: INDUSTRIAL ENGINEERING Approved By: Advisor Date © COPYRIGHT BY HAIDAR ALMOHRI 2018 All Rights Reserved DEDICATION To my lovely daughter Nour. ii ACKNOWLEDGMENTS I would like to thank my advisor Dr. Ratna Babu Chinnam for introducing me to this fascinating research topic and for his guidance, technical advise, and support throughout this research. I am also grateful to my committee members, Dr. Alper Murat, Dr. Evrim Dalkiran for their time and interest in this thesis and for providing helpful comments for my research. Special thanks go to Dr. Arash Ali Amini for providing valuable suggestions to improve this work. I also thank my family and friends for their continuous support. iii TABLE OF CONTENTS DEDICATION .
    [Show full text]
  • Mini-Batch Learning of Exponential Family Finite Mixture Models Hien D Nguyen, Florence Forbes, Geoffrey Mclachlan
    Mini-batch learning of exponential family finite mixture models Hien D Nguyen, Florence Forbes, Geoffrey Mclachlan To cite this version: Hien D Nguyen, Florence Forbes, Geoffrey Mclachlan. Mini-batch learning of exponential family finite mixture models. Statistics and Computing, Springer Verlag (Germany), 2020, 30, pp.731-748. 10.1007/s11222-019-09919-4. hal-02415068v2 HAL Id: hal-02415068 https://hal.archives-ouvertes.fr/hal-02415068v2 Submitted on 26 Mar 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Mini-batch learning of exponential family finite mixture models Hien D. Nguyen1∗, Florence Forbes2, and Geoffrey J. McLachlan3 September 6, 2019 1Department of Mathematics and Statistics, La Trobe University, Melbourne, Victoria, Australia. 2Univ. Grenoble Alpes, Inria, CNRS, Grenoble INPy, LJK, 38000 Grenoble, France. yInstitute of Engineering Univ. Grenoble Alpes. 3School of Mathematics and Physics, University of Queensland, St. Lucia, Brisbane, Australia. ∗Corresponding author: Hien Nguyen (Email: [email protected]). Abstract Mini-batch algorithms have become increasingly popular due to the requirement for solving optimization problems, based on large-scale data sets. Using an existing online expectation-- maximization (EM) algorithm framework, we demonstrate how mini-batch (MB) algorithms may be constructed, and propose a scheme for the stochastic stabilization of the constructed mini-batch algorithms.
    [Show full text]
  • An Introduction to MM Algorithms for Machine Learning and Statistical
    An Introduction to MM Algorithms for Machine Learning and Statistical Estimation Hien D. Nguyen November 12, 2016 School of Mathematics and Physics, University of Queensland, St. Lucia. Centre for Advanced Imaging, University of Queensland, St. Lucia. Abstract MM (majorization–minimization) algorithms are an increasingly pop- ular tool for solving optimization problems in machine learning and sta- tistical estimation. This article introduces the MM algorithm framework in general and via three popular example applications: Gaussian mix- ture regressions, multinomial logistic regressions, and support vector ma- chines. Specific algorithms for the three examples are derived and numer- ical demonstrations are presented. Theoretical and practical aspects of MM algorithm design are discussed. arXiv:1611.03969v1 [stat.CO] 12 Nov 2016 1 Introduction Let X X Rp and Y Y Rq be random variables, which we shall refer to ∈ ⊂ ∈ ⊂ as the input and target variables, respectively. We shall denote a sample of n in- n dependent and identically distributed (IID) pairs of variables D = (Xi, Yi) { }i=1 1 ¯ n as the data, and D = (xi, yi) as an observed realization of the data. Under { }i=1 the empirical risk minimization (ERM) framework of Vapnik (1998, Ch. 1) or the extremum estimation (EE) framework of Amemiya (1985, Ch. 4), a large number of machine learning and statistical estimation problems can be phrased as the computation of min θ; D¯ or θˆ = arg min θ; D¯ , (1) θ Θ R θ Θ R ∈ ∈ where θ; D¯ is a risk function defined over the observed data D¯ and is de- R pendent on some parameter θ Θ.
    [Show full text]
  • Econometrics and Statistics (Ecosta 2018)
    EcoSta2018 PROGRAMME AND ABSTRACTS 2nd International Conference on Econometrics and Statistics (EcoSta 2018) http://cmstatistics.org/EcoSta2018 City University of Hong Kong 19 – 21 June 2018 c ECOSTA ECONOMETRICS AND STATISTICS. All rights reserved. I EcoSta2018 ISBN: 978-9963-2227-3-5 c 2018 - ECOSTA Econometrics and Statistics All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any other form or by any means without the prior permission from the publisher. II c ECOSTA ECONOMETRICS AND STATISTICS. All rights reserved. EcoSta2018 Co-chairs: Igor Pruenster, Alan Wan, Ping-Shou Zhong. EcoSta Editors: Ana Colubi, Erricos J. Kontoghiorghes, Manfred Deistler. Scientific Programme Committee: Tomohiro Ando, Jennifer Chan, Cathy W.S. Chen, Hao Chen, Ming Yen Cheng, Jeng-Min Chiou, Terence Chong, Fabrizio Durante, Yingying Fan, Richard Gerlach, Michele Guindani, Marc Hallin, Alain Hecq, Daniel Henderson, Robert Kohn, Sangyeol Lee, Degui Li, Wai-Keung Li, Yingying Li, Hua Liang, Tsung-I Lin, Shiqing Ling, Alessandra Luati, Hiroki Masuda, Geoffrey McLachlan, Samuel Mueller, Yasuhiro Omori, Marc Paolella, Sandra Paterlini, Heng Peng, Artem Prokhorov, Jeroen Rombouts, Matteo Ruggiero, Mike K.P. So, Xinyuan Song, John Stufken, Botond Szabo, Minh-Ngoc Tran, Andrey Vasnev, Judy Huixia Wang, Yong Wang, Yichao Wu and Jeff Yao. Local Organizing Committee: Guanhao Feng, Daniel Preve, Geoffrey Tso, Inez Zwetsloot, Catherine Liu, Zhen Pang. c ECOSTA ECONOMETRICS AND STATISTICS. All rights reserved. III EcoSta2018 Dear Colleagues, It is a great pleasure to welcome you to the 2nd International Conference on Econometrics and Statistics (EcoSta 2018). The conference is co-organized by the working group on Computational and Methodological Statistics (CMStatistics), the network of Computational and Financial Econometrics (CFEnetwork), the journal Economet- rics and Statistics (EcoSta) and the Department of Management Sciences of the City University of Hong Kong (CityU).
    [Show full text]
  • Mixture Model Clustering in the Analysis of Complex Diseases
    CORE Metadata, citation and similar papers at core.ac.uk Provided by Helsingin yliopiston digitaalinen arkisto Department of Computer Science Series of Publications A Report A-2012-2 Mixture Model Clustering in the Analysis of Complex Diseases Jaana Wessman To be presented, with the permission of the Faculty of Science of the University of Helsinki, for public criticism in Auditorium XIV, University of Helsinki Main Building, on 13 April 2012 at noon. University of Helsinki Finland Supervisor Heikki Mannila, Department of Information and Computer Science and Helsinki Institute of Information Technology, Aalto University, Finland and Leena Peltonen (died 11th March 2010), University of Helsinki and National Public Health Institute, Finland Pre-examiners Sampsa Hautaniemi, Docent, Institute of Biomedicine, University of Helsinki, Finland Martti Juhola, Professor, Department of Computer Science, University of Tampere, Finland Opponent Tapio Elomaa, Professor, Department of Software Systems, Tampere University of Technology, Finland Custos Hannu Toivonen, Professor, Department of Computer Science, University of Helsinki, Finland Contact information Department of Computer Science P.O. Box 68 (Gustaf Hällströmin katu 2b) FI-00014 University of Helsinki Finland Email address: [email protected].fi URL: http://www.cs.Helsinki.fi/ Telephone: +358 9 1911, telefax: +358 9 191 51120 Copyright © 2012 Jaana Wessman ISSN 1238-8645 ISBN 978-952-10-7897-2 (paperback) ISBN 978-952-10-7898-9 (PDF) Computing Reviews (1998) Classification: I.5.3, J.2 Helsinki 2012 Unigrafia Mixture Model Clustering in the Analysis of Complex Diseases Jaana Wessman Department of Computer Science P.O. Box 68, FI-00014 University of Helsinki, Finland Jaana.Wessman@iki.fi PhD Thesis, Series of Publications A, Report A-2012-2 Helsinki, Mar 2012, 119+11 pages ISSN 1238-8645 ISBN 978-952-10-7897-2 (paperback) ISBN 978-952-10-7898-9 (PDF) Abstract The topic of this thesis is the analysis of complex diseases, and specifically the use of k-means and mixture modeling based clustering methods to do it.
    [Show full text]
  • Sponsored by the International Biometric Society
    Sponsored by the International Biometric Society The International Biometric Society is devoted to the development and application of statistical and mathematical theory and methods in the biosiciences. The Conference is grateful for the support of the following organisations: The Federation of Pharmaceutical Manufacturers' Associations of JAPAN Organised by the Biometric Society of Japan, Japanese Region of the International Biometric Society Contents List of Sponsors Inside Front Cover Welcome from Presidents Page 2 Welcome from Chairs Page 3 Programme at a Glance Page 4 - 6 Opening Session Page 7 IBC2012 Governance Meetings Schedule Page 8 Organising Committees Page 9 Awards at IBC Kobe 2012 Page 10 General Information Page 11 Venue Information Page 12 Presentation Instruction Page 13 Social Programme Page 15 Mid-Conference Tours Page 16 Scientifi c Programme Page 21 Satellite Symposium Page 46 Map of KICC Page 60 Access Map Page 61 Floorplan Page 62 1 XXVIth International Biometric Conference Message from the IBS President and the IBC 2012 Organising President It is with great pleasure that we welcome all delegates, their families and friends to the XXVIth International Biometric Conference (IBC 2012) being hosted by the Japanese Region in Kobe. It is nearly thirty years since they hosted the one previous IBC held in Japan in Tokyo in 1984. Toshiro Tango and his Local Organising Committee (LOC) have done a wonderful job in providing a socially and scientifi cally inviting program at an outstanding venue. The Kobe International Convention Centre provides excellent spaces for formal meetings/scientifi c sessions and informal conversations/ networking activities which are an essential part of every IBC.
    [Show full text]
  • Computational Statistics & Data Analysis
    COMPUTATIONAL STATISTICS & DATA ANALYSIS AUTHOR INFORMATION PACK TABLE OF CONTENTS XXX . • Description p.1 • Audience p.2 • Impact Factor p.2 • Abstracting and Indexing p.2 • Editorial Board p.2 • Guide for Authors p.7 ISSN: 0167-9473 DESCRIPTION . Computational Statistics and Data Analysis (CSDA), an Official Publication of the network Computational and Methodological Statistics (CMStatistics) and of the International Association for Statistical Computing (IASC), is an international journal dedicated to the dissemination of methodological research and applications in the areas of computational statistics and data analysis. The journal consists of four refereed sections which are divided into the following subject areas: I) Computational Statistics - Manuscripts dealing with: 1) the explicit impact of computers on statistical methodology (e.g., Bayesian computing, bioinformatics,computer graphics, computer intensive inferential methods, data exploration, data mining, expert systems, heuristics, knowledge based systems, machine learning, neural networks, numerical and optimization methods, parallel computing, statistical databases, statistical systems), and 2) the development, evaluation and validation of statistical software and algorithms. Software and algorithms can be submitted with manuscripts and will be stored together with the online article. II) Statistical Methodology for Data Analysis - Manuscripts dealing with novel and original data analytical strategies and methodologies applied in biostatistics (design and analytic methods
    [Show full text]
  • Vine Copula Mixture Models and Clustering for Non-Gaussian Data
    Vine copula mixture models and clustering for non-Gaussian data Ozge¨ Sahina,∗, Claudia Czadoa,b aDepartment of Mathematics, Technische Universit¨atM¨unchen,Boltzmanstraße 3, 85748 Garching, Germany bMunich Data Science Institute, Walther-von-Dyck-Straße 10, 85748 Garching, Germany Abstract The majority of finite mixture models suffer from not allowing asymmetric tail dependencies within compo- nents and not capturing non-elliptical clusters in clustering applications. Since vine copulas are very flexible in capturing these dependencies, a novel vine copula mixture model for continuous data is proposed. The model selection and parameter estimation problems are discussed, and further, a new model-based clustering algorithm is formulated. The use of vine copulas in clustering allows for a range of shapes and dependency structures for the clusters. The simulation experiments illustrate a significant gain in clustering accuracy when notably asymmetric tail dependencies or/and non-Gaussian margins within the components exist. The analysis of real data sets accompanies the proposed method. The model-based clustering algorithm with vine copula mixture models outperforms others, especially for the non-Gaussian multivariate data. Keywords: Dependence, ECM algorithm, model-based clustering, multivariate finite mixtures, pair-copula, statistical learning 1. Introduction Finite mixture models are convenient statistical tools for model-based clustering. They assume that observations in the multivariate data can be clustered using k components. Each component has its density, and each observation is assigned to a component with a probability. They have many applications in finance, genetics, and marketing (e.g., Hu(2006); Gambacciani & Paolella(2017); Sun et al.(2017); Zhang & Shi (2017)). McLachlan & Peel(2000) provides more details about the finite mixture models.
    [Show full text]
  • Learning Mixtures of Plackett-Luce Models
    Learning Mixtures of Plackett-Luce models Zhibing Zhao, Peter Piech, and Lirong Xia Abstract In this paper we address the identifiability and efficient learning problem of finite mixtures of Plackett-Luce models for rank data. We prove that for any k ≥ 2, the mixture of k Plackett- Luce models for no more than 2k − 1 alternatives is non-identifiable and this bound is tight for k = 2. For generic identifiability, we prove that the mixture of k Plackett-Luce models over m−2 m alternatives is generically identifiable if k ≤ b 2 c!. We also propose an efficient generalized method of moments (GMM) algorithm to learn the mixture of two Plackett-Luce models and show that the algorithm is consistent. Our experi- ments show that our GMM algorithm is significantly faster than the EMM algorithm by Gorm- ley and Murphy (2008), while achieving competitive statistical efficiency. 1 Introduction In many machine learning problems the data are composed of rankings over a finite number of alter- natives [18]. For example, meta-search engines aggregate rankings over webpages from individual search engines [7]; rankings over documents are combined to find the most relevant document in information retrieval [14]; noisy answers from online workers are aggregated to produce a more accurate answer in crowdsourcing [17]. Rank data are also very common in economics and political science. For example, consumers often give discrete choices data [19] and voters often give rankings over presidential candidates [8]. Perhaps the most commonly-used statistical model for rank data is the Plackett-Luce model [22, 16]. The Plackett-Luce model is a natural generalization of multinomial logistic regression.
    [Show full text]
  • Information Bulletin Nº 2
    INFORMATION BULLETIN Nº 2 60th World Statistics Congress of the International Statistical Institute Riocentro | Rio de Janeiro | Brazil 26-31 July 2015 www.isi2015.org CONTENTS Welcome Messages IBGE President .................................................................................................................... 01 ISI President ......................................................................................................................... 02 Committees Scientific Programme Committee (SPC) ................................................................... 03 Local Programme Committee (LPC) ........................................................................... 04 Short Course Committee (SCC) .................................................................................... 05 Lunch Roundtable Discussions Committee (LRTDC) ........................................... 05 National Honorary Committee (NHC) ......................................................................... 06 Local Organising Committee (LOC) ............................................................................ 07 ISI Governance ..................................................................................................................... 08 Conference Schedule ........................................................................................................ 09 Venue Floor Plans ............................................................................................................... 10 Scientific Programme ......................................................................................................
    [Show full text]
  • THREE ESSAYS on MIXTURE MODEL and GAUSSIAN PROCESSES a Dissertation by WENBIN WU Submitted to the Office of Graduate and Profess
    THREE ESSAYS ON MIXTURE MODEL AND GAUSSIAN PROCESSES A Dissertation by WENBIN WU Submitted to the Office of Graduate and Professional Studies of Texas A&M University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Chair of Committee, Ximing Wu Committee Members, Yu Zhang David Leatham Qi Li Head of Department, Mark L. Waller May 2019 Major Subject: Agricultural Economics Copyright 2019 Wenbin Wu ABSTRACT This dissertation includes three essays. In the first essay I study the problem of density esti- mation using normal mixture models. Instead of selecting the ‘right’ number of components in a normal mixture model, I propose an Averaged Normal Mixture (ANM) model to estimate the underlying densities based on model averaging methods, combining normal mixture models with different number of components. I use two methods to estimate the mixing weights of the proposed Averaged Normal Mixture model, one is based on likelihood cross validation and the other is based on Bayesian information criterion (BIC) weights. I also establish the theoretical properties of the proposed estimator and the simulation results demonstrate its good performance in estimating dif- ferent types of underlying densities. The proposed method is also employed to a real world data set, empirical evidence demonstrates the efficiency of this estimator. The second essay studies short term electricity demand forecasting using Gaussian Processes and different forecast strategies. I propose a hybrid forecasting strategy that combines the strength of different forecasting schemes to predict 24 hourly electricity demand for the next day. This method is shown to provide superior point and overall probabilistic forecasts.
    [Show full text]
  • Computational and Financial Econometrics (CFE 2018)
    CFE-CMStatistics 2018 PROGRAMME AND ABSTRACTS 12th International Conference on Computational and Financial Econometrics (CFE 2018) http://www.cfenetwork.org/CFE2018 and 11th International Conference of the ERCIM (European Research Consortium for Informatics and Mathematics) Working Group on Computational and Methodological Statistics (CMStatistics 2018) http://www.cmstatistics.org/CMStatistics2018 University of Pisa, Italy 14 – 16 December 2018 c ECOSTA ECONOMETRICS AND STATISTICS. All rights reserved. I CFE-CMStatistics 2018 ISBN 978-9963-2227-5-9 c 2018 - ECOSTA ECONOMETRICS AND STATISTICS All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any other form or by any means without the prior permission from the publisher. II c ECOSTA ECONOMETRICS AND STATISTICS. All rights reserved. CFE-CMStatistics 2018 International Organizing Committee: Ana Colubi, Erricos Kontoghiorghes, Herman Van Dijk and Caterina Giusti. CFE 2018 Co-chairs: Alessandra Amendola, Michael Owyang, Dimitris Politis and Toshiaki Watanabe. CFE 2018 Programme Committee: Francesco Audrino, Christopher Baum, Monica Billio, Christian Brownlees, Laura Coroneo, Richard Fairchild, Luca Fanelli, Lola Gadea, Alain Hecq, Benjamin Holcblat, Rustam Ibragimov, Florian Ielpo, Laura Jackson, Robert Kohn, Degui Li, Alessandra Luati, Svetlana Makarova, Claudio Morana, Teruo Nakatsuma, Yasuhiro Omori, Alessia Paccagnini, Sandra Paterlini, Ivan Paya, Christian Proano, Artem Prokhorov, Arvid Raknerud, Joern Sass, Willi Semmler, Etsuro
    [Show full text]