Uncertainty of Classification on Limited Data

C759etukansi.fm Page 1 Friday, August 21, 2020 9:43 AM C 759 OULU 2020 C 759 UNIVERSITY OF OULU P.O. Box 8000 FI-90014 UNIVERSITY OF OULU FINLAND ACTA UNIVERSITATISUNIVERSITATIS OULUENSISOULUENSIS ACTA UNIVERSITATIS OULUENSIS ACTAACTA TECHNICATECHNICACC Tuomo Alasalmi Tuomo Alasalmi Tuomo University Lecturer Tuomo Glumoff UNCERTAINTY OF University Lecturer Santeri Palviainen CLASSIFICATION Postdoctoral researcher Jani Peräntie ON LIMITED DATA University Lecturer Anne Tuomisto University Lecturer Veli-Matti Ulvinen Planning Director Pertti Tikkanen Professor Jari Juga University Lecturer Anu Soikkeli University Lecturer Santeri Palviainen UNIVERSITY OF OULU GRADUATE SCHOOL; UNIVERSITY OF OULU, FACULTY OF INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING Publications Editor Kirsti Nurkkala ISBN 978-952-62-2710-8 (Paperback) ISBN 978-952-62-2711-5 (PDF) ISSN 0355-3213 (Print) ISSN 1796-2226 (Online) ACTA UNIVERSITATIS OULUENSIS C Technica 759 TUOMO ALASALMI UNCERTAINTY OF CLASSIFICATION ON LIMITED DATA Academic dissertation to be presented with the assent of the Doctoral Training Committee of Information Technology and Electrical Engineering of the University of Oulu for public defence in the OP auditorium (L10), Linnanmaa, on 18 September 2020, at 12 noon UNIVERSITY OF OULU, OULU 2020 Copyright © 2020 Acta Univ. Oul. C 759, 2020 Supervised by Professor Juha Röning Professor Jaakko Suutala Docent Heli Koskimäki Reviewed by Professor James T. Kwok Docent Charles Elkan Opponent Professor Henrik Boström ISBN 978-952-62-2710-8 (Paperback) ISBN 978-952-62-2711-5 (PDF) ISSN 0355-3213 (Printed) ISSN 1796-2226 (Online) Cover Design Raimo Ahonen PUNAMUSTA TAMPERE 2020 Alasalmi, Tuomo, Uncertainty of classification on limited data. University of Oulu Graduate School; University of Oulu, Faculty of Information Technology and Electrical Engineering Acta Univ. Oul. C 759, 2020 University of Oulu, P.O. Box 8000, FI-90014 University of Oulu, Finland Abstract It is common knowledge that even simple machine learning algorithms can improve in performance with large, good quality data sets. However, limited data sets, be it because of limited size or incomplete instances, are surprisingly common in many real-world modeling problems. In addition to the overall classification accuracy of a model, it is often of interest to know the uncertainty of each individual prediction made by the model. Quantifying this uncertainty of classification models is discussed in this thesis from the perspective of limited data. When some feature values are missing, uncertainty regarding the classification result is increased, but this is not captured in the metrics that quantify uncertainty using traditional methods. To tackle this shortcoming, a method is presented that, in addition to making incomplete data sets usable for any classifier, makes it possible to quantify the uncertainty stemming from missing feature values. In addition, in the case of complete but limited sized data sets, the ability of several commonly used classifiers to produce reliable uncertainty, i.e. probability, estimates, is studied. Two algorithms are presented that can potentially improve probability estimate calibration when data set size is limited. It is shown that the traditional approach to calibration often fails on these limited sized data sets, but using these algorithms still allows improvement in classifier probability estimates with calibration. To support the usefulness of the proposed methods and to answer the proposed research questions, main results from the original publications are presented in this compiling part of the thesis. Implications of the findings are discussed and conclusions drawn. Keywords: classification, missing data, probability, small data, uncertainty Alasalmi, Tuomo, Luokittelun epävarmuus vaillinaisilla aineistoilla. Oulun yliopiston tutkijakoulu; Oulun yliopisto, Tieto- ja sähkötekniikan tiedekunta Acta Univ. Oul. C 759, 2020 Oulun yliopisto, PL 8000, 90014 Oulun yliopisto Tiivistelmä Yleisesti tiedetään, että yksinkertaistenkin koneoppimismenetelmien tuloksia saadaan parannet- tua, jos käytettävissä on paljon hyvälaatuista aineistoa. Vaillinaiset aineistot, joiden puutteet joh- tuvat aineiston vähäisestä määrästä tai puuttuvista arvoista, ovat kuitenkin varsin yleisiä. Pelkän luokittelutarkkuuden lisäksi mallin yksittäisten ennusteiden epävarmuus on usein hyödyllistä tietoa. Tässä väitöskirjassa tarkastellaan luokittimien epävarmuuden määrittämistä silloin, kun saatavilla oleva aineisto on vaillinainen. Kun aineistosta puuttuu arvoja joistakin piirteistä, luokittelutulosten epävarmuus lisääntyy, mutta tämä lisääntynyt epävarmuus jää huo- mioimatta perinteisillä puuttuvien arvojen käsittelymenetelmillä. Asian korjaamiseksi tässä väi- töskirjassa esitetään menetelmä, jolla puuttuvista arvoista johtuva epävarmuuden lisääntyminen voidaan huomioida. Lisäksi tämä menetelmä mahdollistaa minkä tahansa luokittimen käytön, vaikka luokitin ei muutoin tukisi puuttuvia arvoja sisältävien aineistojen käsittelyä. Tämän lisäk- si väitöskirjassa käsitellään useiden yleisesti käytettyjen luokittimien kykyä tuottaa hyviä arvioi- ta ennusteiden luotettavuudesta, eli todennäköisyysarvioita, kun käytettävissä oleva aineisto on pieni. Tässä väitöskirjassa esitetään kaksi algoritmia, joiden avulla voi olla mahdollista parantaa näiden todennäköisyysarvioiden kalibraatiota, vaikka käytettävissä oleva aineisto on pieni. Esi- tetyistä tuloksista ilmenee, että perinteinen tapa kalibrointiin ei pienillä aineistoilla onnistu, mutta esitettyjen algoritmien avulla kalibrointi tulee mahdolliseksi. Alkuperäisten artikkeleiden tuloksia esitetään tässä kokoomaväitöskirjassa tukemaan esitetty- jä väittämiä ja vastaamaan asetettuihin tutkimuskysymyksiin. Lopuksi pohditaan esitettyjen tulosten merkitystä ja vedetään johtopäätökset. Asiasanat: epävarmuus, luokittelu, pieni aineisto, puuttuva aineisto, todennäköisyys To Tanja and Iida, my sunshines . 8 Acknowledgements A doctoral dissertation is a long, lonely journey but it would not be possible without the support of the people around you. First of all, I would like to thank my principal supervisor, professor Juha Röning, for believing in me and giving me the opportunity to conduct my doctoral research and studies in his research group. I was also fortunate to get guidance from my two other awesome supervisors, docent Heli Koskimäki and associate professor Jaakko Suutala. Thank you, Heli, for pushing me kindly towards the end goal and for occasionally reminding me that there are only two kinds of theses, perfect and ready. And thank you, Jaakko, for sharing your wisdom and for your invaluable tips along the way. I was lucky to have two world-class experts to pre-examine my thesis. Thank you for your comments and suggestions for improvement, professor James T. Kwok and docent and former professor Charles Elkan. I would also like to thank professor Henrik Boström for kindly accepting to be the opponent in my thesis defense. I have also received financial support along the way and I would like to give my thanks to my funders for believing in the value of my work. Thank you to the Infotech doctoral school, Jenny and Antti Wihuri Foundation, Tauno Tönning Foundation, and Walter Ahsltröm Foundation. Thank you also to the University of Oulu graduate school for supporting my conference attending costs. A fair amount of humor is needed to balance the sometimes tiresome research efforts. Data Analysis and Inference group members did an excellent job in this regard during all the coffee breaks, nights out, and WhatsApp conversations. Without you I would not have heard (or tried to come up with myself) so many Bad jokes (with a capital B)! I would also like to thank my family and friends. A special thanks goes to you, Mum and Dad, for emphasizing the importance of education and being supportive despite my journey having many twists and turns when trying to find out who I want to become when I grow up. I still don’t think I know, only time will tell, so please bear with me! The biggest thanks, however, go to my lovely wife Tanja, who has been extremely supportive even during the darkest of times, and our adorable daughter Iida. Without you two I would not be where I am today. I love you both ~ Oulu, June 2020 Tuomo Alasalmi 9 10 List of abbreviations Abbreviations ABB Averaging over Bayesian binnings ACP Adaptive calibration of predictions AI Artificial intelligence BBQ Bayesian binning into quantiles BIC Bayesian information criterion BLR Bayesian logistic regression CR Classification rate DG Data Generation model DGG Data Generation and Grouping model ECOC Error correcting output coding ENIR Ensemble of near-isotonic regression EP Expectation propagation GP Gaussian process GPC Gaussian process classifier Logloss Logarithmic loss MI Multiple imputation ML Machine learning NB Naive Bayes classifier NN Neural network MAR Missing at random MCAR Missing completely at random MCCV Monte Carlo cross validation MCMC Markov chain Monte Carlo MNAR Missing not at random MSE Mean squared error RF Random forest RPR Shape restricted polynomial regression SVM Support vector machine SBB Selection over Bayesian binnings XAI Explainable artificial intelligence 11 Mathematical notation B Between imputation variance Ci Predicted class K Number of classes l Number of features m Number of imputed data sets N Number of observations P(Cj) Prior probability of class Cj P(Cjjx) Posterior probability of class Cj p(x) Probability density p(xjCj) Class-conditional probability density

Load more