C759etukansi.fm Page 1 Friday, August 21, 2020 9:43 AM
C 759 OULU 2020 C 759
UNIVERSITY OF OULU P.O. Box 8000 FI-90014 UNIVERSITY OF OULU FINLAND ACTA UNIVERSITATISUNIVERSITATIS OULUENSISOULUENSIS ACTA UNIVERSITATIS OULUENSIS ACTAACTA
TECHNICATECHNICACC Tuomo Alasalmi Tuomo Alasalmi University Lecturer Tuomo Glumoff UNCERTAINTY OF University Lecturer Santeri Palviainen CLASSIFICATION
Postdoctoral researcher Jani Peräntie ON LIMITED DATA
University Lecturer Anne Tuomisto
University Lecturer Veli-Matti Ulvinen
Planning Director Pertti Tikkanen
Professor Jari Juga
University Lecturer Anu Soikkeli
University Lecturer Santeri Palviainen UNIVERSITY OF OULU GRADUATE SCHOOL; UNIVERSITY OF OULU, FACULTY OF INFORMATION TECHNOLOGY AND ELECTRICAL ENGINEERING Publications Editor Kirsti Nurkkala
ISBN 978-952-62-2710-8 (Paperback) ISBN 978-952-62-2711-5 (PDF) ISSN 0355-3213 (Print) ISSN 1796-2226 (Online)
ACTA UNIVERSITATIS OULUENSIS C Technica 759
TUOMO ALASALMI
UNCERTAINTY OF CLASSIFICATION ON LIMITED DATA
Academic dissertation to be presented with the assent of the Doctoral Training Committee of Information Technology and Electrical Engineering of the University of Oulu for public defence in the OP auditorium (L10), Linnanmaa, on 18 September 2020, at 12 noon
UNIVERSITY OF OULU, OULU 2020 Copyright © 2020 Acta Univ. Oul. C 759, 2020
Supervised by Professor Juha Röning Professor Jaakko Suutala Docent Heli Koskimäki
Reviewed by Professor James T. Kwok Docent Charles Elkan
Opponent Professor Henrik Boström
ISBN 978-952-62-2710-8 (Paperback) ISBN 978-952-62-2711-5 (PDF)
ISSN 0355-3213 (Printed) ISSN 1796-2226 (Online)
Cover Design Raimo Ahonen
PUNAMUSTA TAMPERE 2020 Alasalmi, Tuomo, Uncertainty of classification on limited data. University of Oulu Graduate School; University of Oulu, Faculty of Information Technology and Electrical Engineering Acta Univ. Oul. C 759, 2020 University of Oulu, P.O. Box 8000, FI-90014 University of Oulu, Finland
Abstract It is common knowledge that even simple machine learning algorithms can improve in performance with large, good quality data sets. However, limited data sets, be it because of limited size or incomplete instances, are surprisingly common in many real-world modeling problems. In addition to the overall classification accuracy of a model, it is often of interest to know the uncertainty of each individual prediction made by the model. Quantifying this uncertainty of classification models is discussed in this thesis from the perspective of limited data. When some feature values are missing, uncertainty regarding the classification result is increased, but this is not captured in the metrics that quantify uncertainty using traditional methods. To tackle this shortcoming, a method is presented that, in addition to making incomplete data sets usable for any classifier, makes it possible to quantify the uncertainty stemming from missing feature values. In addition, in the case of complete but limited sized data sets, the ability of several commonly used classifiers to produce reliable uncertainty, i.e. probability, estimates, is studied. Two algorithms are presented that can potentially improve probability estimate calibration when data set size is limited. It is shown that the traditional approach to calibration often fails on these limited sized data sets, but using these algorithms still allows improvement in classifier probability estimates with calibration. To support the usefulness of the proposed methods and to answer the proposed research questions, main results from the original publications are presented in this compiling part of the thesis. Implications of the findings are discussed and conclusions drawn.
Keywords: classification, missing data, probability, small data, uncertainty
Alasalmi, Tuomo, Luokittelun epävarmuus vaillinaisilla aineistoilla. Oulun yliopiston tutkijakoulu; Oulun yliopisto, Tieto- ja sähkötekniikan tiedekunta Acta Univ. Oul. C 759, 2020 Oulun yliopisto, PL 8000, 90014 Oulun yliopisto
Tiivistelmä Yleisesti tiedetään, että yksinkertaistenkin koneoppimismenetelmien tuloksia saadaan parannet- tua, jos käytettävissä on paljon hyvälaatuista aineistoa. Vaillinaiset aineistot, joiden puutteet joh- tuvat aineiston vähäisestä määrästä tai puuttuvista arvoista, ovat kuitenkin varsin yleisiä. Pelkän luokittelutarkkuuden lisäksi mallin yksittäisten ennusteiden epävarmuus on usein hyödyllistä tietoa. Tässä väitöskirjassa tarkastellaan luokittimien epävarmuuden määrittämistä silloin, kun saatavilla oleva aineisto on vaillinainen. Kun aineistosta puuttuu arvoja joistakin piirteistä, luokittelutulosten epävarmuus lisääntyy, mutta tämä lisääntynyt epävarmuus jää huo- mioimatta perinteisillä puuttuvien arvojen käsittelymenetelmillä. Asian korjaamiseksi tässä väi- töskirjassa esitetään menetelmä, jolla puuttuvista arvoista johtuva epävarmuuden lisääntyminen voidaan huomioida. Lisäksi tämä menetelmä mahdollistaa minkä tahansa luokittimen käytön, vaikka luokitin ei muutoin tukisi puuttuvia arvoja sisältävien aineistojen käsittelyä. Tämän lisäk- si väitöskirjassa käsitellään useiden yleisesti käytettyjen luokittimien kykyä tuottaa hyviä arvioi- ta ennusteiden luotettavuudesta, eli todennäköisyysarvioita, kun käytettävissä oleva aineisto on pieni. Tässä väitöskirjassa esitetään kaksi algoritmia, joiden avulla voi olla mahdollista parantaa näiden todennäköisyysarvioiden kalibraatiota, vaikka käytettävissä oleva aineisto on pieni. Esi- tetyistä tuloksista ilmenee, että perinteinen tapa kalibrointiin ei pienillä aineistoilla onnistu, mut- ta esitettyjen algoritmien avulla kalibrointi tulee mahdolliseksi. Alkuperäisten artikkeleiden tuloksia esitetään tässä kokoomaväitöskirjassa tukemaan esitetty- jä väittämiä ja vastaamaan asetettuihin tutkimuskysymyksiin. Lopuksi pohditaan esitettyjen tulosten merkitystä ja vedetään johtopäätökset.
Asiasanat: epävarmuus, luokittelu, pieni aineisto, puuttuva aineisto, todennäköisyys
To Tanja and Iida, my sunshines ☼ 8 Acknowledgements
A doctoral dissertation is a long, lonely journey but it would not be possible without the support of the people around you. First of all, I would like to thank my principal supervisor, professor Juha Röning, for believing in me and giving me the opportunity to conduct my doctoral research and studies in his research group. I was also fortunate to get guidance from my two other awesome supervisors, docent Heli Koskimäki and associate professor Jaakko Suutala. Thank you, Heli, for pushing me kindly towards the end goal and for occasionally reminding me that there are only two kinds of theses, perfect and ready. And thank you, Jaakko, for sharing your wisdom and for your invaluable tips along the way. I was lucky to have two world-class experts to pre-examine my thesis. Thank you for your comments and suggestions for improvement, professor James T. Kwok and docent and former professor Charles Elkan. I would also like to thank professor Henrik Boström for kindly accepting to be the opponent in my thesis defense. I have also received financial support along the way and I would like to give my thanks to my funders for believing in the value of my work. Thank you to the Infotech doctoral school, Jenny and Antti Wihuri Foundation, Tauno Tönning Foundation, and Walter Ahsltröm Foundation. Thank you also to the University of Oulu graduate school for supporting my conference attending costs. A fair amount of humor is needed to balance the sometimes tiresome research efforts. Data Analysis and Inference group members did an excellent job in this regard during all the coffee breaks, nights out, and WhatsApp conversations. Without you I would not have heard (or tried to come up with myself) so many Bad jokes (with a capital B)! I would also like to thank my family and friends. A special thanks goes to you, Mum and Dad, for emphasizing the importance of education and being supportive despite my journey having many twists and turns when trying to find out who I want to become when I grow up. I still don’t think I know, only time will tell, so please bear with me! The biggest thanks, however, go to my lovely wife Tanja, who has been extremely supportive even during the darkest of times, and our adorable daughter Iida. Without you two I would not be where I am today. I love you both ♥
Oulu, June 2020 Tuomo Alasalmi
9 10 List of abbreviations
Abbreviations
ABB Averaging over Bayesian binnings ACP Adaptive calibration of predictions AI Artificial intelligence BBQ Bayesian binning into quantiles BIC Bayesian information criterion BLR Bayesian logistic regression CR Classification rate DG Data Generation model DGG Data Generation and Grouping model ECOC Error correcting output coding ENIR Ensemble of near-isotonic regression EP Expectation propagation GP Gaussian process GPC Gaussian process classifier Logloss Logarithmic loss MI Multiple imputation ML Machine learning NB Naive Bayes classifier NN Neural network MAR Missing at random MCAR Missing completely at random MCCV Monte Carlo cross validation MCMC Markov chain Monte Carlo MNAR Missing not at random MSE Mean squared error RF Random forest RPR Shape restricted polynomial regression SVM Support vector machine SBB Selection over Bayesian binnings XAI Explainable artificial intelligence
11 Mathematical notation
B Between imputation variance
Ci Predicted class K Number of classes l Number of features m Number of imputed data sets N Number of observations
P(Cj) Prior probability of class Cj P(Cj|x) Posterior probability of class Cj p(x) Probability density p(x|Cj) Class-conditional probability density Q Statistical estimand Q¯ Estimator of Q calculated from incompletely observed sample Qˆ Estimator of Q calculated from a hypothetically complete sample T Total variance U¯ Estimator of within imputation variance x, xi Feature vector, feature vector of instance i y Target label
12 List of original publications
This thesis is based on the following articles, which are referred to in the text by their Roman numerals (I–V):
I Alasalmi T., Koskimäki H., Suutala J., & Röning J. (2015). Classification uncertainty of multiple imputed data. IEEE Symposium Series on Computational Intelligence: IEEE Symposium on Computational Intelligence and Data Mining (2015 IEEE CIDM), 151-158. doi: 10.1109/SSCI.2015.32 II Alasalmi T., Koskimäki H., Suutala J., & Röning J. (2016). Instance level classification confidence estimation. Advances in Intelligent Systems and Computing. The 13th Interna- tional Conference on Distributed Computing and Artificial Intelligence 2016, Springer, 474:275-282. doi: 10.1007/978-3-319-40162-1_30 III Alasalmi T., Koskimäki H., Suutala J., & Röning J. (2018). Getting more out of small data sets. Improving the calibration performance of isotonic regression by generating more data. Proceedings of the 10th International Conference on Agents and Artificial Intelligence, SCITEPRESS, 379-386. doi: 10.5220/0006576003790386 IV Alasalmi T., Suutala J., Koskimäki H., & Röning J. (2020). Better calibration for small data sets. ACM Transactions on Knowledge Discovery from Data, 14(3), Article 34 (May 2020). doi: 10.1145/3385656 V Alasalmi T., Suutala J., Koskimäki H., & Röning J. (2020). Improving multi-class calibration for small data sets. Manuscript.
13 14 Contents
Abstract Tiivistelmä Acknowledgements 9 List of abbreviations 11 List of original publications 13 Contents 15 1 Introduction 17 1.1 Objectives and scope ...... 18 1.2 Contribution of the thesis ...... 19 1.3 Outline of the thesis ...... 19 2 Theoretical foundation 21 2.1 Classification ...... 22 2.1.1 Bayesian classification ...... 22 2.1.2 Support vector machine ...... 24 2.1.3 Random forest ...... 25 2.1.4 Feed-forward neural network ...... 26 2.2 Missing data...... 27 2.2.1 Handling missing data ...... 29 2.2.2 Missing data imputation...... 30 2.2.3 Classification with multiple imputed data ...... 32 2.3 Quantifying classifier performance ...... 33 2.4 Calibration of classifiers and uncertainty of classification ...... 33 2.4.1 Evaluating calibration performance...... 36 2.5 Comparing algorithm performance ...... 38 3 Quantifying uncertainty of classification on limited data 39 3.1 Data sets ...... 39 3.1.1 Synthetic data ...... 42 3.2 Uncertainty of classification when some feature values are missing...... 42 3.3 Classifier calibration when data set size is limited ...... 46 3.3.1 Results in binary calibration ...... 48 3.3.2 Results in multi-class calibration ...... 55 3.4 Summary ...... 56 4 Discussion 59 5 Conclusions 65
15 References 67 Original publications 73
16 1 Introduction
"The starting point for any data preparation project is to locate the data. This is sometimes easier said than done!"
– Dorian Pyle, Data Preparation for Data Mining
At the time, I could not anticipate how meaningful that quote from a book I studied as part of my doctoral studies would finally become regarding this thesis. Even simple algorithms improve in performance with more good-quality data, but limited data, be it small size or incomplete instances, are very common in practice, therefore calling for improvements in the algorithms if better performance is desired (Géron, 2019). The core topic of this thesis, uncertainty of classification, stemming either from missing data or poor calibration of the learning algorithm on limited-sized data sets, touches the very problem raised by the author of that quote. It is indeed often easier said than done to locate all or even a reasonable sample of good-quality data (Pyle & Cerra, 1999). Artificial intelligence (AI) applications are becoming widespread and have an ever-increasing impact on our everyday lives. In many of its applications, such as product recommendations, the reasons that have led to the system’s decision are not very interesting from the users’ point of view as long as the user finds the recommendations useful. But AI systems are also developed for more critical applications such as disease diagnosis in healthcare and self-driving cars. In these applications, it is crucial that the decision process is transparent. These systems are often accurate in their results but it is difficult to get a clear view into their inner workings. This is especially true for machine learning (ML) algorithms. There are obvious dangers in giving decision-making power to a system that is not transparent. To address this issue, a research field called explainable artificial intelligence (XAI) aims to create techniques to help explain the models without affecting their performance (Adadi & Berrada, 2018). In addition to having an understanding of the inner workings of a model, it is often desirable to also have a reliable estimate of the uncertainty of the predictions produced by the ML model. This becomes apparent when there are costs associated with the decisions that are based on the model predictions (Zadrozny & Elkan, 2001b). As an example, let us consider a clinical decision support system assisting in routine disease screening that will give a recommendation regarding the need for more thorough examination. Giving a recommendation to do more tests will cause unnecessary concern and will also incur
17 monetary costs, but giving a false impression of health when this is unwarranted is not without cost either. Therefore, an accurate estimation of the uncertainty of our prediction is needed to balance the costs of false positive and false negative decisions. However, when presenting a ML model, it is global performance measures, such as classification accuracy, that are reported on a test data set. These measures, however, do not provide any information on the reliability of each individual prediction. The uncertainty of individual predictions can be estimated with the probability estimates that many ML algorithms produce. However, for many ML algorithms, these estimates reflect true probabilities quite poorly. The estimates can be improved by calibration, but this needs quite a substantial amount of data, which is not available when the data set size is limited. Missing values, which are quite common in many data sets, produce another type of problem for uncertainty estimation. In this case, the uncertainty regarding the true nature of the missing values needs to be integrated into the uncertainty estimate. This aspect is completely ignored in the traditional treatment of missing values.
1.1 Objectives and scope
In this thesis, ways to quantify uncertainty of classification on limited data, both incomplete and small, will be discussed. Quantification of uncertainty, taking into account the uncertainty of missing feature values, will be studied and a possible model of an independent solution will be discussed. In addition, classifier calibration on limited sized data sets will be studied in the context of binary classification with classifiers that are among the best performing of commonly used classifiers. Emerged problems in current methods and possible solutions will be discussed. Generalization of the solution to multi-class problems will also be studied. In summary, the following research questions will be considered in this thesis:
1. How can uncertainty of classification be quantified when some feature values are missing? 2. How well are different classifiers calibrated when the amount of available training data is limited? 3. Is it possible to improve the calibration of classifiers when the data set size is limited?
The research questions will be answered with peer-reviewed scientific articles, each article providing a partial solution to the above-defined research problem. The contributions of these articles are combined in this compiling part of the thesis.
18 1.2 Contribution of the thesis
This thesis addresses the research problem that was defined above based on the original publications. Publications I and II deal with uncertainty of classification in the presence of missing feature values. Publication I introduces a new way of quantifying the uncertainty of classifier predictions when some feature values are missing by taking advantage of multiple imputation to capture the inherent uncertainty of the unobserved values. In Publication II, this information is then transformed to estimate the confidence of classifier predictions. Publications III-V deal with classifier calibration on limited-sized data sets. Publica- tion III introduces two algorithms that can be used to generate calibration data and therefore avoid biasing calibration toward the training data set and to avoid overfitting of the calibration model on a limited-sized calibration data set. Publication IV tests the proposed data generation algorithms more thoroughly on a state-of-the-art calibration algorithm and on a sample of classifiers that are commonly used and have proven their effectiveness on a wide array of problems. Publication V then generalizes this calibration scheme into multi-class classification problems. The author’s contribution to each of the articles was in narrowing down the research problem, developing the proposed algorithms, planning and implementing the experi- ments, interpreting the results, and writing the manuscripts. The other authors gave valuable guidance along the way and helped in proofreading the manuscripts.
1.3 Outline of the thesis
The rest of this thesis is organized as follows. Literature on the theoretical background of this work is reviewed in Chapter 2. The contributions of this thesis are presented in Chapter 3 and their implications are discussed in Chapter 4. Chapter 5 then concludes the thesis.
19 20 2 Theoretical foundation
Machine learning can be described as learning a possible hidden structure and patterns from data. The hidden structure and patterns are associated with the mechanism that created the data, and this information can be used to aid the analysis and understanding of the nature of the data (Theodoridis, 2015). In predictive modeling, this understanding is used to make accurate predictions on the chances of something happening, whereas in descriptive modeling, the focus is on describing the found patterns and hidden structures in the data, i.e. why something might happen (Kuhn & Johnson, 2013). The chance of something happening is considered the estimated probability of that event, according to the model, and is often called the prediction score. Predictive modeling can be divided into classification, where an object is assigned into one of a finite number of discrete categories called classes, and regression, where the desired output is a continuous variable (Bishop, 2006). In this thesis the focus will be on classification. In predictive modeling, the most interesting feature of the model is the predictive performance of that model on new, previously unseen data. Using the same data for both building the model and estimating prediction performance tends to overestimate the performance. To overcome this problem, a separate data set, called a test or validation data set, can be used. However, when the amount of data available is limited, the size of the test data set might not be large enough to draw strong conclusions while making the limited training data set smaller still. Therefore, the use of resampling methods such as cross-validation is often recommended to estimate model performance, especially on limited-sized data sets (Kuhn & Johnson, 2013). In addition to overall model accuracy, the uncertainty of individual predictions is of interest in many applications. This uncertainty arises from the underlying phenomenon and the model that tries to capture the behavior of that phenomenon. If the amount of available data is limited, it makes estimating the uncertainty more challenging. Missing values add to the uncertainty, as some information is missing. Limited data set size on the other hand makes it hard to calibrate the model uncertainty estimates. This chapter describes the theoretical foundation and the relevant methods of this work. Section 2.1 formalizes the problem of classification and describes how the classifier algorithms related to this work solve this problem. Missing data problem is described in Section 2.2. Classifier performance measures relevant to this work are introduced in Section 2.3. Section 2.4 addresses uncertainty of classification quantification and calibration of classifier probability estimates. Algorithm performance comparison considerations are discussed in Section 2.5.
21 2.1 Classification
In classification, the task is to predict into which class an object, also known as a pattern, belongs to. It is assumed that the pattern belongs to exactly one of the K classes which are known a priori. A unique set of measurements, called features, describes a pattern and is represented by a feature vector x (Theodoridis, 2015). A classifier, trained on a data set consisting of feature vectors and the corresponding class labels, can be used to classify an unknown pattern into one of the K classes. Some classifiers can also estimate the probability that the given pattern belongs to a certain class. This probability is a measure of the uncertainty of the prediction, and in many applications this probability might be as interesting as the predicted class itself. There are literally hundreds of classifiers available. In this thesis, a sample of the best-performing (Fernández-Delgado, Cernadas, Barro, & Amorim, 2014) and most commonly used classifiers are studied. In addition, Bayesian classifiers are used as a reference because in theory Bayesian inference will yield correct probabilities. In practice, however, their exact inference is intractable, as will be discussed later, but they might still be our best probe for the correct probabilities.
2.1.1 Bayesian classification
In Bayesian reasoning, the probability of an event is determined by combining prior information and observations to determine the posterior probability. Prior probability P(Cj) describes how probable the said event Cj is, while the posterior probability P(Cj|x) is the probability of the event Cj conditioned on having observed x, the evidence. Using the rules of probability theory, Bayes’ rule
p(x|Cj)P(Cj) P(Cj|x) = (1) p(x) can be derived (Barber, 2016). In a classification task with K classes, we can calculate the probability that an instance represented by feature vector x, i.e. the observation, belongs to a particular class Cj, j = 1,2,...,K, with Bayes’ rule and by assigning x to the class Ci having the highest posterior probability. As probability density p(x) does not affect the classification decision, it can be left out, which leads to the Bayesian classification rule
Ci = argmax p(x|Cj)P(Cj). (2) Cj
22 By estimating the conditional probability p(x|Cj) and the prior probability P(Cj) from the training data, a classifier that uses Bayes’ rule to estimate to which class the observation xi most likely belongs to can be built. It turns out that the Bayesian classifier minimizes the probability of classification error (Theodoridis, 2015). The prior probabilities P(Cj) can be estimated by the proportion of the number of instances in each class in the training data. To estimate p(x|Cj), however, the number of unknown parameters is of the order of O(l2) where l is the number of features or the dimension of the feature space, which makes the estimation computationally very hard and also requires a large training data set (Theodoridis, 2015). Therefore, to get a good estimate of p(x|Cj), approximation methods are needed in practice.
Naive Bayes classifier
One way to ease the computational burden in Bayesian classification is to make an assumption that the predictor variables are conditionally independent (Han, Kamber, & Pei, 2012). This leads to the classification rule
K ! Ci = argmax P(Cj) ∏ p(x|Cj) . (3) Cj j=1 The conditional independence assumption of features is rarely true in the real world, but the classifier is still often quite effective albeit producing poor probability estimates. The independence assumption is especially helpful when the dimensionality of the input space is high and also helps if the data set contains both continuous and discrete variables, as they can both use appropriate models separately from each other (Bishop, 2006). Despite its drawbacks, naive Bayes has been selected as one of the most influential data mining methods because of its simplicity, elegance, and robustness (X. Wu et al., 2008). The models produced by naive Bayes are easy to interpret (Kononenko, 1990) compared to many other commonly used learning algorithms. It can also handle missing values, which are common in many real-world data sets, by simply ignoring them in the calculations, i.e. by assigning all possible feature values equal probability.
Gaussian process classification
Instead of trying to find a single solution that maximizes the class separation on the training data, Gaussian processes (GP) try to find a distribution of all possible functions consistent with the observed data. GP starts with a prior distribution of functions and as
23 data are observed, the posterior distribution of the functions is updated accordingly (Rasmussen & Williams, 2006). When used in classification, a Gaussian process classifier (GPC) is more flexible than fitting a fixed hyperplane that separates the classes (Williams & Barber, 1998). GPC is non-parametric and nonlinear when a nonlinear covariance function such as squared exponential function is used. Markov chain Monte Carlo (MCMC) sampling can be considered the gold standard for doing GPC approximations but it is computationally very complex and has the limit of long running times. Expectation propagation (EP) approximation (Minka, 2001), on the other hand, has proved to be in very good agreement with MCMC for both predictive probabilities and marginal likelihood estimates for a fraction of the computational cost of MCMC (Kuss & Rasmussen, 2005). Good probability estimates make GPC an interesting comparison for quantifying uncertainty of classification.
Bayesian logistic regression
Bayesian logistic regression (BLR) uses a Bayesian approach to build a parametric linear classifier by using a Cauchy distribution for priors. BLR uses Bayesian inference to obtain stable and regularized logistic regression coefficients. Using Cauchy prior distribution does not fail in the presence of collinearity or when the classes are completely separable, which are known problems with logistic regression. Moreover, more shrinkage is applied automatically on higher-order interactions. BLR improves the probability estimates of bare logistic regression and other suggested prior distributions (Gelman, Jakulin, Pittau, & Su, 2008). Therefore, Bayesian logistic regression is a good candidate in our uncertainty quantification experiments and serves as a linear Bayesian model.
2.1.2 Support vector machine
A support vector machine (SVM) is a so-called maximum margin classifier. This means that SVM determines the decision boundary so that it maximizes the distance of the closest training samples from the decision boundary, which leads to good generalization. Only the training data set samples that lie on the margin, called support vectors, are considered when predicting the class of a test sample, hence the title support vector machine (Kuhn & Johnson, 2013). The fact that only a subset of the training data, i.e. the support vectors, is needed to construct the decision boundary means that SVMs are sparse, which is a major advantage with large data sets (Cristianini & Shawe-Taylor,
24 Fig. 1. A support vector machine. Retrieved from https://en.wikipedia.org/wiki/Computer- aided_diagnosis. CC BY-SA 4.0.
2000). A linear SVM on a completely separable two-class problem is depicted in Figure 1. This is an easy concept to grasp when the classes are completely separable. When the classes overlap, a cost is associated in the training data samples that lie on or on the wrong side of the decision boundary. The magnitude of this penalty is decided with a model hyperparameter, and this is the main mechanism that controls the complexity of the decision boundary. Nonlinearity can be added to SVMs using nonlinear kernels. This is called the kernel trick and allows for very flexible decision boundaries (Kuhn & Johnson, 2013). SVM was originally developed with a hard decision boundary in mind, as opposed to estimating accurate class probabilities (Kuhn & Johnson, 2013), which makes it interesting in the context of this thesis. Like naive Bayes, SVM also made its way to the list of most influential data mining algorithms (X. Wu et al., 2008) and was also a top performer (with Gaussian and polynomial kernels) in extensive testing of classifiers by Fernández-Delgado et al. (2014).
2.1.3 Random forest
Random forest (RF) is an ensemble of trees that can be used for both classification and regression. Figure 2 illustrates a random forest for classification. There are several ways to construct a random forest, such as using a random split selection, or using a random subset of the training set or features for each tree in the forest. Breiman (2001)
25 Fig. 2. A random forest. Retrieved from https://en.wikipedia.org/wiki/Random_forest. CC BY-SA 4.0.
uses bagging and random feature selection to build the trees. Using bagging improves accuracy when used with random feature selection and enables the use of out-of-bag samples, i.e. samples in the training set that are not part of the current bootstrap sample, to estimate the generalization error, strength, and correlation of the trees. Random forests estimate the probabilities by calculating the proportion of votes of the trees in the forest for each class which are not properly calibrated probabilities. Random forests can be proven not to overfit as the number of trees grows, and their generalization error depends on the strength of the individual trees and their mutual correlation. Random forests are competitive with boosting and adaptive bagging, at least for classification tasks (Breiman, 2001). Excellent classification performance has been verified also in more thorough testing where random forest was the best-performing classifier among 179 classifiers tested on 121 data sets (Fernández-Delgado et al., 2014). Their widespread use makes random forests an interesting candidate regarding the research questions of this thesis.
2.1.4 Feed-forward neural network
Neural networks (NN) are machine learning algorithms that are inspired by biological neural networks, i.e. the brain, whose basic building element is a neuron. In the human brain, these neurons are connected with each other via synapses forming a network comprising of approximately 50 to 100 trillion synapses (Theodoridis, 2015).
26 Input Hidden Output layer layer layer
Feature 1
Feature 2 Output
Feature 3
Fig. 3. A feed-forward neural network.
In artificial neural networks, individual neurons can be modeled using perceptrons (Rosenblatt, 1958), and the network is formed by connecting individual perceptrons in a layered fashion. In a commonly used network structure, the information only flows forward in the network so they are called feed-forward neural networks. Figure 3 demonstrates such a network with three layers: the input layer, hidden layer, and output layer. Each connection has an associated weight and learning is achieved by adjusting the weights (Theodoridis, 2015). A breakthrough in training neural networks was achieved with the backpropagation algorithm (Rumelhart, Hinton, & Williams, 1986), which is still in use today. A two-layer neural network can approximate any continuous function so it can be used to construct any nonlinear decision surface in a classification task. In practice, the needed size of the layers might become unfeasible, and adding more layers instead could be advantageous (Theodoridis, 2015). Neural networks tolerate noisy data quite well, generalize well, and have been proven to be successful in a wide variety of real-world applications, but are not intuitive to interpret (Han et al., 2012). To obtain probability estimates, a softmax transformation can be applied to the network output.
2.2 Missing data
In addition to size, data can be limited by having some feature values missing. Missing data is a common problem in almost all disciplines. In practice it means that some values are not present in the data set and the data set is said to be incomplete. This makes statistical analysis of the data more complicated. Missing data can be the result
27 of, for example, dropouts in longitudinal studies (Molenberghs, Fitzmaurice, Kenward, Tsiatis, & Verbeke, 2014). When some data is missing, it necessarily means that some information is lost, which leads to a reduction in the precision of any analysis. The magnitude of the reduction in precision depends on the amount of missing data and the method used in the analysis. Precision is, however, not the only or even the most significant problem related to missing data. Missing data make the data set unbalanced and can add bias which can lead to misleading inferences (Molenberghs et al., 2014). An important aspect of dealing with missing data is understanding the mechanism that caused the data to be missing. Three types of mechanisms are described in the literature: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR). In MCAR, the cause of the missing data is unrelated to the data itself and the probability that a certain value is missing is the same for all data samples. This is often not a realistic assumption. If the probability of a value being missing is not uniform but is correlated with other observed data, then the mechanism is said to be MAR. In MNAR, the probability a value is missing depends on the missing value itself or some other unobserved data. Assuming MAR is reasonable in most cases, which means that observed values can be used in estimating the missing values (van Buuren, 2012). However, the mechanism behind missing values should always be investigated and taken into account when choosing missing data handling methods (Aste, Boninsegna, Freno, & Trentin, 2015). Most classification algorithms have been developed assuming a complete data set so they cannot deal with missing values. Naive Bayes is one exception, as it handles missing values by ignoring missing feature values in the computations and therefore spreads probability uniformly. As was stated earlier, the best-performing classifiers in wide array of tests have been random forest, support vector machine, and neural network variations (Fernández-Delgado et al., 2014) none of which can handle missing values. Missing values on the other hand are very common in real-world data sets. For example, it is not realistic to expect every patient to have been measured for every possible blood test in a medical data set. In addition, there can be errors in data collection or equipment, mistakes in data entry, or dropouts in longitudinal studies etc. It is wasteful to throw away samples that are incomplete in one or more feature values, especially if the data set size is limited, too.
28 2.2.1 Handling missing data
A review of approaches to handling missing data in classification tasks can be found in, for example, García-Laencina, Sancho-Gómez, and Figueiras-Vidal (2009) and Aste et al. (2015). Missing data handling can be roughly categorized into three main approaches. The most common practice is to use complete case analysis, also called listwise deletion, i.e. just ignore all incomplete data samples. In some cases, it might be appropriate to use available case analysis, also called pairwise deletion, where all available data for a particular analysis are used even if other features are incomplete. Analogously in classification, an ensemble of classifiers could be constructed with a different subset of features used in each, but this approach becomes unfeasible very quickly as the number of features grows. The third approach is to fill in the missing feature values with some sensible value so that any desired method can be used for the analysis. This process is called missing data imputation (Aste et al., 2015; van Buuren, 2012). In this work, interest is focused in the uncertainty of classifier predictions. Deletion gives no information whatsoever about the target class. Imputing a single value in place of a missing value on the other hand gives too rosy a picture of the uncertainty of the prediction because the inherent uncertainty of the imputed value is not taken into account. This is where multiple imputation (MI), where multiple possible values are used to fill in each missing value, can help (Rubin, 1987; van Buuren, 2012) but classification algorithms cannot directly handle multiple imputed data sets. As most classifiers cannot use data with missing values, using imputation to handle the missing values allows the use of any classifier that is suitable for the task at hand. By using multiple imputation, it is quite straightforward to estimate the standard errors of statistical estimands that describe the uncertainty of the analysis. This is done with the so-called Rubin’s rules (Rubin, 1987; van Buuren, 2012) which factor in both the within-imputation variance and the variance between imputations. The latter stems from the uncertainty of the imputations which would be impossible to recognize if only a single imputation was used. For classification, which is the focus of this thesis, the uncertainty can be estimated with the posterior probability estimates (later just probability estimates) of the predictions. The overconfidence problem of uncertainty when using single imputation is the same for classifier probability estimates as it is for standard errors for statistical estimands. Harnessing multiple imputation to catch the uncertainty of the classifier predictions stemming from the uncertainty of the missing values will be studied in this thesis.
29 2.2.2 Missing data imputation
Probably the most common way to choose the imputed values is to use the mean of the available values for that variable. This is unfortunate, as using the mean will suppress the amount of variability in the data and will also affect the relationships between variables. All statistical estimands will be biased if mean imputation is used, even the mean itself, if the data is not missing completely at random which, as stated earlier, is not often a valid assumption (van Buuren, 2012). Another option is to use the information in the observed variables and fill in the missing values based on regression modeling. Variation is suppressed with regression imputation, too, as all of the imputed values fall on the same line and it is unlikely that this represents the true distribution of values had they been observed. Correlation of variables is also increased due to the lowered variation but mean estimates will be unbiased with regression imputation (van Buuren, 2012). Regression imputation can be modified by adding noise to the imputed values so that the lost variability is regained. This method is called stochastic regression. Modeling is done by fitting the regression model and estimating the residual variance. The imputed values can then be drawn from a distribution with these parameters. The variability in the imputed values is indeed restored, but even though the values look like they could be real observed values, they are not, but they are treated with the same certainty as are real observed values. Stochastic regression helps to preserve the correlations and relationships of variables, but like regular regression imputation, it can produce impossible values and gives a false impression of certainty in the imputed values (van Buuren, 2012). In multiple imputation, the missing values are replaced with m > 1 values using a method that draws values from a distribution of plausible values specifically modeled for that particular feature, e.g. stochastic regression, thus creating several data sets that are identical in the observed values but differ in the imputed values. Variation of the imputed values between the imputations indicate how uncertain the imputations are. The imputation model that is appropriate to use for each feature depends on the data. MI data sets are analyzed individually and the results are pooled to form the final estimands using pooling rules developed by Rubin (1987). The pooled estimands are unbiased and have correct statistical properties if the imputation models were chosen correctly (van Buuren, 2012).
30 Rubin’s rules
Standard error is the uncertainty measure that is used in statistical analysis. It can be defined as "the standard deviation of the sampling distribution of some statistic" (Urdan, 2010). Rubin’s rules
1 m ¯ ˆ Q = ∑ Ql, (4) m l=1
1 m ¯ ¯ U = ∑ Ul, (5) m l=1
1 m ˆ ¯ ˆ ¯ 0 B = ∑ (Ql − Q)(Ql − Q) , (6) m − 1 l=1
! B 1 T = U¯ + B + = U¯ + 1 + B (7) m m are a way to estimate standard error when using multiple imputation and thus take into account the uncertainty regarding the missing values. The same principle will be used later when determining the uncertainty of classifier predictions on multiple imputed data. When estimating some statistic Q¯ from a data with missing values, multiple imputation is first used to create the m complete data sets as described above. Each of the data sets is then used separately to estimate the statistic Qˆ in those data sets, and the final estimate Q¯ will be the arithmetic mean of those estimates. To estimate the standard error of the estimate Q¯, three sources of variance need to be taken into account. The first source of variance comes from the fact that only a sample of the whole population is observed. This is called within variance. The second source of variance is caused by the missing values and can be seen as variation between the m data sets. This is called between variance. The third source of variance comes from the fact that Q¯ is estimated with finite m, and this is called simulation variance (Rubin, 1987). Within variance U¯ is the average of the sample variances in each of the imputed data set. Between variance B captures the uncertainty in the statistic caused by missing values. Rubin (1987) has shown that the magnitude of simulation variance is B/m which is added to the other sources of variance when calculating the total variance T. T is the standard error of the estimated statistic.
31 1 Predict m times Stacked 2 Train the model Majority Impute Training Model vote m times Data 1 Training Data
2
Test Data m Impute with m the stacked training data m times
Fig. 4. Classification of multiple imputed data with the 1MI algorithm. Reprinted with permis- sion from Publication I c 2015 IEEE.
2.2.3 Classification with multiple imputed data
To use classifiers with MI data sets, a way is needed to combine the results. One possible way to do this was introduced by Belanche, Kobayashi, and Aluja (2014) who stacked the imputed m training data sets together and used the stacked data to train the classifier. The imputed test data sets were then predicted separately and the majority vote was used for the final prediction. This process is depicted in Figure 4. Another possible variation of the method is to use the m imputed training data sets to train m classifiers and pool the results. If this approach is used, test data can be imputed once with each m variation of the imputed training data, predicted with the corresponding classifier variation, and pooled with the majority vote. The first approach was found to be more effective in experiments by Belanche et al. (2014). Classification uncertainty can be expressed with classifier probability estimates. When using multiple imputation, it is also possible to quantify the uncertainty that is caused by the fact that the true values for the missing values cannot be known for certain. How this is done is presented in Section 3.2.
32 2.3 Quantifying classifier performance
When starting a modeling process, one important decision to make is on which part of the data to use to evaluate the model performance. It would be ideal for same data samples not to be used for both training or fine-tuning the model and evaluating performance so that the performance evaluation is not biased and does not give a false impression of a better performance than is justified. Thus, the use of a separate test data set is generally recommended. However, with limited-sized data sets it might still be wise to avoid setting aside a separate data set for testing as this also limits the amount of available training data. Besides, strong conclusions cannot be made on a small test data set anyway. Instead, to estimate model performance, resampling methods such as cross-validation can be used. By evaluating multiple versions of the data, resampling methods can estimate model performance better than a single test set can (Kuhn & Johnson, 2013). In addition, using resampling allows performing reliable statistical testing on the performance differences between classifiers and calibration scenarios. This aspect will be covered more thoroughly in Section 2.5. In this thesis, the classification rate (CR), i.e. the percentage of correctly classified test data samples, was used as the measure to quantify overall classifier performance. Classification rate does not take into account the costs of different kinds of errors and can give a false impression of good performance if the classes are very imbalanced. As a rule of thumb, if the classification rate on an imbalanced classification problem is higher than the percentage of the largest class in the data set, the model performance can be considered reasonable (Kuhn & Johnson, 2013). With these caveats in mind, classification rate is probably not the best measure of classifier performance in every situation at least if used alone. But in this work, instance-level uncertainty was the focus and classification rate was used to give a hint on the degree on difficulty of the classification problem and allowed the comparison of different classifiers and calibration scenarios and their effect on the overall classification performance.
2.4 Calibration of classifiers and uncertainty of classification
Uncertainty of classifier predictions can be expressed as the posterior probability of the prediction, i.e. how likely the prediction really belongs to the predicted class (Theodoridis, 2015). Gonçalves, Fonte, Júlio, and Caetano (2009) also introduced a derivative measure