An Introduction to Statistical Classification and Its Application To

An introduction to statistical classification and its application to Field Asymmetric Ion Mobility Spectrometry Brian Azizia, Georgios Pilikosb aDepartment of Economics, University of Cambridge, Sidgwick Avenue, Cambridge, CB3 9DD, United Kingdom bLaboratory for Scientific Computing, University of Cambridge, JJ Thomson Avenue, Cambridge, CB3 0HE, United Kingdom Abstract This paper serves as an introduction to a particular area of Machine Learning, statistical classification, applied on medical data sets for automatic clinical diagnosis. An application of these models is illustrated in order to provide non-invasive diagnostics with the discovery of new gas volatile compounds (VOCs) that could provide the detection of fermentation profiles of patients with Inflammatory Bowel Disease (IBD). To achieve the above, an investigation of ten statistical classification algorithms from the supervised learning literature is undertaken. From this investigation, selected algorithms are applied on medical Field Asymmetric Ion Mobility Spectrometry (FAIMS) data sets to train classification models. From our results on the FAIMS data sets, we show that it is possible to classify unseen samples with a very high certainty and automatically perform medical diagnosis for Crohn’s disease and Ulcerative Colitis. In addition, we propose potential future research on other data sets by utilizing the results from the identification of informative regions in feature space. Keywords: Supervised Learning, Statistical Classification, Automatic Clinical Diagnosis, Feature Identification, Ion Mobility 1. Introduction data available in hospitals around the globe. The application of these models on FAIMS data sets By utilizing the emerging technologies and avail- is inspired by a long tradition of clinicians that have able data around us, AI is increasingly being used in been using their own sense of smell as a diagnostic tool. Medicine [1]. Development of algorithms that are able This traces even back to Hippocrates who suggested to process large amounts of data and produce valuable that a patient’s odour could lead to their clinical diag- information is necessary. In the medical world, where nosis. Thus, the motives of using these data sets is due decisions are of vital importance, utilization of medical to the recent research on non-invasive diagnostics and history data can greatly enhance diagnosis. That is, by the discovery of new gas volatile compounds (VOCs) collecting samples from positively diagnosed and neg- biomarkers [2] [3] [4] [5] [6] that could provide a de- atively diagnosed patients, it is possible to identify pat- tection of fermentation profiles of patients with IBD. terns or specific features that distinguish them for reli- The pathogenesis of IBD involves the role of bacteria able future decision-making. [4]. These bacteria ferment non-starch polysaccharides The scientific field that deals with this problem in the colon that produces a fermentation profile that can is called Machine Learning and a more relevant sub be traced in urine smell [4]. Using FAIMS instruments, field called statistical classification from the supervised it is possible to track the resultant VOCs that emanate learning literature. A model is trained by giving it a from urine and identify patterns in their chemical finger- number of examples, each belonging to a certain class. prints to automatically perform medical diagnosis for The aim is to use this model to accurately predict new, Crohn’s disease and Ulcerative Colitis. previously unseen examples. Doctors and practitioners In this paper, we provide a review of classification can benefit from this technology since models can find techniques and test some on medical FAIMS data sets. patterns and structure in the data that was previously From our results on the FAIMS data sets, we show that not possible. This can be achieved by the parallel in- it is possible to classify unseen samples with a very high telligent processing of huge amounts of medical history certainty on certain data sets and propose potential future research on other data sets that would potentially Its true class will be labelled by y∗ and our prediction of allow training for more accurate classification models. the class will bey ˆ∗. Specific informative regions that play a vital role for the creation of the models’ decision boundaries on the data 2.1. Linear Discriminant Analysis sets are also identified and illustrated which are worth Linear Discriminant Analysis (LDA) [7] aims to sep- investigating further. arate the classes in feature space using linear decision The paper is organized as follows. Firstly, in sec- surfaces (hyperplanes). tion 2, an in-depth overview of the theory behind the We need to make two slight modifications to our classification methods is given describing potential ad- notation. Firstly, we need to attach a dummy ‘input’ vantages and disadvantages during training and testing feature xi;0 = 1 to our input vectors xi so that xi = T phases along with a description of each algorithm’s im- [1 xi;1 xi;2 ::: xi;D] . Secondly, we will use an alterna- plementation. In section 3, testing of a subset of these tive representation of our classes: Instead of labels yi, th algorithms is described on numerous data sets to inves- we shall use target vectors ti of length K, where the c tigate their practical performance. In section 4, an in- component of ti is equal to 1 if training example i is in troduction to FAIMS technology is given along with the class c and 0 otherwise (i.e. ti;c = 1 if yi = c). application of selected algorithms on medical FAIMS data sets by testing various scenarios. Finally, a dis- Classification: For LDA, the discriminant function cussion about future research and final remarks can be takes the form found in sections 5 and 6 respectively. ∗ T ∗ yˆ = arg max(wc x ): (1) c 2. Methods That is, we predict x∗ to be in class c which maximizes T ∗ Fundamental definitions and notation: the expression wc x . wc is a (D + 1)-dimensional vector containing the Classification is a form of supervised machine learn- weight parameters of the model for class c. The bound- ing. We train a model by using a large number of exam- T T ary between class c and class d is given by wc x = wd x, ples, each belonging to a certain class. Our aim is to use so that (wd − wc) denotes the normal vector of the deci- the model to accurately predict new, previously unseen sion plane. examples. T ∗ There is a nice interpretation to the quantity wc x . We have K discrete classes, that we will index by We can treat it as an estimate of the probability that x∗ the letter c, i.e. c 2 f1; 2;:::; Kg. We have a train- belongs to class c, that is ing set containing N training examples. Each example ∗ T ∗ consists of two elements, namely the input vector (or p(y = c j wc; Data) = wc x : (2) feature vector), denoted by xi, and the corresponding label, denoted by yi. We will use the letter i to index the Training: The goal of the training phase is to learn training examples, so that i 2 f1; 2;:::; Ng. the weight parameters wc for each class c. We achieve Each label yi is an integer between 1 and K, indicating this by minimising an error function. The optimization the class of training example i. Each input vector xi is objectives are given by: a column vector containing the values of the features N of example i in its components. We let D be the total 1 X E(w ) = (wT x − t )2; c 2 f1;:::; Kg: (3) number of features and use j 2 f1; 2 :::; Dg to index c 2 c i i;c i=1 the features of our input vectors, so that xi; j denotes the jth feature of the ith training example. This is called the least squares error function, and we T We let X = [x1 x2 ::: xN] stand for the N × D ma- minimize one per class. It is the sum of squares of the trix containing the training examples in its rows and prediction errors resulting from a particular choice of the input features in its columns. We will also use weight vector wc. We aim to find wc for which this is T y = [y1 y2 ::: yN ] to denote the N-dimensional column the smallest. th vector containing the class of the i training example in Differentiating with respect to wc we find that the op- row i. timal weight vector satisfies Finally, when discussing how to make new predic- N ∗ @E(w ) X tions based on the trained model, we will use x to de- c = (w T x − t )x = 0: (4) @w c i i;c i note the feature vector of a previously unseen example. c i=1 2 T This problem has in fact a closed form solution, and Computing the quantities wc xi can be interpreted as we can state it concisely using matrices. To do that, a form of dimensionality reduction. We take high di- let W = [w1 w2 ::: wK] be the matrix containing the mensional feature vectors xi and project them onto one weight vectors for all the classes in its columns and let dimension (i.e. onto a line). T T = [t1 t2 ::: tN ] be the matrix containing all the target Generally, dimensionality reduction leads to a con- vectors in its rows. siderable loss of information. However, we can adjust The solution to LDA can then be written as w to find the line that minimizes the overlap between T −1 T the classes when projecting them onto it. W = (X X) X T: (5) The goal of FDA is to do just that by maximising The most expensive part in the computation is invert- the Fisher Criterion, which is defined below.

An Introduction to Statistical Classification and Its Application To

On Discriminative Bayesian Network Classifiers and Logistic Regression

Lecture 5: Naive Bayes Classifier 1 Introduction

Hybrid of Naive Bayes and Gaussian Naive Bayes for Classification: a Map Reduce Approach

Locally Differentially Private Naive Bayes Classification

Improving Feature Selection Algorithms Using Normalised Feature Histograms

Naïve Bayes Lecture 17

Last Time Today Bayesian Learning Naïve Bayes the Distributions We

Generalized Naive Bayes Classifiers

A Comparison of Three Different Methods for Classification of Breast

Foundations of Bayesian Learning

Bayesian Learning

Improving the Naive Bayes Classifier Via a Quick Variable Selection