An introduction to statistical classification and its application to Field Asymmetric Ion Mobility Spectrometry

Brian Azizia, Georgios Pilikosb

aDepartment of Economics, University of Cambridge, Sidgwick Avenue, Cambridge, CB3 9DD, United Kingdom bLaboratory for Scientific Computing, University of Cambridge, JJ Thomson Avenue, Cambridge, CB3 0HE, United Kingdom

Abstract This paper serves as an introduction to a particular area of , statistical classification, applied on medical data sets for automatic clinical diagnosis. An application of these models is illustrated in order to provide non-invasive diagnostics with the discovery of new gas volatile compounds (VOCs) that could provide the detection of fermentation profiles of patients with Inflammatory Bowel Disease (IBD). To achieve the above, an investigation of ten statistical classification from the literature is undertaken. From this investigation, selected algorithms are applied on medical Field Asymmetric Ion Mobility Spectrometry (FAIMS) data sets to train classification models. From our results on the FAIMS data sets, we show that it is possible to classify unseen samples with a very high certainty and automatically perform medical diagnosis for Crohn’s disease and Ulcerative Colitis. In addition, we propose potential future research on other data sets by utilizing the results from the identification of informative regions in feature space. Keywords: Supervised Learning, Statistical Classification, Automatic Clinical Diagnosis, Feature Identification, Ion Mobility

1. Introduction data available in hospitals around the globe. The application of these models on FAIMS data sets By utilizing the emerging technologies and avail- is inspired by a long tradition of clinicians that have able data around us, AI is increasingly being used in been using their own sense of smell as a diagnostic tool. Medicine [1]. Development of algorithms that are able This traces even back to Hippocrates who suggested to process large amounts of data and produce valuable that a patient’s odour could lead to their clinical diag- information is necessary. In the medical world, where nosis. Thus, the motives of using these data sets is due decisions are of vital importance, utilization of medical to the recent research on non-invasive diagnostics and history data can greatly enhance diagnosis. That is, by the discovery of new gas volatile compounds (VOCs) collecting samples from positively diagnosed and neg- biomarkers [2] [3] [4] [5] [6] that could provide a de- atively diagnosed patients, it is possible to identify pat- tection of fermentation profiles of patients with IBD. terns or specific features that distinguish them for reli- The pathogenesis of IBD involves the role of bacteria able future decision-making. [4]. These bacteria ferment non-starch polysaccharides The scientific field that deals with this problem in the colon that produces a fermentation profile that can is called Machine Learning and a more relevant sub be traced in urine smell [4]. Using FAIMS instruments, field called statistical classification from the supervised it is possible to track the resultant VOCs that emanate learning literature. A model is trained by giving it a from urine and identify patterns in their chemical finger- number of examples, each belonging to a certain class. prints to automatically perform medical diagnosis for The aim is to use this model to accurately predict new, Crohn’s disease and Ulcerative Colitis. previously unseen examples. Doctors and practitioners In this paper, we provide a review of classification can benefit from this technology since models can find techniques and test some on medical FAIMS data sets. patterns and structure in the data that was previously From our results on the FAIMS data sets, we show that not possible. This can be achieved by the parallel in- it is possible to classify unseen samples with a very high telligent processing of huge amounts of medical history certainty on certain data sets and propose potential fu- ture research on other data sets that would potentially Its true class will be labelled by y∗ and our prediction of allow training for more accurate classification models. the class will bey ˆ∗. Specific informative regions that play a vital role for the creation of the models’ decision boundaries on the data 2.1. Linear Discriminant Analysis sets are also identified and illustrated which are worth Linear Discriminant Analysis (LDA) [7] aims to sep- investigating further. arate the classes in feature space using linear decision The paper is organized as follows. Firstly, in sec- surfaces (hyperplanes). tion 2, an in-depth overview of the theory behind the We need to make two slight modifications to our classification methods is given describing potential ad- notation. Firstly, we need to attach a dummy ‘input’ vantages and disadvantages during training and testing feature xi,0 = 1 to our input vectors xi so that xi = T phases along with a description of each ’s im- [1 xi,1 xi,2 ... xi,D] . Secondly, we will use an alterna- plementation. In section 3, testing of a subset of these tive representation of our classes: Instead of labels yi, th algorithms is described on numerous data sets to inves- we shall use target vectors ti of length K, where the c tigate their practical performance. In section 4, an in- component of ti is equal to 1 if training example i is in troduction to FAIMS technology is given along with the class c and 0 otherwise (i.e. ti,c = 1 if yi = c). application of selected algorithms on medical FAIMS data sets by testing various scenarios. Finally, a dis- Classification: For LDA, the discriminant function cussion about future research and final remarks can be takes the form found in sections 5 and 6 respectively. ∗ T ∗ yˆ = arg max(wc x ). (1) c 2. Methods That is, we predict x∗ to be in class c which maximizes T ∗ Fundamental definitions and notation: the expression wc x . wc is a (D + 1)-dimensional vector containing the Classification is a form of supervised machine learn- weight parameters of the model for class c. The bound- ing. We train a model by using a large number of exam- T T ary between class c and class d is given by wc x = wd x, ples, each belonging to a certain class. Our aim is to use so that (wd − wc) denotes the normal vector of the deci- the model to accurately predict new, previously unseen sion plane. examples. T ∗ There is a nice interpretation to the quantity wc x . We have K discrete classes, that we will index by We can treat it as an estimate of the probability that x∗ the letter c, i.e. c ∈ {1, 2,..., K}. We have a train- belongs to class c, that is ing set containing N training examples. Each example ∗ T ∗ consists of two elements, namely the input vector (or p(y = c | wc, Data) = wc x . (2) feature vector), denoted by xi, and the corresponding label, denoted by yi. We will use the letter i to index the Training: The goal of the training phase is to learn training examples, so that i ∈ {1, 2,..., N}. the weight parameters wc for each class c. We achieve Each label yi is an integer between 1 and K, indicating this by minimising an error function. The optimization the class of training example i. Each input vector xi is objectives are given by: a column vector containing the values of the features N of example i in its components. We let D be the total 1 X E(w ) = (wT x − t )2, c ∈ {1,..., K}. (3) number of features and use j ∈ {1, 2 ..., D} to index c 2 c i i,c i=1 the features of our input vectors, so that xi, j denotes the jth feature of the ith training example. This is called the least squares error function, and we T We let X = [x1 x2 ... xN] stand for the N × D ma- minimize one per class. It is the sum of squares of the trix containing the training examples in its rows and prediction errors resulting from a particular choice of the input features in its columns. We will also use weight vector wc. We aim to find wc for which this is T y = [y1 y2 ... yN ] to denote the N-dimensional column the smallest. th vector containing the class of the i training example in Differentiating with respect to wc we find that the op- row i. timal weight vector satisfies Finally, when discussing how to make new predic- N ∗ ∂E(w ) X tions based on the trained model, we will use x to de- c = (w T x − t )x = 0. (4) ∂w c i i,c i note the feature vector of a previously unseen example. c i=1 2 T This problem has in fact a closed form solution, and Computing the quantities wc xi can be interpreted as we can state it concisely using matrices. To do that, a form of dimensionality reduction. We take high di- let W = [w1 w2 ... wK] be the matrix containing the mensional feature vectors xi and project them onto one weight vectors for all the classes in its columns and let dimension (i.e. onto a line). T T = [t1 t2 ... tN ] be the matrix containing all the target Generally, dimensionality reduction leads to a con- vectors in its rows. siderable loss of information. However, we can adjust The solution to LDA can then be written as w to find the line that minimizes the overlap between

T −1 T the classes when projecting them onto it. W = (X X) X T. (5) The goal of FDA is to do just that by maximising The most expensive part in the computation is invert- the Fisher Criterion, which is defined below. The in- ing XT X. tuition behind this is to get the largest separation be- tween the classes by maximizing the between-class vari- ance, while simultaneously minimizing the within-class Algorithm 2.1: LDA(X, y) , thereby reducing the spread of the individual classes. This should give us the smallest possible class Create N × K matrix T, where overlap when projected onto one dimension. Ti,c = 1 if yi = c and 0 otherwise. As in the case of LDA, we need to introduce a dummy feature xi,0 = 1 to each example xi. However, we will comment: Training Phase: adopt a 2-class setting to explain FDA. For that, we take T −1 T Compute W = (X X) X T. yi ∈ {0, 1}.

comment: Classification: Training: FDA trains the weight vector w by max- Compute vector WTx∗and find largest row k. imising the Fisher Criterion, J(w). To define the crite- Predicty ˆ∗ = k. rion, let m0 and m1 be the class mean vectors, given by 1 X 1 X m0 = xi , m1 = xi , (6) Discussion: LDA has the advantage of giving a N0 N1 i : yi=0 i : yi=1 closed-form solution for the weight vectors wc. Fur- thermore, the model is computationally inexpensive and PN where N1 = i=1 yi is the total number of training ex- easy to interpret. amples in class 1 and N0 = N − N1 is the total number T ∗ We can also interpret the quantity wc x as the prob- of training examples in class 0. ∗ ability that x belongs to class c. However, this inter- Let SB be the between-class covariance matrix: pretation may break down on some occasions, as there T ∗ T are no constraints to ensure that wc x ∈ [0, 1] for all SB = (m1 − m0)(m1 − m0) , (7) possible vectors x∗. As it is a linear method, LDA cannot handle non- and let SW be the within-class covariance matrix, given linear class boundaries very well. Furthermore, even in by the case of linearly separable classes, it may not find the X − − T optimal decision boundary. Outliers that are “too cor- SW = (xi m1)(xi m1) i : y =1 rect” in that they lie a long way on the correct side of the i (8) X T decision surface have a disproportionate effect on LDA. + (xi − m0)(xi − m0) . The error function penalizes these outliers too heavily i : yi=0 and the result is that LDA shifts the boundary in their Then the Fisher Criterion is given by direction, at the expense of accuracy. The algorithm’s steps can be seen in Algorithm 2.1. T w SBw For a more detailed derivation and thorough discussion, J(w) = (9) wTS w see [7] [8]. W We see that the between-class covariance is in the nu- 2.2. Fisher Discriminant Analysis merator while the total within-class covariance is in Related to LDA is the idea of Fisher Discriminant the denominator, so that maximisation of this criterion Analysis (FDA) [7] [9]. should give us the desired optimal separation. 3 The result of the optimization is: form a commitee. We will describe a particular method of boosting, named AdaBoost [12]. −1 w ∝ SW (m1 − m0). (10) AdaBoost can produce substantial improvements in performance compared to using just a single classifier. This is known as Fisher’s Linear Discriminant. It is, It can give good results even if we only use base clas- however, not quite yet a discriminant, but only a direc- sifiers that have a performance which is only slightly tion for projection. The projected data can then be used better than random (known as “Weak Learners”) [12]. to construct a discriminant by choosing a threshold τ. To classify a new example, we combine all the trained There are various methods for choosing the threshold. base classifiers to form a majority vote. A weights is Often, we assume that the features are normally dis- also associated with each base classifier. The votes of tributed and find the value of τ that maximizes the pos- classifiers that performed relatively better during train- terior class probabilities, or equivalently, minimizes the ing, count more. misclassification rate. We will explain the algorithm in a 2-class setting. For this algorithm, it is useful to use the labels y ∈ {−1, 1} Algorithm 2.2: FDA(X, y) i and let +1 stand for the positive and −1 for the negative class. comment: Training Phase: −1 Compute w ∝ SW (m1 − m0), where Training: We will train a total of M base classi- m1, m0 and SW are given by (6) and (8) fiers, denoted by fm(x) with m ∈ {1,..., M}. It is com- Find optimal threshold τ, by minimising mon to use tree stumps as base classifiers for each iter- the misclassification rate. ation. These are weak learners that attempt to separate the classes using axis-alligned hyperplanes, i.e. hyper- comment: Classification: planes that use one of the axis of the feature space as (m) T ∗ ≥ the direction of their normal. We let wi stand for the if w x τ th ∗ weight of the i training example just before it is passed then yˆ = 1, th else yˆ∗ = 0. on to classifier fm(x), i.e. during the m iteration. Ini- tially, the examples receive equal weights so as to have an unbiased initial state and be able to adapt according to the misclassification error only: Classification: To classify a new data point x∗, we T ∗ (1) 1 compute the quantity w x . Then, we set a threshold τ wi = for all i ∈ {1,..., N}. (12) and predict: N ( In each iteration m, we train the base classifier fm by yˆ∗ = 1 if wT x∗ ≥ τ (11) minimizing the error function : yˆ∗ = 0 if wT x∗ < τ N X (m) Jm = wn I ( fm(xi) , yi) , (13) Discussion: FDA is a useful method for reducing i=1 the dimensionality of a machine learning problem by projecting the features onto a lower-dimensional do- with respect to the parameters of the classifier fm(x), • main. It handles classes with linear boundaries very where I( ) is the indicator function which outputs 1 if well. the condition inside the brackets holds true and 0 other- In practice, it lacks flexibility. The method of Kernel wise. Fisher Discriminants [10] builds on FDA and extends Next, we need to compute the quantities: it to non-linear class boundaries using the kernel trick PN (m) w I( fm(xi) , yi)  = i=1 i , (14) [11]. m PN (m) The algorithm’s steps can be seen in Algorithm 2.2. i=1 wi For a more detailed description of FDA, including a and ! multi-class treatment, see [8] [9]. 1 − m αm = ln . (15) m 2.3. Adaptive Boosting (AdaBoost) We use these to update the weights for the next iteration: Boosting belongs to a family of methods called en- (m+1) (m) sembles. It combines a number of base classifiers to wi = wi exp (αmI( fm(xi) , yi)) . (16) 4 Classification: Once we have trained all base clas- are far away from the decision boundary (i.e. outliers). sifiers, we can use the ensemble to predict the class of a These outliers tend to create “spikes” in the decision new data point x∗: boundary.   The algorithm’s steps can be seen in Algorithm 2.3. XM ∗  ∗  For a more detailed derivation and discussion, along yˆ = sign  αm fm(x ) . (17) m=1 with a multi-class extension, see [12] [13].

2.4. Random Forests Algorithm 2.3: AdaBoost(X, y, f1,..., fM) Random Forests [14] is another comment: Initialize weights method. The idea is to train a multitude of decision trees, each with some degree of randomization. To clas- w(1) ← 1/N for all i ∈ {1,..., N} i sify new data points, we let each tree make a prediction and then average their output to form an ensemble pre- for m ← 1 to M  diction. We randomize by choosing a random subset of comment: Train fm(x) by  our training data for the training of each tree. Further-  PN (m) min Jm = w I( fm(xi) , yi) more, when performing the optimization in the training  i=1 i  phase, we only optimize over a random subset of the  comment: Evaluate quantities available parameters. The reason for this randomization   PN (m)  w ( fm(xi),yi) is to ensure that the trees grow differently from one an-  ← i=1 i I do  m PN w(m) other, increasing the confidence of our final prediction.  i=1 i  A tree is a hierarchical structure consisting of nodes     1−m αm ← ln connected by edges. Nodes are divided into internal (or  m  split) nodes and terminal nodes (or leafs). Each node   has exactly one incoming edge, except the uppermost comment: Update data weights  node called the root which has none. We will focus on w(m+1) = w(m) exp (α ( f (x ) y )) i i mI m i , i binary trees which have exactly two outgoing edges. comment: Predict class of new pattern x∗   Classification: Let T be our forest size, i.e. the total yˆ∗ = sign PM α f (x∗) m=1 m m number of trees grown and index the trees using t ∈ {1, 2 ..., T}. We can use a tree t to classify a new data point x∗. Discussion: AdaBoost is not a classifier per se, but We start by placing x∗ at the root. Each internal node rather a method of combining a number of classification S m (including the root) of the tree has a predefined test methods. function φm(x). In our model, the test function will sim- Decision tree stumps are the traditional choice for the ply be one of the features of x: base classifiers. But AdaBoost can use any classifica- φ (x∗) = x∗ , (18) tion method as a base classifier and for some applica- m jm tions, there are more appropriate base classifiers than decision trees (such as the Haar classifiers in the prob- where jm is optimal in some sense and determined dur- lem of face detection). ing the training phase. As mentioned above, we can expect the ensemble Depending on whether this feature is above or below ∗ to converge to a strong classifier, even if the individ- a certain threshold τm, we send x through either the ual base classifiers are only slightly better than random left or right edge, to the next node. We push x∗ through guessing. the tree in this manner until we reach a leaf S L. The However, the choice of base classifiers has an effect leafs contain the conditional distribution of the classes, ∗ ∗ on the speed of convergence. In practice, we can make pt(y = c | x ∈ S L). If we used a single tree for classifi- use of prior knowledge about the structure of a data set cation, we would simply predict x∗ to be in the class for to help us choose the weak learners. which this probability is the greatest. Boosting generally shows resistance to over-fitting. The Random Forests method pushes x∗ through all However, it is affected by misclassified data points that the trees simultaneously and predicts the class that max- 5 imizes the average of the posterior probabilities: points of only one class (a pure node) or until the num- ber of data points in a node is below a certain threshold.   XT Random Forests are very powerful models. Each indi- ∗  1 ∗ ∗  yˆ = arg max  pt(y = c | x ∈ S L) . (19) c T  vidual tree may not be a very strong classifier by itself. t=1 However, by randomizing the training of the trees and combining their output, we can obtain a very accurate Training: The training phase is in charge of find- and flexible classifier. One advantage of this technique ing the best feature j to split on at each node m of each m is that the trees can be grown independently from one tree t along with the optimal threshold τ for it. The m another. This allows us to improve the computational Random Forests algorithm does this by maximizing the time of the forest, by growing the trees in parallel over expected information gain (or entropy reduction), de- multiple cores. Furthermore, Random Forests give us noted by IG ( j , τ ) as defined in (22). To formalize m m m a measure of the relative importance of features. If a the information gain, we need to define further vari- feature is in the first few layers in many decision trees, ables. Given a particular choice of parameters, let S (R) m it is likely to separate the classes well. We also store and S (L) be the subsets of data points in node m that are m the of the classes in all the leafs, giving us a sent through the right edge and left edge, respectively. measure for the confidence of our predictions. We define the Shannon entropy of a node S to be A potential disadvantage is that trees are prone to overfit the training data if we grow them too deeply. The XK H(S ) = p(y = c | x ∈ S ) log (p(y = c | x ∈ S )), (20) algorithm’s steps can be seen in Algorithm 2.4. For a c=1 more detailed discussion and explanation, see [14] [15]. where we use the empirical distributions of the classes in the node to approximate the probability: Algorithm 2.4: (X, y, T, α, M)

| | 1 XS comment: Training Phase p(y = c | x ∈ S ) = (y = c). (21) |S | I i for t ← 1 to T i=1  Draw a bootstrap sample of size αN  We can then define the expected information gain asso- comment: Grow tree starting at root  ciated with a particular choice ( jm, τm) of parameters for m ← 0  node m to be if (stopping criterion is not met)   (r) do Select M features randomly X |S |   m (r)   IGm( jm, τm) = H(S m) − H(S m ). (22)  Optimize over these features: |S m|  do r∈{R,L}  max [IG ( j , τ )]   jm,τm m m m  Grow two daughter nodes.  Each tree is grown independently of all the other trees m ← m + 1 and we inject randomness into the training process of each tree. comment: Classification The randomization of each tree is two-fold. Instead ∗ of using all N available training examples to train the Send x through all trees. ∗ ∗ tree, we randomly select αN data points from the train- Find pt(y = c | x ∈ S L) for all t and c, ing set with replacement (this is called bootstrapping). where S L are the leafs ∗  1 PT ∗ ∗  α is one of the parameters of the model. It is a positive yˆ = arg maxc T t=1 pt(y = c | x ∈ S L) number usually set to be less than 1. When maximizing the information gain criterion to choose the optimal pa- rameters for a node, we do not make all possible choices 2.5. Naive Bayes of features. Instead, we maximize IG( jm, τm) only over a random subset of features jm, of size M. Naive Bayes [16] is a simple probabilistic classi- Finally we have to decide on when to stop growing fication method. The algorithm estimates the class- a tree. If a tree is grown too deeply, it will overfit the conditional distributions of each input feature. If differ- data and not generalize well. If it is not deep enough, it ent classes show different distributions of the features, will not give very strong predictions. Standard stopping Naive Bayes will be able to use that information to sep- criteria allow a tree to grow until a node contains data arate the classes. 6 To explain the intuition behind the Naive Bayes al- Intuitively, we would then expect: gorithm we adopt a different framework than before, ˆ namely we assume that our input features are binary, πˆ c = Nc/N θ j,c = N j,c/Nc (29) x ∈ {0, 1}. For example, x could summarize the pres- i, j i to be the best estimates of the parameters and it can in- ence of certain chemicals in sample i, i.e. x = 1 i, j deed be shown that these give the maximum likelihood. if chemical j is present in sample i and 0 otherwise. There exist modifications to handle continuous input data, for example by assuming the features are normally Algorithm 2.5: Naive Bayes(X, y) distributed. The model has two sets of parameters: comment: Training Phase

1. θ j,c = p(xi, j = 1|yi = c) is the probability of feature comment: Initialize count parameters j being 1 given that the sample is in class c N j ← 0 N j,c ← 0 2. πc = p(yi = c) are the prior class probabilities for i ← 1 to n  The key assumption for this model is that the fea- c ← yi  tures are conditionally independent given the class la- Nc ← Nc + 1  bels (hence “Naive”): do for j ← 1 to D  (  if xi, j = 1 D  do Y  then N j,c ← N j,c + 1 p(x |θ, y = c) = p(x |θ, y = c). (23) i i i, j i πˆ c ← Nc/N, θˆ j,c ← N j,c/Nc j=1 ∗ Ideally, different classes show different feature distri- comment: Predict class of new pattern x butions allowing us to find a separation between them, for c ← 1 to K evaluate (x∗=1) (x∗=1) using the Naive Bayes method. ∗ ∗ Qd ˆI j ˆI j p(y = c|x ) ∝ πˆ c j=1 θ j,c (1 − θ j,c ) Using Bayes Theorem, we get Normalize probabilities. ∗ ∗ | ∗ p(y|x, θ, π) ∝ p(y|π) × p(x|y, θ), (24) yˆ = arg maxc p(y = c x ) where YN YK I(yi=c) p(y|π) ∝ πc (25) i=1 c=1 Classification: Once we have these estimates, we and can make predictions on the posterior class probabilities of previously unseen data points. I(yi=c)I(xi, j=1) Given a new data point x∗, the p(xi, j|yi = c, θ) ∝ θ j,c (26) that x∗ is in class c is given by I(yi=c)I(xi, j=0) ×(1 − θ j,c) . D Y (x∗=1) (x∗=0) The full model is then ∗ ∗ ˆI j ˆI j p(y = c|x ) ∝ πˆ c θ j,c (1 − θ j,c ). j=1 YN YK YD I(yi=c) I(yi=c)I(xi, j=1) p(y|x, θ, π) ∝ (πc θ j,c (27) We evaluate this expression (up to proportionality) for i=1 c=1 j=1 all classes and then normalize the probabilities so that − θ I(yi=c)I(xi, j=0) . (1 j,c) ) X p(y∗ = c | X∗) = 1. (30) Training Training of the Naive Bayes classifier in- c∈{1,...,K} volves estimation of the parameters {θ , π } of the j,c c Finally, we choose the class y∗ = c for which the proba- model. In order to do that, we define N to be the to- c bility p(y∗ = c|x∗) is largest. tal number of training data points in class c and N j,c to be the training data points in class c for which feature j was equal to 1. That is, Discussion: Even though conditionally indepen- dent features is a stark assumption for most settings, ( PN Nc = i=1 I(yi = c) the Naive Bayes classifier tends to give relatively ac- PN (28) curate predictions in practice. The algorithm’s steps can N j,c = i=1 I(yi = c)I(xi, j = 1). 7 be seen in Algorithm 2.5. The training phase is com- Typically, we choose τ = 0.5 in order to make predic- putationally very efficient as it essentially boils down tions symmetric. The sigmoid has the property that it is to bookkeeping. Alternative classifiers have been pro- naturally constrained to lie in the interval (0, 1). This posed that allow some degree of interdependence be- ensures that our interpretation of its output as a proba- tween features [17]. bility does not break down as was the case with LDA. In practice, a major danger when working with this We can easily compute the decision boundary of our classification method is that prediction involves the mul- model. g(wT x) = 0.5 is equivalent to wT x = 0. So the tiplication of many small numbers. This means that decision boundary is given by wT x = 0. For a general rounding errors have a big effect on predictions and it threshold τ, the boundary is given by wT x∗ = g−1(τ). is not uncommon for underflow to occur. One way to We can extend the method to K > 2 classes by using deal with this problem is to make use of the log-sum-exp a one-versus-the-rest classifier. We train a total of K − 1 trick. Algorithm 2.6 explains how to implement Naive 2-class classifiers, each of which solves the problem Bayes predictions using the log-sum-exp trick. of separating points belonging to one particular class from points not in that class. The final prediction,y ˆ∗, Algorithm 2.6: NB prediction (x∗, π,ˆ θˆ) is made by comparing the output of all the classifiers and choosing the class corresponding to the highest for c ← 1 to K probability.  L = logπ ˆ  c c for j ← 1 to D Training: The cost function can be derived using    if x∗ = 1 maximum likelihood estimation of the weights w. It do   j   ˆ can be shown that this is convex and local-optima free.  do  then Lc ← Lc + log θ j,c   Intuitively, the cost function heavily penalizes high   else Lc ← Lc + log (1 − θˆ j,c) PK confidence predictions of the wrong class. The training pc = exp (Lc − log ( c0=1 exp Lc0 )) yˆ∗ = argmax (p ) phase is in charge of finding the weights w of the model. c c This is done by minimizing:

1 XN J(w) = − y log(g(wT x ) N i i 2.6. i=1 (34) Logistic regression is another probabilistic approach  T  +(1 − yi) log 1 − g(w xi)) to classification. We will describe the method in a 2- class framework with each class label yi ∈ {0, 1} (see Noting that for the , g(z), we have: Section 2.2). Furthermore, we need to include a bias in dg our input data points, i.e. we introduce a dummy feature = g(z)(1 − g(z)), (35) xi,0 = 1 to each data point xi. dz we can differentiate J(w) with respect to w to get: Classification To classify a new data point x∗, the logistic regression technique attaches a weight w to N j ∂J 1 X each feature x∗ (where j ∈ {0, 1,..., D}). It then passes = − y (1 − g(wT x ) j ∂w N i i PD ∗ T ∗ i=1 (36) the weighted sum j=0 w j x j = w x to the sigmoid T function, given by: +(1 − yi)g(w xi))xi

1 There is no closed-form solution to this minimization g(z) = , (31) 1 + exp(−z) problem, thus gradient descent or Newton’s Method for We interpret the output of the function as the proba- finding minima is required. The steps of the algorithm bility that x∗ is in class 1: are summarized in Algorithm 2.7. p(y∗ = 1 | w) = g(wT x∗). (32) 2.7. K-Nearest Neighbours ∗ Finally, we can classify x by setting a threshold τ on K-Nearest-Neighbours (K-NN) [18] is a non- the probability and predicting: parametric approach to density estimation. This means ( yˆ∗ = 1 if g(wT x∗) ≥ τ that we do not make any assumptions about the func- (33) yˆ∗ = 0 if g(wT x∗) < τ tional form of the distribution to be estimated. The 8 method can be extended to the problem of classifica- data, we generally use the euclidean metric. However, tion. To classify a new data point x∗, K-NN identifies the method works with any valid metric and the optimal the K points from the training set that are closest to x∗ choice will depend on the specific data set. and then assigns it to the class which has the most points in this set. Discussion: The K-NN approach requires the en- tire data set to be stored in working memory. Comput- Algorithm 2.7: Logistic Regression(X, y, τ) ing the distances to the training examples will become computationally expensive for large data sets. One way comment: Training Phase: to deal with this problem is to use an appropriate near- est neighbour search algorithm which searches for the w = arg minw J(w) nearest neighbours by only evaluating the distance met-

∗ ric over a subset of all training examples. comment: Classify new data point x Another disadvantage of K-NN is that in the case of  −1 if 1 + exp(wT x∗) ≥ τ skewed class distributions, the algorithm tends to bias then yˆ∗ = 1 predictions in favour of the classes that are overrepre- else yˆ∗ = 0 sented. We can mitigate this drawback by giving each training examples a weight that is inversely proportional comment: Confidence of estimate to its distance from the new data point. The constant K determines the degree of smoothing  −1 p(y∗ = 1 | w) = 1 + exp(wT x∗) in the classification process. A small value of K will lead to many small regions of each class, while a large value of K produces fewer larger regions. This reduces the effect of noise in the data but may also lead to over- Algorithm 2.8: K-NN(X, y, K) seeing certain structures in the data. The algorithm’s steps can be seen in Algorithm 2.8.

Store the labelled training examples (X, y) 2.8. Support Vector Machine The Support Vector Machine (SVM) [19] is a sparse comment: Classify a new pattern x∗: kernel classifier. That is, it transforms the training data for i ← 1 to N points into another domain (using specific kernel with ∗ Evaluate di = ||x − xi|| basis functions) and from that domain chooses only a few basis functions to create a classifier. PK Find i1,..., iK which minimize j=1 di j . In mathematical terms, the model created is defined ∗ yˆ = mode{yi1 ,..., yiK }. by the following function:

N X T f (x; w) = wiφi(x) = w φ(x) + b (37) i=1 Training: The training phase of the K-NN algo- where the output is a linear combination of the selected rithm only involves storing the training data in feature basis functions and b is a bias variable. space along with the corresponding class labels. Training: It utilizes a sparse vector of weights Classification: Before we make any classifications, which effectively choose the basis functions to be in- we have to decide on the parameter K for the model. cluded in the model. This is done by attempting to min- Once we have chosen a specific value for K, we can imize a measure of the error in the training set and at feed the algorithm a new data point x∗. It identifies the the same time maximize the margin between the classes. K training examples in feature space that are closest to The margin is defined as the perpendicular distance be- it and then looks for the class c that occurs most fre- tween the hyper-plane of the classifier and the closest quently among those points. Finally, it predicts x∗ to be of the data points of any class. The maximum margin in class c. If there is a tie, it can be broken at random solution is found by solving: ( ) or by assigning the class of the nearest point among the 1 T ∗ arg max min [yn(w φ(xn) + b)] (38) tied groups to x . When dealing with continuous input w,b kwk n 9 where yn is the corresponding class of the data point. and has the same effect as finding the least-squares fit This is rearranged into the following Lagrangian func- for the data. However, with as many parameters in the tion with the introduction of N Lagrangian multipliers model as training points, it could lead to severe over- [19]: fitting and would probably not provide accurate estima- tion for new data points. 1 XN n o L(w, b, a) = kwk2 − a y (wT φ(x ) + b) − 1 In this context, imposing a constraint on the param- 2 n n n n=1 eters would be advantageous so as to force our model (39) to be described by only a portion of the available ba- Setting the derivatives of L(w, b, a) with respect to w sis functions resulting in a sparse vector of weights w. and b to zero and substituting back to the Lagrangian Thus, a sparse distribution over w is function, the dual representation of the maximum mar- necessary. Using the likelihood and the prior, the poste- gin problem is formed which is solved only with respect rior distribution (ie after observing the training data) is to a. obtained. This takes the form of a quadratic programming prob- Here, the choice of the prior over w could vary but for lem and can be solved using various solvers. From their the RVM [20], it was chosen as a zero-mean Gaussian solution, a is obtained where the majority of the ele- prior distribution given by: ments are zero. The non-zero elements of a are called YN−1 the support vectors and correspond to the data points −1 p(w|α) = N(wi|0, αi ) (43) that lie on the maximum margin of the hyperplane. i=0 where α are the hyperparemeters which in turn are de- Classification: In order to classify a new data point scribed by the following : x∗, the sign of the following function is evaluated: YN−1 N p(α|a, b) = Gamma(αi|a, b) (44) ∗ X ∗ f (x ) = an yn φ(x ) + b (40) i=0 n=1 where, 2.9. Gamma(α|a, b) = Γ(a)−1baαa−1e−bα (45) R ∞ The Relevance Vector Machine (RVM) [20] [21] was a ta−1e−t dt is the Gamma distribution with Γ( ) = 0 the proposed as a model of identical functional form to the gamma function. To make the hyper-priors flat, the pa- SVM but with superior characteristics. It exploits a rameters a and b are set to very small values. probabilistic Bayesian learning framework that provides Using an approximation procedure [20], the posterior the level of trustworthiness of each classified point. probability distribution p(w|y, α) is given by a multi- N That is, given a data set {xn, yn}n=1, we have the iden- variate Gaussian with covariance and mean: tical model for SVM without the bias: Σ = (ΦT BΦ + A)−1 (46) N X T T f (x; w) = wiφi(x) = w φ(x) (41) µ = ΣΦ By (47) i=1 where A = diag(α1, α2, ..., αN ) and B = diag(β , β , ..., β ) where β = σ { f (x )} [1 − σ { f (x )}] Training: For two-class classification, the goal is to 1 2 N n n n predict the posterior probability of a new data point x∗ Classification: belonging to a particular class. This is done by formu- These can then be used to classify lating a likelihood for our model given the training data unseen data points using: T and a prior distribution given some knowledge about our y∗ = w φ(x∗) (48) data. The likelihood is given by the following expres- where w = µ and φ(x ) contains the value of all included sion: ∗ basis functions at the required location. Furthermore, YN the confidence of the estimate is given by: yn 1−yn P(y|w) = σ { f (xn; w)} [1−σ { f (xn; w)}] (42) 2 T n=1 σ∗ = φ(x∗) Σφ(x∗) (49) The likelihood can be used to find the parameters that For large data sets, inversion of large matrices is re- best fit the data. In order to do so, the expression is max- quired. In order to overcome this, a sequential algorithm imized given the appropriate values for the parameters was proposed [22] [23]. 10 2.10. Neural Networks Therefore, the model is a nonlinear function from An alternative to sparse kernel machines is to fix the a set of inputs to a set of outputs controlled by ad- number of basis functions to be used in the model but justable parameters that are acquired during training. allow them to be adaptive in the training phase. The In order to train this network, given a training set idea behind neural networks [24] is that it utilizes multi- {(x1, t1), ..., (xN , tN )} we minimize an error function: layered logistic regression classifiers with many deci- 1 XN sion units on each layer to provide a more sophisticated E(w) = ky(x , w) − t k2 (55) decision boundary. 2 n n 2 n=1 Recall the model used for classification: Once the weights are obtained, prediction can be  M  done using the overall network function defined above. X  y(x, w) = f  w φ (x) (50)  j j  j=1 3. Test data sets where f (.) is a nonlinear function that is zero or one depending on the class of the given data point. All ten algorithms described in the previous section allow the training of classification models. These mod- Training: In the framework of neural networks, the els are trained given many training samples and are able to automatically decide for the nature of unseen sam- basis functions φ j(x) are parametric and their parame- ters are chosen during training as well as the weights ples. In order to choose the most suitable for the FAIMS w j. Each basis function is itself a linear combination of the input data where the coefficients in that linear com- data sets for automatic clinical diagnosis, we first need bination are adaptive parameters. to test their performance. To do this, we chose five of Thus, we construct M linear combinations of the in- the techniques that were available in the MATLAB sta- tistical toolbox. These are the LDA, AdaBoost, Random put variables x1, ..., xD where M is the number of basis functions to be used and D is the dimensionality of the Forests, Naive Bayes and K-NN. For each algorithm, input data. These linear combinations are given by: we used four simple data sets to train and test. From these simple data sets, we were able to understand the D performance, the advantages and disadvantages of each X (1) (1) α j = w ji xi + w j0 (51) algorithm and their suitability under different scenaria. i=1 Using this preliminary testing, we were able to choose where j = 1, ..., M and the superscript indicates the cor- two of the most promising algorithms for their applica- responding layer of the network. All α j are called the tion on three different types of medical FAIMS data sets activations of each neuron and are transformed using an in section 4. activation function h(.) such that: The first three simple data sets are 2-dimensional with only 2 classes with a linearly separable data set, a z j = h(α j) (52) non-linearly separable and a data set with a linear class The choice of h(.) depends on the data but very fre- boundary that includes some random noise. An example quently, the logistic sigmoid function σ(.) is used as in of the first three type of data sets with the K-NN clas- sifier can be seen in Figures 2, 3 and 4. We also tested logistic regression discussed before. These z j are the outputs of the hidden units of the first layer and can sub- the algorithms on a subset of the MNIST dababase [25]. sequently be used as inputs to the second layer: For an example, see Figure 1.

M 3.1. Methodology X (2) (2) αk = wk j z j + wk0 (53) First, we split the data sets into training and test sets. j=1 The training set is used for the training of the parame- where k = 1, ..., K. ters of each classification model. The test set consists We can combine all the layers together to acquire the of similar but unseen samples and is used for the vali- overall network function: dation of the performance of the model. We randomly selected 80% of the data for training and used the rest

 M  D   for testing. For three of the algorithms, we had to tweak X X   y (x, w) = σ  w(2)h  w(1) x + w(1) + w(2) (54) additional meta-parameters which are not part of the au- k  k j  ji i j0  k0  j=1 i=1 tomatic training of the models. These are: the optimal 11 Figure 1: Collection of 25 training data point of the digits data set. The size of each input image is 20×20 pixels and each pixel is a component of our feature vector. The value of a feature measures the intensity of the corresponding pixel. number of iterations in AdaBoost, the optimal number of trees in Random Forest and the optimal number of Figure 2: K-NN decision boundary on Linear Data Set neighbours in KNN. For all these meta-parameters, we trained each model with different values and accordingly recorded the test classification error. The meta-parameter that gave the least classification error was chosen. With different val- ues of the meta-parameters, not only the classification performance but also the training time changes. There- fore, for different application, different meta-parameters might be suitable for different system requirements. In our case, classification performance is priority over training time and thus the meta-parameters were chosen with this in mind. Once we have determined the optimal meta- parameters, we trained the model on each data set and computed the resubstitution errors, the test errors and the amount of time spent in the training phase and the testing phase. Errors and speeds of the classifiers on Figure 3: K-NN decision boundary on Non Linear Data Set each test data sets can be seen in Table 1. Table 2 sum- marizes the properties of the classifiers that we inferred from our tests.

3.2. Discussion From these results, the Random Forest and the K- Nearest Neighbours algorithms showed the greatest po- tential for further investigation. Both have great clas- sification performance on the simple data sets, they are robust to noise and can be trained fast when utilizing parallel computation. In addition, Random Forest is able to provide im- portance for specific features which makes it ideal for the understanding of key indicators in medical data sets. This is vital for the further investigation of particular re- gions in the data sets that would allow for even greater automatic clinical diagnosis. Figure 4: K-NN decision boundary on Noisy Data Set

12 Figure 5: FAIMS setup for ion movement through plates

4. Field Asymmetric Ion Mobility Spectrometry (FAIMS)

Figure 6: Example of output of chemical fingerprint for a sample from As discussed, the detection of airborne gas phase the IBD data using the positive ions. biomarkers that emanate from biological samples like urine and breath can be used for non-invasive diagnos- tics of IBD diseases such as Crohn’s disease and Ul- cerative Colitis. An emerging technology that is able to sense and capture these biomarkers is FAIMS [26] [27]. FAIMS is a recent technology that separates gas molecules to be analysed at atmospheric pressure and room temperature [26]. The sample sensed is ionised and thus decomposed of ions of different types and sizes. These different ions are passed through metal plates that are being applied an asynchronous high voltage waveform. The ionized molecules are subjected to these high electric field and the difference in their movement is used to detect biomarkers [26] [27]. An example of the setup can be seen in Figure 5 from the Owlstone’s whitepaper 1. Further information can be found in [26] Figure 7: Example of output of chemical fingerprint for a sample from the IBD data using the negative ions. [27]. We analysed three separate medical FAIMS data sets supplied by Owlstone. The first data set consisted of Our training data consists of these chemical finger- biological samples from cancer patients who took part prints. For each sample, we get two such fingerprints - in a study at University Hospital Coventry & Warwick- one for the positive ions and one for the negative ions in shire. The second data set consisted of biological sam- the gas. See Figures 6, 7 for an example from the IBD ples from Inflammatory Bowel Disease (IBD) patients Data. from a separate study, also at the University Hospital Coventry & Warwickshire. Lastly, we analysed a data We analysed two versions of this data set. First we set with Methanol in a complex background. analysed the raw data, which consisted of the chemical fingerprints. Each such fingerprint was created by vary- The samples were analysed using Owlstone’s Lon- ing the compensation voltage (CV) in 512 steps from estar chemical detector which uses Field Asymmetric -6V to +6V while varying the dispersion voltage ratio Ion Mobility Spectroscopy (FAIMS) to create a chemi- (DV) from 0% to 100% in steps of 2% and measuring cal ‘fingerprint’ of a gas or odour that is passed through the resulting ion current. In total we get 52224 raw input it. features, each giving the intensity of the measured ion current at a particular location (CV & DV value pair). 1http://www.owlstonenanotech.com/faims Secondly, we used software provided by Owlstone 13 Table 1: Errors and speeds of classification methods on test data

Table 2: Summary of classifier performance

14 to extract peaks in the measured ion current from the We ran the algorithm in a 3-class setting as well as chemical fingerprints. For each peak, we were given in a 2-class setting for both a random 80%-20% split of its location in the fingerprint (CV and DV values) and the data and a balanced split of the data into training and a measure for its height, width and area. We chose to test set. treat each extracted peak as a separate example, giving One advantage of Random Forests is that it allows all the peaks from one sample the same label. measuring the importance of the features. For all fea- We focused on two classification methods as they tures, we measured the increase in prediction error if gave the most promising results on the test data sets. the values of that feature are permuted across the out- We analysed the data using a K-Nearest-Neighbour ap- of-bag observations. This measure is computed for ev- proach to get some fast preliminary results. Then we ery tree, then averaged over the entire ensemble and di- moved on to the more sophisticated Random Forests vided by the standard deviation over the entire ensem- method. In the usual setting, we selected 80% of the ble. This measure allows for the identification of key samples at random to train the model and used the re- regions in the data set that can be used for further inves- maining 20% to test the model. Note that in peak anal- tigation. It allows a more focused investigation on the ysis, we extracted the peaks after splitting our data into key indicators. An example of a heat map of the vari- training and test sets. For all the tests, we aimed to get a able importances will be illustrated later that will show reliable measure of the resubstitution error, the test error its significance in our statistical research and in general and the amount of time it took to respectively train and for the better training of classification models for auto- test our model. matic clinical diagnostics. Tables 5, 6 show the results of our random forests analysis of the cancer data peaks. 4.1. Cancer Data The cancer data set contained a total of 122 samples, each assigned to one of 3 classes: 4.1.2. Discussion • Volunteer → y = 1 (39 samples) From the peak analysis, it can be seen that there is some separation between the classes of interest. Further • Cancer → y = 2 (70 samples) investigation is required to allow for lower test classifi- • Polyps → y = 3 (13 samples) cation errors. Peak extraction illustrated that we could potentially identify interesting key indicators but a more 4.1.1. Peak Analysis detailed and focused study is required. Therefore, for We tested a number of different scenarios. First, we the other data sets, we followed a different approach, ran a number of test on the peaks of the positive ion treating the entire training matrices as one large feature matrix only. Then we focused on the negative ion peaks. vector. We investigated for the existence of a bias in our results due to the fact that one class is overrepresented in our 4.2. Inflammatory Bowel Disease Data data. For that, we made sure that our training set was bal- We splitted the IBD data set into two different cate- anced in that there was an equal number of samples gories, namely the breath samples and the urine sam- from both classes. It is important to note that we calcu- ples. This was done so as to investigate whether differ- lated test and resubstitution errors on a per peak basis. ent types of data sets are able to distinguish better the As the test errors varied with the choice of our partic- classes and allow for a more focused future sample ac- ular test set, we trained and tested our model 50 times, quisition. each time using a separate random training and test set, and calculated the mean of the errors. We used the same value for K in each iteration. The results of our KNN 4.2.1. Breath Samples analysis of the peaks of the cancer data can be found in There were a total 119 breath samples belonging to 3 Tables 3, 4. different classes: We analysed the peaks from the positive matrix, the negative matrix and the combined data set with the ran- • Volunteer → y = 1 (42 samples) dom forest technique. To find a good value for the num- ber of random trees to use, we plotted the test and re- • Crohn’s Disease → y = 2 (37 samples) substitution errors as a function of forest size and aimed to choose the point at which the curves began flattening. • Ulcerative Colitis → y = 3 (40 samples) 15 Table 3: KNN results on the positive peaks

Table 4: KNN results on the negative peaks

KNN: As before, we trained our model with differ- liminary classification due to the fact that the training ent choices of training and test data sets. As the test er- speed is very low. rors varied with the choice of our particular test set, we trained and tested our model 50 times, each time using Random Forests: Due to the large training time of a separate random training and test set, and calculated the random forests algorithm and time constraints on the mean of the errors. We used the same value for K in our part, we were not able to estimate a mean test er- each iteration. ror and a measure for the spread of the errors as we did KNN results for the breath samples can be found in in the KNN analysis for this data set. Thus, one setup Table 7. The method had trouble in the 3-class setting was chosen and the results are documented in Table 8. in general. Using weights inversely proportional to the Using 10 trees gave the best results in terms of ac- distance between the neighbours and the test points gave curacy on the test data (62.4%). This is due to the the best performance with a mean test error of 71.3%. fact that the complexity of the model was the lowest We got much better results in the 2-class setting. of the ones chosen and allowed for greater generaliza- Here, the version with weights proportional to the tion. However, we generated a new random split for squared inverse distance gave the best results with a each test. Therefore, the low test error may be simply an mean accuracy of 64.5%. In our tests, the accuracy var- artifact of our choice of training and test data set. The ied between 50.0% and 79.2%. Using equal weights for large difference between the test errors and resubstitu- all neighbours gave us an accuracy as high as 83.3% in tion errors may indicate that there is a high degree of some cases. overfitting and appropriate regularizers could help ob- As it can be seen, the resubstitution error is zero in tain much better results. most cases whereas the test error is much larger. This illustrates that there is a great degree of overfitting on the training data. However, the results show promising 4.2.2. Urine Samples separation in some cases. If combined with appropriate There were a total of 111 urine samples belonging to regularisers, K-NN could allow for a great tool for pre- 3 different classes: 16 Table 5: Random Forest results on the positive peaks

Table 6: Random Forest results on the negative peaks

Table 7: KNN IBD Breath Results

Table 8: Random Forest IBD Breath Results

17 • Volunteer → y = 1 (39 samples)

• Crohn’s Disease → y = 2 (36 samples)

• Ulcerative Colitis → y = 3 (36 samples)

KNN: The KNN results for the urine data set are in Table 9. In the 3-class setting, the algorithm performs much better on the urine data set than on the breath data. This may indicate that there is a smaller degree of noise in the data and the distinction between the Crohn’s Dis- Figure 9: Heatmap for negative ion fingerprint of a sample from the ease samples and the Ulcerative Colitis may be clearer IBD data. compared to the breath samples. The performance in the 2-class setting is very similar to that in the breath data. The negative ion matrix appears to carry more infor- We were able to achieve a test error as low as 18.2%. mation with regards to feature importance as there more areas with high density of red points. Random Forests: The Random Forests results can be found in Table 10. In our tests in the 2-class setting, the algorithm appears to perform slightly worse than on 4.3. Methanol in Complex Background the breath samples. However, this difference may again The methanol samples were taken under strict labora- be because of the particular test set. We see the same tory conditions. We received 42 training examples and large difference between test errors and resubstitution 12 test examples. We analysed the performance of the errors as in the breath samples, indicating that overfit- KNN and Random Forests algorithms in a 3-class set- ting may be problem in this application as well. ting and in a 2-class setting. The samples were classified as 4.2.3. Heat Map In order to get some visual indicator of regions in the • Diseased → y = 0 fingerprint with high discriminative power, we produced a heat map of variable importance. We calculated the • Exposed → y = 1 measure of feature importance provided by the Random Forest algorithm for each CV-DV value pair. We plotted • Normal → y = 2 these values to find regions of interests, with red indi- cating the most important and blue indicating the least For the 2-class analysis, we merged the “Diseased” class important regions. and the “Exposed” class. We had a total of 18 samples from each class. The settings of the Lonestar chemical detector were changed for the methanol data: The num- ber of lines in the fingerprint was reduced from 51 to 15. Unlike before, we used a particular choice for our training and test data sets.

KNN: The results of the KNN analysis of the methanol data can be found in Table 11. The algorithm performed very well in a 3-class setting, achieving a test error of 25.0% on the provided test data. Setting the weights of the K neighbours proportional to the inverse Figure 8: Heatmap for positive ion fingerprint of a sample from the IBD data. of the distance to the test points lead to very low errors when reclassifying the training data (2.4%). Figures 8 and 9 show two heatmaps for a sample from In a 2-class setting, the algorithm achieves a perfect the IBD data set, one for the positive ion matrix and one prediction on the test data, finding the correct class of for the negative ion matrix. These were created using each test example. Using inverse weights also leads to a 100 trees for a random forest. very low resubstitution error. 18 Table 9: KNN IBD urine results

Table 10: Random Forest IBD urine results

Table 11: KNN Methanol in Complex Backround results

Table 12: Random Forest Methanol in Complex Backround results

19 Random Forests: The results of the Random good results and combined with an algorithm such as Forests analysis can be found in Table 12. In our tests, the random forest, may improve the method’s accuracy. the random forests algorithm showed a strong perfor- Another promising approach may be to treat the Lon- mance. It managed to correctly classify 75.0% of the estar fingerprints as images and use an edge detection test data in the 3-class setting, with a resubstitution er- algorithm. We would scan each sample to detect edges ror of 4.8%. In a 2-class setting, it correctly predicted in the fingerprint. Next we would plot a histogram of the 83.7% of the test examples. We achieved the best per- orientation of the edges. Finally, we would attempt clas- formance with 50 random trees in our forests. sification of the samples using the as feature vectors that might lead to discriminative areas. 5. Future Work

The Random Forest experiments generally had very 6. Conclusion low resubstitution error while the test error was higher than expected. There is a large amount of overfitting and we suggest to address this issue by using a pruning In this paper, we provided an overview of ten differ- algorithm or regularization. ent classification algorithms along with their potential We kept many of the of the random advantages and disadvantages. We described the intu- forest algorithm constant. Optimization of all the hy- ition behind their theory and illustrated how they can perparameters can lead to substantial performance im- be used for different data sets. Next we selected five provements. In most tests, we only utilized 80% of of these methods and performed tests on a number of the available data to train a classification model. Given artificially generated data and also one real-world data more time, we would try a leave-one-out analysis and set. Our aim was to investigate how these algorithms cross-validation. More training data in general should perform in practice. give better results. From these five algorithms, we chose two (K Nearest The KNN algorithm gave promising results, given Neighbours and Random Forests) that performed best that the underlying method is so simple. Furthermore, on the simple data sets and used them to analyse vari- the algorithm is computationally inexpensive. There- ous data sets from Field Asymmetric Ion Mobility Spec- fore, an AdaBoost model with KNN classifiers as weak trometry. In particular, we investigated 2 different stud- learners may be a promising future approach. ies (Colorectal Cancer, Inflammatory Bowel Disease). Due to time restrictions, we were not able to test and Our aim was to find a reliable way to separate the classes apply a number of the classification methods described of patients within a study using FAIMS. We attempted in Section 2. In particular, Neural Networks is a very to separate all three classes as well as just patients from powerful method for classification of complex patterns. volunteers. Using K Nearest Neighbours and Random In the peak analysis, we treated each peak as an inde- Forests, we achieved high prediction accuracy in a num- pendent sample. In order to classify a new sample, one ber of cases. would have to look at the distribution of class predic- We attempted to find regions that provide key indi- tions of all the peaks in the sample. We ran some rough cators between the classes on the chemical fingerprint preliminary tests and noticed that the distribution was produced by FAIMS. We did this by creating heat maps highly skewed towards predicting cancer. This may be using a measure for feature importance from the Ran- due to the greater presence of cancer samples in the data dom Forests model. set. It may be worth to investigate this further. In addition, we studied samples taken from methanol We also did not attempt to extract the peaks from the in oil that were produced under laboratory conditions. IBD and methanol data sets. While we have no particu- These samples contained less contamination and noise lar reason to believe so, it may be possible to find good and were able to give us some further indication on the separation of the classes in the peaks domain. The heat performance of the classification methods. We achieved maps from the IBD data showed that the negative ion perfect predictions using K Nearest Neighbours in some matrix may be more informative for classification pur- cases. Random forests gave also very promising results. poses than the positive one. It would be interesting to However, we repeatedly encountered a large differ- run tests with the negative ion matrix for the methanol ence between resubstitution errors and test errors with data. this algorithm. This indicates that further optimization In previous experiments, wavelets were used to pre- of the random forests parameters may lead to substantial process data before classification [2]. This led to some performance improvements. 20 Acknowledgements [14] L. Breiman, E. Schapire, Random forests, in: Machine Learn- ing, 2001, pp. 5–32. The study was funded by Owlstone. Data for [15] A. Criminisi, J. Shotton, E. Konukoglu, Decision forests for the Methanol in complex background was collected classification, regression, density estimation, manifold learning and semi-supervised learning., Tech. Rep. MSR-TR-2011-114, by Rebecca O’Donnell and Dr Max Allsworth under Microsoft Research (Oct 2011). Owlstone’s LuCID (Lung Cancer Indicator Detection) [16] R. O. Duda, P. E. Hart, Pattern Classification and Scene Analy- project funded by an SBRI Healthcare development sis, John Willey & Sons, 1973. contract. Data for the IBD was collected in a joint [17] N. Friedman, D. Geiger, M. Goldszmidt, clas- sifiers (1997). project with Owlstone, Warwick University, and Uni- [18] T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE versity Hospital Coventry and Warwick, utilizing fund- Trans. Inf. Theor. 13 (1) (2006) 21–27. ing from the Technology Strategy Board (TSB). [19] B. E. Boser, I. M. Guyon, V. N. Vapnik, A training algorithm for optimal margin classifiers, in: Proceedings of the 5th An- nual ACM Workshop on Computational Learning Theory, ACM References Press, 1992, pp. 144–152. [20] M. E. Tipping, Sparse bayesian learning and the relevance vec- [1] V. L. Patel, E. H. Shortliffe, M. Stefanelli, P. Szolovits, M. R. tor machine, J. Mach. Learn. Res. Berthold, R. Bellazzi, A. Abu-Hanna, The coming of age [21] M. E. Tipping, The relevance vector machine, in: Advances in of artificial intelligence in medicine, Artificial Intelligence in Neural Information Processing Systems 12, [NIPS Conference, Medicine 46 (1) (2009) 5–17. Denver, Colorado, USA, November 29 - December 4, 1999], [2] R. Arasaradnam, N. O. Ouaret, M. Thomas, E. Hetherington, 1999, pp. 652–658. M. N. Quraishi, C. Nwokolo, K. D. Bardhan, J. Covington, [22] A. C. Faul, M. E. Tipping, Analysis of sparse bayesian learn- Identification of inflammatory bowel disease (ibd) using field ing, in: Advances in Neural Information Processing Systems asymmetric ion mobility spectrometry (faims), Gut 61 (Suppl 2) 14, MIT Press, 2001, pp. 383–389. (2012) A70. [23] M. E. Tipping, A. Faul, Fast marginal likelihood maximisation [3] J. A. Covington, L. Wedlake, J. Andreyev, N. Ouaret, M. G. for sparse bayesian models, in: Proceedings of the Ninth In- Thomas, C. U. Nwokolo, K. D. Bardhan, R. P. Arasaradnam, ternational Workshop on Artificial Intelligence and , The detection of patients at risk of gastrointestinal toxicity dur- 2003, pp. 3–6. ing pelvic radiotherapy by electronic nose and faims: A pilot [24] C. M. Bishop, Neural Networks for , Oxford study, Sensors 12 (10) (2012) 13002–13018. University Press, Inc., New York, NY, USA, 1995. [25] http://yann.lecun.com/exdb/mnist/. [4] J. A. Covington, E. W. Westenbrink, N. Ouaret, R. Harbord, C. Bailey, N. Connell, J. Cullis, N. Williams, C. U. Nwokolo, [26] A. Wilks, M. Hart, A. Koehl, J. Somerville, B. Boyle, D. Ruiz- Alonso, Characterization of a miniature, ultra-high-field, ion K. D. Bardhan, R. P. Arasaradnam, Application of a novel tool for diagnosing bile acid diarrhoea, Sensors 13 (9) (2013) 11899– mobility spectrometer, International Journal for Ion Mobility Spectrometry 15 (3) (2012) 199–222. 11912. [5] R. P. Arasaradnam, E. Westenbrink, M. J. McFarlane, R. Har- [27] A. A. Shvartsburg, R. D. Smith, A. Wilks, A. Koehl, D. Ruiz- Alonso, B. Boyle, Ultrafast Differential Ion Mobility Spectrom- bord, S. Chambers, N. OConnell, C. Bailey, C. U. Nwokolo, K. D. Bardhan, R. Savage, J. A. Covington, Differentiat- etry at Extreme Electric Fields in Multichannel Microchips, An- ing coeliac disease from irritable bowel syndrome by urinary alytical Chemistry 81 (2009) 6489–6495. volatile organic compound analysis a pilot study, PLoS ONE 9 (2014) e107312. [6] R. P. Arasaradnam, M. J. McFarlane, C. Ryan-Fisher, E. West- enbrink, P. Hodges, M. G. Thomas, S. Chambers, N. O’Connell, C. Bailey, C. Harmston, C. U. Nwokolo, K. D. Bardhan, J. A. Covington, Detection of colorectal cancer (crc) by uri- nary volatile organic compound analysis, PLoS ONE 9 (2014) e108750. [7] R. A. FISHER, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7 (2) (1936) 179–188. [8] C. M. Bishop, Pattern Recognition and Machine Learning (In- formation Science and Statistics), Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. [9] K. Fukunaga, Introduction to statistical pattern recognition, Academic press, 1990. [10] B. Scholkopft, K.-R. Mullert, Fisher discriminant analysis with kernels, Neural networks for signal processing IX. [11] T. Hofmann, B. Scholkopf,¨ A. J. Smola, Kernel methods in ma- chine learning, Annals of Statistics 36. [12] Y. Freund, R. E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, in: Proceedings of the Second European Conference on Computational Learning Theory, Springer-Verlag, London, UK, UK, 1995, pp. 23–37. [13] Y. Freund, R. E. Schapire, Experiments with a new boosting algorithm (1996). 21