Adjusted Probability Naive Bayesian Induction

Adjusted Probability Naive Bayesian Induction 1 2 Geo rey I. Webb and Michael J. Pazzani 1 Scho ol of Computing and Mathematics Deakin University Geelong, Vic, 3217, Australia. 2 Department of Information and Computer Science University of California, Irvine Irvine, Ca, 92717, USA. Abstract. NaiveBayesian classi ers utilise a simple mathematical mo del for induction. While it is known that the assumptions on which this mo del is based are frequently violated, the predictive accuracy obtained in discriminate classi cation tasks is surprisingly comp etitive in compar- ison to more complex induction techniques. Adjusted probability naive Bayesian induction adds a simple extension to the naiveBayesian classi er. A numeric weight is inferred for each class. During discriminate classi cation, the naiveBayesian probability of a class is multiplied by its weight to obtain an adjusted value. The use of this adjusted value in place of the naiveBayesian probabilityisshown to signi cantly improve predictive accuracy. 1 Intro duction The naiveBayesian classi er Duda & Hart, 1973 provides a simple approachto discriminate classi cation learning that has demonstrated comp etitive predictive accuracy on a range of learning tasks Clark & Niblett, 1989; Langley,P., Iba, W., & Thompson, 1992. The naiveBayesian classi er is also attractive as it has an explicit and sound theoretical basis which guarantees optimal induction given a set of explicit assumptions. There is a drawback, however, in that it is known that some of these assumptions will b e violated in many induction scenarios. In particular, one key assumption that is frequently violated is that the attributes are indep endent with resp ect to the class variable. The naiveBayesian classi er has b een shown to b e remarkably robust in the face of many such violations of its underlying assumptions Domingos & Pazzani, 1996. However, further im- provements in p erformance have b een demonstrated byanumb er of approaches, collectively called semi-naive Bayesian classi ers, that seek to adjust the naive Bayesian classi er to remedy violations of its assumptions. Previous semi-naive Bayesian techniques can be broadly classi ed into two groups, those that ma- nipulate the attributes to be employed prior to application of naive Bayesian induction Kononenko, 1991; Langley & Sage, 1994; Pazzani, 1996 and those that select subsets of the training examples prior to the application of naive Bayesian classi cation of an individual case Kohavi, 1996; Langley, 1993. This pap er presents an alternative approach that seeks instead to adjust the probabilities pro duced by a standard naive Bayesian classi er in order to accommo date violations of the assumptions on which it is founded. 2 Adjusted Probability Semi-Naive Bayesian Induction The naiveBayesian classi er is used to infer the probability that an ob ject j , describ ed by attribute values A =V ^ ::: ^ A =V b elongs to a class C .It 1 1 n n i j j uses Bayes theorem P C P A =V ^ ::: ^ A =V j C i 1 1 n n i j j 1 P C j A =V ^ ::: ^ A =V = i 1 1 n n j j ^ ::: ^ A =V P A =V n n 1 1 j j where P C j A =V ^ :::^ A =V is the conditional probability of the class C i 1 1 n n i j j given the ob ject description; P C is the prior probability of class C ; P A =V ^ i i 1 1 j ::: ^ A =V j C is the conditional probability of the ob ject description given n n i j the class C ; and P A =V ^ :::^ A =V is the prior probability of the ob ject i 1 1 n n j j description. Based on an assumption of attribute conditional indep endence, this is estimated using Q P C P A =V j C i k k i j k : 2 P A =V ^ ::: ^ A =V 1 1 n n j j Each of the probabilities within the denumerator of 2 are in turn inferred from the relative frequencies of the corresp onding elements in the training data. Where discriminate prediction of a single class is required, rather than assigning explicit probabilities to each class, the class is chosen with the highest probability or with the lowest misclassi cation risk, if classes are further di erentiated by having asso ciated misclassi cation costs. In this context, the denominator can b e omitted from 2 as it do es not a ect the relative ordering of the classes. Many violations of the assumptions that underlie naiveBayesian classi ers will result in systematic distortion of the probabilities that the classi er outputs. For example, take a simple two attribute learning task where the attributes A and B and class C all have domains f0; 1g, for all ob jects A=B , the probabilityof eachvalue of each attribute is 0:5, P C =0 j A=0=0:75, and P C =0 j A=1 = 3 0:25. Given an ob ject A=0;B=0, and p erfect estimates of all values within 2, the inferred probability of class C = 0 will be 0:5625 and of class C =1 will be 0:0625. The reason that the class probability estimates are incorrect is that the two attributes violate the indep endence assumption. In this simple example, the systematic distortion in estimated class probabilities could b e corrected by taking the square ro ot of all naiveBayesian class probability estimates. It is clear that in many cases there will exist functions from the naiveBayesian estimates to the true conditional class probabilities. However, the nature of these 3 P C =0=0:5, P A =0j C =0=0:75, P B =0j C =0=0:75, P C = 1=0:5, P A =0j C =1=0:25, P B =0j C =1=0:25, and P A =0^ B =0=0:5. functions will vary dep ending up on the typ e and complexity of the violations of the assumptions of the naiveBayesian approach. Where a single discrete class prediction is required rather than probabilis- tic class prediction, it is not even necessary to derive correct class probabilities. Rather, all that is required is to derivevalues for each class probability such that the most probable class or class with the lowest misclassi cation risk has the highest value. In the two class case, if it is assumed that the inferred values are monotonic with resp ect to the correct probabilities, all that is required is iden- ti cation of the inferred value at which the true probability or misclassi cation risk of one class exceeds that of the other. For example, Domingos & Pazzani 1996 show that the naiveBayesian classi er makes systematic errors on some m-of-n concepts. To illustrate their analysis of this problem, assume that the naiveBayesian classi er is trained with all 6 2 examples of an at-least-2-of-6 concept. This is a classi cation task for which the class C equals 1 when anytwo or more of the six binary attributes equal 1. In this case, P C =1 = 57=64, P C =0 = 7=64, P A =1 j C =1 = 31=57, k P A =0 j C =1 = 26=57, P A =1 j C =0 = 1=7 and P A =0 j C =0 = 6=7. k k k Therefore, the naiveBayesian classi er will classify as p ositive an example for which i attributes equal 1 if i 6i i 6i 57 31 26 7 1 6 > : 3 64 57 57 64 7 7 However, this condition is false only for i = 0 while the at-least-2-of-6 concept is false for i<2. Note however that b oth the terms in 3 are monotonic with resp ect to i, the left-hand-side increasing while the right-hand-side decreases as i increases. Therefore, by multiplying the left-hand-side of 3 by a constant adjustment factor a : 0:106 <a <0:758 we havea function of i that p erfectly 4 discriminates p ositive from negative examples . Care must be taken to avoid using this as probability estimate, but this additional degree of freedom will allow the naiveBayesian classi er to discriminate well on a broader class of problems. For multi-class problems, and problems where the inferred probabilities are not monotonic with resp ect to the true probabilities, more complex adjustments are required. This pap er presents an approach that attempts to identify and apply linear adjustments to the class probabilities. To this end, an adjustment factor is asso- ciated with each class, and the inferred probability for a class is multiplied by the corresp onding factor. While it is acknowledged that such simple linear adjustments will not capture the ner detail of the distortions in inferred probabilities in all domains, it is exp ected that they will frequently assist in assigning more useful probabilities in contexts where discrete single class prediction is required as it will enable the probability for a class to be b o osted ab ove that of the other classes, enabling correct class selection irresp ective of accurate probability assignment. The general approach of inferring a function to adjust the class 4 The lower limit on a is the lowest value at which 3 is true for i = 2. The upp er limit is the highest value at which 3 is false for i =1. probabilities obtained through naive Bayesian induction will be referred to as adjustedprobability naive Bayesian classi cation APNBC. This pap er restricts itself to considering simple linear adjustments to the inferred probabilities, al- though it is noted that any other class of functions could b e considered in place of simple linear adjustments.

Adjusted Probability Naive Bayesian Induction

Statistical Modelling

Introduction to Bayesian Estimation

Hybrid of Naive Bayes and Gaussian Naive Bayes for Classification: a Map Reduce Approach

The Bayesian New Statistics: Hypothesis Testing, Estimation, Meta-Analysis, and Power Analysis from a Bayesian Perspective

Applied Bayesian Inference

Iterative Bayes Jo˜Ao Gama∗ LIACC, FEP-University of Porto, Rua Campo Alegre 823, 4150 Porto, Portugal

Bayesian Analysis 1 Introduction

A Brief Overview of Probability Theory in Data Science by Geert

Arxiv:2002.00269V2 [Cs.LG] 8 Mar 2021

Naive Bayes Classifier Assumes That the Presence (Or Absence) of a Particular Feature of a Class Is Unrelated to the Presence (Or Absence) of Any Other Feature

Bayesian Statistics

Bayesian Probability Theory