Generalized Naive Bayes Classifiers

Kim Larsen Concord, California [email protected]

ABSTRACT This paper presents a generalization of the Naive Bayes Clas- sifier. The method is called the Generalized Naive Bayes This paper presents a generalization of the Naive Bayes Clas- Classifier (GNBC) and extends the NBC by relaxing the sifier. The method is specifically designed for binary clas- assumption of between the pre- sification problems commonly found in credit scoring and dictors. This reduces the bias in the estimated class prob- marketing applications. The Generalized Naive Bayes Clas- abilities and improves the fit, while greatly enhancing the sifier turns out to be a powerful tool for both exploratory descriptive value of the model. The generalization is done in and predictive analysis. It can generate accurate predictions a fully non-parametric fashion without putting any restric- through a flexible, non-parametric fitting procedure, while tions on the relationship between the independent variables being able to uncover hidden patterns in the data. In this and the dependent variable. This makes the GNBC much paper, the Generalized Naive Bayes Classifier and the orig- more flexible than traditional linear and non-linear regres- inal Bayes Classifier will be demonstrated. Also, important sion, and allows it to uncover hidden patterns in the data. ties to , the Generalized Additive Model Moreover, the GNBC retains the additive structure of the (GAM), and Weight Of Evidence will be discussed. NBC which means that it does not suffer from the prob- lems of dimensionality that are common with other non- 1. INTRODUCTION parametric techniques. The problem considered is that of predicting the class proba- bility of a binary event (or target) given a set of independent Previous research has successfully tackled the severe assump- variables (predictors). This problem arises frequently in var- tions of the Naive Bayes Classifier [4] [3]. However, most of ious fields such as credit scoring, marketing and medical these efforts deal with improving the Naive Bayes Classifier studies. One of the oldest and most simple techniques for for text classification, i.e. the classification of some docu- binary classification is the Naive Bayes Classifier. Although ment into one of a set pre-defined categories given a vec- simple in structure and based on unrealistic assumptions, tor of words and characters. The generalization presented Naive Bayes Classifiers (NBC) often outperform far more in this paper is specifically designed for the type of binary sophisticated techniques. In fact, due to its simple structure classification problems that arise in credit scoring and mar- the Naive Bayes Classifier is an especially appealing choice keting applications. In this type of setting, the vector of when the set of independent variables is large. This is re- attributes is usually comprised of multiple continuous and lated to the classic bias- tradeoff; the NBC tends to discrete variables, and hence most text mining techniques produce biased estimated class , but it is able do not apply. In fact, the method presented here is closely to save considerable variance around these estimates. related to the Generalized additive model and various non- parametric regression techniques. Despite these appealing features, the NBC is rarely used to- day in credit scoring or marketing applications. In fact, most popular statistical software packages do not have an NBC 2. NAIVE BAYES CLASSIFIERS module. There are at least two reasons for this: First, the Let Y be a binary where biased class probabilities can be a genuine problem for mod- eling applications where the sole focus is not classification or  1 if an event occured ranking. Second, the NBC is estimated under the assump- Yi = , tion that predictors are conditionally independent given the 0 otherwise target variable. As a result relationships between the depen- dent variable and the predictors are estimated in isolation and X1,...,Xp be a set of predictor variables. If the predic- without paying attention to covariance between the predic- tors are conditionally independent given Y , the conditional tors. The NBC is therefore not able to approximate the joint probabilities can be written as multivariate regression function, and as a data exploration tool it adds no more information than a univariate analysis. Yp P (X1,...,Xp | Y ) = P (Xj | Y ) . j=1

Combining this with Bayes Theorem leads to the Naive

SIGKDD Explorations. Volume 7, Issue 1 - Page 76 Bayes Classifier where the j’th term is a function of Xj . Treating the bias like this not only allows us to minimize the bias in the esti- P (Y = 1 | X ,...,X ) log 1 p = mated probabilities, but also enhances the descriptive value P (Y = 0 | X1,...,Xp) of the model. The expression p P (Y = 1) X f(x | Y = 1) log + log j , Xp P (Y = 0) f(xj | Y = 0) j=1 (gj (xj ) + bj (xj )) j=1 where f(xj | Y ) is the conditional density of Xj . is actually an additive approximation to the multivariate The conditional densities can be estimated separately using version

non-parametric univariate density estimation. Joint den- f(x1, . . . , xp | Y = 1) sity estimation is therefore avoided, which is especially de- log f(x1, . . . , xp | Y = 0) sirable when the model contains a large number of predic- tors. More advanced non-parametric techniques of the form and thus gj (xj ) + bj (xj ) can be interpreted as the marginal y = f(x1, . . . , xp), such as thin-plate splines and loess, per- effect of Xj , fully adjusted for the effect of the other vari- form poorly under such conditions. Rapidly increasing vari- ables. The function bj (xj ) reflects the marginal bias, i.e. ance around the estimates as a result of sparseness of data how the effect of Xj changes in the presence of the other is a typical problem when estimating the multidimensional predictors. function f(x1, . . . , xp), and CPU time is a problem with large data sets. Furthermore, given that densities can be The decomposition of the bias also allows the GNBC to estimated non-parametrically, the NBC can model the re- preserve the additive property of the NBC and hence avoid lationship between Xj and Y in a flexible and unrestricted joint density estimation. Sparseness of data and inflated fashion. In this aspect, the NBC is much more flexible than variance is therefore not an issue which makes the GNBC a traditional linear and non-linear regression. practical choice for most binary modeling applications.

Since the conditional independence assumption is rarely true, the savings in variance and CPU time is not free. The es- 4. ESTIMATINGTHE GENERALIZED NAIVE timated class probabilities tend to be biased and therefore the probabilities are typically not used directly but rather as Fitting a Generalized Naive Bayes Classifier is a two-stage rank-scores. This is a problem for certain modeling applica- process. In the first stage, the Naive Effects are estimated tions where ranking or classification is not the only purpose using univariate density estimation. In the second stage the of the analysis. Furthermore, the descriptive value of the adjustment functions are estimated by iteratively smoothing model is limited: Consider the ratio, the partial residuals against the predictors.

f(xj | Y = 1) gj (xj ) = log , 4.1 Estimation of Naive Effects f(xj | Y = 0) The Naive Effects can be estimated separately using uni- which shows how the log- changes with Xj . This func- variate kernel density estimates. Let Nij (xij ) denote the tion is called the Naive Effect since covariance between Xj symmetric nearest neighborhood around xij , which contains and other predictors is not taken into consideration. Obvi- the k observations to the left of xij and the k observations ously, as a data exploration tool, each Naive function adds to the right of xij . The kernel density estimate for Xij has nothing more than a univariate analysis. the form n   1 X |x − x | 3. GENERALIZED NAIVE BAYES CLASSI- fˆ (x ) = I(x , x )K ij kj ij ij nh ij kl λ h FIERS k=1

The idea behind the Generalized Naive Bayes Classifier is where h is the width of the neighborhood and I(xij , xkl) is to relax the conditional independence assumption by ad- an indicator function justing the Naive Effects. This is done by adding p func-  1 if x ∈ N (x ) tions, b1(x1), . . . , bp(xp), to the Naive Effects, where bj (xj ) I(x , x ) = kl k ij . ij kl 0 otherwise accounts for the marginal bias attributed to gj (xj ). Hence the GNBC can be written as The purpose of the weight function Kλ is to reduce the Xp P (Y = 1 | X1,...,Xp) weight given to observations within Nk(xij ) that are far log = α + (gj (xj ) + bj (xj )) , P (Y = 0 | X1,...,Xp) away from xij . This reduces jaggedness in the curves and j=1 makes estimates more robust when dealing with large num- where bers of ties in the predictor variable. Popular choices for Kλ P (Y = 1) include the minimum variance kernel α = log ,  P (Y = 0) 3 (3 − 5t2) if |t| ≤ 1 K (t) = 8 , λ 0 otherwise and b1(x1), . . . , bp(xp) are unspecified smooth functions. Ob- viously if the predictors are conditionally independent, bj (xj ) = and the Epanechnikov kernel 0 for all j, and the model reduces to an NBC.  3 (1 − t2) if |t| ≤ 1 K (t) = 4 . In short, the GNBC decomposes the total bias into p terms λ 0 otherwise

SIGKDD Explorations. Volume 7, Issue 1 - Page 77 The estimated conditional densities can then be computed approximated by the first order Taylor series as P (Yi = 1) ˆ yi − πˆij P   log ≈ θi + . n |xij −xkj | P (Yi = 0) πˆij (1 − πˆij ) k=1 I(xij , xkl)Kλ h yk fˆ(x | Y = 1) = P ij h n y This leads to the partial residual for variable Xj i=1 k  P P (Y = 1) y − πˆ n |xij −xkj | i ˆ ˆ ˆ i ij k=1 I(xij , xkl)Kλ h (1 − yk) eij = log − (θi − bj (xj )) = bj (xj ) + , ˆ P P (Yi = 0) πˆij (1 − πˆij ) f(xij | Y = 0) = n , h (1 − yk) i=1 which isolates the bias caused by Xj by removing the effect which leads to the estimator of for the Naive Effects of all other variables. Hence an estimate for bj (xj ) can be obtained by smoothing ej against Xj with weights   fˆ(xij | Y = 1) gˆij (xij ) = log . yi − πˆi 1 fˆ(xij | Y = 0) wij = var θi + = , πˆi(1 − πˆi) πˆi(1 − πˆi)

If Xj is discrete, simple estimators can be used using the most current estimate of πi. Because πi and wi are functions of ˆbj (xj ), the dependent variable changes when P (Xij = xij | Y = 1) ˆ gˆij (xij ) = log , bj (xj ) is updated and the smoothing itself must be done it- P (Xij = xij | Y = 0) eratively. Hence the marginal bias for xj , given the most which provides a seamless way to mix discrete and contin- current estimate of πi, is estimated by repeatedly smooth- uous variables. In the discrete case, the Naive Effects are ing ej against xj , updating the weights at each iteration. popularly known as the weight-of-evidence. In order to estimate b1(xi), . . . , bp(xp) in a simultaneous The smoothness of the estimated densities is controlled by fashion, a backfitting is used in conjunction with the smoothing parameter. As with any smoother there is the iterative smoothing described above. The backfitting a bias-variance tradeoff when selecting λ. A large λ will algorithm works by cycling through the predictors and fit- produce a smooth curve with low variance and potentially ting smoothing functions to the partial residuals, using the high bias. Conversely, a small λ will produce a flexible curve most current estimates for the other smoothing functions. with potentially high variance and little bias. The former Hence we can fit the GNBC through a nested do-loop: The curve might be too smooth and miss important patterns in outer loop controls the cycles of the backfitting algorithm. the data, while the latter curve might be too jagged and not Each time the backfitting algorithm has cycled through all generalize well to testing data. Although there are iterative the predictors, the outer loop evaluates a goodness of fit cri- methods to find the optimal smoothing parameter, a good terion as well as the convergence status of the smoothing choice can usually be found by combining some simple rules functions. If the smoothing functions cease to change or the with cross-validation results. If an estimated curve fits the goodness of fit criterion ceases to improve, the outer loop ter- training data well but fits the testing data poorly, increasing minates. The inner loop is simply the backfitting algorithm λ will most likely improve the validation results. Likewise, itself, and within the backfitting algorithm is a do-until loop If n or the number of events is small, a large λ may be that controls the iterative smoothing described above. necessary to avoid sparseness of data. This is the same type of maximizer used with GAM, and 4.2 Estimation of Marginal Bias it is also referred to as the Local Scoring Algorithm [1]. Fur- thermore, it can be shown that this routine is a special case The marginal biases are estimated through an iterative back- of Iteratively Re-weighted Least Squares (IRLS) where the fitting algorithm similar to the procedure used with a lo- log- gistic Generalized Additive Model [1]. It uses the NBC as a starting point and then, for each predictor in turn, Xn estimates the marginal bias using the most current esti- L (b1(x1), . . . , bp(xp)) = (yi logπ ˆi + (1 − yi) log(1 − πˆi)) mates for the other predictors. This step is repeated un- i=1 til the estimates ˆb1(xi),..., ˆbp(xp) stabilize. Note that the is being maximized. The difference between this procedure estimatesg ˆ1(x1),..., gˆp(xp) never change throughout this and the typical IRLS is that the weighted linear regression process. This reiterates that the goal of the GNBC is sim- has been replaced by the backfitting algorithm in the outer ply to extend the NBC by correcting for the bias caused by loop. The procedure has several attractive properties: First, the conditional independence assumption. assuming that y1, . . . , yn are i.i.d., the log-likelihood func- tion is an obvious fitting criterion for the GNBC. Second,

Let θˆi denote the estimated log odds for observation i, i.e. the log-likelihood for this problem is concave which means that convergence is almost always guaranteed. Xp   ˆ ˆ θi = α + gˆj (xj ) + bj (xj ) , The algorithm is described in more detail in the following: j=1 Let Sj (h(xj )) denote the weighted smooth of some function h(x ) against X using the weights given above. The choice andπ ˆi denote the estimated target j j of smoother is not critical, but for consistency the smoother ˆ eθi should be based on the same neighborhoods used in the esti- πˆi = ˆ . mation of the Naive Effects. This is because the purpose of 1 + eθi this stage is to remove bias caused by the conditional inde- The at yi around the current estimateπ ˆi can be pendence assumption, not bias caused by smoothing. Given

SIGKDD Explorations. Volume 7, Issue 1 - Page 78 a smoother, we can write the algorithm as: it can be used in conjunction with PROC LOGISTIC for large variable selection problems. ˆ0 Initialize: bj (xj ) = 0, for all j Outer Loop: k = 1, 2,... 4.3.2 Implementation Inner Loop: j = 1, . . . , p, j = 1, . . . , p, . . . In most cases the ultimate goal is to build a classification   system that can score all prospects or existing customers, in order to evaluate their credit risk or propensity to purchase ˆk ˆk−1 yi − πˆij bj (xj ) = Sj bj (xj ) + , a certain product. This requires the ability to score records πˆij (1 − πˆij ) that are not in the training data. Implementing the GNBC ˆk where Sj has to be applied iteratively until bj (xj ) stabilizes. as a classification system is easy, although more complicated If Xj is a categorical variable, a simple weighted average of than implementing a simple LDA model or logistic regres- the partial residuals for each class is used instead of Sj . The sion model. For every variable, a look-up table linking Xj outer loop is repeated until the ratio to bj (xj ) + gj (xj ) must be created. Furthermore, we ex- r   pect to observe values that are not present in the training P P 2 p n k k−1 data which means that the system should be able to ex- ˆb (xj ) − ˆb (xj ) j=1 i=1 j j trapolate from the loop-up tables. If we let x , . . . , x P qP (1)j (n)j p n ˆk 2 denote the ordered values of Xj in the training data, the j=1 i=1 bj (xj ) system should map values in the range (x(i)j , x(i+1)j ] to is less than some specified threshold, or the likelihood func- bj (x(i+1)j ) + gj (x(i+1)j ). Values that exceed boundaries of tion ceases to improve. The initial estimate of the intercept the training data are mapped to bj (x(1)j ) + gj (x(1)j ) or is bj (x ) + gj (x ). P (n)j (n)j n yi αˆ0 = log Pi=1 . This implementation algorithm can be performed with the n − n y i=1 i %SCOREGNBC SAS macro. At the end of each iteration, the intercept is adjusted to ensure that the expected number of events equals the actual, 4.4 Spam data example i.e. The data set used in this example is the same data used Xn Xn to demonstrate GAM in the book ”Elements of Statistical πˆi = yi, Learning” by Hastie and Tibshirani [2]. The data was do- i=1 i=1 nated by George Forman of Hewlett-Packard laboratories, which is identical to the way the intercept is estimated in a Palo Alto, California, and is publicly available at the site logistic regression model. ftp.ics.uci.edu. It contains various information on 4601 email messages, where 1813 are categorized as spam and 4.3 Building a Classifier with the GNBC 2788 are legitimate emails. The goal of this analysis is to build a filter that prevents spam from entering the inbox. 4.3.1 Variable Selection Strategy for the GNBC One strategy is to estimate the probability of spam for every The GNBC is not a practical algorithm for variable selec- inbound email using a binary regression model. If the esti- tion when the number of variables is large. Although the mated probability exceeds a certain level, the email is clas- GNBC is computationally efficient compared to other non- sified as spam and sent to a junk mail folder. parametric techniques it still requires much more CPU time than LDA or logistic regression. Therefore, another tech- The data set contains 57 continuous ordinal predictors as nique is recommended for large variable selection tasks. One described below: technique that has proven to be very effective for such prob- lems is a combination of the standard Naive Bayes Classifier - wordFreq[WORD] 48 continuous attributes that contain and logistic regression. First, Naive Effect functions are es- the percentage of words in the e-mail that match a timated for all the candidate variables. Second, a logistic given word. These attributes indicate whether a spe- regression containing all the Naive Effects is fitted, and a cific word occurs frequently in the email. Examples stepwise selection is performed to pick the most significant include business, address, and free Naive Effects. Once a manageable set of variables have been - 6 continuous attributes that con- determined, the GNBC is used to fit the final model. This charFreq[CHARACTER] tain the percentage of characters in the e-mail that technique should be preferred over variable selection from match a given character. These attributes indicate linear effects, as the Naive Effects ensure that variables with whether a specific character occurs frequently in the strong non-linear effects will be picked up during the selec- email. Examples include , and tion process. Once the marginal bias functions have been ! # $ estimated by the GNBC, we are often able to reduce the set - capLengthAVE A continuous attribute that contains of variables even further by identifying variables that are the average length of uninterrupted sequences of cap- highly affected by multicollinearity. In such cases, the mar- ital letters ginal bias functions can neutralize the Naive Effects, which is usually a sign that the variables are abundant. - capLengthMAX An integer attribute that contains the longest uninterrupted sequences of capital letters The %GNBC SAS macro allows for estimation of both Naive and Generalized Naive Bayes Classifiers, and has a function - capLengthTOT An integer attribute that contains the to output Naive Effects and Marginal bias functions. Hence sum of uninterrupted sequences of capital letters

SIGKDD Explorations. Volume 7, Issue 1 - Page 79 1536 records were randomly selected for testing, and 3065 which shows an overall misclassification rate of 5.3%. were allocated to the training data. The model was then built in two steps: In the first step, a simple NBC containing The spam data illustrates how much information and pre- all 57 predictors was fitted, followed by a stepwise selection dictive strength can be lost when assuming linear effects. of Naive Effects. The following variables were selected: The spam data is typical in the sense that the correlation between the dependent variable and the independent vari- wordFreqGEORGE ables is not linear. Two patterns that are common in the wordFreqOUR spam data, and in most other real-life modeling problems, wordFreqOVER is the V-shaped and the ”hockeystick-shaped” correlation. wordFreqREMOVE The hockeystick occurs when the log odds increase or de- wordFReqINTERNET crease rapidly as the variable increases from the minimum, wordFReqREPORT but then does not change much after that. Both patterns are wordFReqFREE hard to model with parametric functions, and poorly cap- wordFReqBUSINESS tured with a linear effect. Moreover, in order to attempt ap- wordFReqCREDIT proximating these patterns with parametric functions, one wordFReqMONEY would have to know that these patterns exist prior to fitting wordFReq1999 the model. With the GNBC, no knowledge about the cor- wordFReqEDU relation between Xj and Y is required prior to fitting the wordFReqEXCL model, and because of the non-parametric density estima- wordFReqDOLLAR tion, it can adapt to correlation of any shape. The table wordFreqHP below demonstrates how much fit would have been lost if wordFreqPROJECT we had used a standard linear technique. capLengthMAX capLengthAVE Technique C Testing Error charFreqDOLLAR NBC 0.976 6.1% charFreqEXCLAMATION GNBC 0.979 5.3% Logistic regression 0.967 8.1% In the second step, the final model was obtained by fit- Fisher LDA 0.961 9.3% ting a GNBC containing the selected predictors. A smooth- Examples of hockeystick shaped correlation and V-shaped ing parameter of λ = 0.6 was chosen for all the variables, correlation are given in figures 1 & 2. The graphs show and the minimum variance kernel was chosen for density the bias adjustment function, the Naive Effect, as well as estimation. With the convergence criterion set at 0.001, the final estimate b (x ) + g (x ). Figure 1 reveals that the %GNBC macro converged in 6 iterations, which took j j j j has the hockeystick effect, whereas the vari- roughly 1 minute of real computing time. The GNBC re- capLengthMax able , shown in figure 2, has more of a U-shaped quires roughly O(n) operations, so even with a large dataset wordFreqOur effect. Both graphs indicate that most of the bias adjust- computation time is manageable. The model fit ment is needed when the naive function peaks. This suggests are summarized below that, had we just fitted an NBC, we would have overesti- Stage -2LogL MSE C mated the target probability in certain ranges. NBC 1303.78 573.137 0.976 GNBC 1082.36 5.10 0.979 5. CONCLUSION It appears that the GNBC does remove a substantial amount In this paper, we have discussed the Generalized Naive Bayes of bias. Note that the GNBC is primarily a bias reduction Classifier. The GNBC is a flexible algorithm for predicting technique which explains why the improvement is mostly re- probabilities of a binary target given a set of independent flected in the log-likelihood and MSE, and to a lesser extent variables. It can fit a large model without any prior knowl- in the C-statistic. edge about the independent variables, which enables us to uncover hidden patterns in the data and spend more time The spam cut-off can be derived using Bayes rule. Let L 1 selecting variables and designing the model. This makes the denote the loss associated with classifying legitimate emails GNBC a powerful tool for prediction as well as data explo- as spam, and L denote the loss associated with classifying 2 ration. spam as legitimate email. With an equal loss assumption, the cut-off is then given by Finally, it is clear that there is a strong link between the L1 logistic GAM and the GNBC. In fact, the GNBC can be = 0.5. L1 + L2 viewed as a two-stage method that combines GAM and NBC; the first stage fits and NBC to the data, and the In reality, it is more severe to classify legitimate email as second stage uses GAM to minimize the bias. This turns spam and hence L should be greater than L , but for this 1 2 out to be a powerful combination. The NBC is fast regard- example we assume equal loss. Using the %SCOREGNBC less of the size of the model, and we typically only need a macro to score the testing data the following confusion ma- couple of additional GAM iterations to adjust the bias and trix was generated find the optimum. In addition, the two-stage process allows spam email us to measure the severity of the conditional independence Predicted spam 59.58% 3.62% assumption, as well as the predictive strength gained from Predicted email 1.65% 35.15% removing the bias.

SIGKDD Explorations. Volume 7, Issue 1 - Page 80 6. REFERENCES

[1] T. Hastie and R. Tibshirani. Generalized Additive Mod- els. Monographs on Statistics and Applied Probability 43. Chapmann and Hall, 1990. [2] T. Hastie, R. Tibshirani, and J. Friedman. Elements of Statistical Learning. Springer, 2001. [3] F. Peng, D. Schuurmans, and S. Wang. Augmenting naive bayes classifiers with statistical language models. Information Retrieval, 7, 317-345, 2004. [4] J. D. M. Rennie, L. Shih, J. Teewan, and D. R. Karger. Tackling the poor assumptions of naive bayes text clas- sifiers. Proceedings of the Twentieth International Con- ference on (ICML-2003), Washington DC, 2003.

Figure 1: Estimated effects for capLengthMax

Figure 2: Estimated effects for wordFreqOur

SIGKDD Explorations. Volume 7, Issue 1 - Page 81