Generalized Naive Bayes Classifiers
Total Page:16
File Type:pdf, Size:1020Kb
Generalized Naive Bayes Classifiers Kim Larsen Concord, California [email protected] ABSTRACT This paper presents a generalization of the Naive Bayes Clas- sifier. The method is called the Generalized Naive Bayes This paper presents a generalization of the Naive Bayes Clas- Classifier (GNBC) and extends the NBC by relaxing the sifier. The method is specifically designed for binary clas- assumption of conditional independence between the pre- sification problems commonly found in credit scoring and dictors. This reduces the bias in the estimated class prob- marketing applications. The Generalized Naive Bayes Clas- abilities and improves the fit, while greatly enhancing the sifier turns out to be a powerful tool for both exploratory descriptive value of the model. The generalization is done in and predictive analysis. It can generate accurate predictions a fully non-parametric fashion without putting any restric- through a flexible, non-parametric fitting procedure, while tions on the relationship between the independent variables being able to uncover hidden patterns in the data. In this and the dependent variable. This makes the GNBC much paper, the Generalized Naive Bayes Classifier and the orig- more flexible than traditional linear and non-linear regres- inal Bayes Classifier will be demonstrated. Also, important sion, and allows it to uncover hidden patterns in the data. ties to logistic regression, the Generalized Additive Model Moreover, the GNBC retains the additive structure of the (GAM), and Weight Of Evidence will be discussed. NBC which means that it does not suffer from the prob- lems of dimensionality that are common with other non- 1. INTRODUCTION parametric techniques. The problem considered is that of predicting the class proba- bility of a binary event (or target) given a set of independent Previous research has successfully tackled the severe assump- variables (predictors). This problem arises frequently in var- tions of the Naive Bayes Classifier [4] [3]. However, most of ious fields such as credit scoring, marketing and medical these efforts deal with improving the Naive Bayes Classifier studies. One of the oldest and most simple techniques for for text classification, i.e. the classification of some docu- binary classification is the Naive Bayes Classifier. Although ment into one of a set pre-defined categories given a vec- simple in structure and based on unrealistic assumptions, tor of words and characters. The generalization presented Naive Bayes Classifiers (NBC) often outperform far more in this paper is specifically designed for the type of binary sophisticated techniques. In fact, due to its simple structure classification problems that arise in credit scoring and mar- the Naive Bayes Classifier is an especially appealing choice keting applications. In this type of setting, the vector of when the set of independent variables is large. This is re- attributes is usually comprised of multiple continuous and lated to the classic bias-variance tradeoff; the NBC tends to discrete variables, and hence most text mining techniques produce biased estimated class probabilities, but it is able do not apply. In fact, the method presented here is closely to save considerable variance around these estimates. related to the Generalized additive model and various non- parametric regression techniques. Despite these appealing features, the NBC is rarely used to- day in credit scoring or marketing applications. In fact, most popular statistical software packages do not have an NBC 2. NAIVE BAYES CLASSIFIERS module. There are at least two reasons for this: First, the Let Y be a binary random variable where biased class probabilities can be a genuine problem for mod- eling applications where the sole focus is not classification or 1 if an event occured ranking. Second, the NBC is estimated under the assump- Yi = , tion that predictors are conditionally independent given the 0 otherwise target variable. As a result relationships between the depen- dent variable and the predictors are estimated in isolation and X1,...,Xp be a set of predictor variables. If the predic- without paying attention to covariance between the predic- tors are conditionally independent given Y , the conditional tors. The NBC is therefore not able to approximate the joint probabilities can be written as multivariate regression function, and as a data exploration tool it adds no more information than a univariate analysis. Yp P (X1,...,Xp | Y ) = P (Xj | Y ) . j=1 Combining this with Bayes Theorem leads to the Naive SIGKDD Explorations. Volume 7, Issue 1 - Page 76 Bayes Classifier where the j’th term is a function of Xj . Treating the bias like this not only allows us to minimize the bias in the esti- P (Y = 1 | X ,...,X ) log 1 p = mated probabilities, but also enhances the descriptive value P (Y = 0 | X1,...,Xp) of the model. The expression p P (Y = 1) X f(x | Y = 1) log + log j , Xp P (Y = 0) f(xj | Y = 0) j=1 (gj (xj ) + bj (xj )) j=1 where f(xj | Y ) is the conditional density of Xj . is actually an additive approximation to the multivariate The conditional densities can be estimated separately using version non-parametric univariate density estimation. Joint den- f(x1, . , xp | Y = 1) sity estimation is therefore avoided, which is especially de- log f(x1, . , xp | Y = 0) sirable when the model contains a large number of predic- tors. More advanced non-parametric techniques of the form and thus gj (xj ) + bj (xj ) can be interpreted as the marginal y = f(x1, . , xp), such as thin-plate splines and loess, per- effect of Xj , fully adjusted for the effect of the other vari- form poorly under such conditions. Rapidly increasing vari- ables. The function bj (xj ) reflects the marginal bias, i.e. ance around the estimates as a result of sparseness of data how the effect of Xj changes in the presence of the other is a typical problem when estimating the multidimensional predictors. function f(x1, . , xp), and CPU time is a problem with large data sets. Furthermore, given that densities can be The decomposition of the bias also allows the GNBC to estimated non-parametrically, the NBC can model the re- preserve the additive property of the NBC and hence avoid lationship between Xj and Y in a flexible and unrestricted joint density estimation. Sparseness of data and inflated fashion. In this aspect, the NBC is much more flexible than variance is therefore not an issue which makes the GNBC a traditional linear and non-linear regression. practical choice for most binary modeling applications. Since the conditional independence assumption is rarely true, the savings in variance and CPU time is not free. The es- 4. ESTIMATINGTHE GENERALIZED NAIVE timated class probabilities tend to be biased and therefore BAYES CLASSIFIER the probabilities are typically not used directly but rather as Fitting a Generalized Naive Bayes Classifier is a two-stage rank-scores. This is a problem for certain modeling applica- process. In the first stage, the Naive Effects are estimated tions where ranking or classification is not the only purpose using univariate density estimation. In the second stage the of the analysis. Furthermore, the descriptive value of the adjustment functions are estimated by iteratively smoothing model is limited: Consider the ratio, the partial residuals against the predictors. f(xj | Y = 1) gj (xj ) = log , 4.1 Estimation of Naive Effects f(xj | Y = 0) The Naive Effects can be estimated separately using uni- which shows how the log-odds changes with Xj . This func- variate kernel density estimates. Let Nij (xij ) denote the tion is called the Naive Effect since covariance between Xj symmetric nearest neighborhood around xij , which contains and other predictors is not taken into consideration. Obvi- the k observations to the left of xij and the k observations ously, as a data exploration tool, each Naive function adds to the right of xij . The kernel density estimate for Xij has nothing more than a univariate analysis. the form n 1 X |x − x | 3. GENERALIZED NAIVE BAYES CLASSI- fˆ (x ) = I(x , x )K ij kj ij ij nh ij kl λ h FIERS k=1 The idea behind the Generalized Naive Bayes Classifier is where h is the width of the neighborhood and I(xij , xkl) is to relax the conditional independence assumption by ad- an indicator function justing the Naive Effects. This is done by adding p func- 1 if x ∈ N (x ) tions, b1(x1), . , bp(xp), to the Naive Effects, where bj (xj ) I(x , x ) = kl k ij . ij kl 0 otherwise accounts for the marginal bias attributed to gj (xj ). Hence the GNBC can be written as The purpose of the weight function Kλ is to reduce the Xp P (Y = 1 | X1,...,Xp) weight given to observations within Nk(xij ) that are far log = α + (gj (xj ) + bj (xj )) , P (Y = 0 | X1,...,Xp) away from xij . This reduces jaggedness in the curves and j=1 makes estimates more robust when dealing with large num- where bers of ties in the predictor variable. Popular choices for Kλ P (Y = 1) include the minimum variance kernel α = log , P (Y = 0) 3 (3 − 5t2) if |t| ≤ 1 K (t) = 8 , λ 0 otherwise and b1(x1), . , bp(xp) are unspecified smooth functions. Ob- viously if the predictors are conditionally independent, bj (xj ) = and the Epanechnikov kernel 0 for all j, and the model reduces to an NBC. 3 (1 − t2) if |t| ≤ 1 K (t) = 4 . In short, the GNBC decomposes the total bias into p terms λ 0 otherwise SIGKDD Explorations. Volume 7, Issue 1 - Page 77 The estimated conditional densities can then be computed approximated by the first order Taylor series as P (Yi = 1) ˆ yi − πˆij P log ≈ θi + .