Markov Blanket Feature Selection for Support Vector Machines
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence (2008) Markov Blanket Feature Selection for Support Vector Machines Jianqiang Shen and Lida Li and Weng-Keen Wong 1148 Kelley Engineering Center, School of EECS Oregon State University, Corvallis, OR 97331, U.S.A. {shenj,lili,wong}@eecs.oregonstate.edu Abstract Bai et al. 2004). They all depend on some independence tests. In this paper, we formally analyze the performance of Based on Information Theory, optimal feature selection these test-based algorithms. should be carried out by searching Markov blankets. In this paper, we formally analyze the current Markov blanket dis- We propose to discover Markov blankets by first perform- covery approach for support vector machines and propose ing a fast heuristic Bayesian network structure learning step. to discover Markov blankets by performing a fast heuris- Bayesian networks exploit conditional independence to pro- tic Bayesian network structure learning. We give a suffi- duce a succinct representation of the joint probability distri- cient condition that our approach will improve the perfor- bution between a set of variables. Heuristic structure learn- mance. Two major factors that make it prohibitive for learn- ing first defines a score that describes the fitness of each pos- ing Bayesian networks from high-dimensional data sets are sible structure to the observed data, and then searches for the large search space and the expensive cycle detection op- a structure that maximizes the score. We analyze the suffi- erations. We propose to restrict the search space by only cient condition that our approach improves the performance. considering the promising candidates and detect cycles using an online topological sorting method. Experimental results Rather than learning the Bayesian network, we could try to show that we can efficiently reduce the feature dimensional- learn only the Markov blanket for the class variable. How- ity while preserving a high degree of classification accuracy. ever, as will be shown in our results, we observe much better accuracy when we first learn a Bayesian network because the Bayesian network captures a more accurate global view of Introduction how all the random variables interact with each other. Sec- Many machine learning problems require learning from ondly, learning a network gives us more flexibility, and we high-dimensional data sets in which the number of instances can consider any node as the “class variable” once we have is less than the number of features. A common strategy is to the Bayesian network. Finally, learning a network can still first run a feature selection step before applying a traditional be quite efficient compared with only learning Markov blan- learning algorithm. The main reasons for applying feature kets. Our method can learn a reasonably accurate Bayesian selection is to avoid overfitting by removing irrelevant fea- network from large text datasets within minutes. tures and to improve computational efficiency by decreasing Throughout this paper, we use linear support vector ma- the number of features the learning algorithm needs to con- chines (SVMs) to perform classification after feature selec- sider. Many common-used feature selection methods such tion, since SVM is one of the state-of-the-art learning meth- as mutual information and information gain (Yang & Ped- ods and has met success in numerous domains. This can ersen 1997) are based on pairwise statistics. Although pair- be considered as a hybrid generative-discriminative learn- wise statistics can be computed quickly, they are unable to ing approach (Raina et al. 2004). Although discriminative capture higher order interactions and can be quite unreliable classifiers usually achieve higher prediction accuracy than when the training data is sparse. Also, the user usually has generative classifiers, it has been shown that a generative to tune a threshold to determine the optimal feature set. This classifier outperforms its discriminative counterpart when paper presents an efficient and reliable feature selection ap- the amount of training data is limited (Ng & Jordan 2002). proach without the need for such threshold. In this paper we essentially apply the generative learner to Using Markov blankets for feature selection has been identify which information is necessary for training the dis- shown to be effective (Koller & Sahami 1996). The Markov criminative learner. Our hybrid approach significantly out- blanket is defined as the set of variables M that ren- performs the pure SVM when the training size is small. ders a variable Xi conditionally independent of other vari- ables given the variables in M. Feature selection algo- Searching Markov Blanket Features rithms based on Markov blankets have attracted much at- Most common-used feature selection methods (such as mu- tention recently (Aliferis, Tsamardinos, & Statnikov 2003; tual information, information gain, Chi square (Yang & Ped- Copyright c 2008, Association for the Advancement of Artificial ersen 1997)) are based on pairwise statistics which is unreli- Intelligence (www.aaai.org). All rights reserved. able when the training size is small. It is also difficult to de- 696 termine the optimal feature set size and the user has to tune Thus, T will mistakenly treat two independent variables some threshold. Furthermore, pairwise statistics have diffi- as dependent with probability no less than α. In practice, culty in removing redundant features and capturing higher- usually P > 0.5 and people set α to 5% to control the type I order interactions. Consider variables x and y. It is possible error. With small α, it is very likely that T treats dependent that they will be both retained if they are highly correlated variable as independent, which is shown in the following: (and thus have close statistical measurements). It is also pos- Lemma 2 Given nominal level α, if we want the probability sible that they will be both removed if the class variable’s that T identifies two dependent variables as independent to interaction with x or y alone is not significant, although the be no more than ǫ, then we need n > 1 χ2 1 U 2 co-occurrence of x and y is very predictive of the class. ∆ (q 1,α + 4 ǫ + 2 The Markov blanket criterion can address the above prob- Uǫ) samples, where ∆ is a constant depending on the cor- lems. The Markov blanket subsumes the information that a relation between the variables, and Uǫ is the upper ǫ per- variable has about all of the other features. Koller and Sa- centage point of the cumulative normal distribution. hami (1996) suggested that we should remove features for Proof (Sketch) We want to ensure type II error P (χ2 ≤ 2 2 which we find a Markov blanket within the set of remaining χ1,α) ≤ ǫ. Under the alternative hypothesis, χe fol- features. Since finding Markov blankets is hard, they pro- lows the non-central chi-square distribution with 1e degree posed an approximate algorithm, which is suboptimal and of freedom and non-centrality parameter λ = n · ∆ (Meng very expensive due to its simple approximation (Baker & & Chapman 1966). Applying Pearson transformation and McCallum 1998). There are several other algorithms based Fisher approximation (Johnson & Kotz 1970), we get n > on Markov blankets (Aliferis, Tsamardinos, & Statnikov 1 ( χ2 + 1 U 2 + U )2 after some operations. 2003; Bai et al. 2004). They all depend on independence ∆ q 1,α 4 ǫ ǫ 2 tests and are sensitive to test failures. Furthermore, the user Note χ1,α and Uǫ grow extremely fast as α → 0 and ǫ → still needs to tune the threshold to determine the optimal fea- 0. A test-based feature selection algorithm usually involves ture set size. We analyze their error bounds and propose to Ω(m) independence tests given m variables. Because ∆ is select features by heuristically searching Bayesian networks. usually very small, the probability that T treats dependent variables as independent is high with limited samples. Thus, Analysis of Test-based Algorithms test-based feature selection algorithms are too aggressive in For simplicity, let’s concentrate on binary, complete data. removing features, as also shown in the experiments. We assume that the chi-square test is used for independence test and a linear SVM is used for classification. The results Method Based on Bayesian Network Learning can be easily generalized to other situations. We showed that test-based algorithms are very sensitive to Regarding variables A and B, independence test T is car- α and even with infinite samples their error rates are still ried out by testing null hypothesis H0 : PAB = PA · PB. proportional to α. We propose to search Markov blan- Given the nominal level α, it rejects H0 if the chi-square kets through learning a Bayesian network with some good 2 2 2 statistic χ > χ1,α, where χ1,α is the upper α percent- heuristic strategy. Bayesian networks provide a descrip- tive summary of relationships between variables and an ef- age pointe of the cumulative chi-square distribution with 1 degree of freedom. Let Sn denote a training set with n ficient representation of the joint probability distribution. In samples generated from probability distribution D, F(Sn) this paper, we use Bayesian Information Criterion (BIC) as denote the feature set returned by applying feature selec- the search score. We choose BIC because it can penalize tion algorithm F to Sn. We denote the expected error the model complexity and lead to smaller feature sets. If rate of F given n samples as Errn(F). There could be the goal is to preserve the prediction accuracy as much as many different definitions of Errn(F), and one possibility we can, then a Bayesian score such as BDeu(Heckerman, n n n |Φ⊗F(S )| n Geiger, & Chickering 1995) might be preferred. is Err (F) = n P (S ) n dS , where Φ is the RS ∈D |FΦ∪ (S )| ˆ optimal feature set, U V is the difference between set U Given the class variable C, let M be its minimal Markov ⊗ ˆ and V .