Bayesian Model Averaging for Improving Performance of the Naive Bayes Classifier
Ga Wu [email protected]
Suppervised by: Dr.Scott Sanner COMP8740, AI project, Semester 1, 2014 Australian National University OUTLINE
Naive Bayes Classifier
Bayesian Model Averaging for Naive Bayes
Automatic Feature Selection Mechanism
Cross-Validation with Hyper Parameter Tuning
Experimental Results Naive Bayes Classifier
Bayes’ Theorem 푃 푥 푦 푃(푦) 푃 푦 푥 = 푃(푥 )
Conditional Independence Assumption
푃 푥 푦 = 푃 푥1 푦, 푥2, 푥3. . 푥푘)푃 푥2 푦, 푥3, 푥4. . 푥푘 . . 푃 푥푘 푦
∝ 푃 푥1 푦 푃 푥2 푦 . . 푃(푥푘|푦)
Naive Bayes classifier 푎푟푔max 푃 푦 푥 = 푎푟푔max 푃(푦)푃 푥1 푦 푃 푥2 푦 . . 푃 푥푘 푦 푦 푦 Why Naive Bayes?
The Prediction Accuracy can be improved a lot, if applied Feature selection
Feature Selection Strategy in Text Classification PAKDD 2011
Efficient to train, evaluate and interpret
Priori probability, likelihood
Easy to update (online classification task) Bayesian Model Averaging for Naive Bayes
Bayesian Model Averaging 푀 푁
푃 푦 푥 , 퐷 ∝ 푃 푦 푥 , 푓푚 푃(푓푚) 푃(푑푖|푓푚) 푚=1 푖=1
Each feature vector represents a Model
푓1 푓 푓 = 2 푓 ∈ {0,1} ⋮ 푘 푓푘
Independent and identically distributed:
퐾
푃 푓 = 푃(푓푘) 푘=1 Bayesian Model Averaging for Naive Bayes
BMA for Naive Bayes classifier 퐾 푁+1 푁+1 1 푎푟푔max log(푃 푦 푥 , 퐷 ) = 푎푟푔max(log 푃 푦 + log( 푃 푥푖푘 + 푃 푥푖푘 푦푖 )) 푦 푦 퐶 푘=1 푖=1 푖=1
Hyper-Parameter 퐶 = 퐻푁+1 Automatic Feature Selection Mechanism
Feature Weighting Mechanism
푁+1 푖=1 푃(푥푖푘|푦푖) 푁+1 ≥ 1 푖=1 푃 푥푖푘
Complex Model Punishment 1 , 푓푘 = 1 푃 푓푘 = 퐶 1, 푓푘 = 0 Automatic Feature Selection Mechanism Cross-Validation with Hyper Parameter Tuning (Nested Cross-Validation) Information of Experiment Dataset
UCI classification problem datasets (general classification)
20 classification problems with different feature size, data size, feature type and some of them have missing values
To find in which condition the classifier can do better than Naive Bayes
20 newsgroup dataset (text classification)
Has noise features
Relative large number of features Performance Accuracy Comparison Time Consuming Comparison Why some of them get higher accuracy?
Discard the noise features Punish Complex Model The classification problem has trivial features The Hyper Parameter tuning is effective Impact of Hyper-Parameter Tuning
There are noise features that can be detected by our automatic feature selection mechanism The “peak” is captured when tuning Hyper Parameter Impact of Hyper-Parameter Tuning
There is no noise feature that can be detected by our automatic feature selection mechanism Potential Usage Area: Text Classification Conclusion
BMA can do better than Naive Bayes classifier for some classification problem, but not all. But even it can’t do better, it still maintain same prediction accuracy with that of Naive Bayes
Hyper-Parameter Tuning is critical for this algorithm
The time complexity of this classifier is linear with respect to both number of features and number of data
Potential Usage Area: Text classification Further Work
Hyper Parameter Tuning method for now is just compromise between time consuming and prediction accuracy.
We want more efficient Hyper Parameter Tuning method. THANKS!