Bayesian Model Averaging for Improving Performance of the Naive

Ga Wu [email protected]

Suppervised by: Dr.Scott Sanner COMP8740, AI project, Semester 1, 2014 Australian National University OUTLINE

 Naive Bayes Classifier

 Bayesian Model Averaging for Naive Bayes

 Automatic Mechanism

 Cross-Validation with Hyper Parameter Tuning

 Experimental Results Naive Bayes Classifier

 Bayes’ Theorem 푃 푥 푦 푃(푦) 푃 푦 푥 = 푃(푥 )

Assumption

푃 푥 푦 = 푃 푥1 푦, 푥2, 푥3. . 푥푘)푃 푥2 푦, 푥3, 푥4. . 푥푘 . . 푃 푥푘 푦

∝ 푃 푥1 푦 푃 푥2 푦 . . 푃(푥푘|푦)

 Naive Bayes classifier 푎푟푔max 푃 푦 푥 = 푎푟푔max 푃(푦)푃 푥1 푦 푃 푥2 푦 . . 푃 푥푘 푦 푦 푦 Why Naive Bayes?

 The Prediction Accuracy can be improved a lot, if applied Feature selection

 Feature Selection Strategy in Text Classification PAKDD 2011

 Efficient to train, evaluate and interpret

 Priori probability, likelihood

 Easy to update (online classification task) Bayesian Model Averaging for Naive Bayes

 Bayesian Model Averaging 푀 푁

푃 푦 푥 , 퐷 ∝ 푃 푦 푥 , 푓푚 푃(푓푚) 푃(푑푖|푓푚) 푚=1 푖=1

 Each feature vector represents a Model

푓1 푓 푓 = 2 푓 ∈ {0,1} ⋮ 푘 푓푘

 Independent and identically distributed:

푃 푓 = 푃(푓푘) 푘=1 Bayesian Model Averaging for Naive Bayes

 BMA for Naive Bayes classifier 퐾 푁+1 푁+1 1 푎푟푔max log(푃 푦 푥 , 퐷 ) = 푎푟푔max(log 푃 푦 + log( 푃 푥푖푘 + 푃 푥푖푘 푦푖 )) 푦 푦 퐶 푘=1 푖=1 푖=1

 Hyper-Parameter 퐶 = 퐻푁+1 Automatic Feature Selection Mechanism

 Feature Weighting Mechanism

푁+1 푖=1 푃(푥푖푘|푦푖) 푁+1 ≥ 1 푖=1 푃 푥푖푘

 Complex Model Punishment 1 , 푓푘 = 1 푃 푓푘 = 퐶 1, 푓푘 = 0 Automatic Feature Selection Mechanism Cross-Validation with Hyper Parameter Tuning (Nested Cross-Validation) Information of Experiment Dataset

 UCI classification problem datasets (general classification)

 20 classification problems with different feature size, data size, feature type and some of them have missing values

 To find in which condition the classifier can do better than Naive Bayes

 20 newsgroup dataset (text classification)

 Has noise features

 Relative large number of features Performance Accuracy Comparison Time Consuming Comparison Why some of them get higher accuracy?

 Discard the noise features  Punish Complex Model  The classification problem has trivial features  The Hyper Parameter tuning is effective Impact of Hyper-Parameter Tuning

 There are noise features that can be detected by our automatic feature selection mechanism  The “peak” is captured when tuning Hyper Parameter Impact of Hyper-Parameter Tuning

 There is no noise feature that can be detected by our automatic feature selection mechanism Potential Usage Area: Text Classification Conclusion

 BMA can do better than Naive Bayes classifier for some classification problem, but not all. But even it can’t do better, it still maintain same prediction accuracy with that of Naive Bayes

 Hyper-Parameter Tuning is critical for this

 The time complexity of this classifier is linear with respect to both number of features and number of data

 Potential Usage Area: Text classification Further Work

 Hyper Parameter Tuning method for now is just compromise between time consuming and prediction accuracy.

 We want more efficient Hyper Parameter Tuning method. THANKS!