Tree-Augmented Naive Bayes Methods for Real-Time Training

The Florida State University College of Arts and Science Tree-Augmented Naïve Bayes Methods for Real-Time Training and Classification of Streaming Data By Kunal SinghaRoy March 8th, 2018 A master project submitted to the Department of Computer Science In partial fulfillment of the requirements for the Degree of Master of Science Major Professor: Dr. Robert van Engelen 1 Dr. Robert A. van Engelen Major Professor Dr. Michael Mascagni Committee Member Dr. Shayok Chakraborty Committee Member 2 Acknowledgements First and foremost, I would like to thank my project advisor Dr. Robert van Engelen for his constant guidance and encouragement without which this project would not have been possible. His constant new ideas about how to improve the project worked as a motivating factor to achieve more from this project. I would also like to thank my family and friends who always stood beside me and inspired me to work on things that I believe in. Finally, I would like to thank my committee members Dr. Michael Mascagni and Dr. Shayok Chakraborty for their time and support. 3 Table of Contents Abstract ................................................................................................................................................... 6 Chapter 1 ................................................................................................................................................. 7 1.1 Bayes’ Theorem: ...................................................................................................................... 7 1.2 Conditional Probability: ........................................................................................................... 7 1.3 Marginal Probability and Joint Probability: ............................................................................... 8 Chapter 2 ............................................................................................................................................... 11 2.1 Bayesian Networks: ................................................................................................................ 11 2.2 Complexity of Bayesian Networks: ........................................................................................ 16 Chapter 3 ............................................................................................................................................... 18 3.1 Naive Bayes Model ................................................................................................................ 18 3.2 Naive Bayes Model Example .................................................................................................. 19 3.3 Laplace Correction: ................................................................................................................ 21 3.4 Naïve Bayes Training and Classification................................................................................. 22 3.5 Flowchart For Naïve Bayes Model ......................................................................................... 23 3.6 Complexity of Naïve Bayes Training and Classification ......................................................... 27 Chapter 4 ............................................................................................................................................... 29 4.1 Machine Learning Methods for Streaming Data ...................................................................... 29 Chapter 5 ............................................................................................................................................... 31 5.1 Applying Naïve Bayes Models to Streaming Data................................................................... 31 5.2 Flowchart for Naïve Bayes Model Applied to streaming data .................................................. 31 5.3 Complexity of the Naïve Bayes Model Applied to Streaming Data ......................................... 36 Chapter 6 ............................................................................................................................................... 38 6.1 Tree-Augmented Naïve Bayes ................................................................................................ 38 6.2 The TAN model using max abs error ...................................................................................... 47 Chapter 7 ............................................................................................................................................... 52 7.1 Batch Algorithm Approach: .................................................................................................... 52 7.2 Incremental TAN Learning ..................................................................................................... 56 Chapter 8 ............................................................................................................................................... 65 8.1 Experimental Results .............................................................................................................. 65 8.2 Data Preparation ..................................................................................................................... 65 8.3 Brief Descriptions of Datasets used for testing ........................................................................ 65 4 8.4 Performance evaluation of NB and TAN Classification .......................................................... 66 8.5 Performance result of different models ................................................................................... 66 Chapter 9 ............................................................................................................................................... 70 References: ............................................................................................................................................ 72 5 Abstract Naïve Bayes (NB) and Tree Augmented Naïve Bayes (TAN) are simple probabilistic classifiers that are based on the application of Bayes’ Theorem. Popular applications of NB classifiers can be found in email spam Detection, face recognition, medical diagnosis, and weather prediction. In classification problem in general, we attempt to determine the category to which a new instance belongs by using knowledge acquired from analyzing previous instances (i.e. training set) for which the category was already known. This form of learning is generally referred to as supervised learning. The NB classifier assumes the value of an attribute is independent of the value of another attribute given the class variable. This is a very simplistic approach to build a probabilistic model which does not always hold true. If we can incorporate the information of correlation between the attributes within the Naïve Bayes model, the classification accuracy can be improved. Tree Augmented Naïve Bayes (TAN) is one such model. In Tree Augmented Naïve Bayes, the attributes are not independent as in NB, but the level of interaction between the attributes is still limited in TAN to keep the computational cost down. During TAN classification, each attribute conditionally depends on the class and one other attribute from the feature set. Therefore, a TAN is usually more realistic model which can increase the performance of classification without increasing the cost of training significantly. With the advent of internet there is continuous flow of data. We want our classification model to learn from this flow of non-stationary data. The main component in the TAN model is the tree structure. If we have to restructure the tree whenever a new training instance is fed into the model it will use significant processing power. Whereas if the tree structure does not reflect the correlation among the attributes correctly it will reduce the classification accuracy. In this project I implemented the TAN model on streaming data in such a way that whenever a new training instance is fed into model, metrics are used to access whether the current tree is outdated and may negatively affect the accuracy. I reviewed several different approaches for such metrics to decide when to restructure the tree. 6 Chapter 1 1.1 Bayes’ Theorem: In probability theory, Bayes’ Theorem calculates the probability of an event based on the previous knowledge of conditions that might be related to the event. Bayes’ Theorem is mathematically stated using following equation: P(A|B) = P(B|A) * P(A) [1] P(B) where A and B are events and P(B) ≠ 0. P(A|B) is a conditional probability: the likelihood of event A given event B occurs. P(B|A) is a conditional probability: the likelihood of event B given event A occurs. P(A) and P(B) are marginal probabilities: the probabilities of observing A and B independent of each other. Take for instance 10% of patients have obesity i.e. the prior probability of having obesity is P(Obesity)=0.10. Among obese patients 7% are sedentary i.e. P(Sedentary | Obesity)=0.07. Also given, 5% of the patients are sedentary i.e. P(Sedentary)=0.05. Using Bayes’ Theorem, we can calculate the probability of obesity given sedentary: P(Obesity | Sedentary)= P(Sedentary | Obesity) *P(Obesity)/ P(Sedentary) = (0.7*0.10)/0.5 = 0.14 1.2 Conditional Probability: The conditional probability of an event is expressed as the probability of an event to occur given another event occurs. For an example if we are interested in the probability of event A, and event B has occurred then the “conditional probability of A given B” is mathematically expressed as

Load more