Bayesian Methods: Naïve Bayes

Total Page:16

File Type:pdf, Size:1020Kb

Bayesian Methods: Naïve Bayes Bayesian Methods: Naïve Bayes Nicholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time • Parameter learning – Learning the parameter of a simple coin flipping model • Prior distributions • Posterior distributions • Today: more parameter learning and naïve Bayes 2 Maximum Likelihood Estimation (MLE) • Data: Observed set of 훼퐻 heads and 훼푇 tails • Hypothesis: Coin flips follow a binomial distribution • Learning: Find the “best” 휃 • MLE: Choose to maximize the likelihood (probability of D given 휃) 휃푀퐿퐸 = arg max 푝(퐷|휃) 휃 3 MAP Estimation • Choosing 휃 to maximize the posterior distribution is called maximum a posteriori (MAP) estimation 휃푀퐴푃 = arg max 푝(휃|퐷) 휃 • The only difference between 휃푀퐿퐸 and 휃푀퐴푃 is that one assumes a uniform prior (MLE) and the other allows an arbitrary prior 4 Sample Complexity • How many coin flips do we need in order to guarantee that our learned parameter does not differ too much from the true parameter (with high probability)? • Can use Chernoff bound (again!) – Suppose 푌1, … , 푌푁 are i.i.d. random variables taking values in {0, 1} such that 퐸푝 푌 = 푦. For 휖 > 0, 1 2 푝 푦 − ෍ 푌 ≥ 휖 ≤ 2푒−2푁휖 푁 5 Sample Complexity • How many coin flips do we need in order to guarantee that our learned parameter does not differ too much from the true parameter (with high probability)? • Can use Chernoff bound (again!) – For the coin flipping problem with 푋1, … , 푋푛 iid coin flips and 휖 > 0, 1 2 푝 휃 − ෍ 푋 ≥ 휖 ≤ 2푒−2푁휖 푡푟푢푒 푁 6 Sample Complexity • How many coin flips do we need in order to guarantee that our learned parameter does not differ too much from the true parameter (with high probability)? • Can use Chernoff bound (again!) – For the coin flipping problem with 푋1, … , 푋푛 iid coin flips and 휖 > 0, −2푁휖2 푝 휃푡푟푢푒 − 휃푀퐿퐸 ≥ 휖 ≤ 2푒 7 Sample Complexity • How many coin flips do we need in order to guarantee that our learned parameter does not differ too much from the true parameter (with high probability)? • Can use Chernoff bound (again!) – For the coin flipping problem with 푋1, … , 푋푛 iid coin flips and 휖 > 0, −2푁휖2 푝 휃푡푟푢푒 − 휃푀퐿퐸 ≥ 휖 ≤ 2푒 2 1 2 훿 ≥ 2푒−2푁휖 ⇒ 푁 ≥ ln 2휖2 훿 8 MLE for Gaussian Distributions • Two parameter distribution characterized by a mean and a variance 9 Some properties of Gaussians • Affine transformation (multiplying by scalar and adding a constant) are Gaussian – 푋 ~ 푁(휇, 휎2) – 푌 = 푎푋 + 푏 ⇒ 푌 ~ 푁(푎휇 + 푏, 푎2휎2) • Sum of Gaussians is Gaussian – 푋 ~ 푁(휇 , 2), 푌 ~ 푁(휇 , 2) 푋 휎푋 푌 휎푌 – 푍 = 푋 + 푌 ⇒ 푍 ~ 푁(휇 + 휇 , 2 + 2) 푋 푌 휎푋 휎푌 • Easy to differentiate, as we will see soon! 10 Learning a Gaussian • Collect data Exam Score – Hopefully, i.i.d. samples 0 85 – e.g., exam scores 1 95 2 100 • Learn parameters 3 12 … … – Mean: μ 99 89 – Variance: σ 11 MLE for Gaussian: • Probability of 푁 i.i.d. samples 퐷 = 푥(1), … , 푥(푁) 푁 2 푁 푥 푖 −휇 1 − 푝 퐷 휇, 휎 = ෑ 푒 2휎2 2 2휋휎 =1 • Log-likelihood of the data 푁 2 푁 푥 − 휇 ln 푝(퐷|휇, 휎) = − ln 2휋휎2 − ෍ 2 2휎2 =1 12 MLE for the Mean of a Gaussian 푁 2 휕 휕 푁 푥 − 휇 ln 푝(퐷|휇, 휎) = − ln 2휋휎2 − ෍ 휕휇 휕휇 2 2휎2 =1 푁 2 휕 푥 − 휇 = − ෍ 휕휇 2휎2 =1 푁 푥 − 휇 = − ෍ 휎2 =1 푁휇 − σ푁 푥 = =1 = 0 휎2 푁 1 휇 = ෍ 푥 푀퐿퐸 푁 =1 13 MLE for Variance 푁 2 휕 휕 푁 푥 − 휇 ln 푝(퐷|휇, 휎) = − ln 2휋휎2 − ෍ 휕휎 휕휎 2 2휎2 =1 푁 2 푁 휕 푥 − 휇 = − + − ෍ 휎 휕휎 2휎2 =1 푁 2 푁 푥 − 휇 = − + ෍ = 0 휎 휎3 =1 푁 1 2 휎2 = ෍ 푥 − 휇 푀퐿퐸 푁 푀퐿퐸 =1 14 Learning Gaussian parameters 푁 1 휇 = ෍ 푥 푀퐿퐸 푁 =1 푁 1 2 휎2 = ෍ 푥 − 휇 푀퐿퐸 푁 푀퐿퐸 =1 • MLE for the variance of a Gaussian is biased – Expected result of estimation is not true parameter! – Unbiased variance estimator 푁 1 2 휎2 = ෍ 푥 − 휇 푢푛푏푎푠푒푑 푁 − 1 푀퐿퐸 =1 15 Bayesian Categorization/Classification • Given features 푥 = (푥1, … , 푥푚) predict a label 푦 • If we had a joint distribution over 푥 and 푦, given 푥 we could find the label using MAP inference arg max 푝(푦|푥1, … , 푥푚) 푦 • Can compute this in exactly the same way that we did before using Bayes rule: 푝 푥1, … , 푥푚 푦 푝 푦 푝 푦 푥1, … , 푥푚 = 푝 푥1, … , 푥푚 16 Article Classification • Given a collection of news articles labeled by topic, goal is, given a new news article to predict its topic – One possible feature vector: • One feature for each word in the document, in order 푡ℎ – 푥 corresponds to the word – 푥 can take a different value for each word in the dictionary 17 Text Classification 18 Text Classification • 푥1, 푥2, … is sequence of words in document • The set of all possible features (and hence 푝(푦|푥)) is huge – Article at least 1000 words, 푥 = (푥1, … , 푥1000) 푡ℎ – 푥 represents word in document • Can be any word in the dictionary – at least 10,000 words – 10,0001000 = 104000 possible values – Atoms in Universe: ~1080 19 Bag of Words Model • Typically assume position in document doesn’t matter 푝 푋 = 푥 푌 = 푦 = 푝(푋푘 = 푥|푌 = 푦) – All positions have the same distribution – Ignores the order of words – Sounds like a bad assumption, but often works well! • Features – Set of all possible words and their corresponding frequencies (number of times it occurs in the document) 20 Bag of Words aardvark 0 about2 all 2 Africa 1 apple0 anxious 0 ... gas 1 ... oil 1 … Zaire 0 21 Need to Simplify Somehow • Even with the bag of words assumption, there are too many possible outcomes – Too many probabilities 푝(푥1, … , 푥푚|푦) • Can we assume some are the same? 푝(푥1, 푥2|푦 ) = 푝(푥1|푦) 푝(푥2|푦) – This is a conditional independence assumption 22 Conditional Independence • X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z 푝 푋 푌, 푍 = 푃(푋|푍) • Equivalent to 푝 푋, 푌 푍 = 푝 푋 푍 푃(푌|푍) 23 Naïve Bayes • Naïve Bayes assumption – Features are independent given class label 푝(푥1, 푥2|푦 ) = 푝(푥1|푦) 푝(푥2|푦) – More generally 푚 푝 푥1, … , 푥푚|푦 = ෑ 푝(푥|푦) =1 • How many parameters now? – Suppose 푥 is composed of 푚 binary features 24 The Naïve Bayes Classifier • Given 푌 – Prior 푝(푦) – 푚 conditionally independent features 푋 given the class 푌 푋1 푋2 푋푛 – For each 푋, we have likelihood 푃(푋|푌) • Classify via ∗ 푦 = ℎ푁퐵 푥 = arg max 푝 푦 푝(푥1, … , 푥푚|푦) 푦 푚 = arg max 푝(푦) ෑ 푝(푥|푦) 푦 25 MLE for the Parameters of NB • Given dataset, count occurrences for all pairs – 퐶표푢푛푡(푋 = 푥, 푌 = 푦) is the number of samples in which 푋 = 푥 and 푌 = 푦 • MLE for discrete NB 퐶표푢푛푡 푌 = 푦 푝 푌 = 푦 = ′ σ푦′ 퐶표푢푛푡(푌 = 푦 ) 퐶표푢푛푡 푋 = 푥, 푌 = 푦 푝 푋 = 푥 푌 = 푦 = ′ σ ′ 퐶표푢푛푡 (푋 = 푥 , 푌 = 푦) 푥푖 26 Naïve Bayes Calculations 27 Subtleties of NB Classifier: #1 • Usually, features are not conditionally independent: 푚 푝 푥1, … , 푥푚 푦 ≠ ෑ 푝(푥|푦) =1 • The naïve Bayes assumption is often violated, yet it performs surprisingly well in many cases • Plausible reason: Only need the probability of the correct class to be the largest! – Example: binary classification; just need to figure out the correct side of 0.5 and not the actual probability (0.51 is the same as 0.99). 28 Subtleties of NB Classifier: #2 • What if you never see a training instance (푋1 = 푎, 푌 = 푏) – Example: you did not see the word Enlargement in spam! – Then 푝 푋1 = 푎 푌 = 푏 = 0 – Thus no matter what values 푋2, ⋯ , 푋푚take 푃 푋1 = a, 푋2 = 푥2, ⋯ , 푋푚 = 푥푚 푌 = 푏 = 0 – Why? 29 Subtleties of NB Classifier: #2 • To fix this, use a prior! – Already saw how to do this in the coin-flipping example using the Beta distribution – For NB over discrete spaces, can use the Dirichlet prior – The Dirichlet distribution is a distribution over 푧1, … , 푧푘 ∈ (0,1) such that 푧1 + ⋯ + 푧푘 = 1 characterized by 푘 parameters 훼1, … , 훼푘 푘 훼푖−1 푓 푧1, … , 푧푘; 훼1, … , 훼푘 ∝ ෑ 푧 =1 • Called smoothing, what are the MLE estimates under these kinds of priors? 30.
Recommended publications
  • On Discriminative Bayesian Network Classifiers and Logistic Regression
    On Discriminative Bayesian Network Classifiers and Logistic Regression Teemu Roos and Hannes Wettig Complex Systems Computation Group, Helsinki Institute for Information Technology, P.O. Box 9800, FI-02015 HUT, Finland Peter Gr¨unwald Centrum voor Wiskunde en Informatica, P.O. Box 94079, NL-1090 GB Amsterdam, The Netherlands Petri Myllym¨aki and Henry Tirri Complex Systems Computation Group, Helsinki Institute for Information Technology, P.O. Box 9800, FI-02015 HUT, Finland Abstract. Discriminative learning of the parameters in the naive Bayes model is known to be equivalent to a logistic regression problem. Here we show that the same fact holds for much more general Bayesian network models, as long as the corresponding network structure satisfies a certain graph-theoretic property. The property holds for naive Bayes but also for more complex structures such as tree-augmented naive Bayes (TAN) as well as for mixed diagnostic-discriminative structures. Our results imply that for networks satisfying our property, the condi- tional likelihood cannot have local maxima so that the global maximum can be found by simple local optimization methods. We also show that if this property does not hold, then in general the conditional likelihood can have local, non-global maxima. We illustrate our theoretical results by empirical experiments with local optimization in a conditional naive Bayes model. Furthermore, we provide a heuristic strategy for pruning the number of parameters and relevant features in such models. For many data sets, we obtain good results with heavily pruned submodels containing many fewer parameters than the original naive Bayes model. Keywords: Bayesian classifiers, Bayesian networks, discriminative learning, logistic regression 1.
    [Show full text]
  • Lecture 5: Naive Bayes Classifier 1 Introduction
    CSE517A Machine Learning Spring 2020 Lecture 5: Naive Bayes Classifier Instructor: Marion Neumann Scribe: Jingyu Xin Reading: fcml 5.2.1 (Bayes Classifier, Naive Bayes, Classifying Text, and Smoothing) Application Sentiment analysis aims at discovering people's opinions, emo- tions, feelings about a subject matter, product, or service from text. Take some time to answer the following warm-up questions: (1) Is this a regression or classification problem? (2) What are the features and how would you represent them? (3) What is the prediction task? (4) How well do you think a linear model will perform? 1 Introduction Thought: Can we model p(y j x) without model assumptions, e.g. Gaussian/Bernoulli distribution etc.? Idea: Estimate p(y j x) from the data directly, then use the Bayes classifier. Let y be discrete, Pn i=1 I(xi = x \ yi = y) p(y j x) = Pn (1) i=1 I(xi = x) where the numerator counts all examples with input x that have label y and the denominator counts all examples with input x. Figure 1: Illustration of estimatingp ^(y j x) From Figure ??, it is clear that, using the MLE method, we can estimatep ^(y j x) as jCj p^(y j x) = (2) jBj But there is a big problem with this method: The MLE estimate is only good if there are many training vectors with the same identical features as x. 1 2 This never happens for high-dimensional or continuous feature spaces. Solution: Bayes Rule. p(x j y) p(y) p(y j x) = (3) p(x) Let's estimate p(x j y) and p(y) instead! 2 Naive Bayes We have a discrete label space C that can either be binary f+1; −1g or multi-class f1; :::; Kg.
    [Show full text]
  • Hybrid of Naive Bayes and Gaussian Naive Bayes for Classification: a Map Reduce Approach
    International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-6S3, April 2019 Hybrid of Naive Bayes and Gaussian Naive Bayes for Classification: A Map Reduce Approach Shikha Agarwal, Balmukumd Jha, Tisu Kumar, Manish Kumar, Prabhat Ranjan Abstract: Naive Bayes classifier is well known machine model could be used without accepting Bayesian probability learning algorithm which has shown virtues in many fields. In or using any Bayesian methods. Despite their naive design this work big data analysis platforms like Hadoop distributed and simple assumptions, Naive Bayes classifiers have computing and map reduce programming is used with Naive worked quite well in many complex situations. An analysis Bayes and Gaussian Naive Bayes for classification. Naive Bayes is manily popular for classification of discrete data sets while of the Bayesian classification problem showed that there are Gaussian is used to classify data that has continuous attributes. sound theoretical reasons behind the apparently implausible Experimental results show that Hybrid of Naive Bayes and efficacy of types of classifiers[5]. A comprehensive Gaussian Naive Bayes MapReduce model shows the better comparison with other classification algorithms in 2006 performance in terms of classification accuracy on adult data set showed that Bayes classification is outperformed by other which has many continuous attributes. approaches, such as boosted trees or random forests [6]. An Index Terms: Naive Bayes, Gaussian Naive Bayes, Map Reduce, Classification. advantage of Naive Bayes is that it only requires a small number of training data to estimate the parameters necessary I. INTRODUCTION for classification. The fundamental property of Naive Bayes is that it works on discrete value.
    [Show full text]
  • Locally Differentially Private Naive Bayes Classification
    Locally Differentially Private Naive Bayes Classification EMRE YILMAZ, Case Western Reserve University MOHAMMAD AL-RUBAIE, University of South Florida J. MORRIS CHANG, University of South Florida In machine learning, classification models need to be trained in order to predict class labels. When the training data contains personal information about individuals, collecting training data becomes difficult due to privacy concerns. Local differential privacy isa definition to measure the individual privacy when there is no trusted data curator. Individuals interact with an untrusteddata aggregator who obtains statistical information about the population without learning personal data. In order to train a Naive Bayes classifier in an untrusted setting, we propose to use methods satisfying local differential privacy. Individuals send theirperturbed inputs that keep the relationship between the feature values and class labels. The data aggregator estimates all probabilities needed by the Naive Bayes classifier. Then, new instances can be classified based on the estimated probabilities. We propose solutions forboth discrete and continuous data. In order to eliminate high amount of noise and decrease communication cost in multi-dimensional data, we propose utilizing dimensionality reduction techniques which can be applied by individuals before perturbing their inputs. Our experimental results show that the accuracy of the Naive Bayes classifier is maintained even when the individual privacy is guaranteed under local differential privacy, and that using dimensionality reduction enhances the accuracy. CCS Concepts: • Security and privacy → Privacy-preserving protocols; • Computing methodologies → Supervised learning by classification; Dimensionality reduction and manifold learning. Additional Key Words and Phrases: Local Differential Privacy, Naive Bayes, Classification, Dimensionality Reduction 1 INTRODUCTION Predictive analytics is the process of making prediction about future events by analyzing the current data using statistical techniques.
    [Show full text]
  • Improving Feature Selection Algorithms Using Normalised Feature Histograms
    Page 1 of 10 Improving feature selection algorithms using normalised feature histograms A. P. James and A. K. Maan The proposed feature selection method builds a histogram of the most stable features from random subsets of training set and ranks the features based on a classifier based cross validation. This approach reduces the instability of features obtained by conventional feature selection methods that occur with variation in training data and selection criteria. Classification results on 4 microarray and 3 image datasets using 3 major feature selection criteria and a naive bayes classifier show considerable improvement over benchmark results. Introduction : Relevance of features selected by conventional feature selection method is tightly coupled with the notion of inter-feature redundancy within the patterns [1-3]. This essentially results in design of data-driven methods that are designed for meeting criteria of redundancy using various measures of inter-feature correlations [4-5]. These measures behave differently to different types of data, and to different levels of natural variability. This introduces instability in selection of features with change in number of samples in the training set and quality of features from one database to another. The use of large number of statistically distinct training samples can increase the quality of the selected features, although it can be practically unrealistic due to increase in computational complexity and unavailability of sufficient training samples. As an example, in realistic high dimensional databases such as gene expression databases of a rare cancer, there is a practical limitation on availability of large number of training samples. In such cases, instability that occurs with selecting features from insufficient training data would lead to a wrong conclusion that genes responsible for a cancer be strongly subject dependent and not general for a cancer.
    [Show full text]
  • Naïve Bayes Lecture 17
    Naïve Bayes Lecture 17 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Mehryar Mohri Brief ArticleBrief Article The Author The Author January 11, 2012January 11, 2012 θˆ =argmax ln P ( θ) θˆ =argmax ln P ( θ) θ D| θ D| ln θαH ln θαH d d d ln P ( θ)= [ln θαH (1 θ)αT ] = [α ln θ + α ln(1 θ)] d d dθ D| dθ d − dθ H T − ln P ( θ)= [ln θαH (1 θ)αT ] = [α ln θ + α ln(1 θ)] dθ D| dθ − dθ H T − d d αH αT = αH ln θ + αT ln(1 θ)= =0 d d dθ αH dθ αT− θ − 1 θ = αH ln θ + αT ln(1 θ)= =0 − dθ dθ − θ −2N12 θ δ 2e− − P (mistake) ≥ ≥ 2N2 δ 2e− P (mistake) ≥ ≥ ln δ ln 2 2N2 Bayesian Learning ≥ − Prior • Use Bayes’ rule! 2 Data Likelihood ln δ ln 2 2N ln(2/δ) ≥ − N ≥ 22 Posterior ln(2/δ) N ln(2/0.05) 3.8 ≥ 22N Normalization= 190 ≥ 2 0.12 ≈ 0.02 × ln(2• /Or0. equivalently:05) 3.8 N • For uniform priors, this= 190 reducesP (θ) to 1 ≥ 2 0.12 ≈ 0.02 ∝ ×maximum likelihood estimation! P (θ) 1 P (θ ) P ( θ) ∝ |D ∝ D| 1 1 Brief ArticleBrief Article Brief Article Brief Article The Author The Author The Author January 11, 2012 JanuaryThe Author 11, 2012 January 11, 2012 January 11, 2012 θˆ =argmax ln Pθˆ( =argmaxθ) ln P ( θ) θ D| θ D| ˆ =argmax ln (θˆ =argmax) ln P ( θ) θ P θ θ θ D| αH D| ln θαH ln θ αH ln θαH ln θ d d α α d d d lnαP ( θ)=α [lndθ H (1 θ) T ] = [α ln θ + α ln(1 θ)] ln P ( θ)= [lndθ H (1 θ) T ]d= [α ln θ + α ln(1d Hθ)] T d d dθ D| dθ d αH H − αT T dθ − dθ D| dθ αlnH P ( − θα)=T [lndθθ (1 θ) ] = [α−H ln θ + αT ln(1 θ)] ln P ( θ)= [lndθθ (1 D|θ) ] =dθ [αH ln θ−+ αT ln(1dθ θ)] − dθ D| dθ − d dθ d −α
    [Show full text]
  • Last Time Today Bayesian Learning Naïve Bayes the Distributions We
    Last Time CSE 446 • Learning Gaussians Gaussian Naïve Bayes & • Naïve Bayes Logistic Regression Winter 2012 Today • Gaussians Naïve Bayes Dan Weld • Logistic Regression Some slides from Carlos Guestrin, Luke Zettlemoyer 2 Text Classification Bayesian Learning Bag of Words Representation Prior aardvark 0 Use Bayes rule: Data Likelihood about 2 all 2 Africa 1 Posterior P(Y | X) = P(X |Y) P(Y) apple 0 P(X) anxious 0 ... Normalization gas 1 ... oil 1 Or equivalently: P(Y | X) P(X | Y) P(Y) … Zaire 0 Naïve Bayes The Distributions We Love Discrete Continuous • Naïve Bayes assumption: – Features are independent given class: Binary {0, 1} k Values Single Bernouilli Event – More generally: Sequence Binomial Multinomial (N trials) N= H+T • How many parameters now? • Suppose X is composed of n binary features Conjugate Beta Dirichlet Prior 6 1 NB with Bag of Words for Text Classification Easy to Implement • Learning phase: – Prior P(Ym) • But… • Count how many documents from topic m / total # docs – P(Xi|Ym) • Let Bm be a bag of words formed from all the docs in topic m • Let #(i, B) be the number of times word iis in bag B • If you do… it probably won’t work… • P(Xi | Ym) = (#(i, Bm)+1) / (W+j#(j, Bm)) where W=#unique words • Test phase: – For each document • Use naïve Bayes decision rule 8 Probabilities: Important Detail! Naïve Bayes Posterior Probabilities • Classification results of naïve Bayes • P(spam | X … X ) = P(spam | X ) 1 n i i – I.e. the class with maximum posterior probability… Any more potential problems here? – Usually fairly accurate (?!?!?) • However, due to the inadequacy of the .
    [Show full text]
  • Generalized Naive Bayes Classifiers
    Generalized Naive Bayes Classifiers Kim Larsen Concord, California [email protected] ABSTRACT This paper presents a generalization of the Naive Bayes Clas- sifier. The method is called the Generalized Naive Bayes This paper presents a generalization of the Naive Bayes Clas- Classifier (GNBC) and extends the NBC by relaxing the sifier. The method is specifically designed for binary clas- assumption of conditional independence between the pre- sification problems commonly found in credit scoring and dictors. This reduces the bias in the estimated class prob- marketing applications. The Generalized Naive Bayes Clas- abilities and improves the fit, while greatly enhancing the sifier turns out to be a powerful tool for both exploratory descriptive value of the model. The generalization is done in and predictive analysis. It can generate accurate predictions a fully non-parametric fashion without putting any restric- through a flexible, non-parametric fitting procedure, while tions on the relationship between the independent variables being able to uncover hidden patterns in the data. In this and the dependent variable. This makes the GNBC much paper, the Generalized Naive Bayes Classifier and the orig- more flexible than traditional linear and non-linear regres- inal Bayes Classifier will be demonstrated. Also, important sion, and allows it to uncover hidden patterns in the data. ties to logistic regression, the Generalized Additive Model Moreover, the GNBC retains the additive structure of the (GAM), and Weight Of Evidence will be discussed. NBC which means that it does not suffer from the prob- lems of dimensionality that are common with other non- 1. INTRODUCTION parametric techniques.
    [Show full text]
  • A Comparison of Three Different Methods for Classification of Breast
    A Comparison of Three Different Methods for Classification of Breast Cancer Data Daniele Soria Jonathan M. Garibaldi Elia Biganzoli University of Nottingham National Cancer Institute of Milan School of Computer Science Unit of Medical Statistics and Biometry Jubilee Campus, Wollaton Road, Via Vanzetti 5, 20133 Milan, Italy Nottingham, NG8 1BB, UK [email protected] fdqs,[email protected] Ian O. Ellis University of Nottingham School of Molecular Medical Sciences Queens Medical Centre, Derby Road, Nottingham, NG7 2UH, UK [email protected] Abstract tumour information from the databases. Among the exist- ing techniques, supervised learning methods are the most The classification of breast cancer patients is of great popular in cancer diagnosis [8]. importance in cancer diagnosis. During the last few years, According to John and Langley [6], methods for in- many algorithms have been proposed for this task. In ducing probabilistic descriptions from training data have this paper, we review different supervised machine learn- emerged as a major alternative to more established ap- ing techniques for classification of a novel dataset and per- proaches to machine learning, such as decision-tree induc- form a methodological comparison of these. We used the tion and neural networks. However, some of the most im- C4.5 tree classifier, a Multilayer Perceptron and a naive pressive results to date have come from a much simpler – Bayes classifier over a large set of tumour markers. We and much older – approach to probabilistic induction known found good performance of the Multilayer Perceptron even as the naive Bayesian classifier. Despite the simplifying as- when we reduced the number of features to be classified.
    [Show full text]
  • Foundations of Bayesian Learning
    Introduction Foundations of Bayesian Learning Davide Bacciu Computational Intelligence & Machine Learning Group Dipartimento di Informatica Università di Pisa [email protected] Introduzione all’Intelligenza Artificiale - A.A. 2012/2013 Introduction Lecture Outline 1 Introduction 2 Probability Theory Probabilities and Random Variables Bayes Theorem and Independence 3 Bayesian Inference Hypothesis Selection Candy Box Example 4 Parameters Learning 5 Naive Bayes Classifier The Model Learning Text Classification Example 6 Bayesian Networks Introduction Bayesian Learning Why Bayesian? Easy, because of frequent use of Bayes theorem... Bayesian Inference A powerful approach to probabilistic reasoning Bayesian Networks An expressive model for describing probabilistic relationships Why bothering? Real-world is uncertain Data (noisy measurements and partial knowledge) Beliefs (concepts and their relationships) Probability as a measure of our beliefs Conceptual framework for describing uncertainty in world representation Learning and reasoning become matters of probabilistic inference Probabilistic weighting of the hypothesis Probability Theory Part I Probability and Learning Probabilities and Random Variables Probability Theory Bayes Theorem and Independence Random Variables A Random Variable (RV) is a function describing the outcome of a random process by assigning unique values to all possible outcomes of the experiment Random Process =) Coin Toss 0 if heads Discrete RV =) X = 1 if tails The sample space S of a random process is the set of all possible outcomes, e.g. S = fheads; tailsg An event e is a subset e 2 S, i.e. a set of outcomes, that may occur or not as a result of the experiment Random variables are the building blocks for representing our world Probabilities and Random Variables Probability Theory Bayes Theorem and Independence Probability Functions A probability function P(X = x) 2 [0; 1] (P(x) in short) measures the probability of a RV X attaining the value x, i.e.
    [Show full text]
  • Bayesian Learning
    SC4/SM8 Advanced Topics in Statistical Machine Learning Bayesian Learning Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/ Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 1 / 14 Bayesian Learning Review of Bayesian Inference The Bayesian Learning Framework Bayesian learning: treat parameter vector θ as a random variable: process of learning is then computation of the posterior distribution p(θjD). In addition to the likelihood p(Djθ) need to specify a prior distribution p(θ). Posterior distribution is then given by the Bayes Theorem: p(Djθ)p(θ) p(θjD) = p(D) Likelihood: p(Djθ) Posterior: p(θjD) Prior: p(θ) Marginal likelihood: p(D) = Θ p(Djθ)p(θ)dθ ´ Summarizing the posterior: MAP Posterior mode: θb = argmaxθ2Θ p(θjD) (maximum a posteriori). mean Posterior mean: θb = E [θjD]. Posterior variance: Var[θjD]. Department of Statistics, Oxford SC4/SM8 ATSML, HT2018 2 / 14 Bayesian Learning Review of Bayesian Inference Bayesian Inference on the Categorical Distribution Suppose we observe the with yi 2 f1;:::; Kg, and model them as i.i.d. with pmf π = (π1; : : : ; πK): n K Y Y nk p(Djπ) = πyi = πk i=1 k=1 Pn PK with nk = i=1 1(yi = k) and πk > 0, k=1 πk = 1. The conjugate prior on π is the Dirichlet distribution Dir(α1; : : : ; αK) with parameters αk > 0, and density PK K Γ( αk) Y p(π) = k=1 παk−1 QK k k=1 Γ(αk) k=1 PK on the probability simplex fπ : πk > 0; k=1 πk = 1g.
    [Show full text]
  • Improving the Naive Bayes Classifier Via a Quick Variable Selection
    Article Improving the Naive Bayes Classifier via a Quick Variable Selection Method Using Maximum of Entropy Joaquín Abellán * and Javier G. Castellano Department of Computer Science and Artificial Intelligence, University of Granada, 18071 Granada, Spain; [email protected] * Correspondence: [email protected]; Tel.: +34-958-242-376 Academic Editor: Dawn E. Holmes Received: 24 March 2017; Accepted: 19 May 2017; Published: 25 May 2017 Abstract: Variable selection methods play an important role in the field of attribute mining. The Naive Bayes (NB) classifier is a very simple and popular classification method that yields good results in a short processing time. Hence, it is a very appropriate classifier for very large datasets. The method has a high dependence on the relationships between the variables. The Info-Gain (IG) measure, which is based on general entropy, can be used as a quick variable selection method. This measure ranks the importance of the attribute variables on a variable under study via the information obtained from a dataset. The main drawback is that it is always non-negative and it requires setting the information threshold to select the set of most important variables for each dataset. We introduce here a new quick variable selection method that generalizes the method based on the Info-Gain measure. It uses imprecise probabilities and the maximum entropy measure to select the most informative variables without setting a threshold. This new variable selection method, combined with the Naive Bayes classifier, improves the original method and provides a valuable tool for handling datasets with a very large number of features and a huge amount of data, where more complex methods are not computationally feasible.
    [Show full text]