Structured Features in Naive Bayes Classification
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Structured Features in Naive Bayes Classification Arthur Choi Nazgol Tavabi Adnan Darwiche Computer Science Department Department of Computer Engineering Computer Science Department University of California, Los Angeles Sharif University of Technology University of California, Los Angeles [email protected] [email protected] [email protected] Abstract resent attributes that may have an exponential number of values (Darwiche 2011; Xue, Choi, and Darwiche 2012; We propose the structured naive Bayes (SNB) classi- Choi and Darwiche 2013; Van den Broeck and Darwiche fier, which augments the ubiquitous naive Bayes clas- sifier with structured features. SNB classifiers facilitate 2015). The second advance is the Probabilistic Sentential the use of complex features, such as combinatorial ob- Decision Diagram (PSDD), which we use to represent and jects (e.g., graphs, paths and orders) in a general but learn distributions over such attributes (Kisa et al. 2014a; systematic way. Underlying the SNB classifier is the Choi, Van den Broeck, and Darwiche 2015). Our goal is to recently proposed Probabilistic Sentential Decision Di- realize the above objective while maintaining two attractive agram (PSDD), which is a tractable representation of properties of NB classifiers: (1) the ability to learn and then probability distributions over structured spaces. We il- classify in time that is linear in the size of the classifier, and lustrate the utility and generality of the SNB classifier (2) the ability to use the classifier in a systematic manner via case studies. First, we show how we can distinguish across multiple domains. players of simple games in terms of play style and skill NB classifiers are defined by a class variable C and a set level based purely on observing the games they play. X Second, we show how we can detect anomalous paths of simple attributes i, each represented by a discrete or taken on graphs based purely on observing the paths continuous variable. In addition, an SNB classifier includes themselves. a set of SDDs Si, each representing a structured attribute. The size of an NB classifier is characterized by the number of classes, the number of attributes, and the number of values Introduction that the (discrete) variables Xi can attain. On the other hand, Naive Bayes (NB) classifiers and their extensions are ubiq- the size of the SNB classifier is further characterized by the uitous in domains such as data mining, artificial intelli- number and size of the SDDs Si. As with NB classifiers, we gence, and machine learning (Domingos and Pazzani 1997; show that one can learn and classify with SNB classifiers in Friedman, Geiger, and Goldszmidt 1997; Ng and Jordan time that is linear in their size. 2001). In the classical version of these classifiers, attributes We also show that using structured attributes for a particu- (aka features) are represented as discrete or continuous ran- lar domain amounts to simply defining the SDD correspond- dom variables. For discrete attributes, each variable typically ing to each attribute. Once the SDDs are defined, learning has a small or manageable number of values. For continuous from data, and then classifying instances, can all proceed in attributes, it is similarly assumed that variables have man- a systematic and domain independent manner. We illustrate ageable distributions, such as a Gaussian. As a result, these this systematic use of structured attributes through case stud- classifiers are typically used to classify instances that have ies, which delineate the domain specific investment needed. simple features, such as colors or intensities, words or fre- In particular, we consider two different classification tasks: quencies, scores, etc. In contrast, our interest in this paper is in naive Bayes classifiers with structured features that can 1. determining the play style and skill level of players in have an exponential number of values, representing com- games such as tic-tac-toe and hex, given examples of plex entities such as combinatorial objects (e.g., permuta- games that they have played; tions, graphs, paths and game traces). 2. detecting anomalies in paths taken on graphs, based on To realize this objective, we propose the structured naive observing the paths taken in a normal operating mode. Bayes classifier (SNB), which augments the NB classifier by There are two alternative approaches to classification tasks allowing it to efficiently accommodate structured attributes. that are targeted by the SNB classifier. The first is to sim- This extension is enabled by two recent advances in knowl- ply use a classical NB classifier with attributes that have an edge compilation. The first advance is the Sentential De- exponential number of values (using sparse, tabular repre- cision Diagram (SDD), which we use to compactly rep- sentations of the corresponding distributions). The second is Copyright c 2016, Association for the Advancement of Artificial to use a dedicated approach, that is specific to the domain Intelligence (www.aaai.org). All rights reserved. at hand. We comment on both approaches in the context of 3233 C the standard NB classifier, we would represent this attribute by a discrete variable with n! values. In our proposal, this attribute is defined by an SDD, which is obtained by a two-step process. The first step is based purely on Boolean reasoning, in which we define a Boolean X1 X2 ··· Xn formula whose models correspond to the n! permutations. The second step corresponds to compiling this formula into an SDD. The first step can be thought of as a modeling step, Figure 1: A naive Bayes classifier. and is the one requiring the main domain specific invest- ment. The second step is purely computational and is au- tomated to a large extent. The key observation here is that our case studies, arguing for the advantage of using SNB while the attribute may have an exponential number of val- classifiers more broadly. ues, the corresponding SDD may not be exponentially sized. This paper is organized as follows. We first review the To illustrate the first step, consider the Boolean variables classical NB classifier and then show how structured at- Aij for i, j ∈{1,...,n}. Here, the index i represents an tributes can be represented by SDDs and their distributions item and the index j represents its position in a permuta- by PSDDs. We follow by proposing the SNB classifier. The tion of n items (Choi, Van den Broeck, and Darwiche 2015). final part of the paper illustrates the utility and generality of n2 n2 SNB classifiers via case studies, showing how one can de- There are 2 instantiations of our Boolean variables, fine and represent SNB classifiers, learn them from data, and many of which do not correspond to valid permutations. In finally use them for classification. particular, some of these instantiations will place two items in the same position, or place the same item in multiple po- Naive Bayes Classifiers sitions. However, using Boolean constraints, we can rule out all of these invalid instantiations, keeping only the ones that A naive Bayes classifier corresponds to a Bayesian network, correspond to valid permutations. Assuming n =3items, C as in Figure 1. Here, we have a single class variable and the Boolean constraints are as follows: m X attribute variables i (for simplicity of exposition, we i assume that attributes are discrete). Let c denote a class la- – Each item is assigned to exactly one position, leading to three constraints for i ∈{1, 2, 3}: bel and xi denote a value of an attribute Xi. A naive Bayes classifier thus induces a distribution: (Ai1 ∧¬Ai2 ∧¬Ai3) m ∨(¬Ai1 ∧ Ai2 ∧¬Ai3) Pr c, x ,...,x Pr c · Pr x | c ( 1 m)= ( ) ( i ) (1) ∨(¬Ai1 ∧¬Ai2 ∧ Ai3). (2) i=1 – Each position j is assigned exactly one item, leading to where we have a class prior Pr(C) and conditional distribu- three constraints for j ∈{1, 2, 3}: Pr(Xi | C) tions . We can estimate these parameters from A ∧¬A ∧¬A (labeled) data, using maximum likelihood or MAP estima- ( 1j 2j 3j) tion.1 Once we have learned a naive Bayes classifier from ∨(¬A1j ∧ A2j ∧¬A3j) data, we can label new instances by selecting the class label ∨(¬A1j ∧¬A2j ∧ A3j). (3) c that has maximum posterior probability given observa- tions x1,...,xm. That is, we select Our Boolean formula will then correspond to a conjunc- tion of these six “exactly-one” constraints (in general, when c =argmaxPr(c | x1,...,xm). we encode permutations, we have 2n “exactly-one” con- c straints given n items). Moreover, the resulting formula has The SNB classifier has a structure similar to the NB classi- 6=3!models. For example, the permutation (2, 1, 3) is rep- fier of Figure 1. The one exception is that the attributes Xi do resented by the following model: not need to be discrete (or continuous) variables as they can ¬A11,A12, ¬A13,A21, ¬A22, ¬A23, ¬A31, ¬A32,A33. also be SDDs. The only domain specific investment when using an SNB classifier is to define these SDDs. We will The second step is to compile our Boolean formula into thus illustrate this process next, before we formally define a Sentential Decision Diagram (SDD), which is handled by 2 the SNB classifier later. an SDD compiler. Figure 2 (left) depicts an example SDD, which is a Boolean circuit with very specific properties. Representing Structured Attributes Here, circles represent disjunctions and paired boxes rep- resent conjunctions.3 Suppose that we have an attribute whose values are the set of n! permutations (total rankings) over n items. For example, 2In our experiments, we use the publicly available SDD com- we may want to build a classifier for some demographic, piler at http://reasoning.cs.ucla.edu/sdd.