Learning Graphical Models for Hypothesis Testing and Classification

Learning Graphical Models for Hypothesis Testing and Classification

IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 11, NOVEMBER 2010 5481 Learning Graphical Models for Hypothesis Testing and Classification Vincent Y. F. Tan, Student Member, IEEE, Sujay Sanghavi, Member, IEEE, John W. Fisher, III, Member, IEEE, and Alan S. Willsky, Fellow, IEEE Abstract—Sparse graphical models have proven to be a flexible to perform large-scale statistical inference and learning. Sparse, class of multivariate probability models for approximating high-di- but loopy, graphical models have proven to be a robust yet flex- mensional distributions. In this paper, we propose techniques to ex- ible class of probabilistic models in signal and image processing ploit this modeling ability for binary classification by discrimina- tively learning such models from labeled training data, i.e., using [4]. Learning such models from data is an important generic both positive and negative samples to optimize for the structures task. However, this task is complicated by the classic tradeoff of the two models. We motivate why it is difficult to adapt existing between consistency and generalization. That is, graphs with generative methods, and propose an alternative method consisting too few edges have limited modeling capacity, while those with of two parts. First, we develop a novel method to learn tree-struc- too many edges overfit the data. tured graphical models which optimizes an approximation of the log-likelihood ratio. We also formulate a joint objective to learn A classic method developed by Chow and Liu [5] shows how a nested sequence of optimal forests-structured models. Second, to efficiently learn the optimal tree approximation of a mul- we construct a classifier by using ideas from boosting to learn a tivariate probabilistic model. It was shown that only pairwise set of discriminative trees. The final classifier can interpreted as a probabilistic relationships amongst the set of variables suffice likelihood ratio test between two models with a larger set of pair- to learn the model. Such relationships may be deduced by using wise features. We use cross-validation to determine the optimal standard estimation techniques given a set of samples. Consis- number of edges in the final model. The algorithm presented in this paper also provides a method to identify a subset of the edges tency and convergence rates have also been studied [6], [7]. that are most salient for discrimination. Experiments show that the Several promising techniques have been proposed for learning proposed procedure outperforms generative methods such as Tree thicker loopy models [8]–[11] (i.e., models containing more Augmented Naïve Bayes and Chow-Liu as well as their boosted edges) for the purpose of approximating a distribution given in- counterparts. dependent and identically distributed (iid) samples drawn from Index Terms—Boosting, classification, graphical models, struc- that distribution. However, they are not straightforward to adapt ture learning, tree distributions. for the purpose of learning models for binary classification (or binary hypothesis testing). As an example, for two distributions that are “close” to each other, separately modeling each by a I. INTRODUCTION sparse graphical model would likely “blur” the differences be- tween the two. This is because the primary goal of modeling is HE formalism of graphical models [3] (also called to faithfully capture the entire behavior of a single distribution, Markov random fields) involves representing the condi- T and not to emphasize its most salient differences from another tional independence relations of a set of random variables by a probability distribution. Our motivation is to retain the general- graph. This enables the use of efficient graph-based algorithms ization power of sparse graphical models, while also developing a procedure that automatically identifies and emphasizes fea- tures that help to best discriminate between two distributions. Manuscript received February 09, 2010; accepted July 08, 2010. Date of pub- lication July 19, 2010; date of current version October 13, 2010. The associate In this paper, we leverage the modeling flexibility of sparse editor coordinating the review of this manuscript and approving it for publica- graphical models for the task of classification: given labeled tion was Prof. Cedric Richard. This work was supported in part by the AFOSR training data from two unknown distributions, we first describe through Grant FA9550-08-1-1080, by the MURI funded through an ARO Grant W911NF-06-1-0076, and by MURI through AFOSR Grant FA9550-06-1-0324. how to build a pair of tree-structured graphical models to better The work of V. Tan was supported by A*STAR, Singapore. The work of J. discriminate between the two distributions. In addition, we also Fisher was partially supported by the Air Force Research Laboratory under utilize boosting [12] to learn a richer (or larger) set of features1 Award No. FA8650-07-D-1220. The material in this paper was presented at the SSP Workshop, Madison, WI, August 2007, and at ICASSP, Las Vegas, NV, using the previously mentioned tree-learning algorithm as the March 2008. weak classifier. This allows us to learn thicker graphical models, V. Y. F. Tan and A. S. Willsky are with the Stochastic Systems Group, Lab- which to the best of our knowledge, has not been done before. oratory for Information and Decision Systems (LIDS), Massachusetts Institute Learning graphical models for classification has been previously of Technology (MIT), Cambridge, MA 02139 USA (e-mail: [email protected]; [email protected]). proposed for tree-structured models such as Tree Augmented S. Sanghavi is with the Electrical and Computer Engineering Department, Naïve Bayes (TAN) [13], [14], and for more complex models University of Texas, Austin, TX 78712 US (e-mail: [email protected]). using greedy heuristics [15]. J. W. Fisher III is with the Computer Science and Artificial Intelligence Labo- ratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, MA We outline the main contributions of this paper in Section I-A 02139 USA (e-mail: fi[email protected]). and discuss related work in Section I-B. In Section II, we present Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. 1We use the generic term features to denote the marginal and pairwise class Digital Object Identifier 10.1109/TSP.2010.2059019 conditional distributions, i.e., p @x AYq @x A and p @x Yx AYq @x Yx A. 1053-587X/$26.00 © 2010 IEEE 5482 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 11, NOVEMBER 2010 some mathematical preliminaries. In Section III, we describe modification above as a weak classifier to learn multiple pairs discriminative tree learning algorithms specifically tailored for of trees. (iii) Combining the resulting trees to learn a larger set the purpose of classification. This is followed by the presenta- of pairwise features. tion of a novel adaptation of boosting [16], [17] to learn a larger The optimal number of boosting iterations and hence, the set of features in Section IV. In Section V, we present numer- number of trees in the final ensemble models is found by cross- ical experiments to validate the learning method presented in validation (CV) [21]. We note that even though the resulting the paper and also demonstrate how the method can be naturally models are high-dimensional, CV is effective because due to the extended to multiclass classification problems. We conclude in lower-dimensional modeling requirements of classification as Section VI by discussing the merits of the techniques presented. compared to, for example, structure modeling. We show, via ex- periments, that the method of boosted learning outperforms [5], A. Summary of Main Contributions [13], [14]. In fact, any graphical model learning procedure for There are three main contributions in this paper. Firstly, it is classification, such as TAN, can be augmented by the boosting known that decreasing functions of the -divergence [a sym- procedure presented to learn more salient pairwise features and metric form of the Kullback-Leibler (KL) divergence] provide thus to increase modeling capability and subsequent classifica- upper and lower bounds to the error probability [18]–[20]. Moti- tion accuracy. vated by these bounds, we develop efficient algorithms to maxi- B. Related Work mize a tree-based approximation to the -divergence. We show that it is straightforward to adapt the generative tree-learning There has been much work on learning sparse, but loopy, procedure of Chow and Liu [5] to a discriminative2 objective graphs purely for modeling purposes (e.g., in the papers [8]–[11] related to the -divergence over tree models. Secondly, we pro- and references therein). A simple form of learning of graphical pose a boosting procedure [12] to learn a richer set of features, models for classification is the Naïve Bayes model, which cor- thus improving the modeling ability of the distributions and responds to the graphs having no edges, a restrictive assump- . Finally, we demonstrate empirically that this family of algo- tion. A comprehensive study of discriminative versus generative rithms lead to accurate classification on a wide range of datasets. Naïve Bayes was done in Ng et al. [22]. Friedman et al. [14] It is generally difficult to adapt existing techniques for and Wang and Wong [13] suggested an improvement to Naïve learning loopy graphical models directly to the task of clas- Bayes using a generative model known as TAN, a specific form sification. This is because direct approaches typically involve of a graphical model geared towards classification. However, first estimating the structure before estimating the parameters. the models learned in these papers share the same structure and The parameter estimation stage is usually intractable if the hence are more restrictive than the proposed discriminative al- estimated structure is loopy. Our main contribution is thus the gorithm, which learns trees with possibly distinct structures for development of efficient learning algorithms for estimating each hypothesis.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    15 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us