Discriminative, Generative and Imitative Learning by Tony Jebara

Discriminative, Generative and Imitative Learning by Tony Jebara B.Eng., Electrical Engineering McGill University, 1996 M.Sc., Media Arts and Sciences, MIT, 1998 Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY IN MEDIA ARTS AND SCIENCES at the Massachusetts Institute of Technology February 2002 c Massachusetts Institute of Technology, 2002 All Rights Reserved Signature of Author Program in Media Arts and Sciences December 18, 2001 Certified by Alex P. Pentland Toshiba Professor of Media Arts and Sciences Program in Media Arts and Sciences Thesis Supervisor Accepted by Andrew B. Lippman Chair Departmental Committee on Graduate Students Program in Media Arts and Sciences Discriminative, Generative and Imitative Learning by Tony Jebara Submitted to the Program in Media Arts and Sciences, School of Architecture and Planning, in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Media Arts and Sciences Abstract I propose a common framework that combines three different paradigms in machine learning: generative, discriminative and imitative learning. A generative probabilistic distribution is a principled way to model many machine learning and machine perception problems. Therein, one provides do- main specific knowledge in terms of structure and parameter priors over the joint space of variables. Bayesian networks and Bayesian statistics provide a rich and flexible language for specifying this knowledge and subsequently refining it with data and observations. The final result is a distribution that is a good generator of novel exemplars. Conversely, discriminative algorithms adjust a possibly non-distributional model to data optimizing for a specific task, such as classification or prediction. This typically leads to superior performance yet compromises the flexibility of generative modeling. I present Maximum Entropy Discrimination (MED) as a framework to combine both discriminative estimation and generative probability den- sities. Calculations involve distributions over parameters, margins, and priors and are provably and uniquely solvable for the exponential family. Extensions include regression, feature selection, and transduction. SVMs are also naturally subsumed and can be augmented with, for example, feature selection, to obtain substantial improvements. To extend to mixtures of exponential families, I derive a discriminative variant of the Expectation- Maximization (EM) algorithm for latent discriminative learning (or latent MED). While EM and Jensen lower bound log-likelihood, a dual upper bound is made possible via a novel reverse-Jensen inequality. The variational upper bound on latent log-likelihood has the same form as EM bounds, is computable efficiently and is globally guaranteed. It permits powerful discriminative learning with the wide range of contemporary probabilistic mixture models (mixtures of Gaussians, mixtures of multinomials and hidden Markov models). We provide empirical results on standardized data sets that demonstrate the viability of the hybrid discriminative-generative approaches of MED and reverse-Jensen bounds over state of the art discriminative techniques or generative approaches. Subsequently, imitative learning is presented as another variation on generative modeling which also learns from exemplars from an observed data source. However, the distinction is that the generative model is an agent that is interacting in a much more complex surrounding external world. It is not efficient to model the aggregate space in a generative setting. I demonstrate that imitative learning (under appropriate conditions) can be adequately addressed as a discriminative prediction task which outperforms the usual generative approach. This discriminative-imitative learning approach is applied with a generative perceptual system to synthesize a real-time agent that learns to engage in social interactive behavior. Thesis Supervisor: Alex Pentland Title: Toshiba Professor of Media Arts and Sciences, MIT Media Lab Discriminative, Generative and Imitative Learning by Tony Jebara Thesis committee: Advisor: Alex P. Pentland Toshiba Professor of Media Arts and Sciences MIT Media Laboratory Co-Advisor: Tommi S. Jaakkola Assistant Professor of Electrical Engineering and Computer Science MIT Artificial Intelligence Laboratory Reader: David C. Hogg Professor of Computing and Pro-Vice-Chancellor School of Computer Studies, University of Leeds Reader: Tomaso A. Poggio Uncas and Helen Whitaker Professor MIT Brain Sciences Department and MIT Artificial Intelligence Lab 4 Acknowledgments I extend warm thanks to Alex Pentland for sharing with me his wealth of brilliant creative ideas, his ground-breaking visions for computer-human collaboration, and his ability to combine the strengths of so many different fields and applications. I extend warm thanks to Tommi Jaakkola for sharing with me his masterful knowledge of machine learning, his pioneering ideas on discriminative- generative estimation and his excellent statistical and mathematical abilities. I extend warm thanks to David Hogg for sharing with me his will to tackle great challenging problems, his visionary ideas on behavior learning, and his ability to span the panorama of perception, learning and behavior. I extend warm thanks to Tomaso Poggio for sharing with me his extensive understanding of so many aspects of intelligence: biological, psychological, statistical, mathematical and computational and his enthusiasm towards science in general. As members of my committee, they have all profoundly shaped the ideas in this thesis as well as helped me formalize them into this document. I would like to thank the Pentlandians group who has been a great team to work with: Karen Navarro, Elizabeth Farley, Tanzeem Choudhury, Brian Clarkson, Sumit Basu, Yuri Ivanov, Nitin Sawhney, Vikram Kumar, Ali Rahimi, Steve Schwartz, Rich DeVaul, Dave Berger, Josh Weaver and Nathan Eagle. I would also like to thank the TRG who have been a great source of readings and brainstorming: Jayney Yu, Jason Rennie, Adrian Corduneanu, Neel Master, Martin Szummer, Romer Rosales, Chen-Hsiang, Nati Srebro, and so many others. My thanks to all the other Media Lab folks who are still around like Deb Roy, Joe Paradiso, Bruce Blumberg, Roz Picard, Irene Pepperberg, Claudia Urrea, Yuan Qi, Raul Fernandez, Push Singh, Bill Butera, Mike Johnson and Bill Tomlinson for sharing bold ideas and deep thoughts. Thanks also to great Media Lab friends who have moved to other places but had a profound influence on me: Baback Moghaddam, Bernt Schiele, Nuria Oliver, Ali Azarbayejani, Thad Starner, Kris Popat, Chris Wren, Jim Davis, Tom Minka, Francois Berard, Andy Wilson, Nuno Vasconcelos, Janet Cahn, Lee Campbell, Marina Bers, and my UROPs Martin Wagner, Cyrus Eyster and Ken Russell. Thanks to so many folks outside the lab like Sayan Mukherjee, Marina Meila, Yann LeCun, Michael Jordan, Andrew McCallum, Andrew Ng, Thomas Hoffman, John Weng, and many others for great conversations and valuable insight. And thanks to my family, my father, my mother and my sister, Carine. They made every possible sacrifice and effort so that I could do this PhD and supported me cheerfully throughout the whole endeavor. Contents 1 Introduction 14 1.1 Learning and Generative Modeling . 15 1.1.1 Learning and Generative Models in AI . 16 1.1.2 Learning and Generative Models in Perception . 16 1.1.3 Learning and Generative Models in Temporal Behavior . 17 1.2 Why a Probability of Everything? . 18 1.3 Generative versus Discriminative Learning . 18 1.4 Imitative Learning . 20 1.5 Objective . 22 1.6 Scope ............................................ 24 1.7 Organization . 24 2 Generative vs. Discriminative Learning 26 2.1 Two Schools of Thought . 27 2.1.1 Generative Probabilistic Models . 27 2.1.2 Discriminative Classifiers and Regressors . 29 2.2 Generative Learning . 30 2.2.1 Bayesian Inference . 30 2.2.2 Maximum Likelihood . 31 2.3 Conditional Learning . 32 2.3.1 Conditional Bayesian Inference . 32 2.3.2 Maximum Conditional Likelihood . 35 2.3.3 Logistic Regression . 36 2.4 Discriminative Learning . 36 2.4.1 Empirical Risk Minimization . 36 2.4.2 Structural Risk Minimization and Large Margin Estimation . 37 5 CONTENTS 6 2.4.3 Bayes Point Machines . 38 2.5 Joint Generative-Discriminative Learning . 38 3 Maximum Entropy Discrimination 40 3.1 Regularization Theory and Support Vector Machines . 41 3.1.1 Solvability . 42 3.1.2 Support Vector Machines and Kernels . 43 3.2 MED - Distribution over Solutions . 44 3.3 MED - Augmented Distributions . 46 3.4 Information Theoretic and Geometric Interpretation . 48 3.5 Computing the Partition Function . 49 3.6 Margin Priors . 50 3.7 Bias Priors . 52 3.7.1 Gaussian Bias Priors . 52 3.7.2 Non-Informative Bias Priors . 53 3.8 Support Vector Machines . 53 3.8.1 Single Axis SVM Optimization . 54 3.8.2 Kernels . 55 3.9 Generative Models . 55 3.9.1 Exponential Family Models . 56 3.9.2 Empirical Bayes Priors . 57 3.9.3 Full Covariance Gaussians . 59 3.9.4 Multinomials . 62 3.10 Generalization Guarantees . 64 3.10.1 VC Dimension . 64 3.10.2 Sparsity . 65 3.10.3 PAC-Bayes Bounds . 65 3.11 Summary and Extensions . 67 4 Extensions to Maximum Entropy Discrimination 68 4.1 MED Regression . 69 4.1.1 SVM Regression . 71 4.1.2 Generative Model Regression . 71 4.2 Feature Selection and Structure Learning . 72 4.2.1 Feature Selection in Classification . 73 CONTENTS 7 4.2.2 Feature Selection in Regression . 77 4.2.3 Feature Selection in Generative Models . 78 4.3 Transduction . 79 4.3.1 Transductive Classification . 79 4.3.2 Transductive Regression . 82 4.4 Other Extensions . 85 4.5 Mixture Models and Latent Variables . 86 5 Latent Discrimination and CEM 88 5.1 The Exponential Family and Mixtures . 89 5.1.1 Mixtures of the Exponential Family . 90 5.2 Mixtures in Product and Logarithmic Space . 91 5.3 Expectation Maximization: Divide and Conquer . 93 5.4 Latency in Conditional and Discriminative Criteria . 94 5.5 Bounding Mixture Models . 96 5.5.1 Jensen Bounds . 97 5.5.2 Reverse-Jensen Bounds .

Load more