Probabilistic Graphical Models

Total Page:16

File Type:pdf, Size:1020Kb

Probabilistic Graphical Models Probabilistic Graphical Models David Sontag New York University Lecture 3, February 14, 2013 David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 1 / 33 Undirected graphical models Reminder of lecture 2 An alternative representation for joint distributions is as an undirected graphical model (also known as Markov random fields) As in BNs, we have one node for each random variable Rather than CPDs, we specify (non-negative) potential functions over sets of variables associated with cliques C of the graph, 1 Y p(x ;:::; x ) = φ (x ) 1 n Z c c c C 2 Z is the partition function and normalizes the distribution: X Y Z = φc (^xc ) x^ ;:::;x^ c C 1 n 2 David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 2 / 33 Undirected graphical models 1 Y X Y p(x ;:::; x ) = φ (x ); Z = φ (^x ) 1 n Z c c c c c C x^ ;:::;x^ c C 2 1 n 2 Simple example (potential function on each edge encourages the variables to take the same value): B C C φA,B(a, b)= 0 1 φB,C(b, c)= 0 1 φA,C (a, c)= 0 1 B 0 10 1 0 10 1 0 10 1 A B A 1 1 1 10 A C 1 1 10 1 10 1 p(a; b; c) = φ (a; b) φ (b; c) φ (a; c); Z A;B · B;C · A;C where X Z = φ (^a; b^) φ (b^; c^) φ (^a; c^) = 2 1000 + 6 10 = 2060: A;B · B;C · A;C · · ^a;b^;c^ 0;1 3 2f g David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 3 / 33 Example: Ising model Invented by the physicist Wilhelm Lenz (1920), who gave it as a problem to his student Ernst Ising Mathematical model of ferromagnetism in statistical mechanics The spin of an atom is biased by the spins of atoms nearby on the material: = +1 = -1 Each atom X 1; +1 , whose value is the direction of the atom spin i 2 {− g If a spin at position i is +1, what is the probability that the spin at position j is also +1? Are there phase transitions where spins go from \disorder" to \order"? David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 4 / 33 Example: Ising model Each atom X 1; +1 , whose value is the direction of the atom spin i 2 {− g The spin of an atom is biased by the spins of atoms nearby on the material: = +1 = -1 1 X X p(x ; ; x ) = exp w x x u x 1 ··· n Z i;j i j − i i i<j i When wi;j > 0, nearby atoms encouraged to have the same spin (called ferromagnetic), whereas w < 0 encourages X = X i;j i 6 j Node potentials exp( u x ) encode the bias of the individual atoms − i i Scaling the parameters makes the distribution more or less spiky David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 5 / 33 Today's lecture Markov random fields 1 Factor graphs 2 Bayesian networks Markov random fields (moralization) ) 3 Hammersley-Clifford theorem (conditional independence joint distribution factorization) ) Conditional models 3 Discriminative versus generative classifiers 4 Conditional random fields David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 6 / 33 Higher-order potentials The examples so far have all been pairwise MRFs, involving only node potentials φi (Xi ) and pairwise potentials φi;j (Xi ; Xj ) Often we need higher-order potentials, e.g. φ(x; y; z) = 1[x + y + z 1]; ≥ where X ; Y ; Z are binary, enforcing that at least one of the variables takes the value 1 Although Markov networks are useful for understanding independencies, they hide much of the distribution's structure: A B C D Does this have pairwise potentials, or one potential for all 4 variables? David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 7 / 33 Factor graphs G does not reveal the structure of the distribution: maximum cliques vs. subsets of them A factor graph is a bipartite undirected graph with variable nodes and factor nodes. Edges are only between the variable nodes and the factor nodes Each factor node is associated with a single potential, whose scope is the set of variables that are neighbors in the factor graph A B A B C D Factor graphs C D A B Markov network C D The distribution is same as the MRF { this is just a different data structure David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 8 / 33 Example: Low-density parity-check codes Error correcting codes for transmitting a message over a noisy channel (invented by Galleger in the 1960's, then re-discovered in 1996) fA fB fC Y1 Y2 Y3 Y4 Y5 Y6 f1 f2 f3 f4 f5 f6 X1 X2 X3 X4 X5 X6 Each of the top row factors enforce that its variables have even parity: f (Y ; Y ; Y ; Y ) = 1 if Y Y Y Y = 0; and 0 otherwise A 1 2 3 4 1 ⊗ 2 ⊗ 3 ⊗ 4 Thus, the only assignments Y with non-zero probability are the following (called codewords): 3 bits encoded using 6 bits 000000; 011001; 110010; 101011; 111100; 100101; 001110; 010111 f (Y ; X ) = p(X Y ), the likelihood of a bit flip according to noise model i i i i j i David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 9 / 33 Example: Low-density parity-check codes fA fB fC Y1 Y2 Y3 Y4 Y5 Y6 The decoding problemf1 forf LDPCs2 f3 isf4 to findf5 f6 X1 X2 X3 X4 X5 X6 argmax p(y x) y j This is called the maximum a posteriori (MAP) assignment Since Z and p(x) are constants with respect to the choice of y, can equivalently solve (taking the log of p(y; x)): X argmaxy θc (xc ); c C where θc (xc ) = log φc (xc ) 2 This is a discrete optimization problem! David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 10 / 33 Converting BNs to Markov networks What is the equivalent Markov network for a hidden Markov model? Y Y Y Y Y Y1 2 3 4 5 6 X1 X2 X3 X4 X5 X6 Many inference algorithms are more conveniently given for undirected models { this shows how they can be applied to Bayesian networks David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 11 / 33 Moralization of Bayesian networks Procedure for converting a Bayesian network into a Markov network The moral graph [G] of a BN G = (V ; E) is an undirected graph over V M that contains an undirected edge between Xi and Xj if 1 there is a directed edge between them (in either direction) 2 Xi and Xj are both parents of the same node A B Moralization A B C D C D (term historically arose from the idea of \marrying the parents" of the node) The addition of the moralizing edges leads to the loss of some independence information, e.g., A C B, where A B is lost ! ? David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 12 / 33 Converting BNs to Markov networks 1 Moralize the directed graph to obtain the undirected graphical model: 2 Introduce one potential function for each CPD: A φB (x ; xMoralization) = p(x Ax ) B i i pa(i) i j pa(i) C D C D So, converting a hidden Markov model to a Markov network is simple: For variables having > 1 parent, factor graph notation is useful David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 13 / 33 Factorization implies conditional independencies p(x) is a Gibbs distribution over G if it can be written as 1 Y p(x ;:::; x ) = φ (x ); 1 n Z c c c C 2 where the variables in each potential c C form a clique in G 2 Recall that conditional independence is given by graph separation: XB XA XC Theorem (soundness of separation): If p(x) is a Gibbs distribution for G, then G is an I-map for p(x), i.e. I (G) I (p) ⊆ Proof: Suppose B separates A from C. Then we can write 1 p(X ; X ; X ) = f (X ; X )g(X ; X ): A B C Z A B B C David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 14 / 33 Conditional independencies implies factorization Theorem (soundness of separation): If p(x) is a Gibbs distribution for G, then G is an I-map for p(x), i.e. I (G) I (p) ⊆ What about the converse? We need one more assumption: A distribution is positive if p(x) > 0 for all x Theorem (Hammersley-Clifford, 1971): If p(x) is a positive distribution and G is an I-map for p(x), then p(x) is a Gibbs distribution that factorizes over G Proof is in book (as is counter-example for when p(x) is not positive) This is important for learning: Prior knowledge is often in the form of conditional independencies (i.e., a graph structure G) Hammersley-Clifford tells us that it suffices to search over Gibbs distributions for G { allows us to parameterize the distribution David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 15 / 33 Today's lecture Markov random fields 1 Factor graphs 2 Bayesian networks Markov random fields (moralization) ) 3 Hammersley-Clifford theorem (conditional independence joint distribution factorization) ) Conditional models 3 Discriminative versus generative classifiers 4 Conditional random fields David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 16 / 33 Discriminative versus generative classifiers There is often significant flexibility in choosing the structure and parameterization of a graphical model Generative Discriminative Y X X Y It is important to understand the trade-offs In the next few slides, we will study this question in the context of e-mail classification David Sontag (NYU) Graphical Models Lecture 3, February 14, 2013 17 / 33 From lecture 1..
Recommended publications
  • Graphical Models for Discrete and Continuous Data Arxiv:1609.05551V3
    Graphical Models for Discrete and Continuous Data Rui Zhuang Department of Biostatistics, University of Washington and Noah Simon Department of Biostatistics, University of Washington and Johannes Lederer Department of Mathematics, Ruhr-University Bochum June 18, 2019 Abstract We introduce a general framework for undirected graphical models. It generalizes Gaussian graphical models to a wide range of continuous, discrete, and combinations of different types of data. The models in the framework, called exponential trace models, are amenable to estimation based on maximum likelihood. We introduce a sampling-based approximation algorithm for computing the maximum likelihood estimator, and we apply this pipeline to learn simultaneous neural activities from spike data. Keywords: Non-Gaussian Data, Graphical Models, Maximum Likelihood Estimation arXiv:1609.05551v3 [math.ST] 15 Jun 2019 1 Introduction Gaussian graphical models (Drton & Maathuis 2016, Lauritzen 1996, Wainwright & Jordan 2008) describe the dependence structures in normally distributed random vectors. These models have become increasingly popular in the sciences, because their representation of the dependencies is lucid and can be readily estimated. For a brief overview, consider a 1 random vector X 2 Rp that follows a centered normal distribution with density 1 −x>Σ−1x=2 fΣ(x) = e (1) (2π)p=2pjΣj with respect to Lebesgue measure, where the population covariance Σ 2 Rp×p is a symmetric and positive definite matrix. Gaussian graphical models associate these densities with a graph (V; E) that has vertex set V := f1; : : : ; pg and edge set E := f(i; j): i; j 2 −1 f1; : : : ; pg; i 6= j; Σij 6= 0g: The graph encodes the dependence structure of X in the sense that any two entries Xi;Xj; i 6= j; are conditionally independent given all other entries if and only if (i; j) 2= E.
    [Show full text]
  • One-Network Adversarial Fairness
    One-network Adversarial Fairness Tameem Adel Isabel Valera Zoubin Ghahramani Adrian Weller University of Cambridge, UK MPI-IS, Germany University of Cambridge, UK University of Cambridge, UK [email protected] [email protected] Uber AI Labs, USA The Alan Turing Institute, UK [email protected] [email protected] Abstract is general, in that it may be applied to any differentiable dis- criminative model. We establish a fairness paradigm where There is currently a great expansion of the impact of machine the architecture of a deep discriminative model, optimized learning algorithms on our lives, prompting the need for ob- for accuracy, is modified such that fairness is imposed (the jectives other than pure performance, including fairness. Fair- ness here means that the outcome of an automated decision- same paradigm could be applied to a deep generative model making system should not discriminate between subgroups in future work). Beginning with an ordinary neural network characterized by sensitive attributes such as gender or race. optimized for prediction accuracy of the class labels in a Given any existing differentiable classifier, we make only classification task, we propose an adversarial fairness frame- slight adjustments to the architecture including adding a new work performing a change to the network architecture, lead- hidden layer, in order to enable the concurrent adversarial op- ing to a neural network that is maximally uninformative timization for fairness and accuracy. Our framework provides about the sensitive attributes of the data as well as predic- one way to quantify the tradeoff between fairness and accu- tive of the class labels.
    [Show full text]
  • Remote Sensing
    remote sensing Article A Method for the Estimation of Finely-Grained Temporal Spatial Human Population Density Distributions Based on Cell Phone Call Detail Records Guangyuan Zhang 1, Xiaoping Rui 2,* , Stefan Poslad 1, Xianfeng Song 3 , Yonglei Fan 3 and Bang Wu 1 1 School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4NS, UK; [email protected] (G.Z.); [email protected] (S.P.); [email protected] (B.W.) 2 School of Earth Sciences and Engineering, Hohai University, Nanjing 211000, China 3 College of Resources and Environment, University of Chinese Academy of Sciences, Beijing 100049, China; [email protected] (X.S.); [email protected] (Y.F.) * Correspondence: [email protected] Received: 10 June 2020; Accepted: 8 August 2020; Published: 10 August 2020 Abstract: Estimating and mapping population distributions dynamically at a city-wide spatial scale, including those covering suburban areas, has profound, practical, applications such as urban and transportation planning, public safety warning, disaster impact assessment and epidemiological modelling, which benefits governments, merchants and citizens. More recently, call detail record (CDR) of mobile phone data has been used to estimate human population distributions. However, there is a key challenge that the accuracy of such a method is difficult to validate because there is no ground truth data for the dynamic population density distribution in time scales such as hourly. In this study, we present a simple and accurate method to generate more finely grained temporal-spatial population density distributions based upon CDR data. We designed an experiment to test our method based upon the use of a deep convolutional generative adversarial network (DCGAN).
    [Show full text]
  • Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs
    Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs Christopher J. Quinn Negar Kiyavash Todd P. Coleman Department of Electrical and Department of Industrial and Department of Electrical and Computer Engineering Enterprise Systems Engineering Computer Engineering University of Illinois University of Illinois University of Illinois Urbana, Illinois 61801 Urbana, Illinois 61801 Urbana, Illinois 61801 Email: [email protected] Email: [email protected] Email: [email protected] Abstract—We propose a new type of probabilistic graphical own past [4], [5]. This improvement is measured in terms of model, based on directed information, to represent the causal average reduction in the encoding length of the information. dynamics between processes in a stochastic system. We show the Clearly such a definition of causality has the notion of timing practical significance of such graphs by proving their equivalence to generative model graphs which succinctly summarize interde- ingrained in it. Therefore, we expect a graphical model defined pendencies for causal dynamical systems under mild assumptions. based on directed information to deal with processes as its This equivalence means that directed information graphs may be nodes. used for causal inference and learning tasks in the same manner In this work, we formally define directed information Bayesian networks are used for correlative statistical inference graphs, an alternative type of graphical model. In these graphs, and learning. nodes represent processes, and directed edges correspond to whether there is directed information from one process to I. INTRODUCTION another. To demonstrate the practical significance of directed Graphical models facilitate design and analysis of interact- information graphs, we validate their ability to capture causal ing multivariate probabilistic complex systems that arise in interdependencies for a class of interacting processes corre- numerous fields such as statistics, information theory, machine sponding to a causal dynamical system.
    [Show full text]
  • On Construction of Hybrid Logistic Regression-Naıve Bayes Model For
    JMLR: Workshop and Conference Proceedings vol 52, 523-534, 2016 PGM 2016 On Construction of Hybrid Logistic Regression-Na¨ıve Bayes Model for Classification Yi Tan & Prakash P. Shenoy [email protected], [email protected] University of Kansas School of Business 1654 Naismith Drive Lawrence, KS 66045 Moses W. Chan & Paul M. Romberg fMOSES.W.CHAN, [email protected] Advanced Technology Center Lockheed Martin Space Systems Sunnyvale, CA 94089 Abstract In recent years, several authors have described a hybrid discriminative-generative model for clas- sification. In this paper we examine construction of such hybrid models from data where we use logistic regression (LR) as a discriminative component, and na¨ıve Bayes (NB) as a generative com- ponent. First, we estimate a Markov blanket of the class variable to reduce the set of features. Next, we use a heuristic to partition the set of features in the Markov blanket into those that are assigned to the LR part, and those that are assigned to the NB part of the hybrid model. The heuristic is based on reducing the conditional dependence of the features in NB part of the hybrid model given the class variable. We implement our method on 21 different classification datasets, and we compare the prediction accuracy of hybrid models with those of pure LR and pure NB models. Keywords: logistic regression, na¨ıve Bayes, hybrid logistic regression-na¨ıve Bayes 1. Introduction Data classification is a very common task in machine learning and statistics. It is considered an in- stance of supervised learning. For a standard supervised learning problem, we implement a learning algorithm on a set of training examples, which are pairs consisting of an input object, a vector of explanatory variables F1;:::;Fn, called features, and a desired output object, a response variable C, called class variable.
    [Show full text]
  • Graphical Models
    Graphical Models Michael I. Jordan Computer Science Division and Department of Statistics University of California, Berkeley 94720 Abstract Statistical applications in fields such as bioinformatics, information retrieval, speech processing, im- age processing and communications often involve large-scale models in which thousands or millions of random variables are linked in complex ways. Graphical models provide a general methodology for approaching these problems, and indeed many of the models developed by researchers in these applied fields are instances of the general graphical model formalism. We review some of the basic ideas underlying graphical models, including the algorithmic ideas that allow graphical models to be deployed in large-scale data analysis problems. We also present examples of graphical models in bioinformatics, error-control coding and language processing. Keywords: Probabilistic graphical models; junction tree algorithm; sum-product algorithm; Markov chain Monte Carlo; variational inference; bioinformatics; error-control coding. 1. Introduction The fields of Statistics and Computer Science have generally followed separate paths for the past several decades, which each field providing useful services to the other, but with the core concerns of the two fields rarely appearing to intersect. In recent years, however, it has become increasingly evident that the long-term goals of the two fields are closely aligned. Statisticians are increasingly concerned with the computational aspects, both theoretical and practical, of models and inference procedures. Computer scientists are increasingly concerned with systems that interact with the external world and interpret uncertain data in terms of underlying probabilistic models. One area in which these trends are most evident is that of probabilistic graphical models.
    [Show full text]
  • Data-Driven Topo-Climatic Mapping with Machine Learning Methods
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by RERO DOC Digital Library Nat Hazards (2009) 50:497–518 DOI 10.1007/s11069-008-9339-y ORIGINAL PAPER Data-driven topo-climatic mapping with machine learning methods A. Pozdnoukhov Æ L. Foresti Æ M. Kanevski Received: 31 October 2007 / Accepted: 19 December 2008 / Published online: 16 January 2009 Ó Springer Science+Business Media B.V. 2009 Abstract Automatic environmental monitoring networks enforced by wireless commu- nication technologies provide large and ever increasing volumes of data nowadays. The use of this information in natural hazard research is an important issue. Particularly useful for risk assessment and decision making are the spatial maps of hazard-related parameters produced from point observations and available auxiliary information. The purpose of this article is to present and explore the appropriate tools to process large amounts of available data and produce predictions at fine spatial scales. These are the algorithms of machine learning, which are aimed at non-parametric robust modelling of non-linear dependencies from empirical data. The computational efficiency of the data-driven methods allows producing the prediction maps in real time which makes them superior to physical models for the operational use in risk assessment and mitigation. Particularly, this situation encounters in spatial prediction of climatic variables (topo-climatic mapping). In complex topographies of the mountainous regions, the meteorological processes are highly influ- enced by the relief. The article shows how these relations, possibly regionalized and non- linear, can be modelled from data using the information from digital elevation models.
    [Show full text]
  • A Hierarchical Generative Model of Electrocardiogram Records
    A Hierarchical Generative Model of Electrocardiogram Records Andrew C. Miller ∗ Sendhil Mullainathan School of Engineering and Applied Sciences Department of Economics Harvard University Harvard University Cambridge, MA 02138 Cambridge, MA 02138 [email protected] [email protected] Ziad Obermeyer Brigham and Women’s Hospital Harvard Medical School Boston, MA 02115 [email protected] Abstract We develop a probabilistic generative model of electrocardiogram (EKG) tracings. Our model describes multiple sources of variation in EKGs, including patient- specific cardiac cycle morphology and between-cycle variation that leads to quasi- periodicity. We use a deep generative network as a flexible model component to describe variation in beat-specific morphology. We apply our model to a set of 549 EKG records, including over 4,600 unique beats, and show that it is able to discover interpretable information, such as patient similarity and meaningful physiological features (e.g., T wave inversion). 1 Introduction An electrocardiogram (EKG) is a common non-invasive medical test that measures the electrical activity of a patient’s heart by recording the time-varying potential difference between electrodes placed on the surface of the skin. The resulting data is a multivariate time-series that reflects the depolarization and repolarization of the heart muscle that occurs during each heartbeat. These raw waveforms are then inspected by a physician to detect irregular patterns that are evidence of an underlying physiological problem. In this paper, we build a hierarchical generative model of electrocardiogram signals that disentangles sources of variation. We are motivated to build a generative model of EKGs for multiple reasons.
    [Show full text]
  • A Discriminative Model to Predict Comorbidities in ASD
    A Discriminative Model to Predict Comorbidities in ASD The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:38811457 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA i Contents Contents i List of Figures ii List of Tables iii 1 Introduction 1 1.1 Overview . 2 1.2 Motivation . 2 1.3 Related Work . 3 1.4 Contribution . 4 2 Basic Theory 6 2.1 Discriminative Classifiers . 6 2.2 Topic Modeling . 10 3 Experimental Design and Methods 13 3.1 Data Collection . 13 3.2 Exploratory Analysis . 17 3.3 Experimental Setup . 18 4 Results and Conclusion 23 4.1 Evaluation . 23 4.2 Limitations . 28 4.3 Future Work . 28 4.4 Conclusion . 29 Bibliography 30 ii List of Figures 3.1 User Zardoz's post on AutismWeb in March 2006. 15 3.2 User Zardoz's only other post on AutismWeb, published in February 2006. 15 3.3 Number of CUIs extracted per post. 17 3.4 Distribution of associated ages extracted from posts. 17 3.5 Number of posts written per author . 17 3.6 Experimental Setup . 22 4.1 Distribution of AUCs for single-concept word predictions . 25 4.2 Distribution of AUCs for single-concept word persistence . 25 4.3 AUC of single-concept word vs.
    [Show full text]
  • Bias Correction of Learned Generative Models Using Likelihood-Free Importance Weighting
    Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting Aditya Grover1, Jiaming Song1, Alekh Agarwal2, Kenneth Tran2, Ashish Kapoor2, Eric Horvitz2, Stefano Ermon1 1Stanford University, 2Microsoft Research, Redmond Abstract A learned generative model often produces biased statistics relative to the under- lying data distribution. A standard technique to correct this bias is importance sampling, where samples from the model are weighted by the likelihood ratio under model and true distributions. When the likelihood ratio is unknown, it can be estimated by training a probabilistic classifier to distinguish samples from the two distributions. We show that this likelihood-free importance weighting method induces a new energy-based model and employ it to correct for the bias in existing models. We find that this technique consistently improves standard goodness-of-fit metrics for evaluating the sample quality of state-of-the-art deep generative mod- els, suggesting reduced bias. Finally, we demonstrate its utility on representative applications in a) data augmentation for classification using generative adversarial networks, and b) model-based policy evaluation using off-policy data. 1 Introduction Learning generative models of complex environments from high-dimensional observations is a long- standing challenge in machine learning. Once learned, these models are used to draw inferences and to plan future actions. For example, in data augmentation, samples from a learned model are used to enrich a dataset for supervised learning [1]. In model-based off-policy policy evaluation (henceforth MBOPE), a learned dynamics model is used to simulate and evaluate a target policy without real-world deployment [2], which is especially valuable for risk-sensitive applications [3].
    [Show full text]
  • Probabilistic PCA and Factor Analysis
    Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis Piyush Rai Machine Learning (CS771A) Oct 5, 2016 Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 1 K Generate latent variables z n 2 R (K D) as z n ∼ N (0; IK ) Generate data x n conditioned on z n as 2 x n ∼ N (Wz n; σ ID ) where W is the D × K \factor loading matrix" or \dictionary" z n is K-dim latent features or latent factors or \coding" of x n w.r.t. W Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0; σ ID )) This is \Probabilistic PCA" (PPCA) with Gaussian observation model 2 N Want to learn model parameters W; σ and latent factors fz ngn=1 When n ∼ N (0; Ψ), Ψ is diagonal, it is called \Factor Analysis" (FA) Generative Model for Dimensionality Reduction D Assume the following generative story for each x n 2 R Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 Generate data x n conditioned on z n as 2 x n ∼ N (Wz n; σ ID ) where W is the D × K \factor loading matrix" or \dictionary" z n is K-dim latent features or latent factors or \coding" of x n w.r.t. W Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0; σ ID )) This is \Probabilistic PCA" (PPCA) with Gaussian observation model 2 N Want to learn model parameters W; σ and latent factors fz ngn=1 When n ∼ N (0; Ψ), Ψ is diagonal, it is called \Factor Analysis" (FA) Generative Model for Dimensionality Reduction D Assume the following generative story for each x n 2 R K Generate latent variables z n 2 R (K D) as z n ∼ N (0; IK ) Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 where W is the D × K \factor loading matrix" or \dictionary" z n is K-dim latent features or latent factors or \coding" of x n w.r.t.
    [Show full text]
  • Learning a Generative Model of Images by Factoring Appearance and Shape
    1 Learning a generative model of images by factoring appearance and shape Nicolas Le Roux1, Nicolas Heess2, Jamie Shotton1, John Winn1 1Microsoft Research Cambridge 2University of Edinburgh Keywords: generative model of images, occlusion, segmentation, deep networks Abstract Computer vision has grown tremendously in the last two decades. Despite all efforts, existing attempts at matching parts of the human visual system’s extraordinary ability to understand visual scenes lack either scope or power. By combining the advantages of general low-level generative models and powerful layer-based and hierarchical models, this work aims at being a first step towards richer, more flexible models of images. After comparing various types of RBMs able to model continuous-valued data, we introduce our basic model, the masked RBM, which explicitly models occlusion boundaries in image patches by factoring the appearance of any patch region from its shape. We then propose a generative model of larger images using a field of such RBMs. Finally, we discuss how masked RBMs could be stacked to form a deep model able to generate more complicated structures and suitable for various tasks such as segmentation or object recognition. 1 Introduction Despite much progress in the field of computer vision in recent years, interpreting and modeling the bewildering structure of natural images remains a challenging problem. The limitations of even the most advanced systems become strikingly obvious when contrasted with the ease, flexibility, and robustness with which the human visual system analyzes and interprets an image. Computer vision is a problem domain where the structure that needs to be represented is complex and strongly task dependent, and the input data is often highly ambiguous.
    [Show full text]