Machine Learning What Is a Graphical Model?

Total Page:16

File Type:pdf, Size:1020Kb

Load more

Machine Learning 10-701/15-781, Spring 2008 Graphical Models Eric Xing Visit to Asia Smoking X1 X2 Lecture 18, March 26, 2008 X Tuberculosis 3 Lung Cancer X4 Bronchitis X5 Tuberculosis X or Cancer 6 XRay Result X X 7 Dyspnea 8 Reading: Chap. 8, C.B book Eric Xing 1 What is a graphical model? --- example from medical diagnostics z A possible world for a patient with lung problem: Visit to Asia Smoking X1 X2 X Lung Cancer Tuberculosis 3 X4 Bronchitis X5 Tuberculosis X or Cancer 6 XRay Result X 7 Dyspnea X8 Eric Xing 2 1 Recap of Basic Prob. Concepts z Representation: what is the joint probability dist. on multiple variables? P(X 1 , X 2 , X 3 , X 4 , X 5 , X 6 , X 7 , X 8 ,) z How many state configurations in total? --- 28 z Are they all needed to be represented? z Do we get any scientific/medical insight? z Learning: where do we get all this probabilities? z Maximal-likelihood estimation? but how many data do we need? z Where do we put domain knowledge in terms of plausible relationships between variables, and plausible values of the probabilities? z Inference: If not all variables are observable, how to compute the conditional distribution of latent variables given evidence? Eric Xing 3 Dependencies among variables Visit to Asia Smoking X1 X2 Patient Information X Tuberculosis 3 Lung Cancer X4 Bronchitis X5 Medical Difficulties Tuberculosis X or Cancer 6 XRay Result X 7 Dyspnea X8 Diagnostic Tests Eric Xing 4 2 Probabilistic Graphical Models z Represent dependency structure with a graph z Node <-> random variable z Edges encode dependencies z Absence of edge -> conditional independence z Directed and undirected versions z Why is this useful? z A language for communication z A language for computation z A language for development z Origins: z Wright 1920’s z Independently developed by Spiegelhalter and Lauritzen in statistics and Pearl in computer science in the late 1980’s Eric Xing 5 Probabilistic Graphical Models, con'd If Xi's are conditionally independent (as described by a PGM), the joint can be factored to a product of simpler terms, e.g., Visit to Asia Smoking X1 X2 P(X1, X2, X3, X4, X5, X6, X7, X8) X Tuberculosis 3 Lung Cancer X4 Bronchitis X5 = P(X ) P(X ) P(X | X ) P(X | X ) P(X | X ) Tuberculosis 1 2 3 1 4 2 5 2 X or Cancer 6 P(X6| X3, X4) P(X7| X6) P(X8| X5, X6) XRay Result X 7 Dyspnea X8 Why we may favor a PGM? Representation cost: how many probability statements are needed? 2+2+4+4+4+8+4+8=36, an 8-fold reduction from 28! Algorithms for systematic and efficient inference/learning computation • Exploring the graph structure and probabilistic (e.g., Bayesian, Markovian) semantics Incorporation of domain knowledge and causal (logical) structures Eric Xing 6 3 Two types of GMs z Directed edges give causality relationships (Bayesian Network or Directed Graphical Model): z Undirected edges simply give (physical or symmetric) correlations between variables (Markov Random Field or Undirected Graphical model): Eric Xing 7 Bayesian Network: Factorization Theorem Visit to Asia Smoking X1 X2 P(X1, X2, X3, X4, X5, X6, X7, X8) X Tuberculosis 3 Lung Cancer X4 Bronchitis X5 = P(X ) P(X ) P(X | X ) P(X | X ) P(X | X ) Tuberculosis 1 2 3 1 4 2 5 2 X or Cancer 6 P(X6| X3, X4) P(X7| X6) P(X8| X5, X6) XRay Result X 7 Dyspnea X8 z Theorem: Given a DAG, The most general form of the probability distribution that is consistent with the graph factors according to “node given its parents”: P(X) = P(X | X ) ∏ i π i i X where π i is the set of parents of xi. d is the number of nodes (variables) in the graph. Eric Xing 8 4 Bayesian Network: Conditional Independence Semantics Structure: DAG Ancestor • Meaning: a node is conditionally independent of every other node in the Parent Y1 Y2 network outside its Markov blanket X • Local conditional distributions (CPD) and the DAG completely determine the joint dist. Child Children's co-parent •Give causality relationships, and facilitate a generative Descendent process Eric Xing 9 Local Structures & Independencies z Common parent B z Fixing B decouples A and C "given the level of gene B, the levels of A and C are independent" A C z Cascade z Knowing B decouples A and C A B C "given the level of gene B, the level gene A provides no extra prediction value for the level of gene C" z V-structure A B z Knowing C couples A and B because A can "explain away" B w.r.t. C C "If A correlates to C, then chance for B to also correlate to B will decrease" z The language is compact, the concepts are rich! Eric Xing 10 5 A simple justification B A C Eric Xing 11 Graph separation criterion z D-separation criterion for Bayesian networks (D for Directed edges): Definition: variables x and y are D-separated (conditionally independent) given z if they are separated in the moralized ancestral graph z Example: Eric Xing 12 6 Global Markov properties of DAGs z X is d-separated (directed-separated) from Z given Y if we can't send a ball from any node in X to any node in Z using the "Bayes- ball" algorithm illustrated bellow (and plus some boundary conditions): • Defn: I(G)=all independence properties that correspond to d- separation: I(G) = {X ⊥ Z Y : dsepG (X ; Z Y)} • D-separation is sound and complete Eric Xing 13 Example: z Complete the I(G) of this x4 graph: x1 x3 x2 Eric Xing 14 7 Towards quantitative specification of probability distribution z Separation properties in the graph imply independence properties about the associated variables z For the graph to be useful, any conditional independence properties we can derive from the graph should hold for the probability distribution that the graph represents z The Equivalence Theorem For a graph G, Let D1 denote the family of all distributions that satisfy I(G), Let D2 denote the family of all distributions that factor according to G, Then D1≡D2. Eric Xing 15 Example A B p(A,B,C) = C Eric Xing 16 8 Example, con'd z Evolution ancestor ? T years Q Qh m A A CA G G A C Tree Model Eric Xing 17 Example, con'd z Speech recognition Y1 Y2 Y3 ... YT XA1 XA2 XA3 ... XAT Hidden Markov Model Eric Xing 18 9 Example, con'd z Genetic Pedigree A0 B0 Ag Bg A1 B1 F0 Fg M 0 F1 Sg M 1 C C 0 g C 1 Eric Xing 19 Conditional probability tables (CPTs) a0 0.75 b0 0.33 P(a,b,c.d) = a1 0.25 b1 0.67 P(a)P(b)P(c|a,b)P(d|c) A B a0b0 a0b1 a1b0 a1b1 c0 0.45 1 0.9 0.7 C c1 0.55 0 0.1 0.3 c0 c1 D d0 0.3 0.5 d1 07 0.5 Eric Xing 20 10 Conditional probability density func. (CPDs) P(a,b,c.d) = A~N(µ , Σ ) a a B~N(µb, Σb) P(a)P(b)P(c|a,b)P(d|c) A B C C~N(A+B, Σc) P(D| C) D D~N(µa+C, Σa) C D Eric Xing 21 Conditionally Independent Observations θ Model parameters y1 y2 yn-1 yn Data Eric Xing 22 11 “Plate” Notation θ Model parameters Data = {y ,…y } yi 1 n i=1:n Plate = rectangle in graphical model variables within a plate are replicated in a conditionally independent manner Eric Xing 23 Example: Gaussian Model µ σ Generative model: p(y1,…yn | µ, σ) = P p(yi | µ, σ) = p(data | parameters) y i = p(D | θ) i=1:n where θ = {µ, σ} Likelihood = p(data | parameters) = p( D | θ ) = L (θ) Likelihood tells us how likely the observed data are conditioned on a particular setting of the parameters Often easier to work with log L (θ) Eric Xing 24 12 Example: Bayesian Gaussian Model α µ σ β yi i=1:n Note: priors and parameters are assumed independent here Eric Xing 25 Markov Random Fields Structure: an undirected graph • Meaning: a node is conditionally independent of Y1 Y2 every other node in the network given its Directed neighbors X • Local contingency functions (potentials) and the cliques in the graph completely determine the joint dist. •Give correlations between variables, but no explicit way to generate samples Eric Xing 26 13 Semantics of Undirected Graphs z Let H be an undirected graph: z B separates A and C if every path from a node in A to a node in C passes through a node in B: sepH (A;C B) z A probability distribution satisfies the global Markov property if for any disjoint A, B, C, such that B separates A and C, A is independent of C given B: I(H ) = {A ⊥ C B) :sepH (A;C B)} Eric Xing 27 Representation z Defn: an undirected graphical model represents a distribution P(X1 ,…,Xn) defined by an undirected graph H, and a set of positive potential functions yc associated with cliques of H, s.t. 1 P(x1 ,K, xn ) = ∏ψ c (xc ) Z c∈C where Z is known as the partition function: Z = ∑ ∏ψ c (xc ) x1,K,xn c∈C z Also known as Markov Random Fields, Markov networks … z The potential function can be understood as an contingency function of its arguments assigning "pre-probabilistic" score of their joint configuration.
Recommended publications
  • Graphical Models for Discrete and Continuous Data Arxiv:1609.05551V3

    Graphical Models for Discrete and Continuous Data Arxiv:1609.05551V3

    Graphical Models for Discrete and Continuous Data Rui Zhuang Department of Biostatistics, University of Washington and Noah Simon Department of Biostatistics, University of Washington and Johannes Lederer Department of Mathematics, Ruhr-University Bochum June 18, 2019 Abstract We introduce a general framework for undirected graphical models. It generalizes Gaussian graphical models to a wide range of continuous, discrete, and combinations of different types of data. The models in the framework, called exponential trace models, are amenable to estimation based on maximum likelihood. We introduce a sampling-based approximation algorithm for computing the maximum likelihood estimator, and we apply this pipeline to learn simultaneous neural activities from spike data. Keywords: Non-Gaussian Data, Graphical Models, Maximum Likelihood Estimation arXiv:1609.05551v3 [math.ST] 15 Jun 2019 1 Introduction Gaussian graphical models (Drton & Maathuis 2016, Lauritzen 1996, Wainwright & Jordan 2008) describe the dependence structures in normally distributed random vectors. These models have become increasingly popular in the sciences, because their representation of the dependencies is lucid and can be readily estimated. For a brief overview, consider a 1 random vector X 2 Rp that follows a centered normal distribution with density 1 −x>Σ−1x=2 fΣ(x) = e (1) (2π)p=2pjΣj with respect to Lebesgue measure, where the population covariance Σ 2 Rp×p is a symmetric and positive definite matrix. Gaussian graphical models associate these densities with a graph (V; E) that has vertex set V := f1; : : : ; pg and edge set E := f(i; j): i; j 2 −1 f1; : : : ; pg; i 6= j; Σij 6= 0g: The graph encodes the dependence structure of X in the sense that any two entries Xi;Xj; i 6= j; are conditionally independent given all other entries if and only if (i; j) 2= E.
  • Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs

    Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs

    Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs Christopher J. Quinn Negar Kiyavash Todd P. Coleman Department of Electrical and Department of Industrial and Department of Electrical and Computer Engineering Enterprise Systems Engineering Computer Engineering University of Illinois University of Illinois University of Illinois Urbana, Illinois 61801 Urbana, Illinois 61801 Urbana, Illinois 61801 Email: [email protected] Email: [email protected] Email: [email protected] Abstract—We propose a new type of probabilistic graphical own past [4], [5]. This improvement is measured in terms of model, based on directed information, to represent the causal average reduction in the encoding length of the information. dynamics between processes in a stochastic system. We show the Clearly such a definition of causality has the notion of timing practical significance of such graphs by proving their equivalence to generative model graphs which succinctly summarize interde- ingrained in it. Therefore, we expect a graphical model defined pendencies for causal dynamical systems under mild assumptions. based on directed information to deal with processes as its This equivalence means that directed information graphs may be nodes. used for causal inference and learning tasks in the same manner In this work, we formally define directed information Bayesian networks are used for correlative statistical inference graphs, an alternative type of graphical model. In these graphs, and learning. nodes represent processes, and directed edges correspond to whether there is directed information from one process to I. INTRODUCTION another. To demonstrate the practical significance of directed Graphical models facilitate design and analysis of interact- information graphs, we validate their ability to capture causal ing multivariate probabilistic complex systems that arise in interdependencies for a class of interacting processes corre- numerous fields such as statistics, information theory, machine sponding to a causal dynamical system.
  • Graphical Models

    Graphical Models

    Graphical Models Michael I. Jordan Computer Science Division and Department of Statistics University of California, Berkeley 94720 Abstract Statistical applications in fields such as bioinformatics, information retrieval, speech processing, im- age processing and communications often involve large-scale models in which thousands or millions of random variables are linked in complex ways. Graphical models provide a general methodology for approaching these problems, and indeed many of the models developed by researchers in these applied fields are instances of the general graphical model formalism. We review some of the basic ideas underlying graphical models, including the algorithmic ideas that allow graphical models to be deployed in large-scale data analysis problems. We also present examples of graphical models in bioinformatics, error-control coding and language processing. Keywords: Probabilistic graphical models; junction tree algorithm; sum-product algorithm; Markov chain Monte Carlo; variational inference; bioinformatics; error-control coding. 1. Introduction The fields of Statistics and Computer Science have generally followed separate paths for the past several decades, which each field providing useful services to the other, but with the core concerns of the two fields rarely appearing to intersect. In recent years, however, it has become increasingly evident that the long-term goals of the two fields are closely aligned. Statisticians are increasingly concerned with the computational aspects, both theoretical and practical, of models and inference procedures. Computer scientists are increasingly concerned with systems that interact with the external world and interpret uncertain data in terms of underlying probabilistic models. One area in which these trends are most evident is that of probabilistic graphical models.
  • A Hierarchical Generative Model of Electrocardiogram Records

    A Hierarchical Generative Model of Electrocardiogram Records

    A Hierarchical Generative Model of Electrocardiogram Records Andrew C. Miller ∗ Sendhil Mullainathan School of Engineering and Applied Sciences Department of Economics Harvard University Harvard University Cambridge, MA 02138 Cambridge, MA 02138 [email protected] [email protected] Ziad Obermeyer Brigham and Women’s Hospital Harvard Medical School Boston, MA 02115 [email protected] Abstract We develop a probabilistic generative model of electrocardiogram (EKG) tracings. Our model describes multiple sources of variation in EKGs, including patient- specific cardiac cycle morphology and between-cycle variation that leads to quasi- periodicity. We use a deep generative network as a flexible model component to describe variation in beat-specific morphology. We apply our model to a set of 549 EKG records, including over 4,600 unique beats, and show that it is able to discover interpretable information, such as patient similarity and meaningful physiological features (e.g., T wave inversion). 1 Introduction An electrocardiogram (EKG) is a common non-invasive medical test that measures the electrical activity of a patient’s heart by recording the time-varying potential difference between electrodes placed on the surface of the skin. The resulting data is a multivariate time-series that reflects the depolarization and repolarization of the heart muscle that occurs during each heartbeat. These raw waveforms are then inspected by a physician to detect irregular patterns that are evidence of an underlying physiological problem. In this paper, we build a hierarchical generative model of electrocardiogram signals that disentangles sources of variation. We are motivated to build a generative model of EKGs for multiple reasons.
  • Bias Correction of Learned Generative Models Using Likelihood-Free Importance Weighting

    Bias Correction of Learned Generative Models Using Likelihood-Free Importance Weighting

    Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting Aditya Grover1, Jiaming Song1, Alekh Agarwal2, Kenneth Tran2, Ashish Kapoor2, Eric Horvitz2, Stefano Ermon1 1Stanford University, 2Microsoft Research, Redmond Abstract A learned generative model often produces biased statistics relative to the under- lying data distribution. A standard technique to correct this bias is importance sampling, where samples from the model are weighted by the likelihood ratio under model and true distributions. When the likelihood ratio is unknown, it can be estimated by training a probabilistic classifier to distinguish samples from the two distributions. We show that this likelihood-free importance weighting method induces a new energy-based model and employ it to correct for the bias in existing models. We find that this technique consistently improves standard goodness-of-fit metrics for evaluating the sample quality of state-of-the-art deep generative mod- els, suggesting reduced bias. Finally, we demonstrate its utility on representative applications in a) data augmentation for classification using generative adversarial networks, and b) model-based policy evaluation using off-policy data. 1 Introduction Learning generative models of complex environments from high-dimensional observations is a long- standing challenge in machine learning. Once learned, these models are used to draw inferences and to plan future actions. For example, in data augmentation, samples from a learned model are used to enrich a dataset for supervised learning [1]. In model-based off-policy policy evaluation (henceforth MBOPE), a learned dynamics model is used to simulate and evaluate a target policy without real-world deployment [2], which is especially valuable for risk-sensitive applications [3].
  • Probabilistic PCA and Factor Analysis

    Probabilistic PCA and Factor Analysis

    Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis Piyush Rai Machine Learning (CS771A) Oct 5, 2016 Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 1 K Generate latent variables z n 2 R (K D) as z n ∼ N (0; IK ) Generate data x n conditioned on z n as 2 x n ∼ N (Wz n; σ ID ) where W is the D × K \factor loading matrix" or \dictionary" z n is K-dim latent features or latent factors or \coding" of x n w.r.t. W Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0; σ ID )) This is \Probabilistic PCA" (PPCA) with Gaussian observation model 2 N Want to learn model parameters W; σ and latent factors fz ngn=1 When n ∼ N (0; Ψ), Ψ is diagonal, it is called \Factor Analysis" (FA) Generative Model for Dimensionality Reduction D Assume the following generative story for each x n 2 R Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 Generate data x n conditioned on z n as 2 x n ∼ N (Wz n; σ ID ) where W is the D × K \factor loading matrix" or \dictionary" z n is K-dim latent features or latent factors or \coding" of x n w.r.t. W Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0; σ ID )) This is \Probabilistic PCA" (PPCA) with Gaussian observation model 2 N Want to learn model parameters W; σ and latent factors fz ngn=1 When n ∼ N (0; Ψ), Ψ is diagonal, it is called \Factor Analysis" (FA) Generative Model for Dimensionality Reduction D Assume the following generative story for each x n 2 R K Generate latent variables z n 2 R (K D) as z n ∼ N (0; IK ) Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 where W is the D × K \factor loading matrix" or \dictionary" z n is K-dim latent features or latent factors or \coding" of x n w.r.t.
  • Learning a Generative Model of Images by Factoring Appearance and Shape

    Learning a Generative Model of Images by Factoring Appearance and Shape

    1 Learning a generative model of images by factoring appearance and shape Nicolas Le Roux1, Nicolas Heess2, Jamie Shotton1, John Winn1 1Microsoft Research Cambridge 2University of Edinburgh Keywords: generative model of images, occlusion, segmentation, deep networks Abstract Computer vision has grown tremendously in the last two decades. Despite all efforts, existing attempts at matching parts of the human visual system’s extraordinary ability to understand visual scenes lack either scope or power. By combining the advantages of general low-level generative models and powerful layer-based and hierarchical models, this work aims at being a first step towards richer, more flexible models of images. After comparing various types of RBMs able to model continuous-valued data, we introduce our basic model, the masked RBM, which explicitly models occlusion boundaries in image patches by factoring the appearance of any patch region from its shape. We then propose a generative model of larger images using a field of such RBMs. Finally, we discuss how masked RBMs could be stacked to form a deep model able to generate more complicated structures and suitable for various tasks such as segmentation or object recognition. 1 Introduction Despite much progress in the field of computer vision in recent years, interpreting and modeling the bewildering structure of natural images remains a challenging problem. The limitations of even the most advanced systems become strikingly obvious when contrasted with the ease, flexibility, and robustness with which the human visual system analyzes and interprets an image. Computer vision is a problem domain where the structure that needs to be represented is complex and strongly task dependent, and the input data is often highly ambiguous.
  • Graphical Modelling of Multivariate Spatial Point Processes

    Graphical Modelling of Multivariate Spatial Point Processes

    Graphical modelling of multivariate spatial point processes Matthias Eckardt Department of Computer Science, Humboldt Universit¨at zu Berlin, Berlin, Germany July 26, 2016 Abstract This paper proposes a novel graphical model, termed the spatial dependence graph model, which captures the global dependence structure of different events that occur randomly in space. In the spatial dependence graph model, the edge set is identified by using the conditional partial spectral coherence. Thereby, nodes are related to the components of a multivariate spatial point process and edges express orthogo- nality relation between the single components. This paper introduces an efficient approach towards pattern analysis of highly structured and high dimensional spatial point processes. Unlike all previous methods, our new model permits the simultane- ous analysis of all multivariate conditional interrelations. The potential of our new technique to investigate multivariate structural relations is illustrated using data on forest stands in Lansing Woods as well as monthly data on crimes committed in the City of London. Keywords: Crime data, Dependence structure; Graphical model; Lansing Woods, Multi- variate spatial point pattern arXiv:1607.07083v1 [stat.ME] 24 Jul 2016 1 1 Introduction The analysis of spatial point patterns is a rapidly developing field and of particular interest to many disciplines. Here, a main concern is to explore the structures and relations gener- ated by a countable set of randomly occurring points in some bounded planar observation window. Generally, these randomly occurring points could be of one type (univariate) or of two and more types (multivariate). In this paper, we consider the latter type of spatial point patterns.
  • Decomposable Principal Component Analysis Ami Wiesel, Member, IEEE, and Alfred O

    Decomposable Principal Component Analysis Ami Wiesel, Member, IEEE, and Alfred O

    IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 11, NOVEMBER 2009 4369 Decomposable Principal Component Analysis Ami Wiesel, Member, IEEE, and Alfred O. Hero, Fellow, IEEE Abstract—In this paper, we consider principal component anal- computer networks [6], [12], [15]. In particular, [9] and [18] ysis (PCA) in decomposable Gaussian graphical models. We exploit proposed to approximate the global PCA using a sequence of the prior information in these models in order to distribute PCA conditional local PCA solutions. Alternatively, an approximate computation. For this purpose, we reformulate the PCA problem in the sparse inverse covariance (concentration) domain and ad- solution which allows a tradeoff between performance and com- dress the global eigenvalue problem by solving a sequence of local munication requirements was proposed in [12] using eigenvalue eigenvalue problems in each of the cliques of the decomposable perturbation theory. graph. We illustrate our methodology in the context of decentral- DPCA is an efficient implementation of distributed PCA ized anomaly detection in the Abilene backbone network. Based on the topology of the network, we propose an approximate statistical based on a prior graphical model. Unlike the above references graphical model and distribute the computation of PCA. it does not try to approximate PCA, but yields an exact solution Index Terms—Anomaly detection, graphical models, principal up to on any given error tolerance. DPCA assumes additional component analysis. prior knowledge in the form of a graphical model that is not taken into account by previous distributed PCA methods. Such models are very common in signal processing and communi- I. INTRODUCTION cations.
  • A Probabilistic Interpretation of Canonical Correlation Analysis

    A Probabilistic Interpretation of Canonical Correlation Analysis

    A Probabilistic Interpretation of Canonical Correlation Analysis Francis R. Bach Michael I. Jordan Computer Science Division Computer Science Division University of California and Department of Statistics Berkeley, CA 94114, USA University of California [email protected] Berkeley, CA 94114, USA [email protected] April 21, 2005 Technical Report 688 Department of Statistics University of California, Berkeley Abstract We give a probabilistic interpretation of canonical correlation (CCA) analysis as a latent variable model for two Gaussian random vectors. Our interpretation is similar to the proba- bilistic interpretation of principal component analysis (Tipping and Bishop, 1999, Roweis, 1998). In addition, we cast Fisher linear discriminant analysis (LDA) within the CCA framework. 1 Introduction Data analysis tools such as principal component analysis (PCA), linear discriminant analysis (LDA) and canonical correlation analysis (CCA) are widely used for purposes such as dimensionality re- duction or visualization (Hotelling, 1936, Anderson, 1984, Hastie et al., 2001). In this paper, we provide a probabilistic interpretation of CCA and LDA. Such a probabilistic interpretation deepens the understanding of CCA and LDA as model-based methods, enables the use of local CCA mod- els as components of a larger probabilistic model, and suggests generalizations to members of the exponential family other than the Gaussian distribution. In Section 2, we review the probabilistic interpretation of PCA, while in Section 3, we present the probabilistic interpretation of CCA and LDA, with proofs presented in Section 4. In Section 5, we provide a CCA-based probabilistic interpretation of LDA. 1 z x Figure 1: Graphical model for factor analysis. 2 Review: probabilistic interpretation of PCA Tipping and Bishop (1999) have shown that PCA can be seen as the maximum likelihood solution of a factor analysis model with isotropic covariance matrix.
  • Canonical Correlation Analysis and Graphical Modeling for Huaman

    Canonical Correlation Analysis and Graphical Modeling for Huaman

    Canonical Autocorrelation Analysis and Graphical Modeling for Human Trafficking Characterization Qicong Chen Maria De Arteaga Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA 15213 Pittsburgh, PA 15213 [email protected] [email protected] William Herlands Carnegie Mellon University Pittsburgh, PA 15213 [email protected] Abstract The present research characterizes online prostitution advertisements by human trafficking rings to extract and quantify patterns that describe their online oper- ations. We approach this descriptive analytics task from two perspectives. One, we develop an extension to Sparse Canonical Correlation Analysis that identifies autocorrelations within a single set of variables. This technique, which we call Canonical Autocorrelation Analysis, detects features of the human trafficking ad- vertisements which are highly correlated within a particular trafficking ring. Two, we use a variant of supervised latent Dirichlet allocation to learn topic models over the trafficking rings. The relationship of the topics over multiple rings character- izes general behaviors of human traffickers as well as behaviours of individual rings. 1 Introduction Currently, the United Nations Office on Drugs and Crime estimates there are 2.5 million victims of human trafficking in the world, with 79% of them suffering sexual exploitation. Increasingly, traffickers use the Internet to advertise their victims’ services and establish contact with clients. The present reserach chacterizes online advertisements by particular human traffickers (or trafficking rings) to develop a quantitative descriptions of their online operations. As such, prediction is not the main objective. This is a task of descriptive analytics, where the objective is to extract and quantify patterns that are difficult for humans to find.
  • Neural Decomposition: Functional ANOVA with Variational Autoencoders

    Neural Decomposition: Functional ANOVA with Variational Autoencoders

    Neural Decomposition: Functional ANOVA with Variational Autoencoders Kaspar Märtens Christopher Yau University of Oxford Alan Turing Institute University of Birmingham University of Manchester Abstract give insight into patterns of interest embedded within the data. Recently there has been particular interest in applications of Variational Autoencoders (VAEs) Variational Autoencoders (VAEs) have be- (Kingma and Welling, 2014) for generative modelling come a popular approach for dimensionality and dimensionality reduction. VAEs are a class of reduction. However, despite their ability to probabilistic neural-network-based latent variable mod- identify latent low-dimensional structures em- els. Their neural network (NN) underpinnings offer bedded within high-dimensional data, these considerable modelling flexibility while modern pro- latent representations are typically hard to gramming frameworks (e.g. TensorFlow and PyTorch) interpret on their own. Due to the black-box and computational techniques (e.g. stochastic varia- nature of VAEs, their utility for healthcare tional inference) provide immense scalability to large and genomics applications has been limited. data sets, including those in genomics (Lopez et al., In this paper, we focus on characterising the 2018; Eraslan et al., 2019; Märtens and Yau, 2020). The sources of variation in Conditional VAEs. Our Conditional VAE (CVAE) (Sohn et al., 2015) extends goal is to provide a feature-level variance de- this model to allow incorporating side information as composition, i.e. to decompose variation in additional fixed inputs. the data by separating out the marginal ad- ditive effects of latent variables z and fixed The ability of (C)VAEs to extract latent, low- inputs c from their non-linear interactions.