An Introduction to Graphical Models

Total Page:16

File Type:pdf, Size:1020Kb

An Introduction to Graphical Models AN INTRODUCTION TO GRAPHICAL MODELS Michael I. Jordan Center for Biological and Computational Learning Massachusetts Institute of Technology http://www.ai.mit.edu/projects/jordan.html Acknow ledgments: Zoubin Ghahramani, Tommi Jaakkola, Marina Meila Lawrence Saul December, 1997 GRAPHICAL MODELS Graphical mo dels are a marriage between graph theory and probability theory They clarify the relationship between neural networks and related network-based mo dels such as HMMs, MRFs, and Kalman lters Indeed, they can be used to give a fully probabilistic interpretation to many neural network architectures Some advantages of the graphical mo del p oint of view { inference and learning are treated together { sup ervised and unsup ervised learning are merged seamlessly { missing data handled nicely { a fo cus on conditional indep endence and computational issues { interpretability if desired Graphical mo dels cont. There are two kinds of graphical mo dels; those based on undirected graphs and those based on directed graphs. Our main fo cus will be directed graphs. Alternative names for graphical mo dels: belief networks, Bayesian networks, probabilistic independence networks, Markov random elds, loglinear models, in uence diagrams A few myths ab out graphical mo dels: { they require a lo calist semantics for the no des { they require a causal semantics for the edges { they are necessarily Bayesian { they are intractable Learning and inference A key insight from the graphical mo del p oint of view: It is not necessary to learn that which can be inferred The weights in a network make lo cal assertions ab out the relationships between neighb oring no des Inference algorithms turn these lo cal assertions into global assertions ab out the relationships between no des { e.g., correlations between hidden units conditional on an input-output pair { e.g., the probability of an input vector given an output vector This is achieved by asso ciating a joint probability distribution with the network Directed graphical mo dels|basics Consider an arbitrary directed acyclic graph, where each no de in the graph corresp onds to a random variable scalar or vector: D A F C B E There is no a priori need to designate units as \inputs," \outputs" or \hidden" We want to asso ciate a probability distribution P A; B ; C ; D ; E ; F with this graph, and we want all of our calculations to resp ect this distribution e.g., P A; B ; C ; D ; E ; F C D E P F jA; B = P A; B ; C ; D ; E ; F C D E F Some ways to use a graphical mo del Prediction: Diagnosis, control, optimization: Supervised learning: we want to marginalize over the unshaded no des in each case i.e., integrate them out from the joint probability \unsup ervised learning" is the general case Sp eci cation of a graphical mo del There are two comp onents to any graphical mo del: { the qualitative sp eci cation { the quantitative sp eci cation Where do es the qualitative sp eci cation come from? { prior knowledge of causal relationships { prior knowledge of mo dular relationships { assessment from exp erts { learning from data { we simply like a certain architecture e.g., a layered graph Qualitative sp eci cation of graphical mo dels ABC ABC A and B are marginally dependent A and B are conditionally independent C C A B A B A and B are marginally dependent A and B are conditionally independent Semantics of graphical mo dels cont A B A B C C A and B are marginally independent A and B are conditionally dependent This is the interesting case... \Explaining away" Burglar Earthquake Alarm Radio All connections in b oth directions are \excitatory" But an increase in \activation" of Earthquake leads to a decrease in \activation" of Burglar Where do es the \inhibition" come from? Quantitative sp eci cation of directed mo dels Question: how do we sp ecify a joint distribution over the no des in the graph? Answer: asso ciate a conditional probability with each no de: P(D|C) P(A) P(F|D,E) P(C|A,B) P(B) P(E|C) and take the pro duct of the local probabilities to yield the global probabilities Justi cation In general, let fS g represent the set of random variables corresp onding to the N no des of the graph For any no de S , let paS represent the set of i i parents of no de S i Then P S = P S P S jS P S jS ;:::;S 1 2 1 N N 1 1 Y = P S jS ;:::;S i i1 1 i Y = P S jpaS i i i where the last line is by assumption It is possible to prove a theorem that states that if arbitrary probability distributions are utilized for P S jpaS in the formula above, then the family i i of probability distributions obtained is exactly that set which respects the qualitative speci cation the conditional independence relations described earlier Semantics of undirected graphs ABC ABC A and B are marginally dependent A and B are conditionally independent Comparative semantics D A B A B C C The graph on the left yields conditional indep endencies that a directed graph can't represent The graph on the right yields marginal indep endencies that an undirected graph can't represent Quantitative sp eci cation of undirected mo dels D A F C B E identify the cliques in the graph: A, C C, D, E D, E, F B, C de ne a con guration of a clique as a sp eci cation of values for each no de in the clique de ne a potential of a clique as a function that asso ciates a real number with each con guration of the clique Φ A, C Φ Φ C, D, E D, E, F Φ B, C Quantitative sp eci cation cont. Consider the example of a graph with binary no des A \p otential" is a table with entries for each combination of no des in a clique B 0 1 AB 0 1.5 .4 A 1 .7 1.2 \Marginalizing" over a p otential table simply means collapsing summing the table along one or more dimensions marginalizing over B marginalizing over A B 0 1.9 0 1 A 2.2 1.6 1 1.9 Quantitative sp eci cation cont. nally, de ne the probability of a global con guration of the no des as the pro duct of the local p otentials on the cliques: P A; B ; C ; D ; E ; F = A;B B;C C;D;E D;E ;F where, without loss of generality, we assume that the normalization constant if any has b een absorb ed into one of the p otentials It is then possible to prove a theorem that states that if arbitrary potentials are utilized in the product formula for probabilities, then the family of probability distributions obtained is exactly that set which respects the qualitative speci cation the conditional independence relations described earlier This theorem is known as the Hammersley-Cli ord theorem Boltzmann machine The Boltzmann machine is a sp ecial case of an undirected graphical mo del For a Boltzmann machine all of the p otentials are formed by taking pro ducts of factors of the form expfJ S S g ij i j Si Jij Sj Setting J equal to zero for non-neighb oring no des ij guarantees that we resp ect the clique b oundaries But we don't get the full conditional probability semantics with the Boltzmann machine parameterization { i.e., the family of distributions parameterized by a Boltzmann machine on a graph is a prop er subset of the family characterized by the conditional indep endencies Evidence and Inference \Absorbing evidence" means observing the values of certain of the no des Absorbing evidence divides the units of the network into two groups: visible units those for which we have fV g instantiated values \evidence no des". hidden units those for which we do not fH g have instantiated values. \Inference" means calculating the conditional distribution P H; V P H jV = P H; V fH g { prediction and diagnosis are sp ecial cases Inference algorithms for directed graphs There are several inference algorithms; some of which op erate directly on the directed graph The most p opular inference algorithm, known as the junction tree algorithm which we'll discuss here, op erates on an undirected graph It also has the advantage of clarifying some of the relationships between the various algorithms To understand the junction tree algorithm, we need to understand how to \compile" a directed graph into an undirected graph Moral graphs Note that for b oth directed graphs and undirected graphs, the joint probability is in a pro duct form So let's convert lo cal conditional probabilities into p otentials; then the pro ducts of p otentials will give the right answer Indeed we can think of a conditional probability, e.g., P C jA; B as a function of the three variables A; B , and C we get a real number for each con guration: A P(C|A,B) B C Problem: A no de and its parents are not generally in the same clique Solution: Marry the parents to obtain the \moral graph" Φ = P(C|A,B) A A,B,C B C Moral graphs cont. De ne the p otential on a clique as the pro duct over all conditional probabilities contained within the clique Now the pro ducts of p otentials gives the right answer: P A; B ; C ; D ; E ; F = P AP B P C jA; B P D jC P E jC P F jD; E = A;B ;C C;D;E D;E ;F where = P AP B P C jA; B A;B ;C and = P D jC P E jC C;D;E and = P F jD; E D;E ;F D D A F A F C C B E B E Propagation of probabilities Now supp ose that some evidence has b een absorb ed.
Recommended publications
  • Trajectory Averaging for Stochastic Approximation MCMC Algorithms
    The Annals of Statistics 2010, Vol. 38, No. 5, 2823–2856 DOI: 10.1214/10-AOS807 c Institute of Mathematical Statistics, 2010 TRAJECTORY AVERAGING FOR STOCHASTIC APPROXIMATION MCMC ALGORITHMS1 By Faming Liang Texas A&M University The subject of stochastic approximation was founded by Rob- bins and Monro [Ann. Math. Statist. 22 (1951) 400–407]. After five decades of continual development, it has developed into an important area in systems control and optimization, and it has also served as a prototype for the development of adaptive algorithms for on-line esti- mation and control of stochastic systems. Recently, it has been used in statistics with Markov chain Monte Carlo for solving maximum likelihood estimation problems and for general simulation and opti- mizations. In this paper, we first show that the trajectory averaging estimator is asymptotically efficient for the stochastic approxima- tion MCMC (SAMCMC) algorithm under mild conditions, and then apply this result to the stochastic approximation Monte Carlo algo- rithm [Liang, Liu and Carroll J. Amer. Statist. Assoc. 102 (2007) 305–320]. The application of the trajectory averaging estimator to other stochastic approximation MCMC algorithms, for example, a stochastic approximation MLE algorithm for missing data problems, is also considered in the paper. 1. Introduction. Robbins and Monro (1951) introduced the stochastic approximation algorithm to solve the integration equation (1) h(θ)= H(θ,x)fθ(x) dx = 0, ZX where θ Θ Rdθ is a parameter vector and f (x), x Rdx , is a density ∈ ⊂ θ ∈X ⊂ function depending on θ. The dθ and dx denote the dimensions of θ and x, arXiv:1011.2587v1 [math.ST] 11 Nov 2010 respectively.
    [Show full text]
  • Graphical Models for Discrete and Continuous Data Arxiv:1609.05551V3
    Graphical Models for Discrete and Continuous Data Rui Zhuang Department of Biostatistics, University of Washington and Noah Simon Department of Biostatistics, University of Washington and Johannes Lederer Department of Mathematics, Ruhr-University Bochum June 18, 2019 Abstract We introduce a general framework for undirected graphical models. It generalizes Gaussian graphical models to a wide range of continuous, discrete, and combinations of different types of data. The models in the framework, called exponential trace models, are amenable to estimation based on maximum likelihood. We introduce a sampling-based approximation algorithm for computing the maximum likelihood estimator, and we apply this pipeline to learn simultaneous neural activities from spike data. Keywords: Non-Gaussian Data, Graphical Models, Maximum Likelihood Estimation arXiv:1609.05551v3 [math.ST] 15 Jun 2019 1 Introduction Gaussian graphical models (Drton & Maathuis 2016, Lauritzen 1996, Wainwright & Jordan 2008) describe the dependence structures in normally distributed random vectors. These models have become increasingly popular in the sciences, because their representation of the dependencies is lucid and can be readily estimated. For a brief overview, consider a 1 random vector X 2 Rp that follows a centered normal distribution with density 1 −x>Σ−1x=2 fΣ(x) = e (1) (2π)p=2pjΣj with respect to Lebesgue measure, where the population covariance Σ 2 Rp×p is a symmetric and positive definite matrix. Gaussian graphical models associate these densities with a graph (V; E) that has vertex set V := f1; : : : ; pg and edge set E := f(i; j): i; j 2 −1 f1; : : : ; pg; i 6= j; Σij 6= 0g: The graph encodes the dependence structure of X in the sense that any two entries Xi;Xj; i 6= j; are conditionally independent given all other entries if and only if (i; j) 2= E.
    [Show full text]
  • Stochastic Approximation for the Multivariate and the Functional Median 3 It Can Not Be Updated Simply If the Data Arrive Online
    STOCHASTIC APPROXIMATION TO THE MULTIVARIATE AND THE FUNCTIONAL MEDIAN Submitted to COMPSTAT 2010 Herv´eCardot1, Peggy C´enac1, and Mohamed Chaouch2 1 Institut de Math´ematiques de Bourgogne, UMR 5584 CNRS, Universit´ede Bourgogne, 9, Av. A. Savary - B.P. 47 870, 21078 Dijon, France [email protected], [email protected] 2 EDF - Recherche et D´eveloppement, ICAME-SOAD 1 Av. G´en´eralde Gaulle, 92141 Clamart, France [email protected] Abstract. We propose a very simple algorithm in order to estimate the geometric median, also called spatial median, of multivariate (Small, 1990) or functional data (Gervini, 2008) when the sample size is large. A simple and fast iterative approach based on the Robbins-Monro algorithm (Duflo, 1997) as well as its averaged version (Polyak and Juditsky, 1992) are shown to be effective for large samples of high dimension data. They are very fast and only require O(Nd) elementary operations, where N is the sample size and d is the dimension of data. The averaged approach is shown to be more effective and less sensitive to the tuning parameter. The ability of this new estimator to estimate accurately and rapidly (about thirty times faster than the classical estimator) the geometric median is illustrated on a large sample of 18902 electricity consumption curves measured every half an hour during one week. Keywords: Geometric quantiles, High dimension data, Online estimation al- gorithm, Robustness, Robbins-Monro, Spatial median, Stochastic gradient averaging 1 Introduction Estimation of the median of univariate and multivariate data has given rise to many publications in robust statistics, data mining, signal processing and information theory.
    [Show full text]
  • Graphical Models
    Graphical Models Michael I. Jordan Computer Science Division and Department of Statistics University of California, Berkeley 94720 Abstract Statistical applications in fields such as bioinformatics, information retrieval, speech processing, im- age processing and communications often involve large-scale models in which thousands or millions of random variables are linked in complex ways. Graphical models provide a general methodology for approaching these problems, and indeed many of the models developed by researchers in these applied fields are instances of the general graphical model formalism. We review some of the basic ideas underlying graphical models, including the algorithmic ideas that allow graphical models to be deployed in large-scale data analysis problems. We also present examples of graphical models in bioinformatics, error-control coding and language processing. Keywords: Probabilistic graphical models; junction tree algorithm; sum-product algorithm; Markov chain Monte Carlo; variational inference; bioinformatics; error-control coding. 1. Introduction The fields of Statistics and Computer Science have generally followed separate paths for the past several decades, which each field providing useful services to the other, but with the core concerns of the two fields rarely appearing to intersect. In recent years, however, it has become increasingly evident that the long-term goals of the two fields are closely aligned. Statisticians are increasingly concerned with the computational aspects, both theoretical and practical, of models and inference procedures. Computer scientists are increasingly concerned with systems that interact with the external world and interpret uncertain data in terms of underlying probabilistic models. One area in which these trends are most evident is that of probabilistic graphical models.
    [Show full text]
  • Simulation Methods for Robust Risk Assessment and the Distorted Mix
    Simulation Methods for Robust Risk Assessment and the Distorted Mix Approach Sojung Kim Stefan Weber Leibniz Universität Hannover September 9, 2020∗ Abstract Uncertainty requires suitable techniques for risk assessment. Combining stochastic ap- proximation and stochastic average approximation, we propose an efficient algorithm to compute the worst case average value at risk in the face of tail uncertainty. Dependence is modelled by the distorted mix method that flexibly assigns different copulas to different regions of multivariate distributions. We illustrate the application of our approach in the context of financial markets and cyber risk. Keywords: Uncertainty; average value at risk; distorted mix method; stochastic approximation; stochastic average approximation; financial markets; cyber risk. 1 Introduction Capital requirements are an instrument to limit the downside risk of financial companies. They constitute an important part of banking and insurance regulation, for example, in the context of Basel III, Solvency II, and the Swiss Solvency test. Their purpose is to provide a buffer to protect policy holders, customers, and creditors. Within complex financial networks, capital requirements also mitigate systemic risks. The quantitative assessment of the downside risk of financial portfolios is a fundamental, but arduous task. The difficulty of estimating downside risk stems from the fact that extreme events are rare; in addition, portfolio downside risk is largely governed by the tail dependence of positions which can hardly be estimated from data and is typically unknown. Tail dependence is a major source of model uncertainty when assessing the downside risk. arXiv:2009.03653v1 [q-fin.RM] 8 Sep 2020 In practice, when extracting information from data, various statistical tools are applied for fitting both the marginals and the copulas – either (semi-)parametrically or empirically.
    [Show full text]
  • An Empirical Bayes Procedure for the Selection of Gaussian Graphical Models
    An empirical Bayes procedure for the selection of Gaussian graphical models Sophie Donnet CEREMADE Universite´ Paris Dauphine, France Jean-Michel Marin∗ Institut de Mathematiques´ et Modelisation´ de Montpellier Universite´ Montpellier 2, France Abstract A new methodology for model determination in decomposable graphical Gaussian models (Dawid and Lauritzen, 1993) is developed. The Bayesian paradigm is used and, for each given graph, a hyper inverse Wishart prior distribution on the covariance matrix is considered. This prior distribution depends on hyper-parameters. It is well-known that the models’s posterior distribution is sensitive to the specification of these hyper-parameters and no completely satisfactory method is registered. In order to avoid this problem, we suggest adopting an empirical Bayes strategy, that is a strategy for which the values of the hyper-parameters are determined using the data. Typically, the hyper-parameters are fixed to their maximum likelihood estimations. In order to calculate these maximum likelihood estimations, we suggest a Markov chain Monte Carlo version of the Stochastic Approximation EM algorithm. Moreover, we introduce a new sampling scheme in the space of graphs that improves the add and delete proposal of Armstrong et al. (2009). We illustrate the efficiency of this new scheme on simulated and real datasets. Keywords: Gaussian graphical models, decomposable models, empirical Bayes, Stochas- arXiv:1003.5851v2 [stat.CO] 23 May 2011 tic Approximation EM, Markov Chain Monte Carlo ∗Corresponding author: [email protected] 1 1 Gaussian graphical models in a Bayesian Context Statistical applications in genetics, sociology, biology , etc often lead to complicated interaction patterns between variables.
    [Show full text]
  • Stochastic Approximation of Score Functions for Gaussian Processes
    The Annals of Applied Statistics 2013, Vol. 7, No. 2, 1162–1191 DOI: 10.1214/13-AOAS627 c Institute of Mathematical Statistics, 2013 STOCHASTIC APPROXIMATION OF SCORE FUNCTIONS FOR GAUSSIAN PROCESSES1 By Michael L. Stein2, Jie Chen3 and Mihai Anitescu3 University of Chicago, Argonne National Laboratory and Argonne National Laboratory We discuss the statistical properties of a recently introduced un- biased stochastic approximation to the score equations for maximum likelihood calculation for Gaussian processes. Under certain condi- tions, including bounded condition number of the covariance matrix, the approach achieves O(n) storage and nearly O(n) computational effort per optimization step, where n is the number of data sites. Here, we prove that if the condition number of the covariance matrix is bounded, then the approximate score equations are nearly optimal in a well-defined sense. Therefore, not only is the approximation ef- ficient to compute, but it also has comparable statistical properties to the exact maximum likelihood estimates. We discuss a modifi- cation of the stochastic approximation in which design elements of the stochastic terms mimic patterns from a 2n factorial design. We prove these designs are always at least as good as the unstructured design, and we demonstrate through simulation that they can pro- duce a substantial improvement over random designs. Our findings are validated by numerical experiments on simulated data sets of up to 1 million observations. We apply the approach to fit a space–time model to over 80,000 observations of total column ozone contained in the latitude band 40◦–50◦N during April 2012.
    [Show full text]
  • Stochastic Approximation Methods and Applications in Financial Optimization Problems
    Stochastic Approximation Methods And Applications in Financial Optimization Problems by Chao Zhuang (Under the direction of Qing Zhang) Abstract Optimizations play an increasingly indispensable role in financial decisions and financial models. Many problems in mathematical finance, such as asset allocation, trading strategy, and derivative pricing, are now routinely and efficiently approached using optimization. Not until recently have stochastic approximation methods been applied to solve optimization problems in finance. This dissertation is concerned with stochastic approximation algorithms and their appli- cations in financial optimization problems. The first part of this dissertation concerns trading a mean-reverting asset. The strategy is to determine a low price to buy and a high price to sell so that the expected return is maximized. Slippage cost is imposed on each transaction. Our effort is devoted to developing a recursive stochastic approximation type algorithm to esti- mate the desired selling and buying prices. In the second part of this dissertation we consider the trailing stop strategy. Trailing stops are often used in stock trading to limit the maximum of a possible loss and to lock in a profit. We develop stochastic approximation algorithms to estimate the optimal trailing stop percentage. A modification using projection is devel- oped to ensure that the approximation sequence constructed stays in a reasonable range. In both parts, we also study the convergence and the rate of convergence. Simulations and real market data are used to demonstrate the performance of the proposed algorithms. The advantage of using stochastic approximation in stock trading is that the underlying asset is model free. Only observed stock prices are required, so it can be performed on line to provide guidelines for stock trading.
    [Show full text]
  • Graphical Modelling of Multivariate Spatial Point Processes
    Graphical modelling of multivariate spatial point processes Matthias Eckardt Department of Computer Science, Humboldt Universit¨at zu Berlin, Berlin, Germany July 26, 2016 Abstract This paper proposes a novel graphical model, termed the spatial dependence graph model, which captures the global dependence structure of different events that occur randomly in space. In the spatial dependence graph model, the edge set is identified by using the conditional partial spectral coherence. Thereby, nodes are related to the components of a multivariate spatial point process and edges express orthogo- nality relation between the single components. This paper introduces an efficient approach towards pattern analysis of highly structured and high dimensional spatial point processes. Unlike all previous methods, our new model permits the simultane- ous analysis of all multivariate conditional interrelations. The potential of our new technique to investigate multivariate structural relations is illustrated using data on forest stands in Lansing Woods as well as monthly data on crimes committed in the City of London. Keywords: Crime data, Dependence structure; Graphical model; Lansing Woods, Multi- variate spatial point pattern arXiv:1607.07083v1 [stat.ME] 24 Jul 2016 1 1 Introduction The analysis of spatial point patterns is a rapidly developing field and of particular interest to many disciplines. Here, a main concern is to explore the structures and relations gener- ated by a countable set of randomly occurring points in some bounded planar observation window. Generally, these randomly occurring points could be of one type (univariate) or of two and more types (multivariate). In this paper, we consider the latter type of spatial point patterns.
    [Show full text]
  • Decomposable Principal Component Analysis Ami Wiesel, Member, IEEE, and Alfred O
    IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 57, NO. 11, NOVEMBER 2009 4369 Decomposable Principal Component Analysis Ami Wiesel, Member, IEEE, and Alfred O. Hero, Fellow, IEEE Abstract—In this paper, we consider principal component anal- computer networks [6], [12], [15]. In particular, [9] and [18] ysis (PCA) in decomposable Gaussian graphical models. We exploit proposed to approximate the global PCA using a sequence of the prior information in these models in order to distribute PCA conditional local PCA solutions. Alternatively, an approximate computation. For this purpose, we reformulate the PCA problem in the sparse inverse covariance (concentration) domain and ad- solution which allows a tradeoff between performance and com- dress the global eigenvalue problem by solving a sequence of local munication requirements was proposed in [12] using eigenvalue eigenvalue problems in each of the cliques of the decomposable perturbation theory. graph. We illustrate our methodology in the context of decentral- DPCA is an efficient implementation of distributed PCA ized anomaly detection in the Abilene backbone network. Based on the topology of the network, we propose an approximate statistical based on a prior graphical model. Unlike the above references graphical model and distribute the computation of PCA. it does not try to approximate PCA, but yields an exact solution Index Terms—Anomaly detection, graphical models, principal up to on any given error tolerance. DPCA assumes additional component analysis. prior knowledge in the form of a graphical model that is not taken into account by previous distributed PCA methods. Such models are very common in signal processing and communi- I. INTRODUCTION cations.
    [Show full text]
  • Stochastic Approximation Algorithm
    Learning in Complex Systems Spring 2011 Lecture Notes Nahum Shimkin 5 The Stochastic Approximation Algorithm 5.1 Stochastic Processes { Some Basic Concepts 5.1.1 Random Variables and Random Sequences Let (­; F;P ) be a probability space, namely: { ­ is the sample space. { F is the event space. Its elements are subsets of ­, and it is required to be a σ- algebra (includes ; and ­; includes all countable union of its members; includes all complements of its members). { P is the probability measure (assigns a probability in [0,1] to each element of F, with the usual properties: P (­) = 1, countably additive). A random variable (RV) X on (­; F) is a function X : ­ ! R, with values X(!). It is required to be measurable on F, namely, all sets of the form f! : X(!) · ag are events in F. A vector-valued RV is a vector of RVs. Equivalently, it is a function X : ­ ! Rd, with similar measurability requirement. d A random sequence, or a discrete-time stochastic process, is a sequence (Xn)n¸0 of R -valued RVs, which are all de¯ned on the same probability space. 1 5.1.2 Convergence of Random Variables A random sequence may converge to a random variable, say to X. There are several useful notions of convergence: 1. Almost sure convergence (or: convergence with probability 1): a:s: Xn ! X if P f lim Xn = Xg = 1 : n!1 2. Convergence in probability: p Xn ! X if lim P (jXn ¡ Xj > ²) = 0 ; 8² > 0 : n!1 3. Mean-squares convergence (convergence in L2): 2 L 2 Xn ! X if EjXn ¡ X1j ! 0 : 4.
    [Show full text]
  • A Probabilistic Interpretation of Canonical Correlation Analysis
    A Probabilistic Interpretation of Canonical Correlation Analysis Francis R. Bach Michael I. Jordan Computer Science Division Computer Science Division University of California and Department of Statistics Berkeley, CA 94114, USA University of California [email protected] Berkeley, CA 94114, USA [email protected] April 21, 2005 Technical Report 688 Department of Statistics University of California, Berkeley Abstract We give a probabilistic interpretation of canonical correlation (CCA) analysis as a latent variable model for two Gaussian random vectors. Our interpretation is similar to the proba- bilistic interpretation of principal component analysis (Tipping and Bishop, 1999, Roweis, 1998). In addition, we cast Fisher linear discriminant analysis (LDA) within the CCA framework. 1 Introduction Data analysis tools such as principal component analysis (PCA), linear discriminant analysis (LDA) and canonical correlation analysis (CCA) are widely used for purposes such as dimensionality re- duction or visualization (Hotelling, 1936, Anderson, 1984, Hastie et al., 2001). In this paper, we provide a probabilistic interpretation of CCA and LDA. Such a probabilistic interpretation deepens the understanding of CCA and LDA as model-based methods, enables the use of local CCA mod- els as components of a larger probabilistic model, and suggests generalizations to members of the exponential family other than the Gaussian distribution. In Section 2, we review the probabilistic interpretation of PCA, while in Section 3, we present the probabilistic interpretation of CCA and LDA, with proofs presented in Section 4. In Section 5, we provide a CCA-based probabilistic interpretation of LDA. 1 z x Figure 1: Graphical model for factor analysis. 2 Review: probabilistic interpretation of PCA Tipping and Bishop (1999) have shown that PCA can be seen as the maximum likelihood solution of a factor analysis model with isotropic covariance matrix.
    [Show full text]