Deep Variational Canonical Correlation Analysis

Total Page:16

File Type:pdf, Size:1020Kb

Deep Variational Canonical Correlation Analysis Deep Variational Canonical Correlation Analysis Weiran Wang 1 Xinchen Yan 2 Honglak Lee 2 Karen Livescu 1 Abstract that x and y are linear functions of some random variable z Rdz where , and that the prior distri- We present deep variational canonical correla- ∈ dz ≤ min(dx, dy) z x z y z tion analysis (VCCA), a deep multi-view learn- bution p( ) and conditional distributions p( | ) and p( | ) E z x ing model that extends the latent variable model are Gaussian, Bach & Jordan (2005) showed that [ | ] E z y interpretation of linear CCA to nonlinear obser- (resp. [ | ]) lives in the same space as the linear CCA x y vation models parameterized by deep neural net- projection for (resp. ). works. We derive variational lower bounds of the This generative interpretation of CCA is often lost in data likelihood by parameterizing the posterior its nonlinear extensions. For example, in deep CCA probability of the latent variables from the view (DCCA, Andrew et al., 2013), one extracts nonlinear fea- that is available at test time. We also propose a tures from the original inputs of each view using two variant of VCCA called VCCA-private that can, DNNs, f for x and g for y, so that the canonical correlation in addition to the “common variables” underly- of the DNN outputs (measured by a linear CCA with pro- ing both views, extract the “private variables” jection matrices U and V) is maximized. Formally, given a within each view, and disentangles the shared dataset of N pairs of observations (x1, y1),..., (xN , yN ) and private information for multi-view data with- of the random vectors (x, y), DCCA optimizes out hard supervision. Experimental results on real-world datasets show that our methods are max tr U⊤f(X)g(Y)⊤V (1) W ,W ,U,V competitive across domains. f g s.t. U⊤ f(X)f(X)⊤ U = V⊤ g(Y)g(Y)⊤ V = NI, 1. Introduction where Wf (resp. Wg) denotes the weight parameters of f (resp. g), and f(X) = [f(x1),..., f(xN )], g(Y) = In the multi-view representation learning setting, we have [g(y1),..., g(yN )]. multiple views (types of measurements) of the same under- lying signal, and the goal is to learn useful features of each DCCA has achieved good performance across sev- view using complementary information contained in both eral domains (Wang et al., 2015b;a; Lu et al., 2015; views. The learned features should uncover the common Yan & Mikolajczyk, 2015). However, a disadvantage of sources of variation in the views, which can be helpful for DCCA is that it does not provide a model for generat- exploratory analysis or for downstream tasks. ing samples from the latent space. Although Wang et al. (2015b)’s deep canonically correlated autoencoders (DC- A classical approach is canonical correlation analysis CAE) variant optimizes the combination of an autoencoder (CCA, Hotelling, 1936) and its nonlinear extensions, in- objective (reconstruction errors) and the canonical corre- cluding the kernel extension (Lai & Fyfe, 2000; Akaho, lation objective, the authors found that in practice, the 2001; Melzer et al., 2001; Bach & Jordan, 2002) and the canonical correlation term often dominate the reconstruc- deep neural network (DNN) extension (Andrew et al., tion terms in the objective, and therefore the inputs are not 2013; Wang et al., 2015b). CCA projects two random vec- reconstructed well. At the same time, optimization of the Rdx Rdy tors x ∈ and y ∈ into a lower-dimensional DCCA/DCCAE objectives is challenging due to the con- subspace so that the projections are maximally correlated. straints that couple all training samples. There is a probabilistic latent variable model interpreta- tion of linear CCA as shown in Figure 1 (left). Assuming The main contribution of this paper is a new deep multi- view learning model, deep variational CCA (VCCA), 1Toyota Technological Institute at Chicago, Chicago, which extends the latent variable interpretation of lin- 2 IL 60637, USA University of Michigan, Ann Arbor, MI ear CCA to nonlinear observation models parameterized 48109, USA. Correspondence to: Weiran Wang <weiran- [email protected]>. by DNNs. Computation of the marginal data likelihood and inference of the latent variables are both intractable under this model. Inspired by variational autoencoders Deep Variational Canonical Correlation Analysis pθ(x|z) x y pθ(y|z) inference problem pθ(z|x)—the problem of inferring the x latent variables given one of the views—is also intractable. z Inspired by Kingma & Welling (2014)’s work on varia- tional autoencoders (VAE), we approximate pθ(z|x) with y the conditional density qφ(z|x; φz), where φz is the collec- p(z) tion of parameters of another DNN.1 We can derive a lower qφ(z|x) bound on the marginal data log-likelihood using qφ(z|x): z ∼N (0, I) (see the full derivation in Appendix A) x|z ∼ N (Wxz, Φx) log p (x, y) ≥ L(x, y; θ, φ) := −D (q (z|x)||p(z)) y|z ∼N (Wyz, Φy) x θ KL φ E + qφ(z|x) [log pθ(x|z) + log pθ(y|z)] (3) Figure 1. Left: Probabilistic latent variable interpretation of CCA (Bach & Jordan, 2005). Right: Deep variational CCA. where DKL(qφ(z|x)||p(z)) denotes the KL divergence be- tween the approximate posterior qφ(z|x) and the prior q(z) for the latent variables. VCCA maximizes this variational (VAE, Kingma & Welling, 2014), we parameterize the pos- lower bound on the data log-likelihood on the training set: terior distribution of the latent variables given an input N view, and derive variational lower bounds of the data likeli- 1 max L(xi, yi; θ, φ). (4) hood, which is further approximated by Monte Carlo sam- θ,φ N pling. With the reparameterization trick, sampling for the Xi=1 Monte Carlo approximation is trivial and all DNN weights The KL divergence term When the parameterization in VCCA can be optimized jointly via stochastic gradi- qφ(z|x) is chosen properly, this term can be computed ex- ent descent, using unbiased gradient estimates from small actly in closed form. Let the variational approximate pos- minibatches. Interestingly, VCCA is related to multi-view terior be a multivariate Gaussian with diagonal covariance. autoencoders (Ngiam et al., 2011), with additional regular- That is, for a sample pair (xi, yi), we have ization on the posterior distribution. log qφ(zi|xi) = log N (zi; µ , Σi), We also propose a variant of VCCA called VCCA-private i Σ 2 2 that can, in addition to the “common variables” underly- i = diag σi1,...,σidz , ing both views, extract the “private variables” within each where the mean µi and covariance Σi are outputs of an view. We demonstrate that VCCA-private is able to dis- encoding DNN f (and thus [µi, Σi]= f(xi; φz) are deter- entangle the shared and private information for multi-view ministic nonlinear functions of xi). In this case, we have data without hard supervision. Last but not least, as genera- dz tive models, VCCA and VCCA-private enable us to obtain 1 2 2 2 D (qφ(z |x )||p(z )) = − 1 + log σ − σ − µ . high-quality samples for the input of each view. KL i i i 2 ij ij ij Xj=1 2. Variational CCA The expected log-likelihood term The second term of (3) corresponds to the expected data log-likelihood un- The probabilistic latent variable model of der the approximate posterior distribution. Though still in- CCA (Bach & Jordan, 2005) defines the following tractable, this term can be approximated by Monte Carlo joint distribution over the random variables (x, y): (l) sampling: We draw L samples zi ∼ qφ(zi|xi) where x y z z x z y z p( , , )= p( )p( | )p( | ), (2) (l) (l) (l) zi = µi + Σiǫ , where ǫ ∼ N (0, I), l =1, . , L, p(x, y)= p(x, y, z)dz. Z and have E The assumption here is that, conditioned on the latent vari- qφ(zi|xi) [log pθ(xi|zi) + log pθ(yi|zi)] ≈ d ables z ∈ R z , the two views x and y are independent. L 1 (l) (l) Classical CCA is obtained by assuming that the observa- log pθ x |z + log pθ y |z . (5) L i i i i tion models p(x|z) and p(y|z) are linear, as shown in Fig- Xl=1 ure 1 (left). However, linear observation models have lim- ited representation power. In this paper, we consider non- We provide a sketch of VCCA in Figure 1 (right). linear observation models p (x|z; θ ) and p (y|z; θ ), pa- θ x θ y 1For notational simplicity, we denote by θ the parameters as- rameterized by θx and θy respectively, which can be the sociated with the model probabilities pθ(·), and φ the parameters collections of weights of DNNs. In this case, the marginal associated with the variational approximate probabilities qφ(·), likelihood pθ(x, y) does not have a closed form, and the and often omit specific parameters inside the probabilities. Deep Variational Canonical Correlation Analysis x y Connection to multi-view autoencoder (MVAE) If we pθ(x|z, hx) pθ(y|z, hy ) h use the Gaussian observation models x x log pθ(x|z) = log N (gx(z; θx), I), z log pθ(y|z) = log N (gy(z; θy), I), p(hx) p(z) p(hy) y qφ(hx|x) qφ(z|x) qφ(hy|y) (l) (l) we observe that log pθ xi|zi and log pθ yi|zi mea- h y sure the ℓ2 reconstruction errors of each view’s inputs from x x y (l) samples zi using the two DNNs gx and gy respectively.
Recommended publications
  • Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs
    Equivalence Between Minimal Generative Model Graphs and Directed Information Graphs Christopher J. Quinn Negar Kiyavash Todd P. Coleman Department of Electrical and Department of Industrial and Department of Electrical and Computer Engineering Enterprise Systems Engineering Computer Engineering University of Illinois University of Illinois University of Illinois Urbana, Illinois 61801 Urbana, Illinois 61801 Urbana, Illinois 61801 Email: [email protected] Email: [email protected] Email: [email protected] Abstract—We propose a new type of probabilistic graphical own past [4], [5]. This improvement is measured in terms of model, based on directed information, to represent the causal average reduction in the encoding length of the information. dynamics between processes in a stochastic system. We show the Clearly such a definition of causality has the notion of timing practical significance of such graphs by proving their equivalence to generative model graphs which succinctly summarize interde- ingrained in it. Therefore, we expect a graphical model defined pendencies for causal dynamical systems under mild assumptions. based on directed information to deal with processes as its This equivalence means that directed information graphs may be nodes. used for causal inference and learning tasks in the same manner In this work, we formally define directed information Bayesian networks are used for correlative statistical inference graphs, an alternative type of graphical model. In these graphs, and learning. nodes represent processes, and directed edges correspond to whether there is directed information from one process to I. INTRODUCTION another. To demonstrate the practical significance of directed Graphical models facilitate design and analysis of interact- information graphs, we validate their ability to capture causal ing multivariate probabilistic complex systems that arise in interdependencies for a class of interacting processes corre- numerous fields such as statistics, information theory, machine sponding to a causal dynamical system.
    [Show full text]
  • A Hierarchical Generative Model of Electrocardiogram Records
    A Hierarchical Generative Model of Electrocardiogram Records Andrew C. Miller ∗ Sendhil Mullainathan School of Engineering and Applied Sciences Department of Economics Harvard University Harvard University Cambridge, MA 02138 Cambridge, MA 02138 [email protected] [email protected] Ziad Obermeyer Brigham and Women’s Hospital Harvard Medical School Boston, MA 02115 [email protected] Abstract We develop a probabilistic generative model of electrocardiogram (EKG) tracings. Our model describes multiple sources of variation in EKGs, including patient- specific cardiac cycle morphology and between-cycle variation that leads to quasi- periodicity. We use a deep generative network as a flexible model component to describe variation in beat-specific morphology. We apply our model to a set of 549 EKG records, including over 4,600 unique beats, and show that it is able to discover interpretable information, such as patient similarity and meaningful physiological features (e.g., T wave inversion). 1 Introduction An electrocardiogram (EKG) is a common non-invasive medical test that measures the electrical activity of a patient’s heart by recording the time-varying potential difference between electrodes placed on the surface of the skin. The resulting data is a multivariate time-series that reflects the depolarization and repolarization of the heart muscle that occurs during each heartbeat. These raw waveforms are then inspected by a physician to detect irregular patterns that are evidence of an underlying physiological problem. In this paper, we build a hierarchical generative model of electrocardiogram signals that disentangles sources of variation. We are motivated to build a generative model of EKGs for multiple reasons.
    [Show full text]
  • Bias Correction of Learned Generative Models Using Likelihood-Free Importance Weighting
    Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting Aditya Grover1, Jiaming Song1, Alekh Agarwal2, Kenneth Tran2, Ashish Kapoor2, Eric Horvitz2, Stefano Ermon1 1Stanford University, 2Microsoft Research, Redmond Abstract A learned generative model often produces biased statistics relative to the under- lying data distribution. A standard technique to correct this bias is importance sampling, where samples from the model are weighted by the likelihood ratio under model and true distributions. When the likelihood ratio is unknown, it can be estimated by training a probabilistic classifier to distinguish samples from the two distributions. We show that this likelihood-free importance weighting method induces a new energy-based model and employ it to correct for the bias in existing models. We find that this technique consistently improves standard goodness-of-fit metrics for evaluating the sample quality of state-of-the-art deep generative mod- els, suggesting reduced bias. Finally, we demonstrate its utility on representative applications in a) data augmentation for classification using generative adversarial networks, and b) model-based policy evaluation using off-policy data. 1 Introduction Learning generative models of complex environments from high-dimensional observations is a long- standing challenge in machine learning. Once learned, these models are used to draw inferences and to plan future actions. For example, in data augmentation, samples from a learned model are used to enrich a dataset for supervised learning [1]. In model-based off-policy policy evaluation (henceforth MBOPE), a learned dynamics model is used to simulate and evaluate a target policy without real-world deployment [2], which is especially valuable for risk-sensitive applications [3].
    [Show full text]
  • Probabilistic PCA and Factor Analysis
    Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis Piyush Rai Machine Learning (CS771A) Oct 5, 2016 Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 1 K Generate latent variables z n 2 R (K D) as z n ∼ N (0; IK ) Generate data x n conditioned on z n as 2 x n ∼ N (Wz n; σ ID ) where W is the D × K \factor loading matrix" or \dictionary" z n is K-dim latent features or latent factors or \coding" of x n w.r.t. W Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0; σ ID )) This is \Probabilistic PCA" (PPCA) with Gaussian observation model 2 N Want to learn model parameters W; σ and latent factors fz ngn=1 When n ∼ N (0; Ψ), Ψ is diagonal, it is called \Factor Analysis" (FA) Generative Model for Dimensionality Reduction D Assume the following generative story for each x n 2 R Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 Generate data x n conditioned on z n as 2 x n ∼ N (Wz n; σ ID ) where W is the D × K \factor loading matrix" or \dictionary" z n is K-dim latent features or latent factors or \coding" of x n w.r.t. W Note: Can also write x n as a linear transformation of z n, plus Gaussian noise 2 x n = Wz n + n (where n ∼ N (0; σ ID )) This is \Probabilistic PCA" (PPCA) with Gaussian observation model 2 N Want to learn model parameters W; σ and latent factors fz ngn=1 When n ∼ N (0; Ψ), Ψ is diagonal, it is called \Factor Analysis" (FA) Generative Model for Dimensionality Reduction D Assume the following generative story for each x n 2 R K Generate latent variables z n 2 R (K D) as z n ∼ N (0; IK ) Machine Learning (CS771A) Generative Models for Dimensionality Reduction: Probabilistic PCA and Factor Analysis 2 where W is the D × K \factor loading matrix" or \dictionary" z n is K-dim latent features or latent factors or \coding" of x n w.r.t.
    [Show full text]
  • Learning a Generative Model of Images by Factoring Appearance and Shape
    1 Learning a generative model of images by factoring appearance and shape Nicolas Le Roux1, Nicolas Heess2, Jamie Shotton1, John Winn1 1Microsoft Research Cambridge 2University of Edinburgh Keywords: generative model of images, occlusion, segmentation, deep networks Abstract Computer vision has grown tremendously in the last two decades. Despite all efforts, existing attempts at matching parts of the human visual system’s extraordinary ability to understand visual scenes lack either scope or power. By combining the advantages of general low-level generative models and powerful layer-based and hierarchical models, this work aims at being a first step towards richer, more flexible models of images. After comparing various types of RBMs able to model continuous-valued data, we introduce our basic model, the masked RBM, which explicitly models occlusion boundaries in image patches by factoring the appearance of any patch region from its shape. We then propose a generative model of larger images using a field of such RBMs. Finally, we discuss how masked RBMs could be stacked to form a deep model able to generate more complicated structures and suitable for various tasks such as segmentation or object recognition. 1 Introduction Despite much progress in the field of computer vision in recent years, interpreting and modeling the bewildering structure of natural images remains a challenging problem. The limitations of even the most advanced systems become strikingly obvious when contrasted with the ease, flexibility, and robustness with which the human visual system analyzes and interprets an image. Computer vision is a problem domain where the structure that needs to be represented is complex and strongly task dependent, and the input data is often highly ambiguous.
    [Show full text]
  • Neural Decomposition: Functional ANOVA with Variational Autoencoders
    Neural Decomposition: Functional ANOVA with Variational Autoencoders Kaspar Märtens Christopher Yau University of Oxford Alan Turing Institute University of Birmingham University of Manchester Abstract give insight into patterns of interest embedded within the data. Recently there has been particular interest in applications of Variational Autoencoders (VAEs) Variational Autoencoders (VAEs) have be- (Kingma and Welling, 2014) for generative modelling come a popular approach for dimensionality and dimensionality reduction. VAEs are a class of reduction. However, despite their ability to probabilistic neural-network-based latent variable mod- identify latent low-dimensional structures em- els. Their neural network (NN) underpinnings offer bedded within high-dimensional data, these considerable modelling flexibility while modern pro- latent representations are typically hard to gramming frameworks (e.g. TensorFlow and PyTorch) interpret on their own. Due to the black-box and computational techniques (e.g. stochastic varia- nature of VAEs, their utility for healthcare tional inference) provide immense scalability to large and genomics applications has been limited. data sets, including those in genomics (Lopez et al., In this paper, we focus on characterising the 2018; Eraslan et al., 2019; Märtens and Yau, 2020). The sources of variation in Conditional VAEs. Our Conditional VAE (CVAE) (Sohn et al., 2015) extends goal is to provide a feature-level variance de- this model to allow incorporating side information as composition, i.e. to decompose variation in additional fixed inputs. the data by separating out the marginal ad- ditive effects of latent variables z and fixed The ability of (C)VAEs to extract latent, low- inputs c from their non-linear interactions.
    [Show full text]
  • Note 18 : Gaussian Discriminant Analysis
    CS 189 Introduction to Machine Learning Spring 2018 Note 18 1 Gaussian Discriminant Analysis Recall the idea of generative models: we classify an arbitrary datapoint x with the class label that maximizes the joint probability p(x;Y ) over the label Y : y^ = arg max p(x;Y = k) k Generative models typically form the joint distribution by explicitly forming the following: • A prior probability distribution over all classes: P (k) = P (class = k) • A conditional probability distribution for each class k 2 f1; 2; :::; Kg: pk(X) = p(Xjclass k) In total there are K + 1 probability distributions: 1 for the prior, and K for all of the individual classes. Note that the prior probability distribution is a categorical distribution over the K discrete classes, whereas each class conditional probability distribution is a continuous distribution over Rd (often represented as a Gaussian). Using the prior and the conditional distributions in conjunction, we have (from Bayes’ rule) that maximizing the joint probability over the class labels is equivalent to maximizing the posterior probability of the class label: y^ = arg max p(x;Y = k) = arg max P (k) pk(x) = arg max P (Y = kjx) k k k Consider the example of digit classification. Suppose we are given dataset of images of hand- written digits each with known values in the range f0; 1; 2;:::; 9g. The task is, given an image of a handwritten digit, to classify it to the correct digit. A generative classifier for this this task would effectively form a the prior distribution and conditional probability distributions over the 10 possible digits and choose the digit that maximizes posterior probability: y^ = arg max p(digit = kjimage) = arg max P (digit = k) p(imagejdigit = k) k2f0;1;2:::;9g k2f0;1;2:::;9g Maximizing the posterior will induce regions in the feature space in which one class has the highest posterior probability, and decision boundaries in between classes where the posterior probability of two classes are equal.
    [Show full text]
  • Logistic Regression, Generative and Discriminative Classifiers
    Logistic Regression, Generative and Discriminative Classifiers Recommended reading: • Ng and Jordan paper “On Discriminative vs. Generative classifiers: A comparison of logistic regression and naïve Bayes,” A. Ng and M. Jordan, NIPS 2002. Machine Learning 10-701 Tom M. Mitchell Carnegie Mellon University Thanks to Ziv Bar-Joseph, Andrew Moore for some slides Overview Last lecture: • Naïve Bayes classifier • Number of parameters to estimate • Conditional independence This lecture: • Logistic regression • Generative and discriminative classifiers • (if time) Bias and variance in learning Generative vs. Discriminative Classifiers Training classifiers involves estimating f: X Æ Y, or P(Y|X) Generative classifiers: • Assume some functional form for P(X|Y), P(X) • Estimate parameters of P(X|Y), P(X) directly from training data • Use Bayes rule to calculate P(Y|X= xi) Discriminative classifiers: 1. Assume some functional form for P(Y|X) 2. Estimate parameters of P(Y|X) directly from training data • Consider learning f: X Æ Y, where • X is a vector of real-valued features, < X1 …Xn > • Y is boolean • So we use a Gaussian Naïve Bayes classifier • assume all Xi are conditionally independent given Y • model P(Xi | Y = yk) as Gaussian N(µik,σ) • model P(Y) as binomial (p) • What does that imply about the form of P(Y|X)? • Consider learning f: X Æ Y, where • X is a vector of real-valued features, < X1 …Xn > • Y is boolean • assume all Xi are conditionally independent given Y • model P(Xi | Y = yk) as Gaussian N(µik,σ) • model P(Y) as binomial (p)
    [Show full text]
  • A Probabilistic Generative Model for Typographical Analysis of Early Modern Printing
    A Probabilistic Generative Model for Typographical Analysis of Early Modern Printing Kartik Goyal1 Chris Dyer2 Christopher Warren3 Max G’Sell4 Taylor Berg-Kirkpatrick5 1Language Technologies Institute, Carnegie Mellon University 2Deepmind 3Department of English, Carnegie Mellon University 4Department of Statistics, Carnegie Mellon University 5Computer Science and Engineering, University of California, San Diego kartikgo,cnwarren,mgsell @andrew.cmu.edu f [email protected] [email protected] Abstract We propose a deep and interpretable prob- abilistic generative model to analyze glyph shapes in printed Early Modern documents. We focus on clustering extracted glyph images into underlying templates in the presence of multiple confounding sources of variance. Our approach introduces a neural editor model that first generates well-understood printing phe- nomena like spatial perturbations from tem- Figure 1: We desire a generative model that can be plate parameters via interpertable latent vari- biased to cluster according to typeface characteristics ables, and then modifies the result by generat- (e.g. the length of the middle arm) rather than other ing a non-interpretable latent vector responsi- more visually salient sources of variation like inking. ble for inking variations, jitter, noise from the archiving process, and other unforeseen phe- textual analysis for establishing provenance of his- nomena associated with Early Modern print- torical documents. For example, Hinman(1963)’s ing. Critically, by introducing an inference study of typesetting in Shakespeare’s First Folio network whose input is restricted to the vi- relied on the discovery of pieces of damaged or sual residual between the observation and the distinctive type through manual inspection of every interpretably-modified template, we are able glyph in the document.
    [Show full text]
  • Discriminative and Generative Models for Clinical Risk Estimation: an Empirical Comparison
    Discriminative and Generative Models for Clinical Risk Estimation: an Empirical Comparison John Stamford1 and Chandra Kambhampati2 1 Department of Computer Science, The University of Hull, Kingston upon Hull, HU6 7RX, United Kingdom [email protected] 2 Department of Computer Science, The University of Hull, Kingston upon Hull, HU6 7RX, United Kingdom [email protected] Abstract Linear discriminative models, in the form of Logistic Regression, are a popular choice within the clinical domain in the development of risk models. Logistic regression is commonly used as it offers explanatory information in addition to its predictive capabilities. In some examples the coefficients from these models have been used to determine overly simplified clinical risk scores. Such models are constrained to modeling linear relationships between the variables and the class despite it known that this relationship is not always linear. This paper compares the conditions under which linear discriminative and linear generative models perform best. This is done through comparing logistic regression and naïve Bayes on real clinical data. The work shows that generative models perform best when the internal representation of the data is closer to the true distribution of the data and when there is a very small difference between the means of the classes. When looking at variables such as sodium it is shown that logistic regression can not model the observed risk as it is non-linear in its nature, whereas naïve Bayes gives a better estimation of risk. The work concludes that the risk estimations derived from discriminative models such as logistic regression need to be considered in the wider context of the true risk observed within the dataset.
    [Show full text]
  • Acoustic Feature Learning with Deep Variational Canonical Correlation
    Acoustic Feature Learning via Deep Variational Canonical Correlation Analysis Qingming Tang1, Weiran Wang1, Karen Livescu1 1Toyota Technological Institute at Chicago, USA {qmtang,weiranwang,klivescu}@ttic.edu Abstract We compare VCCA and its extensions with previous single- view and multi-view feature learning approaches using the Uni- We study the problem of acoustic feature learning in the set- versity of Wisconsin X-ray Microbeam Database [12], which ting where we have access to another (non-acoustic) modality consists of simultaneously recorded audio and articulatory mea- for feature learning but not at test time. We use deep varia- surements. We first show that VCCA learns better acoustic fea- tional canonical correlation analysis (VCCA), a recently pro- tures for speaker-independent phonetic recognition. We then posed deep generative method for multi-view representation extend VCCA in several ways, including aspects of the prior learning. We also extend VCCA with improved latent variable distributions and adversarial learning [13]. priors and with adversarial learning. Compared to other tech- niques for multi-view feature learning, VCCA’s advantages in- clude an intuitive latent variable interpretation and a variational 2. Deep variational CCA lower bound objective that can be trained end-to-end efficiently. Deep CCA (DCCA, [10]), which uses DNNs for projecting two We compare VCCA and its extensions with previous feature views into a common subspace, has achieved excellent empiri- learning methods on the University of Wisconsin X-ray Mi- cal performance for tasks across several domains in the setting crobeam Database, and show that VCCA-based feature learning of unsupervised multi-view feature learning [14, 15, 16, 17].
    [Show full text]
  • Time Series Simulation by Conditional Generative Adversarial Net
    Time Series Simulation by Conditional Generative Adversarial Net Rao Fu1, Jie Chen, Shutian Zeng, Yiping Zhuang and Agus Sudjianto Corporate Model Risk Management at Wells Fargo Abstract Generative Adversarial Net (GAN) has been proven to be a powerful machine learning tool in image data analysis and generation [1]. In this paper, we propose to use Conditional Generative Adversarial Net (CGAN) [2] to learn and simulate time series data. The conditions can be both categorical and continuous variables containing different kinds of auxiliary information. Our simulation studies show that CGAN is able to learn different kinds of normal and heavy tail distributions, as well as dependent structures of different time series and it can further generate conditional predictive distributions consistent with the training data distributions. We also provide an in-depth discussion on the rationale of GAN and the neural network as hierarchical splines to draw a clear connection with the existing statistical method for distribution generation. In practice, CGAN has a wide range of applications in the market risk and counterparty risk analysis: it can be applied to learn the historical data and generate scenarios for the calculation of Value-at-Risk (VaR) and Expected Shortfall (ES) and predict the movement of the market risk factors. We present a real data analysis including a backtesting to demonstrate CGAN is able to outperform the Historic Simulation, a popular method in market risk analysis for the calculation of VaR. CGAN can also be applied in the economic time series modeling and forecasting, and an example of hypothetical shock analysis for economic models and the generation of potential CCAR scenarios by CGAN is given at the end of the paper.
    [Show full text]