Noise Regularization for Conditional Density Estimation

Noise Regularization for Conditional Density Estimation

Noise Regularization for Conditional Density Estimation Jonas Rothfuss 1 Fabio Ferreira 2 Simon Boehm 2 Simon Walther 2 Maxim Ulrich 2 Tamim Asfour 2 Andreas Krause 1 Abstract focuses on the modelling of images based on large scale data sets, over-fitting and noisy observations are of minor Capturing statistical relationships beyond the concern in this context. In contrast, we are interested in conditional mean is crucial in many applications. CDE in settings where data may be scarce and noisy. When To this end, conditional density estimation (CDE) combined with maximum likelihood estimation, the flexi- aims to learn the full conditional probability bility of such high-capacity models results in over-fitting density from data. Though expressive, neural net- and poor generalization. While regression typically assumes work based CDE models can suffer from severe a Gaussian noise model, CDE uses expressive distribution over-fitting when trained with the maximum likeli- families to model deviations from the conditional mean. hood objective. Their particular structure renders Hence, the over-fitting problem tends to be even more se- classical regularization in the parameter space vere in CDE than in regression. Standard regularization of ineffective. To address this challenge, we propose the neural network weights such as weight decay (Pratt & a model-agnostic noise regularization method for Hanson, 1989) has been shown effective for regression and CDE that adds carefully controlled random pertur- classification. However, in the context of CDE, the output bations to the data during training. We prove that of the neural network merely controls the parameters of a the proposed approach corresponds to a smooth- density model such as a Gaussian Mixture or Normalizing ness regularization and establish its asymptotic Flow. This makes the standard regularization methods in consistency. Our extensive experiments show the parameter space less effective and hard to analyze. that noise regularization consistently outperforms other regularization methods across a range of The lack of an effective regularization scheme renders neural CDE models. Furthermore, we demon- neural network based CDE impractical in most scenarios strate the effectiveness of noise regularized neural where data is scarce. As a result, classical non- and semi- CDE over classical non- and semi-parametric parametric CDE tends to be the primary method in applica- methods, even when training data is scarce. tion areas such as econometrics (Zambom & Dias, 2013). To address this issue, we propose and analyze the use of 1. Introduction noise regularization, an approach well-studied in the context of regression and classification, for the purpose of CDE. By While regression analysis aims to describe the conditional adding small, carefully controlled, random perturbations to mean E[yjx] of a response y given inputs x, many problems the data during training, the conditional density estimate is such as risk management and planning under uncertainty smoothed and tends to generalize better. In fact, we show require gaining insight about deviations from the mean and that adding noise during maximum likelihood estimation is their associated likelihood. The stochastic dependency of y equivalent to a penalty on large second derivatives in the arXiv:1907.08982v2 [stat.ML] 14 Feb 2020 on x can be captured by modeling the conditional probability training point locations which results in an inductive bias density p(yjx). Inferring such a density function from a set towards smoother density estimates. Moreover, under mild of observations is typically referred to as conditional density regularity conditions, we show that the proposed regular- estimation (CDE) and is the focus of this paper. ization scheme is consistent, converging to the unbiased In the recent machine learning literature, there has been a maximum likelihood estimator. This does not only support resurgence of interest in flexible density models based on the soundness of the proposed method but also provides neural networks (Dinh et al., 2017; Ambrogioni et al., 2017; insight in how to set the regularization intensity relative to Kingma & Dhariwal, 2018). Since this line of work mainly the data dimensionality and training set size. Overall, the proposed noise regularization scheme is easy 1ETH Zurich, Switzerland 2Karlsruhe Institute of Technol- ogy (KIT), Germany. Correspondence to: Jonas Rothfuss to implement and agnostic to the parameterization of the <[email protected]>. CDE model. We empirically demonstrate its effectiveness Noise Regularization for Conditional Density Estimation on three different neural network based models. The experi- p(x; y)=p(x) denote the conditional probability density of y mental results show that noise regularization outperforms given x. Typically, Y is referred to as a dependent variable other regularization methods consistently across various (explained variable) and X as conditional (explanatory) vari- N data sets. In a comprehensive benchmark study, we demon- able. Given a dataset of observations D = f(xn; yn)gn=1 strate that, with noise regularization, neural network based drawn from the joint distribution (xn; yn) ∼ p(x; y), the CDE is able to significantly improve upon state-of-the art aim of conditional density estimation (CDE) is to find an non-parametric estimators, even when only 400 training ob- estimate f^(yjx) of the true conditional density p(yjx). servations are available. This is a relevant, and perhaps sur- In the context of CDE, the KL-divergence objective is ex- prising, finding since non-parametric CDE is considered one pressed as expectation over p(x): of the primary approaches for settings with scarce and noisy " # data (Zambom & Dias, 2013). By using non-parametric h ^ i log p(yjx) Ep(x) DKL(p(yjx)jjf(yjx)) = Ep(x;y) regularization for training parametric high-capacity models, log f^(yjx) we are able to combine inductive biases from both worlds, (2) making neural networks the preferable CDE method, even Corresponding to (1), we refer to the minimization of (2) for low-dimensional and small-scale tasks. w.r.t. θ as conditional M-projection. Given a dataset D drawn i.i.d. from p(x; y), the conditional MLE following 2. Background from (2) can be stated as Density Estimation. Let X be a random variable with n probability density function (PDF) p(x) defined over the ∗ X ^ θ = arg min − log fθ(yijxi) (3) dx θ domain X ⊆ R . Given a collection D = fx1; :::; xng of i=1 observations sampled from p(x), the goal is to find a good estimate f^(x) of the true density function p. In parametric 3. Related work estimation, the PDF f^is assumed to belong to a parametric ^ The first part of this section discusses work in the field of family F = ffθ(·)jθ 2 Θg where the density function is described by a finite dimensional parameter θ 2 Θ. The CDE, focusing on high-capacity models that make little standard method for estimating θ is maximum likelihood prior assumptions. The second part relates our approach to estimation (MLE), wherein θ is chosen so that the likelihood previous regularization and data augmentation methods. of the data D is maximized. This is equivalent to minimizing Non-parametric CDE. A vast body of literature in statis- the Kullback-Leibler divergence between the empirical data tics studies nonparametric kernel density estimators (KDEs) 1 Pn distribution pD(x) = n i=1 δ(jjx − xijj) (i.e., mixture (Rosenblatt, 1956; Parzen, 1962) and the associated band- of point masses in the observations xi) and the parametric width selection problem, which concerns choosing the ap- ^ distribution fθ: propriate amount of smoothing (Silverman, 1982; Hall et al., n X ^ ^ 1992; Cao et al., 1994). To estimate conditional probabili- arg max log fθ(xi) = arg min DKL(pDjjfθ) (1) θ2Θ θ2Θ ties, previous work proposes to estimate both the joint and i=1 marginal probability separately with KDE and then com- From a geometric perspective, (1) can be viewed as an puting the conditional probability as their ratio (Hyndman orthogonal projection of pD(x) onto F w.r.t. the KL- et al., 1996; Li & Racine, 2007). Other approaches combine divergence. Hence, (1) is also commonly referred to as non-parametric elements with parametric elements (Tresp, an M-projection (Murphy, 2012; Nielsen, 2018). In con- 2001; Sugiyama & Takeuchi, 2010; Dutordoir et al., 2018). trast, non-parametric density estimators make implicit Despite their theoretical appeal, non-parametric density es- smoothness assumptions through a kernel function. The timators suffer from poor generalization in regions where most popular non-parametric method, kernel density estima- data is sparse (e.g., tail regions) (Scott & Wand, 1991). tion (KDE), places a symmetric density function K(z), the so-called kernel, on each training data point xn (Rosenblatt, CDE based on neural networks. Most work in machine 1956; Parzen, 1962). The resulting density estimate reads learning focuses on flexible parametric function approxi- n 1 P x−xi mators for CDE. In our experiments, we use the work of as q^(x) = nhd i=1 K h : One popular choice of d − 2 1 2 Bishop (1994) and Ambrogioni et al. (2017), who propose K(·) is a Gaussian K(z) = (2π) exp (− 2 z ). Beyond the appropriate choice of K(·), a central challenge is the to use a neural network to control the parameters of a mix- selection of the bandwidth parameter h which controls the ture density model. A recent trend in machine learning are smoothness of the estimated PDF (Li & Racine, 2007). latent density models such as cGANs (Mirza & Osindero, 2014) and cVAEs (Sohn et al., 2015). Although such meth- Conditional Density Estimation (CDE). Let (X; Y ) be ods have been shown successful for estimating distributions a pair of random variables with respective domains X ⊆ of images, the PDF of such models is intractable. More Rdx and Y ⊆ Rdy and realizations x and y. Let p(yjx) = promising in this sense are normalizing flows (Rezende & Noise Regularization for Conditional Density Estimation Mohamed, 2015; Dinh et al., 2017; Trippe & Turner, 2018), leads to severe over-fitting.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    19 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us