<<

Noise Regularization for Conditional Density Estimation

Jonas Rothfuss 1 Fabio Ferreira 2 Simon Boehm 2 Simon Walther 2 Maxim Ulrich 2 Tamim Asfour 2 Andreas Krause 1

Abstract focuses on the modelling of images based on large scale sets, over-fitting and noisy observations are of minor Capturing statistical relationships beyond the concern in this context. In contrast, we are interested in conditional is crucial in many applications. CDE in settings where data may be scarce and noisy. When To this end, conditional density estimation (CDE) combined with maximum likelihood estimation, the flexi- aims to learn the full conditional bility of such high-capacity models results in over-fitting density from data. Though expressive, neural net- and poor generalization. While regression typically assumes work based CDE models can suffer from severe a Gaussian noise model, CDE uses expressive distribution over-fitting when trained with the maximum likeli- families to model deviations from the conditional mean. hood objective. Their particular structure renders Hence, the over-fitting problem tends to be even more se- classical regularization in the parameter space vere in CDE than in regression. Standard regularization of ineffective. To address this challenge, we propose the neural network weights such as weight decay (Pratt & a model-agnostic noise regularization method for Hanson, 1989) has been shown effective for regression and CDE that adds carefully controlled random pertur- classification. However, in the context of CDE, the output bations to the data during training. We prove that of the neural network merely controls the parameters of a the proposed approach corresponds to a smooth- density model such as a Gaussian Mixture or Normalizing ness regularization and establish its asymptotic Flow. This makes the standard regularization methods in consistency. Our extensive show the parameter space less effective and hard to analyze. that noise regularization consistently outperforms other regularization methods across a of The lack of an effective regularization scheme renders neural CDE models. Furthermore, we demon- neural network based CDE impractical in most scenarios strate the effectiveness of noise regularized neural where data is scarce. As a result, classical non- and semi- CDE over classical non- and semi-parametric parametric CDE tends to be the primary method in applica- methods, even when training data is scarce. tion areas such as (Zambom & Dias, 2013). To address this issue, we propose and analyze the use of 1. Introduction noise regularization, an approach well-studied in the context of regression and classification, for the purpose of CDE. By While aims to describe the conditional adding small, carefully controlled, random perturbations to mean E[y|x] of a response y given inputs x, many problems the data during training, the conditional density estimate is such as risk management and planning under uncertainty smoothed and tends to generalize better. In fact, we show require gaining insight about deviations from the mean and that adding noise during maximum likelihood estimation is their associated likelihood. The stochastic dependency of y equivalent to a penalty on large second derivatives in the arXiv:1907.08982v2 [stat.ML] 14 Feb 2020 on x can be captured by modeling the training point locations which results in an inductive bias density p(y|x). Inferring such a density function from a set towards smoother density estimates. Moreover, under mild of observations is typically referred to as conditional density regularity conditions, we show that the proposed regular- estimation (CDE) and is the focus of this paper. ization scheme is consistent, converging to the unbiased In the recent literature, there has been a maximum likelihood . This does not only support resurgence of interest in flexible density models based on the soundness of the proposed method but also provides neural networks (Dinh et al., 2017; Ambrogioni et al., 2017; insight in how to set the regularization intensity relative to Kingma & Dhariwal, 2018). Since this line of work mainly the data dimensionality and training set size. Overall, the proposed noise regularization scheme is easy 1ETH Zurich, Switzerland 2Karlsruhe Institute of Technol- ogy (KIT), Germany. Correspondence to: Jonas Rothfuss to implement and agnostic to the parameterization of the . CDE model. We empirically demonstrate its effectiveness Noise Regularization for Conditional Density Estimation on three different neural network based models. The experi- p(x, y)/p(x) denote the conditional probability density of y mental results show that noise regularization outperforms given x. Typically, Y is referred to as a dependent variable other regularization methods consistently across various (explained variable) and X as conditional (explanatory) vari- N data sets. In a comprehensive benchmark study, we demon- able. Given a dataset of observations D = {(xn, yn)}n=1 strate that, with noise regularization, neural network based drawn from the joint distribution (xn, yn) ∼ p(x, y), the CDE is able to significantly improve upon state-of-the art aim of conditional density estimation (CDE) is to find an non-parametric , even when only 400 training ob- estimate fˆ(y|x) of the true conditional density p(y|x). servations are available. This is a relevant, and perhaps sur- In the context of CDE, the KL- objective is ex- prising, finding since non-parametric CDE is considered one pressed as expectation over p(x): of the primary approaches for settings with scarce and noisy " # data (Zambom & Dias, 2013). By using non-parametric h ˆ i log p(y|x) Ep(x) DKL(p(y|x)||f(y|x)) = Ep(x,y) regularization for training parametric high-capacity models, log fˆ(y|x) we are able to combine inductive biases from both worlds, (2) making neural networks the preferable CDE method, even Corresponding to (1), we refer to the minimization of (2) for low-dimensional and small-scale tasks. w.r.t. θ as conditional M-projection. Given a dataset D drawn i.i.d. from p(x, y), the conditional MLE following 2. Background from (2) can be stated as

Density Estimation. Let X be a with n probability density function (PDF) p(x) defined over the ∗ X ˆ θ = arg min − log fθ(yi|xi) (3) dx θ domain X ⊆ R . Given a collection D = {x1, ..., xn} of i=1 observations sampled from p(x), the goal is to find a good estimate fˆ(x) of the true density function p. In parametric 3. Related work estimation, the PDF fˆis assumed to belong to a parametric ˆ The first part of this section discusses work in the field of family F = {fθ(·)|θ ∈ Θ} where the density function is described by a finite dimensional parameter θ ∈ Θ. The CDE, focusing on high-capacity models that make little standard method for estimating θ is maximum likelihood prior assumptions. The second part relates our approach to estimation (MLE), wherein θ is chosen so that the likelihood previous regularization and data augmentation methods. of the data D is maximized. This is equivalent to minimizing Non-parametric CDE. A vast body of literature in statis- the Kullback-Leibler divergence between the empirical data tics studies nonparametric density estimators (KDEs) 1 Pn distribution pD(x) = n i=1 δ(||x − xi||) (i.e., mixture (Rosenblatt, 1956; Parzen, 1962) and the associated band- of point masses in the observations xi) and the parametric width selection problem, which concerns choosing the ap- ˆ distribution fθ: propriate amount of smoothing (Silverman, 1982; Hall et al., n X ˆ ˆ 1992; Cao et al., 1994). To estimate conditional probabili- arg max log fθ(xi) = arg min DKL(pD||fθ) (1) θ∈Θ θ∈Θ ties, previous work proposes to estimate both the joint and i=1 marginal probability separately with KDE and then com- From a geometric perspective, (1) can be viewed as an puting the conditional probability as their ratio (Hyndman orthogonal projection of pD(x) onto F w.r.t. the KL- et al., 1996; Li & Racine, 2007). Other approaches combine divergence. Hence, (1) is also commonly referred to as non-parametric elements with parametric elements (Tresp, an M-projection (Murphy, 2012; Nielsen, 2018). In con- 2001; Sugiyama & Takeuchi, 2010; Dutordoir et al., 2018). trast, non-parametric density estimators make implicit Despite their theoretical appeal, non-parametric density es- smoothness assumptions through a kernel function. The timators suffer from poor generalization in regions where most popular non-parametric method, kernel density estima- data is sparse (e.g., tail regions) (Scott & Wand, 1991). tion (KDE), places a symmetric density function K(z), the so-called kernel, on each training data point xn (Rosenblatt, CDE based on neural networks. Most work in machine 1956; Parzen, 1962). The resulting density estimate reads learning focuses on flexible parametric function approxi- n 1 P x−xi  mators for CDE. In our experiments, we use the work of as qˆ(x) = nhd i=1 K h . One popular choice of d − 2 1 2 Bishop (1994) and Ambrogioni et al. (2017), who propose K(·) is a Gaussian K(z) = (2π) exp (− 2 z ). Beyond the appropriate choice of K(·), a central challenge is the to use a neural network to control the parameters of a mix- selection of the bandwidth parameter h which controls the ture density model. A recent trend in machine learning are smoothness of the estimated PDF (Li & Racine, 2007). latent density models such as cGANs (Mirza & Osindero, 2014) and cVAEs (Sohn et al., 2015). Although such meth- Conditional Density Estimation (CDE). Let (X,Y ) be ods have been shown successful for estimating distributions a pair of random variables with respective domains X ⊆ of images, the PDF of such models is intractable. More Rdx and Y ⊆ Rdy and realizations x and y. Let p(y|x) = promising in this sense are normalizing flows (Rezende & Noise Regularization for Conditional Density Estimation

Mohamed, 2015; Dinh et al., 2017; Trippe & Turner, 2018), leads to severe over-fitting. Hence, it is necessary to impose since they provide the PDF in tractable form. We employ a additional inductive bias, for instance, in the form of regu- neural network controlling the parameters of a normalizing larization. Unlike in regression or classification, the form flow as our third CDE model to showcase the empirical of inductive bias imposed by popular regularization tech- efficacy of our regularization approach. niques such as weight decay (Krogh & Hertz, 1991) is less clear in the CDE setting, where the neural network weights Regularization. Since neural network based CDE mod- often only indirectly control the probability density through els suffer from severe over-fitting when trained with the a unconditional density model, e.g., a Gaussian Mixture. MLE objective, they require proper regularization. Classi- cal regularization of the parameters such as weight decay 4.1. The algorithm (Pratt & Hanson, 1989; Krogh & Hertz, 1992; Nowlan & We propose to add noise perturbations to the data points dur- Hinton, 1992), l1/l2-penalties (Mackay, 1992; Ng, 2004) ing the optimization of the log-likelihood objective. This can and Bayesian priors (Murray & Edwards, 1993; Hinton & be understood as replacing the original data points (xi, yi) Van Camp, 1993) have been shown to work well in the re- by random variables x˜i = xi + ξx and y˜i = yi + ξy where gression and classification setting. However, in the context the perturbation vectors are sampled from noise distribu- of CDE, it is less clear what kind of inductive bias such a tions Kx(ξx) and Ky(ξy). Further, we choose the noise to regularization imposes on the density estimate. In contrast, be zero centered as well as i.i.d among the data dimensions, our regularization approach is agnostic w.r.t. parametriza- with h: tion and is shown to penalize strong variations of the log- [ξ] = 0 and ξξ> = h2I (4) density function. Regularization methods such as dropout Eξ∼K(ξ) Eξ∼K(ξ) are closely related to ensemble methods (Srivastava et al., This can be seen as a form of data augmentation, where 2014). Thus, they are orthogonal to our work and can be “synthetic” data is generated by randomly perturbing the freely combined with noise regularization. original data. Since the supply of noise vectors is technically unlimited, an arbitrary large augmented can be Adding noise during training. Adding noise during generated by repetitively data points from D, and training is a common scheme that has been proposed in adding a random perturbation vector to the respective data various forms. This includes noise on the neural network point. This procedure is formalized in Algorithm 1. weights or activations (Wan et al., 2013; Srivastava et al., > > > 2014; Gal & Uk, 2016) and additive noise on the gradients For notational brevity, we set Z := X ×Y, z := (x , y ) ˆ ˆ for scalable approximate inference (Welling & Teh, 2011; and denote fθ(z) := fθ(y|x). The presented noise regular- Chen et al., 2014). While this line of work corresponds ization approach is agnostic to whether we are concerned to noise in the parameter space, other research suggests with unconditional or conditional MLE. Thus, the generic to augment the training data through random and/or ad- notation also allows us to generalize the results to both versarial transformations (Sietsma & Dow, 1991; Burges & settings (derived in the remainder of the paper). Scholkopf,¨ 1996; Goodfellow et al., 2015; Yuan et al., 2017). Our approach transforms the observations by adding small Algorithm 1 (Conditional) MLE with Noise Regularization random perturbations. While this form of regularization - Generic Procedure has been studied in the context of regression and classifica- Require: D = {z1, ..., zn}, noise intensity h tion (Holmstrom & Koistinen, 1992a; Webb, 1994; Bishop, Require: number of perturbed samples r, 1995; Natarajan et al., 2013; Maaten et al., 2013), this paper 1: for j = 1 to r do focuses on the regularization of CDE. In particular, we build 2: Select i ∈ {1, ..., n} with equal prob. on top of the results of Webb (1994) showing that training 3: Draw perturbation ξ ∼ K with noise corresponds to a penalty on strong variations of 4: Set z˜j = zi + hξ the log-density and extend previous consistency results for 5: end for Pr ˆ regression of Holmstrom & Koistinen (1992a) to the more 6: return arg minθ∈Θ − j=1 log fθ(˜zj) general setting of CDE. To our best knowledge, this is also the first paper to evaluate the empirical efficacy of noise When considering highly flexible parametric families such regularization for density estimation. as Mixture Density Networks (MDNs) (Bishop, 1994), the 4. Noise Regularization maximum likelihood solution in line 6 of Algorithm 1 is no longer tractable. In such case, one typically resorts to When considering expressive families of conditional densi- stochastic optimization techniques such as mini-batch gradi- ties, standard maximum likelihood estimation of the model ent descent and variations thereof. The generic procedure in parameters θ is ill suited. As can be observed in Figure 1, Algorithm 1 can be transformed into a simple extension of simply minimizing the negative log-likelihood of the data mini-batch gradient descent on the MLE objective (see Al- Noise Regularization for Conditional Density Estimation

noise_std = 0.00 noise_std = 0.05 noise_std = 0.20 0.5 0.4

0.8 0.4 0.3 0.6 0.3 0.2 0.4 0.2

0.2 0.1 0.1 probability density

0.0 0.0 0.0 2 0 2 2 0 2 2 0 2 y true estimated y Figure 1: Conditional MDN density estimate (red) and true conditional density (green) for different noise regularization intensities hz ∈ {0.0, 0.05, 0.2}. The MDN has been fitted with 3000 samples drawn from a conditional Gaussian.

(i) ∂2l Algorithm 2 (Conditional) MLE with Noise Regularization where l(zi) is the loss without noise and H = 2 (z) ∂z zi - Mini-Batch Gradient Descent the Hessian of l w.r.t z, evaluated at zi. Assuming that the 3 Require: D = {z1, ..., zn}, noise intensity h noise ξ is small in its magnitude, O(ξ ) is negligible. This Require: learning rate α, mini-batch size m effect has been observed earlier by Webb (1994) and Bishop 1: Initialize θ (1994). See Appendix A for derivations. 2: while θ not converged do When concerned with maximum likelihood estimation of 3: Sample minibatch {z1, ..., zm} ⊂ D ˆ a conditional density fθ(y|x), the coincides 4: for j = 1 to m do with the negative conditional log-likelihood l(yi, xi) = 5: Draw perturbation ξ ∼ K ˆ − log fθ(yi|xi). Let the standard deviation of the additive 6: Set z˜j = zj + hξ data noise ξx, ξy be hx and hy respectively. Maximum like- 7: end for m lihood estimation (MLE) with data noise is equivalent to 8: θ ← θ + α∇ P log fˆ (˜z ) θ j=1 θ j minimizing the loss 9: end while n 2 n dy 2 ˆ X hy X X ∂ log fθ(y|x) 10: return optimized parameter θ l(D) ≈ − log fˆ (y |x ) − θ i i (j) (j) 2 ∂y ∂y xi i=1 i=1 j=1 yi 2 n dx 2 ˆ h X X ∂ log fθ(y|x) − x gorithm 2). Specifically, each mini-batch is perturbed with (j) (j) 2 ∂x ∂x xi i.i.d. noise before computing the MLE objective function i=1 j=1 yi (forward pass) and the respective gradients (backward pass). Hereby, the first term corresponds to the standard MLE objective, while the other two terms constitute a form of 4.2. Variable Noise as Smoothness Regularization smoothness regularization. The second term of l(D) pe- Intuitively, the previously presented variable noise can be in- nalizes concavity of the conditional log density estimate ˆ terpreted as “smearing” the data points during the maximum log fθ(y|x) w.r.t. y. As the MLE objective pushes the den- likelihood estimation. This alleviates the jaggedness of the sity estimate towards high densities and strong concavity in density estimate arising from an un-regularized maximum the data points yi, the regularization term counteracts this likelihood objective in flexible density classes. We will now tendency to over-fit, thus smoothing the fitted distribution. give this intuition a formal foundation, by mathematically The third term penalizes large negative second derivatives analyzing the effect of the noise perturbations. w.r.t. the conditional variable x, thereby regularizing the Before discussing the particular effects of randomly per- sensitivity of the density estimate to changes in the condi- turbing the data during conditional maximum likelihood tional variable. The intensity of the noise regularization 2 2 estimation, we first analyze noise regularization in a more can be controlled through the (hx and hy) of the general case. Let l(D) be a loss function over a set of data random perturbations. points D = {z1, ..., zn}, which can be partitioned into a Figure 1 illustrates the effect of the introduced noise reg- Pn sum of losses l(D) = i=1 l(zi), corresponding to each ularization scheme on MDN estimates. Plain maximum data point zi: The expected loss l(zi + ξ), resulting from likelihood estimation (left) leads to strong over-fitting, re- adding random perturbations, can be approximated by a sec- sulting in a spiky distribution that generalizes poorly beyond ond order Taylor expansion around zi. Using the assumption the training data. In contrast, training with noise regulariza- about ξ in (4), the expected loss an be written as tion (center and right) results in smoother density estimates that are closer to the true conditional density. 1 h i [l(z + ξ)] =l(z ) + ξ>H(i)ξ + O(ξ3) EK(ξ) i i 2EK(ξ) 4.3. Consistency of Noise Regularization h2 We now establish asymptotic consistency results for the ≈l(z ) + tr(H(i)) i 2 proposed noise regularization. In particular, we show that, Noise Regularization for Conditional Density Estimation under some regularity conditions, concerning integrability ˆ(h) ˆ(h) (h) sup ln,r(θ) − l(θ) ≤ sup ln,r(θ) − ln (θ) and decay of the noise regularization, the solution of Algo- θ∈Θ θ∈Θ (8) rithm 1 converges to the asymptotic MLE solution. (h) + sup ln (θ) − l(θ) θ∈Θ ˆ dz Let fθ(z): R × Θ → (0, ∞) a continuous function of z and θ. Moreover, we assume that the parameter space Θ is (h) Note that ˆln,r(θ) is based on samples from the KDE, which compact. In the classical MLE setting, the idealized loss, are obtained by adding random noise vectors ξ ∼ K(·) to corresponding to a (conditional) M-projection of the true our original training data. Since we can sample an unlimited data distribution onto the parametric family, reads as amount of such random noise vectors, r can be chosen h i ˆ(h) ˆ arbitrarily high. This allows us to make supθ∈Θ |ln,r(θ) − l(θ) = −Ep(z) log fθ(z) (5) (h) ln (θ)| arbitrary small by the uniform law of large numbers. (h) As we typically just have a finite number of samples In order to make supθ∈Θ |ln (θ) − l(θ)| small in the limit ˆ from p(z), the respective empirical estimate ln(θ) = n → ∞, the sequence of bandwidth parameters hn needs to 1 Pn ˆ i.i.d be chosen appropriately. Such results can then be combined − n i=1 log fθ(zi), zi ∼ p(z) is used as training ob- jective. Note that we now define the loss as function of θ, using a union bound argument. In the following we outline and, for fixed θ, treat ln(θ) as a random variable. Under the steps leading us to the desired results. In that, the proof some regularity conditions, one can invoke the uniform law methodology is similar to Holmstrom & Koistinen (1992b). of large numbers to show consistency of the empirical ML While they show consistency results for regression with a ˆ a.s. quadratic loss function, our proof deals with generic and objective in the sense that supθ∈Θ |ln(θ)−l(θ)| −−→ 0 (see Appendix B for details). inherently unbounded log-likelihood objectives and thus holds for a much more general class of learning problems. In case of the presented noise regularization scheme, the The full proofs can be found in the Appendix. maximum likelihood estimation is performed using on the augmented data {z˜j} rather than the original data {zi}. For Initially, we have to make asymptotic integrability assump- (h) our analysis, we view Algorithm 1 from a slightly different tions that ensure that the expectations in ln (θ) and l(θ) angle. In fact, the data augmentation procedure of uniformly are well-behaved in the limit (see Appendix C for details). selecting a data point from {z1, ..., zn} and perturbing it Given respective integrability, we are able to obtain the with a noise vector drawn from K can be viewed as drawing following proposition. (h) i.i.d. samples from a kernel density estimate qˆn (z) = Proposition 1 Suppose the regularity conditions (26) and 1 Pn 1 z−zi  n i=1 hdz K h . Hence, MLE with variable noise (27) in Appendix C are satisfied, and that can be understood as d (h) lim hn = 0, lim n(hn) = ∞ (9) 1. forming a kernel density estimate qˆn of the data, n→∞ n→∞ (h) 2. followed by a (conditional) M-projection of qˆn onto Then, (h) the parametric family. lim sup ln (θ) − l(θ) = 0 (10) n→∞ θ∈Θ Hereby, step 2 aims to find the θ∗ that minimizes the follow- almost surely. ing objective: In (9) we find conditions on the asymptotic behavior of the (h) h ˆ i l (θ) = − (h) log fθ(z) (6) n Eqˆn (z) smoothing sequence (hn). These conditions also give us valuable guidance on how to properly choose the noise inten- Since (6) is generally intractable, r samples are drawn from sity in line 4 of Algorithm 1 (see Section 4.4 for discussion). the kernel density estimate, forming the following Monte The result in (10) demonstrates that, under the discussed con- Carlo approximation of (6) which corresponds to the loss in ditions, replacing the empirical data distribution with a ker- line 6 Algorithm 1: nel density estimate still results in an asymptotically consis- r 1 X tent maximum likelihood objective. However, as previously ˆl(h)(θ) = − log fˆ (˜z ) , z˜ ∼ qˆ(h)(z) (7) (h) n,r r θ j j n discussed, ln (θ) is intractable and, thus, replaced by its j=1 (h) sample estimate ˆln,r. Since we can draw an arbitrary amount (h) (h) We are concerned with the consistency of the training pro- of samples from qˆn , we can approximate ln (θ) with arbi- cedure in Algorithm 1, similar to the classical MLE con- trary precision. Given a fixed data set D of size n > n , this 0 sistency result discussed above. Hence, we need to show ˆ(h) (h) that limr→∞ supθ∈Θ ln,r(θ) − ln (θ) = 0 almost ˆ(h) a.s. that supθ∈Θ ln,r(θ) − l(θ) −−→ 0 as n, r → ∞. We be- surely, by (27) and the uniform law of large numbers. Since gin our argument by decomposing the problem into easier our original goal was to also show consistency for n → ∞, sub-problems. In particular, the triangle inequality is used this result is combined with Proposition 1, obtaining the to obtain the following upper bound: following consistency theorem. Noise Regularization for Conditional Density Estimation

Theorem 1 Suppose the regularity conditions (26) and (27) In addition, the conditions in (9) give us further intuition. are satisfied, hn fulfills (9) and Θ is compact. Then, The first condition tells us that hn needs to decay towards n ˆ(h) zero as becomes large. This reflects the general theme lim lim sup ln,r(θ) − l(θ) = 0 (11) n→∞ r→∞ θ∈Θ in machine learning that the more data is available, the less inductive bias / regularization should be imposed. The almost surely. second condition suggests that the bandwidth decay must − 1 In that, lim used to denote the limit superior (“lim sup”) of a happen at a rate slower than n d . For instance, the rule of sequence. Training a (conditional) density model with noise thumb fulfills these two criteria and thus constitutes a useful (h) guideline for selecting h. However, for highly non-Gaussian regularization means minimizing ˆln,r(θ) w.r.t. θ. As result ˆ(h) data distributions, the respective hn may decay too slowly of this optimization, one obtains a parameter vector θn,r, − 1 and a faster decay rate such as n 1+d may be appropriate. which we hope is close to the minimizing parameter θ¯ of the ideal objective function l(θ). In the following, we establish 5. Experiments consistency results, similar to Theorem 1, in the parameter space. For that, we first have to formalize the concept of We now perform a detailed experimental analysis of the pro- closeness and optimality in the parameter space. Since posed method, aiming to empirically validate the theoretical a minimizing parameter θ¯ of l(θ) may not be unique, we arguments outlined previously and investigating the prac- define Θ∗ = {θ∗ | l(θ∗) ≤ l(θ) ∀θ ∈ Θ} as the set of global tical efficacy of our regularization approach. In all experi- ∗ ∗ minimizers of l(θ), and d(θ, Θ ) = minθ∗∈Θ∗ {||θ − θ ||2} ments we use Gaussian perturbations, i.e., K(ξ) = N (0,I). as the distance of an arbitrary parameter θ to Θ∗. Based Since one of the key features of our noise regularization on these definitions, it can be shown that Algorithm 1 is scheme is that it is agnostic to the choice of model, we (h) evaluate its performance on three different neural network consistent in a sense that the minimizer of θˆn,r converges almost surely to the set of optimal parameters Θ∗. based CDE models: Mixture Density Networks (MDN) (Bishop, 1994), Kernel Mixture Networks (KMN) (Ambro- Theorem 2 Suppose the regularity conditions (26) and (27) gioni et al., 2017) and Normalizing Flows Networks (NFN) are satisfied, hn fulfills (9) and Θ is compact. For r > 0 and (Rezende & Mohamed, 2015; Trippe & Turner, 2018). (h) n > n , let θˆn,r ∈ Θ be a global minimizer of the empirical 0 In our experiments, we consider both simulated as well as objective ˆl(h). Then n,r real-world data sets. In particular, we simulate data from ˆ(h) ∗ lim lim d(θn,r, Θ ) = 0 (12) a 4-dimensional Gaussian Mixture (dx = 2, dy = 2) and n→∞ r→∞ a Skew-Normal distribution whose parameters are func- almost surely. tionally dependent on x (dx = 1, dy = 1). In terms of Note that Theorem 2 considers global optimizers, but equiva- real-world data, we use the following three data sources. lently holds for compact neighborhoods of a local minimum EuroStoxx: Daily returns of the Euro Stoxx 50 index con- θ∗ (see discussion in Appendix C). ditioned on various stock return factors. NYC Taxi: Drop- off locations of Manhattan taxi trips conditioned on the 4.4. Choosing the noise intensity pickup location, weekday and time. UCI datasets: Boston After discussing the properties of noise regularization, we Housing , Concrete and Energy datasets from the UCI ma- are interested in how to properly choose the noise intensity chine learning repository (Dua & Graff, 2017). The reported h, for different training data sets. Ideally, we would like scores are test log-likelihoods, averaged over at least 5 ran- (h) to choose h so that |ln (θ) − l(θ)| is minimized, which is dom seeds alongside the respective standard deviation. For practically not feasible since l(θ) is intractable. Inequality further details regarding the data sets and simulated data, (28) gives as an upper bound on this quantity, suggesting to we refer to Appendix E. The data and code is 1 minimize l1 distance between the kernel density estimate available on our supplementary website . q(h) and the data distribution p(z). This is in turn a well- n 5.1. Noise intensity schedules studied problem in the kernel density estimation literature (see e.g., Devroye & Luc (1987)). Unfortunately, general We complement the discussion in Section 4.4 with an empir- solutions of this problem require knowing p(z) which is not ical investigation of different schedules of hn. In particular, − 1 the case in practice. Under the assumption that p(z) and we compare a) the rule of thumb hn ∝ n 4+d b) a square − 1 the kernel function K are Gaussian, the optimal bandwidth root decay schedule hn ∝ n 1+d c) a constant bandwidth − 1 can be derived as h = 1.06ˆσn 4+d (Silverman, 1986). In hn = const. ∈ (0, ∞) and d) no noise regularization, i.e. that, σˆ denotes the estimated standard deviation of the data, hn = 0. Figure 2 plots the respective test log-likelihoods n the number of data points and d the dimensionality of Z. against an increasing training set size n for the two simu- This formula is widely known as the rule of thumb and often 1https://sites.google.com/view/noisereg/ used as a heuristic for choosing h. Noise Regularization for Conditional Density Estimation

Skew Normal Gaussian Mixture Skew Normal Gaussian Mixture 3 1.6 3.00 1.60 1.5 3.25 1.55 4 3.50 1.4 1.50 3.75 1.3 5 1.45 4.00 1.2 test log-likelihood 4.25 1.40 test log-likelihood 6 1.1 4.50 Mixture Density Network 1.35 Mixture Density Network 1.0 2 3 4 5 2 3 4 5 10 10 10 10 10 10 10 10 102 103 104 102 103 104 number of train observations number of train observations 3 1.6 3.00 1.60 1.5 3.25 1.55 4 3.50 1.4 1.50 3.75 1.3 5 1.45 4.00 1.2 4.25 1.40 6 1.1

Kernel Mixture Network 4.50 1.35 Kernel Mixture Network 1.0 2 3 4 5 2 3 4 5 10 10 10 10 10 10 10 10 102 103 104 102 103 104

3 1.6 3.00 1.60 1.5 3.25 1.55 4 3.50 1.4 1.50 noise reg. (ours) 3.75 1.3 no reg. 5 1.45 rule of thumb l1 reg. 4.00 1.2 sqrt rate l2 reg. 4.25 1.40 fixed rate weight decay 6 1.1 no noise Variational Bayes 4.50 1.35 Normalizing Flow Network Normalizing Flow Network 1.0 102 103 104 105 102 103 104 105 102 103 104 102 103 104 Figure 2: Comparison of different noise intensity schedules Figure 3: Comparison of various regularization methods for hn and their implications on the performance of various three neural network based CDE models. The models are CDE models across different training set sizes. trained with simulated data sets of different sizes. lated densities Gaussian Mixture and Skew Normal. neural network (Neal, 2012) trained with variational infer- First, we observe that bandwidth rates that conform with ence using a Gaussian prior and posterior (Blei et al., 2017). the decay conditions seem to converge in performance to First, we study the performance of the regularization tech- the non-regularized maximum likelihood estimator (red) as niques on our two simulation benchmarks. Figure 3 depicts n becomes large. This validates the theoretical result of the respective test log-likelihood across different training Theorem 1. Second, a fixed bandwidth across n (green), set sizes. For each regularization method, the regularization violating (9), imposes asymptotic bias and thus saturates hyper-parameter has been optimized via grid search. in performance vastly before its counterparts. Third, as hy- As one would expect, the importance of regularization, i.e., pothesized, the relatively slow decay of hn through the rule of thumb works better for data distributions that have larger performance difference to un-regularized model, decreases similarities to a Gaussian, i.e., in our case the Skew Normal as the amount of training data becomes larger. The noise distribution. In contrast, the highly non-Gaussian data from regularization scheme yields similar performance across the the Gaussian Mixture requires faster decay rates like the CDE models while the other regularizers vary greatly in square root decay schedule. Most importantly, noise regular- their performance depending on the different models. This ization substantially improves the estimator’s performance reflects the fact that noise regularization is agnostic to the when only little training data is available. parameterization of the CDE model while regularizers in the parameter space are dependent on the internal structure 5.2. Regularization Comparison of the model. Most importantly, noise regularization per- We now investigate how the proposed noise regularization forms well across all models and sample sizes. In the great scheme compares to classical regularization techniques. In majority of configurations it outperforms the other meth- ods. Especially, when little training data is available, noise particular, we consider an l1 and l2-penalty on the neural net- work weights as regularization term, the weight decay tech- regularization ensures a moderate test error while the other nique of Loshchilov & Hutter (2019)2, as well a Bayesian methods mostly fail to do so.

2 Next, we consider real world data. We use 5-fold cross- Note that an l2 regularizer and weight decay are not equivalent since we use the adaptive learning rate technique Adam. See validation on the training set to select the parameters for Loshchilov & Hutter (2019) for details. each regularization method. The test log-likelihoods, re- Noise Regularization for Conditional Density Estimation

Euro Stoxx NYC Taxi Boston Concrete Energy noise (ours) 3.94±0.03 5.25±0.04 -2.49±0.11 -2.92±0.08 -1.04±0.09 weight decay 3.78±0.06 5.07±0.04 -3.29±0.32 -3.33±0.14 -1.21±0.10 l1 reg. 3.19±0.19 5.00±0.05 -4.01±0.36 -3.87±0.29 -1.44±0.22 MDN l2 reg. 3.16±0.21 4.99±0.04 -4.64±0.52 -3.84±0.26 -1.55±0.26 Bayes 3.26±0.43 5.08±0.03 -3.46±0.47 -3.19±0.21 -1.25±0.23 noise (ours) 3.92±0.01 5.39±0.02 -2.52±0.08 -3.09±0.06 -1.62±0.06 weight decay 3.85±0.03 5.31±0.02 -2.69±0.15 -3.15±0.06 -1.79±0.12 l1 reg. 3.76±0.04 5.39±0.02 -2.75±0.13 -3.25±0.07 -1.82±0.10 KMN l2 reg. 3.71±0.05 5.37±0.02 -2.66±0.13 -3.18±0.07 -1.79±0.13 Bayes 3.33±0.02 4.47±0.02 -3.40±0.11 -4.08±0.05 -3.65±0.07 noise (ours) 3.90±0.01 5.20±0.03 -2.48±0.11 -3.03±0.13 -1.21±0.08 weight decay 3.82±0.06 5.19±0.03 -3.12±0.39 -3.12±0.14 -1.22±0.16 l1 reg. 3.50±0.10 5.12±0.05 -12.6±12.8 -3.91±0.52 -1.29±0.16 NFN l2 reg. 3.50±0.09 5.13±0.05 -14.2±9.60 -3.99±0.66 -1.34±0.19 Bayes 3.34±0.33 5.10±0.03 -5.99±2.45 -3.55±0.46 -1.11±0.22 Table 1: Comparison of various regularization methods for three neural network based CDE models across 5 data sets. We report the test log-likelihood and its respective standard deviation (higher log-likelihood values are better).

Euro Stoxx NCY Taxi Boston Conrete Energy num. train obs. 2536 8000 405 824 615 MDN (ours) 4.00±0.03 5.41±0.02 -2.39±0.02 -2.89±0.03 -1.04±0.05 KMN (ours) 3.98±0.03 5.42±0.02 -2.44±0.02 -3.06±0.03 -1.59±0.09 NFN (ours) 4.00±0.03 5.12±0.03 -2.40±0.04 -2.93±0.02 -1.23±0.06 LSCDE 3.44±0.10 4.85±0.02 -2.78±0.00 -3.63±0.00 -2.16±0.02 CKDE R.O.T. 3.36±0.01 4.87±0.02 -3.12±0.03 -3.78±0.02 -2.90±0.01 CKDE CV-ML 3.87±0.01 5.27±0.06 -2.76±0.26 -3.35±0.13 -1.14±0.02 NKDE R.O.T 3.16±0.02 4.34±0.04 -3.52±0.05 -4.08±0.02 -3.35±0.03 NKDE CV-ML 3.41±0.02 4.93±0.08 -3.34±0.13 -3.93±0.05 -2.21±0.12 Table 2: Comparison of conditional density estimators across 5 data sets. The neural network based models (MDN, KMN, NFN) are trained with noise regularization. Reported is the test log-likelihood and its standard deviation across seeds. ported in Table 1, are averages over 3 different train/test 1982; Sheather & Jones, 1991) and via maximum likelihood splits and 5 seeds each for initializing the neural networks. leave-one-out cross-validation (CV-ML) (Rudemo, 1982; The held out test set amounts to 20% of the respective data Hall et al., 1992). In case of LSCDE, MDN, KMN and set. Consistent with the results of the simulation study, NFN, the respective hyper-parameters are selected via 5- noise regularization outperforms the other methods across fold cross-validation grid search on the training set. Note the great majority of data sets and CDE models. that, in contrast to Section 5.2 which focuses on regular- ization parameters, the grid search here extends to more 5.3. Conditional Density Estimator Benchmark Study hyper-parameters. We benchmark neural network based density estimators The respective test log-likelihood scores are listed in Ta- against state-of-the art CDE approaches. While neural net- ble 2. For the majority of data sets, the three neural net- works are the obvious choice when a large amount of train- work based methods outperform all of the non- and semi- ing data is available, we pose the question how such esti- parametric methods. In summary, neural network based mators compete against popular non-parametric methods in CDE models, trained with noise regularization, work sur- small data regimes. In particular, we compare to Conditional prisingly well, even when training data is scarce and fairly Kernel Density Estimation (CKDE) (Li & Racine, 2007), low-dimensional such as in case of the Boston Housing data -Neighborhood Kernel Density Estimation (NKDE), and set (d = 13, d = 1). We hypothesize this may be due Least-Squares Conditional Density Estimation (LSCDE) x y to the fact, that, when using non-parametric regularization (Sugiyama & Takeuchi, 2010). For the kernel density esti- for training parametric high-capacity models, we combine mation based methods CKDE and NKDE, we perform band- favorable inductive biases of both KDE and neural networks. width selection via the rule of thumb (R.O.T) (Silverman, Noise Regularization for Conditional Density Estimation

6. Conclusion 381. MIT Press, 1996. URL https://dl.acm.org/ citation.cfm?id=2999034. The paper proposes a regularization technique for neural CDE that adds random perturbations to the data during train- Cao, R., Cuevas, A., and Gonzalez´ Manteiga, W. A ing. It can be seamlessly integrated into standard stochas- comparative study of several smoothing methods in tic optimization and is model-agnostic. We show that the density estimation. Computational Statistics & Data proposed noise regularization inherits an inductive bias Analysis, 17(2):153–176, 2 1994. ISSN 0167-9473. doi: from non-parametric KDE, effectively smoothing the es- 10.1016/0167-9473(92)00066-Z. URL https://www. timated conditional density. Our experiments demonstrate sciencedirect.com/science/article/ that it consistently outperforms other regularization meth- pii/016794739200066Z?via%3Dihub. ods across models and datasets. Moreover, empirical results suggest that, when trained with noise regularization, neu- Chen, T., Fox, E. B., and Guestrin, C. Stochastic Gradient ral network based models are able to significantly improve Hamiltonian Monte Carlo. In ICML, 2014. URL https: upon state-of-the art non-parametric estimators. This makes //arxiv.org/pdf/1402.4102.pdf. neural CDE the preferable method, even in settings where Devroye, L. The equivalence of weak, strong and complete training data is low-dimensional and scarce. convergence in L1 for kernel density estimates. Annals of Statistics, 11(3):896–904, 1983. URL https: ACKNOWLEDGMENTS //pdfs.semanticscholar.org/71d9/ This project has received funding from the European Re- b1c7a54cb48ab12bc3c8dcad626dc93d867b. search Council (ERC) under the European Unions Horizon pdf. 2020 research and innovation program grant agreement No Devroye, L. and Luc. A course in density estimation. 815943. Simon Walter has been supported by the Konrad- Birkhauser, 1987. ISBN 0817633650. URL https: Adenauer-Stiftung. Finally, we thank Sara van der Geer for //dl.acm.org/citation.cfm?id=27672. her advise regarding the consistency proofs. Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density estima- References tion using Real NVP. In Proceedings of the International Conference on Learning Representations, 5 2017. URL Ambrogioni, L., Gu¨c¸lu,¨ U., van Gerven, M. A. J., and http://arxiv.org/abs/1605.08803. Maris, E. The Kernel Mixture Network: A Nonpara- metric Method for Conditional Density Estimation of Dua, D. and Graff, C. {UCI} Machine Learning Repository, Continuous Random Variables. 2017. URL http: 2017. URL http://archive.ics.uci.edu/ml. //arxiv.org/abs/1705.07111 . Dutordoir, V., Salimbeni, H., Deisenroth, M. P., and Hens- Andel,ˇ J., Netuka, I., and Zvara,´ K. On Thresh- man, J. Gaussian Process Conditional Density Estimation. old Autoregressive Processes. Kybernetica, 20 In NeurIPS, 2018. (2):89–106, 1984. URL https://dml.cz/ Gal, Y. and Uk, Z. A. Dropout as a Bayesian Approxima- bitstream/handle/10338.dmlcz/124493/ tion: Representing Model Uncertainty in Deep Learn- Kybernetika_20-1984-2_1.pdf. ing Zoubin Ghahramani. In ICML, 2016. URL http: //yarin.co. Bishop, C. M. Mixture Density Networks. 1994. Gilardi, N., Bengio, S., and Kanevski, M. Conditional Gaus- Bishop, C. M. Training with Noise is Equivalent to sian Mixture Models for Environmental Risk Mapping. Tikhonov Regularization. Neural Computation, 7(1):108– In NNSP, 2002. URL http://www.idiap.ch. 116, 1995. ISSN 0899-7667. doi: 10.1162/neco.1995. 7.1.108. URL http://www.mitpressjournals. Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining org/doi/10.1162/neco.1995.7.1.108. and Harnessing Adversarial Examples. In ICLR, 2015. URL http://arxiv.org/abs/1412.6572. Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational Inference: A Review for . Guang Sung, H. Gaussian Mixture Regression and Classi- Journal of the American Statistical Association, 112 fication. PhD thesis, 2004. URL http://www.stat. (518):859–877, 2017. doi: 10.1080/01621459.2017. rice.edu/˜hgsung/thesis.pdf. 1285773. URL https://doi.org/10.1080/ 01621459.2017.1285773. Hall, P., Marron, J., and Park, B. U. Smoothed cross-validation. Probability Theory, 92:1–20, 1992. Burges, C. J. C. and Scholkopf,¨ B. Improving the accuracy URL https://link.springer.com/content/ and speed of support vector machines. In NIPS, pp. 375– pdf/10.1007%2FBF01205233.pdf. Noise Regularization for Conditional Density Estimation

Hinton, G. E. and Van Camp, D. Keeping Neural Net- Murray, A. F. and Edwards, P. J. Synaptic works Simple by Minimizing the Description Length of Weight Noise During MLP Learning Enhances the Weights. In COLT, 1993. URL http://www.cs. Fault-Tolerance, Generalisation and Learning toronto.edu/˜fritz/absps/colt93.pdf. Trajectory. In NIPS, 1993. URL https: //pdfs.semanticscholar.org/b0fc/ Holmstrom, L. and Koistinen, P. Using additive noise in 40f4a4e9db0a67bf644cd1d509044fd3c6c8. back-propagation training. IEEE Transactions on Neural pdf. Networks, 3(1):24–38, 1992a. ISSN 10459227. doi: 10. 1109/72.105415. URL http://ieeexplore.ieee. Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, org/document/105415/. A. Learning with noisy labels. In Advances in neural information processing systems, pp. 1196–1204, 2013. Holmstrom, L. and Koistinen, P. Using Additive Noise Bayesian learning for neural networks in Back Propagation Training. IEEE Transactions on Neal, R. M. , volume Neural Networks, 3(1):24–38, 1992b. ISSN 19410093. 118. Springer Science & Business Media, 2012. doi: 10.1109/72.105415. URL http://ieeexplore. Ng, A. Y. Feature selection, L 1 vs. L 2 reg- ieee.org/document/105415/. ularization, and rotational invariance. In ICML, 2004. URL http://ai.stanford.edu/˜ang/ Hyndman, R. J., Bashtannyk, D. M., and Grunwald, papers/icml04-l1l2.pdf. G. K. Estimating and Visualizing Conditional Densi- ties. Journal of Computational and Graphical Statis- Nielsen, F. What is an information projection? Notices tics, 5(4):315, 12 1996. ISSN 10618600. doi: 10. of the AMS, 2018. doi: 10.1090/noti1647. URL http: 2307/1390887. URL https://www.jstor.org/ //dx.doi.org/10.1090/noti1647. stable/1390887?origin=crossref. Nowlan, S. J. and Hinton, G. E. Simplifying Neural Kingma, D. P. and Dhariwal, P. Glow: Generative Flow with Networks by Soft Weight Sharing. Neural Computa- Invertible 1x1 Convolutions. Technical report, 7 2018. tion, 1992. URL http://www.cs.toronto.edu/ URL http://arxiv.org/abs/1807.03039. ˜hinton/absps/sunspots.pdf. Parzen, E. On Estimation of a Probability Density Function Krogh, A. and Hertz, J. A. A Simple Weight Decay Can and . The Annals of , 33 Improve Generalization. In NIPS, 1991. (3):1065–1076, 9 1962. ISSN 0003-4851. doi: 10.1214/ Krogh, A. and Hertz, J. A. A Simple Weight Decay Can aoms/1177704472. URL http://projecteuclid. Improve Generalization. Technical report, 1992. org/euclid.aoms/1177704472.

Li, Q. and Racine, J. S. Nonparametric econometrics : Pratt, L. Y. and Hanson, S. J. Comparing Biases for Minimal NIPS theory and practice. Princeton University Press, 2007. Network Construction with Back-Propagation. In , 1989. Loshchilov, I. and Hutter, F. Decoupled Weight Decay Reg- Rezende, D. J. and Mohamed, S. Variational Inference https://github. ularization. In ICLR, 2019. URL with Normalizing Flows. In Proceedings of the 32nd com/loshchil/AdamW-and-SGDW . International Conference on Machine Learning, 5 2015. http://arxiv.org/abs/1505.05770 Maaten, L., Chen, M., Tyree, S., and Weinberger, K. Learn- URL . ing with marginalized corrupted features. In International Rosenblatt, M. Remarks on Some Nonparamet- Conference on Machine Learning, pp. 410–418, 2013. ric Estimates of a Density Function. The An- nals of Mathematical Statistics, 27(3):832–837, 9 Mackay, D. J. C. A Practical Bayesian Framework for 1956. ISSN 0003-4851. doi: 10.1214/aoms/ Neural Computation Backprop Networks. , 1992. URL 1177728190. URL http://projecteuclid.org/ http://citeseerx.ist.psu.edu/viewdoc/ euclid.aoms/1177728190. download?doi=10.1.1.29.274&rep=rep1& type=pdf. Rothfuss, J., Ferreira, F., Walther, S., and Ulrich, M. Con- ditional Density Estimation with Neural Networks: Best Mirza, M. and Osindero, S. Conditional Generative Ad- Practices and Benchmarks. Technical report, 2019. URL versarial Nets. Technical report, 11 2014. URL http: http://arxiv.org/abs/1903.00954. //arxiv.org/abs/1411.1784. Rudemo, M. Empirical Choice of and Ker- Murphy, K. P. Machine Learning: A Probabilistic Perspec- nel Density Estimators, 1982. URL https://www. tive. 2012. jstor.org/stable/4615859. Noise Regularization for Conditional Density Estimation

Scott, D. W. and Wand, M. P. Feasibility of Wan, L., Zeiler, M., Zhang, S., Cun, Y. L., and Fergus, Multivariate Density Estimates. Biometrika, 78 R. Regularization of Neural Networks using DropCon- (1):197, 3 1991. ISSN 00063444. doi: 10. nect. In ICML, pp. 1058–1066, 2 2013. URL http:// 2307/2336910. URL https://www.jstor.org/ proceedings.mlr.press/v28/wan13.html. stable/2336910?origin=crossref. Webb, A. Functional approximation by feed-forward Sheather, S. J. and Jones, M. C. A Reliable Data-Based networks: a least-squares approach to generalization. Bandwidth Selection Method for Kernel Density Esti- IEEE Transactions on Neural Networks, 5(3):363– mation. Journal of the Royal Statistical Society, 53: 371, 5 1994. ISSN 10459227. doi: 10.1109/72. 683–690, 1991. doi: 10.2307/2345597. URL https: 286908. URL http://ieeexplore.ieee.org/ //www.jstor.org/stable/2345597. document/286908/. Sietsma, J. and Dow, R. J. Creating artificial neural networks Welling, M. and Teh, Y. W. Bayesian Learning via that generalize. Neural Networks, 4(1):67–79, 1 1991. Stochastic Gradient Langevin Dynamics. In ICML, 2011. ISSN 0893-6080. doi: 10.1016/0893-6080(91)90033-2. URL https://www.ics.uci.edu/˜welling/ URL https://www.sciencedirect.com/ publications/papers/stoclangevin_v6. science/article/pii/0893608091900332. pdf. Silverman, B. On the estimation of a probability White, H. Learning in Artificial Neural Networks: A Sta- density function by the maximum penalized likeli- tistical Perspective. Neural Computation, 1(4):425–464, hood method. The Annals of Statistics, 10(3):795– 12 1989. ISSN 0899-7667. doi: 10.1162/neco.1989. 810, 1982. URL https://projecteuclid.org/ 1.4.425. URL http://www.mitpressjournals. download/pdf_1/euclid.aos/1176345872. org/doi/10.1162/neco.1989.1.4.425. Silverman, B. Density estimation for statistics and data anal- Yuan, X., He, P., Zhu, Q., and Li, X. Adversarial Exam- ysis. Monographs on Statistics and Applied Probability, ples: Attacks and Defenses for Deep Learning. Technical 1986. report, 12 2017. URL http://arxiv.org/abs/ Sohn, K., Lee, H., and Yan, X. Learning Structured Out- 1712.07107. put Representation using Deep Conditional Generative Zambom, A. Z. and Dias, R. A Review of Kernel Density Models. In Advances in Neural Information Processing Estimation with Applications to Econometrics. Inter- Systems, pp. 3483–3491, 2015. national Econometric Review (IER), 5(1):20–42, April Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, 2013. I., and Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Jour- nal of Machine Learning Research, 15:1929–1958, 2014. URL http://jmlr.org/papers/v15/ srivastava14a.html. Sugiyama, M. and Takeuchi, I. Conditional den- sity estimation via Least-Squares Density Ratio Estimation. In Proceedings of the Thirteenth In- ternational Conference on Artificial Intelligence and Statistics, volume 9, pp. 781–788, 2010. URL http://machinelearning.wustl.edu/ mlpapers/paper_files/AISTATS2010_ SugiyamaTSKHO10.pdf. Tresp, V. Mixtures of Gaussian Processes. In NIPS, 2001. URL https://papers.nips.cc/paper/ 1900-mixtures-of-gaussian-processes. pdf. Trippe, B. L. and Turner, R. E. Conditional Density Esti- mation with Bayesian Normalising Flows. Technical re- port, 2 2018. URL http://arxiv.org/abs/1802. 04908. Noise Regularization for Conditional Density Estimation A. Derivation Smoothness Regularization

Let l(D) be a loss function over a set of data points D = {z1, ..., zN }, which can be partitioned into a sum of losses corresponding to each data point xn: n X lD(D) = l(zi) (13) i=1

Also, let each zi be perturbed by a random noise vector ξ ∼ K(ξ) with zero mean and i.i.d. elements, i.e.  > 2 Eξ∼K(ξ) [ξ] = 0 and Eξ∼K(ξ) ξnξj = h I (14)

The resulting loss l(zi + ξ) can be approximated by a second order Taylor expansion around zi

> 1 > 2 3 l(zi + ξ) = l(zi) + ξ ∇zl(z) + ξ ∇zl(z) ξ + O(ξ ) (15) zi 2 zi Assuming that the noise ξ is small in its magnitude, O(ξ3) may be neglected. The expected loss under K(ξ) follows directly from (15): h > i 1 h > 2 i Eξ∼K(ξ) [l(zi + ξ)] = l(zi) + Eξ∼K(ξ) ξ ∇xl(z) + Eξ∼K(ξ) ξ ∇xl(z) ξ (16) zi 2 zi Using the assumption about ξ in (14) we can simplify (16) as follows:

> 1 h > 2 i Eξ∼K(ξ) [l(zi + ξ)] = l(zi) + Eξ∼K(ξ) [ξ] ∇zl(z) + Eξ∼K(ξ) ξ ∇zl(z) ξ (17) zi 2 zi 1 h i = l(z ) + ξ>H(i)ξ (18) i 2Eξ∼K(ξ)   2 1 X X ∂ l(z) = l(zi) + Eξ∼K(ξ)  ξjξk  (19) 2 ∂z(j)∂z(k) j k zi 2 2 1 X  2 ∂ l(z) 1 X X ∂ l(z) = l(zi) + Eξ ξj + Eξ [ξjξk] (20) 2 ∂z(j)∂z(j) 2 ∂z(j)∂z(k) j zi j k6=j zi 2 2 η X ∂ l(z) = l(zi) + (21) 2 ∂z(j)∂z(j) j zi η2 = l(z ) + tr(H(i)) (22) i 2 (i) 2 (j) In that, l(zi) is the loss without noise and H = ∇ l(z) the Hessian of l at zi. With z we denote the elements of the z zi column vector z.

B. Vanilla conditional MLE objective is uniformly consistent The objective function corresponding to a conditional M-projection. h ˆ i l(θ) = −Ep(x,y) log fθ(y|x) (23) The sample equivalent: n 1 X i.i.d ˆl (θ) = − log fˆ (y |x ) , (x , y ) ∼ P (X,Y ) (24) n n θ i i i i i=1

ˆ l m l m Corollary 1 Let Θ be a compact set and and fθ : R × R × Θ → (0, ∞) continuous in θ for all (x, y) ∈ R × R such h ˆ i that Ep(x,y) supθ∈Θ log fθ(y|x) < ∞. Then, as n → ∞, we have

ˆ a.s. sup ln(θ) − l(θ) −−→ 0 (25) θ∈Θ

Proof. The corollary follows directly from the uniform law of large numbers.  Noise Regularization for Conditional Density Estimation C. Consistency Proofs

() Lemma 1 Suppose for some  > 0 there exists a constant Bp such that Z ˆ 1+ () | log fθ(z)| p(z)dz ≤ Bp < ∞ ∀θ ∈ Θ (26)

() and there exists an n0 such that for all n > n0 there exists a constant Bqˆ such that Z ˆ 1+ (hn) () | log fθ(z)| qˆn (z)dz ≤ Bqˆ < ∞ ∀θ ∈ Θ (27) almost surely. Then, the inequality

 Z  1+ (h) (h) sup ln (θ) − l(θ) ≤ C |qˆn (z) − p(z)|dz (28) θ∈Θ where C is a constant holds with probability 1 for all n > n0.

(h) Proof of Lemma 1 Using Hoelder’s inequality and the nonnegativity of p and qˆn , we obtain Z (h) ˆ (h) l (θ) − l(θ) = log fθ(z)(ˆq (z) − p(z))dz n n Z ˆ (h) ≤ | log fθ(z)| |qˆn (z) − p(z)|dz

Z 1  ˆ (h) 1+ (h) 1+ = | log fθ(z)| |qˆn (z) − p(z)| |qˆn (z) − p(z)| dz

1 Z  1+ ˆ 1+ (h) ≤ | log fθ(z)| |qˆn (z) − p(z)|dz

 Z  1+ (h) |qˆn (z) − p(z)|dz  Z ˆ 1+ (h) ≤ | log fθ(z)| qˆn (z)dz

1  Z  1+ Z  1+ ˆ 1+ (h) + | log fθ(z)| p(z)dz |qˆn (z) − p(z)|dz

() () () Employing the regularity conditions (26) and (27) and writing C = Bp + Bqˆ , it follows that ∃n0 such that ∀n > n0

   Z  1+ (h) () () (h) sup ln (θ) − l(θ) ≤ Bp + Bqˆ |qˆn (z) − p(z)|dz θ∈Θ  Z  1+ () (h) = C |qˆn (z) − p(z)|dz with probability 1. 

(h) Lemma 1 states regularity conditions ensuring that the expectations in ln (θ) and l(θ) are well-behaved in the limit. In particular, (26) and (27) imply uniform and absolute integrability of the log-likelihoods under the respective probability (h) measures induced by p and qˆn . Since we are interested in the asymptotic behavior, it is sufficient for (27) to hold for n large enough with probability 1. (h) Inequality (28) shows that we can make |ln (θ) − l(θ)| small by reducing the l1-distance between the true density p and the (h) kernel density estimate qˆn . There exists already a vast body of literature, discussing how to properly choose the kernel Noise Regularization for Conditional Density Estimation

R (hn) K and the bandwidth sequence (hn) so that |qˆn (z) − p(z)|dz → 0. We employ the results in Devroye (1983) for our purposes, leading us to Proposition 1. () Proof of Proposition 1. Let A denote the event that ∃n0∀n > n0 inequality (28) holds for some constant C . From our c R (h) a.s. regularity assumptions it follows that P(A ) = 0. Given that A holds, we just have to show that |qˆn (z) − p(z)|dz −−→ 0. Then, the upper bound in (28) tends to zero and we can conclude our proposition. For any δ > 0 let B denote the event n Z (h) |qˆn (z) − p(z)|dz ≤ δ (29)

(h) wherein qˆn (z) is a kernel density estimate obtained based on n samples from p(z). Under the conditions in (9) we can apply Theorem 1 of Devroye (1983), obtaining an upper bound on the probability that (29) does not hold, i.e. ∃u, m0 such c −un that P(Bn) ≤ e for all n > m0.

Since we need both A and Bn for n → ∞ to hold, we consider the intersection of the events (A ∩ Bn). Using a union c c c −un −un bound argument it follows that ∃k0 such that ∀n > k0 : P((A ∩ Bn) ) ≤ P(A ) + P(Bn) = 0 + e = e . Note that P∞ c P∞ −un 1 we can simply choose k = max{n , m } for this to hold. Hence, P ((A ∩ B ) ) < e = u < ∞ 0 0 0 n=k0+1 n n=1 e −1 and by the Borel-Cantelli lemma we can conclude that

(h) lim sup ln (θ) − l(θ) = 0 (30) n→∞ θ∈Θ holds with probability 1. 

Proof of Theorem 1. The inequality in (8) implies that for any n > n0,

ˆ(h) ˆ(h) (h) (h) lim sup ln,r(θ) − l(θ) ≤ lim sup ln,r(θ) − ln (θ) + sup ln (θ) − l(θ) (31) r→∞ θ∈Θ r→∞ θ∈Θ θ∈Θ

Let n > n0 be fixed but arbitrary and denote

ˆ(h) (h) Jn,r = sup ln,r(θ) − ln (θ) r ∈ N, n > n0 (32) θ∈Θ

(n) It is important to note that Jn,r is a random variable that depends on the samples Z = (Z1, ..., Zn) as well as on (r) the inherent in Algorithm 1. We define I = (I1, ...Ir) as the indices sampled uniformly from {1, ..., n} (r) (n) (r) (r) and Ξ = (ξ1, ...xir) as the sequence of perturbation vectors sampled from K. Let P (Z ), P (I ) and P (Ξ ) be probability measures of the respective random sequences.

(n) (n) (h) If we fix Z to be equal to an arbitrary sequence Z , then qˆn is fixed and we can treat Jn,r as the regular difference (h) between a sample estimate and expectation under qˆn . By the regularity condition (27), the compactness of Θ and the continuity of fθ in θ, we can invoke the uniform law of large numbers to show that

ˆ(h) (h) lim Jn,r = lim sup ln,r(θ) − ln (θ) = 0 (33) r→∞ r→∞ θ∈Θ with probability 1. (n) Now we want to show that (33) also holds with probability 1 for random training samples Z . First, we write Jn,r as a deterministic function of random variables:

(n) (r) (r) Jn,r = J(Z , I , Ξ ) (34) This allows us to restate the result in (33) as follows:

 (n) (n) (r) (r)  PI(r),Ξ(r) ∀δ > 0 ∃r0 ∀r > r0 : J(Z = Z , I , Ξ ) < δ ZZ  (n) (n) (r) (r)  (r) (r) (35) = 1 ∀δ > 0 ∃r0 ∀r > r0 : J(Z = Z , I , Ξ ) < δ dP (Ξ )dP (I )

= 1 Noise Regularization for Conditional Density Estimation

In that 1(A) denotes an indicator function which returns 1 if A is true and 0 else. Next we consider the probability that the convergence in (33) holds for random Z(n):

 (n) (r) (r)  PZ(n),I(r),Ξ(r) ∀δ > 0 ∃r0 ∀r > r0 : J(Z , I , Ξ ) < δ ZZZ  (n) (r) (r)  (r) (r) (n) = 1 ∀δ > 0 ∃r0 ∀r > r0 : J(Z , I , Ξ ) < δ dP (Ξ )dP (I )dP (Z ) Z ZZ  (n)  (n) (r) (r)  (r) (r) = dP (Z ) 1 ∀δ > 0 ∃r0 ∀r > r0 : J(Z , I , Ξ ) < δ dP (Ξ )dP (I ) | {z } =1 = 1

Note that we can dP (Z(n)) move outside of the inner integrals, since Z(n) is independent from I(r) and Ξ(r). Hence, we can conclude that (33) also holds, which we denote as event A, with probability 1 for random training data. From Proposition 1 we know, that (h) lim sup ln (θ) − l(θ) = 0 (36) n→∞ θ∈Θ with probability 1. We denote the event that (36) holds as B. Since P (Ac) = P (Bc) = 0, we can use a union bound argument to show that P (A ∩ B) = 1. From (33) and (31) it follows that for any n > n0,

ˆ(h) (h) lim sup ln,r(θ) − l(θ) ≤ sup ln (θ) − l(θ) (37) r→∞ θ∈Θ θ∈Θ with probability 1. Finally, we combine this result with (36), obtaining that

ˆ(h) lim lim sup ln,r(θ) − l(θ) = 0 (38) n→∞ r→∞ θ∈Θ almost surely, which concludes the proof. 

Proof of Theorem 2. The proof follows the argument used in Theorem 1 of White (1989). In the following, we assume that (11) holds. From Theorem 1 we know that this is the case with probability 1. Respectively, we only consider realizations of our training data Z(n) and noise samples I(r), Ξ(r) for which the convergence in (11) holds (see proof of Theorem 1 for details on this notation). ˆ(h) ˆ(h) For such realization, let (θn,r) be minimizers of ln,r. Also let (ni)i and for any i, (ri,j)j be increasing sequences of positive ˆ(h) ˆ(h) integers. Define vi,j := θni,ri,j and µi,j(θ) := lni,ri,j (θ). Due to the compactness of Θ and the Bolzano-Weierstrass 0 0 property thereof, there exists a limit point θ ∈ Θ and increasing subsequences (ik)k, (jk)k so that vik,jk → θ as k → ∞.

From the triangle inequality, it follows that for any  > 0 there exists k0 so that ∀k > k0

0 0 |µik,jk (vik,jk ) − l(θ )| ≤ |µik,jk (vik,jk ) − l(vik,jk )| + |l(vik,jk ) − l(θ )| < 2 (39) given the convergence established in Theorem 1 and the continuity of l in θ. Next, the result above is extended to

0 0 l(θ ) − l(θ) = [l(θ ) − µik,jk (vik,jk )] + [µik,jk (vik,jk ) − µik,jk (θ)] + [µik,jk (θ) − l(θ)] ≤ 3 (40)

which again holds for k large enough. This due to (39), µik,jk (vik,jk ) − µik,jk (θ) ≤ 0 since vik,jk is the minimizer of 0 µik,jk , and µik,jk (θ) − l(θ) <  by Theorem 1. Because  can be made arbitrarily small, l(θ ) ≤ l(θ) as k → ∞. Because 0 ∗ θ ∈ Θ is arbitrary, θ must be in Θ . In turn, since (ni)i, (ri,j)j and (ik)k, (jk)k were chosen arbitrarily, every limit point ∗ of a sequence (vik,jk )k must be in Θ . In the final step, we proof the theorem by contradiction. Suppose that (12) does not hold. In this case, there must ¯ ¯ exist an  > 0 and sequences (ni)i, (ri,j)j and (ik)k, (jk)k such that ||(vik,jk )k − θ||2 >  for all k and θ ∈ Θ∗. ∗ However, by the previous argument the limit point of the any sequence (vik,jk )k must be in Θ . That is a contradiction to Noise Regularization for Conditional Density Estimation

¯ ¯ (n) (r) (r) ||(vik,jk )k − θ||2 >  ∀ k, θ ∈ Θ∗. Since the random sequences Z , I , Ξ where chosen from a set with probability mass of 1, we can conclude our proposition that ˆ(h) ∗ lim lim d(θn,r, Θ ) = 0 n→∞ r→∞ almost surely.  ∗ (h) Discussion of Theorem 2. Note that, similar to θ , θˆn,r does not have to be unique. In case there are multiple minimizers (h) of ˆln,r, we can chose one of them arbitrarily and the proof of the theorem still holds. Theorem 2 considers global optimizers over a set of parameters Θ, which may not be attainable in practical settings. However, the application of the theorem to the context of local optimization is straightforward when Θ is chosen as a compact neighborhood of a local minimum θ∗ of l ∗ ∗ (h) (h) (Holmstrom & Koistinen, 1992b). If we set Θ = {θ } and restrict minimization over ˆln,r to the local region, then θˆn,r converges to Θ∗ as n, r → ∞ in the sense of Theorem 2.

D. Conditional density estimation models D.1. Mixture Density Network Mixture Density Networks (MDNs) combine conventional neural networks with a mixture density model for the purpose of estimating conditional distributions p(y|x) (Bishop, 1994). In particular, the parameters of the unconditional mixture distribution p(y) are outputted by the neural network, which takes the conditional variable x as input. For our purpose, we employ a Gaussian Mixture Model (GMM) with diagonal matrices as density model. The conditional density estimate pˆ(y|x) follows as weighted sum of K Gaussians

K X 2  pˆ(y|x) = wk(x; θ)N y|µk(x; θ), σk(x; θ) (41) k=1

2 wherein wk(x; θ) denote the weight, µk(x; θ) the mean and σk(x; θ) the variance of the k-th Gaussian component. All the GMM parameters are governed by the neural network with parameters θ and input x. PK The mixing weights wk(x; θ) must resemble a categorical distribution, i.e. it must hold that k=1 wk(x; θ) = 1 and wk(x; θ) ≥ 0 ∀k. To satisfy the conditions, the softmax linearity is used for the output neurons corresponding to wk(x; θ). Similarly, the standard deviations σk(x) must be positive, which is ensured by a sofplus non-linearity. Since the component means µk(x; θ) are not subject to such restrictions, we use a linear output layer without non-linearity for the respective output neurons. For the experiments in 5.2 and 5.1, we set K = 10 and use a neural network with two hidden layers of size 32.

D.1.1. KERNEL MIXTURE NETWORK While MDNs resemble a purely parametric conditional density model, a closely related approach, the Kernel Mixture Network (KMN), combines both non-parametric and parametric elements (Ambrogioni et al., 2017). Similar to MDNs, a mixture density model of pˆ(y) is combined with a neural network which takes the conditional variable x as an input. However, the neural network only controls the weights of the mixture components while the component centers and scales are fixed w.r.t. to x. For each of the kernel centers, M different scale/bandwidth parameters σm are chosen. As for MDNs, we employ Gaussians as mixture components, wherein the directly coincides with the standard deviation.

Let K be the number of kernel centers µk and M the number of different kernel scales σm. The KMN conditional density estimate reads as follows: K M X X 2 pˆ(y|x) = wk,m(x; θ)N (y|µk, σm) (42) k=1 m=1

As previously, the weights wk,m correspond to a softmax function. The M scale parameters σm are learned jointly with the n neural network parameters θ. The centers µk are initially chosen by k-means clustering on the {yi}i=1 in the training data set. Overall, the KMN model is more restrictive than MDN as the locations and scales of the mixture components are fixed during inference and cannot be controlled by the neural network. However, due to the reduced flexibility of KMNs, they are less prone to over-fit than MDNs. Noise Regularization for Conditional Density Estimation

For the experiments in 5.2 and 5.1, we set K = 50 and M = 2. The respective neural network has two hidden layers of size 32.

D.2. Normalizing Flow Network The Normalizing Flow Network (NFN) is similar to the MDN and KMN in that a neural network takes the conditional variable x as its input and outputs parameters for the distribution over y. For the NFN, the distribution is given by a Normalizing Flow (Rezende & Mohamed, 2015). It works by transforming a simple base distribution and an accordingly distributed random variable Z0 through a series of invertible, parametrized mappings f = fN ◦ · · · ◦ f1 into a successively more complex distribution p(f(Z0)). The PDF of samples zN ∼ p(f(Z0)) can be evaluted using the change-of-variable formula:

N X ∂fn log p(z ) = log p(z ) − log det (43) N 0 ∂z n=1 n−1

The Normalizing Flows from Rezende & Mohamed (2015) were introduced in the context of posterior estimation in variational inference. They are optimized for fast sampling while the likelihood evaluation for externally provided data is comparatively slow. To make them useful for CDE, we invert the direction of the flows, defining a mapping from the ˆ−1 transformed distribution p(ZN ) to the base distribution p(Z0) by setting fi (zi) = fi(zi). We experimented with three types of flows: planar flows, radial flows as parametrized by Trippe & Turner (2018) and affine flows f −1(z) = exp(a)z + b. We have found that one affine flow combined with multiple radial flows performs favourably in most settings. For the experiments in 5.2 and 5.1, we used a standard Gaussian as the base distribution that is transformed through one affine flow and ten radial flows. The respective neural network has two hidden layers of size 32.

E. Simulated densities and datasets E.1. SkewNormal

The data generating process (x, y) ∼ p(x, y) resembles a bivariate joint-distribution, wherein x ∈ R follows a normal distribution and y ∈ R a conditional skew-normal distribution (Andelˇ et al., 1984). The parameters (ξ, ω, α) of the are functionally dependent on x. Specifically, the functional dependencies are the following:

  1 x ∼ N · µ = 0, σ = (44) 2 ξ(x) = a ∗ x + b a, b ∈ R (45) 2 ω(x) = c ∗ x + d c, d ∈ R (46) 1 α(x) = α + ∗ (α − α ) (47) low 1 + e−x high low y ∼ SkewNormalξ(x), ω(x), α(x) (48)

Accordingly, the conditional probability density p(y|x) corresponds to the skew normal density function:

2 y − ξ(x)  y − ξ(x) p(y|x) = N Φ α(x) (49) ω(x) ω(x) ω(x)

In that, N (·) denotes the density, and Φ(·) the cumulative distribution function of the standard normal distribution. The α(x) controls the and of the distribution. We set αlow = −4 and αhigh = 0, giving p(y|x) a negative skewness that decreases as x increases. This distribution will allow us to evaluate the performance of the density estimators in presence of skewness, a phenomenon that we often observe in financial market variables. Figure 4a illustrates the conditional skew normal distribution. Noise Regularization for Conditional Density Estimation

12 x=-0.50 x=-1.00 0.6 x=0.00 x=0.00 10 x=0.70 x=1.00 0.5

8 0.4

6 0.3

4 0.2

2 0.1 conditional probability density conditional probability density

0 0.0

0.3 0.2 0.1 0.0 0.1 0.2 6 4 2 0 2 4 6 y y

(a) SkewNormal (b) GaussianMixture

Figure 4: Conditional density simulation models. Conditional probability densities corresponding to the different simulation models. The coloured graphs represent the probability densities p(y|x), conditioned on different values of x.

E.2. Gaussian Mixture

The joint distribution p(x, y) follows a Gaussian Mixture Model in R4 with 5 Gaussian components, i.e. K = 5. We assume that x ∈ R2 and y ∈ R2 can be factorized, i.e. K X p(x, y) = wk N (y|µy,k, Σy,k)N (x|µx,k, Σx,k) (50) i=1 When x and y can be factorized as in (50), the conditional density p(y|x) can be derived in closed form:

K X p(y|x) = Wk(x) N (y|µy,k, Σy,k) (51) i=1 wherein the mixture weights are a function of x: w N (x|µ , Σ ) W (x) = k x,k x,k (52) k PK j=1 wk N (x|µx,j, Σx,j)

For details and derivations we refer the interested reader to Guang Sung (2004) and Gilardi et al. (2002). The weights wk are sampled from a uniform distribution U(0, 1) and then normalized to sum to one. The component means are sampled from a spherical Gaussian with zero mean and standard deviation of σ = 1.5. The covariance matrices Σy,k) and Σy,k) are sampled from a Gaussian with mean 1 and standard deviation 0.5, and then projected onto the cone of positive definite matrices. Since we can hardly visualize a 4-dimensional GMM, Figure 4b depicts a 2-dimensional equivalent, generated with the procedure explained above.

E.3. Euro Stoxx 50 data (dx = 14, dy = 1) The Euro Stoxx 50 data comprises 3169 trading days, dated from January 2003 until June 2015. The goal is to predict the conditional probability density of 1-day log-returns, conditioned on 14 explanatory variables. These conditional variables comprise classical return factors from finance as well as option implied moments. For details, we refer to Rothfuss et al. (2019). Overall, the target variable is one-dimensional, i.e. y ∈ Y ⊆ R, whereas the conditional variable x constitutes a 14-dimensional vector, i.e. x ∈ X ⊆ R14.

E.4. NYC Taxi data (dx = 6, dy = 2) We follow the setup in Dutordoir et al. (2018). The dataset contains records of taxi trips in the Manhattan area operated in January 2016. The objective is to predict spatial distributions of the drop-off location, based on the pick-up location, the day Noise Regularization for Conditional Density Estimation of the week, and the time of day. In that, the two temporal features are represented as sine and cosine with natural periods. Accordingly, the target variable y is 2-dimensional (longitude and latitude of dropoff-location) whereas the conditional variable is 6-dimensional. From the ca. 1 million trips, we randomly sample 10,000 trips to serve as training data.

E.5. UCI

Boston Housing (dx = 13, dy = 1) Concerns the value of houses in the suburban area of Boston. Conditional variables are mostly socio-economic as well as geographical factors. For more details see https://archive.ics.uci.edu/ml/machine- learning-databases/housing/

Concrete (dx = 8, dy = 1) The task is to predict the compressive strength of concrete given variables describing the conrete composition. For more details see https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/

Energy (dx = 9, dy = 1) Concerns the energy efficiency of homes. The task is to predict the cooling load based on features describing the build of the respective house. For more details see https://archive.ics.uci.edu/ml/datasets/energy+efficiency