Kernel Exponential Family Estimation Via Doubly Dual Embedding

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by UCL Discovery Kernel Exponential Family Estimation via Doubly Dual Embedding 1 2 3 2 1 4 ⇤Bo Dai ⇤Hanjun Dai Arthur Gretton Le Song Dale Schuurmans Niao He 1Google Brain 2Georgia Tech 3UCL 4UIUC Abstract flexibility of the exponential family to an infinite- dimensional parameterization via reproducing kernel We investigate penalized maximum log- Hilbert spaces (RKHS) (Canu and Smola, 2006). likelihood estimation for exponential fam- Maximum log-likelihood estimation (MLE) has al- ily distributions whose natural parameter re- ready been well-studied in the case of finite- sides in a reproducing kernel Hilbert space. dimensional exponential families, where desirable sta- Key to our approach is a novel technique, tistical properties such as asymptotic unbiasedness, doubly dual embedding, that avoids compu- consistency and asymptotic normality have been es- tation of the partition function. This tech- tablished. However, it is difficult to extend MLE to nique also allows the development of a flex- the infinite-dimensional case. Beyond the intractabil- ible sampling strategy that amortizes the ity of evaluating the partition function for a general cost of Monte-Carlo sampling in the infer- exponential family, the necessary conditions for max- ence stage. The resulting estimator can be imizing log-likelihood might not be feasible; that is, easily generalized to kernel conditional ex- there may exist no solution to the KKT conditions in ponential families. We establish a connec- the infinite-dimensional case (Pistone and Rogantin, tion between kernel exponential family es- 1999; Fukumizu, 2009). To address this issue, Barron timation and MMD-GANs, revealing a new and Sheu (1991); Gu and Qiu (1993); Fukumizu (2009) perspective for understanding GANs. Com- considered several ways to regularize the function pared to the score matching based estima- space by constructing (a series of) finite-dimensional tors, the proposed method improves both spaces that approximate the original RKHS, yielding a memory and time efficiency while enjoying tractable estimator in the restricted finite-dimensional stronger statistical properties, such as fully space. However, as Sriperumbudur et al. (2017) note, capturing smoothness in its statistical con- even with the finite-dimension approximation, these vergence rate while the score matching es- algorithms are still expensive as every update requires timator appears to saturate. Finally, we Monte-Carlo sampling from a current model to com- show that the proposed estimator empirically pute the partition function. outperforms state-of-the-art methods in both kernel exponential family estimation and its An alternative score matching based estimator has re- conditional extension. cently been introduced by Sriperumbudur et al. (2017). This approach replaces the Kullback-Leibler (KL) with the Fisher divergence, defined by the expected 1 Introduction squared distance between the score of the model (i.e., The exponential family is one of the most impor- the derivative of the log-density) and the target dis- tant classes of distributions in statistics and machine tribution score (Hyvärinen, 2005). By minimizing learning. The exponential family possesses a number the Tikhonov-regularized Fisher divergence, Sriperum- of useful properties (Brown, 1986)and includes many budur et al. (2017) develop a computable estimator commonly used distributions with finite-dimensional for the infinite-dimensional exponential family that natural parameters, such as the Gaussian, Poisson also obtains a consistency guarantee. Recently, this and multinomial distributions, to name a few. It method has been generalized to conditional infinite- is natural to consider generalizing the richness and dimensional exponential family estimation (Arbel and Gretton, 2017). Although score matching avoids Proceedings of the 22nd International Conference on Ar- computing the generally intractable integral, it re- tificial Intelligence and Statistics (AISTATS) 2019, Naha, quires computing and saving the first- and second- Okinawa, Japan. PMLR: Volume 89. Copyright 2019 by order derivatives of the reproducing kernel for each the author(s). ⇤ indicates equal contribution. Kernel Exponential Family Estimation via Doubly Dual Embedding dimension on each sample. For n samples with d fea- where A (λf) := log exp (λf (x)) p (x) dx, x ⌦ ⌦ 0 2 ⇢ tures, this results in n2d2 memory and n3d3 d,λ , and := f :exp(A (λf)) < .In O O R R R time cost respectively, which becomes prohibitive for this paper,2 we mainlyF { focus2H on the case where1} is a datasets with moderate sizes of n and d. To alleviate reproducing Hilbert kernel space (RKHS) withH kernel this cost, Sutherland et al. (2017) utilize the m-rank k (x, x0), such that f (x)= f,k (x, ) . However, we Nystömapproximation to the kernel matrix, reduc- emphasize that the proposedh algorithm· i in Section 4 ing memory and time complexity to nmd2 and can be easily applied to arbitrary di↵erentiable func- O m3d3 respectively. Although this reduces the cost tion parametrizations, such as deep neural networks. O dependence on sample size, the dependence on d is N Given samples =[xi] ,amodelwithafinite- una↵ected, hence the estimator remains unsuitable for i=1 dimensional parameterizationD can be learned via max- high-dimensional data. Estimating a general exponen- imum log-likelihood estimation (MLE), tial family model by either MLE or score matching also N generally requires some form of Monte-Carlo sampling, 1 max log pf (xi)=E [λf(x) + log p0 (x)] A(λf), such as MCMC or HMC, to perform inference, which f N D − 2F i=1 significantly adds to the computation cost. X where E [ ] denotes theb empirical expectation over . In summary, both the MLE and score matching The MLED · (2) is well-studied and has nice properties.D based estimators for the infinite-dimensional exponen- However,b the MLE is known to be “ill-posed” in the tial family incur significant computational overhead, infinite-dimensional case, since the optimal solution particularly in high-dimensional applications. might not be exactly achievable in the representable In this paper, we revisit penalized MLE for the kernel space. Therefore, penalized MLE has been introduced exponential family and propose a new estimation strat- for the finite (Dud´ık et al., 2007) and infinite dimen- egy. Instead of solving the log-likelihood equation di- sional Gu and Qiu (1993); Altun and Smola (2006) rectly, as in existing MLE methods, we exploit a doubly exponential families respectively. Such regularization dual embedding technique that leads to a novel saddle- essentially relaxes the moment matching constraints point reformulation for the MLE (along with its con- in terms of some norm, as shown later, guaranteeing ditional distribution generalization) in Section 3. We the existence of a solution. In this paper, we will also then propose a stochastic algorithm for the new view focus on MLE with RKHS norm regularization. of penalized MLE in Section 4. Since the proposed es- One useful theoretical property of MLE for the expo- timator is based on penalized MLE, it does not require nential family is convexity w.r.t. f. With convex regu- the first- and second-order derivatives of the kernel, as larization, e.g., f 2 , one can use stochastic gradient in score matching, greatly reducing the memory and descent to recoverk k aH unique global optimum. Let L (f) time cost. Moreover, since the saddle point reformula- denote RKHS norm penalized log-likelihood, i.e., tion of penalized MLE avoids the intractable integral, N 1 ⌘ 2 the need for a Monte-Carlo step is also bypassed, thus L (f):= log pf (xi) f . (2) N − 2 k kH accelerating the learning procedure. This approach i=1 X also learns a flexibly parameterized sampler simulta- where ⌘>0 denotes the regularization parameter. neously, therefore, further reducing inference cost. We The gradient of L (f)w.r.t. f can be computed as present the consistency and rate of the new estimation strategy in the well-specified case, i.e.,thetrue f L = E [λ f f(x)] f A(λf) ⌘f. (3) density belongs to the kernel exponential family, and r D r r − To calculate the f A(λf) in (3), we denote Z (λf)= an algorithm convergence guarantee in Section 5. We b r ⌦ exp (λf (x)) p0 (x) dx and expand the partition demonstrate the empirical advantages of the proposed function by definition, algorithm in Section 6, comparing to state-of-the-art R 1 estimators for both the kernel exponential family and f A(λf)= f exp (λf(x)) p0 (x) dx r Z(λf)r Z⌦ its conditional extension. p (x)exp(λf(x)) = 0 λf(x)dx Z(λf) rf 2 Preliminaries Z⌦ = Epf (x) [ f λf(x)] . (4) We first provide a preliminary introduction to the ex- r One can approximate the Epf (x) [ f λf(x)] by ponential family and Fenchel duality, which will play r vital roles in the derivation of the new estimator. AIS/MCMC samples (Vembu et al., 2009), which leads to the Contrastive Divergence (CD) algorithm (Hin- 2.1 Exponential family ton, 2002). The natural form of the exponential family over ⌦with To avoid costly MCMC sampling in estimating the gra- the sufficient statistics f is defined as 2F dient, Sriperumbudur et al. (2017) construct an esti- p (x)=p (x)exp(λf(x) A (λf)) , (1) f 0 − 1 2 3 2 1 4 ⇤Bo Dai , ⇤Hanjun Dai ,ArthurGretton,LeSong,DaleSchuurmans,NiaoHe mator based on score matching instead of MLE, which The supremum achieves if v @h⇤(u), or equivalently minimizes the penalized Fisher divergence. Plugging u @h(v). 2 the kernel exponential family into the empirical Fisher 2 divergence, the optimization reduces to 3 A Saddle-Point Formulation of λ2 ⌘ Penalized MLE J (f):= f,Cf + λ f,δˆ + f 2 , (5) 2 h iH h iH 2 k kH As discussed in Section 2, the penalized MLE for where the exponential family involves computing the log- n d b 1 partition functions, A (λf) and A (λf), which are in- C := @ k (x , ) @ k (x , ) , x n j i · ⌦ j i · tractable in general.

Kernel Exponential Family Estimation Via Doubly Dual Embedding

Deep Kernel Learning

Kernel Methods As Before We Assume a Space X of Objects and a Feature Map Φ : X → RD

3.2 the Periodogram

Lecture 2: Density Estimation 2.1 Histogram

Density Estimation 36-708

Consistent Kernel Mean Estimation for Functions of Random Variables

Density and Distribution Estimation

Recurrent Kernel Networks

Nonparametric Density Estimation Histogram

APPLIED SMOOTHING TECHNIQUES Part 1: Kernel Density Estimation

Regression Discontinuity Design

6. Density Estimation 1. Cross Validation 2. Histogram 3. Kernel