<<

Kernel Conditional Michael Arbel and Arthur Gretton Gatsby Computational Neuroscience Unit

Learning Conditional Distributions Expected Conditional Score Matching Goal: Learning conditional densities in a non-parametric fashion. Motivation: Define a loss between two unnormalized conditional densities : J (p, q).

pyx pyx Idea: Adapt the score objective from (Hyvarinen (2005)) to conditional densities: " 2# 1 p(Y |X ) J (p, q)= EX ,Y ∇y log 2 q(Y |X ) The expectation is under the true joint distribution. Using integration by part and some regularity conditions:  1  J (p, q)= E ∆ log q(Y |X ) + k∇ log q(Y |X )k2 + const Number of modes can vary Densities can be heteroscedastic X ,Y y 2 y

For qθ in the KCEF the score is convex and quadratic in θ: 1 J (p, q )= hθ, Cθi + hξ,θi + const Contribution: θ 2 H H X A particular form of Conditional h i h i E Pd E Pd 2 Exponential Family based on vector valued X ,Y i=1 ΓX ,.∂ik(Y ,.) ⊗ ΓX ,.∂ik(Y ,.) X ,Y i=1 ΓX ,.∂i k(Y ,.) + ∂i log g(Y )ΓX ,.∂ik(Y ,.) RKHS. X A method for approximating conditional C is a symmetric positive trace-class operator and ξ is a vector in H: densities using the KCEF with statistical X No need to compute the intractable normalizer. guarantees. X Convex quadratic loss: Guarantees existence and uniqueness of an optimal solution. Density ratios are often simpler to learn than X A provably convergent algorithm can be used to estimate the optimal θ. the full joint × The score can become degenerate if p(y|x) is not supported on the whole space.

Kernel Exponential Family (Sriperumbudur et al. (2017)) Truth in Advertising Idea: Parametrize densities with functions in an RKHS G with kernel k Failure case: J (p, q) is degenerate if p(y|x) is supported on disjoint subsets. Z p (y) = q (y)ehθ,k(y,.)iG−A(θ) A(θ) = log q (y)ehθ,k(y,.)iGdy θ 0 0 p(y| − 1) p(y|1) θ is the natural and k(y,.) the sufficient statistic. Both are ’infinite’ J (p(y|x), p(y)) = 0 dimensional vectors. p(y) = 1(p(y| − 1) + p(y|1)) 2 |target{z } |{z}model X Richer than finite dimensional exponential family × Intractable log-partition function A(θ) : MLE is hard to compute. Learning via Score-Matching (Hyvarinen (2005)) X Easy Fix: Add a small gaussian noise to the data! X Good statistical properties (Sriperumbudur et al. (2017)) Finite Sample estimate

Kernel Conditional Exponential Family Given n samples (Xi, Yi)1≤i≤n, the regularized empirical version of the score is: 1 λ Idea: Extend the KEF to conditional densities: Jˆ(p, q)= hθ, Cˆθi + hξˆ,θi + kθk2 Z 2 H H 2 H p (y|x) = q (y)ehθx,k(y,.)iG−A(θx) A(θ ) = log q (y)ehθx,k(y,.)iGdy θ 0 x 0 kernel trick: The generalized representer theorem ensures θˆ is of the form: x 7→θ constrained to be in a vector valued RKHS H with vector valued kernelΓ 0. 1 X x x,x θˆ = − ξˆ+ β Γ ∂ k(Y ,.) H contains functions θ : X 7→ G that satisfy the vector valued reproducing property λ (b,i) Xb i b b∈[n];i∈[d] (Micchelli and Pontil (2005)): where β is obtained by solving a linear system of size n × d: 1 hθx, f iG = hθ, Γx,.f iH; ∀f ∈ G (G + nλI )β = h By this property, pθ can also be written as: λ ˆ (h∂ik(Ya,.), Γ(Xa, Xb)∂jk(Yb,.)iH)(a,i),(b,j)∈[n]×[d] (∂iξ(Xa, Ya))(a,i)∈[n]×[d]

hθ,Γx,.k(y,.)iH−A(θx) pθ(y|x) = q0(y)e Theory ˆ The paper provides asymptotic rates of convergence of θ in the well-specified case. If θ0 is Experiments: Sampling from KCEF the true natural parameter, then: 1 ˆ −2+α Motivation: Sampling from a high dimensional distribution p(x1, ..., xd) can suffer from a kθ − θ0k = Op0(n ) slow mixing time. −α 1 1 with λ = n and 4 < α < 2 depends on the kernels and p0.

Red Wine Parkinsons Experiments: Comparison with other methods: Real NADE and LSCDE Data Data KCEF KCEF Estimating p(y|x). Estimating p(x1, ..., xd) as a product of conditional ⟵ densities. negative log-likelihood 6

7 1 Idea:

x 4 17.5 x KCEF-F KCEF-M 15.0 LSCDE-F

Approximate p by a product r X 3 LSCDE-M e 12.5 t NADE

of conditional densities p ' t

2 e 10.0 b

pˆ(x )ˆp(x |x )...pˆ(x |x ) 1 2 1 d π(d) 7.5 x6 x15 1 5.0

0 2.5 Data Data X Use Ancestral Hamiltonian 0.0 topo

NADE NADE engel sniffer mcycle heights Mote Carlo to sample from caution GAGurine snowgeese red_wine ftcollinssnow parkinsons pˆ(Xi|Xπ(i)) given a sample white_wine 6 7 1 x x Xπ(i). R datasets: Ntrain = Ntest from 10 to 300. UCI datasets: Ntrain = 10 ∗ Ntest from 1500 to 5000. dy = 1 and dx from 1 to 13. d from 11 to 15.

x6 x15 Bibliography

Bounliphone, W., Belilovsky, E., Blaschko, M. B., Antonoglou, I., and Gretton, A. (2015). A test of relative similarity for model selection in generative models. CoRR. Model comparison: Samples from different models can be compared using the test for Hyvarinen, A. (2005). Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res. relative similarity in (Bounliphone et al. (2015)). Micchelli, C. A. and Pontil, M. A. (2005). On learning vector-valued functions. Neural Comput. Sriperumbudur, B., Fukumizu, K., Kumar, R., Gretton, A., and Hyvarinen, A. (2017). in infinite dimensional exponential families. Journal of Research.

This work was funded by the Gatsby Charitable Foundation. Contact: [email protected]