Localized Structured Prediction
Total Page:16
File Type:pdf, Size:1020Kb
Localized Structured Prediction Carlo Ciliberto1, Francis Bach2;3, Alessandro Rudi2;3 1 Department of Electrical and Electronic Engineering, Imperial College London, London 2 Département d’informatique, Ecole normale supérieure, PSL Research University. 3 INRIA, Paris, France Supervised Learning 101 • X input space, Y output space, • ` : Y × Y ! R loss function, • ρ probability on X × Y. f ? = argmin E[`(f(x); y)]; f:X !Y n given only the dataset (xi; yi)i=1 sampled independently from ρ. 1 Structured Prediction 2 If Y is a vector space • G easy to choose/optimize: (generalized) linear models, Kernel methods, Neural Networks, etc. If Y is a “structured” space: • How to choose G? How to optimize over it? Protypical Approach: Empirical Risk Minimization Solve the problem: Xn b 1 f = argmin `(f(xi); yi) + λR(f): 2G n f i=1 Where G ⊆ ff : X ! Yg (usually a convex function space) 3 If Y is a “structured” space: • How to choose G? How to optimize over it? Protypical Approach: Empirical Risk Minimization Solve the problem: Xn b 1 f = argmin `(f(xi); yi) + λR(f): 2G n f i=1 Where G ⊆ ff : X ! Yg (usually a convex function space) If Y is a vector space • G easy to choose/optimize: (generalized) linear models, Kernel methods, Neural Networks, etc. 3 Protypical Approach: Empirical Risk Minimization Solve the problem: Xn b 1 f = argmin `(f(xi); yi) + λR(f): 2G n f i=1 Where G ⊆ ff : X ! Yg (usually a convex function space) If Y is a vector space • G easy to choose/optimize: (generalized) linear models, Kernel methods, Neural Networks, etc. If Y is a “structured” space: • How to choose G? How to optimize over it? 3 State of the art: Structured case b Y arbitrary: how do we parametrize G and learn f? Surrogate approaches + Clear theory (e.g. convergence and learning rates) - Only for special cases (classification, ranking, multi-labeling etc.) [Bartlett et al., 2006, Duchi et al., 2010, Mroueh et al., 2012] Score learning techniques + General algorithmic framework (e.g. StructSVM [Tsochantaridis et al., 2005]) - Limited Theory (no consistency, see e.g. [Bakir et al., 2007] ) 4 Is it possible to have best of both worlds? general algorithmic framework + clear theory 5 Table of contents 1. A General Framework for Structured Prediction [Ciliberto et al., 2016] 2. Leveraging Local Structure [This Work] 6 A General Framework for Structured Prediction Pointwise characterization in terms of the conditional expectation: ? f (x) = argmin Ey[`(z; y) j x]: z2Y Characterizing the target function ? f = argmin Exy[`(f(x); y)]: f:X !Y 7 Characterizing the target function ? f = argmin Exy[`(f(x); y)]: f:X !Y Pointwise characterization in terms of the conditional expectation: ? f (x) = argmin Ey[`(z; y) j x]: z2Y 7 Deriving an Estimator Idea: approximate ? f (x) = argmin E(z; x) E(z; x) = Ey[`(z; y) j x] z2Y b by means of an estimator E(z; x) of the ideal E(z; x) fb(x) = argmin Eb(z; x) Eb(z; x) ≈ E(z; x) z2Y b n Question: How to choose E(z; x) given the dataset (xi; yi)i=1? 8 Estimating the Conditional Expectation Idea: for every z perform “regression” over the `(z; ·). 1 Xn gbz = argmin L(g(xi); `(z; yi)) + λR(g) X!R n g: i=1 b Then we take E(z; x) = gbz(x). Questions: • Models: How to choose L? • Computations: Do we need to compute gbz for every z 2 Y? b b • Theory: Does E(z; x) ! E(z; x)? More generally, does f ! f ?? 9 Square Loss! Let L be the square loss. Then: Xn 1 2 2 gbz = argmin (g(xi) − `(z; yi)) + λkgk g n i=1 In particular, for linear models g(x) = ϕ(x)>w > 2 2 gbz(x) = ϕ(x) wbz wbz = argmin Aw − b + λ w w With > > A = [ϕ(x1); : : : ; ϕ(xn)] and b = [`(z; y1); : : : ; `(z; yn)] 10 Computing the gbz All in Once Closed form solution b > b > > −1 > > gz(x) = ϕ(x) wz = |ϕ(x) (A A{z+ λnI) A } b = α(x) b α(x) In particular, we can compute > > −1 αi(x) = ϕ(x) (A A + λnI) ϕ(xi) only once (independently of z). Then, for any z Xn Xn gbz(x) = αi(x)bi = αi(x)`(z; yi) i=1 i=1 11 Structured Prediction Algorithm n Input: dataset (xi; yi)i=1. Training: for i = 1; : : : ; n, compute > −1 vi = (A A + λnI) ϕ(xi) Prediction: given a new test point x compute > αi(x) = ϕ(x) vi Then, Xn b f(x) = argmin αi(x)`(z; yi) 2Y z i=1 12 Theorem (Rates - [Ciliberto et al., 2016]) Under mild assumption on `. Let λ = n−1=2, then E[`(fb(x); y) − `(f ?(x); y)] ≤ O(n−1=4); w:h:p: The Proposed Structured Prediction Algorithm Questions: • Models: How to choose L? Square loss! • Computations: Do we need to compute gbz for every z 2 Y? No need, Compute them all in once! b • Theory: Does f ! f ?? Yes! 13 The Proposed Structured Prediction Algorithm Questions: • Models: How to choose L? Square loss! • Computations: Do we need to compute gbz for every z 2 Y? No need, Compute them all in once! b • Theory: Does f ! f ?? Yes! Theorem (Rates - [Ciliberto et al., 2016]) Under mild assumption on `. Let λ = n−1=2, then E[`(fb(x); y) − `(f ?(x); y)] ≤ O(n−1=4); w:h:p: 13 A General Framework for Structured Prediction (General Algorithm + Theory) Is it possible to have best of both worlds? Yes! We introduced an algorithmic framework for structured prediction: • Directly applicable on a wide family of problems (Y; `). • With strong theoretical guarantees. • Recovering many existing algorithms (not seen here). 14 What Am I Hiding? • Theory. The key assumption to achieve consistency and rates is that ` is a Structure Encoding Loss Function (SELF). `(z; y) = h (z);'(y)iH 8z; y 2 Y With ; ' : Y!H continuous maps into H Hilbert. • Similar to the characterization of reproducing kernels. • In principle hard to verify. However lots of ML losses satisfy it! • Computations. We need to solve an optimization problem at prediction time! 15 Prediction: The Inference Problem Solving an optimization problem at prediction time is a standard practice in structured prediction. Known as Inference Problem fb(x) = argmin Eb(x; z) z2Y In our case it is reminiscient of a weighted barycenter. Xn b f(x) = argmin αi(x)`(z; yi) 2Y z i=1 It is *very* problem dependent 16 Example: Learning to Rank Goal: given a query x, order a set of documents d1; : : : ; dk according to their relevance scores y1; : : : ; yk w.r.t. x. Xk Pair-wise Loss: `rank(f(x); y) = (yi − yj) sign(f(x)i − f(x)j) i;j=1 P b n It can be shown that f(x) = argminz2Y i=1 αi(x)`(z; yi) is a Minimum Feedback Arc Set problem on DAGs (NP Hard!) Still, approximate solutions can improve upon non-consistent approaches. 17 Additional Work Case studies: • Learning to rank [Korba et al., 2018] • Output Fisher Embeddings [Djerrab et al., 2018] • Y = manifolds, ` = geodesic distance [Rudi et al., 2018] • Y = probability space, ` = wasserstein distance [Luise et al., 2018] Refinements of the analysis: • Alternative derivations [Osokin et al., 2017] • Discrete loss [Nowak-Vila et al., 2018, Struminsky et al., 2018] Extensions: • Application to multitask-learning [Ciliberto et al., 2017] • Beyond least squares surrogate [Nowak-Vila et al., 2019] • Regularizing with trace norm [Luise et al., 2019] 18 Predicting Probability Distributions [Luise, Rudi, Pontil, Ciliberto ’18] Setting: Y = P(Rd) probability distributions on Rd. Loss: Wasserstein distance Z `(µ, ν) = min kz − yk2 dτ(x; y) τ2Π(µ,ν) Digit Reconstruction 19 Manifold Regression [Rudi, Ciliberto, Marconi, Rosasco ’18] Setting: Y Riemmanian manifold. Loss: (squared) geodesic distance. Optimization: Riemannian GD. Fingerprint Reconstruction Multi-labeling (Y = S1 sphere) (Y statistical manifold) 20 Nonlinear Multi-task Learning [Ciliberto, Rudi, Rosasco, Pontil ’17, Luise, Stamos, Pontil, Ciliberto ’19 ] Idea: instead of solving multiple learning problems (tasks) separately, leverage the potential relations among them. Previous Methods: only imposing/learning linear tasks relations. Unable to cope with non-linear constraints (e.g. ranking, robotics, etc.). MTL+Structured Prediction − Interpret multiple tasks as separate outputs. − Impose constraints as structure on the joint output. 21 Leveraging local structure Local Structure 22 Motivating Example (Between-Locality) Super-Resolution: Learn f : Lowres ! Highres. However... • Very large output sets (high sample complexity). • Local info might be sufficient to predict output. 23 Motivating Example (Between-Locality) Idea: learn local input-output maps under structural constraints (i.e. overlapping output patches should line up) Super-Resolution: Learn f : Lowres ! Highres. Between-Locality. Let [x]p; [y]p denote input/output “parts” p 2 P : ( ) ( ) • P [y] x = P [y] [x] ( p ) (p p ) • P [y]p [x]p = P [y]q [x]q 24 Structured Prediction + Parts Assumption. The loss is “aware” of the parts. X 0 0 `(y ; y) = `0([y ]p; [y]p) p2P • set P indicizes the parts of X and Y • `0 loss on parts • [y]p is the p-th part of y 25 Localized Structured Prediction: Inference Input Output x<latexit sha1_base64="f2yzimwbR/Dgjzp6tZ360fHRqNI=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cW7Ae0oWy2k3btZhN2N2IJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3H1FpHst7M0nQj+hQ8pAzaqzUeOqXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqN3kcRTiBUzgHD66gBndQhyYwQHiGV3hzHpwX5935WLQWnHzmGP7A+fwB5jmM/A==</latexit> z<latexit sha1_base64="kAAWtYBn7/0aoVBmFqz2tm+nEUw=">AAAB6HicbZC7SwNBEMbnfMbzFbW0WQyCVbiz0UYM2lgmYB6QhLC3mUvW7O0du3tCPAL2NhaK2PrP2Nv537h5FJr4wcKP75thZyZIBNfG876dpeWV1bX13Ia7ubW9s5vf26/pOFUMqywWsWoEVKPgEquGG4GNRCGNAoH1YHA9zuv3qDSP5a0ZJtiOaE/ykDNqrFV56OQLXtGbiCyCP4PC5ad78QgA5U7+q9WNWRqhNExQrZu+l5h2RpXhTODIbaU