Localized Structured Prediction

Localized Structured Prediction Carlo Ciliberto1, Francis Bach2;3, Alessandro Rudi2;3 1 Department of Electrical and Electronic Engineering, Imperial College London, London 2 Département d’informatique, Ecole normale supérieure, PSL Research University. 3 INRIA, Paris, France Supervised Learning 101 • X input space, Y output space, • ` : Y × Y ! R loss function, • ρ probability on X × Y. f ? = argmin E[`(f(x); y)]; f:X !Y n given only the dataset (xi; yi)i=1 sampled independently from ρ. 1 Structured Prediction 2 If Y is a vector space • G easy to choose/optimize: (generalized) linear models, Kernel methods, Neural Networks, etc. If Y is a “structured” space: • How to choose G? How to optimize over it? Protypical Approach: Empirical Risk Minimization Solve the problem: Xn b 1 f = argmin `(f(xi); yi) + λR(f): 2G n f i=1 Where G ⊆ ff : X ! Yg (usually a convex function space) 3 If Y is a “structured” space: • How to choose G? How to optimize over it? Protypical Approach: Empirical Risk Minimization Solve the problem: Xn b 1 f = argmin `(f(xi); yi) + λR(f): 2G n f i=1 Where G ⊆ ff : X ! Yg (usually a convex function space) If Y is a vector space • G easy to choose/optimize: (generalized) linear models, Kernel methods, Neural Networks, etc. 3 Protypical Approach: Empirical Risk Minimization Solve the problem: Xn b 1 f = argmin `(f(xi); yi) + λR(f): 2G n f i=1 Where G ⊆ ff : X ! Yg (usually a convex function space) If Y is a vector space • G easy to choose/optimize: (generalized) linear models, Kernel methods, Neural Networks, etc. If Y is a “structured” space: • How to choose G? How to optimize over it? 3 State of the art: Structured case b Y arbitrary: how do we parametrize G and learn f? Surrogate approaches + Clear theory (e.g. convergence and learning rates) - Only for special cases (classiﬁcation, ranking, multi-labeling etc.) [Bartlett et al., 2006, Duchi et al., 2010, Mroueh et al., 2012] Score learning techniques + General algorithmic framework (e.g. StructSVM [Tsochantaridis et al., 2005]) - Limited Theory (no consistency, see e.g. [Bakir et al., 2007] ) 4 Is it possible to have best of both worlds? general algorithmic framework + clear theory 5 Table of contents 1. A General Framework for Structured Prediction [Ciliberto et al., 2016] 2. Leveraging Local Structure [This Work] 6 A General Framework for Structured Prediction Pointwise characterization in terms of the conditional expectation: ? f (x) = argmin Ey[`(z; y) j x]: z2Y Characterizing the target function ? f = argmin Exy[`(f(x); y)]: f:X !Y 7 Characterizing the target function ? f = argmin Exy[`(f(x); y)]: f:X !Y Pointwise characterization in terms of the conditional expectation: ? f (x) = argmin Ey[`(z; y) j x]: z2Y 7 Deriving an Estimator Idea: approximate ? f (x) = argmin E(z; x) E(z; x) = Ey[`(z; y) j x] z2Y b by means of an estimator E(z; x) of the ideal E(z; x) fb(x) = argmin Eb(z; x) Eb(z; x) ≈ E(z; x) z2Y b n Question: How to choose E(z; x) given the dataset (xi; yi)i=1? 8 Estimating the Conditional Expectation Idea: for every z perform “regression” over the `(z; ·). 1 Xn gbz = argmin L(g(xi); `(z; yi)) + λR(g) X!R n g: i=1 b Then we take E(z; x) = gbz(x). Questions: • Models: How to choose L? • Computations: Do we need to compute gbz for every z 2 Y? b b • Theory: Does E(z; x) ! E(z; x)? More generally, does f ! f ?? 9 Square Loss! Let L be the square loss. Then: Xn 1 2 2 gbz = argmin (g(xi) − `(z; yi)) + λkgk g n i=1 In particular, for linear models g(x) = ϕ(x)>w > 2 2 gbz(x) = ϕ(x) wbz wbz = argmin Aw − b + λ w w With > > A = [ϕ(x1); : : : ; ϕ(xn)] and b = [`(z; y1); : : : ; `(z; yn)] 10 Computing the gbz All in Once Closed form solution b > b > > −1 > > gz(x) = ϕ(x) wz = |ϕ(x) (A A{z+ λnI) A } b = α(x) b α(x) In particular, we can compute > > −1 αi(x) = ϕ(x) (A A + λnI) ϕ(xi) only once (independently of z). Then, for any z Xn Xn gbz(x) = αi(x)bi = αi(x)`(z; yi) i=1 i=1 11 Structured Prediction Algorithm n Input: dataset (xi; yi)i=1. Training: for i = 1; : : : ; n, compute > −1 vi = (A A + λnI) ϕ(xi) Prediction: given a new test point x compute > αi(x) = ϕ(x) vi Then, Xn b f(x) = argmin αi(x)`(z; yi) 2Y z i=1 12 Theorem (Rates - [Ciliberto et al., 2016]) Under mild assumption on `. Let λ = n−1=2, then E[`(fb(x); y) − `(f ?(x); y)] ≤ O(n−1=4); w:h:p: The Proposed Structured Prediction Algorithm Questions: • Models: How to choose L? Square loss! • Computations: Do we need to compute gbz for every z 2 Y? No need, Compute them all in once! b • Theory: Does f ! f ?? Yes! 13 The Proposed Structured Prediction Algorithm Questions: • Models: How to choose L? Square loss! • Computations: Do we need to compute gbz for every z 2 Y? No need, Compute them all in once! b • Theory: Does f ! f ?? Yes! Theorem (Rates - [Ciliberto et al., 2016]) Under mild assumption on `. Let λ = n−1=2, then E[`(fb(x); y) − `(f ?(x); y)] ≤ O(n−1=4); w:h:p: 13 A General Framework for Structured Prediction (General Algorithm + Theory) Is it possible to have best of both worlds? Yes! We introduced an algorithmic framework for structured prediction: • Directly applicable on a wide family of problems (Y; `). • With strong theoretical guarantees. • Recovering many existing algorithms (not seen here). 14 What Am I Hiding? • Theory. The key assumption to achieve consistency and rates is that ` is a Structure Encoding Loss Function (SELF). `(z; y) = h (z);'(y)iH 8z; y 2 Y With ; ' : Y!H continuous maps into H Hilbert. • Similar to the characterization of reproducing kernels. • In principle hard to verify. However lots of ML losses satisfy it! • Computations. We need to solve an optimization problem at prediction time! 15 Prediction: The Inference Problem Solving an optimization problem at prediction time is a standard practice in structured prediction. Known as Inference Problem fb(x) = argmin Eb(x; z) z2Y In our case it is reminiscient of a weighted barycenter. Xn b f(x) = argmin αi(x)`(z; yi) 2Y z i=1 It is *very* problem dependent 16 Example: Learning to Rank Goal: given a query x, order a set of documents d1; : : : ; dk according to their relevance scores y1; : : : ; yk w.r.t. x. Xk Pair-wise Loss: `rank(f(x); y) = (yi − yj) sign(f(x)i − f(x)j) i;j=1 P b n It can be shown that f(x) = argminz2Y i=1 αi(x)`(z; yi) is a Minimum Feedback Arc Set problem on DAGs (NP Hard!) Still, approximate solutions can improve upon non-consistent approaches. 17 Additional Work Case studies: • Learning to rank [Korba et al., 2018] • Output Fisher Embeddings [Djerrab et al., 2018] • Y = manifolds, ` = geodesic distance [Rudi et al., 2018] • Y = probability space, ` = wasserstein distance [Luise et al., 2018] Refinements of the analysis: • Alternative derivations [Osokin et al., 2017] • Discrete loss [Nowak-Vila et al., 2018, Struminsky et al., 2018] Extensions: • Application to multitask-learning [Ciliberto et al., 2017] • Beyond least squares surrogate [Nowak-Vila et al., 2019] • Regularizing with trace norm [Luise et al., 2019] 18 Predicting Probability Distributions [Luise, Rudi, Pontil, Ciliberto ’18] Setting: Y = P(Rd) probability distributions on Rd. Loss: Wasserstein distance Z `(µ, ν) = min kz − yk2 dτ(x; y) τ2Π(µ,ν) Digit Reconstruction 19 Manifold Regression [Rudi, Ciliberto, Marconi, Rosasco ’18] Setting: Y Riemmanian manifold. Loss: (squared) geodesic distance. Optimization: Riemannian GD. Fingerprint Reconstruction Multi-labeling (Y = S1 sphere) (Y statistical manifold) 20 Nonlinear Multi-task Learning [Ciliberto, Rudi, Rosasco, Pontil ’17, Luise, Stamos, Pontil, Ciliberto ’19 ] Idea: instead of solving multiple learning problems (tasks) separately, leverage the potential relations among them. Previous Methods: only imposing/learning linear tasks relations. Unable to cope with non-linear constraints (e.g. ranking, robotics, etc.). MTL+Structured Prediction − Interpret multiple tasks as separate outputs. − Impose constraints as structure on the joint output. 21 Leveraging local structure Local Structure 22 Motivating Example (Between-Locality) Super-Resolution: Learn f : Lowres ! Highres. However... • Very large output sets (high sample complexity). • Local info might be sufﬁcient to predict output. 23 Motivating Example (Between-Locality) Idea: learn local input-output maps under structural constraints (i.e. overlapping output patches should line up) Super-Resolution: Learn f : Lowres ! Highres. Between-Locality. Let [x]p; [y]p denote input/output “parts” p 2 P : ( ) ( ) • P [y] x = P [y] [x] ( p ) (p p ) • P [y]p [x]p = P [y]q [x]q 24 Structured Prediction + Parts Assumption. The loss is “aware” of the parts. X 0 0 `(y ; y) = `0([y ]p; [y]p) p2P • set P indicizes the parts of X and Y • `0 loss on parts • [y]p is the p-th part of y 25 Localized Structured Prediction: Inference Input Output x<latexit sha1_base64="f2yzimwbR/Dgjzp6tZ360fHRqNI=">AAAB6HicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cW7Ae0oWy2k3btZhN2N2IJ/QVePCji1Z/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3H1FpHst7M0nQj+hQ8pAzaqzUeOqXK27VnYOsEi8nFchR75e/eoOYpRFKwwTVuuu5ifEzqgxnAqelXqoxoWxMh9i1VNIItZ/ND52SM6sMSBgrW9KQufp7IqOR1pMosJ0RNSO97M3E/7xuasJrP+MySQ1KtlgUpoKYmMy+JgOukBkxsYQyxe2thI2ooszYbEo2BG/55VXSuqh6btVrXFZqN3kcRTiBUzgHD66gBndQhyYwQHiGV3hzHpwX5935WLQWnHzmGP7A+fwB5jmM/A==</latexit> z<latexit sha1_base64="kAAWtYBn7/0aoVBmFqz2tm+nEUw=">AAAB6HicbZC7SwNBEMbnfMbzFbW0WQyCVbiz0UYM2lgmYB6QhLC3mUvW7O0du3tCPAL2NhaK2PrP2Nv537h5FJr4wcKP75thZyZIBNfG876dpeWV1bX13Ia7ubW9s5vf26/pOFUMqywWsWoEVKPgEquGG4GNRCGNAoH1YHA9zuv3qDSP5a0ZJtiOaE/ykDNqrFV56OQLXtGbiCyCP4PC5ad78QgA5U7+q9WNWRqhNExQrZu+l5h2RpXhTODIbaU

Localized Structured Prediction

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support