Order Parameters and Model Selection in Machine Learning: Model Characterization and Feature Selection

Order parameters and model selection in Machine Learning: model characterization and feature selection Romaric Gaudel Advisor: Michele` Sebag; Co-advisor: Antoine Cornuejols´ PhD, December 14, 2010 Introduction Relational Kernels Feature Selection Conclusion+ Supervised Machine Learning Background Unknown distribution IP(x; y) on X × Y Objective ∗ Find h minimizing generalization error h∗(x) > 0 Err (h) = IEIP(x;y) [` (h(x); y)] Where ` (h(x); y) is the cost of error on example x h∗(x) = 0 h (x) < 0 Given ∗ Training examples L = f(x1; y1);:::; (xn; yn)g Where (x ; y ) IP(x; y); i 1;:::; n i i ∼ 2 R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 2 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Supervised Machine Learning 2 (Vapnik-Chervonenkis; Bottou & Bousquet, 08) Approximation error (a.k.a. bias) Learned hypothesis belong to H ∗ h∗ hH = argmin Err (h) h2H Approximation Estimation error (a.k.a. variance) h∗ Err estimated by empirical error H H 1 P Errn (h) = n `(h(xi ); yi ) hn = argmin Errn (h) h2H Optimization error Learned hypothesis returned by an optimization algorithm A ^ hn = A(L) R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 3 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Supervised Machine Learning 2 (Vapnik-Chervonenkis; Bottou & Bousquet, 08) Approximation error (a.k.a. bias) Learned hypothesis belong to H ∗ h∗ hH = argmin Err (h) h2H Approximation Estimation error (a.k.a. variance) h∗ Err estimated by empirical error H Estimation H hn 1 P Errn (h) = n `(h(xi ); yi ) hn = argmin Errn (h) h2H Optimization error Learned hypothesis returned by an optimization algorithm A ^ hn = A(L) R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 3 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Supervised Machine Learning 2 (Vapnik-Chervonenkis; Bottou & Bousquet, 08) Approximation error (a.k.a. bias) Learned hypothesis belong to H ∗ h∗ hH = argmin Err (h) h2H Approximation Estimation error (a.k.a. variance) h∗ Err estimated by empirical error H Estimation H hn 1 P Optimization Errn (h) = `(h(xi ); yi ) n hˆn hn = argmin Errn (h) h2H Optimization error Learned hypothesis returned by an optimization algorithm A ^ hn = A(L) R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 3 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Focus of the thesis Combinatorial optimization problems hidden in Machine Learning + Relational representation =) Combinatorial optimization problem Example: Mutagenesis database - + Feature Selection =) Combinatorial optimization problem Example: Microarray data − R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 4 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion Outline 1 Relational Kernels 2 Feature Selection R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 5 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion Outline 1 Relational Kernels 2 Feature Selection R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 6 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion Relational Learning / Inductive Logic Programming Position Relational database : keys in the database BackgroundX knowledge : set of logical formulas H Expressive language Actual covering test: Constraint Satisfaction Problem (CSP) R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 7 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion CSP consequences within Inductive Logic Programming Consequences of the Phase Transition Complexity Worst case: NP-hard Average case: “easy” except in Phase Transistion (Cheeseman et al. 91) Phase Transition in Inductive Logic Programming Existence (Giordana & Saitta, 00) Impact: fails to learn in Phase Transition region (Botta et al., 03) R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 8 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion Multiple Instance Problems The missing link between Relational and Propositional Learning Multiple Instance Problems (MIP) (Dietterich et al., 89) An example: set of instances An instance: vector of features Target-concept: there exists an instance satisfying a predicate P pos(x) () 9I 2 x; P(I) Example of MIP Positive key ring A locked door A positive key-ring contains a key which can unlock the door Negative key ring R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 9 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion Support Vector Machine A Convex optimization problem 0 < ξ < 1 ˆ n n i hn(x) > 0 X 1 X argmin αi αi αj yi yj xi ; xj n − 2 h i α2IR i=1 i=1 ( Pn α y = 0 s.t. i=1 i i 0 6 αi 6 C; i = 1;:::; n ˆ ξ = 0 Kernel trick hn(x) < 0 i hˆn(x) = 1 ξi > 1 hˆn(x) = 0 xi ; xj K (xi ; xj ) ˆ hn(x) = 1 h i − Kernel-based propositionalization (differs from RKHS framework) ( = (x1; y1);:::; (xn; yn) L f g Φ: x (K (x ; x);:::; K (xn; x)) K ! 1 R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 10 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion SVM and MIP Averaging-kernel for MIP (Gartner¨ et al., 02) Given a kernel k on instances P P x 2x x 2x0 k(xi ; xj ) K (x; x 0) = i j norm (x) norm (x 0) Question MIP Target-concept: existential properties Averaging-Kernel: average properties Do averaging-kernels sidestep limitations of Relational Learning? R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 11 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion Methodology Inspired from Phase Transition studies Usual Phase Transition framework Generate data after control parameters Observe results Draw phase diagram: results w.r.t. order parameters This study Generalized Multiple Instance Problem Experimental results of averaging-kernel-based propositionalization R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 12 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion Outline 1 Relational Kernels Theoretical failure region Lower bound on the generalization error Empirical failure region 2 Feature Selection R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 13 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion Generalized Multiple Instance Problems Generalized MIP (Weidmann et al., 03) An example: set of instances An instance: vector of features Target-concept: conjunction of predicates P1;:::; Pm m ^ pos(x) () 9I1;:::; Im 2 x; Pi (Ii ) i=1 O CH3 O CH3 CN Example of Generalized MIP CN CH3 N CO CO CH3 N A molecule: set of sub-graphs C C = CC N Bioactivity: implies several sub-graphs N N ) N CH CH3 CH CH3 R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 14 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion Control Parameters Category Param. Deﬁnition Σ Size of alphabet Σ, a Σ Instances jd j number of numerical2 features, I = (a; z) z [0; 1]d 2 + ε M+ Number of instances per positive example M− Number of instances per negative example m+ Number of instances in a predi- Examples cate, for positive example m− Number of instances in a predicate, for negative example P Number of predicates “missed” m ε by each negative example - P Number of predicate Concept " Radius of each predicate ("- ball) R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 15 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion Limitation of averaging-kernels Theoretical analysis 0.008 exemples positifs 0.007 exemples négatifs 0.006 + − m m 0.005 Failure for + = − ,x) M M − 0.004 K(x 0.003 0.002 0.001 IE + [K (x ; x)] = IE − [K (x ; x)] x∼D i x∼D i 0 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 K(x+,x) Empirical approach Generate, test and average empirical results Establish a lower bound on generalization error R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 16 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion Limitation of averaging-kernels Theoretical analysis 0.008 exemples positifs 0.007 exemples négatifs 0.006 + − m m 0.005 Failure for + = − ,x) M M − 0.004 K(x 0.003 0.002 0.001 IE + [K (x ; x)] = IE − [K (x ; x)] x∼D i x∼D i 0 0 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 K(x+,x) Empirical approach Generate, test and average empirical results Establish a lower bound on generalization error R. Gaudel (LRI) Model Characterization and Feature Selection PhD, December 14, 2010 16 / 52 Introduction Relational Kernels Feature Selection Conclusion+ Position Theory Lower bound Experiments Discussion Efﬁciency of kernel-based

Load more