Inductive and Statistical Learning of Formal Grammars
Total Page:16
File Type:pdf, Size:1020Kb
2002 Grammar Induction Machine Learning Inductive and statistical learning of formal grammars Goal: to give the learning ability to a machine Pierre Dupont Design programs the performance of which improves over time [email protected] Inductive learning is a particular instance of machine learning • Goal: to find a general law from examples • Subproblem of theoretical computer science, artificial intelligence or pattern recognition – Typeset by FoilTEX – Pierre Dupont 2 2002 Grammar Induction 2002 Grammar Induction Outline Grammar Induction or Grammatical Inference • Grammar induction definition Grammar induction is a particular case of inductive learning • Learning paradigms The general law is represented by a formal grammar or an equivalent machine • DFA learning from positive and negative examples The set of examples, known as positive sample, is usually made of • RPNI algorithm strings or sequences over a specific alphabet • Probabilistic DFA learning A negative sample, i.e. a set of strings not belonging to the target language, • Application to a natural language task can sometimes help the induction process • Links with Markov models • Smoothing issues Data Induction Grammar • Related problems and future work aaabbb S−>aSb ab S−> λ Pierre Dupont 1 Pierre Dupont 3 2002 Grammar Induction 2002 Grammar Induction Examples Chromosome classification • Natural language sentence • Speech Centromere • Chronological series Chromosome 2a • Successive actions of a WEB user 90 80 70 60 • Successive moves during a chess game 50 grey density 40 30 0 100 200 300 400 500 600 • A musical piece 6 4 2 • A program 0 -2 -4 grey dens. derivative -6 0 100 200 300 400 500 600 • A form characterized by a chain code position along median axis "=====CDFDCBBBBBBBA==bcdc==DGFB=bccb== ...... ==cffc=CCC==cdb==BCB==dfdcb=====" • A biological sequence (DNA, proteins, ...) String of Primitives Pierre Dupont 4 Pierre Dupont 6 2002 Grammar Induction 2002 Grammar Induction Pattern Recognition A modeling hypothesis 16 G 0 '3.4cont' Data Grammar 15 '3.8cont' Generation Induction G 14 13 12 8dC 11 10 • Find G as close as possible to G0 3 2 1 9 8 4 0 7 • The induction process does not prove the existence of G0 5 6 7 6 It is a modeling hypothesis 5 4 3 2 1 0 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 8dC: 000077766676666555545444443211000710112344543311001234454311 Pierre Dupont 5 Pierre Dupont 7 2002 Grammar Induction 2002 Grammar Induction Identification in the limit • Grammar induction definition • Learning paradigms • DFA learning from positive and negative examples G 0 Data Grammar • RPNI algorithm Generation Induction d 1 G 1 • Probabilistic DFA learning d 2 G 2 • Application to a natural language task d • Links with Markov models n G* • Smoothing issues • convergence in finite time to G∗ • Related problems and future work ∗ • G is a representation of L(G0) (exact learning) Pierre Dupont 8 Pierre Dupont 10 2002 Grammar Induction 2002 Grammar Induction Learning paradigms PAC Learning G How to characterize learning? 0 Data Grammar Generation Induction d • which concept classes can or cannot be learned? 1 G 1 d 2 G 2 • what is a good example? d n G* • is it possible to learn in polynomial time? • convergence to G∗ ∗ • G is close enough to G0 with high probability ⇒ Probably Approximately Correct learning • polynomial time complexity Pierre Dupont 9 Pierre Dupont 11 2002 Grammar Induction 2002 Grammar Induction Other learnability results Define a probability distribution D on a set of strings Σ≤n • Identification in the limit in polynomial time L(G 0 ) L(G* ) – DFAs cannot be efficiently identified in the limit – unless we can ask equivalence and membership queries to an oracle Σ∗ • PAC learning P [P (L(G∗) ⊕ L(G )) < ] > 1 − δ D 0 – DFAs are not PAC learnable (under some cryptographic limitation assumption) – unless we can ask membership queries to an oracle The same unknown distribution D is used to generate the sample and to measure the error The result must hold for any distribution D (distribution free requirement) The algorithm must return an hypothesis in polynomial time 1 1 with respect to , δ, n, |R(L)| Pierre Dupont 12 Pierre Dupont 14 2002 Grammar Induction 2002 Grammar Induction Identification in the limit: good and bad news The bad one... • PAC learning with simple examples, i.e. examples drawn according to the conditional Solomonoff-Levin distribution Theorem 1. No superfinite class of languages is identifiable in the limit from −K(x|c) positive data only Pc(x) = λc2 The good one... K(x|c) denotes the Kolmogorov complexity of x given a representation c of the concept to be learned Theorem 2. Any admissible class of languages is identifiable in the limit from positive and negative data – regular languages are PACS learnable with positive examples only – but Kolmogorov complexity is not computable! Pierre Dupont 13 Pierre Dupont 15 2002 Grammar Induction 2002 Grammar Induction Cognitive relevance of learning paradigms • Grammar induction definition A largely unsolved question • Learning paradigms • DFA learning from positive and negative examples Learning paradigms seem irrelevant to model human learning: • RPNI algorithm • Gold’s identification in the limit framework has been criticized as children seem • Probabilistic DFA learning to learn natural language without negative examples • Application to a natural language task • All learning models assume a known representation class • Links with Markov models • Smoothing issues • Some learnability results are based on enumeration • Related problems and future work Pierre Dupont 16 Pierre Dupont 18 2002 Grammar Induction 2002 Grammar Induction Regular Inference from Positive and Negative Data However learning models show that: Additional hypothesis: the underlying theory is a regular grammar or, equiva- • an oracle can help lently, a finite state automaton • some examples are useless, others are good: characteristic samples ⇔ typical examples Property 1. Any regular language has a canonical automaton A(L) which is deterministic and minimal (minimal DFA) • learning well is learning efficiently Example : L = (ba∗a)∗ a • example frequency matters b b a 0 1 2 • good examples are simple examples ⇔ cognitive economy Pierre Dupont 17 Pierre Dupont 19 2002 Grammar Induction 2002 Grammar Induction A few definitions A theorem Definition 1. A positive sample S+ is structurally complete The positive data can be represented by a prefix tree acceptor (PTA) with respect to an automaton A if, when generating S+ from A: a 3 b b a • every transition of A is used at least one a 1 4 6 8 0 b a a 2 5 7 • every final state is used as accepting state of at least one string Example : {aa, abba, baa} a b Theorem 3. If the positive sample is structurally complete with respect to a b a 0 1 2 canonical automaton A(L0) then there exists a partition π of the state set of PTA Example : {ba, baa, baba, λ} such that P T A/π = A(L0) Pierre Dupont 20 Pierre Dupont 22 2002 Grammar Induction 2002 Grammar Induction Merging is fun a 3 A1 a b a 0 1 2 b b a a 1 4 6 8 0 b a a 2 5 7 b a b a a A b a 2 0,1 2 0 1,2 a b 0,1 2 0,2 1 a b a 0 1 b a 0,1,2 b a • Merging ⇔ definition of a partition π on the set of states 0 Example : {{0,1}, {2}} How are we going to find the right partition? Use negative data! • If A2 = A1/π then L(A1) ⊆ L(A2) : merging states ⇔ generalize language Pierre Dupont 21 Pierre Dupont 23 2002 Grammar Induction 2002 Grammar Induction Summary • Grammar induction definition A(L ) 0 Data Grammar Generation Induction PTA π • Learning paradigms PTA π ? • DFA learning from positive and negative examples We observe some positive and negative data • RPNI algorithm The positive sample S+ comes from a regular language L0 • Probabilistic DFA learning The positive sample is assumed to be structurally complete with respect to the • Application to a natural language task canonical automaton A(L0) of the target language L0 (Not an additional hypothesis but a way to restrict the search to reasonable generalizations!) • Links with Markov models • Smoothing issues We build the Prefix Tree Acceptor of S+. By construction L(PTA) = S+ • Related problems and future work Merging states ⇔ generalize S+ The negative sample S− helps to control over-generalization Note: finding the minimal DFA consistent with S+,S− is NP-complete! Pierre Dupont 24 Pierre Dupont 26 2002 Grammar Induction 2002 Grammar Induction An automaton induction algorithm RPNI algorithm RPNI is a particular instance of the “generalization as search” paradigm RPNI follows the prefix order in PTA Algorithm Automaton Induction input a 3 S // positive sample b b a + a 1 4 6 8 S // negative sample − 0 b a a A ← PTA(S+) // PTA 2 5 7 while (i, j) ←choose states () do // Choose a state pair Polynomial time complexity with respect to sample size (S+,S−) if compatible (i, j, S−) then // Check for compatibility of merging i and j A ← A/πij RPNI identifies in the limit the class of regular languages end if A characteristic sample, i.e. a sample such that RPNI is guaranteed to produce end while the correct solution, has a quadratic size with respect to |A(L )| return A 0 Additional heuristics exist to improve performance when such a sample is not provided Pierre Dupont 25 Pierre Dupont 27 2002 Grammar Induction 2002 Grammar Induction RPNI algorithm: pseudo-code Search space characterization input S+,S− Conditions on the learning sample to guarantee the existence of a solution output A DFA consistent with S+,S− begin DFA and NFA in the lattice A ← PTA(S+) // N denotes the number of states of PTA(S+) π ← {{0}, {1},..., {N