2002 Grammar Induction Inductive and statistical learning

of formal grammars Goal: to give the learning ability to a machine

Pierre Dupont Design programs the performance of which improves over time

[email protected]

Inductive learning is a particular instance of machine learning

• Goal: to find a general law from examples

• Subproblem of theoretical computer science, artificial intelligence or pattern recognition

– Typeset by FoilTEX – Pierre Dupont 2

2002 Grammar Induction 2002 Grammar Induction Outline Grammar Induction or Grammatical Inference

• Grammar induction definition Grammar induction is a particular case of inductive learning • Learning paradigms The general law is represented by a or an equivalent machine • DFA learning from positive and negative examples The set of examples, known as positive sample, is usually made of • RPNI strings or sequences over a specific alphabet • Probabilistic DFA learning A negative sample, i.e. a set of strings not belonging to the target language, • Application to a natural language task can sometimes help the induction process • Links with Markov models

• Smoothing issues Data Induction Grammar • Related problems and future work aaabbb S−>aSb ab S−> λ

Pierre Dupont 1 Pierre Dupont 3 2002 Grammar Induction 2002 Grammar Induction Examples Chromosome classification

• Natural language sentence

• Speech Centromere

• Chronological series Chromosome 2a • Successive actions of a WEB user 90 80 70

60 • Successive moves during a chess game 50

grey density 40 30 0 100 200 300 400 500 600 • A musical piece 6 4 2 • A program 0 -2 -4

grey dens. derivative -6 0 100 200 300 400 500 600 • A form characterized by a chain code position along median axis

"=====CDFDCBBBBBBBA==bcdc==DGFB=bccb== ...... ==cffc=CCC==cdb==BCB==dfdcb=====" • A biological sequence (DNA, proteins, ...) String of Primitives

Pierre Dupont 4 Pierre Dupont 6

2002 Grammar Induction 2002 Grammar Induction Pattern Recognition A modeling hypothesis

16 G 0 '3.4cont' Data Grammar 15 '3.8cont' Generation Induction G 14

13

12 8dC 11 10 • Find G as close as possible to G0 3 2 1 9

8 4 0 7 • The induction process does not prove the existence of G0 5 6 7 6 It is a modeling hypothesis 5

4

3

2

1

0 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

8dC: 000077766676666555545444443211000710112344543311001234454311

Pierre Dupont 5 Pierre Dupont 7 2002 Grammar Induction 2002 Grammar Induction Identification in the limit

• Grammar induction definition • Learning paradigms • DFA learning from positive and negative examples G 0 Data Grammar • RPNI algorithm Generation Induction d 1 G 1 • Probabilistic DFA learning d 2 G 2 • Application to a natural language task d • Links with Markov models n G* • Smoothing issues • convergence in finite time to G∗ • Related problems and future work ∗ • G is a representation of L(G0) (exact learning)

Pierre Dupont 8 Pierre Dupont 10

2002 Grammar Induction 2002 Grammar Induction Learning paradigms PAC Learning

G How to characterize learning? 0 Data Grammar Generation Induction

d • which concept classes can or cannot be learned? 1 G 1 d 2 G 2 • what is a good example? d n G* • is it possible to learn in polynomial time? • convergence to G∗

∗ • G is close enough to G0 with high probability

⇒ Probably Approximately Correct learning

• polynomial time complexity

Pierre Dupont 9 Pierre Dupont 11 2002 Grammar Induction 2002 Grammar Induction Other learnability results

Define a probability distribution D on a set of strings Σ≤n

• Identification in the limit in polynomial time L(G 0 ) L(G* ) – DFAs cannot be efficiently identified in the limit – unless we can ask equivalence and membership queries to an oracle Σ∗ • PAC learning P [P (L(G∗) ⊕ L(G )) < ] > 1 − δ D 0 – DFAs are not PAC learnable (under some cryptographic limitation assumption) – unless we can ask membership queries to an oracle The same unknown distribution D is used to generate the sample and to measure the error

The result must hold for any distribution D (distribution free requirement)

The algorithm must return an hypothesis in polynomial time 1 1 with respect to , δ, n, |R(L)|

Pierre Dupont 12 Pierre Dupont 14

2002 Grammar Induction 2002 Grammar Induction Identification in the limit: good and bad news

The bad one... • PAC learning with simple examples, i.e. examples drawn according to the conditional Solomonoff-Levin distribution Theorem 1. No superfinite class of languages is identifiable in the limit from −K(x|c) positive data only Pc(x) = λc2

The good one... K(x|c) denotes the of x given a representation c of the concept to be learned Theorem 2. Any admissible class of languages is identifiable in the limit from positive and negative data – regular languages are PACS learnable with positive examples only – but Kolmogorov complexity is not computable!

Pierre Dupont 13 Pierre Dupont 15 2002 Grammar Induction 2002 Grammar Induction Cognitive relevance of learning paradigms

• Grammar induction definition A largely unsolved question • Learning paradigms • DFA learning from positive and negative examples Learning paradigms seem irrelevant to model human learning: • RPNI algorithm

• Gold’s identification in the limit framework has been criticized as children seem • Probabilistic DFA learning to learn natural language without negative examples • Application to a natural language task

• All learning models assume a known representation class • Links with Markov models • Smoothing issues • Some learnability results are based on enumeration • Related problems and future work

Pierre Dupont 16 Pierre Dupont 18

2002 Grammar Induction 2002 Grammar Induction Regular Inference from Positive and Negative Data

However learning models show that: Additional hypothesis: the underlying theory is a regular grammar or, equiva- • an oracle can help lently, a finite state automaton

• some examples are useless, others are good: characteristic samples ⇔ typical examples Property 1. Any has a canonical automaton A(L) which is deterministic and minimal (minimal DFA) • learning well is learning efficiently Example : L = (ba∗a)∗

a • example frequency matters b b a 0 1 2 • good examples are simple examples ⇔ cognitive economy

Pierre Dupont 17 Pierre Dupont 19 2002 Grammar Induction 2002 Grammar Induction A few definitions A theorem

Definition 1. A positive sample S+ is structurally complete The positive data can be represented by a prefix tree acceptor (PTA) with respect to an automaton A if, when generating S+ from A: a 3

b b a • every transition of A is used at least one a 1 4 6 8 0 b a a 2 5 7 • every final state is used as accepting state of at least one string Example : {aa, abba, baa}

a b Theorem 3. If the positive sample is structurally complete with respect to a b a 0 1 2 canonical automaton A(L0) then there exists a partition π of the state set of PTA Example : {ba, baa, baba, λ} such that P T A/π = A(L0)

Pierre Dupont 20 Pierre Dupont 22

2002 Grammar Induction 2002 Grammar Induction Merging is fun

a 3 A1 a b a 0 1 2 b b a a 1 4 6 8 0 b a a 2 5 7

b a b a a A b a 2 0,1 2 0 1,2 a b 0,1 2 0,2 1 a b a 0 1

b a

0,1,2 b a • Merging ⇔ definition of a partition π on the set of states 0 Example : {{0,1}, {2}}

How are we going to find the right partition? Use negative data! • If A2 = A1/π then L(A1) ⊆ L(A2) : merging states ⇔ generalize language

Pierre Dupont 21 Pierre Dupont 23 2002 Grammar Induction 2002 Grammar Induction Summary

• Grammar induction definition A(L ) 0 Data Grammar Generation Induction PTA π • Learning paradigms PTA π ? • DFA learning from positive and negative examples We observe some positive and negative data • RPNI algorithm The positive sample S+ comes from a regular language L0 • Probabilistic DFA learning The positive sample is assumed to be structurally complete with respect to the • Application to a natural language task canonical automaton A(L0) of the target language L0 (Not an additional hypothesis but a way to restrict the search to reasonable generalizations!) • Links with Markov models • Smoothing issues We build the Prefix Tree Acceptor of S+. By construction L(PTA) = S+ • Related problems and future work Merging states ⇔ generalize S+

The negative sample S− helps to control over-generalization

Note: finding the minimal DFA consistent with S+,S− is NP-complete!

Pierre Dupont 24 Pierre Dupont 26

2002 Grammar Induction 2002 Grammar Induction An automaton induction algorithm RPNI algorithm

RPNI is a particular instance of the “generalization as search” paradigm RPNI follows the prefix order in PTA Algorithm Automaton Induction input a 3 S // positive sample b b a + a 1 4 6 8 S // negative sample − 0 b a a A ← PTA(S+) // PTA 2 5 7 while (i, j) ←choose states () do // Choose a state pair Polynomial time complexity with respect to sample size (S+,S−) if compatible (i, j, S−) then // Check for compatibility of merging i and j A ← A/πij RPNI identifies in the limit the class of regular languages end if A characteristic sample, i.e. a sample such that RPNI is guaranteed to produce end while the correct solution, has a quadratic size with respect to |A(L )| return A 0 Additional heuristics exist to improve performance when such a sample is not provided

Pierre Dupont 25 Pierre Dupont 27 2002 Grammar Induction 2002 Grammar Induction RPNI algorithm: pseudo-code Search space characterization

input S+,S− Conditions on the learning sample to guarantee the existence of a solution output A DFA consistent with S+,S− begin DFA and NFA in the lattice A ← PTA(S+) // N denotes the number of states of PTA(S+) π ← {{0}, {1},..., {N − 1}} // One state for each prefix according to standard order < Characterization of the set of maximal generalizations ⇒ similar to the G set for i = 1 to |π| − 1 // Loop over partition subsets π from Version Space for j = 0 to i − 1 // Loop over subsets of lower rank 0 π ← π\{Bj,Bi}U{BiUBj} // Merging Bi and Bj Efficient Incremental lattice construction is possible ⇒ RPNI2 algorithm A/π0 ← derive (A, π0) π00 ← determ merging (A/π0) Possible search by genetic optimization 00 if compatible (A/π ,S−) then // Deterministic parsing of S− π ← π00 break // Break j loop end if end for // End j loop end for // End i loop return A/π

Pierre Dupont 28 Pierre Dupont 30

2002 Grammar Induction 2002 Grammar Induction An execution step of RPNI

b a a 7 9 10 • Grammar induction definition a a b a 5 8 b 0 2 Merge 5 and 2 • Learning paradigms b a 4 6 • DFA learning from positive and negative examples b a a a a 7 9 10 b b 0 2 8 • RPNI algorithm b Merge 8 and 4 a 4 6 • Probabilistic DFA learning a a b a a 7 9 10 b 0 2 Merge 7 and 4 • Application to a natural language task b a 4 6 • Links with Markov models a a a b 9 10 b Merge 9 and 4 0 2 b a • Smoothing issues 4 6

a a • Related problems and future work a 10 b b 0 2 4 a Merge 10 and 6 6

a a b b a 0 2 4 6

Pierre Dupont 29 Pierre Dupont 31 2002 Grammar Induction 2002 Grammar Induction Probabilistic DFA A probabilistic automaton induction algorithm

b b Algorithm Probabilistic Automaton Induction 0.4 0.7 input a 0 1 S+ // positive sample 0.6 0.3 α // precision parameter A ← PPTA(S+) // Probabilistic PTA P (ab) = 0.6 ∗ 0.7 ∗ 0.3 while (i, j) ←choose states () do // Choose a pair of states A structural and probabilistic model ⇒ an explicit and noise tolerant theory if compatible (i, j, α) then // Check for compatibility of merging i and j A ← A/π A combined inductive learning and statistical estimation problem ij end if Learning from positive examples only and frequency information end while return A Outside of the scope of the previous learning paradigms

Pierre Dupont 32 Pierre Dupont 34

2002 Grammar Induction 2002 Grammar Induction Probabilistic prefix tree acceptor (PPTA) Compatibility criterion

1 ALERGIA, RLIPS a 3 1/2 b b a Two states are compatible (can be merged) if their suffix distributions a 1 4 6 8 2/3 1/2 1 1 are close enough 0 b a a 1 1/3 2 5 7 1 1 1 MDI

⇓ Two states are compatible if prior probability gain of the merged model compensates for the likelihood loss of the data: 1/3 a ⇒ Bayesian learning (not strictly in this case) ⇒ based on Kullback-Leibler divergence b b a a 1,3 4 6 8 2/3 1/3 1 1 0 b 1/3 1 a a 1/3 2 5 7 1 1 1

Pierre Dupont 33 Pierre Dupont 35 2002 Grammar Induction 2002 Grammar Induction ALERGIA Kullback-Leibler divergence RPNI state merging order Compatibility measure : notation D(PA0 k PA1) = D(A0 k A1) P (x) P A0 a a = ∗ PA (x) log x∈Σ 0 PA (x) P 1 = − ∗ P (x) log P (x) − H(A ) q b q b x∈Σ A0 A1 0 1 2

Likelihood of x given model A1 :

q   PA1(x) = P (x|A1) C(q1,a) C(q2,a) 1 2 √ 1 √ 1 • C(q ) − C(q ) < 2 ln α + , ∀a ∈ Σ ∪ {#} 1 2 A C(q1) C(q2)

Cross entropy between A0 and A1 : • δ(q1, a) and δ(q2, a) are αA−compatible, ∀a ∈ Σ X − P (x) log P (x) Remarks: A0 A1 It is a recursive measure of suffix proximity When A0 is a maximum likelihood estimate, e.g. the PPTA, cross entropy measure This measure does not depend on the prefixes of q1 and q2 ⇒ local criterion the likelihood loss while going from A0 to A1 Pierre Dupont 36 Pierre Dupont 38

2002 Grammar Induction 2002 Grammar Induction Bayesian learning MDI algorithm

RPNI state merging order Find a model Mˆ which maximizes the likelihood of the data P (X|M) and the prior Compatibility measure: small divergence increase (= small likelihood loss) with probability of the model P (M) : respect to size reduction (= prior probability increase) ⇒ a global criterion Mˆ = argmax P (X|M).P (M) ∆(A1,A2) M < αM |A1| − |A2| PPTA maximizes the data likelihood Efficient computation of divergence increase A smaller model (number of states) is a priori assumed more likely D(A0||A2) = D(A0||A1) + ∆(A1,A2) P P γ1(qi,a) ∆(A1,A2) = ci γ0(qi, a) log γ2(qi,a) qi∈Q012 a∈Σ∪{#}

Q012 = {qi ∈ Q0 |Bπ01(qi) 6= Bπ02(qi)} denotes the set of states of A0 which have been merged to get A2 from A1

Pierre Dupont 37 Pierre Dupont 39 2002 Grammar Induction 2002 Grammar Induction Comparative results

110 • Grammar induction definition ALERGIA MDI • Learning paradigms 100 • DFA learning from positive and negative examples 90 • RPNI algorithm 80 • Probabilistic DFA learning 70

• Application to a natural language task Perplexity 60 • Links with Markov models 50 • Smoothing issues 40 30 • Related problems and future work 0 2000 4000 6000 8000 10000 12000 Training sample size Perplexity measure the prediction power of the model: the smaller, the better

Pierre Dupont 40 Pierre Dupont 42

2002 Grammar Induction 2002 Grammar Induction Natural language application: the ATIS task Perplexity

j i j Air travel information system, “spontaneous” American English P (xi |q ) : probability of generating xi , the i-th symbol of the j-th string from state qi “Uh, I’d like to go from, uh, Pittsburgh to Boston next Tuesday, no wait, Wednesday”.  |S| |x|  1 X X Lexicon (alphabet): 1294 words LL = − log P (xj|qi)  kSk i  j=1 i=1 Learning sample: 13044 sentences, 130773 words

Validation set: 974 sentences, 10636 words PP = 2LL

Test set: 1001 sentences, 11703 words PP = 1 ⇒ a perfectly predictive model

PP = |Σ| ⇒ uniform random guessing over Σ

Pierre Dupont 41 Pierre Dupont 43 2002 Grammar Induction 2002 Grammar Induction Equivalence between PNFA and HMM

• Grammar induction definition Probabilistic non-deterministic automata (PNFA), with no end-of-string proba- • Learning paradigms bilities, are equivalent to Hidden Markov Models (HMMs) • DFA learning from positive and negative examples 0.4 [a 0.3] 0.6 0.4 0.6 [b 0.7] • RPNI algorithm a 0.27 0.9 0.1 1 2 0.3 b 0.63 a 0.27 • Probabilistic DFA learning a 0.02 1 2 [a 0.2] [a 0.9] [b 0.8] [b 0.1] a 0.56 0.7 • Application to a natural language task b 0.14 [a 0.8] b 0.08 b 0.03 [b 0.2] • Links with Markov models PNFA HMM with emission on transitions • Smoothing issues • Related problems and future work

Pierre Dupont 44 Pierre Dupont 46

2002 Grammar Induction 2002 Grammar Induction Links with Markov chains 0.4 [a 0.3] 0.6 [b 0.7] 0.9 0.1 1 2 0.3 A subclass of regular languages: the k-testable languages in the strict sense [a 0.2] [a 0.9] [b 0.8] [b 0.1] 0.7 A k-TSS language is generated by an automaton such that all subsequences [a 0.8] sharing the same last k − 1 symbols lead to the same state [b 0.2] HMM with emission on transitions b 0.04 0.36 0.04 0.36 a 0.02 b bb a a 0.18 ab ba 0.9 b 0.72 b 0.1 11 12 11 12 a a b 0.08 λ a a [a 0.2] [a 0.3] a 0.21 [b 0.7] aa C(bba) [b 0.8] b 0.21 b 0.49 pˆ(a|bb) = C(bb) 0.7 0.1 0.3 b 0.02 a 0.08 a 0.09 0.9 a 0.72 A probabilistic k-TSS language is equivalent to a k − 1 order Markov chain b 0.18 a 0.27 0.7 21 22 21 22 0.3 a 0.63 [a 0.8] [a 0.9] There exists probabilistic regular languages not reducible to Markov chains of b 0.07 b 0.03 [b 0.2] [b 0.1] any finite order 0.42 0.18 0.42 0.18 HMM with emission on states PNFA a a b 0 1

Pierre Dupont 45 Pierre Dupont 47 2002 Grammar Induction 2002 Grammar Induction

• Grammar induction definition • Grammar induction definition • Learning paradigms • Learning paradigms • DFA learning from positive and negative examples • DFA learning from positive and negative examples • RPNI algorithm • RPNI algorithm • Probabilistic DFA learning • Probabilistic DFA learning • Application to a natural language task • Application to a natural language task • Links with Markov models • Links with Markov models • Smoothing issues • Smoothing issues • Related problems and future work • Related problems and future work

Pierre Dupont 48 Pierre Dupont 50

2002 Grammar Induction 2002 Grammar Induction The smoothing problem Related problems and approaches

A probabilistic DFA defines a probability distribution over a set of strings I did not talk about

Some strings are not observed on the training sample but they could be observed • other induction problems (NFA, CFG, tree grammars, ...) ⇒ their probability should be strictly positive • heuristic approaches as neural nets or genetic The smoothing problem: how to assign a reasonable probability to (yet) unseen random events ? • how to use prior knowledge

Highly optimized smoothing techniques exist for Markov chains • smoothing techniques • How to adapt these techniques to more general probabilistic automata? how to parse natural language without a grammars (decision trees) • how to learn transducers • benchmarks, applications

Pierre Dupont 49 Pierre Dupont 51 2002 Grammar Induction Ongoing and future work

• Definition of a theoretical framework for inductive and statistical learning • Links with HMMs: parameter estimation, structural induction • Smoothing techniques improvement ⇒ a key issue for practical applications • Applications to probabilistic modeling of proteins • Automatic translation • Applications to Text categorization or Text mining

Pierre Dupont 52