
AYJ FORGOTTEN ISLANDS OF REGULARITY REGULAR UNIVERSE OF LANGUAGE MODELS AND ITS CONTINUING EXPANSION ANSSI YLI-JYRÄ Markov Kleene chains closure, union, caten. LOCALITY & REGULARITY 18/12/2017 Moorehttps://upload.wikimedia.org/wikipedia/commons/1/1f/Moore-Automat-en.svg RNN transition logicfinite-stateoutput machine logic input state memory output S Σ 0 S n+1 S S n T G Λ R clock reset https://upload.wikimedia.org/wikipedia/commons/1/1f/Moore-Automat-en.svg 1/1 TRIVIAL FINITE-STATE PHONOLOGY GENERATIVE PHONOLOGY (Chomsky & Halle 1968) Universal computing (Chomsky 1963, Johnson 1972, Ristad 1990) Computes PARTIAL functions Problematic as a Theory (Popper 1959, Johnson 1972) NAIVE FINITE-STATE PHONOLOGY right linear derivation α → βγ → ββ’γ’ →… based on a limited view of regularity - not linguistically intriguing TRUE FINITE-STATE PHONOLOGY NON-ITERATED FUNCTIONAL RULES (Johnson 1972) Generative phonological rules have context conditions: α → β / γ _ γ’ Practical grammars with simultaneous and linear application modes Test contexts with a bi(directional) machine (Schützenberger 1961) Surprisingly reduced linguistically interesting rule into FS machines LIMITATION No cyclic rules, but composition (Schützenberger 1961) FINITE-STATE UNIVERSAL MODELS •BRAIN COMPATIBLE •PRACTICAL DECIDABLE •EFFICIENT •GOOD THEORY •ALGORITHMIC •ADEQUATE REGULAR •SAFE AYJ (4) just backs up until it has tested the precondition. In our example, the precondition is just the suffix[C][y][T]: (4.1) !!!!!!!!$ # g l o s s y T $ $ $ $ (4.2) (4) (4.3) !!!!!!s i e s t # $ With this change, long words are produced in a zigzag style (5) where every rule application may back up some letters. (4) just backs up until it has tested the precondition. In our example, the precondition is just the suffix[C][y][T]: !! - (4.1) !!!!!!!! , $ # g l o s s y T $ $ $ $ !! - (4.2) (4) (5) , (4.3) !! !!!!!!s i e s t # $ ... - With this change, long words are produced in a zigzag style (5) where every , rule application may back up some letters. !! Since the union of the affix-rules is applied repeatedly to its own output, the !! - standard two-part regularity condition of phonological grammars does not apply. , However, as long as the derivation deletes and appends new material only at the !! - BUT (5) , right end of the string, the resulting process is linear and, intuitively, a regular ITERATED!! DERIVATION... grammar. In addition, IN theHUNSPELL moves taken by the TM can now be deterministic because- the machine does not completely rewind the tape at any point but , GOES BEYONDalways!! KAPLAN makes relative moves& KAY that allow (1994) it to remember its previous position. Since the union of the affix-rules is applied repeatedly to its own output, the standard two-part regularity condition of phonological2.3 grammars Linear does Encoding not apply. However, as long as the derivation deletes and appends new material only at the right end of the string, the resulting process is linearAlthough and, intuitively, the grammar a regular represented by a hunspell lexicon does not satisfy the grammar. In addition, the moves taken by the TM can now be deterministic classical two-part condition of finite-state phonology, it is equivalent to a finite- becauseLászló the Németh, machine Viktor does not Trón, completely Péter rewind Halácsy, the András tape at any Kornai, point András but Rung, and István Szakadát. always makes relative moves that allow it to rememberstate transducer its previous position. when restricted to the suffixrules. Leveraging the open source ispell codebaseThere for are minority now some language methods analysis. to compile hunspell lexicons to finite-state 2.3Proceedings Linear Encoding of the SALTMIL Workshoptransducers. at LREC 2004 Early experiments on compilation are due to Gyorgy Gyepesi (p.c., Although the grammar represented by a hunspell2007)lexicon and does others not satisfy in Budapest. the The author developed his solution (Yli-Jyr¨a, classical two-part condition of finite-state phonology,2009) it is using equivalent a variant to a finite- of Two-Level Morphology (Koskenniemi, 1983). This state transducer when restricted to the suffixrules.method viewed the lexicon as a collection of constraints that described linearly There are now some methods to compile hunspellencodedlexicons backing to finite-state up and suffixation in derivations. The method included an effi- transducers. Early experiments on compilation are due to Gyorgy Gyepesi (p.c., 2007) and others in Budapest. The author developedcient one-shot his solution compilation (Yli-Jyr¨a, algorithm to compile and intersect several hundreds 2009) using a variant of Two-Level Morphologyof thousands(Koskenniemi, of 1983). lexical This context restriction rules in parallel as if the lexical contin- method viewed the lexicon as a collection of constraintsuations that (morphotaxis) described linearly were phonological constraints. A similar method, finally encoded backing up and suffixation in derivations.implemented The method included by his an colleagues, effi- Pirinen and Lind´en (2010), separated the lexi- cient one-shot compilation algorithm to compilecal and continuations intersect several hundredsfrom the phonological changes at morpheme boundaries and of thousands of lexical context restriction rules in parallel as if the lexical contin- uations (morphotaxis) were phonological constraints.used A a similar three-step method, approach finally where the final step composed the lexicon with the implemented by his colleagues, Pirinen and Lind´en (2010), separated the lexi- cal continuations from the phonological changes at morpheme boundaries and used a three-step approach where the final step composed the lexicon with the 6 6 GOAL: CHARACTERIZE ALL REGULAR GRAMMARS AND LANGUAGE MODELS BOUNDS OF REGULARITY AYJ People’s Daily Online © CEN REGULARITY MEANS FINITE PARALLELISM REGULARITY MEANS LINEAR BOUNDED SPACE REGULARITY MEANS FINITE COMPOSITION VISIBLE TRACES CC-BY-SA Xvazquez AND … WRITING HEAD BOUNDED NUMBER OF SPIDER WEBS OUTER LINKS Kornai & Tuza 1992: Narrowness, Path-width and their application in NLP REGULARITY À LA HENNIE (1965) BOUNDED LTIME ONE-TAPE TURING MACHINE O(k) CONTROL STATES (BOUNDED PARALLELISM) O(n) TAPE CELLS MSO Definable String Transductions 247 • (BOUNDED SPACE) O(n) TIME STEPS (BOUNDED TIME) CAN BE NONDETERMINISTIC Fig. 8. Track for a3b2aba , Example 9. ⊢ ⊣ (TADAKI ET AL. 2010) Finally, the first visiting sequence of a computation should start with a visit ( , q , ϵ, α), and exactly one visiting sequence should end with a visit ∗ in + (−ϵ, q f , , λ). Since∗ the number of visits to each position is bounded, the visiting sequences come from a finite set, and we can interprete these sequences as symbols from a finite alphabet. Each k-visiting computation is specified by a string over this alphabet, and we will call these strings k-tracks, e.g., the 3-track in Figure 8 specifies the computation of the Hennie machine of Example 9 on input a3b2aba. It should be obvious from the above remarks that the language of such spec- ifications is regular (e.g., see Lemma 2.2 of Greibach [1978c], or Lemma 1 of Chytil and Jakl´ [1977]). For instance, it is the heart of the proof in Hopcroft and Ullman [1979, Theorem 2.5] of the result that two-way finite-state automata are equivalent to their one-way counterparts [Rabin and Scott 1959; Shepherdson 1959]. PROPOSITION 23. Let be a Hennie machine, and let k be a constant. The k-tracks for successful k-visitingM computations of form a regular language. M 7.2 Characterizations Using Hennie Machines From Proposition 23, using standard techniques (e.g., see Chytil and Jakl´ [1977, Lemma 1]) we obtain the following decomposition of nondeterministic Hennie transductions. Note that this decomposition already features in Theorem 20 as characterization of NMSOS. LEMMA 24. NHM MREL 2DGSM NMSOS. ⊆ ◦ = PROOF. Let be a Hennie machine, finite-visit for constant k; each pair (w, z) in the transductionM realized by can be computed by a k-visiting com- putation. M We may decompose the behavior of on input w as follows. First, a rela- beling of w guesses a string of k-visitingM sequences, one for each position of the input⊢ tape,⊣ such that the first symbol of each visiting sequence matches the input symbol of the corresponding tape position. Then, a 2DGSM verifies in a left-to-right scan whether the string specifies a valid computation, a track, of for w, cf. Proposition 23. If this is the case, the 2DGSM returns to the left end markerM and simulates on this input, following the k-visiting computation previously⊢ guessed. M ACM Transactions on Computational Logic, Vol. 2, No. 2, April 2001. REGULARITY KEEPS SURPRISING A. HENNIE (1965) GOES BEYOND THE CLASSICAL FS. PHONOLOGY 1. Restricted Application JOHNSON (1972); KK (1994) 2. Iterated application IN HUNSPELL (NÉMETH ET AL 2004) B. HENNIE (1965) IS RELEVANT TO REPRESENTATION OF 3. Syntax (Nederhof & YJ 2017; Y-J 2017a, 2017b) 4. Semantics and Pragmatics (Gordon & Hobbs 2017; Kornai 2017 manus.) AYJ 5. RNN, including backpropagation Table 1: The coverage of UD v.2 data with depth bounded weak edge bracketing lang N depth 0 depth 1 depth 2 depth 3 depth 4 depth 5 depth 6 depth 7 Arabic 26722 4.42% 20.93% 65.09% 94.39% 99.64% 99.99% +0.011% (3) Catalan 14832 1.27% 19.01% 70.39% 96.07% 99.62% 99.99% +0.007% (1) Czech 102660 10.60% 43.11% 86.77% 98.47% 99.89% 99.99% +0.010% (10) German 14917 2.06% 43.11% 87.52% 98.56% 99.91% 99.97% +0.027% (4) English 19785 16.61% 53.59% 91.10% 99.21% 99.96% 100.00% Spanish 31546 1.59% 24.61% 77.27% 97.24% 99.79%
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages16 Page
-
File Size-