New Probabilistic Model of Stochastic Context-Free Grammars in Greibach Normal Form

Total Page:16

File Type:pdf, Size:1020Kb

New Probabilistic Model of Stochastic Context-Free Grammars in Greibach Normal Form New Probabilistic Model of Stochastic Context-Free Grammars in Greibach Normal Form Yevhen Hrubiian Student, group FI-33, Institute of Physics and Technology NTUU “Igor Sikorsky Kyiv Polytechnic Institute” Kyiv, Ukraine [email protected] Імовірнісна Модель Стохастичних Контекстно- Вільних Граматик в Нормальній Формі Грейбах Євген Грубіян Студент групи ФІ-33, Фізико-технічний інститут НТУУ “Киівський політехнічний інститут ім. Ігоря Сікорського” Київ, Україна [email protected] Abstract—In this paper we propose new probabilistic model models of languages were proposed by Claude Shannon to of stochastic context-free grammars in Greibach normal form. describe informational aspects of languages such as entropy but This model might be perspective for future developments and they became very relevant and actual in modern linguistics. N- researching stochastic context-free grammars. gram model is one of the most accurate statistical models of language but the main drawback is that for precious learning Анотація—В роботі запропоновано нову імовірнісну this model for high values of N one needs a huge amount of модель стохастичних контекстно-вільних граматик в texts in different thematic, but unfortunately, even if we have нормальній формі Грейбах. Ця модель може бути got a lot of texts it turns out to be harmful for accurate and перспективною для подальших досліджень стохастичних precise modeling whole language. Because of this reason контекстно–вільних граматик. stochastic context-free grammars were introduced for more Keywords—stochastic context-free grammar; Greibach normal precious description of languages because context-freeness form; hidden Markov models guarantees grammatical pattern of language and probability distributions on grammatical rules approach semantic structure Ключові слова—стохастична контекстно-вільна граматика; of phrases. Stochastic context-free grammars are also useful in нормальна форма Грейбах; приховані моделі Маркова. modeling DNA sequences, image recognition and modeling plain texts in cryptography so that investigation of stochastic I. INTRODUCTION context-free grammars in very important and actual in today’s Today, there is a huge interest in developing new methods science. of natural language processing due to a lot of issues from cryptanalysis, biology and artificial intelligence. Several ways In this paper we consider stochastic context-free grammars for investigation had been proposed since works of Noam in Greibach normal form although significant contributions in Chomsky in the mid 50s that originated development of natural language processing was associated with algorithms of modern formal theory of syntax. His conception of generative learning stochastic context-free grammars in Chomsky normal grammars and context-free grammars became one of the most form but complexity of these methods is quite high. We discussing and controversy in linguistics but these ideas lay propose new idea of investigation of stochastic context-free behind development of formal theory of programming grammars based on our conception of hidden Markov models languages and compilers later. Context-free grammars were with stack. proposed firstly for describing grammar of English or any other II. STOCHASTIC CONTEXT-FREE GRAMMARS language but this approach does not cover all features of language such as semantics. Later, head-driven grammars were Conception of stochastic context-free grammars (SCFG) introduced to describe some valuable properties of languages came from several problems in natural language processing such as semantic dependencies between words. Probabilistic such as finding the most probable sequence of the words, 161 learning unknown grammar from corpora of texts etc. But NLP Form (CNF) generating the language L(G0) = L(G) − is not only sphere of applications of SCFG, several biologists {ε}. and geneticists proposed usage SCFG for modeling DNA 2. Rename the variables like A1 , A2, . An starting sequences, modern pattern recognition, speech processing, with S = A1. image processing also use this conception. 3. Modify the rules in R so that if Ai → Ajγ ∈ R0 then j We say that G = <N, Σ, R, S, P> is stochastic context-free > i grammar if and only if Q = <N, Σ, R, S> is context-free 4. Starting with A1 and proceeding to An this is done as grammar (N – alphabet of nonterminal symbols or variables, Σ follows: – alphabet of terminal symbols, R – set of production rules, S – a) Assume that productions have been modified start symbol) and one defines family of probability so that for 1 ≤ i ≤ k, Ai → Aj γ∈ R0 only if j > distributions P on R so that for every production in R there is i probability of applying this production rule when forming b) If Ak → Ajγ is a production with j < k, some sequence of symbols such that sum of probabilities of generate a new set of productions substituting productions starting with same symbol must be equal to 1. for the Aj the body of each Aj production. The There are several normal forms of grammars. The most transfer probabilities should split uniformly popular are Chomsky normal form and Greibach normal form. from base probability. c) Repeating (b) at most k − 1 times we obtain We say that context free grammar G = <N, Σ, R, S> is in rules of the form A → A γ, p ≥ k Chomsky normal form (CNF) if and only if all production rules k p d) Replace rules Aj → Akγ by removing left- have form: recursion as stated above. Transfer A → BC probabilities should also be splitted. e. A → a Modify the Ai → Aj γ for i = n − 1, n − 2, ., 1 in Our major interest forms Greibach normal form. desired form at the same time change the Z production rules. We say that context free grammar G = <N, Σ, R, S> is in Greibach normal form (GNF) if and only if all production rules III. SCORING PROBLEM FOR REGURAL GRAMMARS have form: Suppose we have to find probability of sequence of A → aA A …A symbols GNF generated: Pr(w1:n | G) i.e. solve the scoring 1 2 n problem. Where A is nonterminal, a is terminal symbol and A1A2…An is a probable empty sequence of nonterminal symbols except S. Trivial situation come up when our grammar in GNF is regular, i.e. all rules have form: The definition of SCFG in GNF might get stronger if we set number of nonterminal symbols in right hand side of any A→aB production to be not larger than 2. Productions of such Regular grammars are equivalent to finite state machines grammar will be in any of the following forms: (FSM). This equivalence could be set if we assume that set of A → aA A states of FSM is equal to set of nonterminal symbols and every 1 2 transition is described by corresponding rule of grammar with A → aA emission of terminal symbol. A → a We say that bivariate stochastic process {Xi,Oi, i=1,2,…} is hidden Markov model of order n by emissions (HMMn) if It could be shown that every SCFG in GNF might be transformed into equivalent grammar in the stronger GNF. For Pr(Xk | X1:k-1) = Pr(Xk | Xk-1) this reason in this paper we suppose that every Greibach normal form will be in the form described above. Pr(Ok | X1:k) = Pr(Ok | Xk-n+1:k) Every context-free grammar could be transformed to Xk are called latent variables or hidden states of HMM and Ok equivalent grammar in Greibach normal form in polynomial are called observations. time such that languages they form are equal. The same idea is Stochastic regular grammars are equivalent to hidden behind transforming SCFG to SCFG in Greibach normal form. Markov models of order 2 by emissions. It follows from the We obtained algorithm for this transformation based on fact that set of observed values corresponds to set of terminal existing algorithm of transforming an ordinal CFG in Chomsky symbols and set of latent variables corresponds to set Normal form to context-free grammar in Greibach normal form nonterminal symbols of grammar. Second order of models is adding several steps of rebalancing probability distributions P. followed from fact that observed word depends on previous 2 Here is the algorithm: states, however order by transitions is still first. This equivalence also requires independence of appliance every 1. Eliminate null productions, unit productions and grammatical rule. Because of this fact we could apply forward useless symbols from the grammar G and then algorithm for solving scoring problem for stochastic regular 2 construct a G0 = (V0, T , R0, S) in Chomsky Normal grammar. Complexity of this algorithm is Θ(nm ), where n is 162 length of observed sequence and m is number of nonterminal denotes popping out head of stack to which chain jumps after. symbols. Transitional probabilities are omitted because the chain in example is deterministic. If start state is A than chain will IV. MARKOV CHAINS WITH STACK produce sequence ABCABCABC… The case of general SCFG in GNF is much more One could notice that stack size in this example is growing complicated. If we try to build HMM out of SCFG in GNF we unrestrictedly but it is not common case for natural language’s will face with some uncertainty with description of latent grammars. The model terminates when stack is empty and variables set. Naïve approach is based on setting equivalence model reaches # symbol. between set of all possible sequences of nonterminal symbols that grammar could produce during inference all possible Our goal than will be setting up equivalency between sentences. But in that case number of latent states could be stochastic context-free grammars in Greibach normal form and very large and even unbounded because of all recursions in hidden Markov models of order 2 with stack (HMM2S). grammatical rules. V. SCFG IN GNF AND HMM2S EQUIVALENCY Let’s try to overcome these difficulties. It is known fact that every context-free grammar is equivalent to some finite state In previous paragraph we could see that Markov chains with stack work just similar to the way stochastic context-free machine with stack (pushdown automaton).
Recommended publications
  • Ucla Computer Science Depart "Tment
    THE UCLA COMPUTER SCIENCE DEPART"TMENT QUARTERLY THE DEPARTMENT AND ITS PEOPLE FALL 1987/WINTER 1988 VOL. 16 NO. 1 SCHOOL OF ENGINEERING AND APPLIED SCIENCE COMPUTER SCIENCE DEPARTMENT OFFICERS Dr. Gerald Estrin, Chair Dr. Jack W. Carlyle, Vice Chair, Planning & Resources Dr. Sheila Greibach, Vice Chair, Graduate Affairs Dr. Lawrence McNamee, Vice Chair, Undergraduate Affairs Mrs. Arlene C. Weber, Management Services Officer Room 3713 Boelter Hall QUARTERLY STAFF Dr. Sheila Greibach, Editor Ms. Marta Cervantes, Coordinator Material in this publication may not be reproduced without written permission of the Computer Science Department. Subscriptions to this publication are available at $20.00/year (three issues) by contacting Ms. Brenda Ramsey, UCLA Computer Science Department, 3713 Boelter Hall, Los Angeles, CA 90024-1596. THE UCLA COMPUTER SCIENCE DEPARTMENT QUARTERLY The Computer Science Department And Its People Fall Quarter 1987/Winter Quarter 1988 Vol. 16, No. 1 SCHOOL OF ENGINEERING AND APPLIED SCIENCE UNIVERSITY OF CALIFORNIA, LOS ANGELES Fall-Winfer CSD Quarterly CONTENTS COMPUTER SCIENCE AT UCLA AND COMPUTER SCIENCE IN THE UNITED STATES GeraldEstrin, Chair...................................................................................................... I THE UNIVERSITY CONTEXT .............................................................................................. 13 THE UCLA COMPUTER SCIENCE DEPARTMENT The Faculty A rane nf;ntraract Biographies .............................................................................................................................
    [Show full text]
  • Separation of Test-Free Propositional Dynamic Logics Over Context-Free Languages
    Separation of Test-Free Propositional Dynamic Logics over Context-Free Languages Markus Latte∗ Department of Computer Science University of Munich, Germany For a class L of languages let PDL[L] be an extension of Propositional Dynamic Logic which allows programs to be in a language of L rather than just to be regular. If L contains a non-regular language, PDL[L] can express non-regular properties, in contrast to pure PDL. For regular, visibly pushdown and deterministic context-free languages, the separation of the respective PDLs can be proven by automata-theoretic techniques. However, these techniques introduce non-determinism on the automata side. As non-determinism is also the difference between DCFL and CFL, these techniques seem to be inappropriate to separate PDL[DCFL] from PDL[CFL]. Nevertheless, this separation is shown but for programs without test operators. 1 Introduction Propositional Dynamic Logic (PDL)[9] is a logical formalism to specify and verify programs [12, 16, 11]. These tasks rely on the satisfiability and model-checking problems. Applications in the field are supported by their relatively low complexities: EXPTIME- and PTIME-complete, respectively [9]. Formulas in PDL are interpreted over labeled transition systems. For instance, the formula hpij means that after executing the program p the formula j shall hold. In this context, programs and formulas are defined mutually inductively. This mixture allows programs to test whether or not a formula holds at the current state. Additionally, programs are required to be regular over the set of atomic programs and test operations. For instance, the program while (b) do p; can be rendered as h(b?; p)∗;:bij to ensure that the loop is finite and that j holds when the loop terminates [9].
    [Show full text]
  • Describing Automata Associatedwiththei Interms
    . DESCRIBING AUTOMATA INTERMS OF-LANGUAGES ASSOCIATEDWITHTHEI R PERIPHERAL DEVICES . Reino Kurki-Suonio STAN-CS-75-493 MAY 1975 COMPUTER SCIENCE DEPARTMENT School of Humanities and Sciences STANFORD UNIVERSITY Describing Automata in Terms of Languages Associated with Their Peripheral Devices bY Reino Kurki-Suonio Computer Science Department Stanford University and -- University of Tampere, Finland Abstract A unified approach is presented to deal with automata having different kinds of peripheral devices. This approach is applied to pushdown automata and Turing machines, leading to elementary proofs of several well-known theorems concerning transductions, relationship between pushdown automata and context-free languages, as well as homomorphic characterization and undecidability questions. a In general, this approach leads to homomorphic characterization of language families generated by a single language by finite transduction. This research was supported in part by National Science Foundation grant ($I 36473~ and by the Academy of Finland. Reproduction in whole or in part is pmmitted for any purpose of the United States Government. 1 1. Introduction Mathematical formulations of various classes of automata do not usually allow uniform treatment of different kinds of automata. One reason is in irrelevant differences of conventions. For instance, input may come from an input source, or it may constitute the initial contents of some storage device. Also, the formalisms which are used may be difficult to manage in a general situation. Functional notations, for instance, become quite clumsy to use when several storage devices are introduced. L The purpose of this paper is to present an approach, suggested by Floyd [l], where several kinds of automata are treated in a uniform way.
    [Show full text]
  • Chicago Journal of Theoretical Computer Science the MIT Press
    Chicago Journal of Theoretical Computer Science The MIT Press Volume 1996, Article 4 13 November 1996 ISSN 1073–0486. MIT Press Journals, 55 Hayward St., Cambridge, MA 02142 USA; (617)253-2889; [email protected], [email protected]. Published one article at a time in LATEX source form on the Internet. Pag- ination varies from copy to copy. For more information and other articles see: http://www-mitpress.mit.edu/jrnls-catalog/chicago.html • http://www.cs.uchicago.edu/publications/cjtcs/ • gopher.mit.edu • gopher.cs.uchicago.edu • anonymous ftp at mitpress.mit.edu • anonymous ftp at cs.uchicago.edu • Buntrock and Niemann Weakly Growing CSGs (Info) The Chicago Journal of Theoretical Computer Science is abstracted or in- R R R dexed in Research Alert, SciSearch, Current Contents /Engineering Com- R puting & Technology, and CompuMath Citation Index. c 1996 The Massachusetts Institute of Technology. Subscribers are licensed to use journal articles in a variety of ways, limited only as required to insure fair attribution to authors and the journal, and to prohibit use in a competing commercial product. See the journal’s World Wide Web site for further details. Address inquiries to the Subsidiary Rights Manager, MIT Press Journals; (617)253-2864; [email protected]. The Chicago Journal of Theoretical Computer Science is a peer-reviewed scholarly journal in theoretical computer science. The journal is committed to providing a forum for significant results on theoretical aspects of all topics in computer science. Editor in chief: Janos
    [Show full text]
  • Formal Languages Grammar
    Formal Languages Grammar Ryan Stansifer Department of Computer Sciences Florida Institute of Technology Melbourne, Florida USA 32901 http://www.cs.fit.edu/~ryan/ 17 October 2019 Ryan Stansifer (CS, Florida Tech) Formal Languages 17 October 2019 1 / 141 A formal language is a set of strings over an alphabet. Ryan Stansifer (CS, Florida Tech) Formal Languages 17 October 2019 2 / 141 Summary of Four Paradigms of Formal Languages • Ad-hoc mathematical descriptions {w ∈ Σ∗ | ... }. 1 An automaton M accepts or recognizes a formal language L(M). 2 An expression x denotes a formal language L x . J K 3 A grammar G generates a formal language L(G). 4 A Post system P derives a formal language L(P). Ryan Stansifer (CS, Florida Tech) Formal Languages 17 October 2019 3 / 141 A language may be described mathematically using set notation as L = { w ∈ Σ∗ | P(w) }. This means w ∈ L if, and only, if P(w) is true. A language may be accepted by an automaton M. ∗ ∗ L(M) = {w ∈ Σ | d0 ` df where df is final state } A language may be denoted by an expression x. We write L = D x . A language may be denoted by an expression x. We write L = L x . J K J K A language may be generated by a grammar G. L(G) = { w ∈ Σ∗ | S ⇒∗ w } Ryan Stansifer (CS, Florida Tech) Formal Languages 17 October 2019 4 / 141 Automata compute Expressions denote Trees demonstrate Grammars construct Ryan Stansifer (CS, Florida Tech) Formal Languages 17 October 2019 5 / 141 Grammars Ryan Stansifer (CS, Florida Tech) Formal Languages 17 October 2019 6 / 141 Grammar A grammar is another formalism for defining and generating a formal language (a set of strings over an alphabet).
    [Show full text]
  • Appendix A: References and Suggested Reading
    Appendix A: References and Suggested Reading Chapter 1: Professor’s Son Leena Ellonen (ed.): Suomen professorit, Finlands professorer 1640–2007 (Pro- fessors of Finland 1640–2007). Professoriliitto/Gummerus, Jyväskylä 2008. Filosofia.fi – Portti filosofiaan (Gate to philosophy): Salomaa, Jalmari Edvard (Mikko Salmela, ed.), 2014. Helsingin Sanomat 6th December 1944. Helsingin Sanomat 6th September 1960. Kaarlo Hänninen: Kansakoulun maantieto ja kotiseutuoppi yksiopettajaisia kouluja varten (Local geography and history for one-teacher elementary schools), 4th edition. Valistus, Helsinki 1929. Eino Kaila: On the Method of Philosophy – Extracts from a Statement to the Section of History and Philology at the University of Helsinki (1930). Analytic Philosophy in Finland (Leila Haaparanta, Ilkka Niiniluoto, eds.). Rodopi B.V., New York 2003, 69–77. Kuka kukin on – Who’s Who in Finland 2003. Otava, Keuruu 2002. Reino Leimu: Pentti Antero Salomaa À Muistopuhe 10.5.1976 (Commemorative speech). Esitelmät ja pöytäkirjat, Suomalainen Tiedeakatemia (Lectures and records, Finnish Academy of Science and Letters), 1976, 59–63. Vilho Lukkarinen: Suomen lotat – Lotta Svärd -järjestön historia (Members of the women’s auxiliary services in Finland – History of the Lotta Svärd organization). Suomen Naisten Huoltosäätiö /WSOY, Porvoo 1981. Rolf Nevanlinna: Muisteltua (Memoirs). Otava, Keuruu 1976. Valtteri Niemi: Interview, 20th October 2016. Arto Salomaa: Nuoruusvuodet – Meine jungen Tage. Photo album, Ifolor 2014. Arto Salomaa: Interviews, 2nd September 2015 and 22nd October 2015. J. E. Salomaa: Eräs lapsuus ja nuoruus (A childhood and youth). WSOY, Porvoo 1954. Kaarina Salomaa: Interview, 25th October 2016. © Springer Nature Switzerland AG 2019 277 J. Paakki, Arto Salomaa: Mathematician, Computer Scientist, and Teacher, https://doi.org/10.1007/978-3-030-16049-4 278 Appendix A: References and Suggested Reading Chapter 2: Child and Youngster in Turku, Finland Helsingin Sanomat 6th June 1934 and 7th June 1934.
    [Show full text]
  • Forgotten Islands of Regularity in Phonology
    Forgotten Islands of Regularity in Phonology Anssi Yli-Jyr¨a University of Helsinki October 8, 2017 Abstract Giving birth to Finite State Phonology is classically attributed to Johnson (1972), and Kaplan and Kay (1994). However, there is an ear- lier discovery that was very close to this achievement. In 1965, Hennie presented a very general sufficient condition for regularity of Turing ma- chines. Although this discovery happened chronologically before Gen- erative Phonology (Chomsky and Halle, 1968), it is a mystery why its relevance has not been realized until recently (Yli-Jyr¨a,2017). The an- tique work of Hennie provides enough generality to advance even today's frontier of finite-state phonology. First, it lets us construct a finite-state transducer from any grammar implemented by a tightly bounded one- tape Turing machine. If the machine runs in o(n log n), the construction is possible, and this case is reasonably decidable. Second, it can be used to model the regularity in context-sensitive derivations. For example, the suffixation in hunspell dictionaries (N´emethet al., 2004) corresponds to time-bounded two-way computations performed by a Hennie machine. Thirdly, it challenges us to look for new forgotten islands of regularity where Hennie's condition does not necessarily hold. 1 Introduction Generative phonology (Chomsky and Halle, 1968) is a rule-based string rewriting system that has been scrutinized carefully over the years of its existence. One of the major weaknesses of the system is that it has been proven to be equivalent to Turing machines (TMs) (Chomsky, 1963; Johnson, 1972; Ristad, 1990).
    [Show full text]