New Probabilistic Model of Stochastic Context-Free Grammars in Greibach Normal Form
Total Page:16
File Type:pdf, Size:1020Kb
New Probabilistic Model of Stochastic Context-Free Grammars in Greibach Normal Form Yevhen Hrubiian Student, group FI-33, Institute of Physics and Technology NTUU “Igor Sikorsky Kyiv Polytechnic Institute” Kyiv, Ukraine [email protected] Імовірнісна Модель Стохастичних Контекстно- Вільних Граматик в Нормальній Формі Грейбах Євген Грубіян Студент групи ФІ-33, Фізико-технічний інститут НТУУ “Киівський політехнічний інститут ім. Ігоря Сікорського” Київ, Україна [email protected] Abstract—In this paper we propose new probabilistic model models of languages were proposed by Claude Shannon to of stochastic context-free grammars in Greibach normal form. describe informational aspects of languages such as entropy but This model might be perspective for future developments and they became very relevant and actual in modern linguistics. N- researching stochastic context-free grammars. gram model is one of the most accurate statistical models of language but the main drawback is that for precious learning Анотація—В роботі запропоновано нову імовірнісну this model for high values of N one needs a huge amount of модель стохастичних контекстно-вільних граматик в texts in different thematic, but unfortunately, even if we have нормальній формі Грейбах. Ця модель може бути got a lot of texts it turns out to be harmful for accurate and перспективною для подальших досліджень стохастичних precise modeling whole language. Because of this reason контекстно–вільних граматик. stochastic context-free grammars were introduced for more Keywords—stochastic context-free grammar; Greibach normal precious description of languages because context-freeness form; hidden Markov models guarantees grammatical pattern of language and probability distributions on grammatical rules approach semantic structure Ключові слова—стохастична контекстно-вільна граматика; of phrases. Stochastic context-free grammars are also useful in нормальна форма Грейбах; приховані моделі Маркова. modeling DNA sequences, image recognition and modeling plain texts in cryptography so that investigation of stochastic I. INTRODUCTION context-free grammars in very important and actual in today’s Today, there is a huge interest in developing new methods science. of natural language processing due to a lot of issues from cryptanalysis, biology and artificial intelligence. Several ways In this paper we consider stochastic context-free grammars for investigation had been proposed since works of Noam in Greibach normal form although significant contributions in Chomsky in the mid 50s that originated development of natural language processing was associated with algorithms of modern formal theory of syntax. His conception of generative learning stochastic context-free grammars in Chomsky normal grammars and context-free grammars became one of the most form but complexity of these methods is quite high. We discussing and controversy in linguistics but these ideas lay propose new idea of investigation of stochastic context-free behind development of formal theory of programming grammars based on our conception of hidden Markov models languages and compilers later. Context-free grammars were with stack. proposed firstly for describing grammar of English or any other II. STOCHASTIC CONTEXT-FREE GRAMMARS language but this approach does not cover all features of language such as semantics. Later, head-driven grammars were Conception of stochastic context-free grammars (SCFG) introduced to describe some valuable properties of languages came from several problems in natural language processing such as semantic dependencies between words. Probabilistic such as finding the most probable sequence of the words, 161 learning unknown grammar from corpora of texts etc. But NLP Form (CNF) generating the language L(G0) = L(G) − is not only sphere of applications of SCFG, several biologists {ε}. and geneticists proposed usage SCFG for modeling DNA 2. Rename the variables like A1 , A2, . An starting sequences, modern pattern recognition, speech processing, with S = A1. image processing also use this conception. 3. Modify the rules in R so that if Ai → Ajγ ∈ R0 then j We say that G = <N, Σ, R, S, P> is stochastic context-free > i grammar if and only if Q = <N, Σ, R, S> is context-free 4. Starting with A1 and proceeding to An this is done as grammar (N – alphabet of nonterminal symbols or variables, Σ follows: – alphabet of terminal symbols, R – set of production rules, S – a) Assume that productions have been modified start symbol) and one defines family of probability so that for 1 ≤ i ≤ k, Ai → Aj γ∈ R0 only if j > distributions P on R so that for every production in R there is i probability of applying this production rule when forming b) If Ak → Ajγ is a production with j < k, some sequence of symbols such that sum of probabilities of generate a new set of productions substituting productions starting with same symbol must be equal to 1. for the Aj the body of each Aj production. The There are several normal forms of grammars. The most transfer probabilities should split uniformly popular are Chomsky normal form and Greibach normal form. from base probability. c) Repeating (b) at most k − 1 times we obtain We say that context free grammar G = <N, Σ, R, S> is in rules of the form A → A γ, p ≥ k Chomsky normal form (CNF) if and only if all production rules k p d) Replace rules Aj → Akγ by removing left- have form: recursion as stated above. Transfer A → BC probabilities should also be splitted. e. A → a Modify the Ai → Aj γ for i = n − 1, n − 2, ., 1 in Our major interest forms Greibach normal form. desired form at the same time change the Z production rules. We say that context free grammar G = <N, Σ, R, S> is in Greibach normal form (GNF) if and only if all production rules III. SCORING PROBLEM FOR REGURAL GRAMMARS have form: Suppose we have to find probability of sequence of A → aA A …A symbols GNF generated: Pr(w1:n | G) i.e. solve the scoring 1 2 n problem. Where A is nonterminal, a is terminal symbol and A1A2…An is a probable empty sequence of nonterminal symbols except S. Trivial situation come up when our grammar in GNF is regular, i.e. all rules have form: The definition of SCFG in GNF might get stronger if we set number of nonterminal symbols in right hand side of any A→aB production to be not larger than 2. Productions of such Regular grammars are equivalent to finite state machines grammar will be in any of the following forms: (FSM). This equivalence could be set if we assume that set of A → aA A states of FSM is equal to set of nonterminal symbols and every 1 2 transition is described by corresponding rule of grammar with A → aA emission of terminal symbol. A → a We say that bivariate stochastic process {Xi,Oi, i=1,2,…} is hidden Markov model of order n by emissions (HMMn) if It could be shown that every SCFG in GNF might be transformed into equivalent grammar in the stronger GNF. For Pr(Xk | X1:k-1) = Pr(Xk | Xk-1) this reason in this paper we suppose that every Greibach normal form will be in the form described above. Pr(Ok | X1:k) = Pr(Ok | Xk-n+1:k) Every context-free grammar could be transformed to Xk are called latent variables or hidden states of HMM and Ok equivalent grammar in Greibach normal form in polynomial are called observations. time such that languages they form are equal. The same idea is Stochastic regular grammars are equivalent to hidden behind transforming SCFG to SCFG in Greibach normal form. Markov models of order 2 by emissions. It follows from the We obtained algorithm for this transformation based on fact that set of observed values corresponds to set of terminal existing algorithm of transforming an ordinal CFG in Chomsky symbols and set of latent variables corresponds to set Normal form to context-free grammar in Greibach normal form nonterminal symbols of grammar. Second order of models is adding several steps of rebalancing probability distributions P. followed from fact that observed word depends on previous 2 Here is the algorithm: states, however order by transitions is still first. This equivalence also requires independence of appliance every 1. Eliminate null productions, unit productions and grammatical rule. Because of this fact we could apply forward useless symbols from the grammar G and then algorithm for solving scoring problem for stochastic regular 2 construct a G0 = (V0, T , R0, S) in Chomsky Normal grammar. Complexity of this algorithm is Θ(nm ), where n is 162 length of observed sequence and m is number of nonterminal denotes popping out head of stack to which chain jumps after. symbols. Transitional probabilities are omitted because the chain in example is deterministic. If start state is A than chain will IV. MARKOV CHAINS WITH STACK produce sequence ABCABCABC… The case of general SCFG in GNF is much more One could notice that stack size in this example is growing complicated. If we try to build HMM out of SCFG in GNF we unrestrictedly but it is not common case for natural language’s will face with some uncertainty with description of latent grammars. The model terminates when stack is empty and variables set. Naïve approach is based on setting equivalence model reaches # symbol. between set of all possible sequences of nonterminal symbols that grammar could produce during inference all possible Our goal than will be setting up equivalency between sentences. But in that case number of latent states could be stochastic context-free grammars in Greibach normal form and very large and even unbounded because of all recursions in hidden Markov models of order 2 with stack (HMM2S). grammatical rules. V. SCFG IN GNF AND HMM2S EQUIVALENCY Let’s try to overcome these difficulties. It is known fact that every context-free grammar is equivalent to some finite state In previous paragraph we could see that Markov chains with stack work just similar to the way stochastic context-free machine with stack (pushdown automaton).