Grammar Introduction Into Markov Chain Text Generation

ICS 661: Adanced AI Final Project Grammar Introduction into Markov Chain Text Generation Curran Meek December 12, 2019 1 Contents 1 Introduction 3 2 Technical Discussion 4 2.1 Markov Chain Implementation . 4 2.2 Grammar Rule Implementation . 5 2.3 Data . 7 3 Analysis 7 3.1 Results . 7 3.2 Performance . 8 4 Conclusion 8 4.1 Potential Improvements . 9 Appendices 11 A Code 11 B Data - Star Wars Dialog 14 C Test Runs 26 C.1 10 Sentence Data Runs - No Rules . 26 C.2 10 Sentence Data Runs - Some Rules . 27 C.3 10 Sentence Data Runs - All Rules . 27 C.4 Full Dialog Runs - No Rules . 28 C.5 Full Dialog Runs - Some Rules . 29 C.6 Full Dialog Runs - All Rules . 29 2 1 Introduction The computational power in today's world is seemingly limitless. With cloud computing, fast processors and graphics cards, it is relatively cheap and easy to train neural networks or process lots of data in parallel. All while being accomplished in a timely manner. This is not always the case in the robotic domain. Often times robots are sent to places of limited communication availability. Moreover, a robot can only have so much computational power without sacrificing other on-board systems. The limitations of robots require simple, low complexity solutions for algorithms they employ. These constraints have motivated me to find a text generation approach that could be easily em- ployed on a robot. A potential application could be the robot sending generated text on what it has observed instead of entire image files. With these concepts in mind, the goal of this project is to improve upon a low complexity text generation algorithm by using simple English grammar rules. Utilizing grammar in the generation of text will potentially make a small text dataset more robust, providing more coherent generated sentences. Many current robust text generation methods rely on neural networks. For instance, a recurrent neural network (RNN) can be used to generate coherent cooking instructions using a checklist to model global coherence [1]. RNN and generative adversarial networks (GAN) have been shown to have comparable performance in text generation [2]. Despite their success, neural networks often require large amounts of training data to have a generalized model. Large datasets are not always readily available for specific domain applications and can be very time consuming to generate. Furthermore, neural networks require large computational power to use, even during training. The straightforward algorithm explored is Markov Chain text generation. The Markov Chain model gives structure to how random variables can change from one state to the next [3]. Each variable has a related probability on what state will occur next, this can be graphically seen in Figure 1. The Markov Chain makes a powerful assumption that to predict the next state, only the current state is relevant [3]. This assumption makes the model simplified but looses past information that can be useful. The assumption can be mathematically expressed as P (qi = ajq1 : : : qi−1) = P (qi = ajqi−1) where qi is the state variable in the sequence, while a is the value taken in that state [3]. Regardless of the simplicity, Markov Chain text generation can produce believable comprehensive text. Internet users believed generated texted, using Markov Chain, was written by a human approximately 20-40% of the time [4]. The results varied based on the background of the individuals, but demonstrates how this method of text generation is capable of producing understandable text [4]. Markov Chain text generation is expanded upon by introducing simple grammatical rules in the training data. The rules add additional words to the training data based on the grammatical rule. To accurately implement these rules the Hunpos tagger is used to tag the text [5]. Using this methodology the original and modified text generators are tested and evaluated, as discussed in the following sections. 3 Figure 1: The Markov Chain model is a framework for how to predict a sequence of random variables based on related probability. These variables can be words or symbols to represent a number of phenomenon [3]. The weather for each day could be predicted as shown in (a), or the next word in a sentence could be predicted as shown in (b). The circle represent the current value of the variable, while the lines show the probability the variable will be the another state next in the sequence. Figure from [3]. 2 Technical Discussion 2.1 Markov Chain Implementation To generate text using the Markov Chain model requires a set of probabilities of a word appearing after another. In order to do this a dictionary model was used in Python. This dictionary contains start and end words that signify the beginning and end of a sentence. While the words that follow are accumulated in the dictionary by analyzing some text data. Figure 2 shows a simple two sentence training set. In this case two start words and two end words are generated. The probability that a word follows another is represented as the number of times a word appears in the dictionary as the following word [6]. For example, as shown in Figure 2, eat is followed by apples or oranges giving a 50% chance to each. In the dictionary this would be represented as f:::;0 eat0 :[apples; oranges];:::g. Using this representation text data can be broken down into this dictonary model in the training portion. Figure 2: A transition diagram depicting how a Markov Chain probabilistic predictions work on a sequence of words, based on two different sentences. [6]. 4 To preform the training in the code a standard logic flow was used. As described in [6], each line, or sentence, is iterated through saving each word and the word that follows it. Once the end of the sentence is reached the word is saved as an end word, then the next line is analyzed. Any word that has been previously seen has the word following it added to its portion of the dictionary to represent the probability that a word follows it [6]. As discussed if 'best' follows 'the' three times the dictionary for 'the' would be f:::;0 the0 :[best; best; best];:::g. Figure 3 shows the code used to train the model [6]. Figure 3: Python code used in training the Markov Chain model. The code extracts start words, iterating through the sentence capturing each word, then extract a stop word. These are all saved in a dictionary format that is used throughout the program. This code used the same logic as [6] Generating text is straightforward with the dictionary created. Once again following the methodology from [6], the program begins by choosing a random start word. Then the next word is generated by selecting a random word in the dictionary for the current word. Once an end word is generated the sentence is ended [6]. Figure 4 shows the code used when generating a sentence [6]. 2.2 Grammar Rule Implementation These grammatical rules are based on simple rules as shown in Figure 5 [3]. The rules are generated by expanding out the rules shown in Figure 5 and using the most common occurrences, with the addition of common use of adjectives. Since the model predicts based on the current word, the grammar rules were simplified to include what type of word is followed by another. It was found that the following rules were the more common occurrences: Det ! Noun, JJ ! Noun, JJ ! JJ, NNP/PRP ! VB, where JJ represents an adjective, NNP is a proper noun, and PRP is a pronoun. Using these rules the dictionary created during training is modified to include additional words based on the rules. In this case, the probability of words that fit the grammar rule is increased by adding a repeat word in the dictionary. Utilizing the earlier example, if 'the' is followed by 'best' and 'dog' the resulting dictionary entry, with the rules applied, would be f:::;0 the0 :[best; dog; dog];:::g. The additional occurrence of 'dog' results from 'the dog' fitting the rule: Det ! Noun. In order to correctly determine each words grammatical role, the Hunpos PoS tagger was used [5]. Hunpos provides a very simple accurate method for PoS tagging. In an analysis of various 5 Figure 4: Python code used in generating text from the dictionary created during training. The code begins with a random start words and chooses words randomly until an end word is encountered. This code is from [6]. Figure 5: Simple English grammar rules from [3]. 6 Figure 6: Comparative graph of accuracy and computation time of various part of speech taggers. Hunpos can be seen to have high accuracy with low computational time required. Figure from [7]. PoS tagging methods Hunpos was found to be very accurate with orders of magnitude of less time needed for use and training [7]. This can be clearly seen in Figure 6. Having prior experience in using Hunpos made the integration into the code simple. 2.3 Data The data used for testing was lines of dialog from Star Wars Episode VI. This data was obtained from [8] which was reformatted to separate each line of dialog on separate lines. The text was then modified by removing the names of characters and removing one word or two word sentences. The text file used can be found in Appendix B. 3 Analysis 3.1 Results To evaluate the performance of adding grammar rules human evaluation was used. The program was run with varying testing data size, as well as with or without the grammar rules.

Load more