Multiple Choice Question Answering Using a Large Corpus of Information
Total Page:16
File Type:pdf, Size:1020Kb
Multiple Choice Question Answering using a Large Corpus of Information a dissertation submitted to the faculty of the graduate school of the university of minnesota by Mitchell Kinney in partial fulfillment of the requirements for the degree of doctor of philosophy Xiaotong Shen, Adviser July 2020 © Mitchell Kinney 2020 Acknowledgements I am very thankful for my adviser, Xiaotong Shen. He has consistently provided valu- able feedback and direction in my graduate career. His input when I was struggling and feeling dismayed at results was always encouraging and pushed me to continue onward. I also enjoyed our talks about topics unrelated to statistics that helped re- mind me not to completely consume my life with my studies. I would like to thank my committee members Jie Ding, Wei Pan, Maury Bramson, and Charlie Geyer for participating in my preliminary and final exams and for their feedback on improving my research. I am grateful to Galin Jones for his open door and help on side projects throughout the years. Yuhong Yang was an exceptional mentor and instructor that I am thankful for constantly encouraging me. Thank you also to Hui Zou, Charles Doss, Birgit Grund, Dennis Cook, Glen Meeden, and Adam Rothman for their excellent instruction and their helpful office-hour guidance during my first few years. The office staff at the statistics department (a special thank you to Taryn Verley who has been a constant) has been amazingly helpful and allowed me to only stress about my studies and research. My peers in the statistics department and at the university have made my time in Minnesota immensely enjoyable. Thank you to Aaron Molstad, Adam Maidman, Dootika Vats, Dan Eck, Karl Oskar, Haema Nilakanta, Sakshi Arya, Matt Galloway, Wenjun Lang, Yiyi Yin, Yunan Wu, Trevor Knuth, James Burrell, Sarah Sernaker, Ziyue Zhu, Riddhiman Bhattacharya, Emily Kurtz, Luke Jacobsen, Austin Brown, Yu Yang, Marten Thompson, and Tate Jacobson. Finally, I am infinitely appreciative of my family. Thank you to my dogs Callie and Freddie and my cat Mango for their companionship. Thank you to my sister Macken- zie for her positivity and encouragement. Thank you to my mom Marcie and dad Tim for their unwavering support and dedication to my success. Thank you to my wife Hannah who has brought my life so much happiness. I love you all. i Dedication To my teachers, for nurturing my love of learning. ii Abstract The amount of natural language data is massive and the potential to harness the information contained within has led to many recent discoveries. In this dissertation I explore only one aspect of learning with the goal of answering multiple choice ques- tions with information from a large corpus of information. I chose this topic because of an internship at NASA's Jet Propulsion Laboratory, where there is a growing inter- est in making rovers more autonomous in their field research. Being able to process information and act correctly is a key stepping stone to accomplish this, which is an aspect my dissertation covers. The chapters involve a review on the early embed- ding methods, and two novel approaches to create multiple choice question answering mechanisms. In Chapter 2 I review popular algorithms to create word and sentence embeddings given the surrounding context. These embeddings are a numerical representation of the language data that can be used in downhill models such as logistic regression. In Chapter 3 I present a novel method to create a domain specific knowledge base that can be querired to answer multiple choice questions from a database of Elementary School science questions. The knowledge base is made up of a graph structure and trained using deep learning techniques. The classifier creates an embedding to repre- sent the question and answers. This embedding is then passed through a feed forward network to determine the probability of a correct answer. We train on questions and general information from a large corpus in a semi-supervised setting. In Chapter 4 I propose a strategy to train a network to simultaneously classify multiple choice ques- tions and learn to generate words relevant to the surrounding context of the question. Using the Transformer architecture in a Generative Adversarial Network as well as an additional classifier is a novel approach to train a network that is robust against data not seen in the training set. This semi-supervised training regiment also uses sentences from a large corpus of information and Reinforcement Learning to better inform the generator of relevant words. iii Contents List of Tables vi List of Figures vii 1 Introduction 1 2 Creating Embeddings of Phrases, Sentences and Words 6 2.1 Introduction . .6 2.2 Background . .7 2.3 Word2vec and Doc2vec . .8 2.3.1 Skip Thought and Universal Sentence Representation . 13 2.4 My Study . 15 2.4.1 Bag-of-Words and Doc2vec . 15 2.4.2 Doc2vec and Word2vec . 17 2.4.3 Skip Thought and Universal Sentence Encoder . 23 2.5 Question Answering . 24 2.6 Conclusion . 26 3 Domain Specific Knowledge Base with a Graph Neural Network 27 3.1 Introduction . 27 3.2 Previous Work . 28 3.3 Computation . 32 3.3.1 The Data . 32 3.3.2 Knowledge Base . 37 iv Contents v 3.3.3 Initialization . 37 3.3.4 Graph Network Block . 38 3.4 Results . 42 3.4.1 Training Regime . 42 3.4.2 Performance . 44 3.5 Ablation Study . 45 3.6 Conclusion . 50 4 Classification and a Generative Adversarial Network 51 4.1 Introduction . 51 4.2 Previous Work . 54 4.3 Model Architecture . 61 4.3.1 Set Up . 61 4.3.2 Models . 62 4.3.3 Generator . 65 4.3.4 Discriminator and Classifier . 72 4.3.5 Training notes . 73 4.4 Simulation Study . 74 4.5 Results . 81 4.5.1 Classifier Accuracy . 81 4.5.2 Generated Language Quality . 84 4.5.3 Ablation Study . 86 4.6 Proofs . 91 4.6.1 Proof of Equation (4.22) . 91 5 Conclusion 93 References 95 List of Tables 2.1 News20 Organization . 17 2.2 News20 Comparing Bag of Words and Doc2vec . 17 2.3 Log word2vec example A . 18 2.4 Log word2vec example B . 19 2.5 Max word2vec example . 20 2.6 News20 Comparing Word2vec Classifiers and Doc2vec . 20 2.7 Correct and Incorrect Prediction Counts . 21 3.1 An estimate of the proportion of classes with the ARC dataset from [Xu et al., 2019] . 33 3.2 Accuracy and Root Mean Square error of word importance . 34 3.3 Number of questions in the subsets for different domains . 42 3.4 Testing accuracy on domains . 43 3.5 Selection of methods performances on all questions . 44 4.1 Estimates and standard errors of unknown parameters . 77 4.2 Estimates and standard errors of unknown parameters from second simulation . 78 4.3 Classifier Results Table . 81 4.4 Number of questions . 81 4.5 Data Reduction Table . 83 4.6 FED Table . 84 4.7 Perplexity Table . 85 4.8 Percent Same Table . 86 vi List of Figures 1.1 Image of the Mars rover Curiosity accompanied with article about find- ing methane. Credit to NASA/JPL Caltech/MSSS .......2 2.1 Bag-of-Words Example . .7 2.2 Word2vec diagram for the two strategies of building word embeddings.8 2.3 Doc2vec diagram for the two strategies of building sentence embeddings. 11 2.4 Example of an `easy' question from the ARC dataset . 24 2.5 Challenging question example . 25 3.1 A Graph Network as defined by [Battaglia et al., 2018] . 29 3.2 Example multiple choice question from ARC . 32 3.3 Plot of training errors over five runs . 34 3.4 Example of replacing text in a cloze style . 36 3.5 General version of a message passing block from [Battaglia et al., 2018] 38 3.6 One block of my network . 40 3.7 Training error plots for different domains . 43 3.8 Question answered incorrectly with softmaxed scores in parenthesis. The correct answer is A. Heat Energy. 46 3.9 Question answered incorrectly with softmaxed scores in parenthesis. The correct answer ia B. Conduction.. hot chocolate to the spoon . 46 3.10 Question answered incorrectly with softmaxed scores in parenthesis. The correct answer ia B. Fossil fuels . 48 3.11 Question answered correctly with softmaxed scores in parenthesis. The correct answer ia C. Catalyze... 48 3.12 Training error plots for different domains . 49 vii List of Figures viii 4.1 Example of how context can change the correctness of an answer and how to identify important words to answer the question. 53 4.2 Transformer architecture from [Vaswani et al., 2017] . 53 4.3 Diagram of cell in each layer of Transformer from [Vaswani et al., 2017] 54 4.4 BERT model example from [Devlin et al., 2018] . 57 4.5 Architecture of model from [de Masson d'Autume et al., 2019] . 61 4.6 Example multiple choice question from ARC . 62 4.7 Model architecture used to determine importance of individual words utilizing sentence context. 65 4.8 Generator structure. Words are pre assigned an importance probability in the Attention Network. Then during training the Generator uses these probabilities to randomly mask words, then produce new words to get the highest reward possibly from the Discriminator and Classifier. 66 4.9 Training values of β (left) and σi's (right) with respect to batch from the first simulation. 77 4.10 Deviations overlayed with true distribution from the first simulation. 78 4.11 Training values of β (left) and σi's (right) with respect to batch from the second simulation.