Answering Reading Comprehension Tests:N-Gram and Multi-View Regression
Total Page:16
File Type:pdf, Size:1020Kb
Answering Reading Comprehension Tests:N-gram and Multi-View Regression Dept. of CIS - Senior Design 2009-2010 Andrew Pak Daniel Seunghyun Kim Lyle H. Ungar [email protected] [email protected] [email protected] Univ. of Pennsylvania Univ. of Pennsylvania Univ. of Pennsylvania Philadelphia, PA Philadelphia, PA Philadelphia, PA ABSTRACT word has a number of distinct \meanings" -which has been Computer \reading" of natural languages is a growing re- an open problem for a long time can be addressed. search topic with numerous applications, ranging from areas How can one test if machine is correctly \reading" natural of artificial intelligence to linguistics. A successful \reading" languages? Successfully \reading" a passage can be inter- of natural languages can be interpreted in two ways: 1) ex- preted in two ways, as extracting facts from the text or as tracting facts out of a passage and 2) capturing \meaning" comprehending the \meaning" of the text. Obtaining facts of a text. We focus on the latter interpretation, defining from a passage can be thought of as building a knowledge \meaning" as is done in humans, namely, can the system database using tuples that contain information. For exam- answer the same reading comprehension questions that are ple, (Bob, birthday, 6/28/1965) can be a tuple extracted given to early learners of language. from a passage about Bob. The task of extracting facts from We present two approaches to answering cloze questions, passages is heavily researched by fact engines, which will be which are fill-in-the blank, multiple-choice reading compre- introduced in the \Related Work" section. Construction and hension questions. The first approach utilizes n-gram, a ba- maintenance of such knowledge database is valuable, but it sic statistical methodology that captures the local context of provides no context for the passage while being costly and a passage to predict an answer. The second approach uses expensive. We focus on comprehending the \meaning" of Multi-View Regression (\MVR"), based on canonical corre- passages by capturing contexts. lation analysis (\CCA"), which captures both the local and There are several approaches that can be utilized to cap- broader context. In this statistical approach, the \meaning" ture context. HMM and n-gram approach captures only the of each word is represented as a real-valued state vector as- local context, while Latent Semantic Indexing (\LSI") and sociated with each word occurrence. MVR is a dynamical be- Latent Dirchlet Allocation (\LDA") captures only the docu- lief net which, like a Kalman filter or Hidden Markov Model ment level context, or topic. MVR approach captures both (\HMM"), estimates the state based on the preceding and fol- the local and broader context. We build a system that uses lowing sequence of words and predicts the word to be \emit- n-gram and MVR approach to answer a set of cloze questions ted" based on that state. and compare the results between the two. While there has been much research on these statistical In order to see how well an approach captures the \mean- methodologies in the field of machine learning and on gener- ing" of passages, each approach is tested against a set of ation and evaluation of cloze questions in the field of natural cloze questions. Cloze questions are a type of reading com- language processing, the connection between the two hasn't prehension questions which can normally be answered using been explored much. Prior research in cloze questions sug- \yes" or \no", a specific piece of information, or a selection gests that proper context capturing leads to correct answers. from multiple choices. The cloze question answering system We create a system that reads in a set of cloze questions and will focus on those fill-in-the-blank questions with multiple predicts two sets of answers with the two statistical method- choices. Example of such question is: ologies that have different context capturing schemes, and then analyze the results. Because MVR captures both the Anna and her mother boiled water in a big pot and put the local and broader context, it answers cloze questions better berries into it. The water turned into a deep red. Anna's than n-gram, which only captures the local context. mother dipped the pale yarn into it. Soon red yarn was hanging up to dry on a clothesline strung across the kitchen. Anna and her mother changed the blank of the yarn. 1. INTRODUCTION 1. shape If computers can \read" natural languages, the benefit will 2. color be enormous to various fields and industries. We will be one step closer to creating self-learning robots, which has 3. size been the goal of artificial intelligence researchers for many 4. taste years. Also, there can be numerous applications to the field of linguistics and natural language processing. For example, In the above example, the last sentence is the cloze ques- word sense disambiguation - the process of identifying which tion and the sentences prior to the question are considered \meaning" of a word is used in any given sentence, when the as the story related to that question. Stories and questions similar to the example are obtained system. The system searches the Google 3-gram and finds from Lexile, a company that sells language learning related the words that go into the blank and the number of such products. Eight different levels of reading comprehension occurrences. Among the possible multiple choices, the word tests composed of about 50 questions each are used. With with the highest occurrence in Google 3-gram is outputted these cloze questions from Lexile, for which the student re- as the prediction. sponse data set and correct answers are given, the system analyzes the predictions from two approaches. 2.1.2 Multi-View Regression (“MVR") The MVR approach requires a training corpus before a MVR is a slightly different approach to the traditional question can be fed in. Project Gutenberg is an on-line Hidden Markov Model (\HMM") that incorporates Canoni- digital library that contains around 30,000 novels and doc- cal Correlation Analysis (\CCA") and logistic regression. In uments. From this library, novels and documents that have traditional HMM, a statistical model based on Expectation- similar reading level and vocabulary as the stories and ques- Maximization algorithm in which the system being modeled tions from Lexile are filtered to be used as a training corpus. is assumed to be a Markov process with unobserved states, the states are not directly visible, but the outputs are, and 2. RELATED WORK each state has a probability distribution over the possible As our project incorporates statistical algorithms and cloze output tokens. Thus, given some output tokens, HMM gath- question answering, related work consists of two fields: sta- ers information regarding the hidden states that produced tistical methodologies and question answering. In terms of such output. MVR is similar to HMM in that it also as- statistical methodologies, we look at two approaches, n-gram sumes there exists some set of hidden states that produced and MVR. For question answering, related work includes the given output tokens and that these states have a prob- fact-based question answering and cloze question answering. ability distribution over the possible output token. MVR, like HMM, is based on the assumption that each entry in the 2.1 Statistical Methodologies data series can be completely explained by two views: the entries preceding and following the target entry. Under this 2.1.1 N-gram assumption, MVR uses CCA to learn an optimal mapping The task of predicting the next word can be stated as at- from previous or following entries to a latent state space. tempting to estimate the probability of the next word given The hidden states found by CCA is then used as features previous words. In such a stochastic problem, a classification for supervised learning to predict which entry will be seen of the previous words, the history, can be used to predict the as a function of the state. next word. On the basis of having looked at a lot of text, MVR differs from HMM in that HMM assumes each state one can identify which words tend to follow other words. already contains all past information that led to that state For this task, one cannot possibly consider each textual and so when estimating the next state, HMM looks only at history separately: most of the time one will be listening to the state prior to the target state. MVR, on the other hand, a sentence that one has never heard before, and so there is looks at the sequence of states prior to the target state by no previous identical textual history on which to base pre- running a regression on the hidden states. Thus for cloze dictions, and even if one had heard the beginning of the question answering, MVR is more effective than HMM since sentence before, it might end differently this time. And MVR looks at all past information to make predictions while so one needs a method of grouping histories that are sim- HMM does not. ilar in some way so as to give reasonable predictions as to Further, MVR has significant advantage over HMM in which words one can expect to come next. By constructing that MVR is a linear method, which makes MVR faster, a model where all histories that have the same last n-1 words more robust and more scalable to larger data set than HMM. in the same equivalence class and grouping them as a phrase, Also, the Expectation-Maximization algorithm utilized by one can predict the next word by looking at the last word HMM suffers from convergence problems, which MVR does of such n-grams [6].