Automatic Question Generation using Discourse Cues and Distractor Selection for Cloze Questions

Thesis submitted in partial fulfillment of the requirements for the degree of

MS by Research in Computer Science with specialization in NLP

by

Rakshit Shah 200702041 [email protected]

Language Technology and Research Center (LTRC) International Institute of Information Technology Hyderabad - 500032, INDIA July 2012 Copyright c Rakshit Shah, 2012 All Rights Reserved International Institute of Information Technology Hyderabad, India

CERTIFICATE

This is to certify that the thesis entitled “Automatic Question Generation using Discourse Cues and Distractor Selection for Cloze Questions” submitted by Rakshit Shah to International Institute of Information Technology, Hyderabad, for the award of the Degree of Master of Science (by Research) is a record of bona-fide research work carried out by him under my supervision and guidance. The contents of this thesis have not been submitted to any other university or institute for the award o any degree or diploma.

Date Adviser: Prof. Rajeev Sangal To my Parents Acknowledgments

First and foremost I offer my sincerest gratitude to my supervisor, Dr Rajeev Sangal, Head of Lan- guage Technologies Research Centre (LTRC), International Institute of Information Technology - Hy- derabad, who has supported me throughout my thesis with his patience and knowledge whilst allowing me the room to work in my own way. I am thankful to my mentor and guide, Prashanth Mannem, for his wide knowledge and guidance throughout this work. I attribute the level of my Masters degree to his encouragement and effort and without him this thesis, too, would not have been completed or written. One simply could not wish for a better or friendlier mentor. I am deeply grateful to Professor Dipti Misra Sharma for her detailed and constructive comments. Her logical way of thinking has been of great value for me. In my daily work I have been blessed with a friendly and cheerful group of fellow students. Inter- esting discussions about life with Rahul Agarwal, Manish Agarwal, Abhinav Goel, Rohit Nigam and many more at the cafeteria or in the lab has kept me sane throughout my studies. Shubhangi Sharma kept me on track by giving much needed, motivating but boring, lectures time and again and telling me to hang in there. Abhinav Goel kept us entertained with his huge repertoire of anecdotes and stories. He can sense your achievement and was there to celebrate at every stepping stone of my project. Shashank Sahni has fascinated me with his interest in Linux and his ability to get softwares installed on any com- puter system. If it had not been for him, I wouldn’t have started my work on QG when I started my work. Harshit Sureka always made sure that I wasn’t very stressed out with work and often planned the unplanned trips to various places. Late night Basketball games after a long day in the lab with Romit, Shubhangi and Yasir cleared my mind and helped me get good night sleep. I am very thankful to Manish Agarwal, my parter in this work, who kept me motivated and supported me throughout the project right from the beginning. The LTRC has provided the support and equipment I needed to produce and complete my thesis. Finally, I thank my parents for supporting me throughout all my studies at the University.

v Abstract

A question may be either a linguistic expression used to make a request for information, or else the request itself made by such an expression. This information may be provided with an answer. Asking questions is a fundamental cognitive process that underlies higher-level cognitive abilities such as comprehension and reasoning. The ability to ask questions is the central cognitive element that distinguishes human and animal cognitive abilities. Questions are used from the most elementary stage of learning to original research. Question Generation (QG) is the task of automatically generating questions from various inputs such as raw text, database, or semantic representation. Ultimately, QG allows humans, and in many cases artificial intelligence systems, to understand their environment and each other. Research on QG has a long history in artificial intelligence, psychology, education, and natural language processing.

The present work describes automatic Question Generation Systems that take natural language text as input and generate questions of various types and scope for the user. Our aim is to generate questions that assess the content knowledge that a student has acquired upon reading a text rather than vocabulary or grammar assessment or language learning. In this work, we have described two automatic question generation systems. Both these systems factor the QG process into several stages, enabling more or less independent development of particular stages.

The QG system, described in chapter 2, generates questions automatically using discourse connec- tives for different question types. We described an end-to-end system that takes a document as input and outputs all the questions for selected discourse connectives. The selected discourse connectives include four subordinating conjunctions, since, when, because and although, and three adverbials, for example, for instance and as a result. Our system factors the QG process into two stages: content selec- tion (the text selected for question generation) and question formation (transformations on the content to get the question), Question formation module further has the modules of (i) finding suitable question type (wh-word), (ii) auxiliary and main verb transformations and (iii) rearranging the phrases to get the final question. The system has been evaluated for syntactic and semantic soundness of the question by two evaluators. The overall system has been rated 6.3 out of 8 for QGSTEC development dataset and 5.8 out of 8 for Wikipedia dataset. We have shown that some specific discourse relations are important, such as causal, temporal, result, etc., than others from the QG point of view. This work also shows that discourse connectives are good enough for QG and that there is no need for full fledged discourse

vi vii parsing. We have generated questions using discourse connectives paving way for medium and specific scope questions. Cloze question generation (CQG) system, described in chapter 3, takes a document as input and outputs the important cloze questions. Our system factors the CQG system into three stages: (i) Sen- tence selection, (ii) Keyword selection and (iii) Distractor selection. A domain dependent approach is described for the distractor selection module of CQG system. The system is implemented for and tested on examples from the sports domain. The system is evaluated using the guidelines described in this work. The accuracy of the distractors is 3.05 (Eval-1), 3.14 ((Eval-2) and 3.5 (Eval-3) out of 4. Main focus being on distractor selection, we have shown the influence of domain on the quality of distractors. Contents

Chapter Page

1 Introduction ...... 1 1.1 Introduction...... 1 1.1.1 What is a question?...... 1 1.1.2 What is a good question?...... 1 1.1.3 Importance of questions in Learning ...... 2 1.2 What is Question Generation? ...... 2 1.3 Question Generation and its applications ...... 3 1.4 Classification of Questions ...... 4 1.4.1 Based on Question-type ...... 4 1.4.2 BasedonScope...... 4 1.5 ProblemStatement ...... 5 1.6 Contribution of the Thesis ...... 6 1.7 Thesisorganization ...... 6

2 Automatic Question Generation using Discourse Cues ...... 7 2.1 Overview ...... 7 2.2 Introduction...... 8 2.3 RelatedWork ...... 9 2.4 Selection of Discourse Connectives ...... 11 2.5 Discourse connectives for QG ...... 12 2.5.1 Question type identification ...... 12 2.5.2 Target arguments for discourse connectives ...... 14 2.6 Target Argument Identification ...... 15 2.6.1 Locate syntactic head ...... 16 2.6.2 Target Argument Extraction ...... 16 2.7 Syntactic Transformations and Question Generation ...... 17 2.8 EvaluationandResults ...... 18 2.9 ErrorAnalysis...... 20 2.9.1 Co-reference resolution ...... 20 2.9.2 ParsingErrors...... 20 2.9.3 Errors due to the inter-sentential connectives ...... 20 2.9.4 Fluencyissues ...... 21 2.10Conclusions...... 21

viii CONTENTS ix

3 Distractor Selection for Cloze Questions ...... 23 3.1 Overview ...... 23 3.2 Introduction...... 23 3.3 RelatedWork ...... 26 3.4 Approach ...... 28 3.4.1 SentenceSelection ...... 28 3.4.2 KeywordsSelection ...... 28 3.5 DistractorSelection ...... 29 3.5.1 Select distractors from a single team ...... 31 3.5.2 Select distractors from both the teams ...... 32 3.5.3 Select distractors from any team ...... 32 3.6 EvaluationandResults ...... 32 3.7 Conclusions...... 33

4 Conclusion and Future Work ...... 34 4.1 Conclusions...... 34 4.2 FutureWork...... 35

Bibliography ...... 37 List of Figures

Figure Page

1.1 QGProcessingeneral ...... 2

2.1 Three-stage framework for automatic question generation ...... 11 2.2 Questions for discourse connective when ...... 13 2.3 Head selection of the target argument for intra-sentential connectives (V1,V2: finite verbs; X,Z: subtrees of V1; A: subtree of V2; P,Q:Not verbs; DC:discourse connec- tive(child of V2)) ...... 16 2.4 Question Generation process ...... 17

3.1 Distractor Selection Method ...... 31

x List of Tables

Table Page

1.1 Sample paragraph and questions with varying scope ...... 5

2.1 Coverage of the selected discourse connectives in the data ...... 12 2.2 Question type for discourse connectives ...... 14 2.3 Target argument for discourse connectives ...... 15 2.4 Evaluation guidelines for syntactic correctness measure...... 18 2.5 Results on QGSTEC-2010 development dataset ...... 19 2.6 Results on the Wikipedia data (cricket, football, basketball, badminton, tennis) . . . . . 19 2.7 Examples ...... 22

3.1 KnowledgeBase ...... 30 3.2 Evaluation Guidelines ...... 33 3.3 Cloze Question Evaluation Results ...... 33

xi Chapter 1

Introduction

1.1 Introduction

1.1.1 Whatisa question?

A question may be either a linguistic expression used to make a request for information, or else the request itself made by such an expression. This information may be provided with an answer. Asking questions is a fundamental cognitive process that underlies higher-level cognitive abilities such as comprehension and reasoning. The ability to ask questions is the central cognitive element that distinguishes human and animal cognitive abilities.

1.1.2 Whatisa good question?

A good question is relatively short, clear, and unambiguous. A few qualities of a good question are as follows: 1. Evokes the truth. 2. Asks for an answer on only one dimension. A question that asks for a response on more than one dimension will not provide the information you are seeking. For example, a researcher investigating a new food snack asks Do you like the texture and flavor of the snack? If a respondent answers no, then the researcher will not know if the respondent dislikes the texture or the flavor, or both. 3. Can accommodate all possible answers. Multiple choice items are the most popular type of survey questions because they are generally the easiest for a respondent to answer and the easiest to analyze. 4. Has mutually exclusive options. A good question leaves no ambiguity in the mind of the respon- dent. There should be only one correct or appropriate choice for the respondent to make. 5. Does not presuppose a certain state of affairs. 6. Does not use emotionally loaded or vaguely defined words. Quantifying adjectives (e.g., most, least, majority) are frequently used in questions. It is important to understand that these adjectives mean different things to different people.

1 1.1.3 Importance of questions in Learning

One of the most important uses of questions is reflection, improving our understanding of things we have found out. People often spend hours by themselves contemplating ideas and working through is- sues raised by what they have read. These ideas and issues are often articulated in the form of questions. Questions are used from the most elementary stage of learning to original research. In the scientific method, a question often forms the basis of the investigation and can be considered a transition between the observation and hypothesis stages. Students of all ages use questions in their learning of topics, and the skill of having learners creating “investigatable” questions is a central part of inquiry education. The method of questioning student responses may be used by a teacher to lead the student towards the truth without direct instruction, and also helps students to form logical conclusions. Questions have also been used to develop students’ interest in a topic. Another use of questions is to give students a map for self-recognition of reaching milestones of understanding as they study a unit, by asking, when a unit is begun, some difficult things that require understanding and insight into the material to be able to answer. A widespread and accepted use of questions in an educational context is the assessment of students’ knowledge through exams.

1.2 What is Question Generation?

Question Generation (QG) is the task of automatically generating questions from various inputs such as raw text, database, or semantic representation. Although automatic QG can be approached with various techniques, QG is basically regarded as a discourse task involving the following four steps: (1) when to ask the question, (2) what the question is about, i.e. content selection, (3) question type identification, and (4) question construction. The general Question Generation Process is described in Figure 1.1 below.

Target Question Type Question Selection Selection Construction Source Text Answer Question Type Question

Text−to−Question Generation

Figure 1.1 QG Process in general

2 1.3 Question Generation and its applications

Question generation - the purposeful posing and answering of questions about what is read - serves the goal of reading comprehension instruction not only of its own accord, but also in conjunction with multiple reading comprehension strategies. QG strategies, in addition to being a natural precursor to Question Answering (QA), is an excellent compliment to another proven strategy - comprehension mon- itoring. Often referred to as self-regulation, comprehension monitoring translates into meta-cognitive awareness and students’ abilities to self-select and employ questioning strategies on a situational basis. Students learn to independently and actively select and use strategies that help them better comprehend text material. Notably, some of the most favorable gains in students’ abilities to critique and improve the quality of their own questions and those of other students have been found to occur in conjunction with comprehension monitoring instruction.

QG has played a large role in reciprocal teaching [37, 48]. The National Reading Panel identified self-questioning as the single most effective reading comprehension strategy to teach - that is, teaching children to ask themselves about text as they read it, as opposed to teachers asking questions, except as examples to demonstrate the self-questioning strategy [48]. Of the four principal strategies (e.g., summarization, question generation, clarification, and prediction) used in combination during reciprocal teaching, question generation is the strategy most frequently incorporated.

QG and QA are key challenges facing systems that interact with natural languages. The potential benefits of using automated systems to generate questions helps reduce the dependency on humans to generate questions and other needs associated with systems interacting with natural languages.

Ultimately, QG allows humans, and in many cases artificial intelligence systems, to understand their environment and each other. Research on QG has a long history in artificial intelligence, psychology, education, and natural language processing. One thread of research has been theoretical, with attempts to understand and specify the triggers (e.g., knowledge discrepancies) and mechanisms (e.g., association between type of knowledge discrepancy and question type) underlying QG. The other thread of research has focused on automated QG. Applications of automated QG facilities are endless and far reaching. A few examples are listed below: 1. Suggested good questions that learners might ask while reading documents and other media. 2. Questions that human and computer tutors might ask to promote and assess deeper learning. 3. Suggested questions for patients and caretakers in medicine. 4. Suggested questions that might be asked in legal contexts by litigants or in security contexts by interrogators. 5. Questions automatically generated from information repositories as candidates for Frequently Asked Question (FAQ) facilities.

3 1.4 Classification of Questions

It is important to classify questions because different class of questions require different strategies for their automatic generation.

1.4.1 Based on Question-type

• Yes/No Questions: In linguistics, a yes-no question, formally known as a polar question, is a question whose expected answer is either yes or no. Formally, they present an exclusive disjunc- tion, a pair of alternatives of which only one is acceptable. In English, such questions can be formed in both positive and negative forms (For example, Will you be here tomorrow? and Won’t you be here tomorrow?).

• Wh-Questions: W h-questions use interrogative words1, such as why, when, who, where, which, etc., to request information. They cannot be answered with a yes or no. Non-polar wh- questions are in contrast with polar yes-no questions, which do not necessarily present a range of alternative answers, or necessarily restrict that range to two alternatives. (For example, What time did you come home last night?)

• Fill-in-the-blank Questions: F ill-in-the-blank question, also known as Cloze question, is a sentence with one or more blanks in it with four alternatives to fill those blanks. Example: carried the burden of our nation for 21 years, said the youngster Kohli. (a) Sachin (b) Sehwag (c) Raina (d) Dhoni

1.4.2 Based on Scope

The question generation shared task evaluation challenge has been defined such that it is application- independent. Application-independent means questions will be judged based on content analysis of the input paragraph;questions whose answers span more input text are ranked higher. Described below are the categories into which the questions are classified.

• Specific: Question whose answer span is a word, a phrase, a clause or a sentence is said to be of specific scope.

• Medium: Question whose answer span is multiple clauses or sentences is considered to be of medium scope.

• General: Question whose answer spans a paragraph or almost an entire paragraph is considered to be of general scope.

1In linguistics, an interrogative word or question word is a function word used to ask a question, such as where or what. They are sometimes called wh-words, because in English most of them start with wh-.

4 Consider the Example in Table 1.1:

Input Text There is much debate on how to make a traditional ratatouille. One method is to simply sautee all of the vegetables together. Some cooks, including Julia Child, insist on a layering approach, where the eggplant and the courgettes are sauted separately, while the tomatoes, onion, garlic and bell peppers are made into a sauce. The ratatouille is then layered in a casserole - eggplant, courgettes, tomato/pepper mixture - then baked in an oven. When ratatouille is used as a filling for savory crepes or to fill an omelette, the pieces are sometimes cut smaller. Also, unnecessary moisture is reduced by straining the liquid with a colander into a bowl, reducing it in a hot pan, then adding one or two tablespoons of reduced liquid back into the vegetables. Questions 1) How is ratatouille cooked? (or equivalents, e.g. How is ratatouille made? How can someone cook ratatouille?, etc.) 2) How is ratatouille for crepes cooked? OR How is ratatouille for an omelette made?) OR What is the layering approach to cook ratatouille? OR What is the saute-all method to cook ratatouille? 3) How is moisture reduced when making ratatouille? OR What vegetables are sauted in the layering approach? OR What ingredients are needed for making ratatouille? 4) Is Julia Child a proponent of the sautee-all approach to cooking raratouille? 5) Is ratatouille usually served as a side dish?

Table 1.1 Sample paragraph and questions with varying scope

In Table 1.1, question 1 is a good overall question for the given paragraph. Other questions are of lesser importance with respect to the span of the corresponding answers. It should be noted that the answer for question 4 is implicit (first three sentences of the paragraph suggest it) but not explicit whereas the answer for question 5 is implied and thus should not be generated.

1.5 Problem Statement

The aim of this thesis is to build automatic Question Generation Systems that take a natural language text as input and generate questions of various types and scope for the user. Question types include WH- questions, yes/no questions and F ill-in-the-blank or Cloze questions. The scope of the generated questions is either specific or medium.

5 1.6 Contribution of the Thesis

1. An end-to-end automatic question generation system, generating question of the type YES/NO and WH, is presented in Chapter 2 that takes a document as input and outputs all the questions for se- lected discourse connectives.

(a) We have shown that some specific discourse relations are important such as causal, temporal and result than others from the QG point of view.

(b) This work also shows that discourse connectives are good enough for QG and that there is no need for full fledged discourse parsing.

(c) The system has been evaluated for syntactic and semantic soundness of the question by two eval- uators.

2. Cloze question generation (CQG) system is described in chapter 3 that takes a document as input and outputs the important cloze questions.

(a) A domain dependent approach is described for the distractor selection module of CQG system.

(b) Our main focus being distractor selection, we have shown the influence of domain on the quality of distractors.

1.7 Thesis organization

This thesis is organized in a number of Chapters. In Chapter 2, we will explain an approach towards automatic question generation using discourse connectives we will describe an end-to-end automatic question generation system that takes a document as input and outputs all the questions for selected discourse connectives, and give details on how discourse connectives act as important cues for the generation of WH questions and YES/NO questions. Chapter 3 explains the process of automatic Cloze question generation. It explains in detail how the domain influences the quality of distractors. The following chapter rests with the conclusion of the thesis and the future work.

6 Chapter 2

Automatic Question Generation using Discourse Cues

2.1 Overview

This chapter discusses the usability of discourse connectives towards automatic question generation. We have described an end-to-end automatic question generation system that takes a document as input and outputs all the questions for selected discourse connectives, and give details on how discourse connectives act as important cues for the generation of Wh-questions and Yes/No questions.

Discourse connectives1 play a vital role in making the text coherent. They connect two clauses or sentences exhibiting discourse relations such as temporal, causal, elaboration, contrast, result, etc. Discourse relations have been shown to be useful to generate questions [34] but identifying these relations in the text is a difficult task [31]. So in this work, instead of identifying discourse relations and generating questions using them, we explore the usefulness of discourse connectives for QG. We do this by analyzing the senses of the connectives that help in QG and propose a system that makes use of this analysis to generate questions of the type why, when, give an example and yes/no.

The two main problems in QG are identifying the content to ask a question on and finding the corresponding question type for that content. We analyze the connectives in terms of the content useful for question generation based on the senses they exhibit. We show that the senses of the connectives further help in choosing the relevant question type for the content.

In this chapter, we present an end-to-end QG system that takes a document as input and outputs all the questions generated using the selected discourse connectives. The system has been evaluated manually by two evaluators for syntactic and semantic correctness of the generated questions. The overall system has been rated 6.3 out of 8 for QGSTEC development dataset and 5.8 out of 8 for Wikipedia dataset.

1In grammar, a conjunction is a part of speech that connects two words, sentences, phrases or clauses together. A discourse connective is a conjunction joining sentences.

7 2.2 Introduction

Several taxonomies of questions have been developed which may help to guide the study of QG ( [22]; [40]; [13]; [3]), though these focus on logical (e.g., whether inference is necessary) or psy- chological characteristics (e.g., is world knowledge activated) rather than linguistic ones (e.g., lexical overlap, similar constructions or transformations). The generation of interrogative sentences by humans has long been a major topic in linguistics, mo- tivating various theoretical work (e.g., [38]), in particular those that view a question as a transformation of a canonical declarative sentence [6]. In computational linguistics, questions have also been a major topic of study, but primarily with the goal of answering questions [9]. Automatic QG from sentences and paragraphs has caught the attention of the Natural Language Processing (NLP) community in the last few years through the question generation workshops and the shared task in 2010 [39] which aims primarily at establishing a community consensus with respect to a shared task evaluation campaign on Question Generation. As a secondary goal, the workshop aims at boosting research on computational models of Question Generation, an area less studied by the natural language generation community. [44], [41], [12] and [47] are among the very first attempts at building an automatic question generation system. Previous work in this area has concentrated on generating questions from individual sentences [49, 29, 2]. [44] used question templates and [14] used general-purpose rules to transform sentences into questions. A notable exception is [26] who generated questions of various scopes (general, medium and spe- cific)2 from paragraphs. They boil down the QG from paragraphs task into first identifying the sentences in the paragraph with general, medium and specific scopes and then generating the corresponding ques- tions from these sentences using semantic roles of predicates. Discourse connectives play a vital role in making the text coherent. They connect two clauses or sentences exhibiting discourse relations such as temporal, causal, elaboration, contrast, result, etc. Discourse relations have been shown to be useful to generate questions [34] but identifying these relations in the text is a difficult task [31]. So in this work, instead of identifying discourse relations and generating questions using them, we explore the usefulness of discourse connectives for QG. We do this by analyzing the senses of the connectives that help in QG and propose a system that makes use of this analysis to generate questions of the type why, when, give an example and yes/no. The two main problems in QG are identifying the content to ask a question on and finding the corresponding question type for that content. We analyze the connectives in terms of the content useful for question generation based on the senses they exhibit. We show that the senses of the connectives further help in choosing the relevant question type for the content. In this chapter, we present an end-to-end QG system that takes a document as input and outputs all the questions generated using the selected discourse connectives. The system has been evaluated manually

2General scope - entire or almost entire paragraph, Medium scope - multiple clauses or sentences, and Specific scope - sentence or less

8 by two evaluators for syntactic and semantic correctness of the generated questions. The overall system has been rated 6.3 out of 8 for QGSTEC development dataset and 5.8 out of 8 for Wikipedia dataset. The QG system has two modules, content selection (the text selected for question generation) and question formation (transformations on the content to get the question). Question formation module further has the modules of (i) finding suitable question type (wh-word), (ii) auxiliary and main verb transformations and (iii) rearranging the phrases to get the final question.

2.3 Related Work

This section surveys various approaches to Automatic Question Generation of the wh and yes/no type as listed in previous section. [47] describes a computer system, Ruminator, which learns by reflecting on the information it has acquired and posing questions in order to derive new information. Ruminator takes as input simplified sentences in order to focus on question generation rather than handling syntactic complexity; even so, it is reported that even a single sentence generated 2052 questions. The authors note that it is important to weed out the easy questions as quickly as possible, and use this process to learn more refined question- posing strategies to avoid producing silly or obvious questions in the first place [47]. A key component that appears to be missing from the system design is an estimation of the utility, or informativeness, of an automatically generated question. [41] describes a system for generating questions, in the context of learning aids, which also com- prises the NLP components of lexical processing, syntactic processing, logical form, and generation. This system uses summarization as a pre-processing step as a proxy for identifying information that is worth asking a question about. Nevertheless, the authors note that limiting/selecting questions created by Content QA Generator is difficult [41]. [12] describes a system to generate factoid questions automatically from large text corpora [12]. User questions were then matched against these pre-processed factoid questions in order to identify rel- evant answer passages in a Question -Answering system. While no examples of automatically generated questions are provided, this study does report a comparison of the retrieval performance using only au- tomatically generated questions and manually-generated questions: 15.7% of the system responses were relevant given automatically generated questions, while 84% of the system responses were deemed rel- evant with manually-generated questions. The discrepancy in performance indicates that significant difficulties remain. The solution adapted by [49] to solve the QG from sentences task relies on the methodology em- ployed by the multiple-choice question generation system developed by [27, 28]. The system is built on separate components, which perform the following tasks: (i) term extraction and (ii) question gen- eration. After key concepts/terms in a text have been identified, sentences containing these terms are considered by the stem generation component which checks whether a sentence is suitable for generat- ing questions. The criteria used include whether the terms occur in main clauses or subordinate clauses,

9 whether the sentence has coordination structures, whether the sentence contains negative statements, etc. The major drawback of this system is that it just considers single sentences for question generation and that despite of the criteria used, some of the important sentences in the text can be missed. [29] deals with the methodology to generate questions focusing on person Named Entities (NE), temporal or location information and agent based semantic roles associated with the words in the input sentences. The question recognizer module checks whether a specific question type can be generated from a clause in each input sentence and also identifies the possible cue phrase in the clause for the specified question type. The generator module replaces the cue phrase by the question word and reorders the chunks in the clause to ensure grammatical correctness. [2] considered the Question Generation from a single input sentence with a question type target (e.g. Who? Where? When? Etc,). They divide the work tasks into following modules:

• Elementary Sentence Construction: This module extracts the elementary sentences from the complex input sentences by syntactically parsing each complex sentence.

• Sentence Classification: The subject, object, preposition and verb are identified from each ele- mentary sentence based on the associated POS and NE tagged information. This information is used to classify the sentences into either Fine or Coarse Classifier.

• Question Generation: This module takes the elements of the sentences with their coarse classes, the verbs (with its stem) and the tense information. Based on a set of 90 predefined interac- tion rules, they check the coarse classes according to the word to word interaction and form the question using the appropriate word.

The question-answering system developed by [44] matches one-sentence-long user questions to a number of question templates that cover the conceptual model of the database and describe the con- cepts, their attributes, and the relationships in form of natural language questions. A question template, dynamic and parameterized FAQ as opposed to the traditional static FAQ, is a question with entity slots - free space for data instances that represent the main concepts of the question. For example, When does < performer > perform in < place >? is a question template where < performer > and < place > are the entity slots. If we fill these slots with data instances that belong to the concepts, we get an ordinary question, e.g., When does Depeche Mode perform in Globen? The main advantage of such a pattern matching system is its simplicity: no sophisticated processing of user questions is needed. The framework of [14] for question generation composes general-purpose rules to transform declar- ative sentences into questions, and includes a statistical component for scoring questions based on fea- tures of the input, output, and transformations performed. The generation of a single question can be usefully decomposed into the three modular stages depicted in the Figure 2.1. In stage 1, a selected sentence or a set of sentences from the text is transformed into one declar- ative sentence by optionally altering or transforming lexical items, syntactic structure, and semantics. In stage 2, the declarative sentence is turned into a question by executing a set of well-defined syn- tactic transformations (WH-movement, subject-auxiliary inversion, etc.). In stage 3 the questions are

10 Source Sentences Question Question Ranked List of Questions Transcuder Questions Ranker NLP Transformations Derived (3) (1) Sentences (2)

Figure 2.1 Three-stage framework for automatic question generation

scored and ranked according to features of the source sentences, input sentences, the question, and the transformations used in generation. [26] uses the semantic role labels (SRL parse) provided by ASSERT [33] extensively. For each sentence, ASSERT provides multiple predicate-argument structures for all the predicates in the sentence. Arguments include complements of the predicates as well as adjuncts. These arguments can be used to ascertain whether a particular constituent is appropriate for generating a question or not. Their QG system has three stages content selection, question formation and ranking. In the content selection stage, all the pieces of text in the paragraph over which the question has to be asked are identified. These pieces of text could either be a phrase or a clause or an entire sentence. In the second stage, a question is formed over each of the texts picked in the previous stage. The questions are ranked in the third stage using a heuristic function to pick the six best questions over the entire paragraph. In this work, we explore the usefulness of discourse connectives for QG. We do this by analyzing the senses of the connectives that help in QG and propose a system that makes use of this analysis to generate wh-questions (why, when, give an example) and yes/no questions. Our aim is to generate questions that assess the content knowledge that a student has acquired upon reading a text. The QG system has two modules, content selection (the text selected for question generation) and question formation (transformations on the content to get the question). Question formation module further has the modules of (i) finding suitable question type (wh-word), (ii) auxiliary and main verb transformations and (iii) rearranging the phrases to get the final question.

2.4 Selection of Discourse Connectives

There are 100 distinct types of discourse connectives (DCs) listed in PDTB manual [36]. The most frequent connectives in PDTB are and, or, but, when, because, since, also, although, for example, however and as a result. In this paper, we provide analysis for four subordinating conjunctions, since, when, because and although, and three adverbials, for example, for instance and as a result. Connectives such as and, or and also showing conjunction relation have not been found to be good candidates for generating wh-type questions and hence have not been discussed in the paper. Leaving aside and, or and also, the selected connectives cover 52.05 per cent of the total number of the connectives in QGSTEC-

11 2010 3 dataset and 41.97 per cent in Wikipedia articles. Connective-wise coverage in both the datasets is shown in Table 2.1. Though but and however denoting contrast relation occur frequently in the data, it has not been feasible to generate wh-questions using them.A yes/no question could have been asked but it was not chosen to preserve the question-type variety in the final output of the QG system.

QGSTEC-2010 Dev. Data Wikipedia Dataset Connective count % count % because 20 16.53 36 10.28 since 9 7.44 18 5.14 when 23 19.00 35 10.00 although 4 3.30 22 6.28 as a result 5 4.13 6 1.71 for example 2 1.65 30 8.28 for instance 0 0.00 1 0.28 Total 121 52.05 350 41.97

Table 2.1 Coverage of the selected discourse connectives in the data

The system goes through the entire document and identifies the sentences containing at least one of the seven discourse connectives. In our approach, suitable content for each discourse connective which is referred to as target argument is decided based on the properties of discourse connective. The system finds the question type on the basis of discourse relation shown by discourse connective.

2.5 Discourse connectives for QG

In this section, we provide an analysis of discourse connectives with respect to their target arguments and the question types they take.

2.5.1 Question type identification

Different discourse connectives exhibit different discourse relations or senses in a text. Also, the same DC can exhibit different senses depending on the context in which it is used i.e. a single DC can exhibit multiple senses when used in a different form. The sense of the discourse connective influences the question-type (Q-type). Since few discourse connectives such as when, since and although among the selected ones can show multiple senses, the task of sense disambiguation of the connectives is es- sential for finding the question type.

Since: The connective can show temporal, causal or temporal + causal relation in a sentence. Sen- tence exhibits temporal relation in presence of keywords like time(7 am), year (1989 or 1980s), start,

3QGSTEC 2010 data set involves Wikipedia, Yahoo Answers and OpenLearn articles.

12 begin, end, date(9/11), month (January) etc. If the relation is temporal then the question-type is when whereas in case of causal relation it would be why.

1. Single wicket has rarely been played since limited overs cricket began. Q-type: when

2. Half-court games require less cardiovascular stamina , since players need not run back and forth a full court. Q-type: why

In examples 1 and 2, 1 is identified to show temporal relation because it has the keyword began whereas there is no keyword in the context of example 2 that gives the hint of temporal relation and so the relation here is identified as causal.

Sentence: The San−Francisco earthquake hit when resources in the field already were stetched. (Temporal) Question: When did San−Francisco earthquake hit ?

Sentence: Venice’s long decline started in the 15th century, when it first made an unsuccessful attempt to hold Thessalonica against the Ottomans (1423−1430). ( Temporal + Causal ) Question: When did Venice’s long decline start in the 15th century ?

Sentence: Earthquake mainly occurs when the different blocks or plates that make up the Earth’s surface move relative to each other, causing distortion in the rock. ( Conditional ) Question: When do earthquake mainly occur ?

Figure 2.2 Questions for discourse connective when

When: Consider the sentences with connective when in Figure 2.2. Although when shows multiple senses (temporal, temporal+causal and conditional), we can frame questions by a single question type, when. Given a new instance of the connective, finding the correct sense of when becomes unnecessary as a result of using discourse connectives.

Although: The connective can show concession or contrast discourse relations. It is difficult to frame a wh-question on contrast or concession relations. So, system generates a yes/no type question for although. Moreover, yes/no question-type adds to the variety of questions generated by the system.

3. Greek colonies were not politically controlled by their founding cities , although they often re- tained religious and commercial links with them . Q-type: Yes/No

A yes/no question could have been asked for connectives but and however denoting a contrast relation but it was not done to preserve the question-type variety in the final output of the QG system.

13 Y es/no questions have been asked for occurrences of although since they occur less frequently than but and however.

Discourse Sense Q-type connectives because causal why temporal when since causal why causal + temporal when temporal when conditional contrast although yes/ no concession as a result result why for example instantiation give an example where for instance instantiation give an instance where

Table 2.2 Question type for discourse connectives

Identifying the question types for other selected discourse connectives is straight forward because they broadly show only one discourse relation [32]. Based on the relations exhibited by these connec- tives, Table 2.2 shows the question types for each discourse connective.

2.5.2 Target arguments for discourse connectives

A discourse connective can realize its two arguments, Arg1 and Arg2, structurally and anaphorically. Arg2 is always realized structurally whereas Arg1 can be either structural or anaphoric [36, 35].

4. [Arg1 Organisms inherit the characteristics of their parents] because [Arg2 the cells of the off- spring contain copies of the genes in their parents’ cells.](Intra-sentential connective because)

5. [Arg1 The scorers are directed by the hand signals of an .] For example, [Arg2 the umpire raises a forefinger to signal that the batsman is out (has been dismissed); he raises both arms above his head if the batsman has hit the ball for six runs.](Inter-sentential connective for example)

Consider examples 4 and 5. In 4, Arg1 and Arg2 are the structural arguments of the connective because whereas in 5, Arg2 is the structural argument and Arg1 is realized anaphorically.

14 Discourse connective Target argument because Arg1 since Arg1 when Arg1 although Arg1 as a result Arg2 for example Arg1 for instance Arg1

Table 2.3 Target argument for discourse connectives

The task of content selection involves finding the target argument (either Arg1 or Arg2) of the dis- course connective. Since both the arguments are potential candidates for QG, we analyze the data to identify which argument makes better content for each of the connectives. Our system selects one of the two arguments based on the properties of discourse connectives. Table 2.3 shows the target argument i.e. either Arg1 or Arg2, which is used as content for QG.

2.6 Target Argument Identification

Target argument for a discourse connective can be a clause(s) or a sentence(s). It could be one or more sentences in case of inter-sentential4 discourse connectives, whereas one or more clauses in case of intra-sentential5 connectives. There are no restrictions on how many or what types of clauses can be included in these complex selections, except for the Minimality Principle, according to which only as many clauses and/or sentences should be included in an argument selection as are minimally required and sufficient for the interpretation of the relation. Discourse connectives for example and for instance can realize its Arg1 anywhere in the prior dis- course [10]. So the system considers only those sentences in which the connectives occur at the begin- ning of the sentence and the immediate previous sentence is assumed to be the Arg1 of the connective (which is the target argument for QG). In case of intra-sentential connectives (because, since, although and when) and as a result (target argument is Arg2 which would be a clause), identification of target argument is done in two steps. The system first locates the syntactic head or head verb of the target argument and then extracts it from the dependency tree of the sentence. The remainder of this section explains these two steps in detail.

4Connectives that realize its Arg1 anaphorically and Arg2 structurally 5Connectives that realize both of its arguments structurally

15 2.6.1 Locate syntactic head

Approach for locating the syntactic head of target argument is explained with the help of Figure 2.3 (generic dependency trees) and an example shown in Figure 2.4. Syntactic head of Arg2 is the first finite verb while percolating up in the dependency tree starting from the discourse connective. In case of intra-sentential connectives where Arg1 is the target argument, the system percolates up until it gets the second finite verb which is assumed to be target head of Arg1. Number of percolations entirely depend on structure and complexity of the sentence. Figure 2.3 shows two dependency trees (a) and (b). Starting from the discourse connective DC and percolating up, the system identifies that the head of Arg2 is V2 and that of Arg1 is V1.

V V 1 1

X V2 Z X P Z

DC A Q

V2

DC A

(a) (b)

Figure 2.3 Head selection of the target argument for intra-sentential connectives (V1,V2: finite verbs; X,Z: subtrees of V1; A: subtree of V2; P,Q:Not verbs; DC:discourse connective(child of V2))

Since the discourse connective in the example of Figure 2.4 is because, the target argument is Arg1 (from Table 2.2). By percolating up the tree starting from because, the head of Arg2 is affected and that of Arg1 is played. Once we locate the head of the target argument, we find the auxiliary as [26] does. For the example in Figure 2.4, the auxiliary for question generation is is.

2.6.2 Target Argument Extraction

The extraction of the target argument is done after identifying its syntactic head. For as a result, the target argument, Arg2, is the subtree with head as the head of the connective. For intra-sentential connectives, the target argument, Arg1, is the tree remaining after removing the subtree that contains Arg2.

16 Because [Arg2 shuttlecock flight is affected by wind], [Arg1 competitive badminton is played indoors].(content)

(From section 2.1) (From section 2.2) qtype : "Why" Target Arg Head : "played"

aux : "is" played (section 2.3) competitive is indoors affected

badminton by because flight is

wind shuttlecock

Why is competitive badminton played indoors ?

Figure 2.4 Question Generation process

In Figures 2.3 (a) and (b) both, a tree with head V1 and its children, X and Z, is left after removing Arg2 from dependency trees, which is the content required for generating the question. Note that in the tree of Figure 2.3(b), the child P of the head verb V1 is removed with its entire subtree that contains Arg2. Thus, subtree with head V2 is the unwanted part for the tree in Figure 2.3(a) whereas subtree with head P is the unwanted part for the tree in Figure 2.3(b) when the target argument is Arg1. In Figure 2.4, after removing the unwanted argument Arg2 (subtree with head affected), the sys- tem gets competitive badminton is played indoors which is the required clause (content) for question generation. The next section describes how the content is transformed into a question.

2.7 Syntactic Transformations and Question Generation

At this stage, the system has the question type, auxiliary and the content. The following set of transformations are applied on the content to get the final question. (1) If the auxiliary is present in the sentence itself then it is moved to the beginning of the sentence; otherwise auxiliary is added at the beginning of the sentence. (2) If a wh-question is to be formed, the question word is added just before the auxiliary. In case of Yes/No questions, the question starts with the auxiliary itself as no question word is needed. (3) A question-mark(?) is added at the end to complete the question. Consider the example in Figure 2.4. Here the content is competitive badminton is played indoors. Applying the transformations, the auxiliary is first moved at the start of the sentence to get is competitive badminton played indoors. Then the question type Why is added just before the auxiliary is, and a question-mark is added at the end to get the final question, Why is competitive badminton played indoors ?

17 Scope: In QGSTEC 2010 the question had to be assigned a scope, specific, medium or general. The scope is defined as: general - entire input paragraph, medium - one or more clauses or sentences and specific - phrase or less. Questions generated using discourse connectives are usually of the scope spe- cific or medium. [26] assigned medium scope to the questions generated using the semantic roles such as ARGM-DIS (result), ARGM-CAU (causal) and ARGM-PNC (purpose) given by the SRL. However, most of the times, the scope of the answer to these questions is just a clause or a sentence and should have been assigned specific scope instead of medium.

2.8 Evaluation and Results

Automatic evaluation of any natural language generated text is difficult. So, our system is evaluated manually. The evaluation was performed by two graduate students with good English proficiency. Eval- uators were asked to rate the questions on the scale of 1 to 4 (4 being the best score) on syntactic and semantic correctness of the question and an overall rating on the scale of 8 (4+4) is assigned to each question. The syntactic correctness is rated to ensure that the system can generate grammatical output. In addition, those questions which read fluently are given greater score. The syntactic correctness and fluency is evaluated using the following scores: 4 - grammatically correct and idiomatic/natural, 3 - grammatically correct, 2 - some grammar problems, 1 - grammatically unacceptable. Table 2.4 shows syntactic correctness measure with examples.A cloze test is an exercise, test, or assessment consisting of a portion of text with certain words removed or blanked out, where the participant is asked to fill the missing words with four alternatives to choose from.

Score Description Example 4 The question is grammatically correct and idiomatic/natural. In which type of animals are phagocytes highly developed? The question is grammatically correct but does not read as In which type of animals are phagocytes, which are important 3 fluently as we would like. throughout the animal kingdom, highly developed? In which type of animals is phagocytes, which are important 2 There are some grammatical errors in the question. throughout the animal kingdom, highly developed? On which type of animals is phagocytes, which are important 1 The question is grammatically unacceptable. throughout the animal kingdom, developed?

Table 2.4 Evaluation guidelines for syntactic correctness measure

The semantic correctness is evaluated using the following scores: 4 - semantically correct and id- iomatic/natural, 3 - semantically correct and close to the text or other questions, 2 - some semantic issues, 1 - semantically unacceptable. The results of our system on QGSTEC-2010 development dataset are shown in Table 2.5. The overall system is rated 6.3 out of 8 on this dataset and the total number of questions generated for this

18 Discourse No. of Syntactic Semantic Overall connective questions Correctness(4) Correctness(4) Rating(8) because 20 3.6 3.6 7.2 since 9 3.8 3.2 7 when 23 2.3 2.2 4.5 although 4 4 3.8 7.8 as a result 5 4 4 8 Overall 61 3.2 3.1 6.3

Table 2.5 Results on QGSTEC-2010 development dataset dataset is 61. The instances of the connectives were less in the QGSTEC-2010 development dataset. So, the system is further tested on five Wikipedia articles (football, cricket, basketball, badminton and tennis) for effective evaluation. Results on this dataset are presented in Table 2.6. Overall rating of the system is 5.8 out of 8 for this dataset and 150 are the total number of questions generated for this dataset. The ratings presented in the Tables 2.5 and 2.6 are the average of the ratings given by both the evaluators. The inter-evaluator agreement (Cohen’s kappa coefficient) for the QGSTEC-2010 development dataset for syntactic correctness measure is 0.6 and is 0.5 for semantic correctness measure, and in case of Wikipedia articles the agreement is 0.7 and 0.6 for syntactic and semantic correctness measures respectively.

Discourse No. of Syntactic Semantic Overall connective questions Correctness(4) Correctness(4) Rating(8) because 36 3.3 3.2 6.5 since 18 3.1 3 6.1 when 35 2.4 2.0 4.4 although 22 3.1 2.8 5.9 as a result 6 3.6 3.2 6.8 for example 16 3.1 2.9 6.0 for instance 2 4 3 7 Overall 135 3.0 2.8 5.8

Table 2.6 Results on the Wikipedia data (cricket, football, basketball, badminton, tennis)

On analyzing the data, we found that the Wikipedia articles have more complex sentences ( with unusual structure as well as more number of clauses) than QGSTEC-2010 development dataset. As a result, the system’s performance consistently drops for all the connectives in case of Wikipedia dataset. No comparable evaluation was done as none of the earlier works in QG exploited the discourse connectives in text to generate questions. Table 2.7 shows the questions generated by our system for each connective.

19 2.9 Error Analysis

An error analysis was carried out on the system’s output and the four most frequent types of errors are discussed in this section.

2.9.1 Co-reference resolution

The system doesn’t handle co-reference resolution and as a result of this, many questions have been rated low for semantic correctness by the evaluators. Greater the number of pronouns in the question, lesser is the semantic rating of the question.

6. They grow in height when they reach shallower water, in a wave shoaling process. Question: When do they grow in height?

Although the above example 6 is syntactically correct, such questions are rated semantically low because the context is not sufficient to answer the question due to the pronouns in it. 13.54% of the gen- erated questions on the Wikipedia dataset have pronouns without their antecedents, making the questions semantically insufficient.

2.9.2 Parsing Errors

Sometimes the parser fails to give a correct parse for the sentences with complex structure. In such cases, the system generates a question that is unacceptable. Consider the examples below.

7. In a family who know that both parents are carriers of CF , either because they already have a CF child or as a result of carrier testing , PND allows the conversion of a probable risk of the disease affecting an unborn child to nearer a certainty that it will or will not be affected. Question: Why do in a family who know that both parents are carriers of CF , either or will not be affected ?

In example 7 above, the sentence has a complex structure containing paired connective, either-or, where the argument of either has because and that of or has as a result in it. Here the question is formed using because which is correct neither syntactically nor semantically due to the complex nature of the sentence. 9.38% sentences in the datasets are complex with either three or more discourse connectives.

2.9.3 Errors due to the inter-sentential connectives

For inter-sentential connectives, system considers only those sentences in which the connectives occur at the beginning of the sentence and the immediate previous sentence is assumed to be the Arg1 of the connective (which is the target argument for QG). But this assumption is not always true. Of the total number of instances of these connectives, 52.94% (for Wikipedia dataset) connectives occur at the beginning of the sentences. Consider the paragraph below.

20 8. A game point occurs in tennis whenever the player who is in the lead in the game needs only one more point to win the game. The terminology is extended to sets (set point), matches (match point), and even championships (championship point). For example, if the player who is serving has a score of 40-love, the player has a triple game point (triple set point, etc.) as the player has three consecutive chances to win the game.

Here in example 8, the third sentence in which the example is specified is related to the first sen- tence but not the immediately previous sentence. For these connectives, the assumption that immediate previous sentence is Arg1 is false 14.29% of the times.

2.9.4 Fluency issues

The system does not handle the removal of predicative adjuncts. So the questions with optional phrases in it are rated low for syntactic correctness measure.

2.10 Conclusions

Our QG system generates questions using discourse connectives for different question types. In this work, we present an end-to-end system that takes a document as input and outputs all the questions for selected discourse connectives. The system has been evaluated for syntactic and semantic soundness of the question by two evaluators. We have shown that some specific discourse relations are important such as causal, temporal and result than others from the QG point of view. This work also shows that discourse connectives are good enough for QG and that there is no need for full fledged discourse parsing. In the next chapter, we introduce and discuss cloze question generation. Cloze questions (CQs) are fill-in-the-blank questions, where a sentence is given with one or more blanks in it with four alternatives to fill those blanks.

21 Discourse Example Connective One-handed backhand players move to the net with greater ease than two-handed players because the shot permits greater forward momentum and has greater similarities in muscle because memory to the preferred type of backhand volley (one-handed, for greater reach ). Why do one-handed backhand players move to the net with greater ease than two-handed players ? (Causal) Half-court games require less cardiovascular stamina, since players need not run back and forth a full court. Why do half-court games require less cardiovascular stamina ? (Causal) since Single wicket has rarely been played since limited overs cricket began. Since when has single wicket rarely been played ? (Temporal) A one-point shot can be earned when shooting from the foul line after a foul is made. when When can a one-point shot be earned ? (Conditional) A bowler cannot bowl two successive overs, although a bowler can bowl unchanged at although end for several overs. Can a bowler bowl unchanged at the same end for several overs? (Contrast, concession) In the United States sleep deprivation is common with students because almost all schools begin early in the morning and many of these students either choose to stay up awake late into the night or cannot do otherwise due to delayed sleep phase syndrome. As a result, students that should be getting between 8.5 and 9.25 hours of sleep are getting only 7 hours. Why are students that should be getting between 8.5 and 9.25 hours of sleep getting as a result only 7 hours? (Result)

As a result of studies showing the effects of sleep-deprivation on grades , and the different sleep patterns for teenagers , a school in New Zealand , changed its start time to 10:30, in 2006, to allow students to keep to a schedule that allowed more sleep. Why did a school in New Zealand change its start time ? (Result) Slicing also causes the shuttlecock to travel much slower than the arm movement suggests. For example, a good cross court sliced drop shot will use a hitting action that suggests a straight clear or smash, deceiving the opponent about both the power and direction of the for example shuttlecock. Give an example where slicing also causes the shuttlecock to travel much slower than the arm movement suggests. (Instantiation) If the team that bats last scores enough runs to win, it is said to have “won by n wickets”, where n is the number of wickets left to fall. For instance a team that passes its opponents’ for instance score having only lost six wickets would have won ”by four wickets”. Give an instance where if the team that bats last scores enough runs to win, it is said to have ”won by n wickets”,where n is the number of wickets left to fall. (Instantiation)

Table 2.7 Examples

22 Chapter 3

Distractor Selection for Cloze Questions

3.1 Overview

This chapter describes a method for automatically generating multiple-choice cloze questions (CQs). The presented cloze question generation (CQG) system comprises three steps: sentence selection, key- word selection, and distractor selection. The system is implemented for and tested on examples from the cricket sports domain. A cloze test is an exercise, test, or assessment consisting of a portion of text with certain words removed or blanked out, where the participant is asked to fill the missing words with four alternatives to choose from. Cloze tests require the ability to understand context and vocabulary in order to identify the correct words or type of words that belong in the deleted passages of a text. In this work, we present an end-to-end automatic cloze question generation system which adopts a semi-structured approach to generate CQs by making use of a knowledge base extracted from a Cricket portal. This chapter argues about the influence of domain on the quality of the distractors in Cloze Question Generation. A domain dependent approach is described for the distractor selection module of CQG system. We also add context to the question sentence in the process of creating a CQ. This is done to disambiguate the question and avoid cases where there are multiple answers for a question. In this chapter, we describe the sentence selection and the keyword selection modules very briefly. The main focus of this chapter is on distractor selection and the influence the domain knowledge has on it.

3.2 Introduction

English has spread so widely that about a third of the world’s population speak it [8]. Thus, English education for non-native speakers both now and in the near future is of great importance. Multiple choice questions (MCQs) have been proved efficient to judge students’ knowledge. Due to their ability to provide automatic and objective feedback, multiple choice questions are commonly used in educational applications. One type that is especially popular for the assessment of native and second language learning and instruction is fill-in-the-blank questions, or cloze tests.

23 A cloze test is an exercise, test, or assessment consisting of a portion of text with certain words removed or blanked out, where the participant is asked to fill the missing words with four alternatives to choose from. Cloze tests require the ability to understand context and vocabulary in order to identify the correct words or type of words that belong in the deleted passages of a text. Words may be deleted from the text in question either mechanically (every Nth word) or selectively, depending on exactly what aspect it is intended to test for. A language teacher may give the following passage (taken from wikipedia) to students: Example: Today, I went to the and bought some milk and eggs. I knew it was going to rain, but I forgot to take my , and ended up getting wet on the way. Students would then be required to fill in the blanks with words that would best complete the passage. Context in language and content terms are essential in most, if not all, cloze tests. The first blank is preceded by “the”; therefore, a noun, an adjective or an adverb must follow. However, a conjunction follows the blank; the sentence would not be grammatically correct if anything other than a noun were in the blank. The words “milk and eggs” are important for deciding which noun to put in the blank; “supermarket” is a possible answer; depending on the student and the given options, however, the first blank could either be store, supermarket, shop or market while umbrella or raincoat fit the second. As a language learning tool, cloze tests can be enhanced by using up-to-date, authentic text on topics in which the student takes an interest. Such personalization can “provide motivation, generate enthusiasm in learning, encourage learner autonomy, foster learner strategy and help develop students’ reading skills as well as enhance their cultural understanding” [42]. Manual construction of such questions, however, is a time-consuming and labour-intensive task. It is clearly not practical to manually design tailor-made cloze tests for every student. This bottleneck has motivated research on automatic generation of cloze items. Also, a test comprising cloze questions has merits in that (1) it is easy for test-takers to input answers, (2) evaluation is invariable and objective; thus computers can evaluate them automatically, and (3) they are suitable for the modern mental test theory i.e. Item Response Theory (IRT) 1. As a result, automatic cloze question generation (CQG) has received a lot of research attention recently. An example of a cloze question (CQ) is shown below.

1. Zaheer Khan opened his account with three consecutive maidens in the 2011 world-cup final. Q: opened his account with three consecutive maidens in the 2011 world-cup final. A: (a) Zaheer Khan (b) Lasith Malinga (c) Praveen Kumar (d) Munaf Patel

In the above CQ, Zaheer Khan is blanked out which is referred to as the keyword in the sentence and a set of four alternatives (A) are given.

1Item response theory (IRT) is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instru- ments measuring abilities, attitudes, or other variables. It is based on the application of related mathematical models to testing data. It is the preferred method for the development of high-stakes tests such as the Graduate Record Examination (GRE) and Graduate Management Admission Test (GMAT).

24 CQs are widely used from the classroom level to far larger scales to measure peoples’ proficiency at English as a second language. Examples of such tests include TOEFL (Test Of English as a Foreign Language) and TOEIC (Test Of English for International Communication). For assessment purposes, the ability of the cloze test to discriminate between more advanced students and less advanced ones is important. This is expressed in two dimensions [7, 28]: First, item difficulty (or facility index), i.e., the distractor should be neither too obviously wrong nor too tricky. Second, effectiveness (or discrimination index), i.e., it should attract only the less proficient students. For language learning applications, the discriminative power of a cloze test is not as important as its ability to cause users to make mistakes. An easy cloze test, on which the user scores perfectly, would not be very educational; arguably, the user learns most when his/her mistake is corrected. This chapter will emphasize the generation of difficult cloze questions. In area of cloze questions, most of the previous research is in the domain of English language learn- ing. Existing work on cloze question generation such as that of [25] has focused on lexical items regardless of their part-of-speech. [21] focused on generating cloze questions for prepositions with a technique based on collocations. Cloze questions have been generated to test students’ knowledge of English in using the correct verbs [45] and adjectives [24] in sentences. Others have generated cloze questions for common expressions. For instance, [15] generated cloze questions for structures such as “not only the” that assess grammatical skills rather than vocabulary skills. [16] generated questions for both vocabulary and grammatical patterns. [27] generated cloze questions about concepts found in a document by converting a declarative sentence into an interrogative sentence. [30] and [43] have gen- erated questions to teach and evaluate students’ vocabulary. [1] have generated factual cloze questions from a biology text book through heuristically weighted features. The standard method for producing CQs is for an item writer to compose or locate a convincing carrier sentence, which incorporates the desired KEY/GAP/BLANK (the correct answer, which has been deleted to make the gap). They then have to generate DISTRACTORS (wrong answers intended to ‘distract’ the student from selecting the correct answer). This is a non-trivial task because a good distractor must satisfy two requirements. Firstly, the distractor must be incorrect, in that inserting it into the blank generates a ‘bad’ sentence. Secondly, the distractors must in some way be similar enough to the key to be viable alternatives, or else the cloze question will be too easy. [17] defines a distractor as a concept semantically close to the key which, however, cannot serve as the right answer itself. To secure the first requirement, the distractor must yield a sentence with zero hits on the web in [45]; in [25], it must produce a rare collocation with other important words in the sentence. As for the second, various criteria have been proposed: matching patterns hand-crafted by experts [5]; [4] and [43] used WordNet; similarity in meaning to the key, with respect to a thesaurus [45] or to an ontology in a narrow domain [27]; [19] used their in-house thesauri to retrieve similar or related words (synonyms, hypernyms, hyponyms, antonyms, etc.); the most widely used criterion, again, is similarity in word frequency to the key [4, 42].

25 However, their approaches can’t be used for those domains which don’t have ontologies. Moreover, [43] do not select distractors based on the context of the keys. 2. I read a book. 3. Book the flight. For example, in sentences 2 and 3, the key book occurs in two different senses but same set of distractors will be generated by them. So a distractor should come from the same context and domain, and should be relevant. It is also clear from the above discussion that only the term frequency alone will not work for selection of distractors. In this work, we present an end-to-end automatic cloze question generation system which adopts a semi-structured approach to generate CQs by making use of a knowledge base extracted from a Cricket 2 portal. Specifically, we focus on educational reading to acquire content knowledge (rather than im- proving reading ability), unlike most of the previous approaches. Our aim is to generate questions that assess the content knowledge that a student has acquired upon reading a text. Also, unlike previous approaches we add context to the question sentence in the process of creating a CQ. This is done to disambiguate the question and avoid cases where there are multiple answers for a question. In Example 1, we have disambiguated the question by adding context in the 2011 world-cup final. Such a CQG system can be used in a variety of applications such as quizzing systems, trivia games, assigning fan ratings on social networks by posing game related questions, etc. Our system factors the CQG system into three stages: (i) sentence selection, (ii) keyword selection and (iii) distractor selection. In this chapter, we describe the sentence selection and the keyword selec- tion modules very briefly. The main focus of this chapter is on distractor selection and the influence the domain knowledge has on it. Automatic evaluation of a CQG system is a very difficult task; so, the automatically generated distractors were evaluated manually.

3.3 Related Work

This section surveys various approaches to Automatic Cloze Question Generation in regard to the method employed for Distractor Selection as listed in previous section. [28] proposed a computer-aided procedure for generating multiple-choice questions from textbooks. Questions are only asked in reference to domain-specific terms, to ensure that the questions are relevant, and sentences must have either a subject-verb-object structure or a simple subject- verb structure. They tested this method on a linguistics textbook and found that 57% were judged worthy of keeping as test items, of which 94% required some level of post-editing. [45] proposes the automatic generation of Fill-in-the-Blank Questions (FBQs) together with testing based on Item Response Theory (IRT) to measure English proficiency. First, the proposal generates an FBQ from a given sentence in English. The position of a blank in the sentence is determined, and the word at that position is considered as the correct choice. The candidates for incorrect choices for the

2A popular game played in commonwealth countries such as Australia, England, India, Pakistan etc.

26 blank are hypothesized through a thesaurus which are verified by using the Web. Finally, the blanked sentence, the correct choice and the incorrect choices surviving the verification are together laid out to form the FBQ. Second, the proficiency of non-native speakers who took the test consisting of such FBQs is estimated through IRT. [21] is concerned with generating cloze items for prepositions, whose usage often poses problems for non-native speakers of English. The quality of a cloze item depends on the choice of distractors. They propose two methods, based on collocations and on non-native English corpora, to generate distractors for prepositions. Both their methods are found to be more successful in attracting users than a baseline that relies only on word frequency, a common criterion in past research. [24] proposes a multiple-choice question generation methodology for understanding evaluation of adjectives in a text. Based on the sense association among adjectives, an adjective being examined can be usually substituted by some other adjectives. In order to discourage learners from answering questions by recalling memorized answers, and to expose learners with more vocabularies, the answer of the questions generated is a substitute of the adjective being examined. The candidates of a substitute are gathered from WordNet and filtered by web corpus searching. Based on the proposed methodology, the choice candidates that are not selected as the answer are still useful. They are more distracting than the dissimilar adjectives and can be used as distractors to improve the distinguishability of a question. [30] presents a strategy using linguistically motivated features to improve the quality of automat- ically generated cloze and open cloze questions which are used by the REAP tutoring system for as- sessment in the ill-defined domain of English as a Second Language vocabulary learning. The baseline technique produced high quality cloze questions 40% of the time, while the new strategy produced high quality cloze questions 66% of the time. [25] reports experience in applying techniques for natural language processing to algorithmically generating test items for both reading and listening cloze items. They propose a word sense disambiguation- based method for locating sentences in which designated words carry specific senses, and apply a collocation-based method for selecting distractors that are necessary for multiple-choice cloze items. [43] present a system, TEDDCLOG, which automatically generates draft cloze questions from a corpus. TEDDCLOG takes the key as input. It finds distractors from a distributional thesaurus, and identifies a collocate of the key that does not occur with the distractors. Next it finds a simple corpus sentence containing the key and collocate. The system then presents the sentences and distractors to the user for approval, modification or rejection. In this work, we present an end-to-end automatic cloze question generation system which adopts a semi-structured approach to generate CQs by making use of a knowledge base extracted from a Cricket portal. Specifically, we focus on educational reading to acquire content knowledge unlike most of the previous approaches. Our aim is to generate questions that assess the content knowledge that a student has acquired upon reading a text. Also, unlike previous approaches we add context to the question sentence in the process of creating a CQ. This is done to disambiguate the question and avoid cases where there are multiple answers for a question. Our system factors the CQG system into three stages:

27 (i) sentence selection, (ii) keyword selection and (iii) distractor selection. In this chapter, we describe the sentence selection and the keyword selection modules very briefly. The main focus of this chapter is on distractor selection and the influence the domain knowledge has on it. A domain dependent approach is described for the distractor selection module of CQG system.

3.4 Approach

Our system takes news reports on Cricket matches as input and gives factual CQs as output using a knowledge base on Cricket players and officials collected from the web. Given a document, the system goes through three stages to generate the cloze questions. In the first stage, informative and relevant sentences are selected and in the second stage, keywords (or words/phrases to be questioned on) are identified in the selected sentence. Distractors (or answer alternatives) for the keyword in the question sentence are chosen in the final stage. The Stanford CoreNLP tool kit is used for tokenization, POS tagging [46], NER [11], parsing [18] and co-reference resolution [20] of sentences in the input documents.

3.4.1 Sentence Selection

In sentence selection, relevant and informative sentences from a given input article are picked to be the question sentences in cloze questions. [1] uses many summarization features for sentence selection based on heuristic weights. But for this task it is difficult to decide the correct relative weights for each feature without any training data. So our system directly uses a summarizer for selection of important sentences. There are few abstractive sum- marizers but they perform very poorly, [50] for example. So our system uses an extractive summarizer, MEAD 3 to select important sentences. Top 10 percent of the ranked sentences from the summarizer’s output are chosen to generate cloze questions.

3.4.2 Keywords Selection

This step of the process is selection of words in the selected sentence that can be blanked out. These words are referred to as the keywords in the sentence. For a good factual CQ, a keyword should be the word/phrase/clause that tests the knowledge of the user from the content of the article. This keyword shouldn’t be too trivial and neither should be too obscure. For example, in an article on Obama, Obama would make a bad keyword. The system first collects all the potential keywords from a sentence in a list and then prunes this list on the basis of observations described later in this section.

3MEAD is a publicly available toolkit for multi-lingual summarization and evaluation. The toolkit implements multiple summarization algorithms (at arbitrary compression rates) such as position-based, Centroid[RJB00], TF*IDF, and query-based methods (http://www.summarization.com/mead)

28 Unlike the previous works in this area, our system is not bound to select only one token keyword or to select only nouns and adjectives as a keyword. In our work, a keyword could be a Named Entity (person, number, location, organization or date) (NE), a pronoun (that comes at beginning of a sentence so that its referent is not present in that sentence) or a constituent (selected using the parse tree). In Example 4, the selected keyword is a noun phrase, carrom ball.

4. R Ashwin used his carrom ball to remove the potentially explosive Kirk Edwards in Cricket World Cup 2011. Q: R Ashwin used his to remove the potentially explosive Kirk Edwards in Cricket World Cup 2011.

Observations: According to our data analysis we have some observations to prune the list that are described below.

• Relevant tokens should be present in the keyword There must be a few other tokens in a keyword other than stop words 4, common words 5 and topic words 6.

• Prepositions The preposition at the beginning of the keyword is an important clue with respect to what the author is looking to check. So, we keep it as a part of the question sentence rather than blank it out as the keyword. We also prune the keywords containing one or more prepositions as they more often than not make the question unanswerable and sometimes introduce a possibility for multiple answers to such questions.

We also use the observations, presented by [1] in their keyword selection step. Above criteria reduces the potential keywords’ list by a significant amount. Among the rest of the keywords, our system gives preference to NE (persons, location, organization, numbers and dates (in order)), noun phrases, verb phrases in order. To preserve the overall quality of a set of generated questions, system checks that any answer should not be present in other questions. In case of a tie, term frequency is used.

3.5 Distractor Selection

The previous two stages (sentence selection and keyword selection) are not domain specific in nature i.e. they work fine irrespective of the dataset and domain chosen. But the same is not true for dis- tractor selection because the quality of distractors largely depends on the domain. We have performed experiments and presented the results on the domain Cricket. Consider Example 5.

4In computing, stop words are words which are filtered out prior to, or after processing of natural language data (text) http://armandbrahaj.blog.al/2009/04/14/list-of-english-stop-words/ 5Most common words in English taken from http://en.wikipedia.org/wiki/Most_common_words_in_ English 6Topics (words) which the article talks about. We used the TopicS tool [23]

29 5. Sehwag had hit a boundary from the first ball of six of India’s previous eight innings in Cricket World Cup 2011. Q: had hit a boundary from the first ball of six of India’s previous eight innings in Cricket World Cup 2011. A: (a) Ponting (b) Sehwag (c) Zaheer (d) Marsh

In Example 5, although all the distractors are of the domain of Cricket, the distractors are not good enough to create confusion. We have some clues in the given sentence that can be exploited to provide distractors that pose a greater challenge to the students: (i) Someone hitting a boundary on the first ball must be a Top-order batsman and (ii) India in the sentence implies that the batsman is from Indian team. But out of the three distractors, one is an Indian bowler (Zaheer) and the other two are Australian Top-order batsmen (Ponting and Marsh). Hence answer of the question can easily be chosen which is Sehwag. To present more meaningful and useful distractors, the stage is domain dependent and also uses a knowledge base. The system extracts clues from the sentences to present meaningful distractors. The knowledge base is collected by crawling players’ pages available at http://www.espncricinfo. com. Each page has a variety of information about the player such as name, playing style, birth date, playing role, major teams etc. This information is widely used to make better choices through out the system. Sample rows and columns from the database of players are shown in the Table 3.1.

Player’s name Team Playing Role Batting Style Bowling Style Sachin Tendulkar India Top-order batsman Right hand Right-arm offbreak Virendra Sehwag India Top-order batsman Right hand Right-arm offbreak Gautam Gambhir India Top-order batsman Left hand Legbreak Virat Kohli India Middle-order batsman Right hand Right-arm medium Yuvraj Singh India Middle-order batsman Left hand Slow left-arm orthodox Zaheer Khan India Fast bowler Right hand Left-arm fast Australia Top-order batsman Right hand - Shawn Marsh Australia Top-order batsman Left hand Slow left-arm orthodox Kumar Sangakkara Sri Lanka Wicketkeeper batsman Left hand Right-arm offbreak Mahela Jayawardene Sri Lanka Batsman Right hand Right-arm medium Upul Tharanga Sri Lanka Batsman Left hand - Chamara Silva Sri Lanka Batsman Right hand Legbreak

Table 3.1 Knowledge Base

For the Cricket domain, the system takes only the NEs as keywords. So if a keyword’s NE Tag is location/number/date/organization, then system selects three distractors from the database randomly. But in case when the NE Tag is a person’s name, three distractors are selected based on (i) the properties of keyword and (ii) the clues in the question sentence. The distractor selection method is shown in Figure 3.1.

30 In case of person’s name team name, playing role, batting style and bowling style are the features of a keyword (Table 3.1). The system looks for clues in the sentence such as team names and other player names. According to the features and clues extracted by the system, three distractors are chosen either from the same team as that of the keyword or from both playing teams or from any team playing in the tournament. Distractors are selected such that none of them already occur in the question sentence. Remainder of this section describes different strategies incorporated in order to handle different cases.

Figure 3.1 Distractor Selection Method

3.5.1 Select distractors from a single team

The presence of a team name or of a team player of any of the two playing teams is a direct clue for selecting the distractors from the team of the keyword. It does not matter that the team name is of the player which is our keyword or of the team he is playing against as long as it is either of these two. Consider Example 6 and Example 7.

6. Sehwag had hit a boundary from the first ball of six of India’s previous eight innings in Cricket World Cup 2011. Q: had hit a boundary from the first ball of six of India’s previous eight innings in Cricket World Cup 2011. A: (a) Gambhir (b) Sehwag (c) Tendulkar (d) Kohli

7. MS Dhoni trumped a poetic century from Mahela Jayawardene to pull off the highest run-chase ever achieved in a World Cup final. Q: MS Dhoni trumped a poetic century from to pull off the highest run-chase ever achieved in a World Cup final. A: (a) Kumar Sangakkara (b) Upul Tharanga (c) Mahela Jayawardene (d) Chamara Silva

31 In Example 6, the system finds explicitly India, a team name whereas in Example 7, the system finds a player of the opponent team, MS Dhoni. In both these cases, the distractors are selected from the team that the keyword belongs to. (Note that examples 5 and 6 are different because the set of alternatives for the question are different.)

3.5.2 Select distractors from both the teams

We observed that we could choose distractors from either of the teams if there are no features indi- cating a particular playing team and the keyword is from one of the two teams. So the system can select three distractors from any of the two playing teams, which is a larger source to select the distractors.

8. Gambhir gave away the chance for an unforgettable century with a tired charge and slash in the 2011 world-cup final. Q: gave away the chance for an unforgettable century with a tired charge and slash in the 2011 world-cup final. A: (a) Jayawardene (b) Gambhir (c) Dhoni (d) Sangakarra

In Example 8, there are no features indicating that the distractors should all belong to either team India or team Sri Lanka knowing that the world cup final was played between India and Sri Lanka. So, we can select distractors from both the teams in such cases.

3.5.3 Select distractors from any team

If the keyword in a question does not belong to either of the teams then it could be a name of an umpire or a player from the other teams. In case of an umpire, we randomly select three umpires from the list of umpires for that tournament. And in case of a player that belongs to neither of the teams playing the match, we randomly pick three players with the same playing role as that of the keyword from any team, doesn’t matter playing or not. An instance of such a case is shown in Example 9 below.

9. Yuvraj has a leg-before appeal turned down by umpire Taufel, and straightaway implores his captain to go for the review. Q: Yuvraj has a leg-before appeal turned down by umpire , and straightaway implores his captain to go for the review A: (a) (b) (c) Ian Gould (d) Billy Bowden

3.6 Evaluation and Results

Automatic evaluation of any CQG system is difficult for two reasons i) agreeing on standard evalua- tion data is difficult and ii) there is no one particular set of CQs that is correct. Most question generation systems hence rely on manual evaluation.

32 Score Distractor 4 Three are useful 3 Two are useful 2 One is useful 1 None is useful

Table 3.2 Evaluation Guidelines

The distractors are evaluated for their usability (i.e. the score is the number of distractors that are useful). A distractor is useful if it can’t be discounted easily through simple elimination techniques. A “usable” distractor has been defined in different ways, ranging from the simple requirement that only one choice is correct [45], to expert judgments [5]. Others take into account the time needed for manual post-editing [17], in relation to designing the item from scratch. We adopt the simple requirement as in [45] as shown in Table 3.2 above.

Evaluator 4 3 2 1 Eval-1 11 4 4 3 Eval-2 6 14 1 1 Eval-3 14 5 3 0

Table 3.3 Cloze Question Evaluation Results

Cloze questions were generated from news reports on two Cricket World Cup 2011 matches and were evaluated manually. Distractors for 22 questions (10+12) were evaluated by three different evaluators using the guidelines mentioned in the previous section. The results are listed in Table 3.3. The accuracy of the distractors is 3.05 (Eval-1), 3.14 ((Eval-2) and 3.5 (Eval-3) out of 4. It is clear from the results that the distractors generated using the domain knowledge are very useful.

3.7 Conclusions

In this work, we present an end-to-end system that takes a document as input and outputs the im- portant cloze questions. Our system factors the CQG system into three stages: (i) Sentence selection, (ii) Keyword selection and (iii) Distractor selection. A domain dependent approach is described for the distractor selection module of CQG system. The system is implemented for and tested on examples from the cricket sports domain. Main focus being on distractor selection, we have shown the influence of domain on the quality of distractors. The next chapter summarizes the thesis, states the conclusion and discusses the future work.

33 Chapter 4

Conclusion and Future Work

In this thesis, we describe automatic Question Generation Systems that take natural language text as input and generate questions of various types and scope for the user. Our aim is to generate questions that assess the content knowledge that a student has acquired upon reading a text rather than vocabulary or grammar assessment or language learning. We restrict our investigation to fact-based questions about literal information present in the text, but we believe our techniques can be extended to generate ques- tions involving inference and deeper levels of meaning. In this work, we have described two automatic Question Generation Systems. Both these systems factor the QG process into several stages, enabling more or less independent development of particular stages.

4.1 Conclusions

The QG system, described in chapter 2, generates questions automatically using discourse connec- tives for different question types. We described an end-to-end system that takes a document as input and outputs all the questions for selected discourse connectives. The selected discourse connectives include four subordinating conjunctions, since, when, because and although, and three adverbials, for exam- ple, for instance and as a result. Our system factors the QG process into two stages: content selection (the text selected for question generation) and question formation (transformations on the content to get the question), Question formation module further has the modules of (i) finding suitable question type (wh-word), (ii) auxiliary and main verb transformations and (iii) rearranging the phrases to get the final question. The system has been evaluated for syntactic and semantic soundness of the question by two evaluators. Our evaluations showed that the challenging problem of QG using discourse connectives is far from solved. However, by extending and improving upon our current system, we can progress towards that goal. We have shown that from the QG point of view, some specific discourse relations such as causal, temporal, result, etc. are more important than others. It is clear from the work that discourse connectives are good enough for QG and that there is no need for full fledged discourse pars- ing. In this work we have laid the foundation for generating medium and specific scope questions using

34 discourse connectives, medium scope questions being more challenging to generate than specific ones in general. Cloze question generation (CQG) system, described in chapter 3, takes a document as input and outputs the important cloze questions. Our system factors the CQG system into three stages: (i) Sen- tence selection, (ii) Keyword selection and (iii) Distractor selection. A domain dependent approach is described for the distractor selection module of CQG system. The system is implemented for and tested on examples from the cricket sports domain. Main focus being on distractor selection, we have shown the influence of domain on the quality of distractors. We also present evaluation guidelines for CQG systems.

4.2 Future Work

The future work, in general, in the field of QG includes generating questions of general scope. This task, though included in the QGSTEC 2010, has been seldom solved. Also, it would be interesting to see the techniques employed in both the systems to be extended to other languages that have similar structures. The next step of experiments for QG using Discourse Connectives include the following tasks: (a) Implementing co-reference resolution, (b) Question generation for sentences with two or more than two connectives, (c) Improving the system with respect to the sentence complexity, and (d) Research deeper in order to incorporate other discourse connectives. The future work for Cloze question generation includes: (a) Improving the sentence selection module by using a weighed function of various features instead of using an extractive summarizer to select better sentences for QG. (b) Find more rules for even better distractors. (c) Generalize the approach to work automatically with almost any text. (d) Find the overall performance of the system taking into account the entire document.

35 Related Publications

Rakshit Shah, Manish Agarwal and Prashanth Mannem. Automatic Question Generation using Dis- course Cues. In the proceedings of 6th Workshop on Innovative Use of NLP for Building Educational Applications, pages 1-9. Association for Computational Linguistics, 2011.

Rakshit Shah and Prashanth Mannem. Distractor Selection for Cloze Questions. (Due for submis- sion)

36 Bibliography

[1] M. Agarwal and P. Mannem. Automatic gap-fill question generation from text books. In Proceedings of the 6th Workshop on Innovative Use of NLP for Building Educational Applications, pages 56–64. Association for Computational Linguistics, 2011. [2] H. Ali, Y. Chali, and S. A. Hasan. Automation of Question Generation From Sentences, pages 58–67. Number January. questiongeneration.org, 2010. [3] I. Beck et al. Questioning the Author: An Approach for Enhancing Student Engagement with Text. Order Department, International Reading Association, 800 Barksdale Road, PO Box 8139, Newark, DE 19714- 8139 ($13.95 members; $17.95 nonmembers)., 1997. [4] J. Brown, G. Frishkoff, and M. Eskenazi. Automatic question generation for vocabulary assessment. In Pro- ceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 819–826. Association for Computational Linguistics, 2005. [5] C.-Y. Chen, H.-C. Liou, and J. S. Chang. Fast - an automatic generation system for grammar tests. Compu- tational Linguistics, (July):1–4, 2006. [6] N. Chomsky. Conditions on transformations. Avd. f¨or linguistik G¨oteborgs Universitet, 1972. [7] D. Coniam. From text to test, automatically–an evaluation of a computer cloze-test generator. Hong Kong Journal of Applied Linguistics, 3(1):41–60, 1998. [8] D. Crystal. English as a Global Language, volume 35. Cambridge University Press, 2003. [9] H. Dang, J. Lin, and D. Kelly. Overview of the trec 2006 question answering track. In Proceedings of the Fifteenth Text REtrieval Conference, volume 35, page 36, 2006. [10] R. Elwell and J. Baldridge. Discourse connective argument identification with connective specific rankers. 2008 IEEE International Conference on Semantic Computing, pages 198–205, 2008. [11] J. Finkel, T. Grenager, and C. Manning. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 363–370. Association for Computational Linguistics, 2005. [12] S. Harabagiu, A. Hickl, J. Lehmann, and D. Moldovan. Experiments with interactive question-answering. In Proceedings of the 43rd annual meeting on Association for Computational Linguistics, pages 205–214. Association for Computational Linguistics, 2005.

37 [13] S. Harabagiu, S. Maiorano, and M. Pas¸ca. Open-domain textual question answering techniques. Natural Language Engineering, 9(3):231–267, 2003. [14] M. Heilman and N. A. Smith. Question generation via overgenerating transformations and ranking. Frame- work, 2009. [15] D. Higgins. Item distiller: Text retrieval for computer-assisted test item creation. Development, 2007. [16] A. Hoshino and H. Nakagawa. Assisting cloze test making with a web application. TECHNOLOGY AND TEACHER EDUCATION ANNUAL, 18(5):2807, 2007. [17] N. Karamanis, L. Ha, and R. Mitkov. Generating multiple-choice test items from medical text: a pilot study. In Proceedings of the Fourth International Natural Language Generation Conference, pages 111– 113. Association for Computational Linguistics, 2006. [18] D. Klein and C. Manning. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1, pages 423–430. Association for Computational Linguistics, 2003. [19] H. Kunichika, M. Urushima, T. Hirashima, and A. Takeuchi. A computational method of complexity of questions on contents of english sentences and its evaluation. In Computers in Education, 2002. Proceed- ings. International Conference on, pages 97–101. IEEE, 2002. [20] H. Lee, Y. Peirsman, A. Chang, N. Chambers, M. Surdeanu, and D. Jurafsky. Stanford’s multi-pass sieve coreference resolution system at the conll-2011 shared task. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task, pages 28–34. Association for Computational Linguistics, 2011. [21] J. Lee and S. Seneff. Automatic Generation of Cloze Items for Prepositions, pages 2173–2176. Citeseer, 2007. [22] W. Lehnert and W. Lehnert. The process of question answering: A computer simulation of cognition. L. Erlbaum Associates, 1978. [23] C. Lin and E. Hovy. The automatedacquisitionof topic signatures for text summarization. In Proceedings of the 18th conference on Computational linguistics-Volume 1, pages 495–501. Association for Computational Linguistics, 2000. [24] Y. Lin, L. Sung, and M. Chen. An automatic multiple-choice question generation scheme for english adjective understanding. In Workshop on Modeling, Management and Generation of Problems/Questions in eLearning, the 15th International Conference on Computers in Education (ICCE 2007), pages 137–142, 2007. [25] C.-L. Liu, C.-H. Wang, Z.-M. Gao, and S.-M. Huang. Applications of Lexical Information for Algorith- mically Composing Multiple-Choice Cloze Items, volume 6, pages 1–8. Association for Computational Linguistics, 2005. [26] P. Mannem, R. Prasad, and A. Joshi. Question generation from paragraphs at upenn: Qgstec system de- scription. Proceedings of QG2010 The Third Workshop on Question Generation, pages 84–91, 2010.

38 [27] R. Mitkov, L. An Ha, and N. Karamanis. A computer-aidedenvironment for generating multiple-choice test items. Natural Language Engineering, 12(2):177–194, 2006. [28] R. Mitkov and L. Ha. Computer-aided generation of multiple-choice tests. In Proceedings of the HLT- NAACL 03 workshop on Building educational applications using natural language processing-Volume 2, pages 17–22. Association for Computational Linguistics, 2003. [29] S. Pal, T. Mondal, P. Pakray, D. Das, and S. Bandyopadhyay. QGSTEC System Description JUQGG: A Rule based approach, pages 76–79. questiongeneration.org, 2010. [30] J. Pino, M. J. Heilman, andM. Eskenazi. A selectionstrategy to improve cloze question quality. Proceedings of the Workshop on Intelligent Tutoring Systems for IllDefined Domains, 2008. [31] E. Pitler, A. Louis, and A. Nenkova. Automatic sense prediction for implicit discourse relations in text. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP Volume 2 ACLIJCNLP 09, 2(August):683–691, 2009. [32] E. Pitler and A. Nenkova. Using Syntax to Disambiguate Explicit Discourse Connectives in Text, pages 13–16. Number August. Association for Computational Linguistics, 2009. [33] S. Pradhan, W. Ward, K. Hacioglu, J. Martin, and D. Jurafsky. Shallow semantic parsing using support vector machines. In Proceedings of HLT/NAACL, pages 2–7, 2004. [34] R. Prasad and A. Joshi. A discourse-basedapproachto generating why-questions from texts. In Proceedings of the Workshop on the Question Generation Shared Task and Evaluation Challenge, Arlington, VA, 2008. [35] R. Prasad, A. Joshi, and B. Webber. Exploiting scope for shallow discourse parsing. Discourse, 142(1- 2):2076–2083, 2010. [36] R. Prasad, E. Miltsakaki, N. Dinesh, A. Lee, A. Joshi, and B. Webber. The penn discourse treebank 2 . 0 annotation manual. Group, 17(July):2007, 2007. [37] B. Rosenshine. The case for explicit, teacher-led, cognitive strategy instruction. MF Graves (Chair), What sort of comprehension strategy instruction should schools provide, 1997. [38] J. Ross. Constraints on variables in syntax. 1967. [39] V. Rus, B. Wyse, P. Piwek, M. Lintean, S. Stoyanchev, and C. Moldovan. Overview of the first question generation shared task evaluation challenge. In Proceedings of QG2010: The Third Workshop on Ques-tion Generation, page 45, 2010. [40] R. Schank. Explanation patterns: Understanding mechanically and creatively. Lawrence Erlbaum, 1986. [41] L. Schwartz, T. Aikawa, and M. Pahud. Dynamic language learning tools. In InSTIL/ICALL Symposium 2004, 2004. [42] C. Shei. Followyou!: an automatic language lesson generation system. Computer Assisted Language Learning, 14(2):129–144, 2001. [43] S. Smith, P. Avinesh, and A. Kilgarriff. Gap-fill tests for language learners: Corpus-driven item generation.

39 [44] E. Sneiders. Automated question answering using question templates that cover the conceptual model of the database. Data Base, pages 235–239, 2002. [45] E. Sumita, F. Sugaya, and S. Yamamoto. Measuring Non-native Speakers’ Proficiency of English by Using a Test with Automatically-Generated Fill-in-the-Blank Questions, pages 61–68. Number June. Association for Computational Linguistics, 2005. [46] K. Toutanova, D. Klein, C. Manning, and Y.Singer. Feature-rich part-of-speech tagging with a cyclic depen- dency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pages 173–180. Association for Computational Linguistics, 2003. [47] I. Ureel, K. Forbus, C. Riesbeck, L. Birnbaum, and N. U. E. I. D. O. C. SCIENCE. Question generation for learning by reading. Defense Technical Information Center, 2005. [48] N. R. P. (US), N. I. of Child Health, and H. D. (US). Teaching children to read: An evidence-based assess- ment of the scientific research literature on reading and its implications for reading instruction. National Institute of Child Health and Human Development, National Institutes of Health, 2000. [49] A. Varga and L. A. Ha. Wlv : A question generation system for the qgstec 2010 task b. Proceedings of QG2010 The Third Workshop on Question Generation, pages 80–83, 2010. [50] M. Witbrock and V. Mittal. Ultra-summarization (poster abstract): a statistical approach to generating highly condensed non-extractive summaries. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 315–316. ACM, 1999.

40