Good Question! Statistical Ranking for Question Generation

Good Question! Statistical Ranking for Question Generation Michael Heilman Noah A. Smith Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA fmheilman,[email protected] Abstract as the “Queen of the Cow Counties” for? We observe that the QG process can be viewed We address the challenge of automatically as a two-step process that essentially “factors” the generating questions from reading materials problem into simpler components.2 Rather than si- for educational practice and assessment. Our multaneously trying to remove extraneous informa- approach is to overgenerate questions, then rank them. We use manually written rules to tion and transform a declarative sentence into an in- perform a sequence of general purpose syn- terrogative one, we first transform the input sentence tactic transformations (e.g., subject-auxiliary into a simpler sentence such as Los Angeles become inversion) to turn declarative sentences into known as the “Queen of the Cow Counties” for its questions. These questions are then ranked by role in supplying beef and other foodstuffs to hungry a logistic regression model trained on a small, miners in the north, which we then can then trans- tailored dataset consisting of labeled output form into a more succinct question. from our system. Experimental results show that ranking nearly doubles the percentage of Question transformation involves complex long questions rated as acceptable by annotators, distance dependencies. For example, in the ques- from 27% of all questions to 52% of the top tion about Los Angeles, the word what at the begin- ranked 20% of questions. ning of the sentence is a semantic argument of the verb phrase known as . at the end of the ques- 1 Introduction tion. The characteristics of such phenomena are (ar- guably) difficult to learn from corpora, but they have In this paper, we focus on question generation (QG) been studied extensively in linguistics (Ross, 1967; for the creation of educational materials for read- Chomsky, 1973). We take a rule-based approach in ing practice and assessment. Our goal is to gener- order to leverage this linguistic knowledge. ate fact-based questions about the content of a given However, since many phenomena pertaining to article. The top-ranked questions could be filtered question generation are not so easily encoded with and revised by educators, or given directly to stu- rules, we include statistical ranking as an integral dents for practice. Here we restrict our investigation component. Thus, we employ an overgenerate-and- to questions about factual information in texts. rank approach, which has been applied successfully We begin with a motivating example. Consider in areas such as generation (Walker et al., 2001; the following sentence from the Wikipedia article on Langkilde and Knight, 1998) and syntactic parsing the history of Los Angeles:1 During the Gold Rush (Collins, 2000). Since large datasets of the appro- years in northern California, Los Angeles became priate domain, style, and form of questions are not known as the “Queen of the Cow Counties” for its available to train our ranking model, we learn to rank role in supplying beef and other foodstuffs to hungry from a relatively small, tailored dataset of human- miners in the north. labeled output from our rule-based system. Consider generating the following question from The remainder of the paper is organized as fol- that sentence: What did Los Angeles become known 2The motivating example does not exhibit lexical semantic 1“History of Los Angeles.” Wikipedia. 2009. Wikimedia variations such as synonymy. In this work, we do not model Foundation, Inc. Retrieved Nov. 17, 2009 from: http://en. complex paraphrasing, but believe that paraphrase generation wikipedia.org/wiki/History_of_Los_Angeles. techniques could be incorporated into our approach. lows. x2 clarifies connections to prior work and enu- purposes (Mitkov and Ha, 2003; Kunichika et al., merates our contributions. x3 discusses particular 2004; Gates, 2008; Rus and Graessar, 2009; Rus and terms and conventions we will employ. x4 discusses Lester, 2009), but to our knowledge, none has in- rule-based question transformation. x5 describes the volved statistical models for choosing among output data used to learn and to evaluate our question rank- candidates. Mitkov et al. (2006) demonstrated that ing model, and x6 then follows with details on the automatic generation and manual correction of ques- ranking approach itself. We then present and dis- tions can be more time-efficient than manual author- cuss results from an evaluation of ranked question ing alone. Much of the prior QG research has evalu- output in x7 and conclude in x8. ated systems in specific domains (e.g., introductory linguistics, English as a Second Language), and thus 2 Connections with Prior Work we do not attempt empirical comparisons. Exist- ing QG systems model their transformations from The generation of questions by humans has long mo- source text to questions with many complex rules tivated theoretical work in linguistics (e.g., Ross, for specific question types (e.g., a rule for creating 1967), particularly work that portrays questions as a question Who did the Subject Verb? from a transformations of canonical declarative sentences sentence with SVO word order and an object refer- (Chomsky, 1973). ring to a person), rather than with sets of general Questions have also been a major topic of study rules. in computational linguistics, but primarily with the This paper’s contributions are as follows: goal of answering questions (Dang et al., 2008). While much of the question answering research has • We apply statistical ranking to the task of gen- focused on retrieval or extraction (e.g., Ravichan- erating natural language questions. In doing so, dran and Hovy, 2001; Hovy et al., 2001), mod- we show that question rankings are improved by els of the transformation from answers to questions considering features beyond surface characteris- have also been developed (Echihabi and Marcu, tics such as sentence lengths. 2003) with the goal of finding correct answers given • We model QG as a two-step process of first a question (e.g., in a source-channel framework). simplifying declarative input sentences and then Also, Harabagiu et al. (2005) present a system that transforming them into questions, the latter step automatically generates questions from texts to pre- being achieved by a sequence of general rules. dict which user-generated questions the text might • We incorporate linguistic knowledge to explic- answer. In such work on question answering, ques- itly model well-studied phenomena related to long tion generation models are typically not evaluated distance dependencies in WH questions, such as for their intrinsic quality, but rather with respect to noun phrase island constraints. their utility as an intermediate step in the question answering process. • We develop a QG evaluation methodology, in- QG is very different from many natural language cluding the use of broad-domain corpora. generation problems because the input is natural lan- 3 Definitions and Conventions guage rather than a formal representation (cf. Reiter and Dale, 1997). It is also different from some other The term “source sentence” refers to a sentence tasks related to generation: unlike machine transla- taken directly from the input document, from which tion (e.g., Brown et al., 1990), the input and output a question will be generated (e.g., Kenya is located for QG are in the same language, and their length in Africa.). The term “answer phrase” refers to ratio is often far from one to one; and unlike sen- phrases in declarative sentences which may serve tence compression (e.g., Knight and Marcu, 2000), as targets for WH-movement, and therefore as possi- QG may involve substantial changes to words and ble answers to generated questions (e.g., in Africa). their ordering, beyond simple removal of words. The term “question phrase” refers to the phrase con- Some previous research has directly approached taining the WH word that replaces an answer phrase the topic of generating questions for educational (e.g., Where in Where is Kenya located?). To represent the syntactic structure of sentences, tive clauses, appositives, and participial phrases that we use simplified Penn Treebank-style phrase struc- appear in the source sentence. For example, it trans- ture trees, including POS and category labels, as forms the sentence Selling snowballed because of produced by the Stanford Parser (Klein and Man- waves of automatic stop-loss orders, which are trig- ning, 2003). Noun phrase heads are selected using gered by computer when prices fall to certain lev- Collins’ rules (Collins, 1999). els into Automatic stop-loss orders are triggered by To implement the rules for transforming source computer when prices fall to certain levels, from sentences into questions, we use Tregex, a tree which the next step will produce What are triggered query language, and Tsurgeon, a tree manipula- by computer when prices fall to certain levels?. tion language built on top of Tregex (Levy and An- drew, 2006). The Tregex language includes vari- 4.2 Question Transformation ous relational operators based on the primitive re- In the second step, the declarative sentences de- lations of immediate dominance (denoted “<”) and rived in step 1 are transformed into sets of ques- immediate precedence (denoted “:”). Tsurgeon tions by a sequence of well-defined syntactic and adds the ability to modify trees by relabeling, delet- lexical transformations (subject-auxiliary inversion, ing, moving, and inserting nodes. WH-movement, etc.). It identifies the answer phrases which may be targets for WH-movement and con- 4 Rule-based Overgeneration verts them into question phrases.4 Many useful questions can be viewed as lexical, syn- In the current implementation, answer phrases can tactic, or semantic transformations of the declarative be noun phrases or prepositional phrases, which en- sentences in a text. We describe how to model this ables who, what, where, when, and how much ques- process in two steps, as proposed in x1.3 tions.

Good Question! Statistical Ranking for Question Generation

Demand STAR Ranking Methodology

Ranking of Genes, Snvs, and Sequence Regions Ranking Elements Within Various Types of Biosets for Metaanalysis of Genetic Data

Conduct and Interpret a Mann-Whitney U-Test

Different Perspectives for Assigning Weights to Determinants of Health

Learning to Combine Multiple Ranking Metrics for Fault Localization Jifeng Xuan, Martin Monperrus

Pairwise Versus Pointwise Ranking: a Case Study

Power Comparisons of the Mann-Whitney U and Permutation Tests

Practice of Epidemiology a Family Longevity Selection Score

Ranking Spatial Areas by Risk of Cancer: Modelling in Epidemiological Surveillance

DETECTION MONITORING TESTS Unified Guidance

1 Exercise 8: an Introduction to Descriptive and Nonparametric

Research Techniques Made Simple: Network Meta-Analysis