Simple English : A New Text Simplification Task

William Coster David Kauchak Computer Science Department Computer Science Department Pomona College Pomona College Claremont, CA 91711 Claremont, CA 91711 [email protected] [email protected]

Abstract gadda et al., 2009; Vickrey and Koller, 2008; Chan- drasekar and Srinivas, 1997). Finally, models for In this paper we examine the task of sentence text simplification are similar to models for sentence simplification which aims to reduce the read- compression; advances in simplification can bene- ing complexity of a sentence by incorporat- fit compression, which has applications in mobile ing more accessible vocabulary and sentence structure. We introduce a new data set that devices, summarization and captioning (Knight and pairs English Wikipedia with Simple English Marcu, 2002; McDonald, 2006; Galley and McKe- Wikipedia and is orders of magnitude larger own, 2007; Nomoto, 2009; Cohn and Lapata, 2009). than any previously examined for sentence One of the key challenges for text simplification simplification. The data contains the full range is data availability. The small amount of simplifi- of simplification operations including reword- cation data currently available has prevented the ap- ing, reordering, insertion and deletion. We plication of data-driven techniques like those used provide an analysis of this corpus as well as in other text-to-text translation areas (Och and Ney, preliminary results using a phrase-based trans- lation approach for simplification. 2004; Chiang, 2010). Most prior techniques for text simplification have involved either hand-crafted rules (Vickrey and Koller, 2008; Feng, 2008) or 1 Introduction learned within a very restricted rule space (Chan- The task of text simplification aims to reduce the drasekar and Srinivas, 1997). complexity of text while maintaining the content We have generated a data set consisting of 137K (Chandrasekar and Srinivas, 1997; Carroll et al., aligned simplified/unsimplified sentence pairs by 1998; Feng, 2008). In this paper, we explore the pairing documents, then sentences from English 1 sentence simplification problem: given a sentence, Wikipedia with corresponding documents and sen- 2 the goal is to produce an equivalent sentence where tences from . Simple En- the vocabulary and sentence structure are simpler. glish Wikipedia contains articles aimed at children Text simplification has a number of important ap- and learners and contains similar plications. Simplification techniques can be used to content to English Wikipedia but with simpler vo- make text resources available to a broader range of cabulary and grammar. readers, including children, language learners, the Figure 1 shows example sentence simplifications elderly, the hearing impaired and people with apha- from the data set. Like machine translation and other sia or cognitive disabilities (Carroll et al., 1998; text-to-text domains, text simplification involves the Feng, 2008). As a preprocessing step, simplification full range of transformation operations including can improve the performance of NLP tasks, includ- deletion, rewording, reordering and insertion. ing parsing, semantic role labeling, machine transla- 1http://en.wikipedia.org/ tion and summarization (Miwa et al., 2010; Jonnala- 2http://simple.wikipedia.org 665

Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:shortpapers, pages 665–669, Portland, Oregon, June 19-24, 2011. c 2011 Association for Computational Linguistics a. Normal: As Isolde arrives at his side, Tristan dies with her name on his lips. Simple: As Isolde arrives at his side, Tristan dies while speaking her name. b. Normal: Alfonso Perez Munoz, usually referred to as Alfonso, is a former Spanish footballer, in the striker position. Simple: Alfonso Perez is a former Spanish football player. c. Normal: Endemic types or species are especially likely to develop on islands because of their geographical isolation. Simple: Endemic types are most likely to develop on islands because they are isolated. d. Normal: The reverse process, producing electrical energy from mechanical, energy, is accomplished by a generator or dynamo. Simple: A dynamo or an electric generator does the reverse: it changes mechanical movement into electric energy.

Figure 1: Example sentence simplifications extracted from Wikipedia. Normal refers to a sentence in an English Wikipedia article and Simple to a corresponding sentence in Simple English Wikipedia.

2 Previous Data 3 Simplification Corpus Generation

We generated a parallel simplification corpus by aligning sentences between English Wikipedia and Wikipedia and Simple English Wikipedia have both Simple English Wikipedia. We obtained complete received some recent attention as a useful resource copies of English Wikipedia and Simple English for text simplification and the related task of text Wikipedia in May 2010. We first paired the articles compression. Yamangil and Nelken (2008) examine by title, then removed all article pairs where either the history logs of English Wikipedia to learn sen- article: contained only a single line, was flagged as a tence compression rules. Yatskar et al. (2010) learn stub, was flagged as a disambiguation page or was a a set of candidate phrase simplification rules based meta-page about Wikipedia. After pairing and filter- on edits identified in the revision histories of both ing, 10,588 aligned, content article pairs remained Simple English Wikipedia and English Wikipedia. (a 90% reduction from the original 110K Simple En- However, they only provide a list of the top phrasal glish Wikipedia articles). Throughout the rest of this simplifications and do not utilize them in an end- paper we will refer to unsimplified text from English to-end simplification system. Finally, Napoles and Wikipedia as normal and to the simplified version Dredze (2010) provide an analysis of the differences from Simple English Wikipedia as simple. between documents in English Wikipedia and Sim- ple English Wikipedia, though they do not view the To generate aligned sentence pairs from the data set as a parallel corpus. aligned document pairs we followed an approach similar to those utilized in previous monolingual Although the simplification problem shares some alignment problems (Barzilay and Elhadad, 2003; characteristics with the text compression problem, Nelken and Shieber, 2006). Paragraphs were iden- existing text compression data sets are small and tified based on formatting information available in contain a restricted set of possible transformations the articles. Each simple paragraph was then aligned (often only deletion). Knight and Marcu (2002) in- to every normal paragraph where the TF-IDF, co- troduced the Zipf-Davis corpus which contains 1K sine similarity was over a threshold or 0.5. We ini- sentence pairs. Cohn and Lapata (2009) manually tially investigated the paragraph clustering prepro- generated two parallel corpora from news stories to- cessing step in (Barzilay and Elhadad, 2003), but taling 3K sentence pairs. Finally, Nomoto (2009) did not find a qualitative difference and opted for the generated a data set based on RSS feeds containing simpler similarity-based alignment approach, which 2K sentence pairs. does not require manual annotation. 666 For each aligned paragraph pair (i.e. a simple 91/100 were identified as correct, though many of paragraph and one or more normal paragraphs), we the remaining 9 also had some partial content over- then used a dynamic programming approach to find lap. We also repeated the experiment using only that best global sentence alignment following Barzi- those sentences with a similarity above 0.75 (rather lay and Elhadad (2003). Specifically, given n nor- than 0.50 in the original data). This reduced the mal sentences to align to m simple sentences, we number of pairs from 137K to 90K, but the eval- find a(n, m) using the following recurrence: uators identified 98/100 as correct. The analysis throughout the rest of the section is for threshold a(i, j) = of 0.5, though similar results were also seen for the  a(i, j − 1) − skip penalty threshold of 0.75.   Although the average simple article contained ap-  a(i − 1, j) − skip penalty  a(i − 1, j − 1) + sim(i, j) proximately 40 sentences, we extracted an average max  a(i − 1, j − 2) + sim(i, j) + sim(i, j − 1) of 14 aligned sentence pairs per article. Qualita-   a(i − 2, j − 1) + sim(i, j) + sim(i − 1, j) tively, it is rare to find a simple article that is a direct  a(i − 2, j − 2) + sim(i, j − 1) + sim(i − 1, j) translation of the normal article, that is, a simple ar- ticle that was generated by only making sentence- where each line above corresponds to a sentence level changes to the normal document. However, alignment operation: skip the simple sentence, skip there is a strong relationship between the two data the normal sentence, align one normal to one sim- sets: 27% of our aligned sentences were identical ple, align one normal to two simple, align two nor- between simple and normal. We left these identical mal to one simple and align two normal to two sim- sentence pairs in our data set since not all sentences ple. sim(i, j) is the similarity between the ith nor- need to be simplified and it is important for any sim- mal sentence and the jth simple sentence and was plification algorithm to be able to handle this case. calculated using TF-IDF, cosine similarity. We set Much of the content without direct correspon- skip penalty = 0.0001 manually. dence is removed during paragraph alignment. 65% Barzilay and Elhadad (2003) further discourage of the simple paragraphs do not align to a normal aligning dissimilar sentences by including a “mis- paragraphs and are ignored. On top of this, within match penalty” in the similarity measure. Instead, aligned paragraphs, there are a large number of sen- we included a filtering step removing all sentence tences that do not align. Table 1 shows the propor- pairs with a normalized similarity below a threshold tion of the different sentence level alignment opera- of 0.5. We found this approach to be more intuitive tions in our data set. On both the simple and normal and allowed us to compare the effects of differing sides there are many sentences that do not align. levels of similarity in the training set. Our choice of threshold is high enough to ensure that most align- Operation % ments are correct, but low enough to allow for vari- skip simple 27% ation in the paired sentences. In the future, we hope skip normal 23% to explore other similarity techniques that will pair one normal to one simple 37% sentences with even larger variation. one normal to two simple 8% two normal to one simple 5% 4 Corpus Analysis Table 1: Frequency of sentence-level alignment opera- From the 10K article pairs, we extracted 75K tions based on our learned sentence alignment. No 2-to-2 aligned paragraphs. From these, we extracted the alignments were found in the data. final set of 137K aligned sentence pairs. To evaluate the quality of the aligned sentences, we asked two To better understand how sentences are trans- human evaluators to independently judge whether or formed from normal to simple sentences we learned not the aligned sentences were correctly aligned on a word alignment using GIZA++ (Och and Ney, a random sample of 100 sentence pairs. They then 2003). Based on this word alignment, we calcu- were asked to reach a consensus about correctness. lated the percentage of sentences that included: re- 667 wordings – a normal word is changed to a different System BLEU word-F1 SSA simple word, deletions – a normal word is deleted, NONE 0.5937 0.5967 0.6179 reorderings – non-monotonic alignment, splits – a Moses 0.5987 0.6076 0.6224 normal words is split into multiple simple words, Moses-Oracle 0.6317 0.6661 0.6550 and merges – multiple normal words are condensed to a single simple word. Table 3: Test scores for the baseline (NONE), Moses and Moses-Oracle. Transformation % rewordings 65% for text compression (Clarke and Lapata, 2006). All deletions 47% three of these measures have been shown to correlate reorders 34% with human judgements in their respective domains. merges 31% Table 3 shows the results of our initial test. All splits 27% differences are statistically significant at p = 0.01, measured using bootstrap resampling with 100 sam- Table 2: Percentage of sentence pairs that contained ples (Koehn, 2004). Although the baseline does well word-level operations based on the induced word align- ment. Splits and merges are from the perspective of (recall that over a quarter of the sentence pairs in words in the normal sentence. These are not mutually the data set are identical) the phrase-based approach exclusive events. does obtain a statistically significant improvement. To understand the the limits of the phrase-based Table 2 shows the percentage of each of these phe- model for text simplification, we generated an n- nomena occurring in the sentence pairs. All of the best list of the 1000 most-likely simplifications for different operations occur frequently in the data set each test sentence. We then greedily picked the sim- with rewordings being particularly prevalent. plification from this n-best list that had the highest 5 Sentence-level Text Simplification sentence-level BLEU score based on the test exam- ples, labeled Moses-Oracle in Table 3. The large To understand the usefulness of this data we ran difference between Moses and Moses-Oracle indi- preliminary experiments to learn a sentence-level cates possible room for improvement utilizing better simplification system. We view the problem of parameter estimation or n-best list reranking tech- text simplification as an English-to-English transla- niques (Och et al., 2004; Ge and Mooney, 2006). tion problem. Motivated by the importance of lex- ical changes, we used Moses, a phrase-based ma- 6 Conclusion chine translation system (Och and Ney, 2004).3 We trained Moses on 124K pairs from the data set and We have described a new text simplification data set the n-gram language model on the simple side of this generated from aligning sentences in Simple English data. We trained the hyper-parameters of the log- Wikipedia with sentences in English Wikipedia. The linear model on a 500 sentence pair development set. data set is orders of magnitude larger than any cur- We compared the trained system to a baseline of rently available for text simplification or for the re- lated field of text compression and is publicly avail- not doing any simplification (NONE). We evaluated 4 the two approaches on a test set of 1300 sentence able. We provided preliminary text simplification pairs. Since there is currently no standard for au- results using Moses, a phrase-based translation sys- tomatically evaluating sentence simplification, we tem, and saw a statistically significant improvement used three different automatic measures that have of 0.005 BLEU over the baseline of no simplifica- been used in related domains: BLEU, which has tion and showed that further improvement of up to been used extensively in machine translation (Pap- 0.034 BLEU may be possible based on the oracle ineni et al., 2002), and word-level F1 and simple results. In the future, we hope to explore alignment string accuracy (SSA) which have been suggested techniques more tailored to simplification as well as applications of this data to text simplification. 3We also experimented with T3 (Cohn and Lapata, 2009) but the results were poor and are not presented here. 4http://www.cs.pomona.edu/∼dkauchak/simplification/ 668 References Makoto Miwa, Rune Saetre, Yusuke Miyao, and Jun’ichi Tsujii. 2010. Entity-focused sentence simplication for Regina Barzilay and Noemie Elhadad. 2003. Sentence relation extraction. In Proceedings of COLING. alignment for monolingual comparable corpora. In Courtney Napoles and Mark Dredze. 2010. Learn- Proceedings of EMNLP. ing simple Wikipedia: A cogitation in ascertaining John Carroll, Gido Minnen, Yvonne Canning, Siobhan abecedarian language. In Proceedings of HLT/NAACL Devlin, and John Tait. 1998. Practical simplification Workshop on Computation Linguistics and Writing. of English newspaper text to assist aphasic readers. In Rani Nelken and Stuart Shieber. 2006. Towards robust Proceedings of AAAI Workshop on Integrating AI and context-sensitive sentence alignment for monolingual Assistive Technology. corpora. In Proceedings of AMTA. Raman Chandrasekar and Bangalore Srinivas. 1997. Au- Tadashi Nomoto. 2007. Discriminative sentence com- tomatic induction of rules for text simplification. In pression with conditional random fields. In Informa- Knowledge Based Systems. tion Processing and Management. David Chiang. 2010. Learning to translate with source Tadashi Nomoto. 2008. A generic sentence trimmer with and target syntax. In Proceedings of ACL. CRFs. In Proceedings of HLT/NAACL. James Clarke and Mirella Lapata. 2006. Models for Tadashi Nomoto. 2009. A comparison of model free ver- sentence compression: A comparison across domains, sus model intensive approaches to sentence compres- training requirements and evaluation measures. In sion. In Proceedings of EMNLP. Proceedings of ACL. Franz Josef Och and Hermann Ney. 2003. A system- Trevor Cohn and Mirella Lapata. 2009. Sentence com- atic comparison of various statistical alignment mod- pression as tree transduction. Journal of Artificial In- els. Computational Linguistics, 29(1):19–51. telligence Research. Franz Och and Hermann Ney. 2004. The alignment tem- Lijun Feng. 2008. Text simplification: A survey. CUNY plate approach to statistical machine translation. Com- Technical Report. putational Linguistics. Michel Galley and Kathleen McKeown. 2007. Lexical- Franz Josef Och, Kenji Yamada, Stanford U, Alex Fraser, ized Markov grammars for sentence compression. In Daniel Gildea, and Viren Jain. 2004. A smorgasbord Proceedings of HLT/NAACL. of features for statistical machine translation. In Pro- Ruifang Ge and Raymond Mooney. 2006. Discrimina- ceedings of HLT/NAACL. tive reranking for semantic parsing. In Proceedings of Kishore Papineni, Salim Roukos, Todd Ward, and Wei- COLING. Jing Zhu. 2002. BLEU: a method for automatic eval- Siddhartha Jonnalagadda, Luis Tari, Jorg Hakenberg, uation of machine translation. In Proceedings of ACL. Chitta Baral, and Graciela Gonzalez. 2009. To- Emily Pitler. 2010. Methods for sentence compression. wards effective sentence simplification for automatic Technical Report MS-CIS-10-20, University of Penn- processing of biomedical text. In Proceedings of sylvania. HLT/NAACL. Jenine Turner and Eugene Charniak. 2005. Supervised Dan Klein and Christopher Manning. 2003. Accurate and unsupervised learning for sentence compression. unlexicalized parsing. In Proceedings of ACL. In Proceedings of ACL. Kevin Knight and Daniel Marcu. 2002. Summarization David Vickrey and Daphne Koller. 2008. Sentence sim- beyond sentence extraction: A probabilistic approach plification for semantic role labeling. In Proceedings to sentence compression. Artificial Intelligence. of ACL. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Elif Yamangil and Rani Nelken. 2008. Mining Callison-Burch, Marcello Federico, Nicola Bertoldi, Wikipedia revision histories for improving sentence Brooke Cowan, Wade Shen, Christine Moran, Richard compression. In ACL. Zens, Chris Dyer, Ondrej Bojar, Alexandra Con- Mark Yatskar, Bo Pang, Critian Danescu-Niculescu- stantin, and Evan Herbst. 2007. Moses: Open source Mizil, and Lillian Lee. 2010. For the sake of simplic- toolkit for statistical machine translation. In Proceed- ity: Unsupervised extraction of lexical simplifications ings of ACL. from Wikipedia. In HLT/NAACL Short Papers. Philipp Koehn. 2004. Statistical significance tests for machine translation evaluation. In Proceedings of EMNLP. Ryan McDonald. 2006. Discriminative sentence com- pression with soft syntactic evidence. In Proceedings of EACL. 669