Text Simplification for Language Learners: a Corpus Analysis
Total Page:16
File Type:pdf, Size:1020Kb
Speech and Language Technology ISCA Archive in Education (SLaTE2007) http://www.isca-speech.org/archive Farmington, PA, USA October 1-3, 2007 Text Simplification for Language Learners: A Corpus Analysis Sarah E. Petersen, Mari Ostendorf Dept. of Computer Science, Dept. of Electrical Engineering University of Washington, Seattle, WA 98195, USA [email protected], [email protected] Abstract In the following sections, we summarize prior work on text Simplified texts are commonly used by teachers and students in simplification, describe a sentence-aligned corpus of original and bilingual education and other language-learning contexts. These abridged articles, and provide an analysis of the article differences. texts are usually manually adapted, and teachers say this is a time- consuming and sometimes challenging task. Our goal is the devel- 2. Related Work opment of tools to aid teachers by automatically proposing ways to simplify texts. As a first step, this paper presents a detailed Researchers have developed some text simplification systems for analysis of a corpus of news articles and abridged versions written educational purposes. These systems rely on handwritten transfor- by a literacy organization in order to learn what kinds of changes mation rules. Inui et al. address the needs of deaf learners of writ- people make when simplifying texts for language learners. ten English and Japanese by paraphrasing texts to remove syntactic structures known to be difficult for this group of learners [7]. The 1. Introduction Practical Simplification of English Text (PSET) project’s goal is to paraphrase newspaper texts for people with aphasia [3]. Max and Text simplification for second language learners is an important colleagues at LIMSI-CNRS in France also target authors of texts but somewhat controversial issue in the education community. for language-impaired readers with an interactive text simplifica- Some experts prefer the use of “authentic” texts, i.e., texts which tion system that is built into a word processor to suggest simplifica- are written for native speakers for a purpose other than instruction, tions while allowing the writer to maintain control over content and while simplified texts are also very common in educational use. meaning [10]. The Educational Testing Service (ETS) has devel- Proponents of authentic texts tout positive effects on student inter- oped the Automated Text Adaptation (ATA) Tool [2], which does est and motivation and advantages of exposing students to “real” not directly simplify the original text but instead provides support language and culture. However, authentic materials are often too for reading via text adaptations in English and/or Spanish that are hard for students who read at lower levels, as they may contain displayed together with the original text. The adaptations include more complex language structures and vocabulary than texts in- vocabulary support, marginal notes, and text-to-speech. tended for learners [8, 13]. Studies of the lexical and syntactic dif- Most simplified texts are shorter than the original text, so ferences between authentic and simplified texts, and their effects extractive summarization, which selects a subset of sentences to on student comprehension, show mixed results [14, 6], suggest- form a summary, is a potential step in the simplification process. ing that both types of texts can have a place in education. Fur- However, such techniques alone are insufficient for simplification, ther, authentic texts are not always available for students whose since a summarizer could choose complex sentences, resulting in reading level does not match their intellectual level and interests. a shorter text but one that is still too challenging. As our data will Since teachers report spending substantial amounts of time adapt- show, in simplification, the compression rate in words is greater ing texts by hand, automatic simplification could be a useful tool than what one would obtain by simply dropping sentences. to help teachers adapt texts for these students. Other research efforts are aimed at simplifying text to im- This paper presents an analysis of a corpus of original and prove automatic language processing (vs. for the benefit of human manually simplified news articles with the goal of gaining insight readers). Chandrasekar and Srinivas simplify sentences with the into what people most often do to simplify text in order to develop goal of improving the performance of parsing and machine trans- better automatic tools. When creating simplified or abridged texts, lation [4]. Their system learns transformation rules from an anno- authors may drop sentences or phrases, split long sentences into tated and aligned corpus of complex sentences with manual sim- multiple sentences, modify vocabulary, shorten long descriptive plifications. Siddharthan’s work on syntactic simplification and phrases, etc. In this paper we do not address changes to vocabu- text cohesion has the goal of making individual sentences simpler lary; Burstein et al.’s approach to choosing synonyms for challeng- without shortening the entire text. This approach uses transfor- ing words could be used to simplify vocabulary items [2]. Instead, mations based on handwritten rules, paying particular attention to we focus on the following research questions: discourse-level issues in the regeneration stage (i.e., interaction be- ² What differences in part-of-speech usage and phrase types tween sentences in a text, not just individual sentences) [12]. are found in original and simplified sentences? We are interested in a data-driven approach to simplification ² What are the characteristics of sentences which are split like that of Chandrasekar and Srinivas. However, unlike their when an article is simplified? work, which is based on a corpus of original and manually sim- ² What are the characteristics of sentences which are dropped plified sentences, we study a corpus of paired articles in which when an article is simplified? each original sentence does not necessarily have a corresponding Speech and Language Technology in Education (SLaTE 2007) 69 Table 1: Total sentences and words and average sentence length sentences we used an automatic parser [5] to get parses and part- for corpus of 104 pairs of original and abridged articles. of-speech (POS) tags for all sentences. Table 2 shows, for se- lected POS tags, the average number of that tag in the original and Original Abridged Reduction abridged sentences, and the percent difference between the two.2 Total sentences 2539 2459 3% Since abridged sentences are on average 27% shorter, we expect Total words 41982 29584 30% fewer words and therefore fewer POS tags per sentence. How- Avg. sentence length (words) 16.5 12.0 27% ever, we note that the percentage decrease in average frequency is greater for adjectives, adverbs and coordinating conjunctions, i.e., abridged sentences have fewer of these words. The percent de- Table 2: Average frequency of selected part-of-speech tags in orig- crease in nouns is only 22%, compared with 33% for pronouns, inal and abridged sentences. On average, abridged sentences con- indicating that nouns are deleted less often than the average and tain 27% fewer words than original sentences. it is unlikely that nouns are often replaced with pronouns. The frequency difference for determiners, IN (prepositions and sub- Tag Original Abridged Difference ordinating conjunctions), proper nouns, and verbs is near 27%, Adjective 1.2 0.8 33% indicating no difference other than that expected for the shorter Adverb 1.0 0.6 40% abridged sentences. Table 3 shows average frequencies and aver- CC 0.5 0.3 40% age lengths of selected types of phrases. There are fewer phrases Noun 3.6 2.8 22% per sentence in the abridged sentences, and the phrases are shorter. Pronoun 1.2 0.8 33% 3.2. Alignment Methodology In order to understand the techniques the authors used when edit- simplified sentence. Our corpus makes it possible to learn where ing each original article to create the abridged article, we hand- rewriters have dropped as well as simplified sentences. These find- align the sentences in each pair of articles. This alignment was ings should help with the development of simplification tools that done by a native English speaker using the instructions used by can be trained from text in a variety of domains. Barzilay and Elhadad in their work on alignment of comparable corpora [1]. These instructions direct the annotator to mark sen- 3. Aligned Corpus of News Articles tences in the original and abridged versions which convey the same information in at least one clause. Since nearly all abridged sen- This work is based on a corpus of 104 original news articles with tences have a corresponding original sentence but some original corresponding abridged versions developed by Literacyworks as 1 sentences are dropped, we asked the annotator to align all the part of an literacy website for learners and instructors. The tar- sentences in each abridged article to a corresponding sentence or get audience for these articles is adult literacy learners (i.e., native sentences in the original file and automatically reversed the hand speakers with poor reading skills), but the site creators suggest alignments to get the alignments of original to abridged sentences. that the abridged articles can be used by instructors and learners of all ages. In this paper, we will refer to “original” and “abridged” 3.3. Original and Aligned Sentences texts, the terms used by the developers of this corpus.