A Tagging Approach to Identify Complex Constituents for Text Simpliﬁcation

A Tagging Approach to Identify Complex Constituents for Text Simplification Iustin Dornescu Richard Evans Constantin Orasan˘ Research Institute in Information and Language Processing University of Wolverhampton United Kingdom {I.Dornescu2, R.J.Evans, C.Orasan}@wlv.ac.uk Abstract Nivre, 2011). The research described in the current paper is part of the FIRST project1 which aims The occurrence of syntactic phenomena to automatically convert documents into a more such as coordination and subordination is accessible form for people with autistic spectrum characteristic of long, complex sentences. disorders (ASD). Many of the decisions taken in Text simplification systems need to detect the research presented in this paper were informed and categorise constituents in order to by the psycholinguistic experiments carried out in generate simpler sentences. These con- the FIRST project and summarised in Martos et al. stituents are typically bounded or linked (2013). by signs of syntactic complexity, which in- The remainder of this paper is structured as fol- clude conjunctions, complementisers, wh- lows. Section 2 provides background information words, and punctuation marks. This paper about the context of this work, Section 3 presents proposes a supervised tagging approach the annotation scheme, Section 4 describes the to classify these signs in accordance with approach and the main objectives of this study. their linking and bounding functions. The The results and the main findings are presented performance of the approach is evaluated in Section 5. Section 6 provides an overview of both intrinsically, using an annotated cor- previous related work. In Section 7, conclusions pus covering three different genres, and are drawn. extrinsically, by evaluating the impact of classification errors on an automatic text 2 Syntactic Simplification in the FIRST simplification system. The results are Project encouraging. Research carried out in the FIRST project and 1 Introduction investigation of related work revealed that certain types of syntactic complexity adversely affect This paper presents an automatic method to de- the reading comprehension of people with ASD termine the specific coordinating and bounding (Martos et al., 2013). This section presents functions of several reliable signs of syntactic a brief overview of the context in which this complexity in natural language. This method can research is carried out. It builds on the approach be useful for automatic text simplification. The proposed by Evans (2011) who presented a rule- syntactic complexity of input text can be reduced based method to simplify sentences containing by the application of rules triggered by patterns coordinated constituents to facilitate information expressed in terms of the parts of speech of words extraction. In that work, punctuation marks and and the syntactic linking and bounding functions conjunctions were considered to be reliable signs of signs of syntactic complexity occurring within of syntactic complexity in English. These signs it (Evans, 2011). Previous work indicates that were automatically classified in accordance with a syntactic simplification can improve text accessi- scheme indicating their specific syntactic linking bility (Just et al., 1996) and the reliability of NLP function. They then serve as triggers for the applications such as information extraction (Agar- application of distinct sets of simplification rules. wal and Boggess, 1992; Rindflesch et al., 2000), Their accurate labelling is thus a prerequisite for machine translation (Gerber and Hovy, 1998), and syntactic parsing (Tomita, 1985; McDonald and 1http://www.first-asd.eu 221 Proceedings of Recent Advances in Natural Language Processing, pages 221–229, Hissar, Bulgaria, 7-13 September 2013. the simplification process. Collection Genre Signs In that work, signs of syntactic complexity were 1. METER corpus News 12718 considered to belong to one of two broad classes, 2. www.patient.co.uk Healthcare 10796 denoted as coordinators and subordinators. These 3. Gutenberg Literature 11204 groups were subcategorised according to class labels specifying the syntactic projection level of Table 1: Characteristics of the annotated dataset. conjoins2 and of subordinated constituents, and the grammatical category of those phrases. Manual scheme, the class labels, also called sign tags, are annotation of a limited set of signs was exploited to acronyms expressing four types of information: develop a memory-based learning classifier that was used in combination with a part-of-speech 1. {C|SS|ES}, the generic function as a coor- tagger and a set of rules to rewrite complex sen- dinator (C), the left boundary of a subordinate tences as sequences of simpler sentences. Extrinsic constituent (SS), or the right boundary of a evaluation showed that the simplification process subordinate constituent (ES). evoked improvements in information extraction from clinical documents. 2. {P |L|I|M|E}, the syntactic projection level One weakness of the approach presented by of the constituent(s): prefix (P), lexical Evans (2011) is that the set of functions of signs of (L), intermediate (I), maximal (M), or ex- syntactic complexity was derived by empirical anal- tended/clausal (E). ysis of rather homogeneous documents from a spe- 3. {A|Adv|N|P |Q|V }, the grammatical cate- cialised source (a collection of clinical assessment gory of the constituent(s): adjectival (A), items). The restricted range of linguistic phenom- adverbial (Adv), nominal (N), prepositional ena encountered in the texts makes the annotation (P), quantificational (Q), and verbal (V). applicable only to that particular genre/category. The scheme is incapable of encoding the full range 4. {1|2}, used to further differentiate sub-classes of syntactic complexity encountered in texts of on the basis of some other label-specific different genres. criterion. In more recent work, Evans and Orasan˘ (2013) addressed these weaknesses by considering three The annotation scheme also includes classes broad classes of signs: left subordination bound- which bound interjections, tag questions, and aries, right subordination boundaries and coor- reported speech and a class denoting false signs dinators. The classification scheme was also of syntactic complexity, such as use of the word extended to enable the encoding of links and that as a specifier or anaphor. boundaries between a wider range of syntactic Signs of syntactic complexity occurring in texts constituents to cover more syntactic phenomena. belonging to three categories/genres were anno- The current paper presents a method to classify tated in accordance with this scheme3. Their signs of syntactic complexity using the annotated characteristics are summarised in Table 1. Absolute dataset they developed. and cumulative frequencies of signs and tags reveal a skewed distribution in each genre, e.g. in the news 3 Annotation Scheme corpus 15 of 40 tags and 11 of 29 signs account for more than 90% of total occurrences. The annotated signs comprise three conjunctions In the context of information extraction, Evans ([and], [but], [or]), one complementiser ([that]), (2011) showed that automatic syntactic simpli- six wh-words ([what], [when], [where], [which], fication can be performed by annotating input [while], [who]), three punctuation marks ([,], [;], sentences with information on the parts of speech [:]), and 30 compound signs consisting of one of words and the syntactic functions of coordi- of these lexical items immediately preceded by a nators. These annotated sentences can then be punctuation mark (e.g. [, and]). In this paper, signs simplified according to an iterative algorithm which of coordination are referred to as coordinators aggregates several methods to identify specific whereas signs of subordination are referred to as subordination boundaries. In the annotation 3The annotated dataset and a description of each sign is available at http://clg.wlv.ac.uk/resources/ 2Conjoins are the elements linked in coordination. SignsOfSyntacticComplexity/ 222 syntactic patterns and then transform the input medical information extraction (Settles, 2005) or sentence into several simpler sentences. Each shallow parsing (Sha and Pereira, 2003). pattern is recognised on the basis of the class In the annotated dataset, signs of syntactic assigned to the sign which triggers it and the words complexity typically delimit syntactic constituents. surrounding the sign, and is rewritten according to Each sign has a tag which reflects the types of manually created rules. constituent it links or bounds. For coordinators, the When a particular syntactic pattern is recog- tag reflects the syntactic category of its conjoins. nised, a rewriting rule is activated which identifies For subordination boundaries, the tag reflects the coordinated structures, the conjoins linked in syntactic category or type of the bound constituent. coordination, and subordinated constituents. Each This annotation is sign-centric, meaning that the sign triggers the activation of a simplification rule. actual extent and type of constituents is not explic- The rule applied varies according to the specific itly annotated. To employ a tagging approach, the class to which the sign belongs. dataset needs to be converted to a suitable format. One advantage of this general approach to syntactic simplification is that it does not depend 4.2 Tagging Modes on syntactic parsing, a process whose reliability depends both on the characteristics of the A straightforward way to convert the

Load more