Simple and Effective Text Simplification Using Semantic And

Simple and Effective Text Simplification Using Semantic and Neural Methods Elior Sulem, Omri Abend, Ari Rappoport Department of Computer Science, The Hebrew University of Jerusalem feliors|oabend|[email protected] Abstract ered a case of monolingual translation, the sentence splitting operation has not been addressed Sentence splitting is a major simplification by these systems, potentially due to the rareness operator. Here we present a simple and ef- of this operation in the training corpora (Narayan ficient splitting algorithm based on an au- and Gardent, 2014; Xu et al., 2015). tomatic semantic parser. After splitting, We show that the explicit integration of sen- the text is amenable for further fine-tuned tence splitting in the simplification system could simplification operations. In particular, we also reduce conservatism, which is a grave limita- show that neural Machine Translation can tion of NMT-based TS systems (Alva-Manchego be effectively used in this situation. Pre- et al., 2017). Indeed, experimenting with a state- vious application of Machine Translation of-the-art neural system (Nisioi et al., 2017), we for simplification suffers from a consid- find that 66% of the input sentences remain un- erable disadvantage in that they are over- changed, while none of the corresponding refer- conservative, often failing to modify the ences is identical to the source. Human and au- source in any way. Splitting based on se- tomatic evaluation of the references (against other mantic parsing, as proposed here, allevi- references), confirm that the references are indeed ates this issue. Extensive automatic and simpler than the source, indicating that the ob- human evaluation shows that the proposed served conservatism is excessive. Our methods for method compares favorably to the state- performing sentence splitting as pre-processing al- of-the-art in combined lexical and struc- lows the TS system to perform other structural tural simplification. (e.g. deletions) and lexical (e.g. word substitu- tions) operations, thus increasing both structural 1 Introduction and lexical simplicity. Text Simplification (TS) is generally defined as the For combining linguistically informed sentence conversion of a sentence into one or more sim- splitting with data-driven TS, two main methods pler sentences. It has been shown useful both as have been proposed. The first involves hand- a preprocessing step for tasks such as Machine crafted syntactic rules, whose compilation and val- Translation (MT; Mishra et al., 2014; Stajnerˇ and idation are laborious (Shardlow, 2014). For ex- Popovic´, 2016) and relation extraction (Niklaus ample, Siddharthan and Angrosh(2014) used 111 et al., 2016), as well as for developing reading rules for relative clauses, appositions, subordina- aids, e.g. for people with dyslexia (Rello et al., tion and coordination. Moreover, syntactic split- 2013) or non-native speakers (Siddharthan, 2002). ting rules, which form a substantial part of the TS includes both structural and lexical opera- rules, are usually language specific, requiring the tions. The main structural simplification opera- development of new rules when ported to other tion is sentence splitting, namely rewriting a single languages (Alu´ısio and Gasperin, 2010; Seretan, sentence into multiple sentences while preserving 2012; Hung et al., 2012; Barlacchi and Tonelli, its meaning. While recent improvement in TS has 2013, for Portuguese, French, Vietnamese, and been achieved by the use of neural MT (NMT) ap- Italian respectively). The second method uses lin- proaches (Nisioi et al., 2017; Zhang et al., 2017; guistic information for detecting potential splitting Zhang and Lapata, 2017), where TS is consid- points, while splitting probabilities are learned using a parallel corpus. For example, in the sys- highly similar to the target. Other PBMT for TS tem of Narayan and Gardent(2014) (henceforth, systems include the work of Coster and Kauchak HYBRID), the state-of-the-art for joint structural (2011b), which uses Moses (Koehn et al., 2007), and lexical TS, potential splitting points are deter- the work of Coster and Kauchak(2011a), where mined by event boundaries. the model is extended to include deletion, and In this work, which is the first to combine struc- PBMT-R (Wubben et al., 2012), where Leven- tural semantics and neural methods for TS, we shtein distance to the source is used for re-ranking propose an intermediate way for performing sen- to overcome conservatism. tence splitting, presenting Direct Semantic Split- The NTS NMT-based system (Nisioi et al., ting (DSS), a simple and efficient algorithm based 2017) (henceforth, N17) reported superior perfor- on a semantic parser which supports the direct de- mance over PBMT in terms of BLEU and human composition of the sentence into its main semantic evaluation scores, and serves as a component in constituents. After splitting, NMT-based simplifi- our system (see Section4). Zhang et al.(2017) cation is performed, using the NTS system. We took a similar approach, adding lexical constraints show that the resulting system outperforms HY- to an NMT model. Zhang and Lapata(2017) com- BRID in both automatic and human evaluation. bined NMT with reinforcement learning, using We use the UCCA scheme for semantic repre- SARI (Xu et al., 2016), BLEU, and cosine simi- sentation (Abend and Rappoport, 2013), where the larity to the source as the reward. None of these semantic units are anchored in the text, which sim- models explicitly addresses sentence splitting. plifies the splitting operation. We further leverage Alva-Manchego et al.(2017) proposed to re- the explicit distinction in UCCA between types of duce conservatism, observed in PBMT and NMT Scenes (events), applying a specific rule for each systems, by first identifying simplification opera- of the cases. Nevertheless, the DSS approach can tions in a parallel corpus and then using sequence- be adapted to other semantic schemes, like AMR labeling to perform the simplification. However, (Banarescu et al., 2013). they did not address common structural opera- We collect human judgments for multiple vari- tions, such as sentence splitting, and claimed that ants of our system, its sub-components, HYBRID their method is not applicable to them. and similar systems that use phrase-based MT. Xu et al.(2016) used Syntax-based Machine This results in a sizable human evaluation bench- Translation (SBMT) for sentence simplification, mark, which includes 28 systems, totaling at 1960 using a large scale paraphrase dataset (Gan- complex-simple sentence pairs, each annotated by itketitch et al., 2013) for training. While it does 1 three annotators using four criteria. This bench- not target structural simplification, we include it mark will support the future analysis of TS sys- in our evaluation for completeness. tems, and evaluation practices. Previous work is discussed in x2, the semantic Structural sentence simplification. Syntactic and NMT components we use in x3 and x4 re- hand-crafted sentence splitting rules were pro- spectively. The experimental setup is detailed in posed by Chandrasekar et al.(1996), Siddharthan x5. Our main results are presented in x6, while x7 (2002), Siddhathan(2011) in the context of rule- presents a more detailed analysis of the system’s based TS. The rules separate relative clauses and sub-components and related settings. coordinated clauses and un-embed appositives. In our method, the use of semantic distinctions in- 2 Related Work stead of syntactic ones reduces the number of MT-based sentence simplification. Phrase- rules. For example, relative clauses and appos- based Machine Translation (PBMT; Koehn et al., itives can correspond to the same semantic cat- 2003) was first used for TS by Specia(2010), who egory. In syntax-based splitting, a generation showed good performance on lexical simplifica- module is sometimes added after the split (Sid- tion and simple rewriting, but under-prediction dharthan, 2004), addressing issues such as re- of other operations. Stajnerˇ et al.(2015) took ordering and determiner selection. In our model, a similar approach, finding that it is beneficial no explicit regeneration is applied to the split sen- to use training data where the source side is tences, which are fed directly to an NMT system. ˇ 1The benchmark can be found in https://github. Glavasˇ and Stajner(2013) used a rule-based com/eliorsulem/simplification-acl2018. system conditioned on event extraction and syntax for defining two simplification models. The event- gether with an unsupervised lexical simplifier. wise simplification one, which separates events to They tackle a different setting, and aim to simplify separate output sentences, is similar to our seman- texts (rather than sentences), by allowing the dele- tic component. Differences are in that we use tion of entire input sentences. a single semantic representation for defining the Split and Rephrase. Narayan et al.(2017) re- rules (rather than a combination of semantic and cently proposed the Split and Rephrase task, fo- syntactic criteria), and avoid the need for complex cusing on sentence splitting. For this purpose rules for retaining grammaticality by using a sub- they presented a specialized parallel corpus, de- sequent neural component. rived from the WebNLG dataset (Gardent et al., Combined structural and lexical TS. Earlier 2017). The latter is obtained from the DBPedia TS models used syntactic information for splitting. knowledge base (Mendes et al., 2012) using con- Zhu et al.(2010) used syntactic information on the tent selection and crowdsourcing, and is annotated source side, based on the SBMT model of Yamada with semantic triplets of subject-relation-object, and Knight(2001). Syntactic structures were used obtained semi-automatically. They experimented on both sides in the model of Woodsend and La- with five systems, including one similar to HY- pata(2011), based on a quasi-synchronous gram- BRID, as well as sequence-to-sequence methods mar (Smith and Eisner, 2006), which resulted in for generating sentences from the source text and 438 learned splitting rules. its semantic forms.

Simple and Effective Text Simplification Using Semantic And

Preliminary Version of the Text Analysis Component, Including: Ner, Event Detection and Sentiment Analysis

Information Retrieval and Web Search

Constructing a Lexicon of Arabic-English Named Entity Using SMT and Semantic Linked Data

Fasttext-Based Intent Detection for Inflected Languages †

Unsupervised Methods to Predict Example Difficulty in Word Sense

Robust Prediction of Punctuation and Truecasing for Medical ASR

Evaluation of Machine Translation Methods Applied to Medical Terminologies

Copyright and Use of This Thesis This Thesis Must Be Used in Accordance with the Provisions of the Copyright Act 1968

Using Named Entity Recognition for Relevance Detection in Social Network Messages

A Rule-Based Normalization System for Greek Noisy User-Generated Text

Proceedings Chapter Reference

Proceedings of the ACL-IJCNLP 2015 System Demonstrations