Arxiv:2005.00352V2 [Cs.CL] 16 Apr 2021
Total Page:16
File Type:pdf, Size:1020Kb
MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases Louis Martin1;2 Angela Fan1;3 Éric de la Clergerie2 Antoine Bordes1 Benoît Sagot2 1Facebook AI Research, Paris, France 2Inria, Paris, France 3LORIA, Nancy, France {louismartin, angelafan, abordes}@fb.com {eric.de_la_clergerie, benoit.sagot}@inria.fr Abstract in multiple ways, and depend on the target audi- Progress in sentence simplification has been ence. Simplification guidelines are not uniquely hindered by a lack of labeled parallel simpli- defined, outlined by the stark differences in English fication data, particularly in languages other simplification benchmarks (Alva-Manchego et al., than English. We introduce MUSS, a Multi- 2020a). This highlights the need for more general lingual Unsupervised Sentence Simplification models that can adjust to different simplification system that does not require labeled simplifica- contexts and scenarios. tion data. MUSS uses a novel approach to sen- In this paper, we propose to train controllable tence simplification that trains strong models using sentence-level paraphrase data instead models using sentence-level paraphrase data only, of proper simplification data. These models i.e. parallel sentences that have the same meaning leverage unsupervised pretraining and control- but phrased differently. In order to generate sim- lable generation mechanisms to flexibly adjust plifications and not paraphrases, we use ACCESS attributes such as length and lexical complex- (Martin et al., 2020) to control attributes such as ity at inference time. We further present a length, lexical and syntactic complexity. Para- method to mine such paraphrase data in any phrase data is more readily available, and opens the language from Common Crawl using semantic sentence embeddings, thus removing the need door to training flexible models that can adjust to for labeled data. We evaluate our approach more varied simplification scenarios. We propose on English, French, and Spanish simplification to gather such paraphrase data in any language by benchmarks and closely match or outperform mining sentences from Common Crawl using se- the previous best supervised results, despite mantic sentence embeddings. We show that simpli- not using any labeled simplification data. We fication models trained on mined paraphrase data push the state of the art further by incorporat- perform as well as models trained on existing large ing labeled simplification data. paraphrase corpora (cf. AppendixD). Moreover, 1 Introduction paraphrases are more straightforward to mine than Sentence simplification is the task of making a sen- simplifications, and we show that they lead to mod- tence easier to read and understand by reducing its els with better performance than equivalent models lexical and syntactic complexity, while retaining trained on mined simplifications (cf. Section 5.4). most of its original meaning. Simplification has a Our resulting Multilingual Unsupervised Sen- tence Simplification method, MUSS, is unsuper- arXiv:2005.00352v2 [cs.CL] 16 Apr 2021 variety of important societal applications, for exam- vised ple increasing accessibility for those with cognitive because it can be trained without relying on labeled simplification data,1 even though we mine disabilities such as aphasia (Carroll et al., 1998), 2 dyslexia (Rello et al., 2013), and autism (Evans using supervised sentence embeddings. We addi- et al., 2014), or for non-native speakers (Paetzold tionally incorporate unsupervised pretraining (Liu and Specia, 2016). Research has mostly focused et al., 2019, 2020) and apply MUSS on English, on English simplification, where source texts and 1We use the term labeled simplifications to refer to parallel associated simplified texts exist and can be automat- datasets where texts were manually simplified by humans. 2Previous works have also used the term unsupervised ically aligned, such as English Wikipedia and Sim- simplification to describe works that do not use any labeled ple English Wikipedia (Zhang and Lapata, 2017). parallel simplification data while leveraging supervised com- However, such data is limited in terms of size and ponents such as constituency parsers and knowledge bases (Kumar et al., 2020), external synonymy lexicons (Surya et al., domain, and difficult to find in other languages. Ad- 2019), and databases of simplified synonyms (Zhao et al., ditionally, simplifying a sentence can be achieved 2020). We shall come back to these works in Section2. French, and Spanish to closely match or outper- When labeled parallel simplification data is un- form the supervised state of the art in all languages. available, systems rely on unsupervised simplifi- MUSS further improves the state of the art on cation techniques, often inspired from machine all English datasets by incorporating additional la- translation. The prevailing approach is to split beled simplification data. a monolingual corpora into disjoint sets of com- To sum up, our contributions are as follows: plex and simple sentences using readability met- rics. Then simplification models can be trained • We introduce a novel approach to training by using automatic sentence alignments (Kajiwara simplification models with paraphrase data and Komachi, 2016, 2018), auto-encoders (Surya only and propose a mining procedure to cre- et al., 2019; Zhao et al., 2020), unsupervised statis- ate large paraphrase corpora for any language. tical machine translation (Katsuta and Yamamoto, 2019), or back-translation (Aprosio et al., 2019). • Our approach obtains strong performance. Other unsupervised simplification approaches iter- Without any labeled simplification data, we atively edit the sentence until a certain simplicity match or outperform the supervised state of criterion is reached (Kumar et al., 2020). The per- the art in English, French and Spanish. We formance of unsupervised methods are often below further improve the English state of the art by their supervised counterparts. MUSS bridges the incorporating labeled simplification data. gap with supervised method and removes the need • We release MUSS pretrained models, para- for deciding in advance how complex and simple phrase data, and code for mining and train- sentences should be separated, but instead trains di- ing3. rectly on paraphrases mined from the raw corpora. 2 Related work Previous work on parallel dataset mining have Data-driven methods have been predominant in been used mostly in machine translation using doc- English sentence simplification in recent years ument retrieval (Munteanu and Marcu, 2005), lan- (Alva-Manchego et al., 2020b), requiring large guage models (Koehn et al., 2018, 2019), and em- supervised training corpora of complex-simple bedding space alignment (Artetxe and Schwenk, aligned sentences (Wubben et al., 2012; Xu et al., 2019b) to create large corpora (Tiedemann, 2012; 2016; Zhang and Lapata, 2017; Zhao et al., 2018; Schwenk et al., 2019). We focus on paraphras- Martin et al., 2020). Methods have relied on En- ing for sentence simplifications, which presents glish and Simple English Wikipedia with automatic new challenges. Unlike machine translation, where sentence alignment from similar articles (Zhu et al., the same sentence should be identified in two lan- 2010; Coster and Kauchak, 2011; Woodsend and guages, we develop a method to identify varied Lapata, 2011; Kauchak, 2013; Zhang and Lapata, paraphrases of sentences, that have a wider array of 2017). Higher quality datasets have been proposed surface forms, including different lengths, multiple such as the Newsela corpus (Xu et al., 2015), but sentences, different vocabulary usage, and removal they are rare and come with restrictive licenses that of content from more complex sentences. hinder reproducibility and widespread usage. Simplification in other languages has been Previous unsupervised paraphrasing research explored in Brazilian Portuguese (Aluísio et al., has aligned sentences from various parallel corpora 2008), Spanish (Saggion et al., 2015; Štajner et al., (Barzilay and Lee, 2003) with multiple objective 2015), Italian (Brunato et al., 2015; Tonelli et al., functions (Liu et al., 2019). Bilingual pivoting re- 2016), Japanese (Goto et al., 2015; Kajiwara and lied on MT datasets to create large databases of Komachi, 2018; Katsuta and Yamamoto, 2019), word-level paraphrases (Pavlick et al., 2015), lex- and French (Gala et al., 2020), but the lack of a ical simplifications (Pavlick and Callison-Burch, large labeled parallel corpora has slowed research 2016; Kriz et al., 2018), or sentence-level para- down. In this work, we show that a method trained phrase corpora (Wieting and Gimpel, 2018). This on automatically mined corpora can reach state-of- has not been applied to multiple languages or to the-art results in each language. sentence-level simplification. Additionally, we use 3https://github.com/facebookresearch/ raw monolingual data to create our paraphrase cor- muss pora instead of relying on parallel MT datasets. Type # Sequence # Avg. Tokens out noisy text. Pairs per Sequence WikiLarge Labeled Parallel 296,402 original: 21.7 Creating a Sequence Index Using Embeddings (English) Simplifications simple: 16.0 Newsela Labeled Parallel 94,206 original: 23.4 To automatically mine our paraphrase corpora, we (English) Simplifications simple: 14.2 first compute n-dimensional embeddings for each English Mined 1,194,945 22.3 extracted sequence using LASER (Artetxe and French Mined 1,360,422 18.7 Spanish Mined 996,609 22.8 Schwenk, 2019b). LASER provides joint multi- lingual sentence