MUSS: Multilingual Unsupervised Sentence Simplification by Mining Paraphrases Louis Martin1,2 Angela Fan1,3 Éric de la Clergerie2 Antoine Bordes1 Benoît Sagot2 1Facebook AI Research, Paris, France 2Inria, Paris, France 3LORIA, Nancy, France {louismartin, angelafan, abordes}@fb.com {eric.de_la_clergerie, benoit.sagot}@inria.fr Abstract in multiple ways, and depend on the target audi- Progress in sentence simplification has been ence. Simplification guidelines are not uniquely hindered by a lack of labeled parallel simpli- defined, outlined by the stark differences in English fication data, particularly in languages other simplification benchmarks (Alva-Manchego et al., than English. We introduce MUSS, a Multi- 2020a). This highlights the need for more general lingual Unsupervised Sentence Simplification models that can adjust to different simplification system that does not require labeled simplifica- contexts and scenarios. tion data. MUSS uses a novel approach to sen- In this paper, we propose to train controllable tence simplification that trains strong models using sentence-level paraphrase data instead models using sentence-level paraphrase data only, of proper simplification data. These models i.e. parallel sentences that have the same meaning leverage unsupervised pretraining and control- but phrased differently. In order to generate sim- lable generation mechanisms to flexibly adjust plifications and not paraphrases, we use ACCESS attributes such as length and lexical complex- (Martin et al., 2020) to control attributes such as ity at inference time. We further present a length, lexical and syntactic complexity. Para- method to mine such paraphrase data in any phrase data is more readily available, and opens the language from Common Crawl using semantic sentence embeddings, thus removing the need door to training flexible models that can adjust to for labeled data. We evaluate our approach more varied simplification scenarios. We propose on English, French, and Spanish simplification to gather such paraphrase data in any language by benchmarks and closely match or outperform mining sentences from Common Crawl using se- the previous best supervised results, despite mantic sentence embeddings. We show that simpli- not using any labeled simplification data. We fication models trained on mined paraphrase data push the state of the art further by incorporat- perform as well as models trained on existing large ing labeled simplification data. paraphrase corpora (cf. AppendixD). Moreover, 1 Introduction paraphrases are more straightforward to mine than Sentence simplification is the task of making a sen- simplifications, and we show that they lead to mod- tence easier to read and understand by reducing its els with better performance than equivalent models lexical and syntactic complexity, while retaining trained on mined simplifications (cf. Section 5.4). most of its original meaning. Simplification has a Our resulting Multilingual Unsupervised Sen- tence Simplification method, MUSS, is unsuper-

arXiv:2005.00352v2 [cs.CL] 16 Apr 2021 variety of important societal applications, for exam- vised ple increasing accessibility for those with cognitive because it can be trained without relying on labeled simplification data,1 even though we mine disabilities such as aphasia (Carroll et al., 1998), 2 dyslexia (Rello et al., 2013), and autism (Evans using supervised sentence embeddings. We addi- et al., 2014), or for non-native speakers (Paetzold tionally incorporate unsupervised pretraining (Liu and Specia, 2016). Research has mostly focused et al., 2019, 2020) and apply MUSS on English, on English simplification, where source texts and 1We use the term labeled simplifications to refer to parallel associated simplified texts exist and can be automat- datasets where texts were manually simplified by humans. 2Previous works have also used the term unsupervised ically aligned, such as English Wikipedia and Sim- simplification to describe works that do not use any labeled ple English Wikipedia (Zhang and Lapata, 2017). parallel simplification data while leveraging supervised com- However, such data is limited in terms of size and ponents such as constituency parsers and knowledge bases (Kumar et al., 2020), external synonymy lexicons (Surya et al., domain, and difficult to find in other languages. Ad- 2019), and databases of simplified synonyms (Zhao et al., ditionally, simplifying a sentence can be achieved 2020). We shall come back to these works in Section2. French, and Spanish to closely match or outper- When labeled parallel simplification data is un- form the supervised state of the art in all languages. available, systems rely on unsupervised simplifi- MUSS further improves the state of the art on cation techniques, often inspired from machine all English datasets by incorporating additional la- translation. The prevailing approach is to split beled simplification data. a monolingual corpora into disjoint sets of com- To sum up, our contributions are as follows: plex and simple sentences using readability met- rics. Then simplification models can be trained • We introduce a novel approach to training by using automatic sentence alignments (Kajiwara simplification models with paraphrase data and Komachi, 2016, 2018), auto-encoders (Surya only and propose a mining procedure to cre- et al., 2019; Zhao et al., 2020), unsupervised statis- ate large paraphrase corpora for any language. tical (Katsuta and Yamamoto, 2019), or back-translation (Aprosio et al., 2019). • Our approach obtains strong performance. Other unsupervised simplification approaches iter- Without any labeled simplification data, we atively edit the sentence until a certain simplicity match or outperform the supervised state of criterion is reached (Kumar et al., 2020). The per- the art in English, French and Spanish. We formance of unsupervised methods are often below further improve the English state of the art by their supervised counterparts. MUSS bridges the incorporating labeled simplification data. gap with supervised method and removes the need • We release MUSS pretrained models, para- for deciding in advance how complex and simple phrase data, and code for mining and train- sentences should be separated, but instead trains di- ing3. rectly on paraphrases mined from the raw corpora.

2 Related work Previous work on parallel dataset mining have Data-driven methods have been predominant in been used mostly in machine translation using doc- English sentence simplification in recent years ument retrieval (Munteanu and Marcu, 2005), lan- (Alva-Manchego et al., 2020b), requiring large guage models (Koehn et al., 2018, 2019), and em- supervised training corpora of complex-simple bedding space alignment (Artetxe and Schwenk, aligned sentences (Wubben et al., 2012; Xu et al., 2019b) to create large corpora (Tiedemann, 2012; 2016; Zhang and Lapata, 2017; Zhao et al., 2018; Schwenk et al., 2019). We focus on paraphras- Martin et al., 2020). Methods have relied on En- ing for sentence simplifications, which presents glish and Simple English Wikipedia with automatic new challenges. Unlike machine translation, where sentence alignment from similar articles (Zhu et al., the same sentence should be identified in two lan- 2010; Coster and Kauchak, 2011; Woodsend and guages, we develop a method to identify varied Lapata, 2011; Kauchak, 2013; Zhang and Lapata, paraphrases of sentences, that have a wider array of 2017). Higher quality datasets have been proposed surface forms, including different lengths, multiple such as the Newsela corpus (Xu et al., 2015), but sentences, different vocabulary usage, and removal they are rare and come with restrictive licenses that of content from more complex sentences. hinder reproducibility and widespread usage. Simplification in other languages has been Previous unsupervised paraphrasing research explored in Brazilian Portuguese (Aluísio et al., has aligned sentences from various parallel corpora 2008), Spanish (Saggion et al., 2015; Štajner et al., (Barzilay and Lee, 2003) with multiple objective 2015), Italian (Brunato et al., 2015; Tonelli et al., functions (Liu et al., 2019). Bilingual pivoting re- 2016), Japanese (Goto et al., 2015; Kajiwara and lied on MT datasets to create large databases of Komachi, 2018; Katsuta and Yamamoto, 2019), word-level paraphrases (Pavlick et al., 2015), lex- and French (Gala et al., 2020), but the lack of a ical simplifications (Pavlick and Callison-Burch, large labeled parallel corpora has slowed research 2016; Kriz et al., 2018), or sentence-level para- down. In this work, we show that a method trained phrase corpora (Wieting and Gimpel, 2018). This on automatically mined corpora can reach state-of- has not been applied to multiple languages or to the-art results in each language. sentence-level simplification. Additionally, we use 3https://github.com/facebookresearch/ raw monolingual data to create our paraphrase cor- muss pora instead of relying on parallel MT datasets. Type # Sequence # Avg. Tokens out noisy text. Pairs per Sequence WikiLarge Labeled Parallel 296,402 original: 21.7 Creating a Sequence Index Using Embeddings (English) Simplifications simple: 16.0 Newsela Labeled Parallel 94,206 original: 23.4 To automatically mine our paraphrase corpora, we (English) Simplifications simple: 14.2 first compute n-dimensional embeddings for each English Mined 1,194,945 22.3 extracted sequence using LASER (Artetxe and French Mined 1,360,422 18.7 Spanish Mined 996,609 22.8 Schwenk, 2019b). LASER provides joint multi- lingual sentence embeddings in 93 languages that Table 1: Statistics on our mined paraphrase training have been successfully applied to the task of bilin- corpora compared to standard simplification datasets gual bitext mining (Schwenk et al., 2019). In this (see section 4.3 for more details). work, we show that LASER can also be used to mine monolingual paraphrase datasets. 3 Method Mining Paraphrases After computing the em- In this section we describe MUSS, our approach beddings for each language, we index them for fast to mining paraphrase data and training controllable nearest neighbor search using faiss. simplification models on paraphrases. Each of these sequences is then used as a query qi against the billion-scale index that returns a set 3.1 Mining Paraphrases in Many Languages of top-8 nearest neighbor sequences according to Sequence Extraction Simplification consists of the semantic LASER embedding space using L2 multiple rewriting operations, some of which span distance, resulting in a set of candidate paraphrases multiple sentences (e.g. sentence splitting or fu- are {ci,1, . . . , ci,8}. sion). To allow such operations to be represented We then use an upper bound on L2 distance and in our paraphrase data, we extract chunks of texts a margin criterion following (Artetxe and Schwenk, composed of multiple sentences, we refer to these 2019a) to filter out sequences with low similarity. small pieces of text by sequences. We refer the reader to Appendix Section A.1 for We extract such sequences by first tok- technical details. enizing a document into individual sentences The resulting nearest neighbors constitute a {s1, s2, . . . , sn} using the NLTK sentence set of aligned paraphrases of the query sequence: tokenizer (Bird and Loper, 2004). We then {(qi, ci,1),..., (qi, ci,j)}. We finally apply poor extract sequences of adjacent sentences with alignment filters. We remove sequences that are maximum length of 300 characters: e.g. almost identical with character-level Levenshtein ≤ {[s1], [s1, s2], [s1, . . . , sk], [s2], [s2, s3], ...}. distance 20%, when they are contained in one We can thus align two sequences that contain another, or when they were extracted from two a different number of sentences, and represent overlapping sliding windows of the same original sentence splitting or sentence fusion operations. document. These sequences are further filtered to remove We report statistics of the mined corpora in En- noisy sequences with more than 10% punctuation glish, French and Spanish in Table1, and qualita- characters and sequences with low language model tive examples of the resulting mined paraphrases in probability according to a 3-gram Kneser-Ney lan- Appendix Table6. Models trained on the resulting guage model trained with kenlm (Heafield, 2011) mined paraphrases obtain similar performance than on Wikipedia. models trained on existing paraphrase datasets (cf. We extract these sequences from CCNet (Wen- Appendix SectionD). zek et al., 2019), an extraction of Common Crawl 3.2 Simplifying with ACCESS (an open source snapshot of the web) that has been split into different languages using lan- In this section we describe how we adapt ACCESS guage identification (Joulin et al., 2017) and vari- (Martin et al., 2020) to train controllable models ous language modeling filtering techniques to iden- on mined paraphrases, instead of labeled parallel tify high quality, clean sentences. For English and simplifications. French, we extract 1 billion sequences from CCNet. ACCESS is a method to make any seq2seq For Spanish we extract 650 millions sequences, the model controllable by conditioning on maximum for this language in CCNet after filtering simplification-specific control tokens. We apply it on our seq2seq pretrained transformer still lead to solid performance (cf. AppendixC). models based on the BART (Lewis et al., 2019) architecture (see next subsection). 3.3 Leveraging Unsupervised Pretraining

Training with Control Tokens At training time, We combine our controllable models with unsuper- the model is provided with control tokens that give vised pretraining to further extend our approach oracle information on the target sequence, such as to text simplification. For English, we finetune the amount of compression of the target sequence the pretrained generative model BART (Lewis relative to the source sequence (length control). et al., 2019) on our newly created training corpora. For example, when the target sequence is 80% of BART is a pretrained sequence-to-sequence model the length of the original sequence, we provide that can be seen as a generalization of other re- the control token. At infer- cent pretrained models such as BERT (Devlin et al., ence time we can then control the generation by 2018). For non-English, we use its multilingual selecting a given target control value. We adapt the generalization MBART (Liu et al., 2020), which original Levenshtein similarity control to only con- was pretrained on 25 languages. sider replace operations but otherwise use the same controls as Martin et al.(2020). The controls used 4 Experimental Setting are therefore character length ratio, replace-only We assess the performance of our approach on three Levenshtein similarity, aggregated word frequency languages: English, French, and Spanish. We detail ratio, and dependency tree depth ratio. For instance our experimental procedure for mining and train- we will prepend to every source in the training ing in Appendix SectionA. In all our experiments, set the following 4 control tokens with sample- we report scores on the test sets averaged over 5 specific values, so that the model learns to rely random seeds with 95% confidence intervals. on them: . We 4.1 Baselines refer the reader to the original paper (Martin et al., 2020) and Appendix A.2 for details on ACCESS In addition to comparisons with previous works, and how those control tokens are computed. we implement and hereafter describe multiple base- lines to assess the performance of our models, es- Selecting Control Values at Inference Once pecially for French and Spanish where no previous the model has been trained with oracle controls, we simplification systems are available. can adjust the control tokens to obtain the desired type of simplifications. Indeed, sentence simpli- Identity The entire original sequence is kept un- fication often depends on the context and target changed and used as the simplification. audience (Martin et al., 2020). For instance shorter sentences are more adapted to people with cogni- Truncation The original sequence is truncated tive disabilities, while using more frequent words to the first 80% words. It is a strong baseline ac- are useful to second language learners. It is there- cording to standard simplification metrics. fore important that supervised and unsupervised Pivot We use machine translation to provide a simplification systems can be adapted to different baseline for languages for which no simplification conditions: (Kumar et al., 2020) do so by choosing corpus is available. The source non-English sen- a set of operation-specific weights of their unsu- tence is translated to English, simplified with our pervised simplification model for each evaluation best supervised English simplification system, and dataset, (Surya et al., 2019) select different mod- then translated back into the source language. For els using SARI on each validation set. Similarly, French and Spanish translation, we use CCMATRIX we set the 4 control hyper-parameters of ACCESS (Schwenk et al., 2019) to train Transformer mod- using SARI on each validation set and keep them els with LayerDrop (Fan et al., 2019). We use fixed for all samples in the test set.4. These 4 con- the BART+ACCESS supervised model trained on trol hyper-parameters are intuitive and easy to inter- MINED + WikiLarge as the English simplification pret: when no validation set is available, they can model. While pivoting creates potential errors, re- also be set using prior knowledge on the task and cent improvements of translation systems on high 4Details in Appendix A.2 resource languages make this a strong baseline. English ASSET TurkCorpus Newsela SARI ↑ FKGL ↓ SARI ↑ FKGL ↓ SARI ↑ FKGL ↓

Baselines and Gold Reference Identity Baseline 20.73 10.02 26.29 10.02 12.24 8.82 Truncate Baseline 29.85 7.91 33.10 7.91 25.49 6.68 Gold Reference 44.87±0.36 6.49±0.15 40.04±0.30 8.77±0.08 ——

Unsupervised Systems

BTRLTS (Zhao et al., 2020) 33.95 7.59 33.09 8.39 37.22 3.80 UNTS (Surya et al., 2019) 35.19 7.60 36.29 7.60 —— RM+EX+LS+RO (Kumar et al., 2020) 36.67 7.33 37.27 7.33 38.33 2.98

MUSS 42.65±0.23 8.23±0.62 40.85±0.15 8.79±0.30 38.09±0.59 5.12±0.47

Supervised Systems

EditNTS (Dong et al., 2019) 34.95 8.38 37.66 8.38 39.30 3.90 Dress-LS (Zhang and Lapata, 2017) 36.59 7.66 36.97 7.66 38.00 4.90 DMASS-DCSS (Zhao et al., 2018) 38.67 7.73 39.92 7.73 —— ACCESS(Martin et al., 2020) 40.13 7.29 41.38 7.29 ——

MUSS 43.63±0.71 6.25±0.42 42.62±0.27 6.98±0.95 42.59±1.00 2.74±0.98 MUSS (+ mined data) 44.15±0.56 6.05±0.51 42.53±0.36 7.60±1.06 41.17±0.95 2.70±1.00

Table 2: Unsupervised and Supervised Sentence Simplification for English. We display SARI and FKGL on ASSET, TurkCorpus and Newsela test sets for English. Supervised models are trained on WikiLarge for the first two test sets, and Newsela for the last. Best SARI scores within confidence intervals are in bold.

Gold Reference We report gold reference scores French Spanish for ASSET and TurkCorpus as multiple references Baselines SARI ↑ SARI ↑ are available. We compute scores in a leave-one- Identity 26.16 16.99 out scenario where each reference is evaluated Truncate 33.44 27.34 Pivot 33.48±0.37 36.19±0.34 against all others. The scores are then averaged MUSS† 41.73±0.67 35.67±0.46 over all references.

4.2 Evaluation Metrics Table 3: Unsupervised Sentence Simplification in French and Spanish. We display SARI scores in We evaluate with the standard metrics SARI and French (ALECTOR) and Spanish (Newsela). Best FKGL. We report BLEU (Papineni et al., 2002) SARI scores within confidence intervals are in bold. only in Appendix Table 10 due its dubious suitabil- †MBART+ACCESS model. ity for sentence simplification (Sulem et al., 2018).

SARI Sentence simplification is commonly eval- lengths and word lengths. FKGL was designed uated with SARI (Xu et al., 2016), which compares to be used on English texts only, we do not report model-generated simplifications with the source se- it on French and Spanish. quence and gold references. It averages F1 scores for addition, keep, and deletion operations. We 4.3 Training Data compute SARI with the EASSE simplification eval- uation suite (Alva-Manchego et al., 2019).5 For all languages we use the mined data described in Table1 as training data. We show that training FKGL We report readability scores using the with additional labeled simplification data leads to Flesch-Kincaid Grade Level (FKGL) (Kincaid even better performance for English. We use the et al., 1975), a linear combination of sentence labeled datasets WikiLarge (Zhang and Lapata, 5We use the latest version of SARI implemented in EASSE 2017) and Newsela (Xu et al., 2015). WikiLarge (Alva-Manchego et al., 2019) which fixes bugs and inconsis- is composed of 296k simplification pairs automati- tencies from the traditional implementation. We thus recom- cally aligned from English Wikipedia and Simple pute scores from previous systems that we compare to, by using the system predictions provided by the respective au- English Wikipedia. Newsela is a collection of news thors available in EASSE. articles with professional simplifications, aligned into 94k simplification pairs by Zhang and Lapata MUSS Unsupervised Results On the ASSET (2017).6 benchmark, with no labeled simplification data, MUSS obtains a +5.98 SARI improvement with 4.4 Evaluation Data respect to previous unsupervised methods, and a English We evaluate our English models on AS- +2.52 SARI improvement over the state-of-the- SET (Alva-Manchego et al., 2020a), TurkCorpus art supervised methods. For the TurkCorpus and (Xu et al., 2016) and Newsela (Xu et al., 2015). Newsela datasets, the unsupervised MUSS ap- TurkCorpus and ASSET were created using the proach achieves strong results, either outperform- same 2000 valid and 359 test source sentences. ing or closely matching both unsupervised and su- TurkCorpus contains 8 reference simplifications pervised previous works. per source sentence and ASSET contains 10 ref- When incorporating labeled data from Wiki- erences per source. ASSET is a generalization of Large and Newsela, MUSS obtains state-of-the-art TurkCorpus with a more varied set of rewriting op- results on all datasets. Using labeled data along erations, and considered simpler by human judges with mined data does not always help compared (Alva-Manchego et al., 2020a). For Newsela, we to training only with labeled data, especially with evaluate on the split from (Zhang and Lapata, the Newsela training set. Newsela is already a high 2017), which includes 1129 validation and 1077 quality dataset focused on the specific domain of test sentence pairs. news articles. It might not benefit from additional lesser quality mined data. French For French, we use the ALECTOR dataset (Gala et al., 2020) for evaluation. ALEC- Examples of Simplifications Various examples TOR is a collection of literary (tales, stories) and from our unsupervised system are shown in Table4. scientific (documentary) texts along with their man- Examining the simplifications, we see reduced sen- ual document-level simplified versions. These doc- tence length, sentence splitting, and simpler vocab- uments were extracted from material available to ulary usage. For example, the words in the town’s French primary school pupils. We split the dataset western outskirts is changed into near the town and in 450 validation and 416 test sentence pairs (see aerial nests is simplified into nests in the air. We Appendix A.3 for details). also witnessed errors related factual consistency and especially with respect with named entity hal- Spanish For Spanish we use the Spanish part lucination or disappearance which would be an of Newsela (Xu et al., 2015). We use the align- interesting area of improvement for future work. ments from (Aprosio et al., 2019), composed of 2794 validation and 2795 test sentence pairs. Even 5.2 French and Spanish Simplification though sentences were aligned using the CATS Our unsupervised approach to simplification can simplification alignment tool (Štajner et al., 2018), be applied to any language. Similar to English, some alignment errors remain and automatic scores we first create a corpus of paraphrases composed should be taken with a pinch of salt. of 1.4 million sequence pairs in French and 1.0 million sequence pairs in Spanish (cf. Table1). 5 Results To incorporate multilingual pretraining, we replace 5.1 English Simplification the monolingual BART with MBART, which was We compare models trained on our mined corpus trained on 25 languages. (see Table1) with models trained on labeled simpli- We report the performance of models trained on fication datasets (WikiLarge and Newsela). We also the mined corpus in Table3. Unlike English, where compare to other state-of-the-art supervised mod- labeled parallel training data has been created using els: Dress-LS (Zhang and Lapata, 2017), DMASS- Simple English Wikipedia, no such datasets exist DCSS (Zhao et al., 2018), EditNTS (Dong et al., for French or Spanish. Similarly, no other simpli- 2019), ACCESS (Martin et al., 2020); and the fication systems are available in these languages. unsupervised models: UNTS (Surya et al., 2019), We thus compare to several baselines, namely the BTRLTS (Zhao et al., 2020), and RM+EX+LS+RO identity, truncation and the strong pivot baseline. (Kumar et al., 2020). Results are shown in Table2. Results MUSS outperforms our strongest base- 6We experimented with other alignments (wiki-auto and line by +8.25 SARI for French, while matching the newsela-auto (Jiang et al., 2020)) but with lower performance. pivot baseline performance for Spanish. Original History Landsberg prison, which is in the town’s western outskirts, was completed in 1910. Simplified The Landsberg prison, which is near the town, was built in 1910. Original The name "hornet" is used for this and related species primarily because of their habit of making aerial nests (similar to the true hornets) rather than subterranean nests. Simplified The name "hornet" is used for this and related species because they make nests in the air (like the true hornets) rather than in the ground. Original Nocturnes is an orchestral composition in three movements by the French composer Claude Debussy. Simplified Nocturnes is a piece of music for orchestra by the French composer Claude Debussy.

Table 4: Examples of Generated Simplifications. We show simplifications generated by our best unsupervised model: MUSS trained on mined data only. Bold highlights differences between original and simplified.

45 45 45 Paraphrases 40 Simplifications 40 40

35 35 35 SARI SARI 30 30 SARI 30

25 25 25

20 20 20 Seq2Seq BART+ACCESS 1K 10K 100K 1.2M (full) Seq2Seq ACCESS BART BART Number of training samples +ACCESS (a) Simplifications vs. Paraphrases (b) Large-Scale Mining (c) BART and ACCESS

Figure 1: Ablations We display averaged SARI scores on the English ASSET test set with 95% confidence inter- vals (5 runs). (a) Models trained on mined simplifications or mined paraphrases, (b) MUSS trained on varying amounts of mined data, (c) Models trained with or without BART and/or ACCESS.

Besides using state-of-the-art machine transla- point Likert scale (0-4): adequacy (is the meaning tion models, the pivot baseline relies on a strong preserved?), fluency (is the simplification fluent?) backbone simplification model that has two advan- and simplicity (is the simplification actually sim- tages compared to the French and Spanish simplifi- pler?). For each system and each language, 50 cation model. First the simplification model of the simplifications are annotated and each simplifica- pivot baseline was trained on labeled simplification tion is rated once only by a single annotator. The data from WikiLarge, which obtains +1.5 SARI in simplifications are taken from ASSET (English), English compared to training only on mined data. ALECTOR (French), and Newsela (Spanish). Second it uses the stronger monolingual BART model instead of MBART. In Appendix Table 10, Discussion Table5 displays the average ratings we can see that MBART has a small loss in perfor- along with 95% confidence intervals. Human judg- mance of 1.54 SARI compared to its monolingual ments confirm that our unsupervised and super- counterpart BART, due to the fact that it handles 25 vised MUSS models are more fluent and produce languages instead of one. Further improvements simpler outputs than previous state-of-the-art (Mar- could be achieved by using monolingual BART tin et al., 2019). They are deemed as fluent and models trained for French or Spanish, possibly out- simpler than the human simplifications from AS- performing the pivot baseline. SET test set, which indicates our model is able to reach a high level of simplicity thanks to the 5.3 Human Evaluation control mechanism. In French and Spanish, our To further validate the quality of our models, we unsupervised model performs better or similar in conduct a human evaluation in all languages ac- all aspects than the strong supervised pivot baseline cording to adequacy, fluency, and simplicity and which has been trained on labeled English simplifi- report the results in Table5. cations. In Spanish the gold reference surprisingly obtains poor human ratings, which we found to be Human Ratings Collection For human evalua- caused by errors in the automatic alignments of tion, we recruit volunteer native speakers for each source sentences with simplified sentences of the language (5 in English, 2 in French, and 2 in Span- same article, as previously highlighted in Subsec- ish). We evaluate three linguistic aspects on a 5 tion 4.4. English French Spanish Adequacy Fluency Simplicity Adequacy Fluency Simplicity Adequacy Fluency Simplicity

ACCESS(Martin et al., 2020) 3.10±0.32 3.46±0.28 1.40±0.29 ——— ——— Pivot baseline ——— 1.78±0.40 2.10±0.47 1.16±0.31 2.02±0.28 3.48±0.22 2.20±0.29 Gold Reference 3.44±0.28 3.78±0.17 1.80±0.28 3.46±0.25 3.92±0.10 1.66±0.31 2.18†±0.43 3.38±0.24 1.26†±0.36 MUSS (unsup.) 3.20±0.28 3.84±0.14 1.88±0.33 2.88±0.34 3.50±0.32 1.22±0.25 2.26±0.29 3.48±0.25 2.56±0.29 MUSS (sup.) 3.12±0.34 3.90±0.14 2.22±0.36 ——— ———

Table 5: Human Evaluation We display human ratings of adequacy, fluency and simplicity for previous work ACCESS, pivot baseline, reference human simplifications, our best unsupervised systems (BART+ACCESS for English, MBART+ACCESS for other languages), and our best supervised model for English. Scores are averaged over 50 ratings per system with 95% confidence intervals. †Low ratings of the gold reference in Spanish Newsela is due to automatic alignment errors.

5.4 Ablations Improvements from Pretraining and Control Mining Simplifications vs. Paraphrases In this We compare the respective influence of pretraining work, we mined paraphrases to train simplification BART and controllable generation ACCESS in models. This has the advantage of making fewer Figure 1c. While both BART and ACCESS bring assumptions earlier on, by keeping the mining and improvement over standard sequence-to-sequence, models as general as possible, so that they are able they work best in combination. Unlike previous to adapt to more simplification scenarios. approaches to text simplification, we use pretrain- ing to train our simplification systems. We find We also compared to directly mining simpli- that the main qualitative improvement from pre- fications using simplification heuristics to make training is increased fluency and meaning preser- sure that the target side is simpler than the source, vation. For example, in Appendix Table9, the following previous work (Kajiwara and Komachi, model trained only with ACCESS substituted cul- 2016; Surya et al., 2019). To mine a simplification turally akin with culturally much like, but when us- dataset, we followed the same paraphrase mining ing BART, it is simplified to the more fluent closely procedure of querying 1 billion sequences on an related. While models trained on mined data see index of 1 billion sequences. Out of the resulting several million sentences, pretraining methods are paraphrases, we kept only pairs that either con- typically trained on billions. Combining pretrain- tained sentence splits, reduced sequence length, or ing with controllable simplification enhances sim- simpler vocabulary (similar to how previous work plification performance by flexibly adjusting the enforce an FKGL difference). We removed the type of simplification. paraphrase constraint that enforced sentences to be different enough. We tuned these heuristics to optimize SARI on the validation set. The result- 6 Conclusion ing dataset has 2.7 million simplification pairs. In Figure 1a, we show that seq2seq models trained on mined paraphrases achieve better performance. We propose a sentence simplification approach that A similar trend exists with BART and ACCESS, does not rely on labeled parallel simplification data thus confirming that mining paraphrases can obtain thanks to controllable generation, pretraining and better performance than mining simplifications. large-scale mining of paraphrases from the web. This approach is language-agnostic and matches or How Much Mined Data Do You Need? We in- outperforms previous state-of-the-art results, even vestigate the importance of a scalable mining ap- from supervised systems that use labeled simplifi- proach that can create million-sized training cor- cation data, on three languages: English, French, pora for sentence simplification. In Figure 1b, we and Spanish. In future work, we plan to investigate analyze the performance of training our best model how to scale this approach to more languages and on English on different amounts of mined data. By types of simplification, and to apply this method to increasing the number of mined pairs, SARI dras- paraphrase generation. Another interesting direc- tically improves, indicating that efficient mining tion for future work would to examine and improve at scale is critical to performance. Unlike human- factual consistency, especially related to named created training sets, unsupervised mining allows entity hallucination or disappearance. for large datasets in multiple languages. References on Integrating Artificial Intelligence and Assistive Technology, pages 7–10. Sandra M Aluísio, Lucia Specia, Thiago AS Pardo, Er- ick G Maziero, and Renata PM Fortes. 2008. To- William Coster and David Kauchak. 2011. Learning to wards Brazilian Portuguese automatic text simplifi- simplify sentences using wikipedia. In Proceedings cation systems. In Proceedings of the eighth ACM of the workshop on monolingual text-to-text genera- symposium on Document engineering, pages 240– tion, pages 1–9. 248. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Fernando Alva-Manchego, Louis Martin, Antoine Bor- Kristina Toutanova. 2018. BERT: Pre-training of des, Carolina Scarton, Benoît Sagot, and Lucia Spe- deep bidirectional transformers for language under- cia. 2020a. ASSET: A dataset for tuning and eval- standing. arXiv preprint arXiv:1810.04805. uation of sentence simplification models with multi- ple rewriting transformations. In Procedings of ACL Yue Dong, Zichao Li, Mehdi Rezagholizadeh, and 2020, pages 4668–4679. Jackie Chi Kit Cheung. 2019. Editnts: An neural programmer-interpreter model for sentence simplifi- Fernando Alva-Manchego, Louis Martin, Carolina cation through explicit editing. In Proceedings of Scarton, and Lucia Specia. 2019. EASSE: Eas- ACL2019, pages 3393–3402. ier automatic sentence simplification evaluation. In EMNLP-IJCNLP: System Demonstrations. Richard Evans, Constantin Orasan, and Iustin Dor- nescu. 2014. An evaluation of syntactic simplifi- Fernando Alva-Manchego, Carolina Scarton, and Lu- cation rules for people with autism. In Proceed- cia Specia. 2020b. Data-Driven Sentence Simplifi- ings of the 3rd Workshop on Predicting and Improv- cation: Survey and Benchmark. Computational Lin- ing Text Readability for Target Reader Populations, guistics, pages 1–87. pages 131–140. Alessio Palmero Aprosio, Sara Tonelli, Marco Turchi, Matteo Negri, and Mattia A Di Gangi. 2019. Neural Angela Fan, Edouard Grave, and Armand Joulin. 2019. text simplification in low-resource conditions using Reducing transformer depth on demand with struc- weak supervision. In Proceedings of the Workshop tured dropout. arXiv preprint arXiv:1909.11556. on Methods for Optimizing and Evaluating Neural Núria Gala, Anaïs Tack, Ludivine Javourey-Drevet, Language Generation, pages 37–44. Thomas François, and Johannes C Ziegler. 2020. Mikel Artetxe and Holger Schwenk. 2019a. Margin- Alector: A Parallel Corpus of Simplified French based parallel corpus mining with multilingual sen- Texts with Alignments of Misreadings by Poor and tence embeddings. In Proceedings of the 57th An- Dyslexic Readers. In Proceedings of LREC 2020. nual Meeting of the Association for Computational Linguistics, pages 3197–3203, Florence, Italy. Asso- Isao Goto, Hideki Tanaka, and Tadashi Kumano. 2015. ciation for Computational Linguistics. Japanese news simplification: Task design, data set construction, and analysis of simplified text. Pro- Mikel Artetxe and Holger Schwenk. 2019b. Mas- ceedings of MT Summit XV. sively multilingual sentence embeddings for zero- shot cross-lingual transfer and beyond. Transac- Kenneth Heafield. 2011. KenLM: Faster and smaller tions of the Association for Computational Linguis- language model queries. In Proceedings of the sixth tics, 7:597–610. workshop on statistical machine translation, pages 187–197. Regina Barzilay and Lillian Lee. 2003. Learning to paraphrase: an unsupervised approach using Chao Jiang, Mounica Maddela, Wuwei Lan, Yang multiple-sequence alignment. In Proceedings of Zhong, and Wei Xu. 2020. Neural crf model for sen- NAACL-HLT 2003, pages 16–23. tence alignment in text simplification. In Proceed- ings of the 58th Annual Meeting of the Association Steven Bird and Edward Loper. 2004. NLTK: The nat- for Computational Linguistics. ural language toolkit. In Proceedings of the ACL In- teractive Poster and Demonstration Sessions, pages Armand Joulin, Edouard Grave, Piotr Bojanowski, and 214–217, Barcelona, Spain. Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of EACL 2017, pages Dominique Brunato, Felice Dell’Orletta, Giulia Ven- 427–431. turi, and Simonetta Montemagni. 2015. Design and annotation of the first Italian corpus for text simpli- Tomoyuki Kajiwara and M Komachi. 2018. Text sim- fication. In Proceedings of The 9th Linguistic Anno- plification without simplified corpora. The Journal tation Workshop, pages 31–41. of Natural Language Processing, 25:223–249.

John Carroll, Guido Minnen, Yvonne Canning, Siob- Tomoyuki Kajiwara and Mamoru Komachi. 2016. han Devlin, and John Tait. 1998. Practical simpli- Building a monolingual parallel corpus for text sim- fication of English newspaper text to assist aphasic plification using sentence similarity based on align- readers. In Proceedings of the AAAI-98 Workshop ment between word embeddings. In Proceedings of COLING 2016, the 26th International Confer- Louis Martin, Samuel Humeau, Pierre-Emmanuel ence on Computational Linguistics: Technical Pa- Mazaré, Éric de La Clergerie, Antoine Bordes, and pers, pages 1147–1158, Osaka, Japan. The COLING Benoît Sagot. 2018. Reference-less quality estima- 2016 Organizing Committee. tion of text simplification systems. In Proceedings of the 1st Workshop on Automatic Text Adaptation, Akihiro Katsuta and Kazuhide Yamamoto. 2019. Im- pages 29–38. proving text simplification by corpus expansion with unsupervised learning. In 2019 International Louis Martin, Benjamin Muller, Pedro Javier Ortiz Conference on Asian Language Processing (IALP), Suárez, Yoann Dupont, Laurent Romary, Éric Ville- pages 216–221. IEEE. monte de la Clergerie, Djamé Seddah, and Benoît Sagot. 2019. CamemBERT: a Tasty French Lan- David Kauchak. 2013. Improving text simplification guage Model. In Proceedings of ACL 2020. language modeling using unsimplified text data. In Proceedings of ACL 2013, pages 1537–1546. Louis Martin, Benoît Sagot, Éric de la Clergerie, and Antoine Bordes. 2020. Controllable sentence sim- J. Peter Kincaid, Robert P Fishburne Jr., Richard L. plification. In Proceedings of LREC 2020. Rogers, and Brad S. Chissom. 1975. Derivation of new readability formulas (automated readability in- Dragos Stefan Munteanu and Daniel Marcu. 2005. Im- dex, fog count and flesch reading ease formula) for proving machine translation performance by exploit- navy enlisted personnel. ing non-parallel corpora. Computational Linguis- tics, 31(4):477–504. Philipp Koehn, Francisco Guzmán, Vishrav Chaud- hary, and Juan Pino. 2019. Findings of the WMT Myle Ott, Sergey Edunov, Alexei Baevski, Angela 2019 shared task on parallel corpus filtering for low- Fan, Sam Gross, Nathan Ng, David Grangier, and resource conditions. In Proceedings of the Fourth Michael Auli. 2019. fairseq: A fast, extensible Conference on Machine Translation, pages 54–72. toolkit for sequence modeling. In Proceedings of NAACL 2019 (Demonstrations), pages 48–53, Min- Philipp Koehn, Huda Khayrallah, Kenneth Heafield, neapolis, Minnesota. and Mikel L. Forcada. 2018. Findings of the WMT 2018 shared task on parallel corpus filtering. In Pro- Gustavo H Paetzold and Lucia Specia. 2016. Unsuper- ceedings of the Third Conference on Machine Trans- vised lexical simplification for non-native speakers. lation: Shared Task Papers, pages 726–739. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. Reno Kriz, Eleni Miltsakaki, Marianna Apidianaki, Kishore Papineni, Salim Roukos, Todd Ward, and Wei- and Chris Callison-Burch. 2018. Simplification us- Jing Zhu. 2002. BLEU: a method for automatic ing paraphrases and context-based lexical substitu- evaluation of machine translation. In Proceedings tion. In Proceedings of NAACL 2018, pages 207– of ACL 2002, pages 311–318. 217. Association for Computational Linguistics. Ellie Pavlick and Chris Callison-Burch. 2016. Simple Dhruv Kumar, Lili Mou, Lukasz Golab, and Olga Vech- ppdb: A paraphrase database for simplification. In tomova. 2020. Iterative edit-based unsupervised sen- Proceedings of ACL 2016, pages 143–148. tence simplification. In Proceedings of ACL 2020, pages 7918–7928, Online. Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. Vladimir I Levenshtein. 1966. Binary codes capable 2015. Ppdb 2.0: Better paraphrase ranking, fine- of correcting deletions, insertions, and reversals. In grained entailment relations, word embeddings, and Soviet physics doklady, volume 10, pages 707–710. style classification. In Proceedings of ACL-CoLing 2015. Mike Lewis, Yinhan Liu, Naman Goyal, Mar- jan Ghazvininejad, Abdelrahman Mohamed, Omer J. Rapin and O. Teytaud. 2018. Nevergrad - A gradient- Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. free optimization platform. https://GitHub. Bart: Denoising sequence-to-sequence pre-training com/FacebookResearch/Nevergrad. for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. Luz Rello, Ricardo Baeza-Yates, Stefan Bott, and Horacio Saggion. 2013. Simplify or help?: text Xianggen Liu, Lili Mou, Fandong Meng, Hao Zhou, simplification strategies for people with dyslexia. Jie Zhou, and Sen Song. 2019. Unsupervised para- In Proceedings of the 10th International Cross- phrasing by simulated annealing. arXiv preprint Disciplinary Conference on Web Accessibility, arXiv:1909.03588. page 15. ACM.

Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Horacio Saggion, Sanja Štajner, Stefan Bott, Simon Edunov, Marjan Ghazvininejad, Mike Lewis, and Mille, Luz Rello, and Biljana Drndarevic. 2015. Luke Zettlemoyer. 2020. Multilingual denoising Making it simplext: Implementation and evaluation pre-training for neural machine translation. arXiv of a text simplification system for spanish. ACM preprint arXiv:2001.08210. Transactions on Accessible Computing, 6(4):1–36. Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Wei Xu, Chris Callison-Burch, and Courtney Napoles. Edouard Grave, and Armand Joulin. 2019. Ccma- 2015. Problems in current text simplification re- trix: Mining billions of high-quality parallel sen- search: New data can help. Transactions of the As- tences on the web. arXiv:1911.04944. sociation for Computational Linguistics, pages 283– 297. Sanja Štajner, Iacer Calixto, and Horacio Saggion. 2015. Automatic text simplification for Spanish: Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Comparative evaluation of various simplification Chen, and Chris Callison-Burch. 2016. Optimizing strategies. In Proceedings of the international con- statistical machine translation for text simplification. ference recent advances in natural language pro- Transactions of the Association for Computational cessing. Linguistics, 4:401–415.

Sanja Štajner, Marc Franco-Salvador, Paolo Rosso, and Xingxing Zhang and Mirella Lapata. 2017. Sen- Simone Paolo Ponzetto. 2018. CATS: A tool for tence simplification with deep reinforcement learn- customized alignment of text simplification corpora. ing. In Proceedings of EMNLP 2017, pages 584– In Proceedings of the Eleventh International Confer- 594, Copenhagen, Denmark. ence on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Re- Sanqiang Zhao, Rui Meng, Daqing He, Andi Saptono, sources Association (ELRA). and Bambang Parmanto. 2018. Integrating trans- former and paraphrase rules for sentence simplifica- Elior Sulem, Omri Abend, and Ari Rappoport. 2018. tion. In Proceedings of EMNLP 2018, pages 3164– Semantic structural evaluation for text simplification. 3173, Brussels, Belgium. In Proceedings of the NAACL-HLT 2018, pages 685– 696. Yanbin Zhao, Lu Chen, Zhi Chen, and Kai Yu. 2020. Semi-supervised text simplification with Sai Surya, Abhijit Mishra, Anirban Laha, Parag Jain, back-translation and asymmetric denoising autoen- and Karthik Sankaranarayanan. 2019. Unsupervised coders. In Proceedings of the Thirty-Fourth AAAI neural text simplification. In Proceedings of ACL Conference on Artificial Intelligence. 2018, pages 2058–2068. Zhemin Zhu, Delphine Bernhard, and Iryna Gurevych. Jörg Tiedemann. 2012. Parallel Data, Tools and Inter- 2010. A monolingual tree-based translation model faces in OPUS. In Proceedings of LREC 2012. for sentence simplification. In Proceedings of the 23rd International Conference on Computational Sara Tonelli, Alessio Palmero Aprosio, and Francesca Linguistics (Coling 2010), pages 1353–1361. Saltori. 2016. SIMPITIKI: a Simplification corpus for Italian. Proceedings of the Third Italian Confer- ence on Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- cessing Systems 30, pages 5998–6008. Guillaume Wenzek, Marie-Anne Lachaux, Alexis Con- neau, Vishrav Chaudhary, Francisco Guzman, Ar- mand Joulin, and Edouard Grave. 2019. CCnet: Ex- tracting high quality monolingual datasets from web crawl data. arXiv preprint arXiv:1911.00359. John Wieting and Kevin Gimpel. 2018. ParaNMT- 50M: Pushing the limits of paraphrastic sentence embeddings with millions of machine translations. In Proceedings of ACL 2018, pages 451–462, Mel- bourne, Australia. Kristian Woodsend and Mirella Lapata. 2011. Learn- ing to simplify sentences with quasi-synchronous grammar and integer programming. In Proceedings of EMNLP 2011, pages 409–420, Edinburgh, Scot- land, UK. Sander Wubben, Antal Van Den Bosch, and Emiel Krahmer. 2012. Sentence simplification by mono- lingual machine translation. In Proceedings of ACL 2012, pages 1015–1024. Appendices A.2 Training Details A Experimental details Seq2Seq training We implement our models with fairseq (Ott et al., 2019). All our mod- In this section we describe specific details of our els are Transformers (Vaswani et al., 2017) based experimental procedure. Figure2 is a overall re- on the BART Large architecture (388M param- minder of our method presented in the main paper. eters), keeping the optimization procedure and hyper-parameters fixed to those used in the orig- inal implementation (Lewis et al., 2019)7. We either randomly initialize weights for the stan- dard sequence-to-sequence experiments or initial- ize with pretrained BART for the BART exper- iments. When initializing the weights randomly, we use a learning rate of 3.10−4 versus the orig- inal 3.10−5 when finetuning BART. For a given Figure 2: Sentence Simplification Models for Any seed, the model is trained on 8 Nvidia V100 GPUs Language without Simplification Data. Sentences during approximately 10 hours. from the web are used to create a large scale index that allows mining millions of paraphrases. Subsequently, Controllable Generation For controllable gen- we finetune pretrained models augmented with control- eration, we use the open-source ACCESS imple- lable mechanisms on the paraphrase corpora to achieve mentation (Martin et al., 2018). We use the same sentence simplification models in any language. control parameters as the original paper, namely length, Levenshtein similarity, lexical complexity, and syntactic complexity.8 A.1 Mining Details As mentioned in Section ”Simplifying with Sequence Extraction We only consider docu- ACCESS”, we select the 4 ACCESS hyper- ments from the HEAD split in CCNet— this repre- parameters using SARI on the validation set. We sents the third of the data with the best perplexity use zero-order optimization with the NEVERGRAD using a language model. library (Rapin and Teytaud, 2018). We use the One- PlusOne optimizer with a budget of 64 evaluations Paraphrase Mining We compute LASER em- (approximately 1 hour of optimization on a single beddings of dimension 1024 and reduce dimension- GPU). The hyper-parameters are contained in the ality with a 512 PCA followed by random rota- [0.2, 1.5] interval. tion. We further compress them using 8 bit scalar The 4 hyper-parameter values are then kept fixed quantization. The compressed embeddings are then for all sentences in the associated test set. stored in a faiss inverted file index with 32,768 cells (nprobe=16). These embeddings are used to Translation Model for Pivot Baseline For the ccMatrix mine pairs of paraphrases. We return the top-8 pivot baseline we train models on nearest neighbors, and keep those with L2 distance (Schwenk et al., 2019). Our models use the Trans- lower than 0.05 and relative distance compared to former architecture with 240 million parameters other top-8 nearest neighbors lower than 0.6. with LayerDrop (Fan et al., 2019). We train for 36 hours on 8 GPUs following the suggested parame- Paraphrases Filtering The resulting para- ters in Ott et al.(2019). phrases are filtered to remove almost identical Gold Reference Baseline To avoid creating a paraphrases by enforcing a case-insensitive discrepancy in terms of number of references be- character-level (Levenshtein, 1966) greater or equal to 20%. We remove 7All hyper-parameters and training commands for paraphrases that come from the same document fairseq can be found here: https://github.com/ pytorch/fairseq/blob/master/examples/ to avoid aligning sequences that overlapped each bart/README.summarization.md other in the text. We also remove paraphrases 8We modify the Levenshtein similarity parameter to only where one of the sequence is contained in the other. consider replace operations, by assigning a 0 weight to in- sertions and deletions. This change helps decorrelate the We further filter out any sequence that is present in Levenshtein similarity control token from the length control our evaluation datasets. token and produced better results in preliminary experiments. tween the gold reference scores, where we leave • ALECTOR: This dataset has to be requested one reference out, and when we evaluate the mod- from the authors (Gala et al., 2020). els with all references, we compensate by duplicat- ing one of the other references at random so that B Characteristics of the mined data the total number of references is unchanged. We show in Figure3 the distribution of different A.3 Evaluation Details surface features of our mined data versus those of SARI score computation We use the latest WikiLarge. Some examples of mined paraphrases version of SARI implemented in EASSE (Alva- are shown in Table6. Manchego et al., 2019) which fixes bugs and in- consistencies from the traditional implementation C Set ACCESS Control Parameters of SARI. As a consequence, we also recompute Without Parallel Data scores from previous systems that we compare to. In our experiments we adjusted our model to We do so by using the system predictions provided the different dataset conditions by selecting our by the respective authors, and available in EASSE. ACCESS control tokens with SARI on each val- ALECTOR Sentence-level Alignment The idation set. When no such parallel validation set ALECTOR corpus comes as source documents and exists, we show that strong performance can still their manual simplifications but not sentence-level be obtained by using prior knowledge for the given alignment is provided. Luckily, most of these downstream application. This can be done by set- documents were simplified line by line, each ting all 4 ACCESS control hyper-parameters to an line consisting of a few sentences. For each intuitive guess of the desired compression ratio. source document, we therefore align each line, To illustrate this for the considered evaluation provided it is not too long (less than 6 sentences), datasets, we first independently sample 50 source with the most appropriate line in the simplified sentences and 50 random unaligned simple sen- document, using the LASER embedding space. tences from each validation set. These two groups The resulting alignments are split into validation of non-parallel sentences are used to approximate and test by randomly sampling the documents for the character-level compression ratio between com- the validation (450 sentence pairs) and rest for test plex and simplified sentences. We do so by dividing (416 sentence pairs). the average length of the simplified sentences by the average length of the 50 source sentences. We A.4 Links to Datasets finally use this approximated compression ratio as The datasets we used are available at the following the value of all 4 ACCESS hyper-parameters. In addresses: practice, we obtain the following approximations: ASSET = 0.8, TurkCorpus = 0.95, and Newsela • CCNet: = 0.4 0.05 https://github.com/facebookresearch/ (rounded to ). Results in Table7 show that the resulting model performs very close to cc_net. when we adjust the ACCESS hyper-parameters • WikiLarge: using SARI on the complete validation set. https://github.com/XingxingZhang/ dress. D Comparing to Existing Paraphrase Datasets • ASSET: https://github.com/facebookresearch/ We compare using our mined paraphrase data asset or https://github.com/feralvam/ with existing large-scale paraphrase datasets in Ta- easse. ble8. We use PARANMT (Wieting and Gimpel, 2018), a large paraphrase dataset created using • TurkCorpus: back-translation on an existing labeled parallel ma- https://github.com/cocoxu/ chine translation dataset. We use the same 5 million simplification/ or https://github. top-scoring sentences that the authors used to train com/feralvam/easse. their sentence embeddings. Training MUSS on the • Newsela: This dataset has to be requested at mined data or on PARANMT obtains similar re- https://newsela.com/data. sults for text simplification, confirming that mining Compression levels WordRank ratio Target length Replace-only Levenshtein similarity 8 4 0.008 10 Wikilarge 6 Mined 3 8 0.006

6 4 2 0.004 density 4 1 0.002 2 2

0 0 0.000 0 0 1 2 0.8 1.0 1.2 0 100 200 300 0.4 0.6 0.8 1.0

Figure 3: Density of several text features in WikiLarge and our mined data. The WordRank ratio is a measure of lexical complexity reduction (Martin et al., 2020). Replace-only Levenshtein similarity only considers replace operations in the traditional Levenshtein similarity and assigns 0 weights to insertions and deletions.

Query For insulation, it uses foam-injected polyurethane which helps ensure the quality of the ice produced by the machine. It comes with an easy to clean air filter. Mined It has polyurethane for insulation which is foam-injected. This helps to maintain the quality of the ice it produces. The unit has an easy to clean air filter. Query Here are some useful tips and tricks to identify and manage your stress. Mined Here are some tips and remedies you can follow to manage and control your anxiety. Query As cancer cells break apart, their contents are released into the blood. Mined When brain cells die, their contents are partially spilled back into the blood in the form of debris. Query The trail is ideal for taking a short hike with small children or a longer, more rugged overnight trip. Mined It is the ideal location for a short stroll, a nature walk or a longer walk. Query Thank you for joining us, and please check out the site. Mined Thank you for calling us. Please check the website.

Table 6: Examples of Mined Paraphrases. Paraphrases, although sometimes not preserving the entire meaning, display various rewriting operations, such as lexical substitution, compression or sentence splitting.

ASSET TurkCor. Newsela E Influence of BART on Fluency Method SARI ↑ SARI ↑ SARI ↑

SARI on valid 42.65±0.23 40.85±0.15 38.09±0.59 In Table9, we present some selected samples that Approx. value 42.49±0.34 39.57±0.40 36.16±0.35 highlight the improved fluency of simplifications when using BART. Table 7: Set ACCESS Controls Wo. Parallel Data Setting ACCESS parameters of MUSS +MINED F Additional Scores model either using SARI on the validation set or using only 50 unaligned sentence pairs from the validation BLEU We report additional BLEU scores for set. All ACCESS parameters are set to the same ap- completeness. These results are displayed along proximated value: ASSET = 0.8, TurkCorpus = 0.95, with SARI and FKGL for English. These BLEU and Newsela = 0.4). scores should be carefully interpreted. They have been found to correlate poorly with human judg- paraphrase data is a viable alternative to using exist- ments of simplicity (Sulem et al., 2018). Fur- ing paraphrase datasets relying on labeled parallel thermore, the identity baseline achieves very high machine translation corpora. BLEU scores on some datasets (e.g. 92.81 on AS- SET or 99.36 on TurkCorpus), which underlines ASSET TurkCor. Newsela the weaknesses of this metric. Data SARI ↑ SARI ↑ SARI ↑ Validation Scores We report English validation MINED 42.65±0.23 40.85±0.15 38.09±0.59 PARANMT 42.50±0.33 40.50±0.16 39.11±0.88 scores to foster reproducibility in Table 11.

Table 8: Mined Data vs. ParaNMT Seq2Seq Models on Mined Data When train- We compare SARI scores of MUSS trained either on ing a Transformer sequence-to-sequence model our mined data or on PARANMT(Wieting and Gim- (Seq2Seq) on WikiLarge compared to the mined pel, 2018) on the test sets of ASSET, TurkCorpus and corpus, models trained on the mined data perform Newsela. better. It is surprising that a model trained solely on Original They are culturally akin to the coastal peoples of Papua New Guinea. ACCESS They’re culturally much like the Papua New Guinea coastal peoples. BART+ACCESS They are closely related to coastal people of Papua New Guinea Original Orton and his wife welcomed Alanna Marie Orton on July 12, 2008. ACCESS Orton and his wife had been called Alanna Marie Orton on July 12. BART+ACCESS Orton and his wife gave birth to Alanna Marie Orton on July 12, 2008. Original He settled in London, devoting himself chiefly to practical teaching. ACCESS He set up in London and made himself mainly for teaching. BART+ACCESS He settled in London and devoted himself to teaching.

Table 9: Influence of BART on Simplifications. We display some examples of generations that illustrate how BART improves the fluency and meaning preservation of generated simplifications. paraphrases achieves such good results on simpli- fication benchmarks. Previous works have shown that simplification models suffer from not mak- ing enough modifications to the source sentence and found that forcing models to rewrite the input was beneficial (Wubben et al., 2012; Martin et al., 2020). This is confirmed when investigating the F1 deletion component of SARI which is 20 points higher for the model trained on paraphrases. Data ASSET TurkCorpus Newsela Baselines and Gold Reference SARI ↑ BLEU ↑ FKGL ↓ SARI ↑ BLEU ↑ FKGL ↓ SARI ↑ BLEU ↑ FKGL ↓ Identity Baseline — 20.73 92.81 10.02 26.29 99.36 10.02 ——— Truncate Baseline — 29.85 84.94 7.91 33.10 88.82 7.91 ——— Reference — 44.87±0.36 68.95±1.33 6.49±0.15 40.04±0.30 73.56±1.18 8.77±0.08 ———

Supervised Systems (This Work)

Seq2Seq WikiLarge 32.71±1.55 88.56±1.06 8.62±0.34 35.79±0.89 90.24±2.52 8.63±0.34 22.23±1.99 21.75±0.45 8.00±0.26 MUSS WikiLarge 43.63±0.71 76.28±4.30 6.25±0.42 42.62±0.27 78.28±3.95 6.98±0.95 40.00±0.63 14.42±6.85 3.51±0.53 MUSS WikiLarge + MINED 44.15±0.56 72.98±4.27 6.05±0.51 42.53±0.36 78.17±2.20 7.60±1.06 39.50±0.42 15.52±0.99 3.19±0.49 MUSS Newsela 42.91±0.58 71.40±6.38 6.91±0.42 41.53±0.36 74.29±4.67 7.39±0.42 42.59±1.00 18.61±4.49 2.74±0.98 MUSS Newsela + MINED 41.36±0.48 78.35±2.83 6.96±0.26 40.01±0.51 83.77±1.00 8.26±0.36 41.17±0.95 16.87±4.55 2.70±1.00

Unsupervised Systems (This Work)

Seq2Seq MINED 38.03±0.63 61.76±2.19 9.41±0.07 38.06±0.47 63.70±2.43 9.43±0.07 30.36±0.71 12.98±0.32 8.85±0.13 MUSS(MBART) MINED 41.11±0.70 77.22±2.12 7.18±0.21 39.40±0.54 77.05±3.02 8.65±0.40 34.76±0.96 19.06±1.15 5.44±0.25 MUSS(BART) MINED 42.65±0.23 66.23±4.31 8.23±0.62 40.85±0.15 63.76±4.26 8.79±0.30 38.09±0.59 14.91±1.39 5.12±0.47

Table 10: Detailed English Results. We display SARI, BLEU, and FKGL on ASSET, TurkCorpus and Newsela English evaluation datasets (test sets).

Data ASSET TurkCorpus Newsela Baselines and Gold Reference SARI ↑ BLEU ↑ FKGL ↓ SARI ↑ BLEU ↑ FKGL ↓ SARI ↑ BLEU ↑ FKGL ↓ Identity Baseline — 22.53 94.44 9.49 26.96 99.27 9.49 12.00 20.69 8.77 Truncate Baseline — 29.95 86.67 7.39 32.90 89.10 7.40 24.64 18.97 6.90 Reference — 45.22±0.94 72.67±2.83 6.13±0.56 40.66±0.11 77.21±0.45 8.31±0.04 ———

Supervised Systems (This Work)

Seq2Seq WikiLarge 33.87±1.90 90.21±1.14 8.31±0.34 35.87±1.09 91.06±2.24 8.31±0.34 20.89±4.08 20.97±0.53 8.27±0.46 MUSS WikiLarge 45.58±0.28 78.85±4.44 5.61±0.31 43.26±0.42 78.39±3.08 6.73±0.38 39.66±1.80 14.82±7.17 4.64±1.85 MUSS WikiLarge + MINED 45.50±0.69 73.16±4.41 5.83±0.51 43.17±0.19 77.52±3.01 7.19±1.02 40.50±0.56 16.30±0.97 3.57±0.60 MUSS Newsela 43.91±0.10 70.06±10.05 6.47±0.29 41.94±0.21 74.03±7.51 6.99±0.53 42.36±1.32 19.18±6.03 3.20±1.01 MUSS Newsela + MINED 42.48±0.41 77.86±3.13 6.41±0.13 40.77±0.52 83.04±1.16 7.68±0.30 41.68±1.60 17.23±5.28 2.97±0.91

Unsupervised Systems (This Work)

Seq2Seq MINED 38.88±0.22 61.80±0.94 8.63±0.13 37.51±0.10 62.04±0.91 8.64±0.13 30.35±0.23 13.04±0.45 8.87±0.12 MUSS(MBART) MINED 41.68±0.72 77.11±2.02 6.56±0.21 39.60±0.44 75.64±2.85 8.04±0.40 34.59±0.59 18.19±1.26 5.76±0.22 MUSS(BART) MINED 43.01±0.23 67.65±4.32 7.75±0.53 40.61±0.18 63.56±4.30 8.28±0.18 38.07±0.22 14.43±0.97 5.40±0.41

Table 11: English Results on Validation Sets. We display SARI, BLEU, and FKGL on ASSET, TurkCorpus and Newsela English datasets (validation sets).