<<

Text Simplification with Learning using Supervised Rewards on Grammaticality, Meaning Preservation, and Simplicity

Akifumi Nakamachi†, Tomoyuki Kajiwara‡, Yuki Arase† †Graduate School of Information Science and Technology, Osaka University ‡Institute for Datability Science, Osaka University †{nakamachi.akifumi, arase}@ist.osaka-u.ac.jp ‡[email protected]

Abstract Complex Simple Sentence Encoder Decoder Sentence We optimize rewards of reinforcement learn- ing in text simplification using metrics that are highly correlated with human-perspectives. To address problems of exposure bias and loss- Grammaticality

evaluation mismatch, text-to-text generation Meaning tasks employ reinforcement learning that re- Preservation wards task-specific metrics. Previous studies in text simplification employ the weighted sum Simplicity of sub-rewards from three perspectives: gram- Reward Calculator maticality, meaning preservation, and simplic- ity. However, the previous rewards do not Figure 1: Overview of the reinforcement learning for align with human-perspectives for these per- text simplification. spectives. In this study, we propose to use BERT regressors fine-tuned for grammatical- ity, meaning preservation, and simplicity as re- is evaluated as a whole sentence during inference, ward estimators to achieve text simplification it is evaluated at the token-level during training. conforming to human-perspectives. Experi- To address these problems, reinforcement learning mental results show that reinforcement learn- has been employed in text-to-text generation tasks, ing with our rewards balances meaning preser- vation and simplicity. Additionally, human such as machine translation (Ranzato et al., 2016) evaluation confirmed that simplified texts by and abstractive summarization (Paulus et al., 2018). our method are preferred by humans compared These studies use metrics suitable for each task, to previous studies. such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), as rewards. Although reinforcement 1 Introduction learning based text simplification models (Zhang Text simplification is one of the text-to-text gen- and Lapata, 2017; Zhao et al., 2020) have used re- eration tasks that rewrites complex sentences into wards metrics such as SARI (Xu et al., 2016) and simpler ones. Text simplification is useful for pre- FKGL (Kincaid et al., 1975), these metrics do not processing of NLP tasks such as semantic role la- align with human-perspectives, i.e., human evalu- beling (Vickrey and Koller, 2008; Woodsend and ation results (Xu et al., 2016; Sulem et al., 2018; Lapata, 2014) and machine translation (Stajnerˇ and Alva-Manchego et al., 2020). Popovic´, 2016, 2018). It also has valuable applica- In this study, we train a text simplification model tions such as assisting language learning (Inui et al., based on reinforcement learning with rewards that 2003; Petersen and Ostendorf, 2007) and helping highly agree with human-perspectives. Specifically, language-impaired readers (Carroll et al., 1999). we apply a BERT regressor (Devlin et al., 2019) There are two problems in text-to-text genera- on grammaticality, meaning preservation, and sim- tion with an encoder-decoder model: exposure bias plicity, respectively, as shown in Figure1. Exper- and loss-evaluation mismatch (Ranzato et al., 2016; iments on the Newsela dataset (Xu et al., 2015) Wiseman and Rush, 2016). The former is that the have shown that reinforcement learning with our model is not exposed to its own errors during train- rewards balances meaning preservation and sim- ing. The latter is that while the generated sentence plicity. Further, manual evaluation has shown that

153 Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 153–159 December 4 - 7, 2020. c 2020 Association for Computational Linguistics our outputs were preferred by humans compared to As automatic evaluation metrics for text simpli- previous models. fication, BLEU, SARI, and FKGL have been used; however, there has not been a consensus of stan- 2 Background: Reinforcement Learning dard metrics because of their low correlation with for Text Simplification human perspectives (Xu et al., 2016; Sulem et al., 2018; Alva-Manchego et al., 2020). Therefore, the Reinforcement learning in text-to-text generation previous studies designed rewards from the follow- tasks is performed as additional training for pre- ing three perspectives, based on the standards in trained text-to-text generation models. It is a com- manual evaluation for text simplification. mon technique to linearly interpolate a reward of reinforcement learning and the cross-entropy loss • Grammaticality: This reward assesses the to avoid misleading training because of a large ac- grammatical acceptability of the generated tion space (Ranzato et al., 2016; Zhang and Lapata, sentence Yˆ . Previous studies used an neu- 2017). We first explain an attention based encoder- ral language model implemented using Long decoder model (EncDecA) (Luong et al., 2015) in short-term memory (Mikolov et al., 2010; Section 2.1 and then reinforcement learning for text Hochreiter and Schmidhuber, 1997). simplification in Section 2.2. • Meaning Preservation: This reward assesses 2.1 Encoder-Decoder Model with Attention the semantic similarity between the source sentence X and the generated sentence Yˆ . Let X = (x1, ··· , x|X|) be a source sentence and Zhang and Lapata(2017) used cosine simi- Y = (y1, ··· , y|Y |) be its reference sentence. In text simplification, the source and reference are larity of the sentence representations from a complex and simple sentences, respectively. An sequence auto-encoder (Dai and Le, 2015). encoder takes a source sentence as input, and out- Zhao et al.(2020) used cosine similarity of puts hidden states. Decoder generates a word dis- sentence representations which consists of tribution at time step t + 1 from all the encoder weighted average of word embeddings (Arora hidden states and the series of decoder hidden state et al., 2017). ˆ (h1, ··· , ht). We generate a sentence Y by sam- • Simplicity: This reward assesses the sim- pling words from the distribution at each time step. plicity of the generated sentence Yˆ . Zhang The objective function for training is averaged and Lapata(2017) used SARI (X,Y, Yˆ ) score, cross entropy loss of sentence pairs: while Zhao et al.(2020) used FKGL (Yˆ ) score.

|Y | X Among different ways to conduct reinforcement LC = − log P (yt+1|y1···t,X). (1) learning, one of the standard approaches used in t=1 text simplification is directly maximizing the re- wards by the REINFORCE (Williams, As the Equation (1) suggests, y1···t is given at train- ing but not at an inference (exposure bias situation). 1992; Ranzato et al., 2016). This approach opti- In addition, cross entropy loss cannot be evaluated mizes the log weighted by the expected at a sentence-level (loss-evaluation mismatch). future reward as the objective function: |Y | 2.2 Reinforcement Learning X LR = − r(ht) log P (yt+1|y1···t,X), (2) Similar to other text-to-text generation tasks (Ran- t=1 zato et al., 2016; Paulus et al., 2018), reinforcement where the expected future reward r(h ) is estimated learning is applied for text simplification (Zhang t using a reward estimator R(·) and a baseline esti- and Lapata, 2017; Zhao et al., 2020) to address mator b(h ) calculated from the hidden state at time the problems of exposure bias and loss-evaluation t step t. mismatch. In the reinforcement learning step, the r(h ) = R(·) − b(h ). (3) pre-trained text-to-text generation model is trained t t to increase the reward R(·). By employing a re- Following (Ranzato et al., 2016), the baseline esti- 2 ward function that takes the entire sentence Yˆ into mator is optimised by minimizing kbt − R(·)k . account, the exposure bias and loss-evaluation mis- Hashimoto and Tsuruoka(2019) discussed prob- match problems are mitigated. lems in text-to-text generation by reinforcement

154 learning; the expected future reward estimation Train Validation Test is unstable due to the huge action space, which GUG 1, 518 747 754 hinders convergence. This is because the action STS-B 5, 749 1, 500 1, 379 space of text-to-text generation corresponds to the Newsela 94, 208 1, 129 1, 077 entire target vocabulary, where many words are rarely used for prediction. Therefore, previous stud- Table 1: The numbers of sentences in datasets for each ies (Wu et al., 2018; Paulus et al., 2018; Hashimoto sub-reward estimator and Tsuruoka, 2019) proposed to stabilize the train- ing in reinforcement learning by first pre-training a model with cross-entropy loss, and then adding of a sentence. The GUG dataset consists of sen- weighted REINFORCE loss: tences written by English as the second language learners. Each sentence has four native English L = λLR + (1 − λ)LC . (4) speakers assessing grammatical acceptability on a scale of 1 to 4. We estimate the average of these 3 BERT-based Supervised Reward ratings. We propose a reward estimator R(X, Yˆ ) consisting Meaning Preservation We use the STS-B 3 of sub-rewards for grammaticality RG, meaning dataset (Cer et al., 2017) for estimating the mean- preservation RM, and simplicity RS. These sub- ing preservation of sentence pairs. The STS-B rewards are combined by weighted sum with hyper dataset consists of sentence pairs from multiple parameters of δ and : sources such as news headlines and image captions. Each sentence pair is evaluated for semantic sim- ˆ ˆ ˆ ilarity by five cloud workers on a scale of 0 to 5. R(X, Y ) = δRG(Y ) + RM(X, Y ) (5) We estimate the average of these ratings. + (1 − δ − )RS(Yˆ ). Simplicity We use the Newsela dataset4 (Xu To achieve a better correlation between each sub- et al., 2015) for estimating the simplicity of a sen- reward and human perspectives, we employ BERT tence. The Newsela dataset is a parallel dataset of regressors and fine-tune them using manually an- complex and simple sentences. Each sentence is as- notated datasets. signed a U.S. elementary school reading level on a 3.1 Implementation Details scale of 2 to 12. We follow the data split by Zhang and Lapata(2017). We estimate the grade level of For each sub-reward model, we fine-tuned a pre- a single sentence using the BERT regressor. trained -base-uncased model1 from Hugging Face Transformers (Wolf et al., 2019). 3.2 Intrinsic Evaluation of Rewards Dropout of 0.2 was applied to all embedding and We evaluated how well our sub-reward estimators hidden layers. All the models were optimized us- correlate with human perspectives, compared to ing the Adam optimizer (Kingma and Ba, 2015). previous studies (Zhang and Lapata, 2017; Zhao We linearly decrease with a warm up et al., 2020). in the first 1, 000 training steps. The batch size was 32 sentences. We created a checkpoint for the Compared Models We reimplemented the sub- model at every 100 steps. The training stopped af- reward estimators introduced in Section 2.2. For ter 10 epochs without improvement in the Pearson RG estimator, we used a 2-layer LSTM language correlation measured on the validation set. model of 256 hidden dimensions and word em- Each sub-reward estimator was fine-tuned using beddings of 300 dimensions. For RM estimator in the following datasets. Table1 shows the Zhang and Lapata(2017), we implemented a se- for each dataset. quence auto-encoder with bidirectional LSTMs as an encoder. For RM estimator in Zhao et al.(2020), 2 Grammaticality We use the GUG dataset (Heil- we used 300-dimensional embeddings5 man et al., 2014) for estimating the grammaticality 3http://ixa2.si.ehu.es/stswiki/index. 1https://huggingface.co/ php/STSbenchmark bert-base-uncased 4https://newsela.com/data/ 2https://github.com/ 5https://code.google.com/archive/p/ EducationalTestingService/gug-data word2vec/

155 (Mikolov et al., 2013). GMS Datasets For grammaticality and meaning preser- Zhang’s sub-rewards −0.135 0.041 0.034 vation, we used the test sets of GUG and STS-B Zhao’s sub-rewards −0.135 0.379 0.175 to evaluate the Pearson correlation with human Our sub-rewards 0.726 0.846 0.473 evaluations. For simplicity, a reward estimator should be sensitive to grade levels of sentences Table 2: Pearson correlation of each sub-reward esti- mator. Note that G, M, S correspond to grammaticality, with the same meaning, because text simplification meaning preservation, and simplicity, respectively. intends to preserve the original meaning of the in- put sentence. Therefore, we extracted pairs of (Y1, Y2) of the same source sentence from the Newsela size to 20, 000 in addition to the pre-processing by dataset and evaluated the Pearson correlation be- Zhang and Lapata(2017). tween the difference of estimated simplicity and The EncDecA model was pre-trained by cross the difference of gold-standard grade levels. Note entropy loss with Adam optimizer ahead of rein- that RS in Zhang and Lapata(2017), i.e., SARI, re- forcement learning. The batch size was 32 sen- quires source and reference sentences. Hence, we tences. We created a checkpoint for the model at regarded the simplest sentence of the same source every 100 steps. In the pre-training, training was as the reference. We extracted 323 sentence pairs stopped after 10 epochs without improvement of of simplified versions of the same sentence from SARI score measured on the validation set. How- the Newsela test set. ever, as the SARI is not stable at the beginning of the training, we ignored checkpoints whose BLEU Results Table2 shows the evaluation results of scores measured on the validation set were less each sub-reward estimator. In all perspectives, ex- than 21, as suggested by Vu et al.(2018). isting unsupervised sub-reward estimators have lit- tle or no correlation with human annotations. As 4.2 Hyper-Parameter Settings expected, fine-tuning BERT for each task signifi- λ cantly improved the Pearson correlations. In reinforcement learning, the hyperparameter in Equation (4) was initialized to 0.1, and linearly 4 End-to-End Evaluation on Text increased for each iteration until 0.9 during first 10 Simplification epochs for stabilizing training process. Following Zhang and Lapata(2017), we trained reinforcement In this section, we evaluate our rewards on an end- learning models with stochastic to-end text simplification task using the Newsela optimizer with a learning rate of 0.001 and a mo- 6 dataset shown in Table1. mentum term of 0.9. Additionally, we trained the baseline estimator with Adam optimizer with a 4.1 Baseline Encoder-Decoder Model learning rate of 0.001. In the reinforcement learn- We implemented and pre-trained the EncDecA ing, training was stopped after 10 epochs without model as a common base to add reinforcement improvement on rewards measured on the valida- learning with rewards of ours and previous stud- tion set. ies (Zhang and Lapata, 2017; Zhao et al., 2020). We set the equal weights to our sub-rewards, i.e., The EncDecA model has a 2-layer LSTM of 256 assigned 1/3 to δ and  in Equation (5), respec- hidden dimensions for both the encoder and de- tively. Tuning of these weights is our future work. coder, and attention mechanism by multi-layer per- ceptron with a layer size of 256. It has word 4.3 Results of Automatic Evaluation embedding layers of 300 dimensions tying the The performance of each method is automatically source, target, and the output layer’s weight ma- evaluated using the EASSE toolkit6 by BLEU and trices. Dropout of 0.2 was applied to all embed- SARI. Furthermore, we perform detailed automatic dings and hidden layers. We used byte-pair encod- evaluations of grammaticality, meaning preserva- ing7 (Sennrich et al., 2016) to limit the vocabulary tion, simplicity, and overall quality defined in Equa- 6We did not experiment with the Simple English Wikipedia tion (5) using our sub-reward estimators. because it does not have a detailed, difficulty-by-difficulty Table3 shows the experimental results. In both rewrite. 7https://github.com/google/ our reward and existing rewards, reinforcement sentencepiece learning has improved the EncDecA baseline in

156 Metrics Rewards Human BLEU SARI G M S O Avg. Rank Reference 100.0 100.0 0.909 0.585 0.708 0.734 – EncDecA 21.57 37.64 0.862 0.681 0.648 0.730 n/a RL w/ Zhang’s Reward 23.30 39.24 0.878 0.659 0.663 0.734 1.91 RL w/ Zhao’s Reward 23.42 39.20 0.878 0.662 0.662 0.734 1.69 RL w/ Our Reward 23.14 38.70 0.878 0.678 0.653 0.736 1.45**

Table 3: Experimental results of text simplification. Note that G, M, S, and O correspond to grammaticality, meaning preservation, simplicity, and overall rewards, respectively. ** indicates a statistically significant difference between the others. (The p-value of the unpaired t-test of our method and both of the other methods were p < 0.01.)

Source They are tired and it shows in their voices , but they ’re still on the freedom highway . Reference Their voices sound tired . EncDecA They are tired and it shows in their voices , but they ’re still on the freedom . RL w/ Zhang’s Reward They are tired . RL w/ Zhao’s Reward They are tired . RL w/ Our Reward They are tired and it shows in their voices . Source Historic architecture , crafts and music are being overwhelmed by China ’s growth and its inability to effectively preserve traditions of the past . Reference Kite making is only part of a bigger story in China . EncDecA Historic architecture , crafts and music are being overwhelmed by China ’s growth and its inability to effectively preserve traditions of the past . RL w/ Zhang’s Reward Historic architecture , crafts and music are being overwhelmed by China ’s growth . RL w/ Zhao’s Reward The music of the city ’s growth and music are being overwhelmed by China ’s growth . RL w/ Our Reward Historic architecture , crafts and music are being overwhelmed by China ’s growth and its actions .

Table 4: Examples of generated sentences by each simplification model. both BLEU and SARI metrics. Reinforcement preservation, our model generates sentences with learning also improved rewards, but the EncDecA reasonable balances between meaning preservation baseline was the best for meaning preservation. A and simplicity. trade-off relationship was observed between the re- wards of meaning preservation and simplicity. This 4.4 Results of Human Evaluation is expected because the meanings of the input and We also conducted human evaluation using Ama- generated sentences deviate as the model replaces zon Mechanical Turk.8 Human evaluators rank and deletes tokens for simplicity. Reinforcement three sentences generated by a model based on re- learning based on our rewards does not improve inforcement learning with different rewards, taking simplicity as much as previous methods, but it does into account the source sentence. Three sentences not worsen meaning preservation. This balance has were ranked on the basis of whether they were made our model achieve the highest overall reward. rewritten in a simple manner while preserving as Table4 shows generated sentences by each much of the meaning of the source sentence as pos- model. While the previous methods generates ex- tremely simple sentence at the expense of meaning 8https://www.mturk.com/

157 sible. We randomly selected 100 sets of generated Proceedings of the 9th Conference of the European sentences, excluding examples that all models gen- Chapter of the Association for Computational Lin- erated the same sentences.9 To ensure the quality guistics, pages 269–270. of the human evaluation, we employed five master Daniel Cer, Mona Diab, Eneko Agirre, Inigo˜ Lopez- workers for each example and used 85 examples Gazpio, and Lucia Specia. 2017. SemEval-2017 with at least three of them had the same ranking Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceed- order. ings of the 11th International Workshop on Semantic The average ranking of each model is shown in Evaluation, pages 1–14. the last column of Table3. Our model was ranked Andrew M. Dai and Quoc V. Le. 2015. Semi- significantly higher than previous models as con- supervised Sequence Learning. In Proceedings of firmed by bootstrap testing. These results confirm the 28th Conference on Neural Information Process- that our rewards allow to generate simplified sen- ing Systems, pages 3079–3087. tences preferred by humans. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of 5 Conclusion Deep Bidirectional Transformers for Language Un- derstanding. In Proceedings of the 2019 Conference We trained a text simplification model based on of the North American Chapter of the Association reinforcement learning with rewards that are highly for Computational Linguistics: Human Language correlated with human-perspectives. Experimen- Technologies, pages 4171–4186. tal results showed that existing rewards employ- Kazuma Hashimoto and Yoshimasa Tsuruoka. 2019. ing evaluation metrics tend to generate extremely Accelerated Reinforcement Learning for Sentence simple sentence at the expense of meaning preser- Generation by Vocabulary Prediction. In Proceed- vation. Nevertheless, our BERT-based rewards ings of the 2019 Conference of the North Ameri- succeeded in balancing meaning preservation and can Chapter of the Association for Computational Linguistics: Human Language Technologies, pages simplicity. In addition, we confirmed that human 3115–3125. evaluators prefer our simplified sentences to those generated by previous rewards. Michael Heilman, Aoife Cahill, Nitin Madnani, Melissa Lopez, Matthew Mulholland, and Joel In this study, we set the equal weights to our sub- Tetreault. 2014. Predicting Grammaticality on an rewards. We plan to investigate the better weight Ordinal Scale. In Proceedings of the 52nd Annual balance of sub-rewards in the future. Meeting of the Association for Computational Lin- guistics, pages 174–180. Acknowledgments and Jurgen¨ Schmidhuber. 1997. This work was supported by JSPS KAKENHI Long Short-Term Memory. Neural Computation, 9(8):1735–1780. Grant Number JP20K19861. Kentaro Inui, Atsushi Fujita, Tetsuro Takahashi, Ryu Iida, and Tomoya Iwakura. 2003. Text Simplifica- References tion for Reading Assistance: A Project Note. In Proceedings of the 2nd International Workshop on Fernando Alva-Manchego, Louis Martin, Antoine Bor- Paraphrasing, pages 9–16. des, Carolina Scarton, Benoˆıt Sagot, and Lucia Spe- cia. 2020. ASSET: A Dataset for Tuning and Eval- J. Peter Kincaid, Robert P. Fishburne Jr., Richard L. uation of Sentence Simplification Models with Mul- Rogers, and Brad S. Chissom. 1975. Derivation of tiple Rewriting Transformations. In Proceedings of New Readability Formulas (Automated Readability the 58th Annual Meeting of the Association for Com- Index, Fog Count and Flesch Reading Ease Formula) putational Linguistics, pages 4668–4679. for Navy Enlisted Personnel. Technical report, De- fence Technical Information Center (DTIC) Docu- Sanjeev Arora, Yingyu Liang, and Tengyu Ma. 2017. ment. A Simple but Tough-to-Beat Baseline for Sentence Embeddings. In Proceedings of the 5th Interna- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A tional Conference on Learning Representations. Method for . In Proceed- ings of the 3rd International Conference on Learn- John Carroll, Guido Minnen, Darren Pearce, Yvonne ing Representations. Canning, Siobhan Devlin, and John Tait. 1999. Sim- plifying Text for Language-Impaired Readers. In Chin-Yew Lin. 2004. ROUGE: A Package for Auto- matic Evaluation of Summaries. In Proceedings of 9We included examples that two models output the same the Workshop on Text Summarization Branches Out, sentences. pages 74–81.

158 Thang Luong, Hieu Pham, and Christopher D. Man- David Vickrey and Daphne Koller. 2008. Sentence ning. 2015. Effective Approaches to Attention- Simplification for Semantic Role Labeling. In Pro- based Neural Machine Translation. In Proceedings ceedings of the 46th Annual Meeting of the Associ- of the 2015 Conference on Empirical Methods in ation for Computational Linguistics: Human Lan- Natural Language Processing, pages 1412–1421. guage Technologies, pages 344–352.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Tu Vu, Baotian Hu, Tsendsuren Munkhdalai, and Hong Dean. 2013. Efficient Estimation of Word Repre- Yu. 2018. Sentence Simplification with Memory- sentations in Vector Space. In Proceedings of the Augmented Neural Networks. In Proceedings of the 1st International Conference on Learning Represen- 2018 Conference of the North American Chapter of tations. the Association for Computational Linguistics: Hu- man Language Technologies, pages 79–85. Tomas Mikolov, Martin Karafiat,´ Lukas´ Burget, Jan Cernocky,´ and Sanjeev Khudanpur. 2010. Recurrent Ronald J. Williams. 1992. Simple Statistical Gradient- Neural Network Based Language Model. In Pro- Following for Connectionist Reinforce- ceedings of the 11th Annual Conference of the Inter- ment Learning. , 8:229–256. national Speech Communication Association, pages 1045–1048. Sam Wiseman and Alexander M. Rush. 2016. Sequence-to-Sequence Learning as Beam-Search Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Optimization. In Proceedings of the 2016 Confer- Jing Zhu. 2002. BLEU: a Method for Automatic ence on Empirical Methods in Natural Language Evaluation of Machine Translation. In Proceedings Processing, pages 1296–1306. of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- Romain Paulus, Caiming Xiong, and Richard Socher. ric Cistac, Tim Rault, Remi´ Louf, Morgan Funtow- 2018. A Deep Reinforced Model for Abstractive icz, Joe Davison, Sam Shleifer, Patrick von Platen, Summarization. In Proceedings of the 6th Interna- Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, tional Conference on Learning Representations. Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2019. Sarah E. Petersen and Mari Ostendorf. 2007. Text HuggingFace’s Transformers: State-of-the-art Natu- Simplification for Language Learners: A Corpus ral Language Processing. arXiv:1910.03771. Analysis. In Proceedings of the SLaTE Workshop on Speech and Language Technology in Education, Kristian Woodsend and Mirella Lapata. 2014. Text pages 69–72. Rewriting Improves Semantic Role Labeling. Jour- nal of Artificial Intelligence Research, 51:133–164. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie- and Wojciech Zaremba. 2016. Sequence Level Yan Liu. 2018. A Study of Reinforcement Learning Training with Recurrent Neural Networks. In Pro- Proceedings of ceedings of the 4th International Conference on for Neural Machine Translation. In the 2018 Conference on Empirical Methods in Natu- Learning Representations. ral Language Processing, pages 3612–3621. Rico Sennrich, Barry Haddow, and Alexandra Birch. Wei Xu, Chris Callison-Burch, and Courtney Napoles. 2016. Neural Machine Translation of Rare Words 2015. Problems in Current Text Simplification Re- with Subword Units. In Proceedings of the 54th An- search: New Data Can Help. Transactions of the As- nual Meeting of the Association for Computational sociation for Computational Linguistics, 3:283–297. Linguistics, pages 1715–1725. Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze ˇ Sanja Stajner and Maja Popovic.´ 2016. Can Text Sim- Chen, and Chris Callison-Burch. 2016. Optimizing plification Help Machine Translation? In Proceed- Statistical Machine Translation for Text Simplifica- ings of the 19th Annual Conference of the Euro- tion. Transactions of the Association for Computa- pean Association for Machine Translation, pages tional Linguistics, 4:401–415. 230–242. Xingxing Zhang and Mirella Lapata. 2017. Sentence Sanja Stajnerˇ and Maja Popovic.´ 2018. Improving Simplification with Deep Reinforcement Learning. Machine Translation of English Relative Clauses In Proceedings of the 2017 Conference on Empiri- with Automatic Text Simplification. In Proceedings cal Methods in Natural Language Processing, pages of the 1st Workshop on Automatic Text Adaptation, 584–594. pages 39–48. Yanbin Zhao, Lu Chen, Zhi Chen, and Kai Yu. Elior Sulem, Omri Abend, and Ari Rappoport. 2018. 2020. Semi-Supervised Text Simplification with BLEU is Not Suitable for the Evaluation of Text Back-Translation and Asymmetric Denoising Au- Simplification. In Proceedings of the 2018 Con- toencoders. In Proceedings of the 34th AAAI Con- ference on Empirical Methods in Natural Language ference on Artificial Intelligence, pages 9668–9675. Processing, pages 738–744.

159