Text Simplification with Reinforcement Learning Using Supervised

Text Simplification with Reinforcement Learning using Supervised Rewards on Grammaticality, Meaning Preservation, and Simplicity Akifumi Nakamachiy, Tomoyuki Kajiwaraz, Yuki Arasey yGraduate School of Information Science and Technology, Osaka University zInstitute for Datability Science, Osaka University yfnakamachi.akifumi, [email protected] [email protected] Abstract Complex Simple Sentence Encoder Decoder Sentence We optimize rewards of reinforcement learning in text simplification using metrics that are highly correlated with human-perspectives. To address problems of exposure bias and loss- Grammaticality evaluation mismatch, text-to-text generation Meaning tasks employ reinforcement learning that re- Preservation wards task-specific metrics. Previous studies in text simplification employ the weighted sum Simplicity of sub-rewards from three perspectives: gram- Reward Calculator maticality, meaning preservation, and simplicity. However, the previous rewards do not Figure 1: Overview of the reinforcement learning for align with human-perspectives for these per- text simplification. spectives. In this study, we propose to use BERT regressors fine-tuned for grammaticality, meaning preservation, and simplicity as re- is evaluated as a whole sentence during inference, ward estimators to achieve text simplification it is evaluated at the token-level during training. conforming to human-perspectives. Experi- To address these problems, reinforcement learning mental results show that reinforcement learn- has been employed in text-to-text generation tasks, ing with our rewards balances meaning preservation and simplicity. Additionally, human such as machine translation (Ranzato et al., 2016) evaluation confirmed that simplified texts by and abstractive summarization (Paulus et al., 2018). our method are preferred by humans compared These studies use metrics suitable for each task, to previous studies. such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004), as rewards. Although reinforcement 1 Introduction learning based text simplification models (Zhang Text simplification is one of the text-to-text gen- and Lapata, 2017; Zhao et al., 2020) have used re- eration tasks that rewrites complex sentences into wards metrics such as SARI (Xu et al., 2016) and simpler ones. Text simplification is useful for pre- FKGL (Kincaid et al., 1975), these metrics do not processing of NLP tasks such as semantic role la- align with human-perspectives, i.e., human evalu- beling (Vickrey and Koller, 2008; Woodsend and ation results (Xu et al., 2016; Sulem et al., 2018; Lapata, 2014) and machine translation (Stajnerˇ and Alva-Manchego et al., 2020). Popovic´, 2016, 2018). It also has valuable applica- In this study, we train a text simplification model tions such as assisting language learning (Inui et al., based on reinforcement learning with rewards that 2003; Petersen and Ostendorf, 2007) and helping highly agree with human-perspectives. Specifically, language-impaired readers (Carroll et al., 1999). we apply a BERT regressor (Devlin et al., 2019) There are two problems in text-to-text genera- on grammaticality, meaning preservation, and sim- tion with an encoder-decoder model: exposure bias plicity, respectively, as shown in Figure1. Exper- and loss-evaluation mismatch (Ranzato et al., 2016; iments on the Newsela dataset (Xu et al., 2015) Wiseman and Rush, 2016). The former is that the have shown that reinforcement learning with our model is not exposed to its own errors during train- rewards balances meaning preservation and sim- ing. The latter is that while the generated sentence plicity. Further, manual evaluation has shown that 153 Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 153–159 December 4 - 7, 2020. c 2020 Association for Computational Linguistics our outputs were preferred by humans compared to As automatic evaluation metrics for text simpli- previous models. fication, BLEU, SARI, and FKGL have been used; however, there has not been a consensus of stan- 2 Background: Reinforcement Learning dard metrics because of their low correlation with for Text Simplification human perspectives (Xu et al., 2016; Sulem et al., 2018; Alva-Manchego et al., 2020). Therefore, the Reinforcement learning in text-to-text generation previous studies designed rewards from the follow- tasks is performed as additional training for pre- ing three perspectives, based on the standards in trained text-to-text generation models. It is a com- manual evaluation for text simplification. mon technique to linearly interpolate a reward of reinforcement learning and the cross-entropy loss • Grammaticality: This reward assesses the to avoid misleading training because of a large ac- grammatical acceptability of the generated tion space (Ranzato et al., 2016; Zhang and Lapata, sentence Y^ . Previous studies used an neu- 2017). We first explain an attention based encoder- ral language model implemented using Long decoder model (EncDecA) (Luong et al., 2015) in short-term memory (Mikolov et al., 2010; Section 2.1 and then reinforcement learning for text Hochreiter and Schmidhuber, 1997). simplification in Section 2.2. • Meaning Preservation: This reward assesses 2.1 Encoder-Decoder Model with Attention the semantic similarity between the source sentence X and the generated sentence Y^ . Let X = (x1; ··· ; xjXj) be a source sentence and Zhang and Lapata(2017) used cosine simi- Y = (y1; ··· ; yjY j) be its reference sentence. In text simplification, the source and reference are larity of the sentence representations from a complex and simple sentences, respectively. An sequence auto-encoder (Dai and Le, 2015). encoder takes a source sentence as input, and out- Zhao et al.(2020) used cosine similarity of puts hidden states. Decoder generates a word dis- sentence representations which consists of tribution at time step t + 1 from all the encoder weighted average of word embeddings (Arora hidden states and the series of decoder hidden state et al., 2017). ^ (h1; ··· ; ht). We generate a sentence Y by sam- • Simplicity: This reward assesses the sim- pling words from the distribution at each time step. plicity of the generated sentence Y^ . Zhang The objective function for training is averaged and Lapata(2017) used SARI (X; Y; Y^ ) score, cross entropy loss of sentence pairs: while Zhao et al.(2020) used FKGL (Y^ ) score. jY j X Among different ways to conduct reinforcement LC = − log P (yt+1jy1···t;X): (1) learning, one of the standard approaches used in t=1 text simplification is directly maximizing the rewards by the REINFORCE algorithm (Williams, As the Equation (1) suggests, y1···t is given at training but not at an inference (exposure bias situation). 1992; Ranzato et al., 2016). This approach opti- In addition, cross entropy loss cannot be evaluated mizes the log probability weighted by the expected at a sentence-level (loss-evaluation mismatch). future reward as the objective function: jY j 2.2 Reinforcement Learning X LR = − r(ht) log P (yt+1jy1···t;X); (2) Similar to other text-to-text generation tasks (Ran- t=1 zato et al., 2016; Paulus et al., 2018), reinforcement where the expected future reward r(h ) is estimated learning is applied for text simplification (Zhang t using a reward estimator R(·) and a baseline esti- and Lapata, 2017; Zhao et al., 2020) to address mator b(h ) calculated from the hidden state at time the problems of exposure bias and loss-evaluation t step t. mismatch. In the reinforcement learning step, the r(h ) = R(·) − b(h ): (3) pre-trained text-to-text generation model is trained t t to increase the reward R(·). By employing a re- Following (Ranzato et al., 2016), the baseline esti- 2 ward function that takes the entire sentence Y^ into mator is optimised by minimizing kbt − R(·)k . account, the exposure bias and loss-evaluation mis- Hashimoto and Tsuruoka(2019) discussed prob- match problems are mitigated. lems in text-to-text generation by reinforcement 154 learning; the expected future reward estimation Train Validation Test is unstable due to the huge action space, which GUG 1; 518 747 754 hinders convergence. This is because the action STS-B 5; 749 1; 500 1; 379 space of text-to-text generation corresponds to the Newsela 94; 208 1; 129 1; 077 entire target vocabulary, where many words are rarely used for prediction. Therefore, previous stud- Table 1: The numbers of sentences in datasets for each ies (Wu et al., 2018; Paulus et al., 2018; Hashimoto sub-reward estimator and Tsuruoka, 2019) proposed to stabilize the training in reinforcement learning by first pre-training a model with cross-entropy loss, and then adding of a sentence. The GUG dataset consists of sen- weighted REINFORCE loss: tences written by English as the second language learners. Each sentence has four native English L = λLR + (1 − λ)LC : (4) speakers assessing grammatical acceptability on a scale of 1 to 4. We estimate the average of these 3 BERT-based Supervised Reward ratings. We propose a reward estimator R(X; Y^ ) consisting Meaning Preservation We use the STS-B 3 of sub-rewards for grammaticality RG, meaning dataset (Cer et al., 2017) for estimating the mean- preservation RM, and simplicity RS. These sub- ing preservation of sentence pairs. The STS-B rewards are combined by weighted sum with hyper dataset consists of sentence pairs from multiple parameters of δ and : sources such as news headlines and image captions. Each sentence pair is evaluated for semantic sim- ^ ^ ^ ilarity by five cloud workers on a scale of 0 to 5. R(X; Y ) = δRG(Y ) + RM(X; Y ) (5) We estimate the average of these ratings. + (1 − δ − )RS(Y^ ): Simplicity We use the Newsela dataset4 (Xu To achieve a better correlation between each sub- et al., 2015) for estimating the simplicity of a sen- reward and human perspectives, we employ BERT tence. The Newsela dataset is a parallel dataset of regressors and fine-tune them using manually an- complex and simple sentences.

Text Simplification with Reinforcement Learning Using Supervised

Survey on Reinforcement Learning for Language Processing

A Deep Reinforcement Learning Neural Network Folding Proteins

An Introduction to Deep Reinforcement Learning

Modelling Working Memory Using Deep Recurrent Reinforcement Learning

Reinforcement Learning: a Survey

An Energy Efficient Edgeai Autoencoder Accelerator For

Sparse Cooperative Q-Learning

4 Perceptron Learning

Optimal Path Routing Using Reinforcement Learning

18 Reinforcement Learning

Deep Learning and Reinforcement Learning

Deep Reinforcement Learning for Imbalanced Classification