Lexical Simplification with Pretrained Encoders

The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20) Lexical Simplification with Pretrained Encoders Jipeng Qiang,1 Yun Li,1 Yi Zhu,1 Yunhao Yuan,1 Xindong Wu2,3 1Department of Computer Science, Yangzhou University, Jiangsu, China 2Key Laboratory of Knowledge Engineering with Big Data (Hefei University of Technology), Ministry of Education, Anhui, China 3 Mininglamp Academy of Sciences, Minininglamp Technology, Beijing, China {jpqiang, liyun, zhuyi, yhyuan}@yzu.edu.cn, [email protected] Abstract Lexical simplification (LS) aims to replace complex words in a given sentence with their simpler alternatives of equiv- alent meaning. Recently unsupervised lexical simplification approaches only rely on the complex word itself regardless of the given sentence to generate candidate substitutions, which will inevitably produce a large number of spurious candidates. We present a simple LS approach that makes use of the Bidirectional Encoder Representations from Transform- ers (BERT) which can consider both the given sentence and the complex word during generating candidate substitutions for the complex word. Specifically, we mask the complex word of the original sentence for feeding into the BERT to predict the masked token. The predicted results will be used as candidate substitutions. Despite being entirely unsupervised, experimental results show that our approach obtains Figure 1: Comparison of simplification candidates of com- obvious improvement compared with these baselines lever- plex words. Given one sentence ”John composed these aging linguistic databases and parallel corpus, outperforming verses.” and complex words ’composed’ and ’verses’, the the state-of-the-art by more than 12 Accuracy points on three top three simplification candidates for each complex word well-known benchmarks. are generated by our method BERT-LS and the state-of-the- art two baselines based word embeddings (Glavas(Glavaˇ sˇ ˇ 1 Introduction and Stajner 2015) and Paetzold-NE (Paetzold and Specia 2017a)). Lexical Simplification (LS) aims at replacing complex words with simpler alternatives, which can help various groups of people, including children (De Belder and Moens Wikipedia, and it is impossible to give all possible simplifi- 2010), non-native speakers (Paetzold and Specia 2016), peo- cation rules from them. ple with cognitive disabilities (Feng 2009; Saggion 2017), For avoiding the need for resources such as databases to understand text better. The popular LS systems still pre- or parallel corpora, recent work utilizes word embedding dominantly use a set of rules for substituting complex words models to extract simplification candidates for complex with their frequent synonyms from carefully handcrafted words (Glavasˇ and Stajnerˇ 2015; Paetzold and Specia 2016; databases (e.g., WordNet) or automatically induced from 2017a). Given a complex word, they extracted from the word comparable corpora (Devlin and Tait 1998; De Belder and embedding model the simplification candidates whose vec- Moens 2010). Linguistic databases like WordNet are used tors are closer in terms of cosine similarity with the complex to produce simple synonyms of a complex word. (Lesk word. This strategy achieves better results compared with 1986; Sinha 2012; Leroy et al. 2013). Parallel corpora rule-based LS systems. However, the above methods gener- like Wikipedia-Simple Wikipedia corpus were also used ated simplification candidates only considering the complex to extract complex-to-simple word correspondences (Biran, word regardless of the context of the complex word, which Brody, and Elhadad 2011; Yatskar et al. 2010; Horn, Mand- will inevitably produce a large number of spurious candi- uca, and Kauchak 2014). However, linguistic resources are dates that can confuse the systems employed in the subse- scarce or expensive to produce, such as WordNet and Simple quent steps. Copyright c 2020, Association for the Advancement of Artificial Therefore, we present an intuitive and innovative idea Intelligence (www.aaai.org). All rights reserved. completely different from existing LS systems in this paper. 8649 We exploit recent advances in the pre-trained transformer ing its meaning as much as possible, which is a very chal- language model BERT (Devlin et al. 2018) to find suitable lenging task. The popular lexical simplification (LS) ap- simplifications for complex words. The masked language proaches are rule-based, which each rule contain a com- model (MLM) used in BERT randomly masks some percent- plex word and its simple synonyms (Lesk 1986; Pavlick age of the input tokens, and predicts the masked word based and Callison-Burch 2016; Maddela and Xu 2018). In or- on its context. If masking the complex word in a sentence, der to construct rules, rule-based systems usually identified the idea in MLM is in accordance with generating the can- synonyms from WordNet for a predefined set of complex didates of the complex word in LS. Therefore, we introduce words, and selected the ”simplest” from these synonyms a novel LS approach BERT-LS that uses MLM of BERT for based on the frequency of word (Devlin and Tait 1998; simplification candidate generation. More specifically, we De Belder and Moens 2010) or length of word (Bautista et mask the complex word w of the original sentence S as a al. 2011). However, rule-based systems need a lot of human new sentence S, and we concatenate the original sequence involvement to manually define these rules, and it is impos- S and S for feeding into the BERT to obtain the probability sible to give all possible simplification rules. distribution of the vocabulary corresponding to the masked As complex and simplified parallel corpora are available, word. The advantage of our method is that it generates sim- especially, the ’ordinary’ English Wikipedia (EW) in com- plification candidates by considering the whole sentence, not bination with the ’simple’ English Wikipedia (SEW), the just the complex word. paradigm shift of LS systems is from knowledge-based to Here, we give an example shown in Figure 1 to illus- data-driven simplification (Biran, Brody, and Elhadad 2011; trate the advantage of our method BERT-LS. For com- Yatskar et al. 2010; Horn, Manduca, and Kauchak 2014). plex words ’composed’ and ’verses’ in the sentence ”John Yatskar et al. (2010) identified lexical simplifications from composed these verses.”, the top three substitution candi- the edit history of SEW. They utilized a probabilistic method dates of the two complex words generated by the LS sys- to recognize simplification edits distinguishing from other tems based on word embeddings (Glavasˇ and Stajnerˇ 2015; types of content changes. Biran et al. (2011) considered ev- Paetzold and Specia 2017a) are only related with the com- ery pair of the distinct word in the EW and SEW to be a pos- plex words itself without without paying attention to the sible simplification pair, and filtered part of them based on original sentence. The top three substitution candidates gen- morphological variants and WordNet. Horn et al. (2014) also erated by BERT-LS are not only related with the complex generated the candidate rules from the EW and SEW, and words, but also can fit for the original sentence very well. adopted a context-aware binary classifier to decide whether Then, by considering the frequency or order of each can- a candidate rule should be adopted or not in a certain context. didate, we can easily choose ’wrote’ as the replacement of The main limitation of the type of methods relies heavily on ’composed and ’poems’ as the replacement of ’verses’. In simplified corpora. this case, the simplification sentence ’John wrote these po- In order to entirely avoid the requirement of lexical re- ems.’ is more easily understood than the original sentence. sources or parallel corpora, LS systems based on word em- The contributions of our paper are as follows: beddings were proposed (Glavasˇ and Stajnerˇ 2015). They (1) BERT-LS is a novel BERT-based method for LS, extracted the top 10 words as candidate substitutions whose which can take full advantages of BERT to generate and vectors are closer in terms of cosine similarity with the com- rank substitution candidates. Compared with existing meth- plex word. Instead of a traditional word embedding model, ods, BERT-LS is easier to hold cohesion and coherence of a Paetzold and Specia (2016) adopted context-aware word em- sentence, since BERT-LS considers the whole sentence not beddings trained on a large dataset where each word is an- the complex word itself during generating candidates. notated with the POS tag. Afterward, they further extracted (2) BERT-LS is a simple, effective and unsupervised LS candidates for complex words by combining word embed- method. 1)Simple: many steps used in existing LS systems dings with WordNet and parallel corpora (Paetzold and Spe- have been eliminated from our method, e.g., morphologi- cia 2017a). cal transformation and substitution selection. 2) Effective: it After examining existing LS methods ranging from rules- obtains new state-of-the-art results on three benchmarks. 3) based to embedding-based, the major challenge is that they Unsupervised: our method cannot rely on any parallel cor- generated simplification candidates for the complex word re- pus or linguistic databases. gardless of the context of the complex word, which will in- (3) To our best knowledge, this is the first attempt to apply evitably produce a large number of spurious candidates that Pre-Trained Transformer Language Models on lexical sim- can confuse the systems employed in the subsequent steps. plification tasks. The code to reproduce our results is avail- In this paper, we will first present a BERT-based LS able at https://github.com/anonymous. approach that requires only a sufficiently large corpus of regular text without any manual efforts. Pre-training language models (Devlin et al. 2018; Lee et al. 2019; Lam- 2 Related Work ple and Conneau 2019) have attracted wide attention and Lexical simplification (LS) contains identifying complex has shown to be effective for improving many downstream words and finding the best candidate substitution for natural language processing tasks.

Load more