Unsupervised Neural Text Simplification

Sai Surya† Abhijit Mishra‡ Anirban Laha‡ Parag Jain‡ Karthik Sankaranarayanan‡ †IIT Kharagpur, India ‡IBM Research [email protected] {abhijimi,anirlaha,pajain34,kartsank}@in.ibm.com

Abstract Gardent, 2014) (b) deletion/compression (Knight and Marcu, 2002; Clarke and Lapata, 2006; Fil- The paper presents a first attempt towards un- ippova and Strube, 2008; Rush et al., 2015; Filip- supervised neural text simplification that re- pova et al., 2015), and (c) paraphrasing (Specia, lies only on unlabeled text corpora. The core framework is composed of a shared encoder 2010; Coster and Kauchak, 2011; Wubben et al., and a pair of attentional-decoders, crucially as- 2012; Wang et al., 2016; Nisioi et al., 2017). sisted by discrimination-based losses and de- Most of the current TS systems require large- noising. The framework is trained using unla- scale parallel corpora for training (except for sys- beled text collected from en-Wikipedia dump. tems like Glavasˇ and Stajnerˇ (2015) that performs Our analysis (both quantitative and qualita- only lexical-simplification), which is a major im- tive involving human evaluators) on public pediment in scaling to newer languages, use-cases, test data shows that the proposed model can perform text-simplification at both lexical and domains and output styles for which such large- syntactic levels, competitive to existing super- scale parallel data do not exist. In fact, one of the vised methods. It also outperforms viable un- popular corpus for TS in English language, i.e., the supervised baselines. Adding a few labeled Wikipedia-SimpleWikipedia aligned dataset has pairs helps improve the performance further. been prone to noise (mis-aligned instances) and inadequacy (i.e., instances having non-simplified 1 Introduction targets) (Xu et al., 2015; Stajnerˇ et al., 2015), lead- Text Simplification (TS) deals with transforming ing to noisy supervised models (Wubben et al., the original text into simplified variants to increase 2012). While creation of better datasets (such as, its readability and understandability. TS is an im- Newsela by Xu et al.(2015)) can always help, we portant task in computational linguistics, and has explore the unsupervised learning paradigm which numerous use-cases in fields of education technol- can potentially work with unlabeled datasets that ogy, targeted content creation, language learning, are cheaper and easier to obtain. where producing variants of the text with vary- At the heart of the TS problem is the need for ing degree of simplicity is desired. TS systems preservation of language with the goal are typically designed to simplify from two differ- of improving readability. From a neural-learning ent linguistic aspects: (a) Lexical aspect, by re- perspective, this entails a specially designed auto- placing complex words in the input with simpler encoder, which not only is capable of reconstruct- synonyms (Devlin, 1998; Candido Jr et al., 2009; ing the original input but also can additionally in- Yatskar et al., 2010; Biran et al., 2011; Glavasˇ troduce variations so that the auto-encoded out- and Stajnerˇ , 2015), and (b) Syntactic aspect, by put is a simplified version of the input. Intu- altering the inherent hierarchical structure of the itively, both of these can be learned by looking sentences (Chandrasekar and Srinivas, 1997; Can- at the structure and language patterns of a large ning and Tait, 1999; Siddharthan, 2006; Filippova amount of non-aligned complex and simple sen- and Strube, 2008; Brouwers et al., 2014). From tences (which are much cheaper to obtain com- the perspective of sentence construction, sentence pared to aligned parallel data). These motivations simplification can be thought to be a form of form the basis of our work. text-transformation that involves three major types Our approach relies only on two unlabeled text of operations such as (a) splitting (Siddharthan, corpora - one representing relatively simpler sen- 2006; Petersen and Ostendorf, 2007; Narayan and tences than the other (which we call complex). The crux of the (unsupervised) auto-encoding plus heuristics to simplify the text both lexically framework is a shared encoder and a pair of and syntactically. Most of these systems (Sid- attention-based decoders (one for each type of cor- dharthan, 2014) are separately targeted towards pus). The encoder attempts to produce semantics- lexical and syntactic simplification and are lim- preserving representations which can be acted ited to splitting and/or truncating sentences. For upon by the respective decoders (simple or com- paraphrasing based simplification, data-driven ap- plex) to generate the appropriate text output they proaches were proposed like phrase-based SMT are designed for. The framework is crucially sup- (Specia, 2010; Stajnerˇ et al., 2015) or their vari- ported by two kinds of losses: (1) adversarial loss ants (Coster and Kauchak, 2011; Xu et al., 2016), - to distinguish between the real or fake attention that combine heuristic and optimization strategies context vectors for the simple decoder, and (2) di- for better TS. Recently proposed TS systems are versification loss - to distinguish between atten- based on neural seq2seq architecture (Bahdanau tion context vectors of the simple decoder and the et al., 2014) which is modified for TS specific op- complex decoder. The first loss ensures that only erations (Wang et al., 2016; Nisioi et al., 2017). the aspects of semantics that are necessary for sim- While these systems produce state of the art re- plification are passed to the simple decoder in the sults on the popular Wikipedia dataset (Coster and form of the attention context vectors. The second Kauchak, 2011), they may not be generalizable be- loss, on the other hand, facilitates passing different cause of the noise and bias in the dataset (Xu et al., semantic aspects to the different decoders through 2015) and overfitting. Towards this, Stajnerˇ and their respective context vectors. Also we employ Nisioi(2018) showed that improved datasets and denoising in the auto-encoding setup for enabling minor model changes (such as using reduced vo- syntactic transformations. cabulary and enabling copy mechanism) help ob- The framework is trained using unlabeled text tain reasonable performance for both in-domain collected from Wikipedia (complex) and Simple and cross-domain TS. Wikipedia (simple). It attempts to perform sim- In the unsupervised paradigm, Paetzold and plification both lexically and syntactically unlike Specia(2016) proposed an unsupervised lexi- prevalent systems which mostly target them sep- cal simplification technique that replaces complex arately. We demonstrate the competitiveness of words in the input with simpler synonyms, which our unsupervised framework alongside supervised are extracted and disambiguated using word em- skylines through both automatic evaluation met- beddings. However, this work, unlike ours only rics and human evaluation studies. We also outper- addresses lexical simplification and cannot be triv- form another unsupervised baseline (Artetxe et al., ially extended for other forms of simplification 2018b), first proposed for neural machine transla- such as splitting and rephrasing. Other works re- tion. Further, we demonstrate that by leveraging lated to style transfer (Zhang et al., 2018; Shen a small amount of labeled parallel data, perfor- et al., 2017; Xu et al., 2018) typically look into mance can be improved further. Our code and a the problem of sentiment transformation and are new dataset containing partitioned unlabeled sets not motivated by the linguistic aspects of TS, and of simple and complex sentences is publicly avail- hence not comparable to our work. As far as we able1. know, ours is a first of its kind end-to-end solution for unsupervised TS. At this point, though super- 2 Related Work vised solutions perform better than unsupervised Text Simplification has often been discussed from ones, we believe unsupervised techniques should psychological and linguistic standpoints (L’Allier, be further explored since they hold greater poten- 1980; McNamara et al., 1996; Linderholm et al., tial with regards to scalability to various tasks. 2000). A heuristic-based system was first intro- duced by Chandrasekar and Srinivas(1997) which 3 Model Description induces rules for simplification automatically ex- Our system is built based on the encode-attend- tracted from annotated corpora. Canning and Tait decode style architecture (Bahdanau et al., 2014) (1999) proposed a modular system that uses NLP with both algorithmic and architectural changes tools such as morphological analyzer, POS tagger applied to the standard model. An input sequence 1 https://github.com/subramanyamdvss/UnsupNTS of word embeddings X = {x1, x2, . . . , xn} (ob- , � (� ,� ) , � (� ,� ) ����(��� ,��) ������(��� ,��) ��� �� � ����� �� �

� � � � � � � �

Decoder Decoder �� ��

� � � � � � � �

� (� ) Discriminator ���,� � � (� ,� ) Encoder ���,�� � �� �

Classifier ����,�(��) � (� ,� ) ���,�� � �� � � � �

Figure 1: System Architecture. Input sentences of any domain is encoded by E, and decoded by Gs, Gd. Dis- criminator D and classifier C tune the attention vectors for simplification. L represents loss functions. The figure only reveals one layer in E, Gs and Gd for simplicity. However, the model uses two layers of GRUs (Section3).

tained after a standard look up operation on the Ast(X) and Adt(X) denote the context vectors embedding matrix), is passed through a shared en- computed from decoders Gs and Gd respectively coder (E), the output representation from which for time-steps t ∈ {1 . . . m}, m denoting the total 2 is fed to two decoders (Gs, Gd) with attention number of decoding steps performed . The matri- mechanism. Gs is meant to generate a simple sen- ces As(X) and Ad(X) represent the sequence of tence from the encoded representation, whereas respective context vectors from all time-steps. Gd generates a complex sentence. A discrimi- nator (D) and a classifier (C) are also employed 3.2 Discriminator and Classifier adversarially to distinguish between the attention A discriminator D is employed to influence the context vectors computed with respect to the two way the decoder Gs will attend to the hidden rep- decoders. Figure1 is illustrates our system. We resentations, which has to be different for different describe the components below. types of inputs to the shared encoder E (simple vs complex). The input to D is the context vector 3.1 Encode-Attend-Decode Model sequence matrix As pertaining to Gs, and it pro- Encoder E uses two layers of bi-directional GRUs duces a binary output, {1, 0}, 1 indicating the fact (Cho et al., 2014b), and decoders Gs, Gd have that the context vector sequence is close to a typi- two layers of GRUs each. E extracts the hidden cal context vector sequence extracted from simple representations from an input sentence. The de- sentences seen in the dataset. Gs and D are in- coders output sentences, sequentially one word at dulged in an adversarial interplay through an ad- a time. Each decoder-step involves using global versarial loss function (see Section 4.2), analogous attention to create a context-vector (hidden repre- to GANs (Goodfellow et al., 2014), where the sentations weighted by attention weights) as an in- generator and discriminators, converge to a point put for the next decoder-step. The attention mech- where the distribution of the generations eventu- anism enables the decoders to focus on different ally resembles the distribution of the genuine sam- parts of the input sentence. For the input sentence ples. In our case, adversarial loss tunes the context X with n words, the encoder produces n hidden vector sequence from a complex sentence by Gs representations, H = {h1, h2, . . . , hn}. The con- to ultimately resemble the context vector sequence text vector extracted from X by a decoder G for of simple sentences in the corpora. This ensures time-step t is represented as, that the resultant context vector for Gs captures n only the necessary language signals to decode a X At(X) = aithi (1) simple sentence. i=1 A classifier (C) is introduced for diversifica- tion to ensure that the way decoder G attends to where, a denotes attention weight for the hid- s it the hidden representations remains different from den representation at the ith input position with re- spect to decoder-step t. As there are two decoders, 2For a particular X, m can differ for the two decoders. Gd. It helps distinguish between simple and com- struct sentences from S and E − Gd is trained plex context vector sequences with respect to Gs to reconstruct sentences from D. Let PE−Gs (X) and Gd respectively. The classifier diversifies the and PE−Gd (X) denote the reconstruction proba- context vectors given as input to the different de- bilities of an input sentence X estimated by the coders. Intuitively, different linguistic signals are E − Gs and E − Gd models respectively. Re- needed to decode a complex sentence vis-a-vis´ a construction loss for E − Gs and E − Gd , de- simple one. Refer Section 4.3 for more details. noted by Lrec is computed as follows. Both D and C use a CNN-based classifier anal- ogous to Kim(2014). All layers are shared be- Lrec(θE , θGs , θGd ) = EXs∼S [log PE−Gs (Xs)]+ tween D and C except the fully-connected layer EXd∼D[log PE−Gd (Xd)] (2) preceeding the softmax function.

3.3 Special Purpose Word-Embeddings 4.2 Adversarial Loss Pre-trained word embeddings are often seen to Adversarial Loss is imposed upon the context vec- have positive impact on sequence-to-sequence tors for Gs. The idea is that, context vectors ex- frameworks (Cho et al., 2014a; Qi et al., 2018). tracted even for a complex input sentence by Gs However, traditional embeddings are not good at should resemble the context vectors from a sim- capturing relations like synonymy (Tissier et al., ple input sentence. The discriminator D is trained 2017), which are essential for simplification. For to distinguish the fake (complex) context vectors this, our word-embeddings are trained using the from the real (simple) context vectors. E − Gs is Dict2Vec framework3. Dict2Vec fine-tunes the trained to perplex the discriminator D, and even- embeddings through the help of an external lex- tually, at convergence, learns to produce real-like icon containing weak and strong synonymy rela- (simple) context vectors from complex input sen- tions. The system is trained on our whole un- tences. In practice, we observe that adversarial labeled datasets and with seed synonymy dictio- loss indeed assists E − Gs in simplification by naries provided by Tissier et al.(2017). Our en- encouraging sentence shortening. Let As(.) be coder and decoders share the same word embed- a sequence of context vectors as defined in Sec- dings. Moreover, the embeddings at the input side tion 3.1. Adversarial losses for E − Gs , denoted are kept static but the decoder embeddings are up- by Ladv,Gs and for discriminator D, denoted by dated as training progresses. Details about hyper- Ladv,D are as follows. parameters are given in Section 5.2.

Ladv,D(θD) = EXs∼S [log (D(As(Xs)))]+ 4 Training Procedure EXd∼D[log (1 − D(As(Xd))] (3) Let S and D be sets of simple and complex Ladv,Gs (θE , θGs ) = EXd∼D[log (D(As(Xd)))] (4) sentences respectively from large scale unlabeled repositories of simple and complex sentences. Let 4.3 Diversification Loss Xs denote a sentence sampled from the set of sim- ple sentences S and Xd be a sentence sampled Diversification Loss is imposed by the classifier C from the set of complex sentences D. Let θE on context vectors extracted by Gd from complex denote the parameters of E and θGs , θGd de- input sentences in contrast with context vectors ex- note the parameters of Gs and Gd respectively. tracted by Gs from simple input sentences. This Also, θC and θD are the parameters of the dis- helps E − Gs to learn to generate simple context criminator and the classifier modules. Training the vectors distinguishable from complex context vec- model involves optimization of the above param- tors. Let As(.) and Ad(.) be sequence of context eters with respect to the following losses and de- vectors as defined in Section 3.1. Losses for clas- noising, which are explained below. sifier C, denoted by Ldiv,C and for model E − Gs

denoted by Ldiv,Gs are computed as follows. 4.1 Reconstruction Loss

Reconstruction Loss is imposed on both E − Gs Ldiv,C (θC ) = EXs∼S [log (C(As(Xs)))]+ and E − Gd paths. E − Gs is trained to recon- EXd∼D[log (1 − C(Ad(Xd)))] (5) 3 https://github.com/tca19/dict2vec Ldiv,Gs (θE , θGs ) = EXd∼D[log (C(Ad(Xd)))] (6) Algorithm 1 Unsupervised simplification algo- In the adversarial phase, adversarial and diversifi- rithm using denoising, reconstruction, adversarial cation losses are introduced alongside denoising and diversification losses. and reconstruction losses for fine-tuning the en- Input: simple dataset S, complex dataset D. coder and decoders. Algorithm1 is intended to Initialization phase: produce the following results: i) E − Gs should repeat simplify its input (irrespective of whether it is

Update θE, θGs , θGd using Ldenoi simple or complex), and ii) E − Gd should act

Update θE, θGs , θGd using Lrec as an auto-encoder in complex sentence domain. Update θD, θC using Ladv,D Ldiv,C The discriminator and classifier enables preserv- until specified number of steps are completed ing the appropriate aspects of semantics necessary Adversarial phase: for each of these pathways through proper modu- repeat lation of the attention context vectors.

Update θE, θGs , θGd using Ldenoi A key requirement for a model like ours is

Update θE, θGs , θGd using Ladv,Gs , that the dataset used has to be partitioned into

Ldiv,Gs , Lrec two sets, containing relatively simple and complex Update θD, θC using Ladv,D, Ldiv,C sentences. The rationale behind having two de- until specified number of steps are completed coders is that while Gs will try to introduce sim- plified constructs (may be at the expense of loss of semantics), Gd will help preserve the semantics. 4.4 Denoising The idea behind using the discriminator and clas- Denoising has proven to be helpful to learn syn- sifier is to retain signals related to language sim- tactic / structural transformation from the source plicity from which Gs will construct simplified side to the target side (Artetxe et al., 2018b). sentences. Finally, denoising will help tackle nu- Syntactic transformation often requires reorder- ances related to syntactic transfer from complex to ing the input, which the denoising procedure aims simple direction. We remind the readers that, TS, to capture. Denoising involves arbitrarily re- unlike , needs complex syntac- ordering the inputs and reconstructing the origi- tic operations such as sentence splitting, rephras- nal (unperturbed) input from such reordered in- ing and paraphrasing, which can not be tackled by puts. In our implementation, the source sen- the losses and denoising alone. Employing addi- tence is reordered by swapping in the in- tional explicit mechanisms to handle these in the put sentences. The following loss function are pipeline is out of the scope of this paper since we used in denoising. Let PE−Gs (X|noise(X)) seek a prima-facie judgement of our architecture and PE−Gd (X|noise(X)) denote the probabili- based on how much simplification knowledge can ties that a perturbed input X can be reconstructed be gained just from the data. by E − Gs and E − Gd respectively. Denoising loss for models E − Gs and E − Gd , denoted 4.5 Training with Minimal Supervision by Ldenoi(θE, θG , θG ) is computed as follows. s d Our system, by design, is highly data-driven, and like any other sequence-to-sequence learning Ldenoi = EXs∼S [log PE−Gs (Xs|noise(Xs))]+ based system, can also leverage labeled data. We EXd∼D[log PE−Gd (Xd|noise(Xd))] (7) propose a semi-supervised variant of our system that could gain additional knowledge of simplifi- Figure1 depicts the overall architecture and the cation through the help of a small amount of la- losses described above; the training procedure is beled data (in the order of a few thousands). The described in Algorithm1. The initialization phase system undergoes training following steps similar involves training the E − Gs, E − Gd using the to Algorithm1, except that it adds another step reconstruction and denoising losses only. Next, of optimizing the cross entropy loss for both the training of D and C happens using the respec- E − Gs and E − Gd pathways by using the ref- tive adversarial or diversification losses. These erence texts available in the labeled dataset. This losses are not used to update the decoders at this step is carried out in the adversarial phase along point. This gives the discriminator, classifier and with other steps (See Algorithm2). decoders time to learn independent of each other. The cross-entropy loss is imposed on both E − Gs and E − Gd paths using parallel dataset Category #Sents Avg. Avg. FE- (details mentioned in Section 5.1) denoted by Words FE Range ∆ = (Sp, Dp). For a given parallel simplifica- Simple 720k 18.23 76.67 74.9-79.16 Complex 720k 35.03 7.26 5.66-9.93 tion sentence pair (Xs,Xd), let PE−Gs (Xs|Xd) and P (X |X ) denote the probabilities that E−Gd d s Table 1: Statistics showing number of sentences, av- Xs is produced from Xd by the E − Gs and the erage words per sentence, and average FE score, FE reverse is produced by the E − Gd respectively. score limits for complex and simple datasets used for Cross-Entropy loss for E − Gs and E − Gd de- training. noted by Lcross(θE, θGs , θGd ) is computed as follows: readability scores. For this we use the Flesch Readability Ease (henceforth abbreviated as FE) Lcross = E(Xs,Xd)∼∆[log PE−Gs (Xs|Xd)]+ (Flesch, 1948). Sentences with lower FE values [log PE−G (Xd|Xs)] (8) E(Xs,Xd)∼∆ d (up to 10) are categorized as complex and sen- tences with FE values greater than 70 are cate- 5 Algorithm 2 Semi-supervised simplification algo- gorized as simple . The FE bounds are decided rithm using denoising, reconstruction, adversar- through trial and error through manual inspec- ial and diversification losses followed by cross- tion of the categorized sentences. Table1 shows entropy loss using parallel data. dataset statistics. Even though the dataset was Input: simple dataset S, complex dataset D, par- created with some level of human mediation, the manual effort is insignificant compared to that allel dataset ∆ = (Sp, Dp) needed to create a parallel corpus. Initialization phase: To train the system with minimal supervision repeat (Section 4.5), we extract 10, 000 pairs of sen- Update θ , θ , θ using L E Gs Gd denoi tences from various datasets such as Wikipedia- Update θ , θ , θ using L E Gs Gd rec SimpleWikipedia dataset introduced in Hwang Update θ , θ using L L D C adv,D div,C et al.(2015) and the Split-Rephrase dataset until specified number of steps are completed by Narayan et al.(2017) 6. The Wikipedia- Adversarial phase: SimpleWikipedia was filtered following Nisioi repeat et al.(2017) and 4000 examples were randomly Update θ , θ , θ using L E Gs Gd denoi picked from the filtered set. From the Split- Update θ , θ , θ using L , E Gs Gd adv,Gs Rephrase dataset, examples containing one com- L , L div,Gs rec pound/complex sentence at the source side and Update θ , θ using L , L D C adv,D div,C two simple sentences at the target side were se- Update θ , θ using L E Gs cross lected and 6000 examples were randomly picked Update θ , θ using L E Gd cross from the selected set. The Split-Rephrase dataset until specified number of steps are completed is used to promote sentence splitting in the pro- posed system. 5 Experiment Setup To select and evaluate our models, we use the test and development sets7 released by (Xu et al., In this section we describe the dataset, architec- 2016). The test set (359 sentences) and develop- tural choices, and model hyperparameters. The ment set (2000 sentences) have 8 simplified refer- implementation of the experimental setup is pub- ence sentences for each source sentence. licly available4. 5.2 Hyperparameter Settings 5.1 Dataset For all the variants, we use a hidden state of size For training our system, we created an unlabeled 600 and word-embedding size of 300. Classifier C dataset of simple and complex sentences by par- 5 titioning the standard en-wikipedia dump. Since FE has its shortcomings to fully judge simpleness, but we nevertheless employ it in the absence of stronger metrics partitioning requires a metric for measuring text 6https://github.com/shashiongithub/Split-and-Rephrase simpleness we categorize sentences based on their 7We acknowledge that other recent datasets such as Newsela could have been used for development and evalu- 4https://github.com/subramanyamdvss/UnsupNTS ation. We could not get access to the dataset unfortunately. and discriminator D use convolutional layers with with the sources are presented to two native En- filters sizes from 1 to 5. 128 filters of each size glish speakers who would provide two ratings for are used in the CNN-layers. Other training related each output: (a) Simpleness, a binary score [0-1] hyper parameters include learning rate of 0.00012 indicating whether the output is a simplified ver- for θE, θGs , θGd , 0.0005 for θD, θC and batch sion of the input or not, (b) Grammaticality of the size of 36. For learning the word-embedding us- output in the range of [1-5], in the increasing order ing Dict2Vec training, the window size is set to of fluency (c) Relatedness score in the range of [1- 5. Our experiments used at most 13 GB of GPU 5] showing if the overall semantics of the input is memory. The Initialization phase and Adversarial preserved in the output or not. phase took 6000 and 8000 steps in batches respec- tively for both UNTS and UNTS+10K systems. 5.4 Model Variants Using our design, we propose two different vari- 5.3 Evaluation Metrics ants for evaluation: (i) Unsupervised Neural TS (UNTS) with SARI as the criteria for model se- For automatic evaluation of our system on the test lection, (ii) UNTS with minimal supervision us- data, we used four metrics, (a) SARI (b) BLEU ing 10000 labelled examples (UNTS+10K). Mod- (c) FE Difference (d) Word Difference, which are els selected using other selection criteria such as briefly explained below. BLEU resulted in similar and/or reduced perfor- SARI (Xu et al., 2016) is an automatic evalua- mance (details skipped for brevity). tion metric designed to measure the simpleness of We carried out the following basic post- the generated sentences. SARI requires access to processing steps on the generated outputs. The source, predictions and references for evaluation. OOV(out of vocabulary) words in the generations Computing SARI involves penalizing the n-gram are replaced by the source words with high atten- additions to source which are inconsistent with the tion weights. Words repeated consecutively in the references. Similarly, deletions and keep opera- generated sentences are merged. tions are penalized. The overall score is a balanced sum of all the penalties. BLEU (Papineni et al., 5.5 Systems for Comparison 2002), a popular metric to evaluate generations In the absence of any other direct baseline for and translations is used to measure the correctness end-to-end TS, we consider the following unsu- of the generations by measuring overlaps between pervised baselines. We consider the unsuper- the generated sentences and (multiple) references. vised NMT framework proposed by (Artetxe et al., We also compute the average FE score dif- 2018b) as a baseline. It uses techniques such as ference between predictions and source in our backtranslation and denoising techniques to syn- evaluations. FE-difference measures whether the thesize more training examples. To use this frame- changes made by the model increase the readabil- work, we treated the set of simple and complex ity ease of the generated sentence. Word Differ- sentences as two different languages. Same model ence is the average difference between number of configuration as reported by Artetxe et al.(2018b) words in the source sentence and generation. It is a is used. We use the term UNMT for this system. simple and approximate metric proposed to detect Similar to the UNMT system, we also con- if sentence shortening is occurring or not. Genera- sider unsupervised statistical machine translation tions with lesser number of changes can still have (termed as USMT) proposed by Artetxe et al. high SARI and BLEU. Models with such genera- (2018a), with default parameter setting. Another tions can be ruled out by imposing a threshold on system, based on the cross alignment technique the word-diff metric. proposed by Shen et al.(2017) is also used for Models with high word-diff, SARI and BLEU comparison. The system is originally proposed for are picked during model-selection (with validation the task of sentiment translation. We term this sys- data). Model selection also involved manually ex- tem as ST. amining the quality and relevance of generations. We also compare our approach with existing su- We carry out a qualitative analysis of our sys- pervised and unsupervised lexical simplifications tem through human evaluation. For this the first 50 like LIGHTLS(Glavasˇ and Stajnerˇ , 2015), Neural test samples were selected from the test data. Out- Text Simplification or NTS (Nisioi et al., 2017), put of the seven systems reported in Table2 along Syntax based Machine Translation or SBMT(Xu System FE-diff SARI BLEU Word-diff the scores for the unsupervised system UNTS are UNTS+10K 10.45 35.29 76.13 2.38 not far from the supervised skylines. The higher UNTS 11.15 33.8 74.24 3.55 word-diff scores for the unsupervised system also UNMT 6.60 33.72 70.84 0.74 indicate that it is able to perform content reduc- USMT 13.84 32.11 87.36 -0.01 tion (a form of syntactic simplification), which is ST 54.38 14.97 0.73 5.61 crucial to TS. This is unlike the existing unsu- NTS 5.37 36.1 79.38 2.73 pervised LIGHTLS system which often replaces SBMT 17.68 38.59 73.62 -0.84 PBSMT 9.14 34.07 67.79 2.26 nouns with related non-synonymous nouns; some-

LIGHTLS 3.01 34.96 83.54 -0.02 times increasing the complexity and affecting the meaning. Finally, it is worth noting that aiding the Table 2: Comparison of evaluation metrics for system with a very small amount of labeled data proposed systems (UNTS), unsupervised baseline can also benefit our unsupervised pipeline, as sug- (UNMT,USMT, and ST) and existing supervised gested by the scores for the UNTS+10K system. and the unsupervised lexical simplification system In Table3, the first column represents what per- LIGHTLS. centage of output form is a simplified version of the input. The second and third columns present System Simpleness Fluency Relatedness the average fluency (grammaticality) scores given UNTS+10K 57% 4.13 3.93 UNTS 47% 3.86 3.73 by human evaluators and semantic relatedness with input scored through automatic means. Al- UNMT 40% 3.8 4.06 most all systems are able to produce sentences NTS 49% 4.13 3.26 that are somewhat grammatically correct and re- SBMT 53% 4.26 4.06 PBSMT 53% 3.8 3.93 tain phrases from input. Supervised systems like

LIGHTLS 6% 4.2 3.33 PBSMT, as expected, simplify the sentences to the maximum extent. However, our unsupervised Table 3: Average human evaluation scores for simple- variants have scores competitive to the supervised ness and grammatical correctness (fluency) and seman- skylines, which is a positive sign. tic relatedness between the output and input. Table4 shows an anecdotal example, containing outputs from the seven systems. As can be seen, et al., 2016), and Phrase-based SMT simplification the quality of output from our unsupervised vari- or PBSMT(Wubben et al., 2012). All the systems ants, is far from that of the reference output. How- are trained using the Wikipedia-SimpleWikipedia ever, the attempts towards performing lexical sim- dataset (Hwang et al., 2015). The test set is same plification (by replacing the word “Neverthless” for all of these and our models. with “However”) and simplification of multi-word phrases (“Tagore emulated numerous styles” get- 6 Results ting translated to “Tagore replaced many styles”) are quite visible and encouraging. Table5 presents Table2 shows evaluation results of our proposed a few examples demonstrating the capabilities of approaches along with existing supervised and un- our system in performing simplifications at lexical supervised alternatives. We observe that unsu- and syntactic level. We do observe that such op- pervised baselines such as UNMT and USMT erations are carried out only for a few instances in often, after attaining convergence, recreates sen- our test data. Also, our analysis in AppendixB in- tences similar to the inputs. This explains why dicate that the system can improve over time with they achieve higher BLEU and reduced word- addition of more data. Results for ablations on ad- difference scores. The ST system did not converge versarial and diversification loss are also included for our dataset after significant number of epochs in AppendixA. which affected the performance metrics. The sys- tem often produces short sentences which are sim- 7 Conclusion ple but do not retain important phrases. Other supervised systems such as SBMT and In this paper, we made a novel attempt towards un- NTS achieve better content reduction as shown supervised text simplification. We gathered unla- through SARI, BLEU and FE-diff scores; this is beled corpora containing simple and complex sen- expected. However, it is still a good sign that tences and used them to train our system that is System Output Input Nevertheless , Tagore emulated numerous styles , including craftwork from northern New Ireland , Haida carvings from the west coast of Canada ( British Columbia ) , and woodcuts by Max Pechstein . Reference Nevertheless , Tagore copied many styles , such as crafts from northern New Ireland , Haida carvings from the west coast of Canada and wood carvings by Max Pechstein . UNTS+10K Nevertheless , Tagore replaced many styles , including craftwork from northern New Ireland , Haida carved from the west coast of Canada ( British Columbia ) . UNTS However , Tagore notably numerous styles , including craftwork from northern New Ireland , Haida carved from the west coast of Canada ( British ) . UNMT However , Tagore featured numerous styles including craftwork from northern New Ireland , Haida from the west coast of Canada ( British Columbia ) max by Max Pechstein . USMT Nevertheless , Mgr emulated numerous styles , including craftwork from northern New Ireland , Haida carvings from the west coast of Canada (British Columbia) , and etchings by Max Pechstein . NTS However , Tagore wrote many styles , including craftwork from northern New Ireland , Haida carvings from the west coast of Canada ( British Columbia ) . SBMT However , Tagore emulated many styles , such as craftwork in north New Ireland , Haida prints from the west coast of Canada ( British Columbia ) , and woodcuts by Max Pechstein . PBSMT Nevertheless , he copied many styles , from new craftwork , Haida carvings from the west coast of Canada in British Columbia and woodcuts by Max Pechstein .

LIGHTLS However , Tagore imitated numerous styles , including craftwork from northern New Ireland , Haida sculptures from the west coast of Canada ( British Columbia ) , and engravings by Max Pechstein .

Table 4: Example predictions from different systems.

Type of Simplification Source Prediction Splitting Calvin Baker is an American novelist . Calvin Baker is an American . American Baker is a birthplace . Sentence Shortening During an interview , Edward Gorey mentioned that Bawden During an interview , Edward Gorey mentioned that Bawden was one of his favorite artists , lamenting the fact that not was one of his favorite artists . many people remembered or knew about this fine artist . Lexical Replacement In architectural decoration Small pieces of colored and iri- In impressive decoration Small pieces of colored and red- descent shell have been used to create mosaics and inlays , dish shell have been used to create statues and inlays , which which have been used to decorate walls , furniture and boxes . have been used to decorate walls , furniture and boxes .

Table 5: Examples showing different types of simplifications performed by the best model UNTS+10K. based on a shared encoder and two decoders. A References novel training scheme is proposed which allows Mikel Artetxe, Gorka Labaka, and Eneko Agirre. the model to perform content reduction and lexi- 2018a. Unsupervised statistical machine translation. cal simplification simultaneously through our pro- In Proceedings of the 2018 Conference on Empiri- posed losses and denoising. Experiments were cal Methods in Natural Language Processing, pages 3632–3642. conducted for multiple variants of our system as well as known unsupervised baselines and super- Mikel Artetxe, Gorka Labaka, Eneko Agirre, and vised systems. Qualitative and quantitative anal- Kyunghyun Cho. 2018b. Unsupervised neural ma- ysis of the outputs for a publicly available test chine translation. In Proceedings of the Sixth Inter- national Conference on Learning Representations. data demonstrate that our models, though unsu- pervised, can perform better than or competitive Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- to these baselines. In future, we would like to gio. 2014. Neural machine translation by jointly improve the system further by incorporating bet- learning to align and translate. arXiv:1409.0473. ter architectural designs and training schemes to Or Biran, Samuel Brody, and Noemie´ Elhadad. 2011. tackle complex simplification operations. Putting it simply: a context-aware approach to lexi- cal simplification. In ACL, pages 496–501. Associ- ation for Computational Linguistics.

Laetitia Brouwers, Delphine Bernhard, Anne-Laure 8 Acknowledgements Ligozat, and Thomas Franc¸ois. 2014. Syntactic sen- tence simplification for french. In Proceedings of the 3rd Workshop on Predicting and Improving Text We thank researchers at IBM IRL, IIT Kharagpur, Readability for Target Reader Populations (PITR)@ Vishal Gupta and Dr. Sudeshna Sarkar for helpful EACL 2014, pages 47–56. discussions in this project. Arnaldo Candido Jr, Erick Maziero, Caroline Gasperin, Thiago AS Pardo, Lucia Specia, and Sandra M Aluisio. 2009. Supporting the adaptation of texts Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, for poor literacy readers: a text simplification editor Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron for brazilian portuguese. In Innovative Use of NLP Courville, and Yoshua Bengio. 2014. Generative ad- for Building Educational Applications, pages 34–42. versarial nets. In Advances in neural information Association for Computational Linguistics. processing systems, pages 2672–2680.

Yvonne Canning and John Tait. 1999. Syntactic sim- William Hwang, Hannaneh Hajishirzi, Mari Ostendorf, plification of newspaper text for aphasic readers. In and Wei Wu. 2015. Aligning sentences from stan- Customised Information Delivery, pages 6–11. dard wikipedia to simple wikipedia. In NAACL- HLT, pages 211–217. Raman Chandrasekar and Bangalore Srinivas. 1997. Automatic induction of rules for text simplifica- Yoon Kim. 2014. Convolutional neural networks for tion1. Knowledge-Based Systems, 10(3):183–190. sentence classification. arXiv:1408.5882.

Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bah- Kevin Knight and Daniel Marcu. 2002. Summariza- danau, and Yoshua Bengio. 2014a. On the proper- tion beyond : A probabilistic ap- ties of neural machine translation: Encoder–decoder proach to sentence compression. Artificial Intelli- approaches. In Proceedings of SSST-8, Eighth Work- gence, 139(1):91–107. shop on Syntax, Semantics and Structure in Statisti- cal Translation, pages 103–111. J. L’Allier. 1980. An evaluation study of a computer- based lesson that adjusts read- ing level by monitor- Kyunghyun Cho, Bart van Merrienboer, Caglar Gul- ing on task reader characteristics. Ph.D. Thesis. cehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014b. Learning Tracy Linderholm, Michelle Gaddy Everson, Paul Van phrase representations using rnn encoder–decoder Den Broek, Maureen Mischinski, Alex Crittenden, for statistical machine translation. In Proceedings of and Jay Samuels. 2000. Effects of causal text revi- the 2014 Conference on Empirical Methods in Nat- sions on more-and less-skilled readers’ comprehen- ural Language Processing (EMNLP), pages 1724– sion of easy and difficult texts. Cognition and In- 1734. struction, 18(4):525–556.

James Clarke and Mirella Lapata. 2006. Models for Danielle S McNamara, Eileen Kintsch, Nancy But- sentence compression: A comparison across do- ler Songer, and Walter Kintsch. 1996. Are good mains, training requirements and evaluation mea- texts always better? interactions of text coherence, sures. In COLING, pages 377–384. Association for background knowledge, and levels of understanding Computational Linguistics. in learning from text. Cognition and instruction, 14(1):1–43. William Coster and David Kauchak. 2011. Simple english wikipedia: a new text simplification task. Shashi Narayan and Claire Gardent. 2014. Hybrid sim- In ACL, pages 665–669. Association for Computa- plification using deep semantics and machine trans- tional Linguistics. lation. In ACL, volume 1, pages 435–445.

Siobhan Devlin. 1998. The use of a psycholinguis- Shashi Narayan, Claire Gardent, Shay Cohen, and tic database in the simplification of text for aphasic Anastasia Shimorina. 2017. Split and rephrase. In readers. Linguistic databases. EMNLP 2017: Conference on Empirical Methods in Natural Language Processing, pages 617–627. Katja Filippova, Enrique Alfonseca, Carlos A Col- menares, Lukasz Kaiser, and Oriol Vinyals. 2015. Sergiu Nisioi, Sanja Stajner,ˇ Simone Paolo Ponzetto, Sentence compression by deletion with lstms. In and Liviu P Dinu. 2017. Exploring neural text sim- Proceedings of the 2015 Conference on Empirical plification models. In ACL, volume 2, pages 85–91. Methods in Natural Language Processing, pages 360–368. Gustavo H Paetzold and Lucia Specia. 2016. Unsuper- vised lexical simplification for non-native speakers. Katja Filippova and Michael Strube. 2008. Depen- In AAAI, pages 3761–3767. dency tree based sentence compression. In INLG, pages 25–32. Association for Computational Lin- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- guistics. Jing Zhu. 2002. Bleu: a method for automatic eval- uation of machine translation. In ACL, pages 311– Rudolph Flesch. 1948. A new readability yardstick. 318. Association for Computational Linguistics. Journal of applied psychology, 32(3):221. Sarah E Petersen and Mari Ostendorf. 2007. Text sim- Goran Glavasˇ and Sanja Stajner.ˇ 2015. Simplifying plification for language learners: a corpus analysis. lexical simplification: do we need simplified cor- In Workshop on Speech and Language Technology pora? In ACL, volume 2, pages 63–68. in Education. Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Pad- Unpaired sentiment-to-sentiment translation: A cy- manabhan, and Graham Neubig. 2018. When and cled reinforcement learning approach. In Proceed- why are pre-trained word embeddings useful for ings of the 56th Annual Meeting of the Association neural machine translation? In Proceedings of the for Computational Linguistics (Volume 1: Long Pa- 2018 Conference of the North American Chapter of pers), pages 979–988. the Association for Computational Linguistics: Hu- man Language Technologies, Volume 2 (Short Pa- Wei Xu, Chris Callison-Burch, and Courtney Napoles. pers), pages 529–535. 2015. Problems in current text simplification re- search: New data can help. Transactions of the Alexander M Rush, Sumit Chopra, and Jason Weston. Association of Computational Linguistics, 3(1):283– 2015. A neural attention model for abstractive sen- 297. tence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Lan- Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze guage Processing, pages 379–389. Chen, and Chris Callison-Burch. 2016. Optimizing statistical machine translation for text simplification. Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi TACL, 4:401–415. Jaakkola. 2017. Style transfer from non- Mark Yatskar, Bo Pang, Cristian Danescu-Niculescu- by cross-alignment. In Advances in neural informa- Mizil, and Lillian Lee. 2010. For the sake of sim- tion processing systems, pages 6830–6841. plicity: Unsupervised extraction of lexical simpli- fications from wikipedia. In NAACL-HLT, pages Advaith Siddharthan. 2006. Syntactic simplification 365–368. Association for Computational Linguis- and text cohesion. Research on Language and Com- tics. putation, 4(1):77–109. Zhirui Zhang, Shuo Ren, Shujie Liu, Jianyong Wang, Advaith Siddharthan. 2014. A survey of research on Peng Chen, Mu Li, Ming Zhou, and Enhong Chen. text simplification. ITL-International Journal of Ap- 2018. Style transfer as unsupervised machine trans- plied Linguistics, 165(2):259–298. lation. arXiv preprint arXiv:1808.07894. Lucia Specia. 2010. Translating from complex to sim- plified sentences. In Computational Processing of the Portuguese Language, pages 30–39. Springer.

Sanja Stajner,ˇ Hannah Bechara, and Horacio Saggion. 2015. A deeper exploration of the standard pb-smt approach to text simplification and its evaluation. In ACL-IJCNLP, volume 2, pages 823–828.

Sanja Stajnerˇ and Sergiu Nisioi. 2018. A Detailed Evaluation of Neural Sequence-to-Sequence Mod- els for In-domain and Cross-domain Text Simplifi- cation. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).

Julien Tissier, Christophe Gravier, and Amaury Habrard. 2017. Dict2vec : Learning word embed- dings using lexical dictionaries. In EMNLP, pages 254–263.

Tong Wang, Ping Chen, John Rochford, and Jipeng Qiang. 2016. Text simplification using neural ma- chine translation. In AAAI.

Sander Wubben, Antal Van Den Bosch, and Emiel Krahmer. 2012. Sentence simplification by mono- lingual machine translation. In Proceedings of the 50th Annual Meeting of the Association for Compu- tational Linguistics: Long Papers-Volume 1, pages 1015–1024. Association for Computational Linguis- tics.

Jingjing Xu, SUN Xu, Qi Zeng, Xiaodong Zhang, Xu- ancheng Ren, Houfeng Wang, and Wenjie Li. 2018. A Ablation Studies The following table shows results of the proposed system with ablations on adversarial loss (UNTS- ADV) and diversification loss (UNTS-DIV).

System FE-diff SARI BLEU Word-diff UNTS+10K 10.45 35.29 76.13 2.38 UNTS-DIV+10K 11.32 35.24 75.59 2.61 UNTS-ADV+10K 10.32 35.08 76.19 2.64 UNTS 11.15 33.8 74.24 3.55 UNTS-DIV 14.15 34.38 68.65 3.46 UNTS-ADV 12.13 34.74 73.21 2.72

Table 6: UNTS-ADV does not use the adversarial loss, UNTS-DIV does not use the diversification loss.

B Effects of Variation in Labeled Data Size The following table shows the effect of labeled data size on the performance of the system. We supplied the system with 2K, 5K, and 10K pairs of complex and simple sentences. From the trained models, models with similar word-diff are chosen for fair comparison. Our observation is that, with increasing data, BLEU as well as SARI increases.

System FE-diff SARI BLEU Word-diff UNTS+10K 11.65 35.14 75.71 3.05 UNTS+5K 11.69 34.39 70.96 3.01 UNTS+2K 11.64 34.17 72.63 3.26 UNTS 11.15 33.8 74.24 3.55

Table 7: Effect of variation in labeled data considered as additional help during training the unsupervised systems.