Unsupervised Sentence Compression Using Denoising Auto-Encoders

Unsupervised Sentence Compression using Denoising Auto-Encoders Thibault Fevry∗ Jason Phang∗ Center for Data Science Center for Data Science New York University New York University [email protected] [email protected] Sentence compression can also be can be seen Abstract as a “scaled down version of the text summarization problem” (Knight and Marcu, 2002). Within In sentence compression, the task of short- text summarization, two broad approaches exist: ening sentences while retaining the original extractive approaches extract explicit tokens or meaning, models tend to be trained on large phrases from the reference text, whereas abstrac- corpora containing pairs of verbose and compressed sentences. To remove the need for tive approaches involve a compressed paraphras- paired corpora, we emulate a summarization ing of the reference text, similar to the approach task and add noise to extend sentences and humans might take (Jing, 2000, 2002). train a denoising auto-encoder to recover the In the related domain of machine translation, a original, constructing an end-to-end training task that also involves learning a mapping from regime without the need for any examples one string of tokens to another, state of the art of compressed sentences. We conduct a models using deep learning techniques are trained human evaluation of our model on a standard on large parallel corpora. Recent promising text summarization dataset and show that it performs comparably to a supervised base- work on unsupervised neural machine translation line based on grammatical correctness and (Artetxe et al., 2017; Lample et al., 2017) has retention of meaning. Despite being exposed shown that with the right training regime, it is to no target data, our unsupervised models possible to train models for machine translation learn to generate imperfect but reasonably between two languages given only two unpaired readable sentence summaries. Although we monolingual corpora. underperform supervised models based on In this paper, we apply neural text summariza- ROUGE scores, our models are competitive with a supervised baseline based on human tion techniques to the task of sentence compres- evaluation for grammatical correctness and sion, focusing on on extractive summarization. retention of meaning. However, we depart significantly from prior work by taking a fully unsupervised training approach. Beyond not using parallel corpora, we train our 1 Introduction model using a single corpus. In contrast to unsupervised neural machine translation, which still Sentence compression is the task of condensing uses two corpora, we do not have separate corpora a longer sentence into a shorter one that still re- of longer and shorter sentences. tains the meaning of the original. Past models for We show that a simple denoising auto-encoder sentence compression have tended to rely heavily model, trained on removing and reordering words on strong linguistic priors such as syntactic rules from a noised input sequence, can learn effec- or heuristics (Dorr et al., 2003; Cohn and Lap- tive sentence compression, generating shorter se- ata, 2008). More recent work using deep learning quences of reasonably grammatical text that retain involves models trained without strong linguistic the original meaning. While the models are still priors, instead requiring large corpora consisting prone to both errors in grammar and meaning, we of pairs of longer and shorter sentences (Miao and believe that this is a strong step toward reducing Blunsom, 2016). reliance on paired corpora. ∗ Denotes equal contribution We evaluate our model using both a stan- 413 Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 413–422 Brussels, Belgium, October 31 - November 1, 2018. c 2018 Association for Computational Linguistics dard text-summarization benchmark as well as hu- on at least 500k instances of paired data. man evaluation of compressed sentences based on In machine translation, unsupervised methods grammatical correctness and retention of mean- for aligning word embeddings using only un- ing. Although our models do not capture the matched bilingual corpora, trained with only small written style of the target summaries (headlines), seed dictionaries, (Mikolov et al., 2013; Lazari- they still produce reasonably readable and accu- dou et al., 2015), adversarial training on simi- rate compressed sentence summaries, without ever lar corpora (Zhang et al., 2017; Conneau et al., being exposed to any target sentence summaries. 2017b) or even on distant corpora and languages We find that our model underperforms based on (Artetxe et al., 2018) have enabled the develop- ROUGE metrics, especially compared to super- ment of unsupervised machine translation (Artetxe vised models, but performs competitively with su- et al., 2017; Lample et al., 2017). However, it pervised baselines in human evaluation. We fur- is not clear how to adapt these methods for sum- ther show that providing the model with a sen- marization where the task is to shorten the refer- tence embedding of the original sentence leads ence rather than translate it. Wang and Lee(2018) to better ROUGE scores but worse human eval- train a generative adversarial network to encode uation scores. However, both unsupervised and references into a latent space and decode them supervised methods still fall short based on hu- in summaries using only unmatched document- man evaluation, and effective sentence compres- summary pairs. However, in contrast with ma- sion and summarization remains an open problem. chine translation where monolingual data is plen- tiful and paired data scarce, summaries are paired 2 Related work with their respective documents when they exist, thus limiting the usefulness of such approaches. Early sentence compression approaches were In contrast, our method requires no summary cor- extractive, focusing on deletion of uninforma- pora. tive words from sentences through learned rules (Knight and Marcu, 2002) or linguistically- Denoising auto-encoders (Vincent et al., 2008) motivated heuristics (Dorr et al., 2003). The first have been successfully used in natural language abstractive approaches also relied on learned syn- processing for building sentence embeddings (Hill tactic transformations (Cohn and Lapata, 2008). et al., 2016), training unsupervised translation Recent work in automated text summarization models (Artetxe et al., 2017) or for natural lan- has seen the application of sequence-to-sequence guage generation in narrow domains (Freitag and models to automatic summarization, including Roy, 2018). In all those instances, the added noise both extractive (Nallapati et al., 2017) and ab- takes the form of random deletion of words and stractive (Rush et al., 2015; Chopra et al., 2016; word swapping or shuffling. Although our noising Nallapati et al., 2016; Paulus et al., 2017; Fan mechanism relies on adding rather than removing et al., 2017) approaches, as well as hybrids of words, we take some inspiration from these works. both (See et al., 2017). Although these meth- Work in sentence simplification (see Shardlow ods have achieved state-of-the-art results, they are (2014) for a survey) has some similarities with constrained by their need for large amounts paired sentence compression, but it differs in that the document-summary data. key focus is on making sentences more easily un- Miao and Blunsom(2016) seek to overcome derstandable rather than shorter. Though word this shortcoming by training separate compressor deletion is used, sentence simplification methods and reconstruction models, allowing for training feature sentence splitting and word simplification based on both paired (supervised) and unlabeled which are not usually present in sentence compres- (unsupervised) data. For their compressor, they sion. Furthermore, these methods often rely heav- train a discrete variational auto-encoder for sen- ily on learned rules (e.g lexical simplification as tence compression and use the REINFORCE al- in Biran et al.(2011)), integer linear programming gorithm to allow end-to-end training. They fur- and sentence parse trees which makes them starkly ther use a pre-trained language model as a prior different from our deep learning-based approach. for their compression model to induce their com- The exceptions that adopt end-to-end approaches, pressed output to be grammatical. However, their such as Filippova et al.(2015), are usually super- reported results are still based on models trained vised and focus on word deletion. 414 3 Methods 3.3 Length Countdown To induce our model to output sequences of a de- 3.1 Model sired length, we augment the RNN decoder in our Our core model is based on a standard attentional model to take an additional length countdown in- encoder-decoder (Bahdanau et al., 2014), consist- put. In the context of text generation, RNN de- ing of multiple layers bi-directional long short- coders can be formulated as follows: term memory networks in both the encoder and decoder, with negative-log likelihood as our loss ht = RNN(ht−1; xt) (1) function. We detail below the training regime and model modifications to apply the denoising auto- where ht−1 is the hidden state at the previous step encoding paradigm to sentence compression. and xt is an external input (often an embedding of the previously decoded token). Let Tdec be the 3.2 Additive Noising desired length of our output sequence. We modify (1) with an additional input: Since we do not use paired sentence compression data with which to train our model in a supervised ht = RNN(ht−1; xt;Tdec − t) (2) way, we simulate a supervised training regime by modifying a denoising auto-encoder (DAE) train- The length countdown T − t is a single scalar in- ing regime to more closely resemble supervised put that ticks down to 0 when the decoder reaches sentence compression. Given a reference sen- the desired length T , and goes negative after. In tence, we extend and shuffle the input sentence, practice, (xt;Tdec − t) are concatenated into a sin- and then train our model to recover the original gle vector. We also experimented with adding a reference sentence.

Load more