Monolingual Machine Translation for Paraphrase Generation
Total Page:16
File Type:pdf, Size:1020Kb
Monolingual Machine Translation for Paraphrase Generation Chris QUIRK, Chris BROCKETT and William DOLAN Natural Language Processing Group Microsoft Research One Microsoft Way Redmond, WA 90852 USA {chrisq,chrisbkt,billdol}@microsoft.com Abstract phrasal decoder to generate meaning-preserving paraphrases across multiple domains. By adopting We apply statistical machine translation at the outset a paradigm geared toward generating (SMT) tools to generate novel paraphrases sentences, this approach overcomes many prob- of input sentences in the same language. lems encountered by task-specific approaches. In The system is trained on large volumes of particular, we show that SMT techniques can be sentence pairs automatically extracted from extended to paraphrase given sufficient monolin- clustered news articles available on the gual parallel data.1 We show that a huge corpus of World Wide Web. Alignment Error Rate comparable and alignable sentence pairs can be (AER) is measured to gauge the quality of culled from ready-made topical/temporal clusters the resulting corpus. A monotone phrasal of news articles gathered on a daily basis from decoder generates contextual replacements. thousands of sources on the World Wide Web, Human evaluation shows that this system thereby permitting the system to operate outside outperforms baseline paraphrase generation the narrow domains typical of existing systems. techniques and, in a departure from previ- ous work, offers better coverage and scal- 2 Related work ability than the current best-of-breed Until recently, efforts in paraphrase were not paraphrasing approaches. strongly focused on generation and relied primarily on narrow data sources. One data source has been 1 Introduction multiple translations of classic literary works (Bar- zilay & McKeown 2001; Ibrahim 2002; Ibrahim et The ability to categorize distinct word sequences al. 2003). Pang et al. (2003) obtain parallel mono- as “meaning the same thing” is vital to applications lingual texts from a set of 100 multiply-translated as diverse as search, summarization, dialog, and news articles. While translation-based approaches question answering. Recent research has treated to obtaining data do address the problem of how to paraphrase acquisition and generation as a machine identify two strings as meaning the same thing, learning problem (Barzilay & McKeown, 2001; they are limited in scalability owing to the diffi- Lin & Pantel, 2002; Shinyama et al, 2002, Barzilay culty (and expense) of obtaining large quantities of & Lee, 2003, Pang et al., 2003). We approach this multiply-translated source documents. problem as one of statistical machine translation Other researchers have sought to identify pat- (SMT), within the noisy channel model of Brown terns in large unannotated monolingual corpora. et al. (1993). That is, we seek to identify the opti- Lin & Pantel (2002) derive inference rules by pars- mal paraphrase T* of a sentence S by finding: ing text fragments and extracting semantically similar paths. Shinyama et al. (2002) identify de- T* = arg max{}P()T | S pendency paths in two collections of newspaper T = {} articles. In each case, however, the information arg max P(S | T) P(T ) extracted is limited to a small set of patterns. T Barzilay & Lee (2003) exploit the meta- T and S being sentences in the same language. information implicit in dual collections of news- We describe and evaluate an SMT-based para- phrase generation system that utilizes a monotone 1 Barzilay & McKeown (2001), for example, reject the idea owing to the noisy, comparable nature of their data. wire articles, but focus on learning sentence-level these clusters is generally good. Impressionisti- patterns that provide a basis for generation. Multi- cally, discrete events like sudden disasters, busi- sequence alignment (MSA) is used to identify sen- ness announcements, and deaths tend to yield tences that share formal (and presumably semantic) tightly focused clusters, while ongoing stories like properties. This yields a set of clusters, each char- the SARS crisis tend to produce very large and acterized by a word lattice that captures n-gram- unfocused clusters. based structural similarities between sentences. To extract likely paraphrase sentence pairs from Lattices are in turn mapped to templates that can these clusters, we used edit distance (Levenshtein be used to produce novel transforms of input sen- 1966) over words, comparing all sentences pair- tences. Their methodology provides striking results wise within a cluster to find the minimal number of within a limited domain characterized by a high word insertions and deletions transforming the first frequency of stereotypical sentence types. How- sentence into the second. Each sentence was nor- ever, as we show below, the approach may be of malized to lower case, and the pairs were filtered limited generality, even within the training domain. to reject: • 3 Data collection Sentence pairs where the sentences were identical or differed only in punctuation; Our training corpus, like those of Shinyama et • Duplicate sentence pairs; al. and Barzilay & Lee, consists of different news • Sentence pairs with significantly different stories reporting the same event. While previous lengths (the shorter is less than two-thirds work with comparable news corpora has been lim- the length of the longer); ited to just two news sources, we set out to harness • Sentence pairs where the Levenshtein dis- the ongoing explosion in internet news coverage. tance was greater than 12.0.3 Thousands of news sources worldwide are compet- ing to cover the same stories, in real time. Despite A total of 139K non-identical sentence pairs were different authorship, these stories cover the same obtained. Mean Levenshtein distance was 5.17; events and therefore have significant content over- mean sentence length was 18.6 words. lap, especially in reports of the basic facts. In other cases, news agencies introduce minor edits into a 3.2 Word alignment single original AP or Reuters story. We believe To this corpus we applied the word alignment that our work constitutes the first to attempt to ex- algorithms available in Giza++ (Och & Ney, ploit these massively multiple data sources for 2000), a freely available implementation of IBM paraphrase learning and generation. Models 1-5 (Brown, 1993) and the HMM align- ment (Vogel et al, 1996), along with various im- 3.1 Gathering aligned sentence pairs provements and modifications motivated by We began by identifying sets of pre-clustered experimentation by Och & Ney (2000). In order to URLs that point to news articles on the Web, gath- capture the many-to-many alignments that identify ered from publicly available sites such as correspondences between idioms and other phrasal http://news.yahoo.com/, http://news.google.com chunks, we align in the forward direction and again and http://uk.newsbot.msn.com. Their clustering in the backward direction, heuristically recombin- algorithms appear to consider the full text of each ing each unidirectional word alignment into a sin- news article, in addition to temporal cues, to pro- gle bidirectional alignment (Och & Ney 2000). duce sets of topically/temporally related articles. Figure 1 shows an example of a monolingual Story content is captured by downloading the alignment produced by Giza++. Each line repre- HTML and isolating the textual content. A super- sents a uni-directional link; directionality is indi- vised HMM was trained to distinguish story con- cated by a tick mark on the target side of the link. tent from surrounding advertisements, etc.2 We held out a set of news clusters from our Over the course of about 8 months, we collected training data and extracted a set of 250 sentence 11,162 clusters, comprising 177,095 articles and pairs for blind evaluation. Randomly extracted on averaging 15.8 articles per cluster. The quality of the basis of an edit distance of 5 ≤ n ≤ 20 (to allow a range of reasonably divergent candidate pairs 2 while eliminating the most trivial substitutions), We hand-tagged 1,150 articles to indicate which por- tions of the text were story content and which were ad- the gold-standard sentence pairs were checked by vertisements, image captions, or other unwanted an independent human evaluator to ensure that material. We evaluated several classifiers on a 70/30 test train split and found that an HMM trained on a handful of features was most effective in identifying 3 Chosen on the basis of ablation experiments and opti- content lines (95% F-measure). mal AER (discussed in 3.2). Training Data Type: L12 Precision 87.46% Recall 89.52% AER 11.58% Identical word precision 89.36% Identical word recall 89.50% Identical word AER 10.57% Non-identical word preci- 76.99% sion Non-identical word recall 90.22% Non-identical word AER 20.88% Table 1. AER on the Lev12 corpus Figure 1. An example Giza++ alignment Table 1 shows the results of evaluating align- ment after trainng the Giza++ model. Although the they contained paraphrases before they were hand overall AER of 11.58% is higher than the best bi- word-aligned. lingual MT systems (Och & Ney, 2003), the train- To evaluate the alignments, we adhered to the ing data is inherently noisy, having more in standards established in Melamed (2001) and Och common with analogous corpora than conventional & Ney (2000, 2003). Following Och & Ney’s MT parallel corpora in that the paraphrases are not methodology, two annotators each created an ini- constrained by the source text structure. The iden- tial annotation for each dataset, subcategorizing tical word AER of 10.57% is unsurprising given alignments as either SURE (necessary) or POSSIBLE that the domain is unrestricted and the alignment (allowed, but not required). Differences were high- algorithm does not employ direct string matching lighted and the annotators were asked to review to leverage word identity.5 The non-identical word their choices on these differences. Finally we com- AER of 20.88% may appear problematic in a sys- bined the two annotations into a single gold stan- tem that aims to generate paraphrases; as we shall dard: if both annotators agreed that an alignment see, however, this turns out not to be the case.