Mining Paraphrasal Typed Templates from a Plain Text Corpus

Mining Paraphrasal Typed Templates from a Plain Text Corpus Or Biran Terra Blevins Kathleen McKeown Columbia University Columbia University Columbia University [email protected] [email protected] [email protected] Abstract template for each possible message is costly, especially when a generation system is relatively open- Finding paraphrases in text is an impor- ended or is expected to deal with many domains. tant task with implications for genera- In addition, a text generated using templates often tion, summarization and question answer- lacks variation, which means the system’s output ing, among other applications. Of par- will be repetitive, unlike natural text produced by ticular interest to those applications is the a human. specific formulation of the task where the In this paper, we are concerned with a task paraphrases are templated, which provides aimed at solving both problems: automatically an easy way to lexicalize one message in mining paraphrasal templates, i.e. groups of tem- multiple ways by simply plugging in the plates which share the same slot types and which, relevant entities. Previous work has fo- if their slots are filled with the same entities, re- cused on mining paraphrases from parallel sult in paraphrases. We introduce an unsupervised and comparable corpora, or mining very approach to paraphrasal template mining from the short sub-sentence synonyms and para- text of Wikipedia articles. phrases. In this paper we present an ap- Most previous work on paraphrase detection fo- proach which combines distributional and cuses either on a corpus of aligned paraphrase KB-driven methods to allow robust mining candidates or on such candidates extracted from of sentence-level paraphrasal templates, a parallel or comparable corpus. In contrast, we utilizing a rich type system for the slots, are concerned with a very large dataset of tem- from a plain text corpus. plates extracted from a single corpus, where any two templates are potential paraphrases. Specifi- 1 Introduction cally, paraphrasal templates can be extracted from One of the main difficulties in Natural Language sentences which are not in fact paraphrases; for Generation (NLG) is the surface realization of example, the sentences “The population of Mis- messages: transforming a message from its inter- souri includes more than 1 million African Ameri- nal representation to a natural language phrase, cans” and “Roughly 185,000 Japanese Americans sentence or larger structure expressing it. Often reside in Hawaii” can produce the templated para- the simplest way to realize messages is though the phrases “The population of [american state] in- use of templates. For example, any message about cludes more than [number] [ethnic group]” and the birth year and place of any person can be ex- “Roughly [number] [ethnic group] reside in [amer- pressed with the template “[Person] was born in ican state]”. Looking for paraphrases among tem- [Place] in [Year]”. plates, instead of among sentences, allows us to Templates have the advantage that the genera- avoid using an aligned corpus. tion system does not have to deal with the inter- Our approach consists of three stages. First, we nal syntax and coherence of each template, and process the entire corpus and determine slot lo- can instead focus on document-level discourse co- cations, transforming the sentences to templates herence and on local coreference issues. On the (Section 4). Next, we find most approriate type for other hand, templates have two major disadvan- each slot using a large taxonomy, and group to- tages. First, having a human manually compose a gether templates which share the same set of types 1913 Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1913–1923, Berlin, Germany, August 7-12, 2016. c 2016 Association for Computational Linguistics as potential paraphrases (Section 5). Finally, we search that focuses on mining paraphrases from cluster the templates in each group into sets of large text corpora is especially relevant for our paraphrasal templates (Section 6). work. Typically, these approaches utilize a paral- We apply our approach to six corpora represent- lel (Barzilay and McKeown, 2001; Ibrahim et al., ing diverse subject domains, and show through a 2003; Pang et al., 2003; Quirk et al., 2004; Fujita crowd-sourced evaluation that we can achieve a et al., 2012; Regneri and Wang, 2012) or compa- high precision of over 80% with a reasonable sim- rable corpus (Shinyama et al., 2002; Barzilay and ilarity threshold setting. We also show that our Lee, 2003; Sekine, 2005; Shen et al., 2006; Zhao threshold parameter directly controls the trade-off et al., 2009; Wang and Callison-Burch, 2011), and between the number of paraphrases found and the there have been approaches that leverage bilingual precision, which makes it easy to adjust our ap- aligned corpora as well (Bannard and Callison- proach to the needs of various applications. Burch, 2005; Madnani et al., 2008). Of the above, two are particularly relevant. 2 Related Work Barzilay and Lee (2003) produce slotted lattices that are in some ways similar to templates, and To our knowledge, although several works exist their work can be seen as the most closely related which utilize paraphrasal templates in some way, to ours. However, as they rely on a comparable the task of extracting them has not been defined corpus and produce untyped slots, it is not directly as such in the literature. The reason seems to be a comparable. In our approach, it is precisely the difference in priorities. In the context of NLG, An- fact that we use a rich type system that allows us to geli et al. (2010) as well as Kondadadi et al. (2013) extract paraphrasal templates from sentences that used paraphrasal templates extracted from aligned are not, by themselves, paraphrases and avoid us- corpora of text and data representations in specific ing a comparable corpus. Sekine (2005) produces domains, which were grouped by the data types typed phrase templates, but the approach does they relate to. Duma and Klein (2013) extract not allow learning non-trivial paraphrases (that is, templates from Wikipedia pages aligned with RDF paraphrases that do not share the exact same key- information from DBPedia, and although they do words) from sentences that do not share the same not explicitly mention aligning multiple templates entities (thus remaining dependent on a compara- to the same set of RDF templates, the possibility ble corpus), and the type system is not very rich. In seems to exist in their framework. In contrast, we addition, that approach is limited to learning short are interested in extracting paraphrasal templates paraphrases of relations between two entities. from non-aligned text for general NLG, as aligned corpora are difficult to obtain for most domains. Another line of research is based on contex- While template extraction has been a relatively tual similarity (Lin and Pantel, 2001; Pas¸ca and small part of NLG research, it is very prominent Dienes, 2005; Bhagat and Ravichandran, 2008). in the field of Information Extraction (IE), begin- Here, shorter (phrase-level) paraphrases are ex- ning with Hearst (1992). There, however, the goal tracted from a single corpus when they appear is to extract good data and not to extract templates in a similar lexical (and in later approaches, also that are good for generation. Many pattern extrac- syntactic) context. The main drawbacks of these tion (as it is more commonly referred to in IE) ap- methods are their inability to handle longer para- proaches focus on semantic patterns that are not phrases and their tendency to find phrase pairs that coherent lexically or syntactically, and the idea of are semantically related but not real paraphrases paraphrasal templates is not important (Chambers (e.g. antonyms or taxonomic siblings). and Jurafsky, 2011). One exception which expic- More recent work on paraphrase detection has, itly contains a paraphrase detection component is for the most part, focused on classifying provided (Sekine, 2006). sentence pairs as paraphrases or not, using the Mi- Meanwhile, independently of templates, detect- crosoft Paraphrase Corpus (Dolan et al., 2004). ing paraphrases is an important, difficult and well- Mihalcea et al. (2006) evaluated a wide range of researched problem of Natural Language Process- lexical and semantic measures of similarity and in- ing. It has implications for the general study of se- troduced a combined metric that outperformed all mantics as well as many specific applications such previous measures. Madnani et al. (2012) showed as Question Answering and Summarization. Re- that metrics from Machine Translation can be used 1914 to find paraphrases with high accuracy. Another sentence-level paraphrasal templates, each sen- line of research uses the similarity of texts in a tence in the corpus is a potential template. latent space created through matrix factorization Entities are found in multiple ways. First, we (Guo and Diab, 2012; Ji and Eisenstein, 2013). use regular expressions to find dates, percentages, Other approaches that have been explored are ex- currencies, counters (e.g., “9th”) and general num- plicit alignment models (Das and Smith, 2009), bers. Those special cases are immediately given distributional memory tensors (Baroni and Lenci, their known type (e.g., “date” or “percentage”). 2010) and syntax-aware representations of multi- Next, after POS-tagging the entire corpus, we look word phrases using word embeddings (Socher et for candidate entities of the following kinds: terms al., 2011). Word embeddings were also used by that contain only NNP (including NNPS) tags; Milajevs et al. (2014). These approaches are not terms that begin and end with an NNP and con- comparable to ours because they focus on classifi- tain only NNP, TO, IN and DT tags; and terms cation, as opposed to mining, of paraphrases. that contain only capitalized words, regardless of Detecting paraphrases is closely related to re- the POS tags.

Load more