Measuring Popularity of Machine-Generated Sentences Using Term Count, Document Frequency, and Dependency Language Model
Total Page:16
File Type:pdf, Size:1020Kb
PACLIC 29 Measuring Popularity of Machine-Generated Sentences Using Term Count, Document Frequency, and Dependency Language Model Jong Myoung Kim1, Hancheol Park2, Young-Seob Jeong1 Ho-Jin Choi1, Gahgene Gweon2, and Jeong Hur3 1School of Computing, KAIST 2Department of Knowledge Service Engineering, KAIST 3Knowledge Mining Research Team, ETRI fgrayapple, hancheol.park, pinode, hojinc, [email protected] [email protected] Abstract those sentences. Unlike computer algorithms, hu- mans can notice very delicate differences and per- We investigated the notion of “popularity” ceive various characteristics in natural language for machine-generated sentences. We defined sentences. Conventional wisdom holds that hu- a popular sentence as one that contains words man judgments represent the gold standard; how- that are frequently used, appear in many docu- ever, they are prohibitively expensive and time- ments, and contain frequent dependencies. We consuming to obtain. measured the popularity of sentences based on three components: content morpheme count, Because of the high cost of manual evaluation, au- document frequency, and dependency relation- tomatic evaluation techniques are increasingly used. ships. To consider the characteristics of agglu- These include very popular techniques that mea- tinative language, we used content morpheme sure meaning adequacy and lexical similarity, such frequency instead of term frequency. The key as BLEU (Papineni et al., 2002), METEOR (Baner- component in our method is that we use the jee and Lavie, 2005), and TER plus (Snover et al., product of content morpheme count and doc- ument frequency to measure word popular- 2009). Additionally, a distinctive characteristic of ity, and apply language models based on de- auto evaluation techniques is that they can be ap- pendency relationships to consider popularity plied not only to performance verification, but also from the context of words. We verify that our to the generation stage of NLP applications. Al- method accurately reflects popularity by us- though these techniques can make experiments eas- ing Pearson correlations. Human evaluation ier and accelerate progress in a research area, they shows that our method has a high correlation employ fewer evaluation criteria than humans. with human judgments. In general, previous research efforts have focused on “technical qualities” such as meaning and gram- 1 Introduction mar. However, customer satisfaction is sometimes determined more by “functional quality” (how the Natural language generation is widely used in va- service work was delivered) than by “technical qual- riety of Natural Language Processing (NLP) appli- ity” (the quality of the work performed) (Mittal cations. These include paraphrasing, question an- and Lassar, 1998). Especially, Casalo´ et al. (2008) swering systems, and Machine Translation (MT). To showed that the customers’ loyalty and satisfaction improve the quality of generated sentences, arrang- are affected by their past frequent experiences. We ing effective evaluation criteria is critical (Callison- focused on this aspect and propose a new criterion, Burch et al., 2007). popularity, to consider the functional quality of sen- Numerous previous studies have aimed to eval- tences. We define a popular sentence as one that con- uate the quality of sentences. The most frequently tains words that are frequently used, appear in many used evaluation technique is asking judges to score documents, and contain frequent dependencies. Us- 319 29th Pacific Asia Conference on Language, Information and Computation: Posters, pages 319 - 327 Shanghai, China, October 30 - November 1, 2015 Copyright 2015 by Jong Myoung Kim, Hancheol Park, Young-Seob Jeong, Ho-Jin Choi, Gahgene Gweon and Jeong Hur PACLIC 29 ing this definition, we aim to measure the popularity best represented the original meaning (Barzilay and of sentences. Lee, 2002). Philip M. Mc et al. studied overall qual- In this paper, we investigate the notion of “popu- ity using four criteria (McCarthy et al., 2009). Us- larity” for machine-generated sentences. We mea- ing these evaluation techniques, humans can iden- sured popularity of sentences with an automatic tify characteristics that machines cannot recognize, method that can be applied to the generation stage such as nuances and sarcasm. Overwhelmingly, hu- of MT or paraphrasing. Because it is a subjective mans are more sensitive than computers in the area evaluation, measuring the popularity of sentences is of linguistics. As a result, manual evaluation pro- a difficult task. We defined a popular sentence as one vides the gold standard. However, manual evalua- that contains words that are frequently used, appear tion presents significant problems. It is prohibitively in many documents, and contain frequent dependen- expensive and time-consuming to obtain. cies. Subsequently, we began our analysis by calcu- To address these limitations, there have been stud- lating Term Frequency (TF). To reflect the charac- ies involving automatic evaluation methods. Pap- teristics of agglutinative languages, we apply a mor- ineni et al. (2002) and Callison-Burch et al. (2008) pheme analysis during language resources genera- proposed methods that measure meaning adequacy tion. As a result, we obtain a Content Morpheme based on an established standard. Several methods Count (CMC). To complement areas CMC cannot based on Levenshtein distance (Levenshtein, 1966) cover (words that have abnormally high CMC), we calculate superficial similarity by counting the num- apply morpheme-based Document Frequency (DF). ber of edits required to make two sentences identi- Lastly, to consider popularity came from contex- cal (Wagner and Fischer, 1974; Snover et al., 2009). tual information, we apply a dependency relation- These methods can be used to calculate dissimilar- ship language model. We verify our method by an- ity in paraphrasing. Chen et al. measured paraphrase alyzing Pearson correlations between human judg- changes with n-gram (Chen and Dolan, 2011). These ments; human evaluation shows that our method has automatic evaluations also present a problem — the a high correlation with human judgments. And our absence of diversity. There are many senses humans method shows the potential for measuring popular- can detect from sentences, even if they are not pri- ity by involving the contextual information. mary factors such as meaning adequacy or grammar. The remainder of this paper is organized as fol- We identify a novel criteria, popularity, as one of lows. Section 2 presents related works in the field those senses, based on the fact that customer satis- of sentence evaluation. Section 3 explains the ap- faction is sometimes derived from functional quality proach to measure the popularity of words and sen- (Mittal and Lassar, 1998). tences. In Section 4, we evaluate the usefulness of We define the popularity of a sentence using TF, our method. In section 5, we analyze the result of DF and dependency relations. TF, defined as the experiment Lastly, Section 6 concludes the paper. number of times a term appears, is primarily used to measure a term’s significance, especially in in- 2 Related Works formation retrieval and text summarization. Since Luhn used total TF as a popularity metric (Luhn, Manual evaluation, the most frequently used tech- 1957), TF has been frequently used to measure term nique, asks judges to score the quality of sentences. weight, and employed in various forms to suit spe- It exhibits effective performance, despite its inher- cific purposes. Term Frequency-Inversed Document ent simplicity. Callison-Burch asked judges to score Frequency (TF-IDF), the most well-known varia- fluency and adequacy with a 5-point Likert scale tion of TF, is used to identify the most represen- (Callison-Burch et al., 2007), and asked judges to tative term in a document (Salton and Buckley, score meaning and grammar in a subsequent pa- 1988). Most previous research using those variations per (Callison-Burch, 2008). Similarly, Barzilay et al. has focused on the most significant and impressive asked judges to read hand-crafted and application- terms. There has been minimal research concerned crafted paraphrases with corresponding meanings, with commonly used terms. We measured popularity and to identify which version was most readable and of sentences with these commonly used terms that 320 PACLIC 29 have high TF and DF. fined a popular word as one with a frequently used content morpheme. The empty morphemes are not 3 Method considered, because they are stop words in Korean. We adopt Content Morpheme Count (CMC), a vari- In this section, we explain the process of language ation of TF, to measure usage of the content mor- resource generation, and propose a method to mea- pheme of words. CMC is the frequency of a word’s sure the popularity of sentences. First, we utilize content morpheme in a set of documents. The CMC morpheme analysis on the corpus of sentences, be- of the word w is driven in the following equations. cause our target language is Korean which is an ag- glutinative languages. Next, we statistically analyze CMC = max(0; log b(w)) (1) each content morpheme occurrence, and then calcu- w X late sentence popularity using these resources. b(w) = fm;d (2) d2D 3.1 Korean Morpheme Analysis In Eq. (2), b(w) is the qualified popularity of word We built our language resources (Content Mor- w, defined as the number of content morphemes m pheme Count-Document Frequency (CMC-DF) and of word w in entire documents D. f is the frequency