<<

Semantic Detection

Omid Kashefi Andrew T. Lucas Rebecca Hwa School of Computing and Information University of Pittsburgh Pittsburgh, PA, 15260 kashefi, andrew.lucas, hwa @cs.pitt.edu { }

Abstract some GEC corpora do annotate some or phrases as “redundant” or “unnecessary,” they are are words that are redundant. To typically a manifestation of errors (e.g., aid the development of systems that detect we still have room to improve our current pleonasms in text, we introduce an anno- for tated corpus of semantic pleonasms. We val- welfare system) rather than a stylistic idate the integrity of the corpus with inter- (e.g., we aim to better improve our welfare annotator analyses. We also com- system). pare it against alternative resources in terms of This paper presents a new Semantic Pleonasm their effects on several automatic redundancy Corpus (SPC), a collection of three thousand sen- detection methods. tences. Each sentence features a pair of potentially 1 Introduction semantically related words (chosen by a heuris- tic); human annotators determine whether either Pleonasm is the use of extraneous words in an (or both) of the words is redundant. The corpus expression such that removing them would not offers two improvements over current resources. significantly alter the meaning of the expression First, the corpus filters for grammatical sentences (Merriam-Webster, 1983; Quinn, 1993; Lehmann, so that the question of redundancy is separated 2005). Although pleonastic phrases may serve from grammaticality. Second, the corpus is fil- literary functions (e.g., to add emphasis) (Miller, tered for a balanced set of positive and negative ex- 1951; Chernov, 1979), most modern writing style amples (i.e., no redundancy). The negative exam- guides caution against them in favor of con- ples may make useful benchmark data – because cise writing (Hart et al., 1905; Williams, 2003; they all contain a pair of words that are deemed to Turabian, 2013; Gowers, 2014; Strunk, 1920). be semantically related, a successful system can- An automatic pleonasm detector would be not rely on simple heuristics, such as semantic dis- beneficial for natural language processing tances, for discrimination. We evaluate the cor- (NLP) applications that support student writ- pus in terms of inter-annotator agreement, and in ing, such as grammar error correction (GEC) terms of its usefulness for developing automatic (Han et al., 2006; Rozovskaya and Roth, 2010; pleonasm detectors. Tetreault et al., 2010; Dahlmeier and Ng, 2011), automatic essay grading (Larkey, 1998; Landauer, 2 Semantic Pleonasm 2003; Ong et al., 2014), and intelligent writing tutors (Merrill et al., 1992; Aleven et al., 2009; Although pleonasm is generally a semantic and Atkinson, 2016). Pleonastic phrases may also rhetorical concept, it could have different aspects negatively impact NLP applications in general and be formed in different layers of language, in- because they introduce an unnecessary complexity cluding morphemic (e.g., “irregardless”(Berube, to the language. Their removal might facilitate 1985)) and syntactic layers (e.g., “the most un- NLP tasks such as parsing, summarization, kindest cut of all”). Detecting and correcting and machine translation. However, automated morphemic and syntactic pleonasms are more in pleonasm detection is a challenging problem, in the scope of GEC research, especially when they parts because there is no appropriate resources to cause errors. Semantic pleonasm, on the other support the development of such systems. While hand, is “a question of style or taste, not gram-

225 Proceedings of NAACL-HLT 2018, pages 225–230 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics mar” (Evans and Evans, 1957). It occurs when the of finding redundancy, since semantically related meaning of a (or phrase) is already implied words that are farther apart are more likely to by other words in the sentence. For example, the have different syntactic and semantic roles. To de- following is a grammatical sentence that has a re- termine semantic similarity, we use the TextBlob dundant word: I received a free gift. While writers Python interface2, which, for a given word, pro- might intentionally include the redundant word for vides access to WordNet synsets (Miller, 1995) emphasis, the overuse of pleonasm may weaken corresponding to each of the word’s senses. We the expression, making it “boring rather than strik- compare each pair of adjacent words in the dataset ing the hearer.” (Fowler, 1994). to see whether they share any synsets. Since WordNet serves as a coarse filter, we need to fur- 3 A Semantic Pleonasm Corpus ther improve recall. We select any sentences that Semantic pleonasm is a complex linguistic phe- contains a pair of adjacent words such that one of nomenon; to develop a useful corpus for it, we the words has a synset that is similar to a synset need to make some design decisions in terms of of the other word. TextBlob provides this “simi- a trading off between the breadth and depth of our lar to” functionality, which finds synsets that are coverage. close to a given synset in WordNet’s taxonomy tree. (note, however, that these words may not 3.1 Data Source be used in those senses in the sentence). Apply- We want to start from a source that is likely to con- ing these filtering rules, we are able to eliminate tain semantic redundancies. Because good writers a large percentage of sentences that do not con- are trained to guard against redundant phrasings, tain semantic redundancy; the method also help us professionally written text from Project Gutenberg identify a pair of words in each sentence that is or the Wall Street Journal would not be appropri- likely to have a redundancy. In the second step ate. Because we want to separate the issues of of filtering, we manually removed sentences that grammaticality from redundancy, learner corpora contained obvious grammatical mistakes. would also not be appropriate. A data source that 3.3 Annotation seems promising is amateur product reviews. The We set up a Amazon Mechanical Turk service to writers tend to produce more emotional prose that determine whether the potentially redundant word are at times exasperated or gushing; the writing pairs are actually redundant. Because we want to is more off-the-cuff and casual, and may contain build a balanced corpus, we first perform a quick more redundancy. Ultimately, we chose to work internal first pass, marking each sentence as ei- with restaurant reviews from Round Seven of the ther “possibly containing redundancy” or “prob- Yelp Dataset Challenge1 because it is widely dis- ably not containing redundancy” so that we can tributed. distribute the instances to the Turkers with equal 3.2 Filtering probability (they do not see our internal annota- tions). The Turkers are given six sentences at a Although redundant words and phrases occur fre- time, each containing a highlighted pair of words. quently enough that exhortations to excise them The workers have to decide whether to delete the is a constant refrain in writing guides, most sen- first word, the second word, both, or neither. Then, tences still skew toward not containing pleonasms. they indicate their confidence: “Certain,” “Some- Annotating all sentences would dilute the impact what certain,” or “Uncertain.” Lastly, they are of the positive examples, further complicate the given the opportunity to provide additional expla- annotation scheme, and increase the cost of the nations. Each sentence has been reviewed by three corpus creation. Thus, we opt to construct a bal- different workers. For about ninety percent of the anced corpus of positive and negative examples for sentences, three annotations proved sufficient to a specific kind of redundancy in a specific config- achieve a consensus. We collect a fourth annota- uration. In particular, we extract all sentences that tion for the remaining sentences, and are then able contained a pair of adjacent words that are likely to declare a consensus. to be semantically similar. We restrict our atten- tion to adjacent word pairs to increase the chance

1https://www.yelp.com/dataset/challenge 2http://textblob.readthedocs.io/en/dev/

226 Consensus Level Fleiss’s Kappa One Both Neither Total First Second Word Level 0.384 Sentence Level 0.482 955 765 16 1,283 3,019 32% 25% 1% 42% 100% Table 1: Inter-Annotator Agreement 57%

Table 2: Statistics of the Semantic Pleonasm Corpus A Few Examples

Sentence: Freshly squeezed and no additives, sentences, both words are marked as redundant. • just plain pure fruit pulp. Table 2. shows the statistics of annotators consen- Consensus: plain is redundant. sus. The corpus, including all annotations and the final consensus, is available in JSON format from Sentence: It is clear that I will never have • http://pleonasm.cs.pitt.edu another prime first experience like the one I had at Chompies. 4 Automatic Pleonasm Detection Consensus: neither word is redundant. Given our design choices, the current SPC is not a Sentence: The dressing is absolutely incred- large corpus; we posit that it can nonetheless serve • ibly fabulously flavorful! as a valuable resource for developing systems to Consensus: both words are redundant. detect semantic pleonasm. For example, the ear- lier work of Xue and Hwa (2014) might have ben- 3.4 Inter-Annotator Agreement efited from this resource. They wanted to detect Because our corpus is annotated by many Turk- the word in a sentence that contributes the least ers, with some labeling only a handful of sen- to the meaning of the sentence; however, their ex- tences while others contributed hundreds, the typ- periments were hampered by a mismatch between ical pair-wise inter-annotated agreement is not ap- their intended domain and the corpus they eval- propriate. Instead, we compute Fleiss’s Kappa uated on – while their model estimated a word’s (Fleiss, 1971), which measures the degree of semantic redundancy, their experiments were per- agreement in classification over what would be ex- formed on NUCLE (Dahlmeier et al., 2013), a pected by chance for more than two annotator. learner corpus that focused more on grammatical We analyze agreements at two levels of gran- errors. Moreover, since their detector always re- ularity: word level indicates the consensus on turned the word with the lowest meaning contri- whether the first, second, both, or neither of bution score, they only evaluated their model on the candidates is pleonastic; sentence level indi- sentences known to contain an unnecessary word; cates the consensus on whether a sentence has a without appropriate negative examples, it is not pleonastic construction. clear how to apply their system to sentences with Table 1 shows that annotators are more likely no redundancy. These are two use-case scenarios to agree whether a sentence contains a pleonasm that the SPC may address. To verify our claim, than exactly which words should be considered re- we will first compare the performances of several dundant. In many cases, a majority consensus is word redundancy metrics, including a replication achieved with one annotator disagreeing with the of the metric of Xue and Hwa, on our corpus with others. The result suggests that when there is a sin- their performances on NUCLE. We will then show gle word redundancy, removing either of the syn- that the SPC can train a classifier that predicts onyms could be appropriate. whether a sentence contains semantic pleonasm.

3.5 Properties 4.1 Pleonastic Word Detection The final dataset consists of 3,019 sentences. This experiment focuses on the positive examples Their final labels are based on a majority consen- – the methods under evaluation are all metrics for sus: 1,283 sentences are marked as not having a detecting the most redundant word from sentences redundant word; 1,720 sentences are marked as known to contain one. We compare the perfor- containing a single word redundancy; and for 16 mances of different word detectors under SPC and

227 NUCLE. Note that our experimental goal is not Method NUCLE SPC to obtain a method that reports a high accuracy on SPC (we do not want a corpus that overfits to some Xue&Hwa 22.8% 31.7% particular method). Rather, it is to demonstrate that the human-annotated SPC captures aspects of SIM 11.1% 16.6% semantic redundancy that are not available in other GEN 9.6% 13.3% resources. SMP 16.1% 20.6% In order to shed lights on the differences be- SIM + SMP + GEN 18.2% 27.6% tween SPC and NUCLE, we compare them using detectors that are formulated from different strate- gies. First, we have replicated the metric pro- ALL 31.1% 39.4% posed by Xue and Hwa, which consists of two main components: a language model and a word GEC 11.9% 4.7% meaning contribution model that is derived from 3 word alignments from machine translation . This Table 3: The accuracy of detecting the redundant word method is the most focused on lexical semantic, in sentences with different methods under two corpora: so we expect it to be better at detecting redundant NUCLE and SPC. ALL is a composite metric from the words on the SPC. Next, we have implemented other four: Xue&Hwa + SIM + SMP + GEN. three simple metrics: SIM computes the seman- tic similarity between a full sentence and that sen- difference than the other two. Comparing the four tence with the target word removed4; GEN esti- methods’ between corpora differences, we see that mates the degree to which a word is general (there- the method of Xue&Hwa has the most to gain, fore more likely to be redundant) by its number of perhaps because it has the strongest domain mis- synonyms; and SMP estimates the simplicity of a match. Yet, a combination of all four metrics re- word based on an implementation of the Flesch- sults in an improved accuracy of 39.4%, suggest- Kincaid readability score (Kincaid et al., 1975). ing that the four strategies capture different aspect Of these, only SIM directly models semantics; we of semantic redundancy. That this highest achiev- expect it to be better at detecting redundant words ing accuracy is still quite low suggests that there is on the SPC than the two other, more general, met- ample room for improvement in terms of word de- rics. Finally, as a point of contrast, we consider a tector development. In contrast, the GEC method GEC system using languagetools5 (Naber, 2003); performed much better on NUCLE (11.9%) than we expect the GEC system to be better at detecting on the SPC (4.7%). Taken as a whole, these re- grammar error related redundancy found on NU- sults suggest that the SPC, while small, is a better CLE than cases of semantic redundancy found in fit for the task of detecting semantic redundancy the SPC. than NUCLE. To conduct the experiment, we selected 1,140 NUCLE sentences that contain one local redun- 4.2 Sentential Pleonasm Detection dancy (RLOC) error; for SPC, 1,720 sentences with one semantic pleonasm are used. Table 3 All the methods shown in the previous experi- shows the accuracy of each method under both ment are metrics that assign a redundancy score to corpora. Our implementation of Xue and Hwa’s each word within a sentence; they still have to be model replicates their reported outcome with NU- incorporated into an outer classifier to determine CLE, and, as expected, their method is more suc- whether the sentence indeed contains a pleonasm. cessful on the SPC. All three simple metrics are A corpus of naturally occurring text is unsuitable more successful at picking out redundant word on for training the classifier because the distribution the SPC than NUCLE, with SIM showing a bigger is heavily skewed toward the no redundancy case. Random down-sampling is also not ideal because 3In our re-implementation, the language model is trained some might be too obvious (e.g., very short sen- on a portion of English Gigaword (Graff et al., 2003) using tences). SPC addresses this problem by filtering KenLM (Heafield, 2011); the word alignments are derived for challenging negative cases: sentences that con- from Bing’s English-French Translator 4using sense2vec word-embeddings (Trask et al., 2015) tain a pair of words that are heuristically deemed 5https://languagetool.org/ to be semantically related, but are not judged to

228 Feature Description the three features that directly encode the words of the sentence are more relevant (UG, TG, UG the one-hot representation TFIDF ) than the group of statistics over the word (Harris and Harris, 2010) of redundancy metrics (WSTAT ). For our corpus unigrams of the sentence size, Naive Bayes seems to converge faster to the TG the one-hot representation of tri- minimum error rate than MaxEnt (Ng and Jordan, grams of the sentence 2002). In combination, WSTAT + TFIDF gave the highest accuracy, at around 92%. This result TFIDF the one-hot representation of also reinforces our inter-annotator agreement rate, smoothed TFIDF tuples of suggesting that determining whether a sentence trigrams of the sentence contains a semantic pleonasm is easier than deciding which word is pleonastic. WSTAT [max(ALL), avg(ALL), min(ALL), Len(s), LM(s)] 5 Conclusion Table 4: Features for sentential level pleonasm detec- tion. ALL represents the collection of word-level met- We have introduced a semantic pleonasm corpus rics: Xue&Hwa, SIM, GEN, and SMP ; Len(s) is the in which each sentence contains a word pair that is number of words in sentence s; LM(s) is the trigram proba- bility for sentence s. potentially semantically related. These sentences are reviewed by human annotators, who determine SPC whether any of the words are redundant. Our cor- Baseline MaxEnt Naive Bayes pus offers two main contributions. First, as a cor- pus that focuses on semantic similarity, it provides UG 79.2 88.4 a more appropriate resource for systems that aim TG 79.9 88.8 to detect stylistic redundancy rather than gram- TFIDF 83.0 90.5 matical errors. Second, as a balanced corpus of positive and near-miss negative examples, it al- WSTAT 63.1 53.2 lows systems to evaluate their ability to detect ”no WSTAT+UG 82.3 89.2 redundancy.” WSTAT+TG 83.7 89.3 WSTAT+TFIDF 84.5 92.2 Acknowledgments We would like to thank the anonymous review- Table 5: The accuracy of a binary classifier using dif- ers for their helpful comments. This material is ferent feature set to predict whether a sentence contains based upon work supported by the National Sci- a pleonastic construction. ence Foundation under Grant Number #1735752. be redundant by human annotators. In this exper- iment, we use SPC to train a binary classifier; our References feature set is summarized in Table 4. Vincent Aleven, Bruce M Mclaren, Jonathan Sewall, To train the classifiers, we performed 5-fold and Kenneth R Koedinger. 2009. Example-Tracing cross-validation on the full SPC corpus. We ex- Tutors: A New Paradigm for Intelligent Tutoring perimented with both a Maximum Entropy and Systems. International Journal of Artificial Intel- ligence in Education 19(2):105–154. a Binomial Naive Bayes binary classifier. We 2 considered the number of features from χ test, Robert Kenneth Atkinson. 2016. Intelligent tutor- regularization coefficient, the choice of penalty ing systems: Structure, applications and challenges. function and solver as hyperparameters and opti- Nova Science Publishers, Inc. mized them using the Particle Swarm algorithm 6 Margery Berube. 1985. The American Heritage Dic- (Clerc and Kennedy, 2002) in the Optunity opti- tionary: Second College Edition. Houghton Mifflin. mizer package. Table 5 presents the results. We observe that Ghelly V Chernov. 1979. Semantic Aspects of Psy- cholinguistic Research in Simultaneous Interpreta- 6https://github.com/claesenm/optunity tion. Language and 22(3):277–295.

229 Maurice Clerc and James Kennedy. 2002. The parti- Christian Lehmann. 2005. Pleonasm and Hyperchar- cle swarm-explosion, stability, and convergence in acterisation. Yearbook of 2005 pages a multidimensional complex space. IEEE Transac- 119–154. tions on Evolutionary Computation 6(1):58–73. Inc Merriam-Webster. 1983. Webster’s Ninth New Col- Daniel Dahlmeier and Hwee Tou Ng. 2011. Grammat- legiate Dictionary. Merriam-Webster. ical Error Correction with Alternating Structure Op- timization. In Proceedings of the 49th Annual Meet- Douglas C Merrill, Brian J Reiser, Michael Ranney, ing of the ACL-HLT. pages 915–923. and J Gregory Trafton. 1992. Effective Tutoring Techniques: A of Human Tutors and Daniel Dahlmeier, Hwee Tou Ng, and Siew Mei Wu. Intelligent Tutoring Systems. The Journal of the 2013. Building a Large Annotated Corpus of Learning Sciences 2(3):277–305. Learner English: The NUS Corpus of Learner En- glish. In Proceedings of the Eighth Workshop on George A Miller. 1995. WordNet: A Lexical Innovative Use of NLP for Building Educational Ap- Database for English. Communications of the ACM plications. pages 22–31. 38(11):39–41.

Bergen Evans and Cor Nelia Evans. 1957. A Dictio- George Armitage Miller. 1951. Language and Com- nary of Contemporary American Usage. Random munication. McGraw-Hill. House, Inc. Daniel Naber. 2003. A Rule-Based Style and Grammar Joseph L Fleiss. 1971. Measuring Nominal Scale Checker. B.S. Thesis, Bielefeld University. Agreement Among Many Raters. Psychological Andrew Y Ng and Michael I Jordan. 2002. On Dis- Bulletin 76(5):378. criminative vs. Generative Classifiers: A compari- Henry Watson Fowler. 1994. A Dictionary of Modern son of logistic regression and Naive Bayes. In Ad- English Usage. Wordsworth Editions. vances in Neural Information Processing Systems. pages 841–848. Ernest Gowers. 2014. Plain Words. Penguin UK. Nathan Ong, Diane Litman, and Alexandra David Graff, Junbo Kong, Ke Chen, and Kazuaki Brusilovsky. 2014. Ontology-Based Argument Maeda. 2003. English Gigaword. Linguistic Data Mining and Automatic Essay Scoring. In Pro- Consortium, Philadelphia 4:1. ceedings of the First Workshop on Argumentation Mining. pages 24–28. Na-Rae Han, Martin Chodorow, and Claudia Leacock. 2006. Detecting errors in English usage by Arthur Quinn. 1993. Figures of Speech: 60 Ways to non-native speakers. Natural Language Engineer- Turn a Phrase. Psychology Press. ing 12(02):115–129. Alla Rozovskaya and Dan Roth. 2010. Generating David Harris and Sarah Harris. 2010. Digital Design Confusion Sets for Context-Sensitive Error Correc- and Computer Architecture. Morgan Kaufmann. tion. In Proceedings of the EMNLP. pages 961–970.

Horace Hart, James Augustus Henry Murray, and William Strunk. 1920. The Elements of Style. New Henry Bradley. 1905. Rules for Compositors and York: Harcourt, Brace and Howe. Readers at the University Press, Oxford. Joel Tetreault, Jennifer Foster, and Martin Chodorow. Kenneth Heafield. 2011. Kenlm: Faster and smaller 2010. Using Parse Features for Preposition Selec- language model queries. In Proceedings of the Sixth tion and Error Detection. In Proceedings of the ACL Workshop on Statistical Machine Translation. pages Short Papers. pages 353–358. 187–197. Andrew Trask, Phil Michalak, and John Liu. 2015. J Peter Kincaid, Robert P Fishburne Jr, Richard L sense2vec - A Fast and Accurate Method for Word Rogers, and Brad S Chissom. 1975. Derivation Of Sense Disambiguation In Neural Word Embeddings. New Readability Formulas (Automated Readability arXiv preprint arXiv:1511.06388 . Index, Fog Count And Flesch Reading Ease For- Kate L Turabian. 2013. A Manual for Writers of Re- mula) For Navy Enlisted Personnel. search Papers, Theses, and Dissertations: Chicago Thomas K Landauer. 2003. Automatic Essay Assess- Style for Students and Researchers. University of ment. Assessment in Education: Principles, Policy Chicago Press. & Practice 10(3):295–308. Joseph M Williams. 2003. Style. Longman.

Leah S Larkey. 1998. Automatic Essay Grading Using Huichao Xue and Rebecca Hwa. 2014. Redundancy Text Categorization Techniques. In Proceedings of Detection in ESL Writings. In EACL. pages 683– the 21st Annual International ACM SIGIR Confer- 691. ence on Research and Development in Information Retrieval. ACM, pages 90–95.

230