<<

Semantic Pleonasm Detection

Click to edit the title text format

Semantic Pleonasm Detection

Omid Kashefi, Andrew Lucas, Rebecca Hwa

Intelligent System Program University of Pittsburgh December, 2018 What is Pleonasm?

• Pleonasm • The use of extraneous in an expression such that removing them would not significantly alter the meaning of the expression.

• Pleonasm could have different aspects and formed in different layers of language

• Morphemic (e.g., “irregardless”) in the scope of GEC research, especially when they cause errors • Syntactic (e.g., “the most kindest”)

• Semantic (e.g. “I received a free gift”) What is Semantic Pleonasm?

• Semantic Pleonasm: when the meaning of a (or phrase) is already implied by other words in the sentence • “A question of style or taste, not ” (Evans et al., 1957)

• Might have some literary functions

• Most modern style guides caution against them in favor of concise writing Challenges of Detecting Semantic Pleonasm

• Semantic pleonasm is a complex linguistic phenomenon

• There is no appropriate resources to support the development of such systems

• Lack of good strategies to build such resources

• Some GEC corpora (e.g., NUCLE) have “redundant” annotation: • manifestation of grammar errors • e.g. “we still have room to improve for our current welfare system” • rather than a stylistic • e.g. “we aim to better improve our welfare system”

• Using NUCLE and GEC corpora does not allow us to separate the question of redundancy from grammaticality. Semantic Pleonasm Corpus (SPC)

• Raw Data

• Round Seven of the Yelp Dataset Challenge

• The writing is more casual

• The writing is often more emotional

• The writing is more likely to contain semantic pleonasms Semantic Pleonasm Corpus (SPC)

• Annotation Principles 1. Not directly annotate over raw text because most sentences do not containing pleonasms

2. Negative examples should be challenging

3. Avoid sentences with obvious grammar errors

• Filter for sentences containing a pair of adjacent semantically similar words (via WordNet)

• Filter out ungrammatical sentences Semantic Pleonasm Corpus (SPC)

• Annotation Procedures • Our annotators are from Amazon Mechanical Turk

• Turkers are given six sentences with a pair of semantically similar adjacent words at a time to decide whether to delete the first word, the second word, both, or neither.

• Each sentence is reviewed by three different Turkers

• Final annotation is based on majority consensus Semantic Pleonasm Corpus (SPC)

• Examples • Freshly squeezed and no additives, just plain pure fruit pulp” • Consensus: plain is redundant

• “It is clear that I will never have another prime first experience like the one I had at Chompies.” • Consensus: neither word is redundant

• “The dressing is absolutely incredibly fabulously flavorful!” • Consensus: both words are redundant Semantic Pleonasm Corpus (SPC)

• Statistics

One Both Neither Total First Second 955 765 16 1,283 3,019 32% 25% 1% 42% 100% 57% Semantic Pleonasm Corpus (SPC)

• Inter-Annotator • Word Level — whether the first, second, both, or neither of the candidates is pleonastic

• Sentence Level — whether a sentence has a pleonastic construction

Consensus Level Fleiss’s Kappa Word Level 0.384 Sentence Level 0.482 Automatic Pleonasm Detection

• SPC can serve as a valuable resource for developing systems to detect semantic pleonasm • Claim 1: the actual performance of word redundancy metrics is hampered by the mismatch between intended domain (i.e., semantic pleonasm) and the available corpus they are evaluated on (i.e., GEC corpora and NUCLE) • SPC is focused on the desired target domain

• Claim 2: without appropriate negative examples, it is not clear how to apply word redundancy metrics to sentences with no redundancy • SPC contains negative example so it is suitable to train sentence classifiers Automatic Pleonasm Detection

• Detecting Most Redundant Word • Validating claim 1: compare the performances of word redundancy metrics on SPC with their performances on NUCLE given the sentences known to contain one • 1,140 NUCLE and 1,720 SPC sentences

• Baselines • Xue&Hwa a combination of fluency and word meaning contribution model • SIM semantic similarity between a full sentence and when the target word removed • GEN the degree to which a word is general by its number of synonyms • SMP the simplicity of a word based on Flesch-Kincaid readability score • GEC a GEC system using languagetools; we expect it to be better on NUCLE than SPC Automatic Pleonasm Detection

• Detecting Most Redundant Word • Validating claim 1: SPC, while small, is a better fit for the task than NUCLE

Method NUCLE SPC Xue&Hwa 22.8% 31.7% SIM 11.1% 16.6% GEN 9.6% 13.3% SMP 16.1% 20.6% SIM+SMP+GEN 18.2% 27.6% ALL 31.1% 39.4% GEC 11.9% 4.7% Automatic Pleonasm Detection

• Detecting Sentences with Pleonasm • Validating claim 2: use the whole SPC to train binary classifiers

• Baselines • UG one-hot representation of the sentence • TG one-hot representation of the trigrams of the sentence • TFIDF one-hot representation of the smoothed TFIDF tuples of the sentence • WSTAT [max(퐴퐿퐿), 푎푣푔(퐴퐿퐿), min(퐴퐿퐿), 푙푒푛(푠), 퐿푀(푠)] Automatic Pleonasm Detection

• Detecting Sentences with Pleonasm

• Encoding words are more relevant than the statistics over the word redundancy metrics

SPC Baseline MaxEnt NB UG 79.2% 88.4% TG 79.9% 88.8% TFIDF 83.0% 90.5% WSTAT 63.1% 53.2% WSTAT+UG 82.3% 89.2% WSTAT+TG 83.7% 89.3% WSTAT+TFIDF 84.5% 92.2% Conclusion

• We have introduced SPC in which • Each sentence contains a word pair that is potentially semantically related • These sentences have been reviewed by human annotators, who determine whether any of the words are redundant

• SPC offers two main contributions • By focusing on semantic similarity, it provides a more appropriate resource for systems that aim to detect stylistic redundancy rather than grammatical errors • By balancing between positive and near-miss negative examples, it allows systems to evaluate their ability to detect ”no redundancy.” Acknowledgment

• This material is based upon work supported by the National Science Foundation under Grant Number #1735752

• This work is published in NAACL 2018 Thank You