Predicting Word Clipping with Latent Semantic Analysis

Predicting Word Clipping with Latent Semantic Analysis Julian Brooke Tong Wang Graeme Hirst Department of Computer Science University of Toronto jbrooke,tong,gh @cs.toronto.edu { } Abstract Compared to many near-synonyms, clipped forms have the important property that the dif- In this paper, we compare a resource- ferences between full and abbreviated forms are driven approach with a task-specific clas- almost entirely connotational or stylistic, closely sification model for a new near-synonym tied to the formality of the discourse.1 This fact word choice sub-task, predicting whether allows us to pursue two distinct though related a full or a clipped form of a word will be approaches to this task, comparing a supervised used (e.g. doctor or doc) in a given con- model of word choice (Wang and Hirst, 2010) with text. Our results indicate that the resource- a mostly unsupervised system that leverages an driven approach, the use of a formality automatically-built lexicon of formality (Brooke lexicon, can provide competitive perfor- et al., 2010). Our findings indicate that the mance, with the parameters of the task- lexicon-based method is highly competitive with specific model mirroring the parameters the supervised, task-specific method. Both mod- under which the lexicon was built. els approach the human performance evidenced in an independent crowdsourced annotation. 1 Introduction Lexical resources, though the focus of much work 2 Methods in computational linguistics, often compare poorly to direct statistical methods when applied to prob- 2.1 Latent Semantic Analysis lems such as sentiment classification (Kennedy and Inkpen, 2006). Nevertheless, such resources Both approaches that we are investigating make offer advantages in terms of human interpretability use of Latent Semantic Analysis (LSA) as and portability to many different tasks (Argamon a dimensionality-reduction technique (Landauer et al., 2007). In this paper, we introduce a new and Dumais, 1997).2 In LSA, the first step is to sub-task of near-synonym word choice (Edmonds create a matrix representing the association be- and Hirst, 2002), prediction of word clipping, in tween words as determined by their co-occurrence order to provide some new evidence that resource- in a corpus, and then apply singular value decom- driven approaches have general potential. position (SVD) to identify the first k most signifi- Clipping is a type of word formation where cant dimensions of variation. After this step, each the beginning and/or the end of a longer word word can be represented as a vector of length k, is omitted (Kreidler, 1979). This phenomenon is which can be compared or combined with the vec- attested in various languages; well-known exam- tors of other words. The best k is usually deter- ples in English include words such as hippo (hip- mined empirically. For a more detailed introduc- popotamus) and blog (weblog). Clipping and re- tion to this method, see also the discussion by Tur- lated kinds of word formation have received at- ney and Littman (2003). tention in computational linguistics with respect to the task of identifying source words from ab- 1Shortened forms might also be preferred in cases where breviated forms, which has been studied, for in- space is at a premium, e.g. newspaper headlines or tweets. stance, in the biomedical and text messaging do- 2Note that neither technique is feasible using the full co- occurrence vectors, which have several hundred thousand di- mains (Okazaki and Ananiadou, 2006; Cook and mensions in both cases; in addition, previous work has shown Stevenson, 2009). that performance drops off with increased dimensionality. 1392 Proceedings of the 5th International Joint Conference on Natural Language Processing, pages 1392–1396, Chiang Mai, Thailand, November 8 – 13, 2011. c 2011 AFNLP 2.2 Classifying Context Vectors tion of the final formality score involves several normalization steps, and therefore a full discus- Our first method is the lexical choice model pro- sion is precluded here for space reasons; for the posed by Wang and Hirst (2010). This approach details, please see Brooke et al. (2010). Our eval- performs SVD on a term–term co-occurrence ma- uation suggests that, given a large-enough blog trix, which has been shown to outperform tradi- corpus, this method almost perfectly distinguishes tional LSA models that use term–document co- words of extreme formality, and is able to identify occurrence information. Specifically, a given word the more formal of two near-synonyms over 80% w is initially represented by a vector v of all its of the time, better than a word-length baseline. co-occurring words in a small collocation context Given a lexicon of formality scores, the pre- (a 5-word window), i.e., v = (v1,...,vn), where n ferred form for a context is identified by averaging is the size of the vocabulary, and vi = 1 if w co- the formality scores of the words in the context occurs with the i-th word in lexicon, or vi = 0 oth- erwise. The dimensionality of the original vector and comparing the average score to a cutoff value. is then reduced by SVD. Here, the context is generally understood to be the entire text, though we also test smaller contexts. A context, typically comprising a set of words We take the cutoff to be midpoints of the aver- within a small collocation context around the tar- age scores for the contexts of known instances; al- get word for prediction (though we test larger though technically supervised, we have found that contexts here), is represented by a weighted cen- in practice just a few instances is enough to find a troid of the word vectors. Together with the can- stable, high-performing cutoff. Note that the cut- didate words for prediction, this context vector off is analogous to the decision hyperplane of an can then be used as a feature vector for super- SVM. In our case, building a lexical resource cor- vised learning; we follow Wang and Hirst in us- responds to additional task-independent reduction ing support vector machines (SVMs) as imple- in the dimensionality of the space, greatly simpli- mented in WEKA (Witten and Frank, 2005), train- fying the decision. ing a separate classifier for each full/clipped word form pair. The prediction performance varies by k, 3 Resources which can be tested efficiently by simply truncat- ing a single high-k vector to smaller dimensions. Blog data is an ideal resource for this task, since The optimal k value reported by Wang and Hirst it clearly contains a wide variety of language reg- testing on a standard set of seven near-synonyms isters. For our exploration here, we used a col- was 415; they achieved an accuracy of 74.5%, an lection of over 900,000 blogs (216 million tokens) improvement over previous statistical approaches, originally crawled from the web in May 2008. We e.g. Inkpen (2007). segmented the texts, filtered out short documents (less than 100 words), and then split the corpus 2.3 Using Formality Lexicons into two halves, training and testing. For each of The competing method involves building lexicons the two methods described in the previous section, of formality, using our method from Brooke et al. we derived the corresponding LSA-reduced vec- (2010), which is itself an adaption of an approach tors for all lower-case words using the collocation 3 used for sentiment lexicon building (Turney and information contained within the training portion. Littman, 2003). Though it relies on LSA, there are The testing portion was used only as a source for several key differences as compared to the context test contexts. vector approach. First, the pre-LSA matrix is a We independently collected a set of common binary word–document matrix, rather than word– full/clipped word pairs from web resources such as word. For the LSA step, we showed that a very Wikipedia, limiting ourselves to phonologically- low k value (20) was appropriate choice for iden- realized clippings. This excludes orthographic tifying variation in formality. After dimensional- shortenings like thx or ppl which cannot be pro- ity reduction, each word vector is compared, using 3We used the same dataset for each method so that the cosine similarity, to words from two sets of seed difference in raw co-occurence information available to each terms, each representing prototypical formal and method was not a confounding factor. However, we also tested the lexicon method using the full formality lexicon informal words, which provides a formality score from Brooke et al. (2010), built on the larger ICWSM blog for each word in the range of 1 to 1. The deriva- corpus; the difference in performance was negligible. − 1393 nounced. We also removed pairs where one of tween “Probably” and “Definitely”. We queried the words was quite rare (fewer than 150 tokens for five different judgments per test case in our in the entire corpus) or where, based on examples test corpus, and took the majority judgment as the pulled from the corpus, there was a common con- standard, or “I’m not sure” if there was no major- founding homonym–for instance the word prob, ity judgment. which is a common clipped form of both prob- lem and probably. However, we did keep words 4 Evaluation like doc, where the doctor sense was much more First, we compare our crowdsourced annotation common than the document sense. After this fil- to our writer’s choice gold standard, which pro- tering, 38 full/clipped word pairs remained in our vides a useful baseline for the difficulty of the set. For each pair, we automatically extracted a task. The agreement is surprisingly low; even if sample of usage contexts from texts in the cor- “I’m not sure” responses are discounted, agree- pus where only one of the two forms appears.

Predicting Word Clipping with Latent Semantic Analysis

Lexical Resource Reconciliation in the Xerox Linguistic Environment

Wordnet As an Ontology for Generation Valerio Basile

Modeling and Encoding Traditional Wordlists for Machine Applications

TEI and the Documentation of Mixtepec-Mixtec Jack Bowers

Unification of Multiple Treebanks and Testing Them with Statistical Parser with Support of Large Corpus As a Lexical Resource

Instructions for ACL-2010 Proceedings

The Interplay Between Lexical Resources and Natural Language Processing

Exploring Complex Vowels As Phrase Break Correlates in a Corpus of English Speech with Proposel, a Prosody and POS English Lexicon

Methods for Efficient Ontology Lexicalization for Non-Indo

Maurice Gross' Grammar Lexicon and Natural Language Processing

Using Relevant Domains Resource for Word Sense Disambiguation

Using a Lexical Semantic Network for the Ontology Building Nadia Bebeshina-Clairet, Sylvie Despres, Mathieu Lafourcade