Learning to Explain Non-Standard English Words and Phrases

Ke Ni and William Yang Wang Department of Computer Science University of California, Santa Barbara Santa Barbara, CA 93106 USA {ke00@umail},{william@cs}.ucsb.edu

Abstract

We describe a data-driven approach for au- tomatically explaining new, non-standard English expressions in a given sentence, building on a large dataset that includes Figure 1: An example Tweet with a non-standard 15 years of crowdsourced examples from English expression. Our model aims at automati- UrbanDictionary.com. Unlike prior stud- cally explaining any newly emerged, non-standard ies that focus on matching keywords from expressions (generating the blue box). a slang , we investigate the pos- sibility of learning a neural sequence-to- sequence model that generates explana- tions from to detect satirical news ar- tions of unseen non-standard English ex- ticles. Wang and McKeown (2010) show that us- pressions given context. We propose a ing a slang dictionary of 5K terms can help detect- dual encoder approach—a word-level en- ing vandalism of edits. Han and Bald- coder learns the representation of context, win (2011) make use of the same slang dictionary, and a second character-level encoder to and achieve the best performance by combining learn the hidden representation of the tar- the dictionary lookup approach with word simi- get non-standard expression. Our model larity and context for Twitter and SMS text nor- can produce reasonable definitions of new malization. However, using a 5K slang dictionary non-standard English expressions given may suffer from the coverage issue, since slang their context with certain confidence. is evolving rapidly in the social media age1. Re- cently, Thanapon Noraset (2016) shows that it is 1 Introduction possible to use word embeddings to generate plau- sible definitions. Nonetheless, one weakness is In the past two decades, the majority of NLP re- that definition of a word may change within dif- search focused on developing tools for the Stan- ferent contexts. dard English on newswire data. However, the non- standard part of the language is not well-studied In contrast, we take a more radical approach: in the community, even though it becomes more we aim at building a general purpose sequence-to- and more important in the era of social media. sequence neural network model (Sutskever et al., While we agree that one must take a cautious ap- 2014) to explain any non-standard English expres- proach to automatic generation of non-standard sion, which can be used in many NLP and social language (Hickman, 2013), but for many practical media applications. More specifically, given a sen- purposes, it is also of crucial importance for ma- tence that includes a non-standard English expres- chines to be able to understand and explain this sion, we aim at automatically generating the trans- important subversion of the language. lation of the target expression. Previously, this In the NLP community, using of is not possible because the resources of labeled non-standard language as an external knowledge 1For example, more than 2K entries are submitted daily to source is useful for many tasks. For example, Bur- Urban Dictionary (Kolt and Lazier, 2009), the largest online foot and Baldwin (2009) consult the slang defini- slang resource. non-standard expressions are not available. In this and McKeown, 2011). However, we argue that us- paper, we collect a large corpus of 15 years of ing a small, fixed-size dictionary approach may crowdsourced examples, and formulate the task as be suboptimal: it suffers from the low coverage a sequence-to-sequence generation problem. Us- problem, and to keep the dictionary up to date, ing a word-level model, we show that it is pos- maintaining such dictionary is also expensive and sible to build a general purpose non-standard En- time-consuming. To the best of our knowledge, glish words and phrases explainer using neural se- our work is the first to build a general purpose ma- quence learning techniques. To summarize our chine learning model for explaining non-standard contributions: English terms, using a large crowdsourced dataset.

• We present a large, publicly available corpus 3 Our Approach of non-standard English words and phrases, including 15 years of definitions and exam- 3.1 Sequence-to-Sequence Model ples for each entry via ; Since our goal is to automatically generate expla- nations for any non-standard English expressions, • We present a hybrid word-character we select sequence-to-sequence models with at- sequence-to-sequence model that directly tention mechanisms as our fundamental frame- explains unseen non-standard expressions work (Bahdanau et al., 2014), which can pro- from social media; duce abstractive explanations, and assign different • Our novel dual encoder LSTM model outper- weights to different parts of a sentence. To model forms a standard attentive LSTM baseline, both the context words and the non-standard ex- and it is capable of generative plausible ex- pression, we propose a hybrid word-character dual planation for unseen non-standard words and encoder. An overview of our model is shown in phrases. Figure2.

In the next section, we outline related work on 3.2 Context Encoder non-standard expressions and social media text Our context encoder is basically a recurrent processing. We will then introduce our dual en- neural network with long-short term memory coder based attentive sequence-to-sequence model (LSTM) (Hochreiter and Schmidhuber, 1997). for explaining non-standard expressions in Sec- LSTM consists of input gates, forget gates and tion3. Experimental results are shown in Sec- output gates, together with hidden units as its in- tion4. And finally, we conclude in Section5. ternal memory. Here, i controls the impact of new input, while f is a forget gate, and o is an output 2 Related Work gate. C˜t is the candidate new value. h is the hidden The study of non-standard language is of inter- state, and m is the cell memory state. “ ” means ests to many researchers in the social media and element-wise vector product. The definition of the NLP communities. For example, Eisenstein et gates, memory, and hidden state is: al. (2010) propose a latent variable model to study i = σ(W [x , h ]) the lexical variation of the language on Twitter, t i t t−1 where many regional slang words are discovered. ft = σ(W [xt, ht−1]) Zappavigna (2012) identifies the Internet slang as f an important component in the discourse of Twit- ot = σ(Wo[xt, ht−1]) ter and social media. Gouws et al. (2011) provide C˜ = tanh(W [x , h ]) a contextual analysis of how social media users t c t t−1 shorten their messages. Notably, a study on Tweet mt = mt−1 ft + it C˜t normalization (Han and Baldwin, 2011) finds that, h = m o even when using a small slang dictionary of 5K t t t words, Slang makes up 12% of the ill-formed At each step, RNN is given a vector as input, words in a Twitter corpus of 449 posts. In the changes its hidden states and produces outputs NLP community, slang dictionary is widely used from its last layer. Hidden states and outputs are in many tasks and applications (Burfoot and Bald- stored and later passed to a decoder, which pro- win, 2009; Wang and McKeown, 2010; Rosenthal duces final outputs based on hidden states, outputs, it confuses the decoder on which one it should ex- plain. In practice, the user can often pin point to the exact expression she or he does not know, so in this work, we design a second encoder to learn the representation of the target non-standard expres- sion. Since our goal is to explain any non-standard English words and phrases, we consider a character-level encoder for this purpose. This sec- Figure 2: Dual Encoder Structure. Top left: a ond encoder reads the embedding vector of each word-level LSTM encoder for modeling context. character at each time step, and produces an out- Bottom left: a character-level LSTM encoder for put vector, and hidden states. Our dual encoder modeling target non-standard expression. Right: model linearly combines the hidden states of the an LSTM decoder for generating explanations. character-level target expression encoder with the hidden states of the word-level encoder with the and the decoder’s own parameters.The context en- following equation: coder learns and encodes sentence-level informa- tion for the decoder to generate explanations. hnew = h1W1 + h2W2 + B

3.3 Attention Mechanism Here, h1 is the context representation, and h2 is The idea of designing an attention mechanism is the target expression representation. The Bias to select key focuses in the sentence. More specif- B and the combination weights W1 and W2 are ically, we would like to pay attention to specific learnable. parts of encoder outputs and states. We follow a 4 Experiment recent study (Vinyals et al., 2015) to setup an at- tention mechanism. We have a separate LSTM for 4.1 Dataset decoding. We collect the non-standard English corpus2 from Here we briefly explain the method. When the Urban Dictionary (UD)3—the largest online slang decoder starts its decoding process, it has all hid- dictionary, where terms, definitions and examples den units from the encoder, denoted as (h ..h ). 1 T1 are submitted by the crowd. UD is made a reliable Also we denote the hidden state of the decoder resource, due to the quality control of its publish- as (d ..d ). T and T are input and output 1 T2 1 2 ing procedure. To prevent vandalism, a user must lengths. At each time step, the model computes have a or Gmail account, and when each new weighted hidden states based on encoder user submits an entry, the UD editors will vote states and three learnable components: v, W 0 and 1 “Publish” or “Don’t Publish” (Lloyd, 2011). Each W 0. 2 editor is also distinguished by IP addresses, and ut = vT tanh(W 0h + W 0d ) i 1 i 2 t HTTP cookies are used to prevent each editor from t t ai = softmax(ui) cheating. In recent years, United States Federal government has consulted UD for the definition of T1 0 X t “murk” in a threat case (Jackson, 2011), whereas d = a hi t i UD is also referred in a financial restitution case in i=1 Wisconsin (Kaufman, 2013), as well as determin- t Here a denotes the attention weights. u has length ing appropriate license plates in Las Vegas (Davis, 0 T1. After the model computes dt, it concatenates 2011). 0 dt and dt as new hidden states used for prediction A total of 421K entries (words and phrases) and next update. from the period of 1999-2014 are collected. Each 3.4 Dual Encoder Structure entry includes a list of candidate definitions and examples, as well as the surface form of the target Given a single context encoder, it is challenging 2 for any decoder to generate explanation in the in- We have released the dataset for research usage. The dataset is available at: http://www.cs.ucsb.edu/ stance. The reason is that there could be multi- ˜william/data/slang_ijcnlp.zip ple non-standard expressions in the sentence, and 3www.urbandictionary.com Instance: “dude taht roflcopter was pretty loltas- Model (w. attention) hidden units B1 B2 tic!!1!” single encoder 1024 21.06 2.1 Target: loltastic small dual encoder 512 21.84 2.2 Reference: something funnyishly fantastic. large dual encoder 1024 24.58 2.37 Generated Explanation (Single): a really cool , full char-level model 256 21.13 1.8 amazing , and good looking . Generated Explanation (Dual): a word that is Table 1: BLEU scores for explaining non-standard extremely awesome . English words and phrases in test dataset. Instance: “danny is so jelouse of my work!” Target: jelouse context encoder, word-level attentive sequence-to- Reference: how unintelligent people who think sequence, as well as a full character-level context they are better than someone spells ”jealous”. Generated Explanation (Single): your friend ’ s encoder model. We use 256 hidden units for this way of saying ” full character-level, because it is the largest setting Generated Explanation (Dual): a word used to that fits our Titan X Pascale GPU. Character-level describe a situation , or a person who is a com- plete idiot . model has longer sequence, which becomes one of its shortages compared with word-level model. Instance: “that sir right there, is being quite We see that the single encoder and full character adoucheous” Target: adoucheous level context models do not offer the best empiri- Reference: a person acting in a conformative cal performances on this dataset. Our novel dual manner that causes social upset and violence. encoder method, which combines the strengths Generated Explanation (Single): when a male is being a male and a male . of word-level and character-level sequence-to- Generated Explanation (Dual): the act of being sequence models, obtained the best performance. a douchebag . For qualitative analysis, we provide some gen- erated explanations in Figure3. For example, the Figure 3: Some generated explanations from our first target non-standard expression is the word system. “loltastic”, which is a combination of the words “lol” and “fantastic”. To explain this composite term. Using the UD API, we can pinpoint the to- non-standard expression, a character-level encoder ken positions of the words/phrases, and obtain the is needed. The generated explanation from our ground truth labels for tagging. dual encoder approach clearly makes more sense Our training and testing data use all the exam- than the single decoder result. ples in an entry (a non-standard expression). We Overall, dual encoder can explain words with randomly select 371,028 entries for training, re- more confidence partly because it knows which sulting in 907,624 sequence pairs of instance and words it needs to explain, especially for sentences reference explanation. The test set includes 50,000 containing multiple non-standard English words entries, and 61,330 sentences. Note that all testing and phrases. Our model can also accurately ex- target non-standard expressions, instances and ex- plain many known acronyms. We also notice amples do not overlap with those in the training that LSTM cells outperform gated recurrent units dataset. (GRUs) (Cho et al., 2014), and attention mecha- nism improves the performance. 4.2 Experimental Settings Our implementation is based on Tensorflow4. For 5 Conclusion input embeddings, we randomly initialize the In this paper, we introduce a new task of learning word vectors. We use stochastic gradient descent to explain newly emerged, non-standard English with adaptive learning rate to optimize the cross words and phrases. To do this, we collected 15- entropy function. We use BLEU scores (Papineni year of UrbanDictionary data, and designed a dual et al., 2002) for the evaluation. encoder attentive sequence-to-sequence model to 4.3 Quantitative and Qualitative Results learn the hidden context representation and the hidden non-standard expression embedding. We Quantitative experimental results are showed in showed that combining word-level and character- Table1. Here we compare the performance of level models improved the performance for this our proposed large dual encoder model to a single task 4www.tensorflow.org References Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. Bleu: a method for automatic eval- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben- uation of machine translation. In Proceedings of gio. 2014. Neural machine translation by jointly the 40th annual meeting on association for compu- learning to align and translate. arXiv preprint tational linguistics. Association for Computational arXiv:1409.0473 . Linguistics, pages 311–318. Clint Burfoot and Timothy Baldwin. 2009. Automatic Sara Rosenthal and Kathleen McKeown. 2011. Age satire detection: Are you having a laugh? In Pro- prediction in blogs: A study of style, content, and ceedings of the ACL-IJCNLP 2009 conference short online behavior in pre-and post-social media genera- papers. Association for Computational Linguistics, tions. In Proceedings of the 49th Annual Meeting of pages 161–164. the Association for Computational Linguistics: Hu- Kyunghyun Cho, Bart van Merrienboer,¨ Dzmitry Bah- man Language Technologies-Volume 1. Association danau, and Yoshua Bengio. 2014. On the properties for Computational Linguistics, pages 763–772. of neural machine translation: Encoder–decoder ap- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Syntax, Semantics and Structure in Statis- proaches. Sequence to sequence learning with neural net- tical Translation page 103. works. In Advances in neural information process- Johnny Davis. 2011. In praise of urban dictionaries. ing systems. pages 3104–3112. The Guardian. Oriol Vinyals, Łukasz Kaiser, Terry Koo, Slav Petrov, Jacob Eisenstein, Brendan O’Connor, Noah A Smith, Ilya Sutskever, and Geoffrey Hinton. 2015. Gram- and Eric P Xing. 2010. A latent variable model mar as a foreign language. In Advances in Neural for geographic lexical variation. In Proceedings of Information Processing Systems. pages 2773–2781. the 2010 Conference on Empirical Methods in Nat- William Yang Wang and Kathleen McKeown. 2010. ural Language Processing. Association for Compu- ”got you!”: Automatic vandalism detection in tational Linguistics, pages 1277–1287. wikipedia with web-based shallow syntactic- Stephan Gouws, Donald Metzler, Congxing Cai, and semantic modeling. In Proceedings of the 23rd Eduard Hovy. 2011. Contextual bearing on linguis- International Conference on Computational Lin- tic variation in social media. In Proceedings of the guistics (Coling 2010). Workshop on Languages in Social Media. Associa- Michele Zappavigna. 2012. Discourse of Twitter and tion for Computational Linguistics, pages 20–29. social media: How we use language to create affili- Bo Han and Timothy Baldwin. 2011. Lexical normal- ation on the web. A&C Black. isation of short text messages: Makn sens a# twit- ter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Hu- man Language Technologies-Volume 1. Association for Computational Linguistics, pages 368–378. Leo Hickman. 2013. Why IBM’s Watson supercom- puter can’t speak slang. The Guardian. Sepp Hochreiter and Jurgen¨ Schmidhuber. 1997. Long short-term memory. Neural computation 9(8):1735–1780. Joe Jackson. 2011. Feds Check Urban Dictionary to Crack Gun-Store Death Threat. TIME. Leslie Kaufman. 2013. For the Word on the Street, Courts Call Up an Online Witness. The New York Times. Leah Kolt and Matt Lazier. 2009. Alumni in the News: Summer & Fall 2009. Cal Poly Magazine. JoAnn Lloyd. 2011. Alum Aaron Peckham’s Urban Dictionary Redefines Language. Cal Poly Maga- zine. Thanapon Noraset, Liang Chen, Larry Birnbaum, and Doug Downey. 2016. Definition modeling: Learn- ing to define word embeddings in natural language. In AAAI. pages 3259–3266.