<<

TFW, DamnGina, Juvie, and Hotsie-Totsie: On the Linguistic and Social Aspects of

Vivek Kulkarni William Yang Wang Department of Science Department of Computer Science University of California, Santa Barbara University of California, Santa Barbara Santa Barbara, California Santa Barbara, California [email protected] [email protected] ABSTRACT Slang is ubiquitous on the Internet. The emergence of new so- cial contexts like micro-, question-answering forums, and social networks has enabled slang and non-standard expressions to abound on the web. Despite this, slang has been traditionally viewed as a form of non-standard language – a form of language that is not the focus of linguistic analysis and has largely been neglected. In this work, we use UrbanDictionary to conduct the first large Figure 1: Sample depicting some of the linguistic features in scale linguistic analysis of slang and its social aspects on the Inter- slang on the Internet. Note the presence of simple alphabetisms as net to yield insights into this of language that is increasingly well as more complex phenomena like blending (loltastic). used all over the world online. First, we begin by computationally analyzing the phonological, morphological and syntactic proper- ties of slang in general. We then study linguistic patterns in four which has traditionally received little linguistic attention. In fact specific categories of slang namely alphabetisms, blends, clippings, Labov [27] argues that all articles focusing on slang should be and reduplicatives. Our analysis reveals that slang demonstrates assigned to an “extra-linguistic darkness”. According to Eble [16], extra-grammatical rules of phonological and morphological forma- while the development of socio- has legitimized the study tion that markedly distinguish it from the standard form shedding of slang, slang does not naturally fit into the controlled framework insight into its generative patterns. Second, we follow up by analyz- of socio-linguistics where correlations between social factors like ing the social aspects of slang where we study restriction age, gender and ethnicity with language use are studied, and is and stereotyping in slang usage. Analyzing tens of thousands of better explained through the lens of social connections (like per- such slang words reveals that the majority of slang on the Internet sonal kinship). In other words, slang is firmly grounded in social belongs to two major categories: sex and drugs. Further analysis connections and contexts enabling “group identity”. However, the reveals that not only does slang demonstrate prevalent gender and evolution of Internet and social media has radically transformed religious stereotypes but also an increased bias where even these social contexts [11, 16]. First, slang is no longer restricted to of persons are associated with higher sexual prejudice than one oral but is now widely prevalent on the Internet in might encounter in the standard form. written form. Second, the notion of a group associated with slang In summary, our work suggests that slang exhibits linguistic now extends to a network that is increasingly global where mem- properties and lexical innovation that strikingly distinguish it from bers are not necessarily associated by friendship or face-to-face the standard variety. Moreover, we note that not only is slang interactions but are participants in new and evolving social con- usage not immune to prevalent social biases and prejudices but texts like forums and micro-blogs [11, 16]. Consequently, Eble [16] also reflects such biases and stereotypes more intensely than the claims “slang is now world-wide the vocabulary of choice of young arXiv:1712.08291v1 [cs.CL] 22 Dec 2017 standard variety. people (who compose the majority of inhabitants of the earth) and reflects their tastes in , art, clothing and leisure time pursuits” KEYWORDS and further argues that its study has fallen behind since it is not an Natural Language Processing, Social Media Analysis, Computa- integral part of socio-linguistics. tional Social Science Here, we address this gap and conduct the first large scale compu- tational analysis of slang on the Internet using UrbanDictionary, 1 INTRODUCTION addressing both linguistic as well as social aspects. The Internet is global, diverse, and dynamic, all of which are re- First, we characterize linguistic patterns dominant in the forma- flected in its language. One aspect of this diversity and dynamism tion of slang and provide supporting evidence of its distinctiveness is the abundance of slang and non-standard varieties – an aspect from the standard variety not only supporting claims made by Mat- tiello [30] but also revealing new insights into the phonological, This is the author’s draft of the work. It is available here for personal use. Not for morphological and syntactic patterns evident in slang through a redistribution. © 2018 Copyright held by the owner/author(s). large scale quantitative analysis. Figure 1 illustrates some of these patterns. hmu is an Alphabetism, more specifically an Initialism for hit me up. Blends are formed by mixing parts of existing words. popular online slang dictionary UrbanDictionary as of July 2017. 1 For example, loltastic is a blend of and fantastic. Other For each , we obtain the top 10 definitions (when available) as examples include words like netizen, infotainment, frenemy, well as their example usage and vote counts. We remove very rare and bootylicious. While Blends are formed by the mixture of words and short-lived trends by considering only instances with two or more words, Clippings are formed by shortening the orig- at-least 100 votes (weeding out rare/noisy slang like jeff cohoon inal word. stache is an example since it is a shortened form of which has no votes or short-lived trends such as Naimbia which mustache. Finally, Reduplicatives, also called echo words con- was added after President Trump mistook it for a country). This sist of word pairs, where the first word is phonetically similar to yields a dataset of 128, 807 words and 328, 170 definitions spanning the second word. Examples are teenie-weenie, artsy-fartsy, the time period 1999 − 2017 leaning towards relatively long-lived boo-boo, chick-flick. While these patterns are not exhaustive, slang words rather than short-lived fads/phenomena. We note that they suggest that slang exhibits rich and varied linguistic forma- even though Urban Dictionary was created in 1999, the majority tion patterns. We therefore, conduct an in-depth investigation into of the words entries are introduced after 2005 which interestingly the phonological, morphological and syntactic properties of slang coincides with evolution of social media like Facebook and Twit- words and contrast them with the standard form yielding insights ter. Finally, we observed that the fraction of single word slang is into the linguistic mechanisms at play in slang formation (see Sec- slightly higher than 50%, suggesting that a significant fraction of tion 3). slang consists of multiple words or phrases in contrast with stan- Second, we analyze the social aspects associated with slang. In- dard English where almost all dictionary entries are exclusively line with Eble [16] who claims “One of the greatest challenges to single words or two-word phrases. scholarship in slang is to fit slang into the current conversations going on in about such topics as identity, power, community (SE). To contrast slang with the more standard formation, stereotyping, discrimination and the like”, we investigate usage of English, we consider a list of 67, 713 words released by two social aspects of slang: (a) Subject Restriction – Slang is the Dictionary Challenge [24] which consists of words and strongly associated with certain subjects like Sex, Drugs or Food. their definitions obtained from electronic resources like Webster’s, We study this aspect by proposing a model to classify slang words WordNet, The Collaborative International Dictionary of English and into set of 10 pre-defined categories to reveal dominant subjects (b) and pre-dominantly contains words in standard usage Stereotyping – We analyze and quantify biases and stereotypes (including technical , scientific terms etc). evident in slang usage and show evidence of gender stereotypes, sexual and religious prejudices – prejudices that are much more 3 LINGUISTIC ASPECTS OF SLANG extreme in slang than those observed in the standard form (see Having described our datasets, we now proceed to analyze several Section 4). linguistic aspects of slang words and contrast them with those in In a nutshell therefore, our contributions are: Standard English (SE). (1) Linguistic Aspects of Slang: We analyze the phonological, morphological and syntactical properties of slang and con- 3.1 trast them with those found in the standard form revealing insights into patterns governing slang formation. Mattiello [29] suggests that slang incorporates broad phonolog- ical phenomena like echoisms, mis-pronunciations and assimi- (2) Social Aspects of Slang: We analyze two social aspects as- sociated with slang: (a) Subject Restriction and (b) Stereotypes lation. However little is known about specific phonological pat- and prejudices reflected in slang usage online. terns/properties of slang where lexical innovation is so rampant. What are the phonological properties of slang and how do they differ Altogether, our results shed light on both the linguistic and social from those of words in Standard English? aspects governing slang formation and usage on the Internet. To analyze phonological properties, we obtain the phoneme representation of each word using G2PSeq2Seq [46], a neural pre- 2 DATASETS trained model for the task of (letter) to phoneme conver- Slang Data Set. In discussions of slang, one controversial issue sion trained on the CMU Pronouncing Dictionary. To illustrate, the has always been its definition. Mattiello [30] notes that there are phoneme representation for woody is W UH D IY. Each phoneme multiple views to characterize slang. One dominant view (a socio- is also associated with one of the 8 articulation manners shown in logical view) adopted by [17, 36] is that it primarily enables one to Table 1. identify with a specific group/cohort. Others like [38] adopt a more Are certain phonemes over-represented in slang? To an- stylistic view and are of the opinion that slang should be catego- swer this question, we estimate the distribution over phonemes for rized according to attitudes that vary from “the casual to the vulgar”. words in our data-sets and rank the phonemes in descending order Still others emphasize the novelty of slang and characterize slang of their odds ratio (between slang and Standard English). Figure as a variety of language that is inclined towards lexical innovation 2 shows the top 5 phonemes in slang that are over-represented. [15, 34, 43]. In our work, we adopt a broad definition of slang which In particular, note the presence of phonemes like W, G and Z includes non-standard expressions (words) and emphasizes both the which are used less frequently in Standard English. Examples of sociological viewpoint as well as the viewpoint of lexical innovation slang that uses these phonemes are zazzed, zucker, pizzle, and novelty. We constructed a large data set of slang, by scraping the fonzie, woodie, faggoth suggesting evidence of phonological 1These words are also called portmanteaus. variation between slang and Standard English. 2 hl opooia atrsaddrvtosaewdl studied widely are derivations and patterns morphological While od are words (by α 2 [ English of forms standard for 3.2 articulation of manners and instan- like phonemes certain words where from different English markedly dard are slang of properties cal in decrease and3d). a 3c being phoneme final (Figures the phoneme that final Observe the of distribution manner to English. Standard and slang between of theproportions in difference jerkweed slang juvy, such of Examples English. Standard in than slang in common with begin which words is nally, phoneme by first lower whose is vowel slang a in words of proportion the Second, fricative the using phonemes: pronunciation their begin words slang several than common more 3b).We and are slang, 3a In (a) (Figures following: phoneme the First,con- first observe (SE). the English in distribution andStandard the sider slang and first both in the for phonemes obtained final manners articulation of distribution the like phonemes jizz use words zucker, slang several noting by over- the Note of slang. representation in phonemes over-represented most 5 2: Figure ihihe infcn ifrne npootoswr ttsial infcn at significant werestatistically proportions in differences significant Highlighted al :Mneso riuainfrpoee sper as phonemes for Articulation of Manners 1: Table 0 = Semivowel Aspirate Affricate Liqid Nasal Vowel Fricative Stop Manner Conclusion ossagdfe nmnesof articulation? in manners differ slang Does Probability 0.000 0.005 0.010 0.015 0.020 0.025 6 Fricatives . . 05 3% sn -etfrpooto ifrne ihBnern correction. Bonferroni with differences proportion for z-test using W ons nsagta tnadEgihadas oea note also and English standard than slang in points) hmad hly hclt hra chadzing, charva, chicklet, chelly, chemtard, S Stop a Slang (a) sin as etc. G osmaie u nlssrvasta phonologi- that reveals analysis our summarize, To and Phoneme ,Y W, HH JH CH, R L, NG N, M, UW IY,OW, UH, IH, OY, EY, ER, EH, AW, AY, AO, ZH AH, AE, Z, V, AA, TH, SH, S, F, DH, T P, K, G, D, B, Phonemes diinly entdn ttsial significant statistically no noted we Additionally, . hnmsby phonemes ,G ,U n CH and UW Z, G, W, selfy Z 7 Affricates . 2% ,

F UW Vowels onsta tnadEgih(E.()Fi- (c) (SE). English Standard than points sin as

CH Affricates 2 flub Stop ∼ , 4 r vrrpeetdi slang. in over-represented are eepanti yntn that noting by this explain We . 2 , Fricatives

. Probability 0.000 0.005 0.010 0.015 0.020 0.025 32 and 7% hnm stefrtphoneme first the as phoneme 2 , b tnadEnglish. Standard (b) 41 enwtr u attention our turn now We onsi slang. in points SH nsag eepanthis explain We slang. in ,Mattiello ], Vowel W : sin as H JH CH,

stefrtphoneme first the as G Phoneme schmammered smr common more is Z

r uhmore much are Z iue3shows 3 Figure nwrslike words in [30] CMUDict UW observes

CH (b) . . 3 odfrainrlsi ln [ slang in rules formation word tives ln ifri t opooia om rmwordsin from forms its morphological in ? standard differ slang example: (for English Standard in than slang Similarly example, phonemein For final form. standard and the first with contrasted the when for English(SE). slang Standard and differences slang significant in Observe Manners Articulation 3: Figure h aeo opooia ufxsa elb oigthepresence bynoting 3 as well suffixes sex morphological of and case the cock white, shit, pre- like non-standard in slang to assigned mass fixes higher much the note Second, for top the as where mass total top the Standard slang, than In tail heavier English. these a make demonstrates immediately slang We First, English. observations: standard and slang both in words segment using to morphemes model into a learning by morphemes, corresponding like blends, reduplicatives. and ofslang clippings classes different of morphological formation the in into evident insight patterns yields analysis Our slang. of patterns and its into impact. formation, insight sociological language provide of also process but creative formation the on only can not slang light of shed characteristics morphological expressive the studying patterns formation word since regular thus ignored the and largely from displaced been far have are slang they of formations many that euetedfutsto aaeesfor parameters of set default the use We iue4sostetop the shows 4 Figure morphological the analyze now we viewpoint, this with line In

3.2.1 Probability Probability 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 15% c ia hnm nslang in Phoneme Final (c) a is hnm nslang in Phoneme First (a) r oecmo than common more are vowel stop ugsigarc n oevre odfraini slang. in formation word varied more and rich a suggesting extra-grammatical nlsso opooia Patterns Morphological of Analysis Affricates stop fricative

nasal vowel

Manner fricative Manner nasal igr uk rs,mn lc,ass, black, man, irish, fuck, nigger, oase hs edcmoewrsit their into words decompose we this, answer To liquid liquid stefrtpoeeaemr omnin common more are phoneme first the as Morfessor affricate affricate 25 25

25 semivowel aspirate ihol ml ou nstandard on focus small a only with , rfxsol con for account only prefixes Vowels otcmo rfxsadsuffixes and prefixes common most rfxsi tnadEgihaccount English in Standard prefixes aspirate semivowel iia bevtosaentdin noted are observations Similar . 17 Morfessor .Cneunl,[ Consequently, ]. Probability Probability 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 [45] d ia hnm nSE in Phoneme Final (d) b is hnm nSE in Phoneme First (b) stefrtpoeeinslang. phoneme first the as

3 vowel stop . . stop vowel hmad chadzing chemtard, nasal fricative

Manner fricative Manner nasal 30

. liquid liquid listhat claims ] o does How

10% affricate aspirate

semivowel affricate Frica- fthe of

aspirate semivowel ). odfraini tnadEgih oevr eqatttvl es- quantitatively we Moreover, English. standard in formation word atrsfrec fteecassrvaigisgt notergen- their into insights revealing process. classes erative these of morphological each analyze for thoroughly patterns then and briefly describe classes We these patterns”. morphological underlying some for ences by Mattiello to identified attention slang our of turn now we broadly, slang of slang. in patterns such of suffixes) nature and tail prefixes heavy (both the patterns characterizing these of likelihoods the timate in observed not are which rules extra-grammatical exhibits slang in [ by made observations the with tent ass. and head fuck, like suffixes non-standard of and prefixes suffixes non-standard Eng- like using andStandard English, Standard distinctive quite from in slang derivations morphological exhibits Slang (SE). differences lish Morphological 4: Figure Conclusion

3.2.2 Probability Probability 0.0150 0.0175 0.0200 0.0000 0.0025 0.0050 0.0075 0.0100 0.0125 0.01 0.02 0.03 0.04 0.05 0.06 0.00 a Top (a) • c Top (c)

Probability ass nigger, fuck, a o ufxsinslang suffixes 5 Top (a) oiatsfie inSE. not suffixes dominant are these Note blends. 0.000 0.005 0.010 0.015 0.020 0.025 freg. (for ecer (a) always clear: not be may distinction on the based although types pronunciation two their into sub-categorized be can phabetisms Alphabetisms s iue5 igitcpten fBed n Clippings. and Blends of patterns Linguistic 5: Figure ing s a the a ln Classes Slang lish y er i 25 25 ed nigger e b g prefixes in slang prefixes licious inslang suffixes school hmu in t u ag cl opttoa nlssi consis- is analysis computational scale large Our : Morphemes

Morphemes fuck o Morphemes d man irish nomics d from man up c ie f t p out r hreig famliwr sequence multi-word a of shortenings are tainment black . girl e i eup me hit . sex k i o aigaaye opooia patterns morphological analyzed Having it ass ity fuck shit head white r rnucdb sn h regular the using by pronounced are ass y col a,ot il sex, girl, out, man, school,

[31] ism cock n sex ng)cipnsand clippings are (nigg) clippings Most (b) pound Probability Probability Probability 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 0.040 0.045 0.000 0.005 0.010 0.015 0.020 0.025 0.030 31 ht“xii neligprefer- underlying “exhibit that or b Top (b) d Top (d) h oe htmorphology that notes who ]

TV ing un BACK s s lpig (slowmo). clippings ed re

from er in ly a y de 25 25 COMP e dis

Clipping a d 4 prefixes in SE. prefixes ufxsinSE. suffixes al con tranvestite Morphemes Morphemes pcfccategories specific d pre able al ation over FORE ness non ism sub ate inter ic e us en OTHER ian co ity super Back Com- ist c n o o .Al- ). ment ma ion im in t 4 sn h bv odsadr aae of dataset standard gold Reduplicatives, above (c) and the Clippings using (b) Blends (a) in patterns word mon eg. (for categories these of several any and into exhaustive fall not not do are words categories slang these that note to tant the of each to belonging [ while Finally a o eulctv patterns. Reduplicative Top (a) ulct common. most is Duplicate opooia atrso ln Classes. Slang of patterns Morphological h.(hurly-burly) ’h’. c o etr hc replace which letters Top (c) Probability Probability 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 • • • • od(eee srtie (like retained is (lexeme) word quantitatively. characterize we which Blends Reduplicatives Clippings Blends like [ by piled they that ( so similar consonants phonologically are or vowels certain ( alternating word by a or repeating either by obtained pairs word are ( word compound a (like of retained clipping is word a of from end the where clipping, portionthat onthe (a) clipped: being depending is classes 3major into classified and eg. For syllables. of number some patterns for of rules affinities strict show exhibit only not instead do but formation they slang, in ubiquitous are martini example: Mattiello necessarily as not example, is For it case. construc- predictable, the that very appear is may alphabetisms it of While tion BLT). eg. (for letter by ter (b) dink) eg. (for rules reading fteAt London Arts the of DUP b ih iiu,tainment licious, lish, iue6 atrsi eulctv formation. reduplicative in Patterns 6: Figure gym EX_VOW cockroach p 31 r omdb egn at feitn od.For words. existing of parts merging by formed are iue5 hw h o ufxsi lnscom- inblends suffixes 5 top the shows 5a Figure sacipn of clipping a is Mattiello . Consonant:h EX_CONS sextini 12 rvdsamnal opldls of list compiled manually a provides ] Type r bandb hreigalxm noasmall a into lexeme a shortening by obtained are d , 31 .Nt h oiatpeec fsuffixes of presence dominant the Note ]. UNK n c opudcipn ( clipping compound A (c) and ) m as aldeh-od rfi-lpwords) flip-flop or echo-words called (also 4 Back sfre ymrigprsof parts merging by formed is [31] different classes mentioned, it is impor- it mentioned, classes different SHM t ol eabeitdas abbreviated be could gymnasium bevsta vntog blends though even that observes lpigweetebgnigo the of beginning the where clipping b o etr hc elc ’i’. replace (bing-bang) which letters Top (b) eneweenie teenie t.(teenie-weenie). ’t’. d o etr hc replace which letters Top (d)

fave Probability Probability 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Initialisms nigg lwmotion slow hc il ag number large a yield which w sacipn of clipping a is lpig a efurther be can Clippings . 1580 a from [31] ee eaayecom- analyze we Here, words. l notes, Consonant:t r rnucdlet- pronounced are nigger Vowel:i ). o ). UAL r University slowmo 1580 b a (b) ) favorite o boo boo or edging sex u UOTAL roach words m Fore sa is ) and ). ) . of blends like Hinglish, bootylicious, fergilicious, Model Weighted F1 sextainment, and infotainment supporting [31] quanti- Random 26.57 tatively that blends have affinities to certain suffixes. Morpheme N-grams 69.99 • Clippings Figure 5b shows the fraction of each clipping type. Char N-grams 86.07 Most clippings are Back clippings while Fore clippings are Table 2: Performance of different models on Gold Standard Test relatively rare. set. Note that using character n-grams demonstrates the best per- • Reduplicatives Figure 6 shows preferred patterns of redu- formance. plicative formation. First, observe that reduplicatives are pre-dominantly formed by one of the following processes: models substantially outperform the baseline random classifier. Fur- (a) Duplication (DUP) like boo-boo (b) Exchanging a vowel thermore the model using character-ngram features significantly (EX_VOW) like flip-flop and (c) Exchanging a consonant outperforms the morpheme based model. This is primarily because (EX_CONS) like bitsy-witsy while a small fraction of redu- the morpheme based feature representation is too sparse for robust plicatives are formed by (d) Prefixing schm, shm (SHM) like estimation of the decision boundary given the small amount of an- moodle-schmoodle and (e) other patterns (UNK). Second, notated training data. We believe that the morpheme features will vowel and consonant substitutions reveal dominant pref- be most effective when the amount of training data is much larger, erences. i is more likely to be substituted with a and o to reduce the feature sparsity of general morpheme features. Fi- among other letters (Figure 6b). h is much more likely to nally, Figure 7a shows a confusion matrix for the character-ngram- be substituted by b and p whereas t is much more likely based model when evaluated on the gold standard test set. It is to be substituted by w (Figures 6c and 6d). Examples are evident that the classifier does well on Alphabetisms, Clippings bling-blang, shick-shack, flip-flop, hurly-burly, and Reduplicatives but is relatively worse at identifying Blends. teenie-weenie. We hypothesize this is because Blends are much more linguistically Conclusion In summary, our analysis reveals varied patterns of complex than other classes like Alphabetisms where distinctive formation of blends, clippings and reduplicatives. features are much more easier to detect. In order to gain insight into the model that we learned using 3.2.3 Detection of Slang Classes. Having obtained insights character n-grams, we examine the feature weights learned by the into linguistic patterns governing four different slang categories, we classifier. For Alphabetisms, we find distinctive features which demonstrate the utility of these insights in developing a predictive involve periods, upper-case letters and other symbols like &,/. For model to classify slang into one of these 4 proposed categories. We blends we observe strings that make up blends like ny, sh etc. evaluate our model quantitatively on a small gold-standard test set Similarly for Clippings we observe distinctive suffixes like ly-, as well and apply our learned model to infer labels for a large list of ity, tie while for Reduplicatives we see patterns like -, ween, words from UrbanDictionary to construct a much larger data-set win, um as well (used in words like teenie-weenie and bum bum) to aid future qualitative and quantitative analysis. suggesting that our model effectively picks up on linguistic cues to Gold standard Dataset. We once again use the data-set of 1580 effectively distinguish between the classes. words by [31] as the gold standard data set for learning a classifica- Inducing labels on the Urban Dictionary Data. As mentioned in tion model to classify a word into one of these four categories. We Section 3.2.2, slang does not exhaustively fall into the four cate- create a gold standard training and test data set by a random split gories we considered. Consequently, naively applying our learned using 10% of the data as the test set. model on Urban Dictionary data where slang can additionally Learning the model. We consider a simple Logistic Regression belong to multiple “unknown” classes would result in poor perfor- classifier and experiment with two sets of features:4 mance (since several words which belong to none of the 4 categories would be incorrectly assigned to one of these known categories). • of (length:1-5) where feature Character N-gram features Observe that this is essentially the problem of open set recognition size is restricted to 200. studied quite rigorously in computer vision [25, 39, 40, 42], where • : We consider morpheme-grams (1-5) as fea- Morphemes several methods to address this problem are available. In our work, tures and restrict our feature size to 200. Given, the small we consider one such approach that augments the trained model amount of training data, we expect the morpheme feature with a posterior probability estimator and a decision threshold to set to be much sparser than the character ngram features also optionally reject an instance.6 Specifically, the approach con- and therefore expect a model learned using these features to sists of the following steps (see Algorithm 1). Given C, the set of perform worse than using character ngrams. known classes, and a closed set model H that outputs Pr(y|X), a Finally, as a baseline we consider a random model which randomly probability distribution over C given a feature vector X, we want draws predictions from the training data label distribution. to make predictions using H on a data set D where instances could potentially belong to a unknown set of “unknown” classes in Quantitative Evaluation on Gold Standard Test Set. Table 2 shows addition to C. To do this, for each instance we compute a score sig- the performance of the various models on the gold standard test nifying the confidence of H in its prediction and reject the instance set5. Observe that both the morpheme and the character n-gram if this confidence is below a manually chosen threshold δ. This

4Other models like SVM’s also yielded similar performance. 6We leave other complicated approaches which might boost performance on this task 5Hyper-parameter tuning was done using cross validation. using W-SVM as future work. 5 Algorithm 1 PredictWithReject (C, H, D, δ , F) Category Word Alphabetisms A.D.E.D, E.V.I.L, S.P.E.W, C.H.U.D, S.E.A.L, Input: C: Set of known classes, H: Classifier for closed set C, D: S.W.I.M Dataset of instances from an open set. Each instance needs to be assigned a label in C or Rejected if it belongs to none of Blends Iretalian (Ireland+Italian), Obamerica (Obama the classes in C. δ: Reject threshold. T : Score type: Negative + America), Metapedia (Meta + Encyclopedia), Entropy or MaxProb. Obroxious (Obnoxious+Rock), Cumbrella (Cum + Umbrella), Rainmelon (probably c.f No Rain + Output: Predicted Label Assignments for each instance in D where each label l ∈ C ∪ Rejected Blind Melon) Clippings cuttie (cutback), stevie (Steven), Bishie 1: SCOREFUNC ← F {Initialize the scoring function to either compute the negative entropy or the maximum value of the (bishounen), hattie, cottie output probability distribution.} Reduplicatives e-d-b-t-z, yu-gay-ho, yu-gay-oh, bug-a-boo, Roody-poo, hooty-hoo 2: for e ∈ D do 3: Compute p the output probability distribution over C. Rejected Darwin, edging, Pingo, Oil-Can, Wet-Seal, 4: Score(w) ← SCOREFUNC(w) Flank, Baking Table 3: Examples of top words detected by model for each class 5: Labels(w) ← arg maxc ∈C p 6: if SCORE(w) ≤ δ then on the UrbanDictionary dataset along with words our algorithm correctly rejected. Observe that we can correctly identify labels for 7: Labels(w) ← Rejected {Failed threshold so reject} interesting words like Ireitalian, obroxious and cuttie. 8: end if 9: end for 10: return Labels 3.3 How do the syntactic roles in which slang words are used Normalized Confusion Matrix Normalized Confusion Matrix differ from those in Standard English? We investigate this by ALP 0.97 0.0 0.03 0.0 ALP 0.96 0.0 0.04 0.0 0.8 0.8 analyzing the part of (POS) roles of slang. We obtain the POS tags for slang words by using a pre-trained part of speech BLE 0.0 0.72 0.21 0.07 0.6 BLE 0.0 0.96 0.04 0.0 0.6 tagger7 on the example usages of slang. We also obtained the POS

0.4

True label 0.4 True label CLI 0.02 0.04 0.88 0.07 CLI 0.04 0.06 0.89 0.01 tags of words in Standard English by querying Dictionary.com. We observed that slang contains a much higher proportion of Nouns 0.2 0.2 RED 0.0 0.05 0.09 0.86 RED 0.11 0.13 0.13 0.63 (∼ 72%). In contrast, the fraction of Nouns in Standard English was 0.0 0.0 ALP BLE CLI RED ALP BLE CLI RED ∼ 50%. Further analysis reveals that about 28% of slang are used Predicted label Predicted label as Proper Nouns. We explain this by noting that even names of (a) Confusion Matrix - Gold (b) Confusion Matrix - 600BUD people can be used as slang (with a creatively assigned connotation, Figure 7: Performance of our character n-gram model on the gold as we will see in Section 4) which is quite rare in the standard form. and 600BUD test sets in closed class setting. Note good performance Examples of such words include Trumpence, Angry Bill Cosby, on all classes. (ALP: Alphabetisms BLE: Blends CLI: Clippings and Erik Erikson, Annabelle, Ria, Debby etc. REDUP: Reduplicatives). 4 SOCIAL ASPECTS OF SLANG USAGE According to the sociological viewpoint of defining slang30 [ ], slang score could be (a) the maximum probability over the known classes is associated with several sociological properties with perhaps the or (b) the negative entropy of the output probability distribution. most widely accepted one being group restriction. In addition to Table 3 shows some of the top words detected by our model this, slang also exhibits properties like debasement, humor, obscenity, for each category on UrbanDictionary data. Our method ef- and subject-restriction to a few. While several scholars have fectively identifies instances of each class while also rejecting studied the social aspects of slang they have been largely qualita- instances not belonging to four classes. We identify slang like tive [14] or restricted to a very specific group [7], a quantitative E.V.I.L and S.P.E.W as Alphabetisms and detect Blends like large scale analysis of social aspects of has not been Iretalian:Irish+Italian or Obamerica:Obama+America. Simi- addressed to the best of our knowledge. Here, we consider two larly our model is able to detect Clippings like Stevie (Steven), such aspects: (a) Subject restriction: Slang can be associated with Bishie (bishounen) and Reduplicatives like hooty-hoo. a particular subject. Examples of such slang are crack, junkie, We also evaluate our model quantitatively in a closed class set- acid, crystal, all related to the subject of drugs. Similarly, slang ting on a balanced manually created test sample of the UrbanDic- words abound in obscenity with a plethora of terms related to Sex. tionary data-set of 600 words (600BUD) over which we obtained a (b) Stereotypes and Prejudices: We investigate the question of weighted F1-score of 86% (see Figure 7b for the confusion matrix on whether slang like the standard form reflects prevalent gender and 600BUD). Finally, we evaluated our model in the open class setting religious biases/prejudices. using cross-class validation[42] which yields a mean weighted F1 score of 66.43 implying that our model generalizes reasonably well to this open set recognition setting as well. 7We use TextBlob for inferring POS tags. 6 emnal rae aee etstof set test labeled a created manually we dimension of embeddings word As before, we learn a classifier on the supervised labeled data and labeled supervised the on classifier a learn we before, As oqatttvl vlaetepromneo u ere model, learned our of performance the evaluate quantitatively To blower), weeded, (eg. ad) ti ob oe htteectgre r o xasieand exhaustive not are categories these that noted be to is It vandy). yagetn h ere lsiirwt rbblt threshold, 8 a probability with classifier learned the augmenting by on poorly to belonging Music words on best the does distribution label of data F1-score training an the yields to according label a predicting of F1-score weighted a obtained neighbor K-nearest a learn we classifier features, as learned embeddings press” pill your a if into acid put thizz and and coke, up extacy, mixed heroin, all meth, ) ( lucky...;) drugs shitty of bunch a illustrate, To of examples. usage capturing in example usage by one their words on slang based of cues distributional embeddings learn we words, slang 1. Algorithm using recognition open-set handle to it exhaustive. extend not are labels the that noting data-set un-annotated UrbanDictionary top the that this Sex Observe using data. category labeled each standard of gold fraction the the shows to 8a addition that Figure words in classes). slang classes many "unknown" are multiple are to (there belong open-set an form again once jooz), poned), hannah-montana), bati, namely: possible, when egories Slang Internet in Restriction UrbanDictionary Subject 4.1 examples. 11K additional the include we when of higher proportion estimated the Note, slang. with associated prevalence dominant the Note of slang. in restriction Subject 8: Figure te lsiir ieSMyedcmaal results. comparable SVM yield like classifiers Other ial,w xedormdlt h oe-e"rcgiinsetting recognition "open-set" the to model our extend we Finally, from data annotated standard gold of number small a Given Sex 4.1.2 4.1.1 Probability 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 and a odSadr Data Standard Gold (a) Food sasagta srltdto related is that slang a is and sex and Internet Drugs erigSagEmbeddings Slang Learning erigteCtgrzto Model Categorization the Learning 8 drugs Food Drugs ntegl tnadanttddtst(eset (we dataset annotated standard gold the on Sex music e.gubn ca)and scram) grubbin, (eg.

name ugsigtedmnneo hs oisi slang. in topics these of dominance the suggesting Category and hc r oepoietadprom quite performs and prominent more are which college enocn h bcn n ugrapcsoften aspects vulgar and obscene the reinforcing e.tprcie eracism), typeractive, (eg. ese oifrlbl o h uhlarger much the for labels infer to seek we ,

13 sports Religion aeoie ln od nooeof one into words slang categorizes Music thizz

. internet 45% College religion efrhrosre htorclassifier our that observed further We . work e.bietp tweenwave), bridestep, (eg. is

hc cu nrqetyi slang. in infrequently occur which food "hz sNTpr xay..tizis extacy.....thizz pure NOT is “"thizz d 63 e.architorture), (eg. Sex drugs 100 =

. Probability b odSadr aa+11K + Data Standard Gold (b) 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 22% Sex e.btn,spear), bating, (eg.

Work sex nti etst ncontrast, In set. test this on sn eaiesampling. negative using elanSiga [ Skipgram learn We . .

, drugs ocpuesmnisof semantics capture To Drugs 200 music hc ugssthat suggests which , Religion e.pixel-counting, (eg. name od.Ormodel Our words. Category .

, college 2 sn h slang the Using Internet aeoisare categories Sports sports

internet Sex 10 Name e.kyke, (eg. religion known Drugs seven is 10

(huck, food and , k =5). cat- 35 (eg. work ] 7 hc lal ugssthat suggests clearly which eivsiaeti nsagtruhteln fsagembeddings. slang of lens the through slang in this investigate we weelnug ssadr)tpclyfrom typically standard) is language (where al :Eape fTpwrsi fte1 aeoisdetected categories 10 the of 4 in words Top of Examples 4: Table ln seFgr b ossetwt u rvosobservation. previous our with consistent 8b) Figure dataset (see larger slang onthis thresholds) each rejection of different proportion mean (at the subject estimated We dataset. larger a created model our by wordtolearn inferred these labels of the on the using usage Finally oncontextual categorization. up its is model pick learned effectively our to representative that able are suggesting category class, each corresponding in their words of the of to most beingrelated that as observe model our by classified like words that by identified for words model top learned the our shows 4 Table words. additional slang an unlabeled on model our apply and 1 Algorithm using 9 extreme Fig- Finally, the at alevel News. shows in 9 used ure language stereotypes standard other gender to traditional comparable reflects also slang that for value responding be would significant this a bias, notice gender no is there if Ideally, method their applied We quantify pre-defined to occupations. their of using list stereotypes validated and humanly bias gender quantify to language. general in observe might one what than weaker be de- 9 user age the younger where towards since groups skewed that is posit UrbanDictionary might of one mographics News Wikipedia, in on prevalent even are occupations and to respect with stereotypes der stereotypes? gender reflect slang Does [ stereotypes gender existing formal more the data on learned embeddings word that shown Slang has Internet research in Recent Stereotypes 4.2 to belonging as identified correctly are to belonging as categorized like correctly words words how of examples Note with rejected. along 1 Algorithm with model by aafo mznAlexa. Amazon from Data rvln cuainlseetpswt epc ogne might gender to respect with stereotypes occupational prevalent , oase hsqeto,w olwtemto ulndb [ by outlined method the follow we question, this answer To College Drugs Sex Category Rejected Food 11000 ietbias direct 18 od n nldn h odsadr aae we dataset standard gold the including and words ie ht,hr,rl,cigs roll, herb, shots, pipe, e col uz ege odby math, boy, good herd preppy, league, derbie, dropout, quiz, school, presidents, med cigs dead cubba, roll, dwayne, bomb, nuggs, herb, shots, entrails, pipe, hair foreskin, facial dick, penis, hard finger, bum hole, tooth, ass, fine Word ilo,cp,daos rn,terrorism drone, dragons, copy, billion, gluten dog, hot tortilla, chili, cream, cake, ice chip, chocolate pepperoni, ogeNews − 24 DirectBias and ( GoogleNews 5 DirectBias aeoisa xmlr.Seiial,note Specifically, exemplars. as categories 5 25 Drugs cuain ae nterprojections their on based occupations Sex 6 − , ie ht n herb and shots pipe, 8 eel edrbae n reinforces and biases gender reveals 34 1 , and 47 0 = hl eultrslike terms sexual while r infcnl overepresented significantly are .Iln ihteeobservations these with Inline ]. 1 [ ) Drugs . 09 was 6 Sex ntesagembeddings. slang the on ] o ln medns(cor- embeddings slang for 0 . Drugs . hl rdtoa gen- traditional While r oiattpc in topics dominant are 08 Wikipedia [ 6 0 ) hssuggests This ]). oegenerally, More . . 0 oee,we However, . r correctly are r correctly are 11000 n even and entrails ( 11K 6 ] ) gangster 0.40 Female officer Male warrior 0.35 commander 0.30

soldier 0.25 stylist 0.20 sailor Score socialite 0.15

counselor 0.10 missionary 0.05 0.3 0.2 0.1 0.0 0.1 0.2 Femaleness 0.00 News Slang Figure 9: Gender Stereotypes for occupations in Slang. Note the high “maleness” associated with occupations like gangster, officer and Figure 10: Sexual Prejudice of person names. Note that both male the high “femaleness” associated with occupations like socialite, and female names demonstrate higher sexual prejudice in slang counselor, missionary. than a more formal variety (NEWS). Furthermore, female names are more sexually prejudiced than male names in both varieties. onto the gender axis using slang embeddings. Note that the occu- pations gangster, officer, warrior, commander, soldier are considered a ria’, u’hey look...its ria!. and (b) debby:That girl is such a debby, the much more Male whereas occupations like stylist, sailor, socialite, guys are always checking her out. counselor and missionary are considered to be much more female Conclusion: Our analysis suggests that slang is much more reflecting and reinforcing prevalent gender stereotypes. uninhibited with stronger sexual prejudices. Conclusion: Similar to standard language and even formal news, Does slang show religious prejudices? We now investigate slang is not immune to gender biases and stereotypes. the prevalence of religious prejudices in slang. First, we define two Does slang exhibit sexual prejudice? Noting the significant lists: (a) Religious terms a list of religious terms spanning mul- proportion of sexual terms in slang, it is natural to ask the question tiple religions. These include words like sikh, muslim, islam, Does slang exhibit sexual prejudices when referring to males/females? jews, christian, agnostic, atheist, agnostic, buddhist. To answer this question, we first observe that several person (b) Prejudice terms which reflect prejudices. Examples of such names are also added as slang entries (for example. female names words include terrorist, evil, sexy, suave and good. These like neelam, sangeeta, ganga, ria, annabelle and male lists are in no-way exhaustive but representative. 12 Given a reli- names like vance, reuben). First, we obtain a pre-compiled list gious term r and a prejudice term p we compute a prejudice score of words (not an exhaustive one) primarily associated with sexual for (r,p) defined as the cosine similarity of r with p. We standard- prejudice namely slut, whore, shrew, bitch, faggot, sexy, ize these scores over all religious terms for a given prejudice p. fuck, fucked, nude, porn, cocksucker10. Given a word w, Figure 11 shows the standardized scores obtained for each of the we define its sexual prejudice score as the mean cosine similarity of religious terms for several prejudice terms. Note how for preju- the word embedding for w and each of the words in the above set. dices terrorist, and illegal, religious terms Islam, muslim, P c∈L Cosine(w,c) and christian buddhist and Specifically, we define: SEXPREJ(w) = |L | . We then have the highest scores where as obtained a list of ∼ 5000 slang words which are names of persons, jewish have the most negative scores (see Figures 11a, 11b). Further- inferred their gender (male or female) using a pre-trained model and more observe that for positive traits good, sexy (Figures 11c, 11d), computed their sexual prejudice score using slang embeddings. 11 buddhist, jewish are associated with very high positive scores As a baseline, we also computed the corresponding sexual prejudice while islam, agnostic, muslim are associated with extreme scores for the same set of names using the pre-trained GoogleNews negative scores thus reinforcing existing stereotypes. Finally we word embeddings. Figure 10 shows the mean sexual prejudice score observed the mean prejudice score over all (r,p) pairs in slang was for males and females in both Slang and Google News. Observe that significantly greater than Google News (0.34 vs 0.16 pval<0.0001) in both Google News and Slang, both male and female names are suggesting that such prejudices manifest more extremely in slang. associated with sexual prejudices. Furthermore, sexual prejudices Conclusion: We show that slang usage reveals prevalent religious associated with female names is (statistically significantly) higher stereotypes suggesting that slang like standard language is subject than male names in both sources. More interestingly, we observe to the same biases and stereotypes prevalent in standard language. that sexual prejudice associated with people names is significantly higher in Slang than in a more standard and formal domain like 5 RELATED WORK News. Some examples of female names which reflect such sexual Socio-variational Linguistics A large body of work studies lin- prejudices in slang from UrbanDictionary which would be very guistic aspects of language and its correlation with social factors unlikely in formal domains like News are (a) ria:hey tht chic is such like age, gender, ethnicity, and geography[3, 18–20, 28, 33, 44]. Most

10Obtained from: https://github.com/commonsense/conceptnet5/ 11We use the sexmachine toolkit to infer gender. 12We obtained these lists from: https://github.com/commonsense/conceptnet5 8 islam islam buddhist buddhist muslim protestant agnosticism jewish agnosticism agnosticism catholic catholic agnostic muslim agnostic sikh christian sikh christianity pagan jewish secular jewish secular sikh agnostic sikh christian atheist christianity atheism mormon christianity atheism secular agnosticism mormon christian christian christianity pagan pagan pagan muslim protestant atheist atheist atheist secular catholic protestant protestant atheism mormon islam atheism catholic buddhist muslim agnostic buddhist jewish mormon islam −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1.5 −1.0 −0.5 0.0 0.5 1.0 1.5 −1 0 1 2 Terrorist Illegal Good Sexy

(a) Terrorist (b) Illegal (c) Good (d) Sexy Figure 11: Religious prejudices in Slang. Note that even in slang, prevalent religious prejudices manifest. For example: Islam, Muslim are closer to terrorist (see Figure 11a). Note atheist has near neutral prejudice score while words like Buddhist have negative prejudice scores. Similarly observe the reversal of religions in positive trait words like good, sexy. of these works either study the standard form of English (in writ- 6 CONCLUSION ten or online social media) and do not focus primarily on slang. In this work, we conducted the first large scale analysis of slang on Mencken [33] outlines variation between American English and the Internet on both aspects: linguistic and social. Our linguistic British English and Labov [28] studied language variation in time analysis of slang, which included phonological, morphological and and geography,and outlined principles of . In the syntactical analysis reveals that slang exhibits linguistic properties age of social media, Eisenstein et al. [18, 19, 20] study lexical varia- markedly different from the standard variety. Furthermore our anal- tion in social media, propose models to detect geographic lexical ysis reveals insights into generative mechanisms of four different variation in social media and study its diffusion across regions. classes of slang: Alphabetisms, Blends, Clippings and Reduplica- Bamman et al. [3] then follow-up by studying gender identity and tives. We also propose a model to classify slang words into these lexical variation in social media. categories yet effectively reject words which do not belong to any There has been little work on the linguistic and social aspects of these four known classes. of slang with the exception being the work of Mencken [34] who We also analyzed social aspects of slang pertaining to subject studies the origin and nature of American Slang. Consequently, few restriction and stereotyping. Our analysis revealed two dominant dictionaries documenting slang have been compiled before the evo- subjects of slang: Sex and Drugs. Additionally, we showed that lution of Internet and social media [9, 21]. Recently, Mattiello [30] slang like its standard variety exhibits a non-trivial gender bias. notes and provides qualitative evidence for the extragrammatical More interestingly, our analysis reveals that both male and female morphological properties of slang while some works attempt to names are disproportionately sexualized in slang. In general, we explicitly incorporate slang to improve tasks in natural language noted that slang is not immune to prevalent sexual and religious processing (like sentiment detection) especially for social media prejudices and in-fact manifests such prejudices to a greater degree. like [1, 13, 26, 37]. Our work also suggests several directions for future research. The works that are closest to ours are that of Dhuliawala et al. First, our analysis is restricted to slang in UrbanDictionary which [13], Mattiello [29, 30]. Dhuliawala et al. [13] use Reddit and Ur- mostly reflects slang usage on the Internet and as used byyouth. banDictionary to build a lexical resource called SlangNet that It would be interesting to analyze slang in alternate settings like captures slang semantics. Mattiello [29, 30] notes the pervasiveness forums, micro-blogs, printed media, audio/video content as well of slang on the Internet, outlines the extra-grammatical nature of as specific demographic groups (like senior citizens etc). Second, slang morphology and compiles a small dataset of 1580 slang words. our insights can enable development of generative models for slang Differing from all of these works, we distinguish ourselves byex- and its diffusion in social media. Thirdly, the methods outlined plicitly focusing and analyzing slang on both aspects: linguistic and in our work can enable a study of slang in the multi-lingual and social at a large scale. We not only provide supporting quantitative cross-cultural setting where slang usage and its social aspects can evidence for observations made by Mattiello [30] but additionally be quite culture specific. conduct the first phonological, morphological and syntactical anal- Finally, we conclude by noting that our work has implications ysis of tens of thousands of slang words – revealing new linguistic to the larger fields of and natural language insights into diverse patterns of slang generation. processing which is increasingly being applied to non-standard Fairness in Machine Learning There has been a surge of re- language varieties so prevalent in social media. search into quantifying bias and analyzing fairness of machine learning models including word embeddings [5, 6, 10, 22, 23, 48]. Among these, the most relevant works are by [5, 6] who analyze and REFERENCES quantify the gender bias prevalent in pre-trained word embedding [1] Hadi Amiri and Tat-Seng Chua. 2012. Mining slang and urban opinion words and models like Google News word embeddings. They also propose phrases from cqa services: An optimization approach. In Proceedings of the fifth methods to debias such word embeddings. Our work builds on their ACM international conference on Web search and data mining. ACM, 193–202. [2] Mark Aronoff. 1976. Word formation in . Linguistic Inquiry approach where we not only quantify such bias in slang, but also Monographs Cambridge, Mass. 1 (1976), 1–134. reveal the presence of additional sexual and religious prejudices. [3] David Bamman, Jacob Eisenstein, and Tyler Schnoebelen. 2014. Gender identity and lexical variation in social media. Journal of Sociolinguistics 18, 2 (2014), 135–160. [4] Laurie Bauer. 1983. English word-formation. Cambridge university press.

9 [5] Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam [37] Finn Årup Nielsen. 2011. A new ANEW: Evaluation of a word list for sentiment Kalai. 2016. Quantifying and reducing stereotypes in word embeddings. arXiv analysis in microblogs. arXiv preprint arXiv:1103.2903 (2011). preprint arXiv:1606.06121 (2016). [38] Randolph Quirk. 2010. A comprehensive grammar of the . Pearson [6] Tolga Bolukbasi, Kai-Wei Chang, James Y Zou, Venkatesh Saligrama, and Adam T Education India. Kalai. 2016. Man is to computer programmer as woman is to homemaker? Debi- [39] Ajita Rattani, Walter J. Scheirer, and Arun Ross. 2015. Open Set Fingerprint Spoof asing word embeddings. In Advances in Neural Information Processing Systems. Detection Across Novel Fabrication Materials. IEEE Transactions on Information 4349–4357. Forensics and Security (T-IFS) 10 (November 2015). Issue 11. [7] Mary Bucholtz. [n. d.]. Word up: Social meanings of slang in California youth [40] Ethan Rudd, Lalit P. Jain, Walter J. Scheirer, and Terrance Boult. 2017. The culture. A cultural approach to interpersonal communication: Essential readings Extreme Value Machine. IEEE Transactions on Pattern Analysis and Machine ([n. d.]), 243–267. Intelligence (T-PAMI) (2017). To Appear. [8] Aylin Caliskan-Islam, Joanna J Bryson, and Arvind Narayanan. 2016. Semantics [41] Sergio Scalise. 1986. Generative morphology. Vol. 18. Walter de Gruyter. derived automatically from language corpora necessarily contain human biases. [42] Walter J. Scheirer, Lalit P. Jain, and Terrance E. Boult. 2014. Probability Models arXiv preprint arXiv:1608.07187 (2016). for Open Set Recognition. IEEE Transactions on Pattern Analysis and Machine [9] Robert L Chapman. 1986. New dictionary of American slang. New York: Harper Intelligence (T-PAMI) 36 (November 2014). Issue 11. & Row. [43] Karl Sornig. 1981. Lexical innovation: A study of slang, colloquialisms and casual [10] Kate Crawford. 2016. Artificial intelligenceâĂŹs white guy problem. The New speech. John Benjamins Publishing. York Times (2016). [44] Sali A. Tagliamonte and Derek Denis. 2008. Linguistic Ruin? LOL! Instant mes- [11] . 2011. Internet Linguistics: A Student Guide (1st ed.). Routledge, saging and teen language. American Speech 83 (2008), 3–34. Issue 1. New York, NY, 10001. [45] Sami Virpioja, Peter Smit, Stig-Arne Grönroos, Mikko Kurimo, et al. 2013. Mor- [12] Aliya Deri and Kevin Knight. [n. d.]. How to Make a Frenemy: Multitape FSTs fessor 2.0: Python implementation and extensions for Morfessor Baseline. (2013). for Portmanteau Generation. [46] Kaisheng Yao and Geoffrey Zweig. 2015. Sequence-to-sequence neural net models [13] Shehzaad Dhuliawala, Diptesh Kanojia, and Pushpak Bhattacharyya. 2016. for grapheme-to-phoneme conversion. arXiv preprint arXiv:1506.00196 (2015). SlangNet: A WordNet like resource for English Slang.. In LREC. [47] Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. [14] GF Drake. [n. d.]. The social role of slang. In Language: social psychological perspec- 2017. Men also like shopping: Reducing gender bias amplification using corpus- tives: selected papers from the first International Conference on Social Psychology level constraints. arXiv preprint arXiv:1707.09457 (2017). and Language held at the University of Bristol, England, July 1979. Pergamon, 63. [48] Indre Zliobaite. 2015. A survey on measuring indirect discrimination in machine [15] Bethany K Dumas and Jonathan Lighter. 1978. Is slang a word for linguists? learning. arXiv preprint arXiv:1511.00148 (2015). American Speech 53, 1 (1978), 5–17. [16] Connie Eble. [n. d.]. Slang and the Internet. ([n. d.]). [17] Connie Eble. 2012. Slang and sociability: In-group language among college students. UNC Press Books. [18] Jacob Eisenstein, Brendan O’Connor, Noah A Smith, and Eric P Xing. 2010. A latent variable model for geographic lexical variation. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1277–1287. [19] Jacob Eisenstein, Noah A Smith, and Eric P Xing. 2011. Discovering sociolinguistic associations with structured sparsity. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies- Volume 1. Association for Computational Linguistics, 1365–1374. [20] Jacob Eisenstein, Eric P Xing, Noah A Smith, and Brendan O’Connor. 2012. Mapping the geographical diffusion of new words. Technical Report. [21] Stuart Berg Flexner. 1967. Dictionary of American slang. Crowell. [22] Sara Hajian, Francesco Bonchi, and Carlos Castillo. 2016. Algorithmic bias: from discrimination discovery to fairness-aware data mining. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2125–2126. [23] Moritz Hardt, Eric Price, Nati Srebro, et al. 2016. Equality of opportunity in supervised learning. In Advances in Neural Information Processing Systems. 3315– 3323. [24] Felix Hill, Kyunghyun Cho, Anna Korhonen, and Yoshua Bengio. 2015. Learning to understand phrases by embedding the dictionary. arXiv preprint arXiv:1504.00548 (2015). [25] Lalit P. Jain, Walter J. Scheirer, and Terrance E. Boult. 2014. Multi-Class Open Set Recognition Using Probability of Inclusion. In The European Conference on Computer Vision (ECCV). [26] Fazal Masud Kundi, Shakeel Ahmad, Aurangzeb Khan, and Muhammad Zubair Asghar. 2014. Detection and scoring of internet for sentiment analysis using SentiWordNet. Life Science Journal 11, 9 (2014), 66–72. [27] William Labov. 1972. Some principles of linguistic methodology. Language in society 1, 1 (1972), 97–120. [28] William Labov. 1983. Locating language in time and space: William Labov (ed.), Quantitative Analyses of Linguistic Structure, volume 1. Academic Press, New York. 271 pp. Lingua 60, 1 (1983), 87–96. [29] Elisa Mattiello. [n. d.]. The pervasiveness of slang in standard and non-standard English. ([n. d.]). [30] Elisa Mattiello. 2008. An introduction to English slang: A description of its mor- phology, semantics and sociology. Vol. 2. Polimetrica sas. [31] Elisa Mattiello. 2013. Extra-grammatical morphology in English: , blends, reduplicatives, and related phenomena. Vol. 82. Walter de Gruyter. [32] Willi Mayerthaler. 1981. Morphologische natuerlichkeit. Vol. 28. Akademische Verlagsgesellschaft Athenaion. [33] Henry Louis Mencken. 1945. The American language: An inquiry into the develop- ment of English in the . Vol. 2. Alfred a Knopf Incorporated. [34] Henry Louis Mencken. 1967. American slang. The American Language (1967). [35] Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of NAACL-HLT. 746–751. [36] Pamela Munro et al. 1989. Slang U: The Official Dictionary of College Slang. (1989). 10