Be Appropriate and Funny: Automatic Entity Morph Encoding
Total Page:16
File Type:pdf, Size:1020Kb
Be Appropriate and Funny: Automatic Entity Morph Encoding Boliang Zhang1, Hongzhao Huang1, Xiaoman Pan1, Heng Ji1, Kevin Knight2 Zhen Wen3, Yizhou Sun4, Jiawei Han5, Bulent Yener1 1Computer Science Department, Rensselaer Polytechnic Institute 2Information Sciences Institute, University of Southern California 3IBM T. J. Watson Research Center 4College of Computer and Information Science, Northeastern University 5Computer Science Department, Univerisity of Illinois at Urbana-Champaign 1fzhangb8,huangh9,panx2,jih,[email protected], [email protected] [email protected], [email protected], [email protected] Abstract • “)¿ (Antenna)” refers to “)¶宝 (Wen Ji- abao)” because it shares one character “宝 Internet users are keen on creating differ- (baby)” with the famous children’s television ent kinds of morphs to avoid censorship, series “)¿宝宝 (Teletubbies)”; express strong sentiment or humor. For example, in Chinese social media, users In contrast with covert or subliminal chan- often use the entity morph “¹¿b (In- nels studied extensively in cryptography and se- stant Noodles)” to refer to “h8· (Zhou curity, Morphing provides confidentiality against Yongkang)” because it shares one char- a weaker adversary which has to make a real time acter “· (Kang)” with the well-known or near real time decision whether or not to block brand of instant noodles “·师傅 (Master a morph within a time interval t. It will take longer Kang)”. We developed a wide variety of than the duration t for a morph decoder to decide novel approaches to automatically encode which encoding method is used and exactly how it proper and interesting morphs, which can is used; otherwise adversary can create a codebook effectively pass decoding tests 1. and decode the morphs with a simple look up. We note that there are other distinct characteristics 1 Introduction of morphs that make them different from crypto- graphic constructs: (1) Morphing can be consid- One of the most innovative linguistic forms in so- ered as a way of using natural language to com- cial media is Information Morph (Huang et al., municate confidential information without encryp- 2013). Morph is a special case of alias to hide the tion. Most morphs are encoded based on seman- original objects (e.g., sensitive entities and events) tic meaning and background knowledge instead for different purposes, including avoiding censor- of lexical changes, so they are closer to Jargon. ship (Bamman et al., 2012; Chen et al., 2013), (2) There can be multiple morphs for an entity. expressing strong sentiment, emotion or sarcasm, (3) The Shannon’s Maxim “the enemy knows the and making descriptions more vivid. Morphs are system” does not always hold. There is no com- widely used in Chinese social media. Here is an mon code-book or secret key between the sender example morphs: “1于瓜9的事Å,¹¿b与 and the receiver of a morph. (4) Social networks )¿JL. (Because of Gua Dad’s issue, Instant play an important role in creating morphs. One Noodles faces down with Antenna.)”, where main purpose of encoding morphs is to dissemi- • “瓜9 (Gua Dad)” refers to “薄熙e (Bo Xilai)” nate them widely so they can become part of the because it shares one character “瓜 (Gua)” with new Internet language. Therefore morphs should “薄瓜瓜 (Bo Guagua)” who is the son of “薄熙 be interesting, fun, intuitive and easy to remem- e (Bo Xilai)”; ber. (5) Morphs rapidly evolve over time, as some • “¹¿b (Instant Noodles)” refers to “h8· morphs are discovered and blocked by censorship (Zhou Yongkang)” because it shares one char- and newly created morphs emerge. acter “· (kang)” with the well-known instant We propose a brand new and challenging re- noodles brand “康师傅 (Master Kang)”; search problem - can we automatically encode 1The morphing data set is available for research purposes: morphs for any given entity to help users commu- http://nlp.cs.rpi.edu/data/morphencoding.tar.gz nicate in an appropriate and fun way? 2 Approaches converted into“I” (grass) ), we create a morph by 1 2 replacing ck with ckck in e. Here we use a char- 2.1 Motivation from Human Approaches acter to radical mapping table that includes 191 Let’s start from taking a close look at human’s radicals (59 of them are characters) and 1328 com- intentions and general methods to create morphs mon characters. For example, we create a morph from a social cognitive perspective. In Table 1 “ºFW (Person Dumb Luo)” for “保W (Paul)” and Table 2, we summarize 548 randomly selected by decomposing “保 (Pau-)” into “º (Person)” morphs into different categories. In this paper we and “F (Dull)”. A natural alternative is to com- automate the first seven human approaches, with- posing two chracter radicals in an entity name to out investigating the most challenging Method 8, form a morph. However, very few Chinese names which requires deep mining of rich background include two characters with single radicals. and tracking all events involving the entities. 2.4 M3: Nickname Generation 2.2 M1: Phonetic Substitution We propose a simple method to create morphs by Given an entity name e, we obtain its pho- duplicating the last character of an entity’s first netic transcription pinyin(e). Similarly, for each name. For example, we create a morph “BB unique term t extracted from Tsinghua Weibo (Mimi)” to refer to “h B (Yang Mi)”. dataset (Zhang et al., 2013) with one billion 2.5 M4: Translation and Transliteration tweets from 1:8 million users from 8/28/2012 to 9/29/2012, we obtain pinyin(t). According to the Given an entity e, we search its English translation Chinese phonetic transcription articulation man- EN(e) based on 94,015 name translation pairs (Ji ner 2, the pairs (b; p), (d; t), (g; k), (z; c), (zh; ch), et al., 2009). Then, if any name component in (j; q), (sh; r), (x; h), (l; n), (c; ch), (s; sh) and EN(e) is a common English word, we search for (z; zh) are mutually transformable. its Chinese translation based on a 94,966 word If a part of pinyin(e) and pinyin(t) are identi- translation pairs (Zens and Ney, 2004), and use the cal or their initials are transformable, we substi- Chinese translation to replace the corresponding tute the part of e with t to form a new morph. characters in e. For example, we create a morph 拉 拉 · For example, we can substitute the characters of “ 里 鸟? (Larry bird)” for “ 里 / (Larry · “比尔 盖( (Bill Gates) [Bi Er Gai Ci]” with Bird)” by replacing the last name “/ (Bird)” “;3 (Nose and ear) [Bi Er]” and “盖P (Lid) with its Chinese translation “鸟? (bird)”. [Gai Zi]” to form new morph “;3 盖P (Nose 2.6 M5: Semantic Interpretation and ear Lid) [Bi Er Gai Zi]”. We rank the candi- For each character ck in the first name of a given dates based on the following two criteria: (1) If the entity name e, we search its semantic interpreta- morph includes more negative words (based on a tion sentence from the Xinhua Chinese character gazetteer including 11,729 negative words derived dictionary including 20,894 entries 3. If a word from HowNet (Dong and Dong, 1999), it’s more in the sentence contains ck, we append the word humorous (Valitutti et al., 2013). (2) If the morph with the last name of e to form a new morph. Sim- includes rarer terms with low frequency, it is more ilarly to M1, we prefer positive, negative or rare interesting (Petrovic and Matthews, 2013). words. For example, we create a morph “薄 áe 2.3 M2: Spelling Decomposition (Bo Mess)” for “薄熙e (Bo Xi Lai)” because the semantic interpretation sentence for “e (Lai)” in- Chinese characters are ideograms, hieroglyphs cludes a negative word “áe (Mess)”. and mostly picture-based. It allows us to natu- rally construct a virtually infinite range of combi- 2.7 M6: Historical Figure Mapping nations from a finite set of basic units - radicals (Li We collect a set of 38 famous historical figures and Zhou, 2007). Some of these radicals them- including politicians, emperors, poets, generals, selves are also characters. For a given entity name ministers and scholars from a website. For a given e = c1:::cn, if any character ck can be decomposed entity name e, we rank these candidates by ap- 1 2 into two radicals ck and ck which are both char- plying the resolution approach as described in our acters or can be converted into characters based previous work (Huang et al., 2013) to measure the on their pictograms (e.g., the radical “y” can be similarity between an entity and a historic figure 2http://en.wikipedia.org/wiki/Pinyin#Initials and finals 3http://xh.5156edu.com/ Frequency Examples Category Distribution Entity Morph Comment (1) Avoid censorship 6.56% 薄熙e (Bo Xi- Bf 记 (B Secre- “B” is the first letter of “Bo” and “Secretary” is lai) tary) the entity’s title. (2) Express strong 15.77% 王Çs (Wang G ù 哥 (Miracle Sarcasm on the entity’s public speech: “It’s a mir- sentiment, sarcasm, Yongping) Brother) acle that the girl survived (from the 2011 train col- emotion lision)”. (3) Be humorous or 25.91% hB (Yang Mi) é[五¹ (Tender The entity’s face shape looks like the shape of fa- make descriptions Beef Pentagon) mous KFC food “Tender Beef Pentagon”. more vivid Mixture 25.32% a N 菲 ¯-上! (Crazy Sarcasm on Colonel Gaddafi’s violence. (Gaddafi) Duck Colonel) Others 23.44% 蒋 Ë 石 (Chi- ±生s (Peanut) Joseph Stilwell, a US general in China during ang Kai-shek) World War II, called Chiang Kai-shek “±生s (Peanut)” in his diary because of his stubbornness.