Blow the Dog Whistle: A Chinese Dataset for Cant Understanding with Common Sense and World Knowledge
Canwen Xu1∗, Wangchunshu Zhou2∗, Tao Ge3, Ke Xu2, Julian McAuley1, Furu Wei3 1 University of California, San Diego 2 Beihang University 3 Microsoft Research Asia 1{cxu,jmcauley}@ucsd.edu 3{tage,fuwei}@microsoft.com [email protected],[email protected]
Abstract decrypted by people outside the group (i.e., the out- siders). These elements make the creation and un- Cant is important for understanding advertis- ing, comedies and dog-whistle politics. How- derstanding of cant subtle and hard to observe (Tay- ever, computational research on cant is hin- lor, 1974). To the best of our knowledge, currently dered by a lack of available datasets. In this there are very few resources available for the re- paper, we propose a large and diverse Chinese search of cant. dataset for creating and understanding cant In this paper, we create a dataset for studying from a computational linguistics perspective. cant, DogWhistle, centered around the afore- We formulate a task for cant understanding and mentioned key elements (examples shown in Fig- provide both quantitative and qualitative anal- ure1). We collect the data with a well-designed ysis for tested word embedding similarity and pretrained language models. Experiments sug- online game under a player-versus-player setting gest that such a task requires deep language un- (see Section 3.1). The dataset includes abundant derstanding, common sense, and world knowl- and diverse cant for a wide spectrum of hidden edge and thus can be a good testbed for pre- words. We find that cant understanding requires trained language models and help models per- a deep understanding of language, common sense 1 form better on other tasks. and world knowledge, making it a good testbed for 1 Introduction next-generation pretrained language models. Our dataset also serves as a timely and complex lan- A cant2 (also known as doublespeak, cryptolect, ar- guage resource that can help models perform better got, anti-language or secret language) is the jargon on other tasks through Intermediate Task Trans- or language of a group, often employed to exclude fer (Pruksachatkun et al., 2020). or mislead people outside the group (McArthur et al., 2018). Cant is crucial for understanding ad- 2 Related Work vertising (Dieterich, 1974) and both ancient and The use of cant has long been studied in linguistics modern comedy (Sommerstein, 1999; Prasetyo, research (Pei, 1973; Pulley, 1994; Albertson, 2006; 2019). Also, it is the cornerstone for infamous dog- Squires, 2010; Henderson and McCready, 2017, whistle politics (López, 2015; Albertson, 2015). 2019b,a; Bhat and Klein, 2020). However, due to Here, we summarize the key elements for cant: a lack of language resources, there are few studies (1) Both a cant and its reference (i.e., hidden word) arXiv:2104.02704v2 [cs.CL] 7 Jun 2021 in computational linguistics research. Henderson should be in the form of common natural text (not and McCready(2020) attempted to model the dog- another symbol system, e.g., Morse code). (2) whistle communications with a functional, agent- There is some shared information between the cant based method. users (i.e., the insiders) that is not provided to the As a related topic in computational linguistics, people outside the group. (3) A cant should be some previous studies investigate coded names in deceptive and remain undetected to avoid being human language. Zhang et al.(2014) analyzed and ∗ Equal Contribution. Work done at Microsoft Research generated coded names of public figures. Zhang Asia. 1The code is available at https://github.com/ et al.(2015) designed an automatic system to de- JetRunner/dogwhistle. The data and leaderboard are code the coded names. Huang et al.(2017) ex- available at https://competitions.codalab.org/ ploited a knowledge graph to identify coded names. competitions/30451. 2https://en.wikipedia.org/wiki/Cant_ Huang et al.(2019) leveraged multi-modal infor- (language) mation to align coded names with their references. Word indices 0 1 2 3 Word indices 0 1 2 3 Hidden words Cant history