Detect Camouflaged Spam Content Via Stoneskipping: Graph and Text

Home , Ming (typefaces)

Detect Camouﬂaged Spam Content via StoneSkipping: Graph and Text Joint Embedding for Chinese Character Variation Representation Zhuoren Jiang1∗, Zhe Gao2∗, Guoxiu He3, Yangyang Kang2, Changlong Sun2, Qiong Zhang2, Luo Si2, Xiaozhong Liu4† 1 School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China 2Alibaba Group, Hangzhou & Sunnyvale & Seattle, China & USA 3School of Information Management, Wuhan University, Wuhan, China 4School of Informatics, Computing and Engineering, IUB, Bloomington, USA [email protected], {gaozhe.gz,yangyang.kangyy,qz.zhang,luo.si}@alibaba-inc.com, [email protected], [email protected], [email protected] Abstract

The task of Chinese text spam detection is very challenging due to both glyph and phonetic variations of Chinese characters. This paper proposes a novel framework to jointly model Chinese variational, semantic, and contextualized representations for Chinese text spam detection task. In particular, a Variation Family-enhanced Graph Embedding (VFGE) Figure 1: Character Variations in Chinese Spam Texts algorithm is designed based on a Chinese char- (the pinyin codes provide phonetic information and the acter variation graph. The VFGE can learn stroke and Zhengma codes provide glyph information). both the graph embeddings of the Chinese characters (local) and the latent variation families (global). Furthermore, an enhanced bidi- (curtain)” have the similar structure and pronunci- rectional language model, with a combina- ation (homonyms). Unfortunately, in the context tion gate function and an aggregation learning of spam detection, as shown in Figure1, spam- function, is proposed to integrate the graph and mers are able to take advantage of these variations text information while capturing the sequential to escape from the detection algorithms (Jindal information. Extensive experiments have been and Liu, 2007). For instance, in the e-commerce conducted on both SMS and review datasets, ecosystem, variation-based Chinese spam muta- to show the proposed method outperforms a tions thrive to spread illegal, misleading, and series of state-of-the-art models for Chinese 1 spam detection. harmful information . In this study, we propose a novel problem - Chinese Spam Variation Detec- 1 Introduction tion (CSVD), a.k.a. investigating an effective Chi- nese character embedding model to assist the clas- Chinese orchestrates over tens of thousands of siﬁcation models to detect the variations of Chi- characters by utilizing their morphological in- nese spam text, which needs to address the follow- formation, e.g., pictograms, simple/compound ing key challenges. ideograms, and phono-semantic compounds (Nor- Diversity: the variation patterns of Chinese man, 1988). Different characters, however, may characters can be complex and subtle, which are share the similar glyph and phonetic “root”. For difﬁcult to generalize and detect. For instance, instance, from glyph perspective, character “裸 in the experimental dataset, one Chinese char- (naked)” looks like “课 (course)” (homographs), acter can have 297 (glyph and phonetic) vari- while from phonetic viewpoint, it shares the ants averagely and 2,332 maximally. The exist- similar pronunciation with “锣 (gong)” (homo- ing keyword based spam detection approaches, phones). The form of variations can also be e.g., (Ntoulas et al., 2006), can hardly address this compounded, for instance, “账 (account)”and “帐 problem. Sparseness, Zero-shot, and Dynamics:

∗These two authors contributed equally to this research. 1More detailed information can be found in the experi- † Corresponding author ment section.

6187 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 6187–6196, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics when competing with classification models, spam- diversity, sparseness, and text camouflage prob- mers are constantly creating new Chinese charac- lems. ters combinations for spam texts (that can be a 2. A novel joint embedding SS model is pro- “zero/few shot learning” problem (Socher et al., posed to learn the variational, semantic, and con- 2013)). The labelling cost can be inevitably high textual representations of Chinese characters. SS in such dynamic circumstance. Data driven ap- is able to predict unseen variations. proaches, e.g., (Zhang et al., 2015), will perform 3. A Chinese character variation graph is con- poorly for unseen data. Camouflage: with the structed for encapsulating the glyph and phonetic common cognition knowledge of Chinese and the relationships among Chinese characters. Since contextual information, users are able to consume the graph can be potentially useful for other NLP the spam information, even when some characters tasks, we share the graph/embeddings to motivate in the content are intentionally mutated into their further investigation. similar variations (Spinks et al., 2000; Shu and 4. Through the extensive experiments on both Anderson, 1997). However, the variation-based SMS and review datasets2, we demonstrate the ef- spam text are highly camouflaged for machines. ficacy of the proposed method for Chinese spam It is important to propose a novel Chinese char- detection. The proposed method outperforms the acter representation learning model that can syn- state-of-the-art models. thesize character variation knowledge, semantics, and contextualized information. 2 Related Work To address these challenges, we propose a novel Neural Word Embeddings. Unlike traditional solution, StoneSkipping (SS) model to enable Chi- word representations, low-dimensional distributed nese variation representation learning via graph word representations (Mikolov et al., 2013; Pen- and text joint embedding. SS is able to learn the nington et al., 2014) are able to capture in- Chinese character variation knowledge and pre- depth semantics of text content. More recently, dict the new variations not appearing in the train- ELMo (Peters et al., 2018) employed learning ing set by utilizing sophisticated heterogeneous functions of the internal states of a deep bidirec- graph mining method. For a piece of text (a char- tional language model to generate the character acter sequence), with the proposed model, each embeddings. BERT (Devlin et al., 2018) utilized candidate character can probe character variation bidirectional encoder representations from trans- graph (like stone bouncing cross the water sur- formers (Vaswani et al., 2017) and achieved im- face), and explore its glyph and phonetic variation provements for multiple NLP tasks. However, information (like the ripples caused by the stone all the prior models only focused on learning the hitting the water). Algorithmically, a Variation context, whereas the text variation was ignored. Family-enhanced Graph Embedding (VFGE) al- Moreover, CSVD problem can be different from gorithm is proposed to extract the heterogeneous other NLP tasks: the intentional character mu- Chinese variation knowledge while learning the tations and unseen variations (zero-shot learning (local) graph representation of a Chinese character (Socher et al., 2013)) can threaten prior models’ along with the (global) representation of the latent performances. variation families. Finally, an enhanced bidirec- Chinese Word and Sub-word Embeddings.A tional language model, with a combination gate number of studies explored Chinese representation function and an aggregation learning function, is learning methodologies. CWE (Chen et al., 2015) proposed to comprehensively learn the variation, learned the character and word embeddings to im- semantic, and sequential information of Chinese prove the representation performance. GWE (Su characters. To the best of our knowledge, this and Lee, 2017) introduced the features extracted is the first work to use graph embedding to learn from the images of traditional Chinese characters. the heterogeneous variation knowledge of Chinese JWE (Yu et al., 2017) used deep learning to gen- characters for spam detection. erate character embedding based on an extended The major contributions of this paper can be radical collection. Cw2vec (Cao et al., 2018) in- summarized as follows: 2In order to help other scholars reproduce the exper- 1. We propose an innovative CSVD problem, in iment outcome, we will release the datasets via GitHub the context of text spam detection, to address the (https://github.com/Giruvegan/stoneskipping)

6188 vestigated Chinese character as a sequence of n- tion learning; an enhanced bidirectional lan- gram stroke order to generate its embedding. Al- guage model for joint representation learning. In though these models had considered the nature of the remaining of this section, we will introduce Chinese characters, they only utilized glyph fea- them in detail. tures while the phonetic information was ignored. In CSVD problem, the forms of variations can be 3.1 Chinese Character Variation Graph heterogeneous, and a single kind of features can- A Chinese character variation graph3 can be de- not cover all mutation patterns. More importantly, noted as G = (C,R). C denotes the Chinese all these models are not designed for spam detec- character set, and each character is represented as tion, and the task-oriented model should be able to a vertex in G. R denotes the variation relation highlight the most important variations for spam (edge) set, and edge weight is the similarity of text. two characters given the target relation (variation) Graph Embedding. Graph (a.k.a. information type. To accurately characterize both phonetic and network) is a natural data structure for characteriz- glyph information of Chinese character, we utilize ing the multiple relationships between the objects. three different encoding methods: Recently, multiple graph embedding algorithms Pinyin system provides phonetic-based infor- are proposed to learn the low dimensional fea- mation, which is widely used for representing the ture representations of vertexes in graphs. Deep- pronunciations of Chinese characters (Chen and Walk (Perozzi et al., 2014) and Node2vec (Grover Lee, 2000). In this system, each Chinese character and Leskovec, 2016) are random walk based mod- has one syllable which consists of three compo- els. LINE (Tang et al., 2015) modeled 1st and 2nd nents: an initial (consonant), a final (vowel), and order graph neighbourhood. Meanwhile, metap- a tone. There are four types of tones in Modern ath2vec++ (Dong et al., 2017) was designed for Standard Mandarin Chinese. Different tones with heterogeneous graph embedding with human de- the same syllable can have different meanings. For fined metapath rules. HEER (Shi et al., 2018) instance, the pinyin code of “裸 (naked)” is “luo3” is a recent state-of-the-art heterogeneous graph and “锣 (gong)” is ‘luo2”. The pinyin-based vari- embedding model. Though the techniques uti- ation similarity is calculated based on their pinyin lized in these models are different, most exist- syllables with tones4. ing graph embedding models focus more on local Stroke is a basic glyph pattern for writing Chi- graph structure representation, e.g., modelling of a nese character (Cao et al., 2018). All Chinese fixed-size graph neighbourhood. CSVD problem characters are written in a certain stroke order requires graph embedding conducted from a more and can be represented as a stroke code, e.g., the global perspective, to characterize comprehensive stroke code of “裸 (naked)” is “4523425111234” variation patterns. and ‘课 (course)” is “4525111234”. The stroke- Spelling Correction. Spelling correction may based variational similarity is calculated based on serve as an alternative to address CSVD problem, longest common substring and longest common e.g., using dictionary-based (Yeh et al., 2014) or subsequence metrics4. language model-based method (Yu and Li, 2014) Zhengma is another important means for glyph to restore the content variations to their regular character encoding, which encodes character at format. However, because spammers intentionally radical level (Yu et al., 2017). For instance, the mutate the spam text to escape from the detection Zhengma code of “裸 (naked)” is “WTKF” and model, training data sparseness and dynamics may ‘课 (course)” is “SKF”. The Zhengma-based vari- challenge this approach. ational similarity is calculated based on the Jac- card Index metric4. 3 StoneSkipping Model Unlike previous works (Cao et al., 2018; Yu et al., 2017) only employ one kind of glyph-based Figure2 depicts the proposed SS model. There are information, we utilize two different glyph pat- three core modules in SS: a Chinese character variation graph to host the heterogeneous vari- 3We will release Chinese character variation graph via ation information; a variation family-enhanced GitHub (https://github.com/Giruvegan/stoneskipping). 4Because of the space limitation, the detailed op- graph embedding for Chinese character varia- erations of relation generation will be provided on tion knowledge extraction and graph representa- https://github.com/Giruvegan/stoneskipping.

6189 Figure 2: An Illustration of “StoneSkipping” Framework terns (stroke and Zhengma) to encode the Chinese some random infrequent variation patterns could character. Because these two patterns can char- be “noisy” for CSVD while polluting the detection acterize Chinese characters from different inter- outcomes. nal structural levels, and complement each other Latent Character Variation Family. In this to enable an enhanced glyph representation learn- study, we propose a VFGE model to address these ing. Furthermore, the pinyin encoder provides problems. As depicted in Figure2, in VFGE phonetic information. The constructed character model, we introduce a set of latent variables “char- variation graph integrates these three kinds of vari- acter variation family” F = F1, ..., F|F | at a ation relations, which can be significant for cam- graph schema (global) level to capture the critical ouflaged spam detection. information for spam detection. Each Fi is defined as a distribution of characters, which aims to esti- 3.2 Variation Family-enhanced Graph mate the global frequent variation dependencies in Embedding G. By learning F , VFGE is able to highlight the useful variations, eliminate the noisy patterns, and While the variation graph can provide compre- predict the unseen variation forms w.r.t. the spam hensive knowledge of Chinese character varia- detection task. tions, efforts need to be made to address these two problems: (1) the variation patterns can be Random Walk based Character Family Rep- very flexible, and the compounded (long-range) resentation Co-Learning . VFGE is a random variation information transfer may exist. There- walk based graph embedding model, and we em- fore, short-range (local) graph information, e.g., ploy a hierarchical random walk strategy (Jiang character vertex’s neighbors, may be insufficient et al., 2018b) on G to generate the optimized walk- for spam detection. Meanwhile, it is impracti- ing paths (character vertex sequences) for each cal to exhaust all the possible variation patterns. character. The model can sample the most pos- (2) To oblige users to consume the text content, sible variation context vertexes for each charac- spammers cannot make the variation patterns to be ter. Based on generated walking paths, VFGE ex- too complex/confusing, they usually focus on the ecutes the following two processes iteratively: most sensitive words in a spam message. Hence, (1) Family Assignment. By leveraging both lo-

6190 cal context and global family distributions, we as- character vertex’s context will encapsulate both lo- sign a discrete family for each character vertex in a cal (vertex) and global (variation family) informa- particular walking path to form a character-family tion. Hence, the learned representations are able to pair hC,F i. As shown in Figure2, we assume dif- comprehensively preserve the variation informa- ferent walking paths tend to emerge various char- tion in G. Fi acter variation patterns which can be represented P r(hCj,Fji |Ci ) is modeled as a softmax as mixtures over latent variation families. Given a function: character Ci in a path, Ci has a higher chance to F exp(C j · CFi ) be assigned to a dominant family Fi. The assign- Fi j i P r(hCj,Fji |C ) = ment probability can be calculated as: i P exp(CFk · CFi ) Ck∈C k i (3) P r(Fi|Ci, path) Stochastic gradient ascent is used for optimizing ∝P r(Ci,Fi, path) (1) the model parameters of f. Negative sampling

=P r(path)P r(Fi|path)P r(Ci|Fi) (Mikolov et al., 2013) is applied for optimization efﬁciency. Note that, the parameters of each char- As depicted in Figure2, α is the parameter of the acter embedding and family embedding are shared Dirichlet prior on the per-path family distributions over all the character-family pairs, which, as sug- (P r(path)); β is the family assignment distribu- gested in (Liu et al., 2015), can address the train- tion (P r(C|F )); and θ is the family mixture distri- ing data sparseness problem and improve the rep- bution for a walking path (P r(F |path)). The dis- resentation quality. tribution learning can be considered as a Bayesian Family-enhanced Embedding Integration. inference problem, and we use Gibbs sampling As shown in Figure2, the family-enhanced char- (Porteous et al., 2008) to address this problem. acter graph embedding can be calculated as: (2) Character-Family Representation Co-   Learning. Given the assigned character-family X pairs, the proposed method aims to obtain the rep- Gi = Ci, P r(Fj|Ci)Fj (4) resentations of character C and latent variation Fj ∈F family F by mapping them into a low-dimensional where Gi is family-enhanced graph embed- d d space R ( is a parameter specifying the num- ding for Ci, and [·] is concatenating operation. ber of dimensions). Motivated by (Liu et al., P r(Fj|Ci) can be inferred from family assign- 2015), we propose a novel representation learning ment distribution β. method to optimize characters and families sepa- rately and simultaneously. 3.3 Enhanced Bidirectional Language Model The objective is deﬁned to maximize the follow- As shown in Figure2, SS model utilizes an en- ing log probability: hanced bidirectional language model to jointly learn variation, semantic and contextualized rep- X X Fi L = max logP r(hCj,Fji |C ) f i resentation of Chinese character. Ci∈C Cj ∈N(Ci) Combination Gate Function. This gate func- (2) tion is utilized for combining the variation and se- We use f(·) as the embedding function. Ci = mantic representations, which is the input function f(Ci) represents the character graph embedding for bidirectional language model. The formula- and Fi = f(Fi) represents the family graph tions of the gate function are listed as follows: Fi embedding. Ci denotes the concatenation of Ci and Fi, whereas N(Ci) is Ci’s neighborhood P = σ(WP · [G, T] + bP ) (5) (context). As Figure2 shows, the feature rep- N = (P T) + ((1 − P) G) resentation learning method is an upgraded ver- d sion of the skip-gram architecture. Compared with P ∈ R is the preference weights for controlling d merely using the target vertex Ci to predict context the contributions from G ∈ R (variation graph d vertexes in original skip-gram model (Mikolov embedding) and T ∈ R (Skip-Gram textual em- 2d×d d et al., 2013), the proposed approach employs the bedding). WP ∈ R . N ∈ R is the combi- character-family pair hCi,Fii to predict context nation representation. is elementwise product, character-family pairs. From variation viewpoint, and + is elementwise sum.

6191 SMS Review Group Model Accuracy F1 Score Accuracy F1 Score Skipgram (Mikolov et al., 2013) 0.807 0.765 0.693 0.560 Text GloVe (Pennington et al., 2014) 0.732 0.637 0.707 0.600 ELMo (Peters et al., 2018) 0.786 0.747 0.755 0.647 CWE (Chen et al., 2015) 0.751 0.674 0.780 0.726 GWE (Su and Lee, 2017) 0.505 0.426 0.778 0.718 Chinese JWE (Yu et al., 2017) 0.770 0.707 0.738 0.646 Cw2vec (Cao et al., 2018) 0.800 0.753 0.724 0.618 DeepWalk (Perozzi et al., 2014) 0.836 0.804 0.738 0.638 LINE (Tang et al., 2015) 0.821 0.783 0.764 0.695 Graph Node2vec (Grover and Leskovec, 2016) 0.835 0.802 0.792 0.736 M2VMax (Dong et al., 2017) 0.838 0.807 0.790 0.740 HEER (Shi et al., 2018) 0.723 0.617 0.771 0.708 Correction Pycorrector (Yu and Li, 2014) 0.782 0.727 0.688 0.549

SSGraph 0.839 0.827 0.812 0.756 Comparison SSNaive 0.849 0.825 0.811 0.757 SSOriginal 0.851 0.832 0.854 0.822

Table 1: Chinese Text Spam Detection Performance Comparison of Different Models

Aggregation Learning Function. With the 4 Experiment combination representation N as input, we train a bidirectional language model for capturing the 4.1 Dataset and Experiment Setting sequential information. There could be multiple Dataset5. In Table2, we summarize the statis- layers of forward and backward LSTMs in bidi- tics of the two real-world spam datasets (in Chi- −→ k nese). One is a SMS dataset, the other is a prod- rectional language model. For kth character, Hl is the forward LSTM unit’s output for layer l, where uct review dataset. Both datasets were manually ←− labeled (spam or regular labels) by professionals. l = 1, 2, ..., L, and Hk is the output of the back- l False advertising and scam information are the ward LSTM unit. most common forms of spam information for SMS dataset, while abuse information dominates review The output SS embedding is learned from spam dataset. an aggregation function, which aims to aggre- gate the intermediate layer representations of the Dataset Part All Spam Normal bidirectional language model and the input em- Train 48,884 23,891 24,993 SMS bedding N. For kth character, if we denote Test 48,896 23,891 25,005 k k k k H0 = [N , N ] (self concatenation), and Hl = Train 37,299 17,299 20,000 ←− −→ Review k k Test 37,299 17,299 20,000 [Hl , Hl ], the output can be: Table 2: Statistics of Two Chinese Spam Text Datasets

In the constructed variation graph, there are totally 25,949 Chinese characters (vertexes) and L 7,705,051 variation relations. For all the varia- k k X k SS = ω s0H0 + slHl tion relations, there are 1,508,768 pinyin relations | {z } l=1 (Variational & Semantic) (phonetic), 373,803 stroke relations (glyph), and | {z } Contextualized 5,822,480 Zhengma relations (glyph). (6) Experimental Set-up. We validated the pro- where ω is the scale parameter, and s is a l posed model in Chinese text spam detection task. weight parameter for the combination of each In order to simulate the “diversity”, “sparseness” layer, which can be learned through the training and “ zero-shot” problems under real business sce- process. Similar aggregation operation has been narios, we made a challenging restriction on the proven useful to model the contextualized word representation (Peters et al., 2018). 5https://github.com/Giruvegan/stoneskipping

6192 Text Chinese Graph Proposed model Character Skipgram Cw2vec VFGE SS C 捷(prompt) C 捷(prompt) G P 云(cloud) S C 转(transmit) 运(move) C 站(stop) C 站(stop) G P 纭(numerous) G P 芸(weed) C 客(guest) S C 输(transport) G 坛(altar) G P 云(cloud) S C 讶(surprised) S C 讶(surprised) G P 景(view) S C 慌(ﬂurried) 惊(shock) S C 愕(startled) S C 撼(shake) G 晾(dry) G 琼(jade) S C 吓(scare) S C 愕(startled) G 谅(forgive) G S C 悚(afraid) G : Glyph; P : Phonetic; S : Semantic; C :Context

Table 3: Case Study: given the target character, we list the top 3 similar characters from each algorithm. The characters are selected from a frequently used candidate character set whose size is 8238. training and testing sets, i.e., the character vari- baselines on constructed Chinese character variations were only included in testing set, and all ation graph to get graph based character embed- samples in training set were using the original dings. Specifically, Metapath2vec++ required a characters. human-defined metapath scheme to guide the ran- For the proposed SS model, we utilized the fol- dom walks. We tried 4 different metapaths for this lowing setting: layers of LSTMs: 2; dimension of experiment:(1) M2VP (only walking on pinyin hidden (output) state in LSTM: 128; dimension of (phonetic) relations); (2) M2VS (only walking on pre-trained character text embedding: 128; dimen- stroke (glyph) relations); (3) M2VZ (only walking sion of VFGE embedding: 128; batch size: 64; on Zhengma (glyph) relations); (4) M2VC (alter- Dropout: 0.1. For training VFGE embedding6, the nately walking on glyph and phonetic relations). walk length was 80, the number of walks per ver- We reported the best results from these four meta- tex was 10. These parameters were adopted in (Pe- paths, denoted as M2VMax. ters et al., 2018; Jiang et al., 2018a; Perozzi et al., Spelling Correction Baseline: Pycorrector8 2014; Grover and Leskovec, 2016). The varia- based on n-gram language model (Yu and Li, tion family number7 was 500. SS model was pre- 2014). trained for parameter initialization as suggested in Comparison Group: we compared the per- (Peters et al., 2018). formances of several variants of the proposed Baselines and Comparison Groups. We chose method in order to highlight our technical con- 13 strong baseline algorithms, from text or graph tributions. There were 3 comparison groups con- viewpoints, to comprehensively evaluate the per- ducted. SSGraph: we only used VFGE graph em- formance of the proposed method. bedding. SSNaive: we simply concatenated VFGE General Textual Based Baselines: Skip- graph embedding and skip-gram textual embed- gram (Mikolov et al., 2013), GloVe (Pennington ding (a naive version). SSOriginal: the proposed et al., 2014), and ELMo (Peters et al., 2018). SS model. Chinese Specific Textual Based Baselines: For a fair comparison, the dimension9 of all em- CWE (Chen et al., 2015), GWE (Su and Lee, bedding models was 128. A single layer of CNN 2017), JWE (Yu et al., 2017), and Cw2vec (Cao classification model10 was used for spam detection et al., 2018). task. Graph Embedding Based Baselines: Deep- Walk (Perozzi et al., 2014), LINE (Tang et al., 4.2 Experiment Result and Analysis 2015), Node2vec (Grover and Leskovec, 2016), The text spam detection task performances of dif- Metapath2vec++ (Dong et al., 2017), and ferent models were reported in Table1. Based on HEER (Shi et al., 2018). We applied this group of the experiment results, we had the following ob-

6For the experiment fairness, all the random walk based 8https://github.com/shibing624/pycorrector 9 graph embedding baselines shared the same parameters with The initial dimension of SSNaive and SSOriginal is 256, VFGE. so we used a fully connected layer to reduce its dimension to 7Based on the parameter sensitive analysis, the proposed 128. method was not very sensitive to number of variation fami- 10The ﬁlter sizes of CNN is 3, 4, 5, and the ﬁlter number is lies. 128, dropout ratio is 0.1.

6193 Figure 3: Two typical examples for CSVD task servations: form well, which indicated employing a single SS vs. Baselines. (1) SSOriginal outperformed kind of glyph-based information can be insuffi- the baseline models for all evaluation metrics on cient for Chinese variation representation learn- both datasets, which indicated the proposed SS ing. Similarly, in graph based baseline group, the model can effectively address the CSVD problem. performances of M2VP, M2VS and M2VZ (em- (2) On review dataset, the leading gap between ployed only one encoding relation on the con- SSOriginal and other baselines was greater. A pos- structed graph) were still unsatisfactory. The re- sible explanation was that, the review spam text sults revealed that an individual encoding method usually had richer content and more complex vari- cannot comprehensively encode a character, we ation patterns than SMS spam text. Therefore, a should consider various kinds of variation infor- good variation representation model may have cer- mation simultaneously. (2) The performance of tain advantages. M2VC (integrated all relations based on a pre- Chinese vs. General. (1) Compared to clas- defined metapath pattern) was still inferior. This sical textual embedding models (Skipgram and result indicated a human-defined rule cannot ef- GloVe), the Chinese embedding models showed fectively integrate all relationships in a complex their advantages, especially on review dataset. graph. This result indicated that the characteristic knowl- Representation vs. Spelling Correction. Py- edge of Chinese can help to detect spam text. (2) corrector performed poorly in experiment, and ELMo was able to learn both the semantic and other baselines outperformed this approach, which contextualized information, and it achieved a good proved the spelling correction method is not capa- performance in text baseline group. ble for CSVD problem. Graph vs. Text. Generally, the graph based Variants of SS model. For variants of the pro- baselines outperformed the textual based baselines posed method, the results showed that (1) by com- (including general and Chinese). This observa- bining the semantic and sequential information, tion indicated: (1) the variation knowledge of Chi- the task performances can improve; (2) simply nese character can be critical for CSVD prob- concatenating graph and text embeddings cannot lem. (2) The proposed character variation graph generate a satisfactory joint representation. (3) can provide critical information for Chinese char- The proposed SS model can successfully capture acter representation learning. (3) Compared to the variation, semantic, and sequential information other graph based baselines, SSGraph was supe- for character representation learning. rior, which proved the effectiveness of VFGE algorithm, and the proposed variation family can 4.3 Case Study characterize and predict useful variation patterns To gain an insightful understanding regarding the for CSVD problem. variation representation of the proposed method, Chinese Character Encodings. (1) In Chi- we conduct qualitative analysis by performing the nese textual embedding baseline group, JWE (rad- case studies of character similarities. As shown ical based) and Cw2vec (stroke based) didn’t per- in Table3, for exemplary characters, the most

6194 similar characters, based on skipgram embedding ods. Meanwhile, the case study empirically proves (general textual based baseline), are all semanti- that the proposed model can successfully cap- cally similar or/and context-related. Meanwhile, ture the Chinese variation, semantic, and contex- based on Cw2vec embedding (most recent Chi- tualized information, which can be essential for nese embedding baseline), all similar characters CSVD problem. In the future, we will investigate for target characters are also semantically similar more sophisticated methods to improve SS’s per- or/and context-related. Unsurprisingly, for each formance, e.g., enable self-attention mechanism target character, all similar characters based on for contextualized information modelling. VFGE model (best performed graph embedding model), are glyph and phonetic similar characters. Acknowledgments The proposed SS model can achieve a comprehen- This work is supported by the National Nat- sive coverage from variation, semantic and context ural Science Foundation of China (61876003, viewpoints. For instance, in its top 3 similar char- 81971691), the China Department of Science and acters for “运(move)”, “转(transmit)” is a seman- Technology Key Grant (2018YFC1704206), and tic and context similar character, and “云(cloud)” Fundamental Research Funds for the Central Uni- is a glyph and phonetic similar character. Fur- versities (18lgpy62). thermore, SS model can capture complicated compound similarity between Chinese characters, for instance, “悚(afraid)” is a glyph, semantic, and References context similar character for “惊(shock)”. This also explains why SS model performs well to ad- Shaosheng Cao, Wei Lu, Jun Zhou, and Xiaolong Li. 2018. cw2vec: Learning chinese word embeddings dress the CSVD problem. with stroke n-gram information. In Proceedings of Figure3 depicts two typical examples in the the Thirty-Second AAAI Conference on Artificial In- experimental datasets. For the spam text with telligence (AAAI-18), pages 5053–5061. variations, spammers used character variations Xinxiong Chen, Lei Xu, Zhiyuan Liu, Maosong Sun, to create camouflaged expressions. For in- and Huan-Bo Luan. 2015. Joint learning of charac- stance, using glyph variation “江(river)” to re- ter and word embeddings. In IJCAI, pages 1236– place “红(red)”, and glyph-phonetic compound 1242. 薇微 variation “ (osmund)” to replace “ (micro)”. Zheng Chen and Kai-Fu Lee. 2000. A new statistical The classical text embedding models may fail to approach to chinese pinyin input. In Proceedings identify this kind of spam texts. With the min- of the 38th Annual Meeting of the Association for ing of character variation graph, the graph based Computational Linguistics. approaches can be successful to capture these Jacob Devlin, Ming-Wei Chang, Kenton Lee, and changes. For spam text without variations, clas- Kristina Toutanova. 2018. Bert: Pre-training of deep sification models need more semantic and contex- bidirectional transformers for language understand- tual information, and the text based methods can ing. arXiv preprint arXiv:1810.04805. be suitable for this kind of spam texts. The pro- Yuxiao Dong, Nitesh V Chawla, and Ananthram posed SS model is able to detect both two kinds Swami. 2017. metapath2vec: Scalable representa- of spam texts effectively, and experiment results tion learning for heterogeneous networks. In Pro- proved SS can successfully model Chinese vari- ceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Min- ational, semantic and contextualized representa- ing, pages 135–144. ACM. tions for CSVD task. Aditya Grover and Jure Leskovec. 2016. node2vec: 5 Conclusion Scalable feature learning for networks. In Proceed- ings of the 22nd ACM SIGKDD International Con- ference on Knowledge Discovery and Data Mining, In this paper, we propose a StoneSkipping model pages 855–864. ACM. for Chinese spam detection. The performance of the proposed method is comprehensively eval- Zhuoren Jiang, Liangcai Gao, Ke Yuan, Zheng Gao, uated in two real world datasets with challeng- Zhi Tang, and Xiaozhong Liu. 2018a. Mathematics content understanding for cyberlearning via formula ing experimental setting. The results of experi- evolution map. In Proceedings of the 27th ACM In- ments show that the proposed model significantly ternational Conference on Information and Knowl- outperforms a number of state-of-the-art meth- edge Management, pages 37–46. ACM.

6195 Zhuoren Jiang, Yue Yin, Liangcai Gao, Yao Lu, and Hua Shu and Richard C Anderson. 1997. Role of rad- Xiaozhong Liu. 2018b. Cross-language citation rec- ical awareness in the character and word acquisition ommendation via hierarchical representation learn- of chinese children. Reading Research Quarterly, ing on heterogeneous graph. In The 41st Interna- 32(1):78–89. tional ACM SIGIR Conference on Research & De- velopment in Information Retrieval, pages 635–644. Richard Socher, Milind Ganjoo, Christopher D Man- ACM. ning, and Andrew Ng. 2013. Zero-shot learning through cross-modal transfer. In Advances in neu- Nitin Jindal and Bing Liu. 2007. Review spam de- ral information processing systems, pages 935–943. tection. In Proceedings of the 16th international conference on World Wide Web, pages 1189–1190. John A Spinks, Ying Liu, Charles A Perfetti, and Li Hai ACM. Tan. 2000. Reading chinese characters for meaning: The role of phonological information. Cognition, Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong 76(1):B1–B11. Sun. 2015. Topical word embeddings. In AAAI, pages 2418–2424. Tzu-ray Su and Hung-yi Lee. 2017. Learning chinese word representations from glyphs of characters. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- In Proceedings of the 2017 Conference on Empiri- rado, and Jeff Dean. 2013. Distributed representa- cal Methods in Natural Language Processing, pages tions of words and phrases and their compositional- 264–273. ity. In Advances in neural information processing systems, pages 3111–3119. Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. Line: Large-scale in- Jerry Norman. 1988. Chinese. Cambridge University formation network embedding. In Proceedings of Press. the 24th International Conference on World Wide Web, pages 1067–1077. International World Wide Alexandros Ntoulas, Marc Najork, Mark Manasse, and Web Conferences Steering Committee. Dennis Fetterly. 2006. Detecting spam web pages through content analysis. In Proceedings of the 15th Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob international conference on World Wide Web, pages Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz 83–92. ACM. Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Pro- Jeffrey Pennington, Richard Socher, and Christopher cessing Systems, pages 5998–6008. Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 confer- Jui-Feng Yeh, Yun-Yun Lu, Chen-Hsien Lee, Yu- ence on empirical methods in natural language pro- Hsiang Yu, and Yong-Ting Chen. 2014. Chinese cessing (EMNLP), pages 1532–1543. word spelling correction based on rule induction. In Proceedings of The Third CIPS-SIGHAN Joint Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Conference on Chinese Language Processing, pages 2014. Deepwalk: Online learning of social rep- 139–145. resentations. In Proceedings of the 20th ACM Jinxing Yu, Xun Jian, Hao Xin, and Yangqiu Song. SIGKDD international conference on Knowledge 2017. Joint embeddings of chinese words, char- discovery and data mining, pages 701–710. ACM. acters, and ﬁne-grained subcharacter components. Matthew Peters, Mark Neumann, Mohit Iyyer, Matt In Proceedings of the 2017 Conference on Empiri- Gardner, Christopher Clark, Kenton Lee, and Luke cal Methods in Natural Language Processing, pages Zettlemoyer. 2018. Deep contextualized word repre- 286–291. Proceedings of the 2018 Conference sentations. In Junjie Yu and Zhenghua Li. 2014. Chinese spelling of the North American Chapter of the Association error detection and correction based on language for Computational Linguistics: Human Language model, pronunciation, and shape. In Proceedings of Technologies, volume 1, pages 2227–2237. The Third CIPS-SIGHAN Joint Conference on Chi- nese Language Processing, pages 220–223. Ian Porteous, David Newman, Alexander Ihler, Arthur Asuncion, Padhraic Smyth, and Max Welling. 2008. Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Fast collapsed gibbs sampling for latent dirichlet al- Character-level convolutional networks for text clas- location. In Proceedings of the 14th ACM SIGKDD siﬁcation. In Advances in neural information pro- international conference on Knowledge discovery cessing systems, pages 649–657. and data mining, pages 569–577. ACM.

Yu Shi, Qi Zhu, Fang Guo, Chao Zhang, and Jiawei Han. 2018. Easing embedding learning by comprehensive transcription of heterogeneous information networks. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2190–2199. ACM.

6196