<<

Understanding Transformers for Bot Detection in Twitter

Andres Garcia-Silva Cristian Berrio Jose Manuel Gomez-Perez Expert.ai Research Lab Expert.ai Research Lab Expert.ai Research Lab Prof. Waksman 10 Prof. Waksman 10 Prof. Waksman 10 28036 Madrid, Spain 28036 Madrid, Spain 28036 Madrid, Spain [email protected] [email protected] [email protected]

Abstract sentence prediction learning objective. However, In this paper we shed light on the impact of none of these corpora cover text from social media fine-tuning over social media data in the in- where users, human or automated agents, are con- ternal representations of neural language mod- tinuously sharing pieces of information as part of els. We focus on bot detection in Twitter, a the social interaction. This brings up the question key task to mitigate and counteract the auto- whether such language models also capture the nu- matic spreading of disinformation and bias in ances of the short, informal and evolving language social media. We investigate the use of pre- often found in social media channels. trained language models to tackle the detection of tweets generated by a bot or a human ac- In this paper we shed light on the impact of fine- count based exclusively on its content. Unlike tuning language models over social media data. the general trend in benchmarks like GLUE, Particularly, we focus on bot detection in Twitter, where BERT generally outperforms generative which is a key task to mitigate and counteract the transformers like GPT and GPT-2 for most automatic spreading of disinformation and bias in classification tasks on regular text, we observe social media (Shao et al., 2018). We model this that fine-tuning generative transformers on a bot detection task produces higher accuracies. task as a binary classification problem where the We analyze the architectural components of goal is to detect whether the content of a tweet each transformer and study the effect of fine- has been produced by a bot or a human. Note tuning on their hidden states and output repre- that different features, e.g. number of followees sentations. Among our findings, we show that and followers, number of retweets, presence of part of the syntactical information and distri- hashtags, URLs, etc., can be valuable sources of butional properties captured by BERT during relevant information to solve this task. However, it pre-training is lost upon fine-tuning while the is not our objective to beat the SotA in bot detection, generative pretraining approach manage to pre- serve these properties. but to understand how pre-trained language models adapt their internal representations in the presence 1 Introduction of such data and make an informed interpretation Nowadays fine-tuning neural networks trained on a of the resulting predictions. Therefore, we focus language modeling objective (Bengio et al., 2003) solely on the analysis of the actual text contained has become the usual way to address most NLP in the tweet, which has been regarded as highly relevant to detect automatic accounts (Gilani et al., arXiv:2104.06182v1 [cs.CL] 13 Apr 2021 tasks. While early language models like ELMo (Pe- ters et al., 2018a) and ULMFit (Howard and Ruder, 2017). 2018) used LSTM architectures, as of today trans- Looking at multi-task benchmarks like formers (Vaswani et al., 2017) are the foundation GLUE (Wang et al., 2018), most tasks are of successful language models like GPT (Radford dominated by transformer-based models fine-tuned et al., 2018) and BERT (Devlin et al., 2018). on task-specific data. This is also the case for Most language models are trained on high qual- classification tasks, where a deeper look reveals ity, grammatically correct, curated text corpora that BERT-based models systematically outper- such as Wikipedia (ULMFiT), BookCorpus (GPT), form models based on generative pre-training a collection of Web Pages (GPT-2), a combination like GPT. However, our experiments in a bot of Wikipedia and BookCorpus (BERT) or News detection dataset (section3) show that this trend (ELMo). In fact, BERT requires sequences of sen- actually reverses (section4) for the bot detection tences that makes sense due to the additional next task. We investigate the reasons of this apparent anomaly by carrying out a bottom-up analysis on stand how they work (Rogers et al., 2020). Peters the transformers used in each language model, et al.(2018b) found that language models encode a from vocabulary and word embeddings (section5) rich hierarchy of contextual information throughout through hidden states to output representations the layers, from morphological at the embeddings, (section6). Our results show that fine-tuning for local-syntax at lower layers and semantic informa- bot detection impacts BERT and GPT-2 differently tion at upper layers. Tenney et al.(2019) show that with respect to the contextualized outputs and BERT encodes phrase-level information in its lower linguistic properties learned in pre-training. layers and a hierarchy of linguistic information in We observe that BERT’s hidden states cap- its intermediate layers. Ethayarajh(2019) shows ture less grammatical information after fine-tuning, that BERT and GPT-2 hidden states and outputs while GPT-2 manages to generate representations occupy a narrow cone in the vector space. Clark that preserve such information as in the pre-trained et al.(2019) investigated self attention and found model. We show evidence that after fine-tuning some heads that capture syntactical information. GPT-2 tends to generate more distributed outputs Kovaleva et al.(2019) shows that in fine-tuning the and some clusters can be roughly mapped to twitter attention mechanism in BERT focus mainly on the entities while BERT outputs do not express this CLS and SEP tokens. Our research work can be differentiation and are all concentrated in the same framed with the above mentioned since our goal is region of the vector space. For reproducibility pur- to understand the effect of fine-tuning on the hid- poses, the paper includes a GitHub repository1 with den states and outputs of the language models by experimental data and code, as well as detailed dia- put them into test on tasks related to bot detection, grams and figures. grammatical analysis and contextual similarity, and provide reference data from classification tasks in 2 Related work standard benchmarks.

Different neural architectures have been proposed Automatic accounts known as bots publish a to pre-train neural language models on large text large amount of information in social media, and corpus (Howard and Ruder, 2018; Radford et al., some of them take active part on spreading misin- 2018; Devlin et al., 2018). Transformer architec- formation and fake news (Shao et al., 2018). In tures (Vaswani et al., 2017) are used in GPT (Rad- fact, twitter points in its last report about platform 2 ford et al., 2018), and BERT (Devlin et al., 2018), manipulation at malicious bots as one of the main two language models that have pushed the state actors, along with spam, and fake accounts. of the art in different NLP tasks, inspiring sub- In general, detecting bots can be addressed as sequent work like GPT-2 (Radford et al., 2019), a binary classification problem that can leverage RoBERTa (Liu et al., 2019) and DistilBERT (Sanh different features from the social network including et al., 2019). user metadata, user relations, and posting activity, Recently, language models are being used to pro- as well as the content of the posts. Approaches cess text from social media. BERT is used for the like Botomoter (Varol et al., 2017) work at the ac- classification of tweets in the disaster management count level and use all the available information to field (Ma, 2019), hate speech identification (Mishra, derive features that are processed with an ensem- 2019; Mozafari et al., 2019), offensive language ble of algorithms. However this classification (Paraschiv and Cercel, 2019), subjec- approach shows a large disagreement with human tivity, polarity and irony detection (Polignano et al., annotators (Gilani et al., 2017), and leads to false 2019), and to evaluate tweet predictability (Przy- positives and negatives (Rauchfleisch and Kaiser, była, 2019). The generative ability of GPT-2 is 2020). In this paper we work at the tweet-level used to replicate politician’s twitter accounts (Ress- focusing only on the tweet content to determine meyer, 2019), and to generate fake news and enable whether it was posted by a bot or a human. In (Gi- its detection (Zellers et al., 2019). lani et al., 2017) human annotators cited the style Since pre-trained language models work very and the patterns of the tweets as strong indicators well in practice, researchers are trying to under- of automatic accounts.

1https://github.com/botonobot/Underst anding-Transformers-for-Bot-Detection-Tw 2https://transparency.twitter.com/en/ itter reports/platform-manipulation.html 3 Dataset Model Library Precision Recall F-score GPT-2 OpenAI 0.8657 0.8640 0.8630 We tap into a ground truth dataset of bot and hu- GPT OpenAI 0.8567 0.8546 0.8533 man accounts released by Gilani et al.(2017). This GPT-2 Transformers 0.8549 0.8524 0.8509 dataset was manually annotated by 4 annotators, BERT Transformers 0.8443 0.8437 0.8439 BERT - 0.8572 0.8213 0.8388 achieving 89% inter-annotator agreement, 77 Co- SVM LinearSVC 0.8033 0.8031 0.8032 hen’s kappa (κ) coefficient indicating substantial agreement. We generate a balanced dataset con- Table 1: Fine-tuned language models on the bot de- taining 600,000 tweets labelled as bot or human tection task. We trained a linear SVM baseline (Fan according to the account label. Our training dataset et al., 2008) with a tuned regularization parameter, lem- comprises a total of 500,000 tweets where 279,495 matized words using , and stopwords removed. tweets were created by 1,208 human accounts, and 220,505 tweets were tweeted from 722 bot ac- To evaluate the resulting models we designed counts. The test set contains the remaining 100,000 a preliminary experiment to spot twitter entities tweets where 55,712 are tweets from human ac- like hashtags, user mentions, and URLs. Spotting counts and 44,288 are tweets from bot accounts. twitter entities is relevant for bot detection since Different usage patterns emerge in this dataset bots and humans seem to use them in different for bots and humans regarding twitter entities such ways, e.g. bots tend to focus on hashtags rather as user mentions, hashtags and URL. For exam- than users, while the opposite applies for humans. ple, bots are more active averaging 305 tweets per We fine-tune the language models following the account compared to 231 tweets of humans. Bots same procedure used by Devlin et al.(2018) to also seek to redirect more traffic and increase visi- fine-tune BERT on SQuAD v1.1 (Rajpurkar et al., bility using 0.8313 URL and 0.4745 hashtags per 2016). We use an entity type as a question and tweet, while humans use 0.5781 URL and 0.2887 the tweet content as paragraph so that the task is hashtags. Finally, humans interact more with other to identify the entity text span in the tweet. For accounts by mentioning other users 0.5781 times instance, to supervise BERT on this task we use per tweet, while bots display a less social behaviour sequences such as "[CLS] hashtag? [SEP] Get mentioning other users only 0.4371 times per tweet. ready for the next #YouthOlympics [SEP]", and the expected answer is "#YouthOlympics". 4 Fine-tuning transformer language The evaluation results of the twitter entity spot- models for bot detection ting task are shown in table2. Note that we dis- We follow Devlin et al.(2018) and Radford et al. criminate between entities anywhere and the ones (2018) to tackle single sequence classification with appearing only at the end of the tweet in an attempt BERT and GPT respectively. That is, we add a to find out whether removing the context on the linear on top of the last hidden state of the right of the entities could have a positive effect. classification token CLS (BERT) or the last token Regardless of the entity placement, GPT-2 outper- (GPT) to train a softmax classifier. The only new forms BERT. For entities anywhere, GPT-2 learns parameters to train from scratch are the weights in to detect the start of all entities types, while for the the new layer W ∈ RKxH , where k is the number end position it only learns to detect URLs. BERT of classification labels and H is the size of the hid- predictions are less accurate for both start and end den state representation for CLS or the last token. tokens in all entity types, and in fact it does not In our experiments we use BERT base and GPT, learn to detect the end token of any entity type. that are comparable models in size (110M parame- At this stage it is hard to distinguish what makes ters), and the GPT-2 version with 117M parameters. one language model more accurate spotting entities These models are available in the Transformers li- than the other beyond the fitness of each model’s brary (Wolf et al., 2019). We use batch size 8 and output for the task. We expected that removing the train during 2 epochs. As the f-measure values indi- context on the right would largely improve BERT’s cate, the evaluation results reported in table1 show results, considering that the right-to-left in the bidi- that GPT and GPT-2 consistently achieved better rectional and simultaneous text processing was part performance in this task (henceforth we focus on of the problem to generate good representations for GPT-2 rather than GPT since the former generally this task, or in general to process tweets. Although outperforms the latter in bot detection). we showed that removing context on the right ben- Entity anywhere Entity at the end Twitter Entity Model start end span start end span GPT-2 0.6673 0.2351 0.2582 0.8618 0.9946 0.9150 Hashtag BERT 0.3625 0.1596 0.1011 0.4649 0.5513 0.2452 GPT-2 0.8649 0.8427 0.8112 0.9882 0.9984 0.9914 URL BERT 0.7975 0.4927 0.5004 0.9828 0.6412 0.6923 GPT-2 0.8242 0.5120 0.5047 0.7349 0.9954 0.8123 User mention BERT 0.7008 0.4066 0.3731 0.1583 0.7538 0.1359

Table 2: F-measure of the fine-tuned language models to spot twitter entities. efited BERT in some cases, GPT-2 seems to benefit URLs (see figure 1b) the two distributions become the most of this setting. very similar. In general, word length distribution in our dataset is very different compared to CoLA and 5 Vocabulary and word embeddings SST-2, where most of the words are represented us- ing 1 to 3 or 4 tokens. Beyond the greater number To deal with rare words language models use fixed- of sub-words in our dataset and the difference be- size vocabularies of sub-word units. Sub-word tween BERT and GTP-2 when representing URLs units are a flexible representation between char- we did not find further significant differences be- acters and word level models that allow repre- tween both approaches at this level. Thus, we move senting infrequent words by putting together sub- on with our analysis at the embedding layer. word units. The vocabulary size of BERT base We evaluate the embeddings learnt by the lan- is 30,000 and is based on the WordPiece model guage models after fine-tuning on the bot detection (Wu et al., 2016) while Open AI GPT-2 uses task in the same classification task. For every tweet, Byte Pair Encoding (Sennrich et al., 2016), with we concatenate the embeddings of all their tokens a vocabulary size of 50,257. For example, the and feed them to a linear fully connected layer to tweet "Congratulations @LeoDiCaprio #Oscars train a softmax classifier. Due to memory restric- https://t.co/5WLesgfnbe"3 is converted to the se- tions we limit the number of token in each tweet to quences of word and sub-word units in table3. 128 maximum. We carry out similar experiments Unlike the well-formed text where language with CoLA and SST-2 datasets, training classifiers models are usually pre-trained, tweets can contain on their sentence classification tasks. We also in- twitter-specific entities, abbreviations, and emoti- clude the Quora Question Pairs4 (QQP) and the cons among other distinctive features. Such entities Research Paraphrase Corpus (MRPC) are candidates to be split into sub-word units since (Dolan and Brockett, 2005) datasets from GLUE, they are hardly covered in the corpus where lan- which require to classify pairs of sentences from guage models are trained. Figure1 shows the word social question answering and news, respectively. length distribution in our dataset measured in num- The evaluation of the embedding-based classi- ber of tokens (words and word units) and compares fiers are reported in table4. In the bot detection it with two datasets used in sentence classification task, BERT generates embeddings that help to learn tasks that are part of the GLUE benchmark (Wang better classifiers than GPT-2, which contrasts with et al., 2018): the Corpus of Linguistic Acceptabil- our early findings, where GPT-2 outputs contribute ity (CoLA) (Warstadt et al., 2019), with sentences to learn more accurate classifiers. Nevertheless, from the linguistic literature domain, and Stanford the embeddings alone do not seem to be a reli- Sentiment Tree Bank (SST-2) (Socher et al., 2013), able indicator of the performance of the models. which includes movie reviews. Fine-tuning BERT on CoLA, SST-2, QQP, and In general, words in tweets are split in more to- MRPC systematically produces more accurate re- kens than words in the other datasets. As shown sults than GPT, according to (Devlin et al., 2018) in figure 1a, BERT and GPT-2 have similar dis- and (Radford et al., 2018), and GPT-2, according tributions for words split in less than 10 tokens, to our own experimentation. However, BERT em- however for words with more tokens the patterns beddings only generate better classifiers in SST-2 are different. Manual inspection shows that URLs and MRPC, while GPT-2 embeddings are better in are often split in many tokens. If we remove the 4https://data.quora.com/First-Quora-D 3tweet from OMGtrolls : https://twitter.com/OMGtrolls ataset-Release-Question-Pairs Model Words and sub-words ’congratulations’ ’@’ ’leo’ ’##dic’ ’##ap’ ’##rio’ ’#’ ’oscar’ ’##s’ ’https’ ’:’ ’/’ BERT /’ ’t’ ’.’ ’co’ ’/’ ’5’ ’##wl’ ’##es’ ’##gf’ ’##nb’ ’##e’ ’cong’ ’ratulations’ ’G@’˙ ’le’ ’odic’ ’ap’ ’rio’ ’G#’˙ ’osc’ ’ars’ ’Ghttps’˙ ’://’ GPT-2 ’t’ ’.’ ’co’ ’/’ ’5’ ’w’ ’les’ ’g’ ’fn’ ’be’

Table 3: Word and sub-word units produced by BERT and GPT-2 tokenizations for the tweet "Congratulations @LeoDiCaprio #Oscars https://t.co/5WLesgfnbe"

(a) Bot detection (b) Bot detection no URLs (c) CoLA (d) SST-2

Figure 1: Word length distribution on the bot detection, CoLA, and SST-2 datasets. X-axis is number of tokens in a word, and Y-axis is number of words in thousands

Dataset Model Precision Recall F-score except for layer 5. While for SST-2 the models Bot detection GPT-2 0.7098 0.7120 0.7061 exhibit a similar pattern, for CoLA, MRPC and BERT 0.7298 0.7274 0.7283 CoLA GPT-2 0.4779 0.6913 0.5651 QQP tasks, the pattern is different and recurrent, BERT 0.6394 0.5139 0.5254 showing that in the first layers GPT-2 hidden states SST-2 GPT-2 0.5047 0.5092 0.3496 BERT 0.6908 0.6319 0.6048 help to learn classifiers with greater or similar f- MRPC GPT-2 0.7873 0.6912 0.5722 measure than BERT hidden states, and from the BERT 0.6144 0.6765 0.5957 QQP GPT-2 0.7446 0.7361 0.7389 intermediate to the final layers BERT hidden states BERT 0.7212 0.7198 0.7204 contribute to learn classifiers with similar or greater f-measure. The slight but constant margin in bot Table 4: Embeddings performance in different datasets detection shows evidence that the internal represen- (fine-tuned models in the corresponding task). tations of GPT-2 do a better job at encoding tweet information for this task. CoLa and QQP. Thus, we disregard the embeddings 6.1 Grammatical Analysis and focus our analysis on the transformer hidden states. We investigate the ability of language model hidden states to capture grammatical information upon fine 6 Exploring hidden states tuning for bot detection. As grammatical tasks we use part-of-speech tagging and chunking. Our We aim at quantify the potential contribution of goal is to determine whether the hidden states of each layer in the transfomer to the bot detection different words with the same part-of-speech or task. We connect the hidden state of the classifica- chunk tag are positioned in the same region of the tion token (CLS in case of BERT or the end token distributional vector space. in the case of GPT-2) at each layer of the fine-tuned For the chunking task we use regular text from model to a linear layer and train a softmax classi- CONLL-2000 dataset (Tjong Kim Sang and Buch- fier for the bot detection task. In this experiment, holz, 2000). For the part-of-speech tagging task we train each classifier for 2 epochs with a learn- we annotate our dataset using TweetNLP tag- ing rate of 2e-5 and batch size 8. In addition, we ger (Owoputi et al., 2013). This tool reported carry out the same experiment for language models 93.2% accuracy, and in addition to common word fine-tuned on CoLA, SST-2, QQP and MRPC. categories like verbs and nouns, includes twitter- The results depicted in figure2 show that, for the specific labels like emoticons or hashtags. Then, bot detection task, GPT-2 hidden states generally we cluster the hidden states and outputs for each contribute to learn slightly more accurate classi- tagged word using k-means and evaluate the clus- fiers layer by layer than using BERT hidden states, ters using normalized mutual information (NMI). (a) Bot detection (b) CoLA (c) MRPC (d) QQP (e) SST-2

Figure 2: Layer analysis of BERT and GPT-2 fine-tuned on bot detection, SST-2, CoLA, MRPC and QQP. X-axis corresponds with the hidden state layer used while Y-axis is the f-score using that hidden state layer.

(a) BERT (b) GPT-2

Figure 3: NMI of the hidden state clusters and part-of-speech tags and chunk labels. We compare the results obtained with pre-trained language models vs. their versions fine-tuned for the bot detection task.

This procedure was used in (Peters et al., 2018b; gests that, upon fine-tuning for bot detection, GPT- Jawahar et al., 2019) to assess whether hidden 2 manages to preserve the grammatical informa- states capture syntactic information. tion contained in its internal representations, while We tested pre-trained and fine-tuned models to BERT on the other hand seems to lose part of it. figure out how the adjustment to the bot detection task affects the language model hidden states. We 6.2 Twitter Entitites generate clusters out of the hidden states and out- We also look into hidden states and outputs for puts for 3000 labeled tokens randomly selected twitter entities (user mention, hashtag, and url) from the POS and chunking datasets. We repeat since GPT-2 learns to spot these entities with more this experiment 10 times and report in figure3 the accuracy than BERT. We use k-means to cluster average NMI scores for these clusters. the language model hidden states and outputs for The hidden states in pre-trained BERT produce, 3000 entities randomly sampled from our dataset. in general, greater NMI scores than GPT-2’s for We repeat this experiment 10 times and report in the POS-tagging task and also in the chunking task. figure4 the average NMI of the clusters. The curves, high at the beginning and decreasing Note the negative effect of fine-tuning BERT towards the end, are in line with previous findings for bot detection when representing twitter entities. (Jawahar et al., 2019; Tenney et al., 2019), which While pre-trained BERT encodes twitter entities reported that BERT encodes mainly syntactic infor- both in hidden states and outputs, the fine-tuned mation in its first layers, while the last layers are model generates consistently lesser NMI values for more task-specific. these contextual representations. Also, the NMI However, when the models are fine-tuned for drop is particularly acute from layers 8 to 12. Af- the bot detection task, BERT’s hidden states and ter fine-tuning, hidden states and outputs of GPT-2 outputs are significantly worse at encoding POS follow a similar pattern to the pre-trained model, and chunking information from layer 8 to 12, while encoding twitter entities mainly between layers 2 GPT-2 holds a similar pattern as the pre-trained to 10. However, in the last layers GPT-2 repre- model for POS tagging and chunking. This sug- sentations for these entities are worse than in the pre-trained model. 6.4 Discussion Fine-tuning on the bot detection task modifies the hidden states and outputs of both BERT and GPT- 2, although in different ways. pre-trained BERT encodes grammatical information through all lay- ers regardless whether it was tested on regular text as in the CONLL-2000 dataset or on tweets as in the part-of-speech experiments (figure 3a). Never- theless, fine-tuned BERT encoding of grammatical information is poorer in the final layers. After fine- Figure 4: NMI evaluation of clusters out of hidden states and outputs for twitter entities. tuning, the final layers change the most (Kovaleva et al., 2019) to adjust the representation to the corre- sponding task, and hence grammatical information vanishes in these layers. 6.3 Contextual Similarity BERT encoding of twitter entities (figure4) ex- To gain additional insight, we investigate the con- hibits a similar pattern to the grammatical infor- textual embeddings generated by each model and mation (figure 3a). However, in this case there how these outputs change from pre-trained to is larger gap between the encoding of entities in fine-tuned models. Let us focus e.g., on the the pre-trained and the fine-tuned model across tweet: "Congratulations @LeoDiCaprio #Oscars all layers and not limited to the last layers as in https://t.co/5WLesgfnbe". Cosine similarity values the grammatical information analysis. We think of the outputs for tokens in this tweet are shown that such a change across all layers indicates that in figure5, where lighter colors indicate greater BERT modifies to a great extent the representation similarity and darker means the contrary. of twitter entities since they are relevant for the bot Note that pre-trained GPT-2 outputs are very sim- detection task. ilar, except some of the URL tokens (https, t, ·, and According to Kovaleva et al.(2019), CLS takes co) which are more similar among them. However, most of the focus of BERT’s self-attention during in the fine-tuned model the outputs become less fine-tuning. Therefore, we also hypothesize that similar and some groups of similar outputs emerge the changes in the BERT contextual embeddings that are roughly associated with entities, such as and representations associated with twitter entities user mentions, hashtags, and URLs. On the other are related to the increased attention on CLS. This hand, BERT follows the opposite pattern. That is, would contribute to explain the loss of linguistic the pre-trained model generates similar outputs for information and the high similarity of the outputs entities (leo #dic #ap #rio, and oscar #s) and URL upon fine-tuning (figure6). substrings. However, when the model is fine-tuned On the other hand, GPT-2 shows similar patterns all the outputs are very similar among them and no in the pre-trained and fine-tuned models when en- segmentation is noticeable in the vector space. coding grammatical information (figure 3b) and To investigate whether this observation general- twitter entities (figure4). Similarly the contextual izes we randomly sample 100,000 tweets from our representations are very stable upon fine-tuning dataset and calculate the average cosine similarity (figure6). We think this is a consequence of a) the between the outputs of the tokens in each tweet. left-to-right text processing strategy followed in The results in figure6 confirm that pre-trained GPT-2 and b) the use of the last token’s language BERT outputs are less similar, and hence more dis- model output for classification during fine-tuning. tributed in the vector space than pre-trained GPT-2 The model needs to process the whole tweet token outputs. These results are aligned with Ethayarajh by token in order to be able to adjust the last to- (2019) findings for text extracted from the SemEval ken representation for the bot detection task. An Semantic Textual Similarity task, where GPT-2 out- analysis of the self-attention mechanism in GPT-2 puts for samples of word pairs were more similar could identify on which tokens the model is fo- than BERT’s. Fine-tuned for bot detection, GPT-2 cused during each fine-tuning step. Nevertheless outputs become less similar, while BERT outputs such analysis is out of the scope of this research, turn into very similar. where we focused on hidden states and outputs. (a) GPT-2 pre-trained (b) GPT-2 fine-tuned (c) BERT pre-trained (d) BERT fine-tuned

Figure 5: Contextual similarity of all words and sub-word units pairs in a tweet computed on the language model output representations of the pre-trained and fine-tuned models.

states encode less grammatical information than the pre-trained model particularly in the final lay- ers, and the encoding of twitter entities is affected across all layers. In addition, while pre-trained BERT contextualized embeddings are more dis- tributed, after fine-tuning they become more simi- lar. On the contrary, pre-trained and fine-tuned GPT- Figure 6: Average intra-sentence contextual similarity. 2 do not show acute changes about the encoding of grammatical information in the hidden states and outputs. In the twitter entities spotting probe 7 Conclusions GPT-2 hidden states change but only on the last lay- In the bot detection task the generative transformers ers, and the contextual representations were more GPT and GPT-2 were more accurate than the bidi- distributed after fine-tuning which indicates that rectional masked language model learnt by BERT. the model was discriminating more the representa- We show that this behaviour is also true when GTP- tions. We think that the stability of the model after 2 and BERT are fine-tuned to detect entities that fine-tuning is due to the use of the last token output are twitter-specific such as user mentions, hashtags in fine-tuning and the left-to-right text processing and URL. In our dataset bots and humans show used in GPT-2 which enforce to process the whole different usages patterns for these entities. tweet before adjusting the last token representation for the bot detection task. Since BERT generally reports higher accuracy for the text classification tasks in GLUE, we inves- Future work includes the extension of this study tigate both language models in order to understand to other tasks based on Twitter data to assess why in the bot detection task the generative trans- whether the language model learnt by GPT-2 is formers were more accurate. At the vocabulary more suitable to process twitter data in general be- level we found that words in our dataset are split yond the bot detection task. In addition, we would into more subwords than words found on regular like to investigate the role of the next sentence text datasets, and when removing URL the word prediction learning objective used in BERT since lenght distribution turn into very similar. Next, we twitter messages are short and typically consist of discarded the word embeddings as a reliable indica- only one long sentence. tor of the model performance in text classification tasks. Acknowledgements By probing the hidden states of both models in This research was funded by the EU Horizon 2020 the bot detection task we found that the hidden grant European Language Grid (ELG-825627). states of GPT-2 contributed to learn more accurate classifiers than BERT’s, across all layers. This was a distinct pattern compared to the models fine- References tuned for some text classification tasks in GLUE , Réjean Ducharme, Pascal Vincent, and (CoLA, QQP, and MRPC). Christian Janvin. 2003. A neural probabilistic lan- We show that after fine-tuning BERT hidden guage model. J. Mach. Learn. Res., 3:1137–1155. Kevin Clark, Urvashi Khandelwal, Omer Levy, and Shubhanshu Mishra. 2019. 3idiots at hasoc 2019: Fine- Christopher D. Manning. 2019. What does BERT tuning transformer neural networks for hate speech look at? an analysis of BERT’s attention. In Pro- identification in indo-european languages. In FIRE. ceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for Marzieh Mozafari, Reza Farahbakhsh, and Noel Crespi. NLP, pages 276–286, Florence, Italy. Association 2019. A BERT-based approach for for Computational Linguistics. hate speech detection in online social media. In COMPLEX NETWORKS. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Olutobi Owoputi, Brendan T. O’Connor, Chris Dyer, deep bidirectional transformers for language under- Kevin Gimpel, Nathan Schneider, and Noah A. standing. arXiv preprint arXiv:1810.04805. Smith. 2013. Improved part-of-speech tagging for online conversational text with word clusters. In William B. Dolan and Chris Brockett. 2005. Automati- HLT-NAACL. cally constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop Andrei Paraschiv and Dumitru-Clementin Cercel. 2019. on Paraphrasing (IWP2005). Upb at GermEval-2019 task 2: BERT-based offen- sive language classification of german tweets. In Kawin Ethayarajh. 2019. How contextual are contextu- KONVENS. alized word representations? comparing the geome- Matthew Peters, Mark Neumann, Mohit Iyyer, Matt try of BERT, ELMo, and GPT-2 embeddings. arXiv Gardner, Christopher Clark, Kenton Lee, and Luke preprint arXiv:1909.00512. Zettlemoyer. 2018a. Deep contextualized word rep- Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang- resentations. In Proceedings of the 2018 Conference Rui Wang, and Chih-Jen Lin. 2008. Liblinear: A of the North American Chapter of the Association library for large linear classification. J. Mach. Learn. for Computational Linguistics: Human Language Res., 9:1871–1874. Technologies, Volume 1 (Long Papers), pages 2227– 2237. Association for Computational Linguistics. Zafar Gilani, Ekaterina Kochmar, and Jon Crowcroft. Matthew Peters, Mark Neumann, Luke Zettlemoyer, 2017. Classification of twitter accounts into auto- and Wen-tau Yih. 2018b. Dissecting contextual mated agents and human users. In Proceedings of word embeddings: Architecture and representation. the 2017 IEEE/ACM International Conference on In Proceedings of the 2018 Conference on Em- Advances in Social Networks Analysis and Mining pirical Methods in Natural Language Processing, 2017, ASONAM ’17, pages 489–496, New York, pages 1499–1509, Brussels, Belgium. Association NY, USA. ACM. for Computational Linguistics. Jeremy Howard and Sebastian Ruder. 2018. Fine- Marco Polignano, Pierpaolo Basile, Marco de Gem- tuned language models for text classification. CoRR, mis, Giovanni Semeraro, and Valerio Basile. 2019. abs/1801.06146. AlBERTo: Italian BERT Language Understanding Model for NLP Challenging Tasks Based on Tweets. Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. In Proceedings of the Sixth Italian Conference on 2019. What does BERT learn about the structure Computational Linguistics (CLiC-it 2019), volume of language? In Proceedings of the 57th Annual 2481. CEUR. Meeting of the Association for Computational Lin- guistics, pages 3651–3657, Florence, Italy. Associa- Piotr Przybyła. 2019. Detecting bot accounts on twitter tion for Computational Linguistics. by measuring message predictability. In CLEF. Olga Kovaleva, Alexey Romanov, Anna Rogers, and Alec Radford, Karthik Narasimhan, Tim Salimans, and Anna Rumshisky. 2019. Revealing the dark secrets . 2018. Improving language under- of BERT. In Proceedings of the 2019 Conference on standing by generative pre-training. URL https://s3- Empirical Methods in Natural Language Processing us-west-2. amazonaws. com/-assets/research- and the 9th International Joint Conference on Natu- covers/languageunsupervised/language understand- ral Language Processing (EMNLP-IJCNLP), pages ing paper. pdf. 4365–4374, , China. Association for Computational Linguistics. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- models are unsupervised multitask learners. OpenAI dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Blog, 1(8):9. Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and approach. Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of Guoqin Ma. 2019. Tweets classification with BERT in the 2016 Conference on Empirical Methods in Natu- the field of disaster management. Technical report, ral Language Processing, pages 2383–2392, Austin, Department, . Texas. Association for Computational Linguistics. Adrian Rauchfleisch and Jonas Kaiser. 2020. The false GLUE: A multi-task benchmark and analysis plat- positive problem of automatic bot detection in social form for natural language understanding. In Pro- science research. Berkman Klein Center Research ceedings of the 2018 EMNLP Workshop Black- Publication, (2020-3). boxNLP: Analyzing and Interpreting Neural Net- works for NLP, pages 353–355, Brussels, Belgium. Ryan Ressmeyer. 2019. "deep faking" political twitter Association for Computational Linguistics. using transfer learning and GPT-2. Technical report, Computer Science Department, Stanford University. Alex Warstadt, Amanpreet Singh, and Samuel R. Bow- man. 2019. Neural network acceptability judgments. Anna Rogers, Olga Kovaleva, and Anna Rumshisky. Transactions of the Association for Computational 2020. A primer in BERTology: What we know Linguistics, 7:625–641. about how BERT works. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Victor Sanh, Lysandre Debut, Julien Chaumond, and Chaumond, Clement Delangue, Anthony Moi, Pier- Thomas Wolf. 2019. DistilBERT, a distilled version ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow- of BERT: smaller, faster, cheaper and lighter. icz, and Jamie Brew. 2019. Huggingface’s trans- formers: State-of-the-art natural language process- Rico Sennrich, Barry Haddow, and Alexandra Birch. ing. ArXiv, abs/1910.03771. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th An- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. nual Meeting of the Association for Computational Le, Mohammad Norouzi, Wolfgang Macherey, Linguistics (Volume 1: Long Papers), pages 1715– Maxim Krikun, Yuan Cao, Qin Gao, Klaus 1725, Berlin, Germany. Association for Computa- Macherey, Jeff Klingner, Apurva Shah, Melvin John- tional Linguistics. son, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Chengcheng Shao, Giovanni Luca Ciampaglia, Onur Stevens, George Kurian, Nishant Patil, Wei Wang, Varol, Kai-Cheng Yang, Alessandro Flammini, and Cliff Young, Jason Smith, Jason Riesa, Alex Rud- Filippo Menczer. 2018. The spread of low- nick, Oriol Vinyals, Gregory S. Corrado, Macduff credibility content by social bots. Nature commu- Hughes, and Jeffrey Dean. 2016. Google’s neu- nications, 9(1):1–9. ral machine translation system: Bridging the gap between human and machine translation. ArXiv, Richard Socher, Alex Perelygin, Jean Wu, Jason abs/1609.08144. Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models Rowan Zellers, Ari Holtzman, Hannah Rashkin, for semantic compositionality over a sentiment tree- Yonatan Bisk, Ali Farhadi, Franziska Roesner, and bank. In Proceedings of the 2013 Conference on Yejin Choi. 2019. Defending against neural fake Empirical Methods in Natural Language Processing, news. In Advances in Neural Information Process- pages 1631–1642, Seattle, Washington, USA. Asso- ing Systems, pages 9051–9062. ciation for Computational Linguistics.

Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 4593– 4601, Florence, Italy. Association for Computational Linguistics.

Erik F. Tjong Kim Sang and Sabine Buchholz. 2000. Introduction to the CoNLL-2000 shared task chunk- ing. In Fourth Conference on Computational Nat- ural Language Learning and the Second Learning Language in Logic Workshop.

Onur Varol, Emilio Ferrara, Clayton A. Davis, Filippo Menczer, and Alessandro Flammini. 2017. Online human-bot interactions: Detection, estimation, and characterization. In ICWSM.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. CoRR, abs/1706.03762.

Alex Wang, Amanpreet Singh, Julian Michael, Fe- lix Hill, Omer Levy, and Samuel Bowman. 2018.