<<

Autonomic Author Identification in Relay Chat (IRC)

Sicong Shao Cihan Tunc Amany Al-Shawi Salim Hariri NSF Center for Cloud and NSF Center for Cloud and National Center for NSF Center for Cloud and Autonomic Computing Autonomic Computing Cybersecurity Technology Autonomic Computing The University of Arizona, The University of Arizona, King Abdulaziz City for Science The University of Arizona, Tucson, Arizona Tucson, Arizona and Technology, Tucson, Arizona [email protected] [email protected] Riyadh, Saudi Arabia [email protected] [email protected]

Abstract— With the advances in Internet technologies and services, tracing illegitimate messages is an important cybersecurity the social media has been gaining excessive popularity, especially because research problem. Author identification for analyzing online these technologies provide anonymity where they use nicknames to post messages is one of the suggested solutions, which can be their messages. Unfortunately, the anonymity feature has been exploited by the cyber-criminals to hide their identities and their operations. Hence, formulated as assigning a writing among a set of the given there is a growing interest in cybersecurity research domain to identify the authors [3]. The goal of the research presented in this paper authors of malicious messages and activities. (IRC) is to develop methods for author identification in cyberspace channels are widely used to exchange messages and information among and the ability to group intercepted anonymous messages that malicious users involved in cybercrimes. In this paper, we present an belong to the same authors as well as those generated by autonomic author identification technique based on personality profile known terrorists or cyber-criminals [30]. and analysis of IRC messages. We first monitor the IRC channels using our autonomic bots and then create a personality profile for each targeted To effectively achieve these capabilities, we present the author. We demonstrate that personality analysis for author detection/identification is an efficient approach and has high detection design and implementation of Automatic Author rates. Identification and Characterization (AAIC) framework that provides an innovative and effective solution to identify Keywords— Author identification; cybersecurity; machine learning; authorship using the personality profile analysis based on the Internet Relay Chat (IRC); Watson AI platform; personality insights fact that the personality characteristics of a person is relatively stable [37]. For monitoring and identification of the I. INTRODUCTION authors, we use Internet Relay Chat (IRC) channels, which Advances in the Internet services and mobile services have been actively used by the security groups (both have led to the rapid growth in the use of the social network malicious and non-malicious) to share their knowledge and where users can post their opinion and share information get help because IRC provides many professional channels across the globe immediately, in an anonymous way. (chatrooms) for real-time communication [1][2] even for However, such an anonymity has also been exploited by cybercrime such as hacking, cracking, and carding [9][25]. cybercrime to create fake and illegal accounts and pretend to The reason why using IRC in cybercrime is not only that IRC be somebody else [36]. People with illegitimate purposes can is a commonly used method of communication in cybercrime exploit the powerful dissemination feature of social community but also that IRC ensures the anonymity. And, the networking platforms to spread malicious information and users can hide themselves in the public channels (chatrooms) influence people opinion, which can cause serious threat in and change their user name regularly. cyberspace. For example, in “Myspace mom” incident, a fake Compared to themost of the author identification account caused the suicide of a teenage girl with cruel techniques which focus on newspapers, emails, messages and cyber-bullying [29]. Furthermore, the forums, blogs, etc. [3][4][5][6][7][8], performing author perpetrators can avoid being detected by using anonymous identification in IRC is a more challenging task due to several servers, spoofing, through VPN, and use of fake accounts. reasons. First, most of the previous works studied Hence, homeland security and law enforcement agencies asynchronous computer-mediated communication (CMC) have launched projects to prevent deceptive attacks and track including emails, web forums, blogs, comments, etc., while the identities of senders in order to improve our protection few works go into author identification in synchronous capabilities against terrorism, child predators, etc. [30]. mediums such as chatrooms. This fact gives us less clue on The current solutions can only detect and find cyber- how to find effective feature to distinguish a user in chatroom criminals if they make a mistake by providing their real- like IRC. Also, the IRC channel administrators generally identity information; e.g., Andromeda botnet mastermind dislike bots (used to monitor/log the channels) and block (a.k.a. Ar3s) was arrested through his public ICQ number them and even their IPs. This necessitates the need for [31]. Hence, it is critically important to develop innovative developing intelligent monitoring systems. Third, in most tools that can efficiently identify authors, their origins, author identification studies, researchers have focused on language, locations, etc. regardless of the approach used to identifying a very limited number of the most active users hide their identity. Thus, developing effective methods for (e.g., up to 20) [3][4][5][6][7]. For example, Zheng et al. [3]

15th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2018) ©2018 IEEE

978-1-5386-9120-5/18/$31.00 ©2018 IEEE identified up to 20 of most active users who frequently posted B. Author Identification messages in newsgroups, with the best accuracy being around Author identification has been widely used for various 83% when given 20 authors. However, more candidates need reasons such as computer forensics, plagiarism check, social to be considered because IRC channel usually has more media misuse, etc. using e-mail, website forum, and blogs. potential suspects. Fourth, the average length of IRC For example, Zheng et al. [3] and Narayanan et al. [11] messages is very short compared to emails or blog entries. conduct a series of research to identify the author’s online For example, the average length of a message in #anonops messages. Author identification on online messages is a (one of the IRC channels) operated by the anonymous particular important issue in cybercrime because one of the organization is 5.7 words based on more than 800,000 obvious features of cybercrime is anonymity. Anonymous messages collected by our monitoring bot. Last but not least, users always fake their personal information and hide their hacker, anonymity, and terrorist groups use sophisticated identity for escaping from security investigation. techniques to hide and fake their stylometric features which are commonly used for author identification. However, even The authors in [32] apply n-gram analysis using the 3-15 though their stylometric features change, their word based n-grams and apply k-NN, outlier classification, personality/characteristics remain the same and that gives the and collective classification. In [33], the authors focus on the means to identify them using personality profile analysis. frequency of the most frequently occurring character n-grams and apply SVM for the classification. In [11] the authors The main idea in our author identification approach in extract features from each post and pass them to the classifier. IRC channels is that the suspects in cyberspace unconsciously While doing that they use a fixed set of function words, such leave their personality trace derived from their online as “the”, “in”, as function words bear little relation to the messages. These messages can be used to model the topic of conversation. Furthermore, they also exclude personality characteristics and hence can be used to bigrams and trigrams, which may be significantly influenced distinguish each individual uniquely. Moreover, most author by specific words. Hill et al. [34] showed that stylometry identification techniques rely on stylometric features, which enables identifying reviewers of research papers with can be manipulated and controlled [3][4][5][6][7][8]; reasonably high accuracy, given that the adversary, assumed however, hiding and faking personality features is arduous. to be a member of the community, has access to a large The remainder of this paper is structured as follows. In number of unblinded reviews of potential reviewers by Section II we provide background information about IRC serving on conference and grant selection committees. The client, author identification, and IBM Watson Artificial authors in [35] apply unsupervised machine learning Intelligence (AI) platform. Section III explains our Automatic algorithms based on word frequencies. However, the study Author Identification and Characterization (AAIC) contains very small amount of authors for the successful framework. The experimental environment and evaluation demonstration. results are presented in Section IV. Finally, in Section V, the IRC provides anonymity by the protocol [12]. IRC server paper is concluded. will automatically mask the user’s IP when a user connects to its server, which is the most important layer of anonymity. II. BACKGROUND AND RELATED WORK Moreover, unlike most social media platform requiring sign- A. Internet Relay Chat (IRC) up task, users join hacker channels through an easy non- Internet Relay Chat (IRC) is a popular communication registration process where they can regularly change their method, especially in the cyber-domain. IRC requires a server user name as they wish. With the help of many hacker groups, that provides networking for the connected users through a e.g., #anonops which is an infamous international anonymous protocol that facilitates real-time text communications [1]. organization providing network in IRC, users can hide in the IRC has been traditionally utilized for legitimate functions, IRC channels to plan and commit cybercrime [13]. Therefore, but it has also been extensively used by hacker, anonymity, it would be very important if security experts could identify and terrorist over the years [22]. IRC provides two methods who is the author of IRC messages given a list of suspects. of communication: (i) private chat, i.e. one-to-one messages, There are a few previous studies related to author and () broadcasted public messages. In the channels, public identification in IRC. Layton et al. [38] used IRC messages messages sent by the users are broadcasted to all other users of 50 users from 8 years channel logs. An accuracy in the same channel in real time. Hence, this differs from the up to 55% was achieved by using the inverse-author- website behavior because in the (e.g., blogs), the frequency (iaf) weighting and Recentred Local Profile (RLP) users can read previously posted messages anytime by methods. Inches et al. [39] used the dataset “irc logs” that browsing them [9]. On the contrary to the website blogs merges heterogeneous chat messages of 40 different IRC where offline collection and batch processing would work channels. By applying chi-squared distance and Kullback- efficiently, in IRC based communication, real-time collection Leibler divergence to determine the similarity between author and threat detection are critical research issues [1][10]. profiles, the best accuracy achieved up to 92%, 52%, and 61%, with 20, 132, and 148 users. However, this approach In this research, in order to monitor the IRC channels and does not consider the degree of class imbalance, and also the chats, we have developed autonomic IRC bots for the ignores performing author identification in the same channel comprehensive real-time recording of the IRC data using which is a more difficult real-world problem due to lack of several strategies as will be discussed in Section III. data. Furthermore, both of these approaches only focus on normal channel without regard to hacker channel where the

logs are harder to be collected and users are more processing; (b) Feature extraction; () Learning Unit; (d) sophisticated to hide their identities. Author Identification. C. Personality Analysis through IBM Watson AI Platform IBM Watson is the AI platform service provided by IBM allowing users to integrate AI into their applications, training, management, and analysis of data in a secure cloud environment (guaranteeing privacy of the data against with IBM and third parties) [14][15][16]. We leverage IBM Watson Assistant and Personality Insight capabilities to build the conversation module and personality feature extraction module for our automatic author identification and characterization framework. Watson Assistant is an AI assistant service for social media Figure 1. The architecture of automatic author identification and to answer the given questions through pre-configured content characterization framework intents, such as banking [17]. Furthermore, the service also can be improved using the history by better understanding of For the IRC channel monitoring and conversation logging, input [17]. an autonomic IRC bot has been created with the features of Another service we leverage in our approach is the IBM robust continuous monitoring, comprehensive information Personality Insights that is based on integrating psychology collection, and pre-processing in real-time. Using these and data analytics algorithms to analyze the given content and capabilities, the IRC bot monitors the channels and extracts create a personality profile [18]. The IBM Personality structured data to be used for analysis, in the following format: Insights service uses three models: Big Five, Needs, and Username + Chat message content +Time. The architecture Values [18]. Big Five personality characteristics represent the of autonomic IRC monitoring bot is shown in Figure 2. most widely used model for generally describing how a We have observed that the IRC channel monitoring and person engages with the world. The model includes five logging has multiple challenges. First of all, in some critical primary dimensions as follows. (1) Agreeableness: a person's channels, if a user is identified as a non-contributing user or tendency to be compassionate and cooperative toward others; as a bot, the channel operators (i.e., administrators) block the (2) Conscientiousness: a person's tendency to act in an user and even the IP. Hence, to ensure that the bot can avoid organized or thoughtful way; (3) Extraversion: a person's being identified by IRC channel administrators, we integrated tendency to seek stimulation in the company of others; (4) a conversation module, which can provide basic answers for Emotional range, also referred to as Neuroticism or Natural the questions. For this purpose, we have leveraged IBM reactions: the extent to which a person's emotions are Watson Assistant [17] by adding a natural language interface sensitive to the person's environment; and (5) Openness: the and automated the interactions with users in the monitored extent to which a person is open to experiencing a variety of channels. The procedure of conversation capability is as activities. Each of these top-level dimensions has six facets follows: 1) we create a workspace which is a container in that further characterize an individual according to the IBM Cloud for the artifacts that define the conversation flow. dimension. Needs model describes which aspects of a product 2) Using the IRC messages collected from the monitored IRC will resonate with a person and includes twelve characteristic channel to transform to the Intent, Entity, and Dialog content needs: Excitement, Harmony, Curiosity, Ideal, Closeness, in workspace for training the conversation capability. The Self-expression, Liberty, Love, Practicality, Stability, Intent is purposes expressed in IRC user’s input messages, Challenge, and Structure. Values model describes motivating such as a topic about cybercrime or anonymous activity. The factors that influence a person's decision making. The model Entity represents a term, an object or a data type which is includes five values: Self-transcendence, Conservation, relevant to IRC user’s intents. By identifying the entity which Hedonism, Self-enhancement, Open to change. Watson infers is mentioned by IRC user’s input, the Watson Assistant can personality features from textual information using an open- perform a specific context for an intent. For example, an vocabulary approach [18]. By using GloVe which is an open- entity may represent a hacking tool that the IRC user intends source word embedding techniques, the service obtains a to launch a cyber-attack, a math calculating question or a time vector representation for the words in the input text [40]. It inquiring question for detection bot. To train Watson then feeds this representation to a machine learning model Assistant to identify IRC user’s entities, we list the possible that infers a personality profile. To train the model, IBM uses values for entities that IRC users may mention. The dialog is scores from surveys that were conducted among thousands of a branching conversation flow that defines the response of users along with their Twitter data [18]. conversation module when it identifies the defined intents and entities. We provide responses by analyzing our collected III. SYSTEM DESIGN IRC chat messages based on the intents and entities which we The Automatic Author Identification and recognize in their input chat messages. As we provide this Characterization (AAIC) framework architecture for IRC information, Watson Assistant uses the IRC chat messages to environment is shown in Figure 1. The AAIC components create a machine learning model to understand the IRC can be outlined as follows: (a) Data collection and pre- messages. Through retraining, we ensure our chat module

keeps the latest models for handling conversation for provides multiple languages (e.g., English, Japanese, Korean, monitored IRC channel when new chat data is introduced. 3) Arabic), which is important for the international After the creation of the model for IRC conversation, we cybersecurity investigation [23][24]. connect Watson Assistant into the conversation module through Watson Assistant API. The conversation module also It has been shown that a successful personality includes response trigger for triggering Watson Assistant to characteristics can be created using 3000 words [18]. If the response group chat in the channel, one-to-one chat in the given text is less than 600 words, the service still analyzes channel, and private chat via Direct Client-to-Client (DCC) them but the result is not guaranteed to provide a sufficient protocol. personality information and 100 words are the minimum threshold. It is also possible that the cyber-criminals can create a temporary channel where they can discuss their plans or share Feature extraction starts with the collection of the IRC and invite the other users to work together. For example, on dataset and then the pre-processing of the data by our December 6, 2010, the users in anonops server have suddenly autonomic IRC bot (it separates the individual authors’ started using a temporary channel called #operationpayback, messages). Next, we obtain individual user characteristics which had been quiet for months [19]. In this channel, the through the feature extraction unit which can filter individual suspect’s IRC chat messages to the personality analysis cyber attackers discussed their motivation and plans, and then started launching DDoS attacks against the websites of module. By calling the Personality Insights service from IBM Swedish Prosecution Authority, everyDnS, Senator Joseph Cloud, the personality analysis module can get an individual Lieberman, and others. This event resulted in having these user’s personality in JSON format (that has the normalized websites to experience downtime [19][26]. Therefore, to personality analysis results based on three models: Big Five, track such activities, a self-replication module is developed Needs, and Values). Big Five model contains five primary that allows parent bot to generate a new (child) bot which dimensions, Agreeableness, Conscientiousness, Extraversion, inherits all the capability for continuous autonomic Emotional range, and Openness. Each of these primary monitoring. By trusting all the self-signed certificate, the bot dimensions includes six facet features that further distinguish is able to join the hacker server and channel that enforced a user. Needs model contains twelve need features, and TLS/SSL access to their network. Values model includes five value features. We select all the facet features in each primary dimension of Big Five model (except the openness, agreeableness, emotional range, conscientiousness, and extraversion due to they are high dimensional features), all the features of Needs model, and all the features of Values model to represent the personality of the suspect candidate user, which creates 47 features in total. A sunburst chart visualization for a user’s personality profile is shown in Figure 3.

Figure 2. The architecture of autonomic IRC monitoring bot

A. Feature Extraction In the monitored IRC channel, users send messages representing social communication. These messages can be measured, and constituted the personality. The characteristics of personality are distinguished uniquely from individual to individual. Based on how IRC users communicate with others, personality characteristics influence most of the user's activities and behaviors in the IRC channel, from those as natural as the way user conversation and interaction. Moreover, personality also influences the way IRC users make decisions including cyberattack type and hacking production selection, attack and crime motivation, hacking activities organization, malicious tool developing, and so on. Using IBM’s Personality Insights services as explained in Section II, we have been successful in analyzing individual authors’ IRC messages and inferred individuals’ intrinsic personality characteristics to create their personality profile. Our author identification approach can also perform in different languages as IBM Personality Insights service Figure 3. Visualization of an IRC user’s personality features

B. Learning Unit This optimization problem is a quadratic problem which can After creating features that define each user (i.e., the be solved by a sequential minimal optimization type feature extraction), we apply classification methods for the decomposition method [20]. The binary classification SVM author identification model creation. To successfully identify can be extended to multi-class classification by combining each author, we have adopted machine learning algorithms some two-category SVM classifier in a certain manner, thus such as k-Nearest Neighbor (k-NN) with different nearest forming a multi-class classifier. neighbors and Support Vector Machine (SVM) algorithms. In our approach of author identification, we used 1) k-Nearest Neighbor LIBSVM [20] to implement multi-class classification SVM. k-Nearest Neighbor (k-NN) classifier identifies the author LIBSVM uses the one-against-one method for multi-class of a given text content from a given set of candidate users. classification that needs ( − 1)/2 classifier for N-class This classifier can be viewed as a multi-class classification classification [20]. Each classifier is trained on samples from task. In this problem, the k-NN method classifies an unknown two corresponding classes. A voting mechanism is used for personality insights sample from Personality Insights service test after all the classifiers are trained. The unknown to identify which of the majority of its nearest neighbors personality sample is classified to the suspect with the largest the author belongs to. In k-NN analysis, we have used the vote. Euclidean distance for the personality similarity distance IV. EXPERIMENTS AND RESULTS measurement. The distance between two data samples ( , ,…, ) and ( , ,…, ) is calculated as To evaluate the effectiveness of our approach, we designed several author identification tasks in various IRC ∑( − ) where is the index of the features that are channels monitored by our autonomic IRC bot technique normalized personaliy characteristics values derived IRC (shown in Table I). We selected six active IRC channels for chat content. the demonstration. 2) Support Vector Machine • The #anonops channel is an international Support Vector Machine (SVM) algorithm is a machine communication platform controlled by anonymous learning approach that we leverage for the authorship analysis hacking organization. approach. We can define SVM as maximizing the margin • The #2600 channel is a highly active community with between two classes given a dataset of (,), = hacker magazines and monthly hacker meetings [1]. • 1, … , , ∈, ∈ +1, −1 , where is the label of The #computer is an active computer discussion channel located Underworld server for understanding class and represent the feature vector. The label of cybercrime and immoral deeds on the Internet. unknown data sample can be determined by = • The #politics channel is another Underworld’s ( ∅() +) where ∅() is a mapping to a higher important channel whose topics are related to political dimensional space to get a nonlinear SVM and is the vector warfare. that SVM needs to optimize. By calculating the following • The #security and the #networking are two popular optimization problem, the optimal hyperplane can be channels involving the topics of computer and network obtained as follows: security in server. 1 min ( ∙) + 2 Table I. TOTAL # OF MESSAGES OF THE MONITORED CHANNELS (1) Channel Total # of Collection Server Name Name Messages Data Range . . ∙∅( )+≥1,=1,…, irc.anonops.com #anonops 817,435 8/15/17 – 4/13/18

irc.2600.net #2600 549,400 4/01/17 – 4/13/18 where is a slack variable, and is a penalty factor. Its dual form is: irc.underworld.no #computer 186,458 9/13/17 – 4/13/18 irc.underworld.no #politics 109,416 1/04/18 – 4/13/18 1 arg − Κ , irc.freenode.net ##security 243,273 4/13/17 – 4/13/18 2 , (2) irc.freenode.net #networking 220,060 9/06/17 – 4/13/18

. . =0, 0 ≤ ≤,=1,…, After unstructured IRC messages are extracted in real- time and transformed to structured data files in CSV format, the feature extraction unit can automatically analyze the where , is a kernel function. We use Radial basis function (RBF) kernel in our problem. user’s IRC messages to return the personality profile in JSON format by communicating with the IBM Personality Insights The classification function is: service. For a successful personality analysis, IBM Personality Insights services require 3000 words [18]. Hence, we obtain each personality sample in JSON format using = ( ,) + (3) continuous 3,000 words from the same user without any overlap (each sample is derived by non-overlapping text) and

discard the remaining context that are less than 3,000 words. experiment, LOOCV uses −1 examples for training and These sample files of the author are stored in their own the remaining example for testing [27]. personality profile document. In our experiments (as shown in Figure 5), we observe While the number of the authors increase, we face a class- that k-NN algorithm is able to provide acceptable amount of imbalance problem where the data proportion is not equal. To accuracy in the detection of the authors, as efficient as the address this issue, we apply undersampling in our personality study in [3]. We have observed that as the value decrease, datasets, which is an efficient method for class-imbalance the accuracy increases due to the better classifying the author learning [21]. The undersampling method uses a subset of the groups. Furthermore, we observe that SVM achieves samples from the majority class to train the classifier [21]. significantly higher accuracy than k-NN in the monitored We perform undersampling to the majority by randomly channels. We also observe that the accuracy presents the removing samples from that population until the minority downward trend with the increasing number of authors. For class becomes some specified percentage of the majority example, SVM achieved 95.89% to 100% accuracy when the class [28]. We have observed that the ratio of the biggest class number of authors is 5. Given 10 authors, SVM achieved to the smallest class is 4:1, and the ratio of the prevalent class 99.34%, 99.18%, 94.17%, 93.73%, 93.33%, and 90.43% to the small class is smaller than 4:1. The number of samples accuracy. When extending to 20 authors, SVM can achieve of each author in different IRC channels is listed in the Figure accuracy varied from 92.12% to 96.13% in #anonops and 4. #2600 and #politics. The fact that #politics can still maintain good accuracy with very small samples reflects that the topics related politics such as political warfare are easier to distinguish the personality from individual to individual, compared with traditional cybersecurity and cybercrime topics in the other channels. The accuracy results of SVM of #computer, #security, and #networking are 86.45%, 81.68%, and 77.17% in 20 author level test, respectively. When extending to 30 authors, the accuracy results of #anonops and #politics are still higher than 90%, and the accuracy of #2600 decrease to 87.35%. In the cases of #computer, #security, and #networking channels, the results decrease to 83.60%, 75.11%, and 75.22%, respectively. The decrease in the accuracy with the increasing number of authors is a predictable result due to the reason: when the number of top users increase, the authors with lower amount of messages Figure 4. The statistic results of the numbers of samples of the (i.e., the authors who do not frequently participate in the most 30 different active users in six different channels channel conversation) cannot provide sufficient information to effectively discriminate authorship. Especially for Most authorship identification studies perform the #networking and #security channels, the authorship detection identification using static websites (through offline data) for seriously suffered due to the lack of data for infrequent a limited number of authors between two to twenty authors. authors. However, by looking at the active channels, we can As the number of authors increases, the accuracy of the easily state that by collecting further data, the accuracy will author identification significantly decreases. For example, be increased. Zheng et al. [3] identified up to 20 most active users posting messages frequently in newsgroups with the best accuracy of It appears that the personality features based on 83% detection. In contrast, we use streaming messages and Personality Insights are very effective for author up to 30 most active authors. In Figure 5 (a-f), we identification. While k-NN provides sufficient accuracy demonstrate the author identification using k-NN algorithm (compared to other studies), SVM is able to present and SVM and compare the effect of the number of authors outperforming results with a 9.88%, 10.60%, 10.73% included in the comparison. In these experiments, we used improvement in average, compared with 1-NN, 3-NN, 5-NN, top 5, 10, 20, and 30 authors from six different IRC channels. respectively. Also, the activity degree is measured by the total number of messages sent by a user, not the total amount of words or sentences. For each test, we trained our classifier model using k-NN (1-NN, 3-NN, and 5-NN) and SVM, respectively. To evaluate the classifier performance, we used accuracy measure that has normally been adopted in author identification [3]. We calculate the accuracy for all classifiers using Leave-One-Out Cross-Validation (LOOCV) in order to train on as many samples as possible. LOOCV is k-fold cross- validation taken to its logical extreme, with equal to (i.e., the number of data points in the set). LOOCV performs experiments on a dataset with examples. For each (a)

(b) (f) Fig. 5. The author identification accuracy results of six monitored channels. (a) accuracy of #anonops. (b) accuracy of #2600. (c) accuracy of #politics. (d) accuracy of #computer. (e) accuracy of #networking. (f) accuracy of ##security.

V. CONCLUSION The anonymity of the Internet services, especially in the social media, provides freedom to the users. On the other hand, it can be exploited for the underground cyber-criminal works. It is highly desired to be able to identify the anonymous individuals spreading malicious software tools or cybercriminals. To address this cybersecurity challenge, we presented in this paper an autonomic personality analysis based author identification for the Internet Relay Chat (IRC) (c) environment. Compared to the previously applied techniques that focus on stylometric measures and deep-learning techniques, our approach focuses on the fact that each author leaves some footprints in the text from their personality characteristics. By using IBM Watson Personality Insights, we were able to extract this information and apply classification techniques to identify individual authors. Using the IRC chat logs that are collected through our autonomic IRC bots in various cybersecurity, underground channels, and also general channels (computer and politics), we have demonstrated that the personality based solution can work effectively in identification of the authors. We have observed between 92%-96% identification for 20 authors when the chat messages are sufficient. (d) ACKNOWLEDGMENT This work is partly supported by the Air Force Office of Scientific Research (AFOSR) Dynamic Data-Driven Application Systems (DDDAS) award number FA9550-18-1- 0427, National Science Foundation (NSF) research projects NSF-1624668 and SES-1314631, and Thomson Reuters in the framework of the Partner University Fund (PUF) project (PUF is a program of the French Embassy in the United States and the FACE Foundation and is supported by American donors and the French government).

REFERENCES [1] S. Shao, C. Tunc, P. Satam, and S. Hariri, “Real-Time IRC Threat Detection Framework,” In Foundations and Applications of Self* (e) Systems (FAS* W), 2017 IEEE 2nd International Workshops on, pp. 318-323. [2] J. Yu, C. Tunc, and S. Hariri, “Automated Framework for Scalable Collection and Intelligent Analytics of Hacker IRC Information,” In

Cloud and Autonomic Computing (ICCAC), 2016 International [23] V. Benjamin, and H. Chen, “Identifying language groups within Conference on, pp. 33-39. multilingual cybercriminal forums,” In Intelligence and Security [3] R. Zheng, J. Li, H. Chen, and Z. Huang, “A framework for authorship Informatics (ISI), 2016 IEEE Conference on, pp. 205-207. identification of online messages: Writing ‐ style features and [24] Z. Fang, X. Zhao, Q. Wei, G. Chen, Y. Zhang, C. Xing, W. Li, and H. classification techniques,” Journal of the Association for Information Chen, “Exploring key hackers and cybersecurity threats in Chinese Science and Technology 57, no. 3 (2006): 378-393. hacker communities,” In Intelligence and Security Informatics (ISI), [4] J. Savoy, “Authorship attribution based on specific vocabulary,” ACM 2016 IEEE Conference on, pp. 13-18. Transactions on Information Systems (TOIS) 30, no. 2 (2012): 12. [25] J. Radianti, “A study of a social behavior inside the online black [5] S. Segarra, M. Eisen, and A. Ribeiro, “Authorship attribution through markets,” In Emerging Security Information Systems and function word adjacency networks,” IEEE Transactions on Signal Technologies (SECURWARE), 2010 Fourth International Conference Processing 63, no. 20 (2015): 5464-5478. on, pp. 189-194. [6] S. Phani, S. Lahiri, and A. Biswas, “A machine learning approach for [26] M. Sauter. The coming swarm: DDOS actions, hacktivism, and civil authorship attribution for Bengali blogs,” In Asian Language disobedience on the Internet. Bloomsbury Publishing USA. 2014 Processing (IALP), 2016 International Conference on, pp. 271-274. [27] R. Payam., T. Lei, and L. Huan, “Cross validation in Encyclopedia of [7] J. Ma, B. Xue, and M. Zhang, “A Profile-Based Authorship Attribution Systems,” Tamer zsu M, Ling L (Eds). EUA: Springer Approach to Forensic Identification in Chinese Online Messages,” In (2009). Intelligence and Security Informatics, 2016 Springer Pacific-Asia [28] N. V. Chawla, “C4. 5 and imbalanced data sets: investigating the effect Workshop on, pp. 33-52. of sampling method, probabilistic estimate, and decision tree [8] S. R. Pillay, and T. Solorio, “Authorship attribution of web forum structure,” In Machine Learning, 2003 International Conference on, posts,” In eCrime Researchers Summit (eCrime), 2010, pp. 1-7. vol. 3, p. 66. [9] V. Benjamin, B. Zhang, J. F. Nunamaker Jr, and H. Chen, “Examining [29] “Myspace Mom” [Online] hacker participation length in cybercriminal Internet-relay-chat http://www.foxnews.com/story/2007/12/06/myspace-mom-linked-to- communities,” Journal of Management Information Systems 33, no. 2 missouri-teen-suicide-being-cyber-bullied-herself.html, Accessed: Feb (2016): 482-510. 2018 [10] V. Benjamin, and H. Chen, “Time-to-event modeling for predicting [30] N. Cheng, R. Chandramouli, and K. P. Subbalakshmi. “Author gender hacker IRC community participant trajectory,” In Intelligence and identification from text.” Digital Investigation 8, no. 1 (2011): 78-88. Security Informatics Conference (JISIC), 2014 IEEE Joint, pp. 25-32. [31] “Mastermind behind sophisticated, massive botnet outs himself,” [11] A. Narayanan, H. Paskov, Z. Gong, J. Bethencourt, E. Stefanov, E. R. [Online] URL: https://arstechnica.com/tech- Shin, and D. Song, “On the feasibility of internet-scale author policy/2017/12/mastermind-behind-massive-botnet-tracked-down-by- identification,” In Security and Privacy (SP), 2012 IEEE Symposium sloppy-opsec/, Accessed: Feb 2018 on, pp. 300-314. [32] J. Peng, K. R. Choo, and H. Ashman. “Bit-level n-gram based forensic [12] E. Cooke, F. Jahanian, and D. McPherson, “The Zombie Roundup: authorship analysis on social media: Identifying individuals from Understanding, Detecting, and Disrupting Botnets,” SRUTI 5 (2005): linguistic profiles,” Journal of Network and Computer Applications 70 6-6. (2016): 171-182. [13] V. Benjamin, W. Li, T. Holt, and H. Chen, “Exploring threats and [33] E. Stamatatos. “Author identification: Using text sampling to handle vulnerabilities in hacker web: Forums, IRC and carding shops,” In the class imbalance problem.” Information Processing & Management Intelligence and Security Informatics (ISI), 2015 IEEE International 44, no. 2 (2008): 790-799. Conference on, pp. 85-90. [34] S. Hill, and F. Provost. “The myth of the double-blind review?: author [14] “IBM Watson Platform service,” [Online] URL: identification using only citations,” Acm Sigkdd Explorations https://console.bluemix.net/developer/watson/dashboard, Accessed: Newsletter 5, no. 2 (2003): 179-184. December 2017 [35] S. Nirkhi, R. V. Dharaskar, and V. M. Thakare. “Authorship [15] “IBM Watson SDKs,” [Online] URL: Verification of Online Messages for Forensic Investigation,” Procedia https://console.bluemix.net/developer/watson/sdks-and-tools, Computer Science 78 (2016): 640-645. Accessed: December 2017 [36] W. Wu, J. Zhou, Y. Xiang, and L. Xu, “How to achieve non- [16] “AI Everywhere with IBM Watson and Apple Core ML,” [Online] repudiation of origin with privacy protection in cloud computing,” J. URL: https://www.ibm.com/blogs/watson/2018/03/ai-everywhere- Comput. Syst. Sci., vol. 79, no. 8 (2013): 1200-1213. ibm-watson-apple-core-ml/, Accessed: March 2018 [37] D. A. Cobb-Clark, and S. Schurer. “The stability of big-five personality [17] “IBM Watson Assistant service,” [Online] URL: traits.” Economics Letters 115, no. 1 (2012): 11-15. https://www.ibm.com/watson/services/conversation/, Accessed: [38] R. Layton, S. McCombie, and P. Watters. “Authorship attribution of December 2017 irc messages using inverse author frequency,” In Cybercrime and [18] “IBM Watson Personality Insights service,” [Online] URL: Trustworthy Computing Workshop (CTC), 2012 Third, pp. 7-13. https://console.bluemix.net/docs/services/personality-insights, [39] G. Inches, M. Harvey, and F. Crestani, “Finding participants in a chat: Accessed: December 2017 Authorship attribution for conversational documents,” In Social [19] P. Olson. We Are Anonymous: Inside the Hacker World of LulzSec, Computing (SocialCom), 2013 International Conference on, pp. 272- Anonymous, and the Global Cyber Insurgency. Back Bay Books, 2013 279. [20] C. Chang, and C. Lin, “LIBSVM: a library for support vector [40] P, Jeffrey, R. Socher, and C. Manning, “Glove: Global vectors for word machines,” ACM transactions on intelligent systems and technology representation,” In Empirical Methods in Natural Language Processing (TIST) 2, no. 3 (2011): 27. (EMNLP), 2014 Conference on, pp. 1532-1543. [21] X. Liu, J. Wu, and Z. Zhou, “Exploratory undersampling for class- imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, no. 2 (2009): 539-550. [22] V. Benjamin, and H. Chen, “Securing cyberspace: Identifying key actors in hacker communities,” In Intelligence and Security Informatics (ISI), 2012 IEEE International Conference on, pp. 24-29.