Autonomic Author Identification in Internet Relay Chat (IRC)
Sicong Shao Cihan Tunc Amany Al-Shawi Salim Hariri NSF Center for Cloud and NSF Center for Cloud and National Center for NSF Center for Cloud and Autonomic Computing Autonomic Computing Cybersecurity Technology Autonomic Computing The University of Arizona, The University of Arizona, King Abdulaziz City for Science The University of Arizona, Tucson, Arizona Tucson, Arizona and Technology, Tucson, Arizona [email protected] [email protected] Riyadh, Saudi Arabia [email protected] [email protected]
Abstract— With the advances in Internet technologies and services, tracing illegitimate messages is an important cybersecurity the social media has been gaining excessive popularity, especially because research problem. Author identification for analyzing online these technologies provide anonymity where they use nicknames to post messages is one of the suggested solutions, which can be their messages. Unfortunately, the anonymity feature has been exploited by the cyber-criminals to hide their identities and their operations. Hence, formulated as assigning a writing among a set of the given there is a growing interest in cybersecurity research domain to identify the authors [3]. The goal of the research presented in this paper authors of malicious messages and activities. Internet Relay Chat (IRC) is to develop methods for author identification in cyberspace channels are widely used to exchange messages and information among and the ability to group intercepted anonymous messages that malicious users involved in cybercrimes. In this paper, we present an belong to the same authors as well as those generated by autonomic author identification technique based on personality profile known terrorists or cyber-criminals [30]. and analysis of IRC messages. We first monitor the IRC channels using our autonomic bots and then create a personality profile for each targeted To effectively achieve these capabilities, we present the author. We demonstrate that personality analysis for author detection/identification is an efficient approach and has high detection design and implementation of Automatic Author rates. Identification and Characterization (AAIC) framework that provides an innovative and effective solution to identify Keywords— Author identification; cybersecurity; machine learning; authorship using the personality profile analysis based on the Internet Relay Chat (IRC); Watson AI platform; personality insights fact that the personality characteristics of a person is relatively stable [37]. For monitoring and identification of the I. INTRODUCTION authors, we use Internet Relay Chat (IRC) channels, which Advances in the Internet services and mobile services have been actively used by the security groups (both have led to the rapid growth in the use of the social network malicious and non-malicious) to share their knowledge and where users can post their opinion and share information get help because IRC provides many professional channels across the globe immediately, in an anonymous way. (chatrooms) for real-time communication [1][2] even for However, such an anonymity has also been exploited by cybercrime such as hacking, cracking, and carding [9][25]. cybercrime to create fake and illegal accounts and pretend to The reason why using IRC in cybercrime is not only that IRC be somebody else [36]. People with illegitimate purposes can is a commonly used method of communication in cybercrime exploit the powerful dissemination feature of social community but also that IRC ensures the anonymity. And, the networking platforms to spread malicious information and users can hide themselves in the public channels (chatrooms) influence people opinion, which can cause serious threat in and change their user name regularly. cyberspace. For example, in “Myspace mom” incident, a fake Compared to themost of the author identification account caused the suicide of a teenage girl with cruel techniques which focus on newspapers, emails, website messages and cyber-bullying [29]. Furthermore, the forums, blogs, etc. [3][4][5][6][7][8], performing author perpetrators can avoid being detected by using anonymous identification in IRC is a more challenging task due to several servers, spoofing, through VPN, and use of fake accounts. reasons. First, most of the previous works studied Hence, homeland security and law enforcement agencies asynchronous computer-mediated communication (CMC) have launched projects to prevent deceptive attacks and track including emails, web forums, blogs, comments, etc., while the identities of senders in order to improve our protection few works go into author identification in synchronous capabilities against terrorism, child predators, etc. [30]. mediums such as chatrooms. This fact gives us less clue on The current solutions can only detect and find cyber- how to find effective feature to distinguish a user in chatroom criminals if they make a mistake by providing their real- like IRC. Also, the IRC channel administrators generally identity information; e.g., Andromeda botnet mastermind dislike bots (used to monitor/log the channels) and block (a.k.a. Ar3s) was arrested through his public ICQ number them and even their IPs. This necessitates the need for [31]. Hence, it is critically important to develop innovative developing intelligent monitoring systems. Third, in most tools that can efficiently identify authors, their origins, author identification studies, researchers have focused on language, locations, etc. regardless of the approach used to identifying a very limited number of the most active users hide their identity. Thus, developing effective methods for (e.g., up to 20) [3][4][5][6][7]. For example, Zheng et al. [3]
15th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2018) ©2018 IEEE
978-1-5386-9120-5/18/$31.00 ©2018 IEEE identified up to 20 of most active users who frequently posted B. Author Identification messages in newsgroups, with the best accuracy being around Author identification has been widely used for various 83% when given 20 authors. However, more candidates need reasons such as computer forensics, plagiarism check, social to be considered because IRC channel usually has more media misuse, etc. using e-mail, website forum, and blogs. potential suspects. Fourth, the average length of IRC For example, Zheng et al. [3] and Narayanan et al. [11] messages is very short compared to emails or blog entries. conduct a series of research to identify the author’s online For example, the average length of a message in #anonops messages. Author identification on online messages is a (one of the IRC channels) operated by the anonymous particular important issue in cybercrime because one of the organization is 5.7 words based on more than 800,000 obvious features of cybercrime is anonymity. Anonymous messages collected by our monitoring bot. Last but not least, users always fake their personal information and hide their hacker, anonymity, and terrorist groups use sophisticated identity for escaping from security investigation. techniques to hide and fake their stylometric features which are commonly used for author identification. However, even The authors in [32] apply n-gram analysis using the 3-15 though their stylometric features change, their word based n-grams and apply k-NN, outlier classification, personality/characteristics remain the same and that gives the and collective classification. In [33], the authors focus on the means to identify them using personality profile analysis. frequency of the most frequently occurring character n-grams and apply SVM for the classification. In [11] the authors The main idea in our author identification approach in extract features from each post and pass them to the classifier. IRC channels is that the suspects in cyberspace unconsciously While doing that they use a fixed set of function words, such leave their personality trace derived from their online as “the”, “in”, as function words bear little relation to the messages. These messages can be used to model the topic of conversation. Furthermore, they also exclude personality characteristics and hence can be used to bigrams and trigrams, which may be significantly influenced distinguish each individual uniquely. Moreover, most author by specific words. Hill et al. [34] showed that stylometry identification techniques rely on stylometric features, which enables identifying reviewers of research papers with can be manipulated and controlled [3][4][5][6][7][8]; reasonably high accuracy, given that the adversary, assumed however, hiding and faking personality features is arduous. to be a member of the community, has access to a large The remainder of this paper is structured as follows. In number of unblinded reviews of potential reviewers by Section II we provide background information about IRC serving on conference and grant selection committees. The client, author identification, and IBM Watson Artificial authors in [35] apply unsupervised machine learning Intelligence (AI) platform. Section III explains our Automatic algorithms based on word frequencies. However, the study Author Identification and Characterization (AAIC) contains very small amount of authors for the successful framework. The experimental environment and evaluation demonstration. results are presented in Section IV. Finally, in Section V, the IRC provides anonymity by the protocol [12]. IRC server paper is concluded. will automatically mask the user’s IP when a user connects to its server, which is the most important layer of anonymity. II. BACKGROUND AND RELATED WORK Moreover, unlike most social media platform requiring sign- A. Internet Relay Chat (IRC) up task, users join hacker channels through an easy non- Internet Relay Chat (IRC) is a popular communication registration process where they can regularly change their method, especially in the cyber-domain. IRC requires a server user name as they wish. With the help of many hacker groups, that provides networking for the connected users through a e.g., #anonops which is an infamous international anonymous protocol that facilitates real-time text communications [1]. organization providing network in IRC, users can hide in the IRC has been traditionally utilized for legitimate functions, IRC channels to plan and commit cybercrime [13]. Therefore, but it has also been extensively used by hacker, anonymity, it would be very important if security experts could identify and terrorist over the years [22]. IRC provides two methods who is the author of IRC messages given a list of suspects. of communication: (i) private chat, i.e. one-to-one messages, There are a few previous studies related to author and (ii) broadcasted public messages. In the channels, public identification in IRC. Layton et al. [38] used IRC messages messages sent by the users are broadcasted to all other users of 50 users from 8 years Ubuntu channel logs. An accuracy in the same channel in real time. Hence, this differs from the up to 55% was achieved by using the inverse-author- website behavior because in the websites (e.g., blogs), the frequency (iaf) weighting and Recentred Local Profile (RLP) users can read previously posted messages anytime by methods. Inches et al. [39] used the dataset “irc logs” that browsing them [9]. On the contrary to the website blogs merges heterogeneous chat messages of 40 different IRC where offline collection and batch processing would work channels. By applying chi-squared distance and Kullback- efficiently, in IRC based communication, real-time collection Leibler divergence to determine the similarity between author and threat detection are critical research issues [1][10]. profiles, the best accuracy achieved up to 92%, 52%, and 61%, with 20, 132, and 148 users. However, this approach In this research, in order to monitor the IRC channels and does not consider the degree of class imbalance, and also the chats, we have developed autonomic IRC bots for the ignores performing author identification in the same channel comprehensive real-time recording of the IRC data using which is a more difficult real-world problem due to lack of several strategies as will be discussed in Section III. data. Furthermore, both of these approaches only focus on normal channel without regard to hacker channel where the
logs are harder to be collected and users are more processing; (b) Feature extraction; (c) Learning Unit; (d) sophisticated to hide their identities. Author Identification. C. Personality Analysis through IBM Watson AI Platform IBM Watson is the AI platform service provided by IBM allowing users to integrate AI into their applications, training, management, and analysis of data in a secure cloud environment (guaranteeing privacy of the data against with IBM and third parties) [14][15][16]. We leverage IBM Watson Assistant and Personality Insight capabilities to build the conversation module and personality feature extraction module for our automatic author identification and characterization framework. Watson Assistant is an AI assistant service for social media Figure 1. The architecture of automatic author identification and to answer the given questions through pre-configured content characterization framework intents, such as banking [17]. Furthermore, the service also can be improved using the history by better understanding of For the IRC channel monitoring and conversation logging, input [17]. an autonomic IRC bot has been created with the features of Another service we leverage in our approach is the IBM robust continuous monitoring, comprehensive information Personality Insights that is based on integrating psychology collection, and pre-processing in real-time. Using these and data analytics algorithms to analyze the given content and capabilities, the IRC bot monitors the channels and extracts create a personality profile [18]. The IBM Personality structured data to be used for analysis, in the following format: Insights service uses three models: Big Five, Needs, and Username + Chat message content +Time. The architecture Values [18]. Big Five personality characteristics represent the of autonomic IRC monitoring bot is shown in Figure 2. most widely used model for generally describing how a We have observed that the IRC channel monitoring and person engages with the world. The model includes five logging has multiple challenges. First of all, in some critical primary dimensions as follows. (1) Agreeableness: a person's channels, if a user is identified as a non-contributing user or tendency to be compassionate and cooperative toward others; as a bot, the channel operators (i.e., administrators) block the (2) Conscientiousness: a person's tendency to act in an user and even the IP. Hence, to ensure that the bot can avoid organized or thoughtful way; (3) Extraversion: a person's being identified by IRC channel administrators, we integrated tendency to seek stimulation in the company of others; (4) a conversation module, which can provide basic answers for Emotional range, also referred to as Neuroticism or Natural the questions. For this purpose, we have leveraged IBM reactions: the extent to which a person's emotions are Watson Assistant [17] by adding a natural language interface sensitive to the person's environment; and (5) Openness: the and automated the interactions with users in the monitored extent to which a person is open to experiencing a variety of channels. The procedure of conversation capability is as activities. Each of these top-level dimensions has six facets follows: 1) we create a workspace which is a container in that further characterize an individual according to the IBM Cloud for the artifacts that define the conversation flow. dimension. Needs model describes which aspects of a product 2) Using the IRC messages collected from the monitored IRC will resonate with a person and includes twelve characteristic channel to transform to the Intent, Entity, and Dialog content needs: Excitement, Harmony, Curiosity, Ideal, Closeness, in workspace for training the conversation capability. The Self-expression, Liberty, Love, Practicality, Stability, Intent is purposes expressed in IRC user’s input messages, Challenge, and Structure. Values model describes motivating such as a topic about cybercrime or anonymous activity. The factors that influence a person's decision making. The model Entity represents a term, an object or a data type which is includes five values: Self-transcendence, Conservation, relevant to IRC user’s intents. By identifying the entity which Hedonism, Self-enhancement, Open to change. Watson infers is mentioned by IRC user’s input, the Watson Assistant can personality features from textual information using an open- perform a specific context for an intent. For example, an vocabulary approach [18]. By using GloVe which is an open- entity may represent a hacking tool that the IRC user intends source word embedding techniques, the service obtains a to launch a cyber-attack, a math calculating question or a time vector representation for the words in the input text [40]. It inquiring question for detection bot. To train Watson then feeds this representation to a machine learning model Assistant to identify IRC user’s entities, we list the possible that infers a personality profile. To train the model, IBM uses values for entities that IRC users may mention. The dialog is scores from surveys that were conducted among thousands of a branching conversation flow that defines the response of users along with their Twitter data [18]. conversation module when it identifies the defined intents and entities. We provide responses by analyzing our collected III. SYSTEM DESIGN IRC chat messages based on the intents and entities which we The Automatic Author Identification and recognize in their input chat messages. As we provide this Characterization (AAIC) framework architecture for IRC information, Watson Assistant uses the IRC chat messages to environment is shown in Figure 1. The AAIC components create a machine learning model to understand the IRC can be outlined as follows: (a) Data collection and pre- messages. Through retraining, we ensure our chat module
keeps the latest models for handling conversation for provides multiple languages (e.g., English, Japanese, Korean, monitored IRC channel when new chat data is introduced. 3) Arabic), which is important for the international After the creation of the model for IRC conversation, we cybersecurity investigation [23][24]. connect Watson Assistant into the conversation module through Watson Assistant API. The conversation module also It has been shown that a successful personality includes response trigger for triggering Watson Assistant to characteristics can be created using 3000 words [18]. If the response group chat in the channel, one-to-one chat in the given text is less than 600 words, the service still analyzes channel, and private chat via Direct Client-to-Client (DCC) them but the result is not guaranteed to provide a sufficient protocol. personality information and 100 words are the minimum threshold. It is also possible that the cyber-criminals can create a temporary channel where they can discuss their plans or share Feature extraction starts with the collection of the IRC and invite the other users to work together. For example, on dataset and then the pre-processing of the data by our December 6, 2010, the users in anonops server have suddenly autonomic IRC bot (it separates the individual authors’ started using a temporary channel called #operationpayback, messages). Next, we obtain individual user characteristics which had been quiet for months [19]. In this channel, the through the feature extraction unit which can filter individual suspect’s IRC chat messages to the personality analysis cyber attackers discussed their motivation and plans, and then started launching DDoS attacks against the websites of module. By calling the Personality Insights service from IBM Swedish Prosecution Authority, everyDnS, Senator Joseph Cloud, the personality analysis module can get an individual Lieberman, and others. This event resulted in having these user’s personality in JSON format (that has the normalized websites to experience downtime [19][26]. Therefore, to personality analysis results based on three models: Big Five, track such activities, a self-replication module is developed Needs, and Values). Big Five model contains five primary that allows parent bot to generate a new (child) bot which dimensions, Agreeableness, Conscientiousness, Extraversion, inherits all the capability for continuous autonomic Emotional range, and Openness. Each of these primary monitoring. By trusting all the self-signed certificate, the bot dimensions includes six facet features that further distinguish is able to join the hacker server and channel that enforced a user. Needs model contains twelve need features, and TLS/SSL access to their network. Values model includes five value features. We select all the facet features in each primary dimension of Big Five model (except the openness, agreeableness, emotional range, conscientiousness, and extraversion due to they are high dimensional features), all the features of Needs model, and all the features of Values model to represent the personality of the suspect candidate user, which creates 47 features in total. A sunburst chart visualization for a user’s personality profile is shown in Figure 3.
Figure 2. The architecture of autonomic IRC monitoring bot
A. Feature Extraction In the monitored IRC channel, users send messages representing social communication. These messages can be measured, and constituted the personality. The characteristics of personality are distinguished uniquely from individual to individual. Based on how IRC users communicate with others, personality characteristics influence most of the user's activities and behaviors in the IRC channel, from those as natural as the way user conversation and interaction. Moreover, personality also influences the way IRC users make decisions including cyberattack type and hacking production selection, attack and crime motivation, hacking activities organization, malicious tool developing, and so on. Using IBM’s Personality Insights services as explained in Section II, we have been successful in analyzing individual authors’ IRC messages and inferred individuals’ intrinsic personality characteristics to create their personality profile. Our author identification approach can also perform in different languages as IBM Personality Insights service Figure 3. Visualization of an IRC user’s personality features
B. Learning Unit This optimization problem is a quadratic problem which can After creating features that define each user (i.e., the be solved by a sequential minimal optimization type feature extraction), we apply classification methods for the decomposition method [20]. The binary classification SVM author identification model creation. To successfully identify can be extended to multi-class classification by combining each author, we have adopted machine learning algorithms some two-category SVM classifier in a certain manner, thus such as k-Nearest Neighbor (k-NN) with different nearest forming a multi-class classifier. neighbors and Support Vector Machine (SVM) algorithms. In our approach of author identification, we used 1) k-Nearest Neighbor LIBSVM [20] to implement multi-class classification SVM. k-Nearest Neighbor (k-NN) classifier identifies the author LIBSVM uses the one-against-one method for multi-class of a given text content from a given set of candidate users. classification that needs ( − 1)/2 classifier for N-class This classifier can be viewed as a multi-class classification classification [20]. Each classifier is trained on samples from task. In this problem, the k-NN method classifies an unknown two corresponding classes. A voting mechanism is used for personality insights sample from Personality Insights service test after all the classifiers are trained. The unknown to identify which of the majority of its nearest neighbors personality sample is classified to the suspect with the largest the author belongs to. In k-NN analysis, we have used the vote. Euclidean distance for the personality similarity distance IV. EXPERIMENTS AND RESULTS measurement. The distance between two data samples ( , ,…, ) and ( , ,…, ) is calculated as To evaluate the effectiveness of our approach, we designed several author identification tasks in various IRC ∑( − ) where is the index of the features that are channels monitored by our autonomic IRC bot technique normalized personaliy characteristics values derived IRC (shown in Table I). We selected six active IRC channels for chat content. the demonstration. 2) Support Vector Machine • The #anonops channel is an international Support Vector Machine (SVM) algorithm is a machine communication platform controlled by anonymous learning approach that we leverage for the authorship analysis hacking organization. approach. We can define SVM as maximizing the margin • The #2600 channel is a highly active community with between two classes given a dataset of ( , ), = hacker magazines and monthly hacker meetings [1]. • 1, … , , ∈ , ∈ +1, −1 , where is the label of The #computer is an active computer discussion channel located Underworld server for understanding class and represent the feature vector. The label of cybercrime and immoral deeds on the Internet. unknown data sample can be determined by = • The #politics channel is another Underworld’s ( ∅( ) + ) where ∅( ) is a mapping to a higher important channel whose topics are related to political dimensional space to get a nonlinear SVM and is the vector warfare. that SVM needs to optimize. By calculating the following • The #security and the #networking are two popular optimization problem, the optimal hyperplane can be channels involving the topics of computer and network obtained as follows: security in freenode server. 1 min ( ∙ ) + 2 Table I. TOTAL # OF MESSAGES OF THE MONITORED CHANNELS (1) Channel Total # of Collection Server Name Name Messages Data Range . . ∙∅( ) + ≥1, =1,…, irc.anonops.com #anonops 817,435 8/15/17 – 4/13/18
irc.2600.net #2600 549,400 4/01/17 – 4/13/18 where is a slack variable, and is a penalty factor. Its dual form is: irc.underworld.no #computer 186,458 9/13/17 – 4/13/18 irc.underworld.no #politics 109,416 1/04/18 – 4/13/18 1 arg − Κ , irc.freenode.net ##security 243,273 4/13/17 – 4/13/18 2 , (2) irc.freenode.net #networking 220,060 9/06/17 – 4/13/18