Autonomic Author Identification in Internet Relay Chat (IRC)

Autonomic Author Identification in Internet Relay Chat (IRC) Sicong Shao Cihan Tunc Amany Al-Shawi Salim Hariri NSF Center for Cloud and NSF Center for Cloud and National Center for NSF Center for Cloud and Autonomic Computing Autonomic Computing Cybersecurity Technology Autonomic Computing The University of Arizona, The University of Arizona, King Abdulaziz City for Science The University of Arizona, Tucson, Arizona Tucson, Arizona and Technology, Tucson, Arizona [email protected] [email protected] Riyadh, Saudi Arabia [email protected] [email protected] Abstract— With the advances in Internet technologies and services, tracing illegitimate messages is an important cybersecurity the social media has been gaining excessive popularity, especially because research problem. Author identification for analyzing online these technologies provide anonymity where they use nicknames to post messages is one of the suggested solutions, which can be their messages. Unfortunately, the anonymity feature has been exploited by the cyber-criminals to hide their identities and their operations. Hence, formulated as assigning a writing among a set of the given there is a growing interest in cybersecurity research domain to identify the authors [3]. The goal of the research presented in this paper authors of malicious messages and activities. Internet Relay Chat (IRC) is to develop methods for author identification in cyberspace channels are widely used to exchange messages and information among and the ability to group intercepted anonymous messages that malicious users involved in cybercrimes. In this paper, we present an belong to the same authors as well as those generated by autonomic author identification technique based on personality profile known terrorists or cyber-criminals [30]. and analysis of IRC messages. We first monitor the IRC channels using our autonomic bots and then create a personality profile for each targeted To effectively achieve these capabilities, we present the author. We demonstrate that personality analysis for author detection/identification is an efficient approach and has high detection design and implementation of Automatic Author rates. Identification and Characterization (AAIC) framework that provides an innovative and effective solution to identify Keywords— Author identification; cybersecurity; machine learning; authorship using the personality profile analysis based on the Internet Relay Chat (IRC); Watson AI platform; personality insights fact that the personality characteristics of a person is relatively stable [37]. For monitoring and identification of the I. INTRODUCTION authors, we use Internet Relay Chat (IRC) channels, which Advances in the Internet services and mobile services have been actively used by the security groups (both have led to the rapid growth in the use of the social network malicious and non-malicious) to share their knowledge and where users can post their opinion and share information get help because IRC provides many professional channels across the globe immediately, in an anonymous way. (chatrooms) for real-time communication [1][2] even for However, such an anonymity has also been exploited by cybercrime such as hacking, cracking, and carding [9][25]. cybercrime to create fake and illegal accounts and pretend to The reason why using IRC in cybercrime is not only that IRC be somebody else [36]. People with illegitimate purposes can is a commonly used method of communication in cybercrime exploit the powerful dissemination feature of social community but also that IRC ensures the anonymity. And, the networking platforms to spread malicious information and users can hide themselves in the public channels (chatrooms) influence people opinion, which can cause serious threat in and change their user name regularly. cyberspace. For example, in “Myspace mom” incident, a fake Compared to themost of the author identification account caused the suicide of a teenage girl with cruel techniques which focus on newspapers, emails, website messages and cyber-bullying [29]. Furthermore, the forums, blogs, etc. [3][4][5][6][7][8], performing author perpetrators can avoid being detected by using anonymous identification in IRC is a more challenging task due to several servers, spoofing, through VPN, and use of fake accounts. reasons. First, most of the previous works studied Hence, homeland security and law enforcement agencies asynchronous computer-mediated communication (CMC) have launched projects to prevent deceptive attacks and track including emails, web forums, blogs, comments, etc., while the identities of senders in order to improve our protection few works go into author identification in synchronous capabilities against terrorism, child predators, etc. [30]. mediums such as chatrooms. This fact gives us less clue on The current solutions can only detect and find cyber- how to find effective feature to distinguish a user in chatroom criminals if they make a mistake by providing their real- like IRC. Also, the IRC channel administrators generally identity information; e.g., Andromeda botnet mastermind dislike bots (used to monitor/log the channels) and block (a.k.a. Ar3s) was arrested through his public ICQ number them and even their IPs. This necessitates the need for [31]. Hence, it is critically important to develop innovative developing intelligent monitoring systems. Third, in most tools that can efficiently identify authors, their origins, author identification studies, researchers have focused on language, locations, etc. regardless of the approach used to identifying a very limited number of the most active users hide their identity. Thus, developing effective methods for (e.g., up to 20) [3][4][5][6][7]. For example, Zheng et al. [3] 15th ACS/IEEE International Conference on Computer Systems and Applications (AICCSA 2018) ©2018 IEEE 978-1-5386-9120-5/18/$31.00 ©2018 IEEE identified up to 20 of most active users who frequently posted B. Author Identification messages in newsgroups, with the best accuracy being around Author identification has been widely used for various 83% when given 20 authors. However, more candidates need reasons such as computer forensics, plagiarism check, social to be considered because IRC channel usually has more media misuse, etc. using e-mail, website forum, and blogs. potential suspects. Fourth, the average length of IRC For example, Zheng et al. [3] and Narayanan et al. [11] messages is very short compared to emails or blog entries. conduct a series of research to identify the author’s online For example, the average length of a message in #anonops messages. Author identification on online messages is a (one of the IRC channels) operated by the anonymous particular important issue in cybercrime because one of the organization is 5.7 words based on more than 800,000 obvious features of cybercrime is anonymity. Anonymous messages collected by our monitoring bot. Last but not least, users always fake their personal information and hide their hacker, anonymity, and terrorist groups use sophisticated identity for escaping from security investigation. techniques to hide and fake their stylometric features which are commonly used for author identification. However, even The authors in [32] apply n-gram analysis using the 3-15 though their stylometric features change, their word based n-grams and apply k-NN, outlier classification, personality/characteristics remain the same and that gives the and collective classification. In [33], the authors focus on the means to identify them using personality profile analysis. frequency of the most frequently occurring character n-grams and apply SVM for the classification. In [11] the authors The main idea in our author identification approach in extract features from each post and pass them to the classifier. IRC channels is that the suspects in cyberspace unconsciously While doing that they use a fixed set of function words, such leave their personality trace derived from their online as “the”, “in”, as function words bear little relation to the messages. These messages can be used to model the topic of conversation. Furthermore, they also exclude personality characteristics and hence can be used to bigrams and trigrams, which may be significantly influenced distinguish each individual uniquely. Moreover, most author by specific words. Hill et al. [34] showed that stylometry identification techniques rely on stylometric features, which enables identifying reviewers of research papers with can be manipulated and controlled [3][4][5][6][7][8]; reasonably high accuracy, given that the adversary, assumed however, hiding and faking personality features is arduous. to be a member of the community, has access to a large The remainder of this paper is structured as follows. In number of unblinded reviews of potential reviewers by Section II we provide background information about IRC serving on conference and grant selection committees. The client, author identification, and IBM Watson Artificial authors in [35] apply unsupervised machine learning Intelligence (AI) platform. Section III explains our Automatic algorithms based on word frequencies. However, the study Author Identification and Characterization (AAIC) contains very small amount of authors for the successful framework. The experimental environment and evaluation demonstration. results are presented in Section IV. Finally, in Section V, the IRC provides anonymity by the protocol [12]. IRC server paper is concluded. will automatically

Autonomic Author Identification in Internet Relay Chat (IRC)

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support