Arabic Twitter User Profiling: Application to Cyber-Security
Total Page:16
File Type:pdf, Size:1020Kb
Arabic Twitter User Profiling: Application to Cyber-security Rahma Basti1, Salma Jamoussi2, Anis Charfi3 and Abdelmajid Ben Hamadou4 1Multimedia InfoRmation Systems and Advanced Computing Laboratory (MIRACL), University of Sfax, Tunis, Tunisia 2Higher Institute of Computer Sceience and Multimedia of Sfax, 1173 Sfax 3038, Tunisia 3Carnegie Mellon University Qatar, Qatar 4Digital Research Center of Sfax (DRCS), Tunisia Keywords: Author Profiling, Arabic Text Processing, Age and Gender Prediction, Dangerous Profiles, Stylometric Features. Abstract: In recent years, we witnessed a rapid growth of social media networking and micro-blogging sites such as Twitter. In these sites, users provide a variety of data such as their personal data, interests, and opinions. However, this data shared is not always true. Often, social media users hide behind a fake profile and may use it to spread rumors or threaten others. To address that, different methods and techniques were proposed for user profiling. In this article, we use machine learning for user profiling in order to predict the age and gender of a user’s profile and we assess whether it is a dangerous profile using the users’ tweets and features. Our approach uses several stylistic features such as characters based, words based and syntax based. Moreover, the topics of interest of a user are included in the profiling task. We obtained the best accuracy levels with SVM and these were respectively 73.49% for age, 83.7% for gender, and 88.7% for the dangerous profile detection. 1 INTRODUCTION researchers to use it including those who focus on user profiling (Feldman and Sanger, 2006). Social media networks allow users to share Research in psychology (Frank and Witten, 1998) information, opinions and communicate with each has revealed that the words used by an individual can other. Often, social media users choose not reveal project his or her mental and, physical health. With their real identity such as name, age, and gender in the advances in technology and computing, order to express their ideas freely, without risking any stylometry (Georgios, 2014) has been used to retaliation. Some other users hide their real identity determine traits of the user’s profile and personality for dishonest and dangerous purposes such as based on what they write. Several stylometric features threatening other social media users or spreading have been proposed to date, including features based rumors and lies. Therefore, it has become very on words, characters and punctuation. important to provide effective means for identity In this article, we aim at profiling Twitter users, tracing in the cyberspace (Argamon et al., 2003). i.e., determining characteristics such as age and Twitter is one of the most popular social media gender based on their tweets. This research is networks in the world and it has a large number of applicable in several fields, such as forensics and users who post a huge amounts of data in different marketing. languages. Posts cover a wide variety of topics such The remainder of this paper is organized as as politic, sport, and technology. follows. Section 2 reports on existing work on user The volume and variety of Twitter data as well as profiling. Section 3 presents our approach and the availability of APIs has attracted several describes the features that could be considered as significant indicators of age and gender. Section 4 1 https://developer.twitter.com/ 2 http://www.cs.cmu.edu/~ark/TweetNLP/ 3 https://emojipedia.org/ 4 https://dev.twitter.com/streaming/overview 110 Basti, R., Jamoussi, S., Charfi, A. and Ben Hamadou, A. Arabic Twitter User Profiling: Application to Cyber-security. DOI: 10.5220/0008167401100117 In Proceedings of the 15th International Conference on Web Information Systems and Technologies (WEBIST 2019), pages 110-117 ISBN: 978-989-758-386-5 Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved Arabic Twitter User Profiling: Application to Cyber-security presents the corpus we used in this research. Section Although in (Mikros, 2012), the researchers 5 presents the results we obtained and Section 6 worked on the automatic classification of blogs and concludes the paper with directions for future work. emails, they obtained a precision of 81.5% of documents well classified for the dimension of gender and 72% for the dimension of age. 2 RELATED WORK The work in (Juola, 2012) investigated the attribution and detection of the author's genre using Greek blogs. He chose this model of social networks Analyzing the user's’ posts and their behavior on because people can express their opinions on blogs. social networks has been the subject of numerous Juola focused on two types of features of text content. research works. Interest in this field of research has The first type includes classic stylometric features, increased in recent years as several users misused the anonymity they could have on social media networks which depend on vocabulary richness, word length, and word frequency. The second type of features to spread threats, hate messages or false news. depends on the bi-gram characters, and the n-gram of By analyzing the contents produced by users, or words. The results of their experiments showed an the activities they perform, author-profiling accuracy of gender identification of 82.6% with SVM researchers were able to determine several users’ characteristics such as age, gender, mother tongue (Maharjan et al., 2014). and level of education. The work in (Koppel et al., 2003) presented an application that detects various demographic The shared task on Author Profiling at PAN 2013 characteristics such as name, age, gender, level of focused on digital text forensics. Specifically, the education. The authors used two corpora of e-mail for purpose was to determine the age and gender of the the Arabic and English languages. They used a authors of a large number of unidentified texts. In this context, (Argamon et al., 2005) determined questionnaire to check and examine the user's’ profile experimentally that content features performed well including age, gender, and level of education. The authors used many machine-learning classifiers in for age and gender profiling. their experiments such as SVM, KNN and decision In addition to that, (Peersman et al., 2011) worked trees. For gender detection, the best accuracy was on segments of blogs of the British National Corpus. achieved by SVM (Argamon et al., 2009). They used features such as punctuation, average words, part of speech, sentence length, and word Current author identification techniques go factor analysis to predict gender at an accuracy of beyond stylometric analysis, which opens the way to profiling, attribution, and identification of authors. In 80%. addition, they explore data and use digital documents The detection of the author's profile consists of like graphics, emoticons, colors, layouts, etc. In this analyzing the way in which the linguistic context, we cite a very recent work of 2015 in which characteristics vary according to the profile of the author (Koppel et al., 2014) used the SVM model a play "Double Falsehood" was identified as the work of William Shakespeare where the researchers were trained on English, Spanish and Dutch Twitter data based on colors and graphics information for from unknown Twitter text to achieve 80% accuracy identification because each author or artist has his for gender prediction. The work focuses instead on own style (Mechti et al., 2010). punctuation, n-gram counts, sentence and word length, vocabulary richness, function words, out-of- The Arabic language is one of the most widely vocabulary words, emoticons and part-of-speech. adopted languages with hundreds of millions of native talkers. Furthermore, it is used by more than In another study (Alwajeeh et al., 2014) the 1.5 billion Muslims to practice their religion and authors worked on blog segments using features such spiritual ceremonies. as speech analysis, punctuation, average word length, Authorship attribution is another field of related sentences, and word factors. They achieved a gender prediction rate of 72.2% (Tang et al., 2010). work that is concerned with the description and In addition to that, (Estival et al., 2007) worked on identification of the true author of an anonymous text. In the literature, authorship description is defined as Arabic emails and reported being able to predict the a text categorization or text analysis and classification gender with a precision of 72.10%. They calculated problem. The authorship has various potential 64 features describing psycholinguistic word applications in fields such as literature, program code categories (e.g. family, anger, death, wealth, family, etc.). Any feature describes the number of words authorship attribution, digital content forensics, law enforcement, crime prevention, etc. In the context of detected in the similar category divided by all words authorship attribution, stylometry has been used to in the text. 111 WEBIST 2019 - 15th International Conference on Web Information Systems and Technologies determine the authenticity of the document. It is followers, number of friends, number of favorite considered as the study of how people can judge accounts, number of retweets and preferable time of others according to their writing style. Therefore, publications. Then, we represented the user’s profile stylometry cannot only be used to identify a writing as a normalized number vector with numbers style but can also help identify the author's gender and corresponding to the profile features. age. For each profile collected in our corpus, an expert in sociology helps us to identify and annotate the age, gender and whether the user has a dangerous profile. 3 PROPOSED METHOD 3.2 Tweets Specific Features The goal in this work is to analyze the profiles of anonymous authors and predict the author’s age, What features allow to predict age, gender, and gender, and whether the profile is a dangerous dangerous profiles is an open research question that profiles.