Author Profiling Using Stylistic and N-Gram Features

International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 – 8958, Volume-9 Issue-1, October 2019 Author Profiling using Stylistic and N-Gram Features Radha D, Chandra Sekhar P Authorship Profiling [1]. The Plagiarism Detection detects Abstract: The World Wide Web is increasing tremendously with the percentage of authors contribution is copied from other massive amount of textual content primarily through social media author‟s contributions [2]. Authorship Identification sites. Most of the users are not interested to upload their genuine classified into two classes namely Authorship Verification details along with textual content to these sites. To identify the correct information of the authors the researchers started a new and Authorship Attribution. Authorship Verification verifies research area named as Authorship Analysis. The authorship whether the anonymous document was written by the Analysis is used to find the details of the authors by examining suspected author or not by investigating the suspected their text. Authorship Profiling is one type of Authorship author‟s documents [3]. Authorship Attribution detects the Analysis, which is used to detect the demographic characteristics author of an unknown document by investigating the like Age, Gender, Location, Educational Background, Nativity Language and Personality Traits of the authors by examining documents of given set of authors [4]. Authorship Profiling writing skills in their written text. Stylometry is one research area discover the demographic characteristics of an author by defines a set of stylometric features namely word based, character investigating the writing style in their texts [5]. In Authorship based, syntactic, structural and content based features for Identification, the training data need suspected authors differentiating the author’s writing styles. In this work, the documents to recognize the document‟s author. But in experimentation conducted with various stylistic features, N-grams and content based features for gender prediction. These Authorship Profiling, the suspected author‟s documents need features are used for representing the vectors of documents. The not required in training data to detect the characteristics of the classification algorithms produce the model by processing these suspected author. This is the major difference among vectors. Two classification algorithms namely Random Forest, Authorship Identification and Authorship Profiling [7]. Naïve Bayes Multinomial were used for classification. We Authorship Profiling is used in information processing concentrated on prediction of Gender from 2019 Pan Competition applications such as harassing messages, forensic analysis, Twitter dataset. Our approach obtained best accuracies when compared with many Authorship Profiling approaches. security, educational domain, literary research and marketing [6]. In social websites, people are involved in different crimes Keywords : Authorship Analysis, Authorship Profiling, like public embarrassment by sending harassing messages, Accuracy, Content based Features, Gender Prediction, N-grams, blackmailing, defamation, stalking and creation of profiles Stylistic Features. with fake details. All these crimes are in the form of messages. The authorship profiling is used here to analyze the harassing I. INTRODUCTION messages and detect the basic characteristics like gender, age In the last 20 years, Internet has evolved from a network of group, location of messages of authors. In forensic analysis, connected computers used to share data among researchers. the forensic experts analyze the property wills and suicide As a result of this growth and the birth of social networks, notes to detect the details of the suspected author. In this blogs and many other websites where users are given the context Authorship Profiling is one such technique helpful for opportunity of easily creating or uploading content and the this purpose. The terrorist organizations send letters and mails amount of data generated every day has also grown to threaten the government bodies. The Authorship Profiling immensely. Most of the generated data in the net is thus approaches were used in security to analyze these mails and unstructured. One of the characteristics of the Internet whether the messages came from suspected sources or not. In nowadays is that a user can post anonymously in forums, marketing point of view, the market people are analyzing their comment sections of articles, social networks, chat systems, products based on the reviews of their product. Based on the etc. The Authorship Analysis is one research area analysis of reviews they will take strategic decisions about concentrated by the many researchers to find the details of the their products. Authorship Profiling is used to analyze the authors by analyzing their written textual content. reviews of products and find the details of reviewers like Authorship Analysis is categorized into three techniques gender, age, location etc. Authorship Profiling is used in such as Plagiarism Detection, Authorship Identification and educational domain also. In educational domain, the researchers are able to find the exceptional talented students, the knowledge level of student by analyzing the written texts Revised Manuscript Received on October 05, 2019 of the students. In the case of literary and historic studies, Radha D*, department of CSE, Malla Reddy College of Engineering and Technology, Hyderabad, India. Authorship Profiling can be applied to confirm/refute the Email: [email protected] author characteristics of a Chandra Sekhar P, department of CSE, GITAM, Visakhapatnam, India. text. Email: [email protected] Published By: Retrieval Number: A1621109119/2019©BEIESP Blue Eyes Intelligence Engineering DOI: 10.35940/ijeat.A1621.109119 3044 & Sciences Publication Author Profiling using Stylistic and N-Gram Features Different researchers identified several variances in writing section 6. The results discussed in section 7. The section 8 style of female and male by analyzing their datasets. Koppel describes conclusions of this work and possible directions et al. analyzed [6] large corpus of documents and concluded also. that woman used more number of pronouns and men used more determiners and quantifiers in their writings. In general, II. RELATED WORK the author style of writing depends on the grammar rules they One of the first studies on performing author profiling with followed, topics they selected and words they used. They machine learning techniques was done by Argamon et al. observed that, the female discussed the topics like kitty [10]. The study observed that the correct linguistic features parties, shopping and beauty in their writings whereas the combination were useful to distinguish the characteristics of male concentrated on topics like technology, politics and an author‟s text. The scientific event PAN competition sports in their writings. In another observation [3], the males arranged several author profiling tasks during the last few more written about politics and technology and female years, welcoming anyone with interest to participate. The discussed about marriage styles. They also observed that, the competition considered different aspects like age, gender, adverbs and adjectives usage is more in female writings than personality traits and native language in author profiling task. male writings. SVMs was used by the majority of the participants [11, 12], The words selected for writing a review or a blog is but also random forest [13], logistic regression [11] and many different based on the concept discussed in those sites. There more algorithms was used. Vollenbroek et al. used [14] linear is a possibility of classify a document whether the document SVM and achieved the highest average accuracy of 52.58 % was written by female or male based on the words occurred in by extracting various Stylometry features. However, a document. The presence of words like boyfriend, my Modaresi et al. also extracted part-of-speech taggings, various husband and pink in a document improves the chances of Stylometry features and lexical features, but used those document written by female authors and the presence of features to train a logistic regression classifier. They got words like cricket and world cup in a text improves the average accuracy of 52.47 % and concluded that chances of text written by male authors. They also observed part-of-speech taggings were not suitable features when that when compared with male authors the female authors testing on domain independence [15]. used less number of prepositions in their writings [1]. The most often explored demographic traits in the literature In [8], they found that the experimentation with only are clearly gender and age. In [1], the authors describe two content based features was more effective to distinguish the experiments on the Blog Corpus to identify gender and age. writing style of female and male authors and also identified For both experiments, stylistic features such as function that the accuracy is reduced when experiment with content words, part-of-speech frequencies, blog words and hyperlinks based features along with other features. Some other as well as the 1,000 most relevant unigrams according to the researchers [9] identified that male used more prepositions, Information Gain metric are extracted. The chosen classifier articles and longer words in their writings whereas female is the Multi-class Real

Author Profiling Using Stylistic and N-Gram Features

PAN 2017: Author Profiling

Author Profiling: Predicting Age and Gender from Blogs

Automatic Author Profiling Based on Linguistic and Stylistic Features Notebook for PAN at CLEF 2013

Arabic Twitter User Profiling: Application to Cyber-Security

Author Profiling for English and Spanish Text

Author Profiling from Facebook Corpora

A Study of Arabic Social Media Users—Posting Behavior and Author's

An Evaluation Study of Authorship Attribution Approaches

Automatic Author Profiling of Online Chat Logs

On the Impact of Emotions on Author Profiling

Authorship Attribution and Author Profiling of Lithuanian Literary Texts

Writer Profiling Without the Writer's Text