Author Profiling Using Texts in Social Networks

245 Chapter 11 Author Profiling Using Texts in Social Networks Iqra Ameer Instituto Politécnico Nacional, Mexico Grigori Sidorov Instituto Politécnico Nacional, Mexico ABSTRACT The automatic identification of an author’s demographic traits (e.g., gender, age group) from their written text is termed as author profiling. This problem has become an essential problem in fields like linguistic forensics, marketing, and security. In recent years, online social setups (e.g., Twitter, Facebook, blogs, hotel reviews) have extended remarkably; however, it is easy to provide fake profiles. This research aims to predict the traits of the authors for a benchmark existing corpus, based on Twitter, hotel reviews, social media, and blogs’ profiles. In this chapter, the authors have explored four sets of features, including syntactic n-grams of part-of-speech tags, traditional n-grams of part-of-speech tags, combinations of word n-grams, and combinations of character n-grams. They used word unigram and character three- gram as a baseline approach. After analyzing the results, they concluded that the performance improves when the combination of word n-grams is used. INTRODUCTION Author Profiling Task Author profiling (AP) is the identification process of a person’s gender, age, native language, personality traits, and other demographic information from his/her written text (Iqbal, Ashraf & Nawab, 2015). We are living in an era where technology is growing rapidly and arising many challenging problems for researchers related to the availability of much written textual data; one of such issues is author profiling. Nowadays, most of the text is available online. People sometimes write and share their opinions and ideas DOI: 10.4018/978-1-7998-4730-4.ch011 Copyright © 2021, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. Author Profiling Using Texts in Social Networks behind the curtain of anonymity. The problem of AP has become an essential problem in the fields like linguistic forensics, marketing, and security. Authorship analysis can be of two types: • Author verification tasks: Where the style of individual authors is observed, to check whether a text belongs to a specific author or not. • Author profiling: Discriminates between classes of authors examining their socialist aspects, that is, the way how people share the language. This benefits in classifying profiling characteristics such as gender, age, native language, education, profession, or personality type. Therefore, AP can be stated as follows: given the set of texts, it is necessary to identify age, gender, occupation, education, native language, and similar personality-related traits of the author. Importance and Applications of Author Profiling Recently, online social setups like Twitter, Facebook, Blogs, Hotels Review, etc. have extended remarkably and have allowed lots of clients of all age groups to grow and support personal and professional relationships. However, a shared characteristic of these digital bodies is that it is simple to keep a fake name, age, gender, and location to conceal one’s actual identity providing criminals like pedophiles with new options to look for their victims. When trying to identify these internet predators, law enforcement agencies and social network moderators faced with two main difficulties: (i) the significant amount of profiles on social setups make manual evaluation unmanageable and (ii) internet predators frequently make a fake identification, posing as youths to make interaction with their victims. So, proficient automatic systems for identity uncovering and inspection are essential. Chapter Focus The main aim of this study is to discover what feature set is appropriate for author profiling. We conduct experiments on the same genre using word n-grams and character n-grams, as well as part-of-speech tag n-grams (traditional ad syntactic). We also try the cross genre, but only for part-of-speech n-grams. The concept of syntactic n-grams is introduced by Sidorov (2013a; 2013b; 2014; 2019). Sn-grams of POS tags are distinctive from traditional n-grams in the manner of what elements are considered neighbors. For syntactic n-grams, the neighbors are taken from the dependency parse tree, and not from the surface structure of the text. This technique builds an image of the style of an author by the information enclosed in dependency trees for sn-grams. This information is characterized as syntactic n-grams of POS tags and is applied to fit a vector space model. We also examine how traditional n-grams of POS tags can be helpful in the AP task. The supervised machine learning approach is used in this research. We explain the features that are utilized and the engaged supervised algorithms of Machine Learning. Moreover, in this project, we also deal with different machine learning techniques and methods for training the model and compare the results to identify the best and most suitable techniques for this research work. Thus, we use those Machine Learning methods to distinguish the age group of an author and gender. Our focus is to analyze as much textual data as possible and classify methods to identify the writer’s age group and gender. As we have mentioned above, author profiling is a vast field covering different aspects related to the personality, behavior, and emotions of the author. Still, in this chapter, our primary focus is covering mainly two points, namely, age group and gender of author profiling using PAN 2014 and PAN 2016 corpora. 246 19 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the publisher's webpage: www.igi-global.com/chapter/author-profiling-using-texts-in-social- networks/263105 Related Content An Extensive Text Mining Study for the Turkish Language: Author Recognition, Sentiment Analysis, and Text Classification Durmu Özkan ahin and Erdal Klç (2021). Natural Language Processing for Global and Local Business (pp. 272-306). www.irma-international.org/chapter/an-extensive-text-mining-study-for-the-turkish-language/259794 Deep Learning for Sentiment Analysis: An Overview and Perspectives Vincent Karas and Björn W. Schuller (2021). Natural Language Processing for Global and Local Business (pp. 97-132). www.irma-international.org/chapter/deep-learning-for-sentiment-analysis/259785 The Use of Natural Language Processing for Market Orientation on Rare Diseases Matthias Hölscher and Rudiger Buchkremer (2021). Natural Language Processing for Global and Local Business (pp. 226-246). www.irma-international.org/chapter/the-use-of-natural-language-processing-for-market-orientation-on-rare- diseases/259791 Two New Challenging Resources to Evaluate Natural Language Interfaces to Databases Generated Based on Geobase and Geoquery Juan Javier González-Barbosa, Juan Frausto Solís, Juan Paulo Sánchez-Hernández and Julia Patricia Sanchez-Solís (2021). Handbook of Research on Natural Language Processing and Smart Service Systems (pp. 70-100). www.irma-international.org/chapter/two-new-challenging-resources-to-evaluate-natural-language-interfaces-to- databases-generated-based-on-geobase-and-geoquery/263097 Author Profiling Using Texts in Social Networks Iqra Ameer and Grigori Sidorov (2021). Handbook of Research on Natural Language Processing and Smart Service Systems (pp. 245-265). www.irma-international.org/chapter/author-profiling-using-texts-in-social-networks/263105.

Author Profiling Using Texts in Social Networks

PAN 2017: Author Profiling

Author Profiling: Predicting Age and Gender from Blogs

Automatic Author Profiling Based on Linguistic and Stylistic Features Notebook for PAN at CLEF 2013

Arabic Twitter User Profiling: Application to Cyber-Security

Author Profiling for English and Spanish Text

Author Profiling from Facebook Corpora

A Study of Arabic Social Media Users—Posting Behavior and Author's

An Evaluation Study of Authorship Attribution Approaches

Automatic Author Profiling of Online Chat Logs

On the Impact of Emotions on Author Profiling

Authorship Attribution and Author Profiling of Lithuanian Literary Texts

Writer Profiling Without the Writer's Text