245

Chapter 11 Author Profiling Using Texts in Social Networks

Iqra Ameer Instituto Politécnico Nacional, Mexico

Grigori Sidorov Instituto Politécnico Nacional, Mexico

ABSTRACT The automatic identification of an author’s demographic traits (e.g., , age group) from their written text is termed as author profiling. This problem has become an essential problem in fields like linguistic forensics, marketing, and security. In recent years, online social setups (e.g., , , , hotel reviews) have extended remarkably; however, it is easy to provide fake profiles. This research aims to predict the traits of the authors for a benchmark existing corpus, based on Twitter, hotel reviews, , and blogs’ profiles. In this chapter, the authors have explored four sets of features, including syntactic n-grams of part-of-speech tags, traditional n-grams of part-of-speech tags, combinations of word n-grams, and combinations of character n-grams. They used word unigram and character three- gram as a baseline approach. After analyzing the results, they concluded that the performance improves when the combination of word n-grams is used.

INTRODUCTION

Author Profiling Task

Author profiling (AP) is the identification process of a person’s gender, age, native language, personal- ity traits, and other demographic information from his/her written text (Iqbal, Ashraf & Nawab, 2015). We are living in an era where technology is growing rapidly and arising many challenging problems for researchers related to the availability of much written textual data; one of such issues is author profiling. Nowadays, most of the text is available online. People sometimes write and share their opinions and ideas

DOI: 10.4018/978-1-7998-4730-4.ch011

Copyright © 2021, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.  Author Profiling Using Texts in Social Networks

behind the curtain of anonymity. The problem of AP has become an essential problem in the fields like linguistic forensics, marketing, and security. Authorship analysis can be of two types:

• Author verification tasks: Where the style of individual authors is observed, to check whether a text belongs to a specific author or not. • Author profiling: Discriminates between classes of authors examining their socialist aspects, that is, the way how people share the language. This benefits in classifying profiling characteristics such as gender, age, native language, , profession, or personality type. Therefore, AP can be stated as follows: given the set of texts, it is necessary to identify age, gender, occupation, education, native language, and similar personality-related traits of the author.

Importance and Applications of Author Profiling

Recently, online social setups like Twitter, Facebook, Blogs, Hotels Review, etc. have extended remark- ably and have allowed lots of clients of all age groups to grow and support personal and professional relationships. However, a shared characteristic of these digital bodies is that it is simple to keep a fake name, age, gender, and location to conceal one’s actual identity providing criminals like pedophiles with new options to look for their victims. When trying to identify these predators, law enforcement agencies and moderators faced with two main difficulties: (i) the significant amount of profiles on social setups make manual evaluation unmanageable and (ii) internet predators frequently make a fake identification, posing as youths to make interaction with their victims. So, proficient auto- matic systems for identity uncovering and inspection are essential.

Chapter Focus

The main of this study is to discover what feature set is appropriate for author profiling. We conduct experiments on the same genre using word n-grams and character n-grams, as well as part-of-speech tag n-grams (traditional ad syntactic). We also try the cross genre, but only for part-of-speech n-grams. The concept of syntactic n-grams is introduced by Sidorov (2013a; 2013b; 2014; 2019). Sn-grams of POS tags are distinctive from traditional n-grams in the manner of what elements are considered neighbors. For syntactic n-grams, the neighbors are taken from the dependency parse tree, and not from the surface structure of the text. This technique builds an image of the style of an author by the information enclosed in dependency trees for sn-grams. This information is characterized as syntactic n-grams of POS tags and is applied to fit a vector space model. We also examine how traditional n-grams of POS tags can be helpful in the AP task. The supervised machine learning approach is used in this research. We explain the features that are utilized and the engaged supervised algorithms of Machine Learning. Moreover, in this project, we also deal with different machine learning techniques and methods for training the model and compare the results to identify the best and most suitable techniques for this research work. Thus, we use those Machine Learning methods to distinguish the age group of an author and gender. Our focus is to analyze as much textual data as possible and classify methods to identify the writer’s age group and gender. As we have mentioned above, author profiling is a vast field covering different aspects related to the personality, behavior, and emotions of the author. Still, in this chapter, our primary focus is covering mainly two points, namely, age group and gender of author profiling using PAN 2014 and PAN 2016 corpora.

246 19 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the publisher's webpage: www.igi-global.com/chapter/author-profiling-using-texts-in-social- networks/263105

Related Content

An Extensive Text Mining Study for the Turkish Language: Author Recognition, , and Text Classification Durmu Özkan ahin and Erdal Klç (2021). Natural Language Processing for Global and Local Business (pp. 272-306). www.irma-international.org/chapter/an-extensive-text-mining-study-for-the-turkish-language/259794

Deep Learning for Sentiment Analysis: An Overview and Perspectives Vincent Karas and Björn W. Schuller (2021). Natural Language Processing for Global and Local Business (pp. 97-132). www.irma-international.org/chapter/deep-learning-for-sentiment-analysis/259785

The Use of Natural Language Processing for Market Orientation on Rare Diseases Matthias Hölscher and Rudiger Buchkremer (2021). Natural Language Processing for Global and Local Business (pp. 226-246). www.irma-international.org/chapter/the-use-of-natural-language-processing-for-market-orientation-on-rare- diseases/259791

Two New Challenging Resources to Evaluate Natural Language Interfaces to Databases Generated Based on Geobase and Geoquery Juan Javier González-Barbosa, Juan Frausto Solís, Juan Paulo Sánchez-Hernández and Julia Patricia Sanchez-Solís (2021). Handbook of Research on Natural Language Processing and Smart Service Systems (pp. 70-100). www.irma-international.org/chapter/two-new-challenging-resources-to-evaluate-natural-language-interfaces-to- databases-generated-based-on-geobase-and-geoquery/263097

Author Profiling Using Texts in Social Networks Iqra Ameer and Grigori Sidorov (2021). Handbook of Research on Natural Language Processing and Smart Service Systems (pp. 245-265). www.irma-international.org/chapter/author-profiling-using-texts-in-social-networks/263105