A More Qualitative Approach to Personality Profiling on Twitter ! !
Total Page:16
File Type:pdf, Size:1020Kb
A more qualitative approach to personality profiling on Twitter ! ! SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER OF SCIENCE Frank Houweling 10199969 MASTER INFORMATION STUDIES HUMAN-CENTERED MULTIMEDIA FACULTY OF SCIENCE UNIVERSITY OF AMSTERDAM June 20, 2015 1st Supervisor 2nd Supervisor dr. Maarten Marx MSc. Christophe van Gysel ILPS, UvA ILPS, UvA ! A more qualitative approach to personality profiling on Twitter [Master Thesis] Frank Houweling University of Amsterdam Master Information Studies Human Centered Multimedia [email protected] ABSTRACT new possibilities to apply author profiling technology. For In di↵erent fields, the interest in author profiling: determin- example in politics, where party leaders often try to portray ing demographic features for an author is growing. Complex a certain type of personality to the voters, in order to become features without a ground truth like (perceived) personality likable candidates [5]. With author profiling systems, the require another approach then traditional author profiling. process of determining personality can potentially turn out In this research, a gold standard that is constructed using to be faster, easier and more reliable than existing (manual) the personality test by Rammstedt and Oliver (2007) is com- methods. pared with a new author-profiling method. This method consists of analysis of all information available about a twit- Existing research on author profiling is characterized by ter profile, using measures that are based on personality a rather uniform methodology. A supervised classifier is characteristics found in existing research. While the person- trained using a ground truth data set of authors’ person- alities that result from this method di↵er greatly from the alities and a broad range of mostly textual features. What set gold standard, it performs almost equal on a secondary sets our approach apart from previous research, is a strongly human evaluation, suggesting that future work is necessary reduced amount of features, which are founded by existing to determine if there is such thing as an objective or mean research. Because the features are based on observations in personality. previous research, they have the potential to perform better then normal textual features. Next to that, they have the General Terms potential of a stronger external validity. Author Profiling, Text Analysis, Personality, Psychometrics, Twitter During this research the political case described earlier is used as the research setting, and the politicians’ Twitter ac- counts as the available data source. During the design of 1. INTRODUCTION the author profiling system, the following main question is In di↵erent fields the interest in author profiling is growing. posed: (1) ”Can we develop a system that e↵ectively deter- Author profiling is an application of text analysis which goal mines perceived personality based on all information avail- it is to, given a piece of text written by a person, predict dif- able about a Twitter profile via the API, by using personal- ferent descriptive features of this person (the author). These ity characteristics found in existing research?”. features can then be used to build a demographic descrip- tion of the author (an author profile) which promises many In this paper, firstly, a literature review is conducted to an- possibilities for business intelligence, criminal law, computer swer the following subquestions: How can personality be rep- forensics and more. resented ?, What are proven methods to determine personal- ity?, Which characteristics can be used to describe personal- Previous author profiling research was focused on predicting ity? and How can we evaluate such a method of personality straight forward features like age and gender. But, in recent determination?. developments there is also a growing amount of research on more complex features like author personality. Based on this theoretical background, an experiment is de- signed in which an author profiling system is evaluated. This Personality is an interesting descriptive feature, which opens author profiling system is based on characteristics found in the literature review and the data available via the twitter API. Finally, using the results of this experiment we will try to answer the main research question. 2. THEORETICAL BACKGROUND The main focus of our author profiling system is to deter- mine personality. But, personality is not such a precisely delimited concept as might be expected. The meaning of the term personality is very much dependent on the context senting two di↵erent personalities can be seen in figure 2. in which it is used. In general, there are three main types of personality that are related but not necessarily always equal. 2.3 Determining personality To find these five scores that make up a big five representa- tion, di↵erent methods can be used. 2.1 Different types of personality The first type of personality is the personality a subject assigns to him or herself, also called self-rated personality 2.3.1 Computer science: Author Profiling [16]. This is the type of personality that is most used in In the introduction, the methodology used in previous au- research, and is gathered often with so-called personality thor profiling research is already shortly discussed. As de- markers. These are statements about a persons personality scribed, determining personality is seen as a normal author where the person can then agree or disagree to a certain profiling task, just like age and gender. scale. Most existing author profiling research is performed to be The second type of personality is the type of personality submitted to the PAN congress author profiling task [28]. an expert assigns to a person via observation, also called For this task, a training data set is given with authors, de- peer-reported personality. During an observation, the ex- mographic features of these authors, and their tweets. There pert looks at the person’s behavior and tries to find specific is only a limited set of metadata available, where the real actions that in existing literature are linked to specific per- Twitter account is obfuscated for privacy reasons. sonality traits. This author profiling task focuses on determining several The third and last type of personality is the type of per- demographic features at the same time. And to limit the sonality people around a person assign to him or her, also complexity of the system, the same approach is used to pre- called peer-rated personality [16]. Just like with self-rated dict all of these demographic features. personality, peer-rated personality is often determined using personality markers. In the case of peer-rated personality, The common approach in author profiling consists of cal- these statements are not rated by the person him or herself, culating a broad range of features from the author’s set but by people who know the target person. of tweets. Common types of features are: textual / con- tent features (N-grams, TF-IDF, slang words, swear words, With author profiling, the goal is to determine one of these LIWC features [14]), stylistic features (punctuation marks), types of personality by analyzing the media created by the time-based features (the amount of tweets in a specific time- person. In the political case introduced in this research, frame) and social network features (network size, density peer-rated personality is the type of personality which is [28, 13]). In most research, textual features are far the most most interesting, where this represents how voters see the important. politician. After the calculation of these features, one classifier per to-be-determined demographic feature, often an SVM, is 2.2 Representing personality with the big five trained [28]. Because the big five representation consists There are two main methods to represent personality. The of five values, five separate SVM classifiers are trained. first method is the description of a person’s personality in free text, without any limitations on what terms can be used. One successful extension to this method is the second order This is a very extensive method that enables a researcher to attributes (SOA) method described by L´opez-Monroy et al. give in-depth insights in the person’s personality. But,this [22]. This method does use an SVM, but does not stick to method is not desirable for many cases like the one used in the di↵erent possible classes for classification. Instead, sub this research, as it is not well suited for comparison between classes are generated for each class using the training data users or calculations. Because of that, multiple quantitative and K-Means clustering. This results in more classes which methods are developed that are usable for calculations and are used in the classification. comparison. The big five is an example of such a a quantitative represen- 2.3.2 Psychometrics: Personality tests tation, which is very often applied in both psychometric and The field of psychometrics already had extensive experience author profiling settings [19, 28]. With the big five method, with measuring personalities, far before anyone in the field a person receives a ranking (often between 0 and 5) on five of computer science tried to do the same. In psychometrics, major personality traits, that should give a good overall a broad range of personality tests are developed. Such tests view of his or her personality. The five personality traits often work with a set of descriptive sentences[11] or single are openness to experience, conscientiousness, extraversion, adjectives[15] (also called markers), which are then displayed agreeableness and neuroticism. to the rater. The rater can then agree or disagree with these descriptive sentences or terms using a likert scale. When the The big five representation is the standard method to de- rater agrees, he thinks the descriptive suits the person well. scribe personality, where it was found to work consistently with multiple ages, cultures and determination methods [8].