A more qualitative approach to personality profiling on Twitter ! !

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER OF SCIENCE

Frank Houweling 10199969

MASTER INFORMATION STUDIES HUMAN-CENTERED MULTIMEDIA

FACULTY OF SCIENCE UNIVERSITY OF

June 20, 2015

1st Supervisor 2nd Supervisor dr. Maarten Marx MSc. Christophe van Gysel ILPS, UvA ILPS, UvA ! A more qualitative approach to personality profiling on Twitter

[Master Thesis]

Frank Houweling University of Amsterdam Master Information Studies Human Centered Multimedia [email protected]

ABSTRACT new possibilities to apply author profiling technology. For In di↵erent fields, the interest in author profiling: determin- example in politics, where party leaders often try to portray ing demographic features for an author is growing. Complex a certain type of personality to the voters, in order to become features without a ground truth like (perceived) personality likable candidates [5]. With author profiling systems, the require another approach then traditional author profiling. process of determining personality can potentially turn out In this research, a gold standard that is constructed using to be faster, easier and more reliable than existing (manual) the personality test by Rammstedt and Oliver (2007) is com- methods. pared with a new author-profiling method. This method consists of analysis of all information available about a twit- Existing research on author profiling is characterized by ter profile, using measures that are based on personality a rather uniform methodology. A supervised classifier is characteristics found in existing research. While the person- trained using a ground truth data set of authors’ person- alities that result from this method di↵er greatly from the alities and a broad range of mostly textual features. What set gold standard, it performs almost equal on a secondary sets our approach apart from previous research, is a strongly human evaluation, suggesting that future work is necessary reduced amount of features, which are founded by existing to determine if there is such thing as an objective or mean research. Because the features are based on observations in personality. previous research, they have the potential to perform better then normal textual features. Next to that, they have the General Terms potential of a stronger external validity. Author Profiling, Text Analysis, Personality, Psychometrics, Twitter During this research the political case described earlier is used as the research setting, and the politicians’ Twitter ac- counts as the available data source. During the design of 1. INTRODUCTION the author profiling system, the following main question is In di↵erent fields the interest in author profiling is growing. posed: (1) ”Can we develop a system that e↵ectively deter- Author profiling is an application of text analysis which goal mines perceived personality based on all information avail- it is to, given a piece of text written by a person, predict dif- able about a Twitter profile via the API, by using personal- ferent descriptive features of this person (the author). These ity characteristics found in existing research?”. features can then be used to build a demographic descrip- tion of the author (an author profile) which promises many In this paper, firstly, a literature review is conducted to an- possibilities for business intelligence, criminal , computer swer the following subquestions: How can personality be rep- forensics and more. resented ?, What are proven methods to determine personal- ity?, Which characteristics can be used to describe personal- Previous author profiling research was focused on predicting ity? and How can we evaluate such a method of personality straight forward features like age and gender. But, in recent determination?. developments there is also a growing amount of research on more complex features like author personality. Based on this theoretical background, an experiment is de- signed in which an author profiling system is evaluated. This Personality is an interesting descriptive feature, which opens author profiling system is based on characteristics found in the literature review and the data available via the twitter API. Finally, using the results of this experiment we will try to answer the main research question.

2. THEORETICAL BACKGROUND The main focus of our author profiling system is to deter- mine personality. But, personality is not such a precisely delimited concept as might be expected. The meaning of the term personality is very much dependent on the context senting two di↵erent personalities can be seen in figure 2. in which it is used. In general, there are three main types of personality that are related but not necessarily always equal. 2.3 Determining personality To find these five scores that make up a big five representa- tion, di↵erent methods can be used. 2.1 Different types of personality The first type of personality is the personality a subject assigns to him or herself, also called self-rated personality 2.3.1 Computer science: Author Profiling [16]. This is the type of personality that is most used in In the introduction, the methodology used in previous au- research, and is gathered often with so-called personality thor profiling research is already shortly discussed. As de- markers. These are statements about a persons personality scribed, determining personality is seen as a normal author where the person can then agree or disagree to a certain profiling task, just like age and gender. scale. Most existing author profiling research is performed to be The second type of personality is the type of personality submitted to the PAN congress author profiling task [28]. an expert assigns to a person via observation, also called For this task, a training data set is given with authors, de- peer-reported personality. During an observation, the ex- mographic features of these authors, and their tweets. There pert looks at the person’s behavior and tries to find specific is only a limited set of metadata available, where the real actions that in existing literature are linked to specific per- Twitter account is obfuscated for privacy reasons. sonality traits. This author profiling task focuses on determining several The third and last type of personality is the type of per- demographic features at the same time. And to limit the sonality people around a person assign to him or her, also complexity of the system, the same approach is used to pre- called peer-rated personality [16]. Just like with self-rated dict all of these demographic features. personality, peer-rated personality is often determined using personality markers. In the case of peer-rated personality, The common approach in author profiling consists of cal- these statements are not rated by the person him or herself, culating a broad range of features from the author’s set but by people who know the target person. of tweets. Common types of features are: textual / con- tent features (N-grams, TF-IDF, slang words, swear words, With author profiling, the goal is to determine one of these LIWC features [14]), stylistic features (punctuation marks), types of personality by analyzing the media created by the time-based features (the amount of tweets in a specific time- person. In the political case introduced in this research, frame) and social network features (network size, density peer-rated personality is the type of personality which is [28, 13]). In most research, textual features are far the most most interesting, where this represents how voters see the important. politician. After the calculation of these features, one classifier per to-be-determined demographic feature, often an SVM, is 2.2 Representing personality with the big five trained [28]. Because the big five representation consists There are two main methods to represent personality. The of five values, five separate SVM classifiers are trained. first method is the description of a person’s personality in free text, without any limitations on what terms can be used. One successful extension to this method is the second order This is a very extensive method that enables a researcher to attributes (SOA) method described by L´opez-Monroy et al. give in-depth insights in the person’s personality. But,this [22]. This method does use an SVM, but does not stick to method is not desirable for many cases like the one used in the di↵erent possible classes for classification. Instead, sub this research, as it is not well suited for comparison between classes are generated for each class using the training data users or calculations. Because of that, multiple quantitative and K-Means clustering. This results in more classes which methods are developed that are usable for calculations and are used in the classification. comparison.

The big five is an example of such a a quantitative represen- 2.3.2 Psychometrics: Personality tests tation, which is very often applied in both psychometric and The field of psychometrics already had extensive experience author profiling settings [19, 28]. With the big five method, with measuring personalities, far before anyone in the field a person receives a ranking (often between 0 and 5) on five of computer science tried to do the same. In psychometrics, major personality traits, that should give a good overall a broad range of personality tests are developed. Such tests view of his or her personality. The five personality traits often work with a set of descriptive sentences[11] or single are openness to experience, conscientiousness, extraversion, adjectives[15] (also called markers), which are then displayed agreeableness and neuroticism. to the rater. The rater can then agree or disagree with these descriptive sentences or terms using a likert scale. When the The big five representation is the standard method to de- rater agrees, he thinks the descriptive suits the person well. scribe personality, where it was found to work consistently with multiple ages, cultures and determination methods [8]. These tests greatly di↵er in the time that is needed to com- plete them. Some descriptive sentence based tests can take Radar plots are an e↵ective way of visualizing a big five up to 40 minutes to complete[11], where the short set of representation [25]. An example of such a radar plot repre- markers by Rammstedt (2007) takes only a few minutes [27]. 2.3.3 Psychology: observation and characteristics [21], or make emotions explicit by writing them down of personality traits [17]. The questions that construct such a personality test are of- Neuroticism ten founded in psychological research. In such research, ob- Neuroticism is often called emotional stability, where servations are used to find new traits and characteristics that it describes how frequently a person has to cope with make up the big five. negative emotions like depression, anxiety and guilt [18]. Not only does a neurotic person experience more There is a significant amount of research about the di↵erent anger, he is also more likely to express this anger [23]. personality characteristics that describe personality traits Someone who scores low on neuroticism is more emo- well. When a person is highly associated with these charac- tionally stable and calm, and will experience less stress. teristics, the person also scores high for the corresponding personality trait and visa versa. Conscientiousness Conscientiousness is often described as the personality Because Twitter author profiling provides new ways of de- trait of being thorough and careful [30]. People who termining a personality, it is a good idea to look at all char- score high on conscientiousness have a desire to do a acteristics that are applied to describe a personality trait, in task well. stead of only focusing on the ones often found in personality They spend more time on perfecting a task, are more tests. It might be the case that some of these characteristics careful and have more regret of the things they have are hard to measure in a test, but can be e↵ectively applied done wrong. On social media, they often put more in a twitter based author profiling system. time in what they post [29]. The following characteristics are found to describe specific personality traits well, and are applicable on Twitter data. 2.4 Evaluation These characteristics form the foundation of the twitter anal- 2.4.1 Evaluation in Author Profiling ysis system that is introduced in this research. For evaluating author profiling systems, the same evaluation approach is used for evaluating systems that determine age Openness to experience and gender, as for systems that determine personality. Openness to experience can be summarized into six dimensions: imagination, aesthetic sensitivity, atten- The results from the author profiling system are compared tiveness to inner feelings, preference for variety and with a ground truth, which is a set of results that were intellectual curiosity [9]. independently checked. The accuracy of the author profiling system is then used as the measure to evaluate a system, and Butler (2000) notes that people who are more open to to perform between-system comparisons [28]. An author experiences are in general more open to di↵erent cul- profiling system that has a higher accuracy on the test set ture and lifestyles [4]. McCrae (1996) describes that then another author profiling system is thus better. it is likely that this openness to di↵erent cultures and preference for variety can be seen in the friend net- There are two reasons why this approach does not fit person- work of the twitter account of a person, but does not ality very well. Firstly, when using a boolean measure for elaborate on how it can be seen [24]. It is however a good and false predictions by the author profiling system, reasonable hypotheses that a user that is open to dif- we ignore the di↵erence between completely wrong and only ferent lifestyles and opinions is more likely to follow a slightly o↵. If we take the case of an author that scores 5 diverse group of people. out of 5 on a specific big five dimension (ground truth), and Extraversion two author profiling systems that pose 1 and 4 as result re- Thompson (2008) described extraversion with terms spectively. Then they are both as wrong for the accuracy like: talkative, outgoing and energetic behavior, and measure, where it would be better to use a measure that introversion (the inverse) with quiet, reserved and shy takes into account which one is more wrong. This can be [30]. In short, extraverts enjoy social interactions. fixed relatively easy by using the mean squared error. This can be seen on social media, where people who are extravert have a significantly larger group of friends Secondly, creating a ground truth for personality is not as and contacts. [1] straight forward as for age or gender. In the case of Verho- even (2014), to determine this ground truth, a normal big Agreeableness five personality tests is used [32]. When evaluating a new Agreeableness is a personality trait that is most often method by comparing it to such a ground truth we should described with characteristics as sympathetic, kind, keep in mind that there is no objective ground truth per- warm and considerate [30]. People are often found to sonality for any author. Every personality measure that is be sympathetic or kind because of the way they com- defined, can be successfully argued to be better then the old municate. People who are found kind and sympathetic one. Simply because the definitions of the big five dimen- are seen as such because of non-verbal clues. Which sions are not water tight [3]. Because of that, the ground clues are considered polite and nice, is very culturally truth can better be called a gold standard in the case of dependent [20]. personality. On Twitter, there are of course no real non-verbal clues. Instead of verbal clues, people on Twitter of- The result of this change for the evaluation is that it can not ten use emoticons to support the transfer of emotions simply be said that when a system gives results that are not equal to the gold standard, these results are wrong. This comparison can only be used to see if they are comparable 5 5 5 5 5 5 to the gold standard personality test, but a di↵erent result can both be worse or better. To make the distinction be- tween these two, some kind of secondary evaluation should be performed. 4 4

2.4.2 Evaluation in psychometrics The diculties of evaluating and validating personality of 3 course also has implications in psychometrics. In psycho- 3 # of politicians metrics, a few personality tests are generally seen as well validated methods of determining personality. These meth- ods are then used as a ’gold standard’, the most accurate 2 2 test currently available. One of the personality tests which often is argued to be the best available is the Revised NEO Personality Inventory (NEO PI-R) [7]. SP D66 VVD CDA PvdA PVV NEO PI-R is seen as very reliable because of the extensive amount of research that went in validating it. Because there GroenLinks Other parties is no ground truth, NEO PI-R and other personality traits are often validated on how useful they are in research. NEO Figure 1: Distribution of selected politicians over dutch po- PI-R was found to be useful in research where it scored well 1 on internal consistency and retest reliability [7]. Also, it litical parties. scored high on criterion validity [6], which is a measure spe- cific to psychometric which measures to what extend the outcomes of the personality test relate to a behavioral cri- The first selected politicians were politicians in leading and terion which is generally agreed upon by research [26]. public roles. This is because they are often in the media and as a result of that are very well known by the general Newer personality tests are evaluated by their agreement public This resulted in a slight over-representation of the to a personality test that is already thoroughly validated. ruling parties and ex-ruling parties, that provided the cur- For example, the markers described by Rammstedt (2007), rent and previous ministers. After these politiciansm the which are used in this researched, are validated by the cor- party leaders of all major political parties were also added. relation between the results of the new test and the results of the proven NEO PI-R test [27]. Parties who were very active on social media like Groen- Links, were somewhat over-represented, because it was rela- Author profiling research that is focused on measuring per- tively easy to find politicians with an active Twitter account. sonality should base it’s evaluation on this method, where Less often represented in the politician selection were parties it takes into account that there is no objective personality like the SP and 50PLUS, who were generally more focused measurement, and focus on how well a measurement fits is on older generations, and less active on social media. purpose. In this research, the purpose is to measure what the perception of one’s personalty is by the public, and this PVV was also underrepresented compared to the party’s perception should thus, next to the gold standard compari- size, because of the lack of well-known politicians that rep- son, in some way be evaluated by the public. resent the party. To compensate for some of the over- or under-representation, 3. METHOD political figures who no longer carry out a public role, but As described in the introduction, the goal of this research is are still well known, where added. Examples of these polit- determining the personality of specific politicians by analyz- icans are Boris van den Ham and Erica Terpstra. The final ing their twitter account using an author profiling system. division of selected politicians over parties can be seen in To be able to evaluate the personality scores that result from figure 1 (. that analysis, a gold standard has to be constructed where the values can be compared with. 3.2 Gold standard peer rating As a gold standard to compare our system with, peer raters Peer rating is currently the most frequently used method of were asked to rate the personalities of 35 politicians. retrieving personality, and because of that it is used to set the gold standard of personality scores. These ratings consisted of peer-rated markers, for which the raters had to fill in to which extend they agreed with them. To make sure that many ratings could be gathered in a rel- 3.1 Politician selection atively short time, the short markers by Rammstedt (2007) Before these peer ratings where gathered, politicians had to were used [27]. The markers were translated from German be selected who were well-known, public figures. Otherwise, it would have been significantly harder to find enough raters 150PLUS, SGP, PvdD, Denk and ChristenUnie (all one who would be able to judge them. politician) are combined in other parties. and English to Dutch, for the ease of the raters. The trans- lations are displayed in table 1. f F COSSIM(Bf ,BU ) 2 u (1) FU The raters consisted of a random group of people who were P | | asked to fill in an online questionnaire. where Not every rater had to rate all politicians. For each rater, a random selection of three politicians was made randomly. Whenever a person indicated to not know one of the politi- U is the target Twitter account cians, another random one was given for the participant to • F U is the list of following Twitter accounts for a given rate. At the end of the survey, general demographics (age, • political involvement, political preference, age, gender and Twitter account educational level) from the rater were gathered. and B U is the bibliography feature vector for a given • user account 3.3 Data selection and measures The Twitter analysis consists of accessing the data available The bibliography feature vector exists of simply all words on the politicians’ Twitter accounts, and using the char- that occur in the biography together with their relative fre- acteristics described in the theoretical background to find quencies. Words were split on spaces and punctuation char- measures to transform this data into meaningful personality acters. scores. 3.3.2 Extraversion Because Twitter is such an open system, a wide range of Extraversion is the personality trait of being outgoing and data is available using the public API. The data is divided energetic, and often associated with a higher amount of between two main objects: users and tweets, which contain friends and contacts. a broad range of attributes. Other objects (such as places and entities) were not used in this research. The full list of On Twitter, ’becoming a contact’ is a one-way interaction. available attributes for both the user and the tweets objects In contrary to other social networking sites, following a per- can be found in the Twitter API Documentation [31]. son does not require the permission of the other person. But, the other person is also not inclined to follow back. It should be kept in mind that Twitter limits the amount Because of this reason, a politicians contacts are more likely of API requests per system. These limits di↵er per type of the people he follows then the ones he is being followed by. request, and retrieving some types of data requires a lot of time. Because of this reason, measures that depended on Also, the simple count of these friends is not a fair measure twitter following graphs were skipped. There was simply of his outgoingness. It is given that the friend network of not enough time to retrieve all required data. any social networking site grows in the time that he or her is a member. In the measure of extraversion, the amount of The characteristics found in the literature review were ap- friends should thus be compensated towards the age of the plied on the data available on Twitter, resulting in the fol- Twitter account. lowing measures: With this in mind, the normalized relationship quantity is defined as: 3.3.1 Openness to experience As described in the theoretical background, people who score high for openness to experience are generally more open to U di↵erent opinions, cultures and lifestyles and also often have log( f )(2) a preference for variety in the people they talk to and the EXP(Ua) media they consume. where As a consequence of this, a person who has a preference for media and intellectual variety, might also have a preference for more variety in the people they follow on Twitter. U is the target Twitter account • U a is the account age in days One way we can measure this variety in Twitter followers is • by using metadata publicly available on Twitter. On Twit- U f is the amount of users that account is following ter, users are asked to write a short summary about their • interests and occupation. It is a Twitter custom to fill this and EXP(x) is the expected amount of Twitter follow- with either a short description, or a comma separated list • ers for a given age in days of keywords that describe the these topics. This informa- tion in combination with simple text analysis can be used to describe the biography interest similarity of two users. The The expected amount of Twitter followers is found with average biography interest similarity is calculated as follow- a short empirical research, where a sample of 904 random ing: dutch Twitter profiles was analyzed to find a function that Table 1: Dutch translations of the 10 mini markers of Rammstedt and John (2007) used for gold standard peer rating.

English Dutch ... is reserved ... terughoudend en gereserveerd is ... is generally trusting ... mensen veel vertrouwen geeft, het goede ziet in mensen ... tends to be lazy ... comfortabel is, geneigd tot luiheid ... is relaxed, handles stress well ... relaxed is, goed met stress om kan gaan ... has few artistic interests ... weinig artistieke interesses heeft ... is outgoing, sociable ... uitgaand en gezellig is ... tends to find fault with others ... de neiging heeft om anderen te bekritiseren ... does a thorough job ... grondig te werk gaat ... gets nervous easily ... gemakkelijk nerveus en onzeker is ... has an active imagination ... een actieve verbeelding heeft en fantasierijk is represents what amount of followers would be expected from a certain account age. From these Twitter accounts, the ones with more then 8000 following users were removed, where 1ifw in N w W they are obviously spam accounts. A normal person would 2 t (0 otherwise P (3) not very likely have more then 8000 contacts. Wt t T X2 U | | On this dataset, a logarithmic function was fitted using re- gression to represent the amount of followers based on a where Twitter account’s age in days. The resulting function for the expected amount of Twitter followers, given the account age, U is the target Twitter account is thus: EXP(x)= 452.3626678331251+146.9562760301834 • log(x) and R2 = .032. ⇤ T u is the list of tweets for a user • W t is the list of words in a tweet 3.3.3 Agreeableness • Agreeableness is the trait of being kind. As described in the N is the list of negative words • literature review, being considered kind or warm is mostly dependent on non-verbal communication. Because non-verbal 3.3.5 Conscientiousness communication is often replaced with emoticons on the in- People who score high on conscientiousness tend to spend ternet, the assumption is made that agreeableness can be more time on writing posts on social media. By spending measured by the use of positive emoticons in communica- more time on perfecting these posts, they will most likely tion to other persons. Another clue of kindness might be make less spelling errors. the use of words that indicate kindness: like ”thank you”. Measuring the amount of spelling errors in a text can pose Following this assumption, the direct tweet kindness index a way to find the conscientiousness. The degree of spelling is the percentage of directed tweets (that is tweets that are errors is defined as: sent to a specific user) that contains at least one of the terms that indicate kindness. These terms are defined as the list of positive smileys: :), ;) etc. in combination with ”thanks”, ”thank you” etc. 1 if spelling error w W 2 t (0 otherwise P (4) Wt t T 3.3.4 Neuroticism X2 u | | On social media, people who score high on neuroticism will more often express negative emotions. Because of that, mea- where suring how often negative emotions reflect from posts will be a valid way to determine neuroticism. U is the target Twitter account • To do that, we first need a list of words that describe such T u is the list of tweets for a user negative emotions. Because no extensive list in Dutch was • 2 found, an existing english list was translated. W t is the list of words in a tweet • Using this list of negative emotion words, the following for- mula is used to describe neuroticism: To check the spelling of a specific word, the complete tweet was used as the input for ASpell, an often-used open source 2Negative Emotions list on negativeemotion- spelling checker, using the default dutch dictionary. To make slist.com. Translated version can be found on results on tweets better, @-mensions and hashtags where re- https://gist.github.com/FrankHouweling/7fce4b89da4357744054 moved, where usernames and hashtagged terms often not For the gold standard, a survey was conducted in which raters used a short personality test to determine the per- ceived personality of di↵erent politicians. In total, 144 rat- ings were gathered. Each rating resulted in a score between -4 and 4 for all five personality traits. As can be seen in ta- ble 2, the results strongly follow a normal distribution. No further transformations were applied to the data.

The mean values of the personality traits di↵er a lot from the mean value of the complete range. Politicians in the survey score much higher then 0 (the mean of the full possi- ble data range) for extraversion and conscientiousness, and much lower for neuroticism. Table 2: Descriptives of the gold standard survey data. Figure 2: Spider representing the big five personality scores Personality Trait Mean SD Skewness Kurtosis for Twitter (blue) and the gold standard survey (red) for . Agreeableness -0.3 2.079 -0.247 -0.690 Extraversion 0.76 1.900 -0.73 -0.409 Openness to Exp. 0.18 1.581 0.118 0.572 represent any real-life words. Because even with these ad- Conscientiousness 1.06 1.906 -0.283 -0.633 justments, a lot of false positives occurred, the 250 most fre- Neuroticism -1.08 1.794 0.241 -0.198 quently falsely spelled words were analyzed manually, and transformed into an addition to the default dictionary. This additional dictionary consists mostly of words that are very specific to politics, and are therefor not part of any normal 5. DATA CLEANING For a fair comparison between the gold standard method dictionaries. (survey) and the author profiling method introduced in this research, politicians have to be selected for who enough data 3.4 Method of evaluation is available for both methods to base their results on. To In the theoretical background, two shortcomings in the eval- make the selection process as fair as possible, a set of quality uation of previous author profiling systems where discussed. requirements where set. In short, previous research used accuracy to evaluate the success of their method. But, accuracy does not take into For a politician to be considered in the evaluation. He has account the fact that personality is measured on a continu- to comply to the following: ous scale, and it dos not take into account that the method is compared with a gold standard, not a ground truth, and In the survey, at least three participants rated the that it is possible that the new method is better then the • politician. gold standard to which it is compared. To overcome both shortcomings, the system is evaluated in two ways. The survey result data for the politician approaches • a normal distribution, with the mean of the standard Firstly, the author profiling results are compared to the gold- deviations for the big five personality dimensions lower standard method for measuring personality. Because both then 1.8. This because a low SD indicates most val- methods contain a certain amount of measurement error, ues to be near the mean, and thus general agreement and both give results on a continuous scale, a Bland-Altman between the raters, approach is argued to be the best evaluation method [10]. The Twitter account associated with the politician is This method can show if two techniques are drastically dif- • active, which means at least one real tweet (no retweet) ferent, even with only a small sample. in the past week. Secondly, when they do di↵er significantly, it is not directly The Twitter account follows at least 45 people. • concluded that the new system does not function. For this, we must further evaluate the results. A group of raters is These requirements make sure that the survey is based on introduced into the concept of the big five personality traits, enough opinions. Next to that, they make sure that the and explained where these traits stand for. After that, they Twitter account contains enough tweets and befriended users are shown all politicians with their according personalities to perform meaningful calculations. Table 4 in appendix B according to the two methods. For this, a spider plot as displays an overview of all politicians and their scores on shown in figure 2 is used, because it enables easy comparison these items is shown. Politicians who are printed in bold between the two personalities. Their assignment is to choose passed the requirements, and are used in further analysis. the best fitting personality for the politician. These politicians are: Jeanine Hennis, Lodewijk Asscher, , , , Marianne 4. RESULTS Thieme, Tofik Dibi, and . These nine politicians form the subset of data on which the 4.1 Gold Standard data following evaluation is based. 6. EVALUATION Table 3: Binomial test results of survey evaluation of the As described in the method section of this paper, the author gold standard (survey) method vs. the Twitter analysis profiling method introduced in this thesis is evaluated in two method. ways. Politician # survey # Twitter Sig. 6.1 Testing for equality Jeanine Hennis 3 8 .227 Firstly, the new method is compared to the gold standard Lodewijk Asscher 6 4 .754 method. The gold standard method of personality determi- Arie Slob 10 0 .002 nation, at this moment, consists of a survey with personality Boris van der Ham 4 3 1.000 markers. From this survey, the mean of all raters was used Emile Roemer 7 4 .549 as the resulting personality. 2 7 .180 Tofik Dibi 7 0 .016 Before these two big five representations were compared, Fleur Agema 5 1 .219 they were both transformed to a scale from 0 to 1. For most Alexander Pechtold 5 7 .774 Twitter survey measures, this was already the case. The gold standard data all originally were on a scale from -4 to 4, and had to be transformed. To do this, the following agreement in the evaluation survey. Even while this was a simple transformation was used: relative simple task (two quite di↵erent personalities: which one fits the politician better?). For inter-rater agreement calculation, the Klippendorf’s alpha measure was used, where v min (5) it performs well with more then two raters and lot’s of miss- max min ing data, as in our case [12]. Klippendorf’s alpha for the evaluation survey was 0.3722, where .67 is generally seen as where min of course is -4 and max is 4. a minimum for meaningful results [12]. After this, the di↵erence between the two methods was cal- culated for each pair of politician and personality trait. Next 7. CONCLUSION to that, the mean of the values that come from the two meth- In this paper we investigate the possibilities of developing an ods is also calculated. author profiling system that determines personality by look- ing at characteristics of personality traits found in existing This information was then used to, for each personality trait, research. construct two plots in the way described by [10]. These plots are shown in figures 3 - 7 (appendix A). During the research, such a system was constructed and evaluated. In the evaluation, the personality scores from From analyzing these plots, we can say that the two methods the proposed system are found to be very di↵erent then the do not return the same results for the selected politicians. ones found by the personality test that serves as a golden Next to that, the noise does not seem to follow a pattern, standard. It would have thus been expected that the human and further transformations will most likely not significantly evaluation in which participants are asked to compare the improve the fit. The only possibility to a minor improvement proposed personalities of the gold standard personality test would be to exclude outliers, like in the case of conscientious- and the system would result in significantly more votes for ness where one politician clearly scores di↵erent (higher on one of the two methods. This is, however, not the case. Twitter, lower on survey) then the others. The strong di↵erence in resulting personalities between the 6.2 Human evaluation two methods can not be clearly seen in the human evalua- But, that the Twitter based measure is di↵erent then the tions, where the personality scores from the author profiling gold standard does not necessarily mean that the method system were only significantly worse then the personality performs worse. This hypothesis is partly supported by the scores from the personality test for two politicians. In the survey evaluation results. other cases, both methods performed equally well or the au- thor profiling system performed slightly better. The survey results (N = 12) were analyzed using a binomial test to find if one of the two methods was chosen signif- Because of this, it is not possible to say if the system was icantly more often then we would expect from change (= successful, and no definitive answer can be given to the main .5). The results are shown in table 3. Two politicians were research question. found to be represented significantly better using the sur- vey. These were Arie Slob (p = 0.002) and tofik dibi (p 7.1 Discussion = 0.016). The other politicians were not represented sig- That the results are rather surprising, is clear. When the nificantly better by one of the two methods, but this can results of two methods of personality determination are com- also be a matter of a lack of participants. Marianne Thieme pared, and are found to being very di↵erent, one would ex- and Alexander Pechtold for example have more votes for the pect one to be superior then the other. But, the human Twitter analysis method’s result then for the gold standard evaluation found that only two politicians were significantly method. better represented by the gold standard.

What catches the attention is the relative low inter-rater Before a potential reason for this is given, it is important to mention that making a system that returns the same results rent implementation of the big five, there is no room for as the gold standard personality test is not necessarily the description of how polarized the views are on a certain per- final goal of this research. The goal is to make a system that son’s personality. While this can be extremely interesting shows the general view of a politician’s personality by the information. Because personality is not always objective, public. a polarization dimension will be able to predict how many people will agree. This can also be the reason of why the human evaluation gives these results. Raters were found to not agree even in Author profiling of personality is an interesting problem with the simple task of choosing between two completely di↵erent many applications. But, before we can accurately mine per- personalities. Maybe, there is no such thing as one average sonalities, there needs to be more research on how person- personality, but is the view of politicians highly polarized. ality can be identified on the internet. That is the only way in which a significantly better performing system then the Next to the evaluation, we have found more interesting re- normal text-feature based classifiers can be designed. This sults. For example in the determination of the gold stan- research was a first step in this, with an introduction in this dard, were the traits did not always have a mean roughly new approach. But follow-up research is necessary to im- equal to 0, but scored di↵erently. Politicians were found to prove our knowledge and experiences with the combination score relatively high on extraversion and conscientiousness, of personality mining and the internet. and much lower then expected for neuroticism. 8. REFERENCES This can of course be explained by the fact that politicians [1] Y. Amichai-Hamburger and G. Vinitzky. Social with certain personality characteristics have a higher likeli- network use and personality. Computers in human hood to become well-known politicians [5]. These personal- behavior,26(6):1289–1295,2010. ity characteristics could very well result in a high score for [2] M. R. Barrick and M. K. Mount. The big five extraversion (necessary in propagating the beliefs of the po- personality dimensions and job performance: A litical party) and conscientiousness. Conscientiousness was meta-analysis. 1991. found to positively correlate with overall job performance [3] J. Block. The five-factor framing of personality and across multiple professions [2], so it could very well corre- beyond: Some ruminations. Psychological Inquiry, late positively with being successful as a politician as well. 21(1):2–25, 2010. [4] J. C. Butler. Personality and emotional correlates of 7.2 Future work right-wing authoritarianism. Social Behavior and That no clearly e↵ective author profiling system was found, Personality: an international journal,28(1):1–14, does not necessarily mean it is impossible to do so while 2000. using personality characteristics found in research. [5] G. V. Caprara and P. G. Zimbardo. Personalizing politics: a congruency model of political preference. One way to make the purposed author profiling system more American Psychologist,59(7):581,2004. e↵ective might be to use a di↵erent case. The image of [6] M. A. Conard. Aptitude is not enough: How politicians is highly modeled by the media, and maybe not personality and behavior predict academic really representative to the real politician. Next to that, performance. Journal of Research in Personality, some politicians have spin doctors influencing their behavior 40(3):339–346, 2006. on twitter. This influences both the results of the survey as [7] P. T. Costa and R. R. McCrae. Neo Personality the results of the twitter account, making the error larger. Inventory-Revised (NEO PI-R). Psychological Analyzing normal people with a personal twitter account, Assessment Resources, 1992. and a survey with people who have a normal relationship [8] P. T. Costa and R. R. McCrae. Solid ground in the with the target person in real life requires more e↵ort, but wetlands of personality: A reply to block. 1995. might give better results because of a more representative [9] P. T. Costa and R. R. McCrae. The revised neo image on both the (personal) twitter account and the survey. personality inventory (neo-pi-r). The SAGE handbook of personality theory and assessment,2:179–198,2008. Another way would be to use a larger set of measures. Be- cause of the limited time available for this research, only one [10] G. E. Dallal. Comparing two measurement devices, personality characteristic-based measure was used per per- 2000. sonality trait. But such a measure can never give a complete [11] F. De Fruyt, R. R. McCrae, Z. Szirm´ak, and J. Nagy. view of the trait, where a trait is often identified using mul- The five-factor personality inventory as a measure of tiple characteristics. Expanding the set of measurements in the five-factor model belgian, american, and hungarian a way that multiple characteristics are combined to find all comparisons with the neo-pi-r. Assessment, personality traits might improve the results significantly. 11(3):207–215, 2004. [12] K. De Swert. Calculating inter-coder reliability in A final method to improve results would be to use a larger media content analysis using krippendor↵ˆa A˘Zs´ alpha. set of people to analyze. A larger sample would make it Center for Politics and Communication,2012. easier to find correlations between the gold standard and [13] D. Ediger, K. Jiang, J. Riedy, D. Bader, C. Corley, author profiling personalities, and reduce noise. R. Farber, W. N. Reynolds, et al. Massive social network analysis: Mining twitter for social good. In Next to improving the method in which personality is de- Parallel Processing (ICPP), 2010 39th International termined, the big five could also be improved. In the cur- Conference on, pages 583–593. IEEE, 2010. [14] G. Farnadi, S. Zoghbi, M.-F. Moens, and M. De Cock. [31] Twitter. Twitter api documentation. Recognising personality traits using facebook status [32] B. Verhoeven and W. Daelemans. Clips stylometry updates. Proc. of WCPR, pages 14–18, 2013. investigation (csi) corpus: A dutch corpus for the [15] L. R. Goldberg. The development of markers for the detection of age, gender, personality, sentiment and big-five factor structure. Psychological assessment, deception in text. In Proc. of the 9th Int. Conf. on 4(1):26, 1992. Language Resources and Evaluation,2014. [16] S. D. Gosling, P. J. Rentfrow, and W. B. Swann. A very brief measure of the big-five personality domains. Journal of Research in personality,37(6):504–528, 2003. [17] J. T. Hancock, C. Landrigan, and C. Silver. Expressing emotion in text-based communication. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 929–932. ACM, 2007. [18] B. F. Jeronimus, H. Riese, R. Sanderman, and J. Ormel. Mutual reinforcement between neuroticism and life experiences: A five-wave, 16-year study to test reciprocal causation. Journal of personality and social psychology,107(4):751,2014. [19] O. P. John and S. Srivastava. The big five trait taxonomy: History, measurement, and theoretical perspectives. Handbook of personality: Theory and research,2(1999):102–138,1999. [20] E. Leach. The influence of cultural context on non-verbal communication in man. Non-verbal communication,pages315–349,1972. [21] S.-K. Lo. The nonverbal communication functions of emoticons in computer-mediated communication. CyberPsychology & Behavior,11(5):595–597,2008. [22] A. P. L´opez-Monroy, M. Montes-y G´omez, H. J. Escalante, and L. Villase˜nor-Pineda. Using intra-profile information for author profiling. [23] R. Martin, D. Watson, and C. K. Wan. A three-factor model of trait anger: Dimensions of a↵ect, behavior, and cognition. Journal of personality,68(5):869–897, 2000. [24] R. R. McCrae. Social consequences of experiential openness. Psychological bulletin,120(3):323,1996. [25] M. Ogot and G. E. Okudan. The five-factor model personality assessment for improved student design team performance. European Journal of Engineering Education,31(5):517–529,2006. [26] D. C. Pennington. Essential personality. Oxford University Press, 2003. [27] B. Rammstedt and O. P. John. Measuring personality in one minute or less: A 10-item short version of the big five inventory in english and german. Journal of research in Personality,41(1):203–212,2007. [28] F. Rangel, P. Rosso, M. Moshe Koppel, E. Stamatatos, and G. Inches. Overview of the author profiling task at pan 2013. In CLEF Conference on Multilingual and Multimodal Information Access Evaluation, pages 352–365. CELCT, 2013. [29] G. Seidman. Self-presentation and belonging on facebook: How personality influences social media use and motivations. Personality and Individual Di↵erences,54(3):402–407,2013. [30] E. R. Thompson. Development and validation of an international english big-five mini-markers. Personality and Individual Di↵erences,45(6):542–548,2008. APPENDIX

A. BLAND-ALTMAN PLOTS FOR EVALUATION For each personality trait, the di↵erence between the resulting scores for the gold standard method (y-axis) and the Twitter analysis method (x-axis) is shown on the left. On the right, we see a plot with the relation between the mean of the two methods and the di↵erence between the results.

The plot on the left shows till what extend the results of the two methods agree with each other. When the dots follow the line in the center, they often agree. The plot on the right can then be used to analyze the di↵erences. When all dots appear to be on one side of the line, we can conclude that there is most likely an error that can be resolved by a transformation of the data. When the dots seem to be all over the plot it means that the error is random, and that no improvement is possible.

Figure 3: Agreeableness

Figure 4: Conscientiousness Figure 5: Extraversion

. Figure 6: Neuroticism

Figure 7: Openness to experience

B. DESCRIPTIVES FOR DATA CLEANING A table with all relevant data for the data cleaning is shown on the next page. Table 4: Politician selection based on survey data spread and Twitter data availability. Bold items were used for further evaluation.

Politician Raters SD’s: Agreeableness Extraversion Openness Conscientiousness Neuroticism Mean SD Twitter: # Following Active marijnissenl 1 0,000 0,000 0,000 0,000 0,000 0,000 139 1 SophieintVeld 1 0,000 0,000 0,000 0,000 0,000 0,000 1157 1 WassilaHachchi 1 0,000 0,000 0,000 0,000 0,000 0,000 2164 1 anouchkavm 2 0,707 0,707 0,707 2,121 0,000 0,848 51 bramvanojikgl 2 0,707 1,414 0,707 1,414 0,000 0,848 350 0 jcdejager 2 0,000 2,121 2,828 0,707 0,707 1,273 24 0 jolandesap 2 2,823 2,121 0,707 2,828 0,000 1,696 196 0 tunahankuzu 2 2,121 0,000 0,707 2,121 0,707 1,131 137 1 halbezijlstra 3 1,528 1,732 1,000 1,732 0,000 1,198 108 0 henkkrol 3 1,528 2,646 1,528 2,517 1,155 1,875 300 1 jaspervandijksp 3 1,528 1,000 0,577 1,000 1,000 1,021 325 1 keesvdstaaij 3 0,577 2,646 2,000 1,528 0,000 1,350 1387 1 MonaKeijzer 3 0,000 1,000 0,577 0,577 2,082 0,847 641 1 piadijkstra 3 4,041 2,517 2,309 2,517 1,000 2,477 996 1 hansspekman 4 3,416 1,500 1,258 1,915 0,957 1,809 418 1 jeaninehennis 4 1,258 1,258 1,500 2,062 2,082 1,632 3489 1 jesseklaver 4 1,258 1,708 1,500 1,732 1,000 1,440 497 0 lodewijka 4 1,258 0,816 0,000 1,732 1,000 0,961 1045 1 sybrandbuma 5 1,140 2,490 1,949 1,789 1,949 1,863 241 0 arieslob 6 1,633 1,211 1,673 1,049 1,602 1,434 613 1 borisham 6 0,753 1,033 1,378 1,835 1,366 1,273 4218 1 camieleurlings 6 1,506 2,168 1,975 1,329 1,751 1,746 76 0 diederiksamsom 6 0,816 1,722 1,366 2,422 2,041 1,673 476 0 emileroemer 6 1,835 1,366 1,033 1,966 1,049 1,450 540 1 erica terpstra 6 2,000 0,837 2,563 2,160 1,633 1,839 30 femkehalsema 6 0,753 1,169 1,549 1,366 1,366 1,241 660 0 jpbalkenende 6 2,041 2,608 1,835 2,530 2,229 2,249 21 mariannethieme 6 2,258 2,000 1,265 0,983 1,862 1,674 3088 1 MinPres 6 1,643 1,673 0,516 1,095 1,633 1,312 01 tofikdibi 6 1,366 0,753 1,033 1,789 1,789 1,346 445 1 fleuragemapvv 7 2,000 1,574 0,787 1,272 1,890 1,505 45 1 geertwilderspvv 8 0,707 1,408 2,200 2,200 1,669 1,637 01 j dijsselbloem 8 2,066 1,996 1,309 2,134 1,512 1,803 353 1 apechtold 9 2,128 1,658 1,716 1,323 2,028 1,771 450 1

Averages 1,394 1,437 1,237 1,581 1,149 1,359