Eindhoven University of Technology

MASTER

A method to perform an automated background check on professional football players

Hendrickx, T.

Award date: 2016

Link to publication

Disclaimer This document contains a student thesis (bachelor's or master's), as authored by a student at University of Technology. Student theses are made available in the TU/e repository upon obtaining the required degree. The grade received is not published on the document as presented in the repository. The required complexity or quality of research of student theses may vary by program, and the required minimum study period may vary in duration.

General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights.

• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain School of Industrial Engineering Information Systems Group

A Method to Perform an Automated Background Check on Professional Football Players

Master Thesis

T. Hendrickx 0920308

Supervisors: dr. R.J. de Almeida e Santos Nogueira (TU/e) prof.dr.ir. U. Kaymak (TU/e) ir. B.J. Aalbers (SciSports)

Final Version

Enschede, July 2016

Abstract

Lately, there has been much attention for modeling the performance of players on the pitch, to see whether a specific player fits in a team, to detect promising players early in their career, and to predict match outcomes in order to make profit by placing bets. However, information considering football players outside the pitch has not been investigated yet, while many clubs show interest in a player background check. Performing such a background check manually is a time consuming process, and often considers text documents individually. Throughout this thesis, text mining techniques are used to both reduce the time it takes to perform a background check, and detect patterns in large amounts of textual data about football players. A first attempt is made to construct a personality profile in terms of the Big Five Personality Factor model, based on news articles about a football player. Three different methods, two of which using the bag-of-words approach and the other one using part-of-speech tagging, are tested and the results are evaluated. Furthermore, other text mining techniques, such as regular expressions and sentiment analysis, are applied to obtain background information about football players from Twitter. The results of the Twitter analysis are directly applicable in practice. A list of topics a player often talks about and people to whom he is talking can save time. A word cloud and a visualiz- ation of the sentiment analysis provide insight in what the fans think about this player. On the other hand, the construction of a personality profile requires some further research. While one of the experiments showed promising results on the Openness to Experience personality factor, and another one on the Neuroticism factor, further research is required to improve the construction of a person’s personality based on news articles.

Keywords: bag-of-words, part-of-speech tagging, personality, sentiment analysis, text mining

A Method to Perform an Automated Background Check on Professional Football Players iii

Executive Summary

This master thesis project is carried out at SciSports, a company that wants to rationalize the decision making process in the football transfer market (in this thesis, the term football refers to soccer or association football). The main priority of SciSports is looking at the performance of football players on the pitch. This implies that data of player actions on the pitch is analyzed, transformed into information, and included in the player reports that are sold to professional football organizations. However, those football organizations also seem to be interested in background information about players, since SciSports receives a lot of requests for player background checks. Currently, these player background checks are created manually, but this is considered to be a very time consuming process. Reducing the required time for this process would give the personnel of SciSports more time to focus on other activities within the company. Furthermore, the background checks currently contain only certain facts about a player, directly found in a news article or a social media post, but combining different articles to find patterns to reveal new information is not done yet. Therefore, this master thesis project should contribute to SciSports in two different ways. The first one is to make the process of creating background checks less time consuming, by automat- ically filtering interesting text data. Secondly, different text data sources are to be combined in order to reveal new information, that can not be extracted from the text data by just reading it. To accomplish these challenges, an answer to the following research question is to be found:

How can text mining techniques help in performing a player background check on professional football players?

Selecting Sources and Gathering the Data

To get to know which specific news sites and social media platforms are usable for this project, and how the data should be collected such that it can be analyzed, the first sub-question that was asked, reads as follows:

Which data sources need to be selected in order to obtain the most meaningful information for a player background check and how should these data be gathered?

Since, building one HTML parser to parse the content of different news websites is not feasible, eight news websites (four in Dutch and four in English) were selected and specific HTML parsers were built for those websites, such that a total of 58,140 news articles about 203 different players could be downloaded into a database. The different websites were selected with the help of the person currently performing player background checks at SciSports. Furthermore, Twitter data on 10 different football players in both English and Dutch was downloaded. Using the Twitter API, more than 20,000 tweets were downloaded, both tweets posted by the players themselves, and tweets posted by football fans about the players. The latter was accomplished by downloading mentions on a Twitter account of a football player.

A Method to Perform an Automated Background Check on Professional Football Players v Identifying Interesting Player Characteristics

After the data has been collected, it can be analyzed to discover information. Before this is done, it is required to specify which kinds of information are interesting to know from both an academic and a practical point of view. This is done by answering the following sub-question:

Which player characteristics are interesting to know and how can these characteristics be obtained from the data?

The first aspect that is involved in the automated background check, is the player’s personality. From the different frameworks to map personality that currently exist, the most widely used one, the Big Five Personality Factor model, is used throughout this master thesis. The five factors of this personality model are:

• Agreeableness versus Antagonism.

• Conscientiousness versus Lack of Direction.

• Extraversion versus Introversion.

• Neuroticism versus Emotional Stability.

• Openness to Experience versus Closedness to Experience.

SciSports indicated that they find the personality aspect of a player very interesting, and considering the theory, creating an overview of the personality of a player based on news articles is very interesting. Existing literature shows a clear relationship between written text and a person’s personality in terms of the Big Five, but conclusions of this theory are based on people performing specific writing tasks, or people writing pieces of text about themselves, and not on pieces of text that are written about them by others. There is no theory available on how to predict a person’s personality based on pieces of text that are written about this person, but not written by this person himself. Furthermore, a relationship seems to exist between a person’s personality and his social media profiles. However, the user created content, such as tweets and blogposts, are irrelevant according to the existing literature. Other parts of the social media profile, such as the number of friends, and the photos posted are more important in determining a person’s personality. Since the user- created tweets is the data that is used in this project, it is, according to literature, not possible to base a person’s personality on this data only. Therefore it is necessary to create a model that can predict a football player’s personality by analyzing news articles about him. A large list consisting of 435 personality adjectives was found in literature, which can be used to analyze the articles. It is assumed that when many news articles about a player are gathered, the adjectives that are used in those articles can tell something about a this player’s personality. For example, if the word sympathetic is appears a lot in news articles about a certain player, one might expect this player to score high on the Agreeableness personality factor. Besides a player’s personality, three other types of information, that can be obtained from the Twitter data, are proposed. A first aspect that is interesting, is what players are generally talking about on Twitter, and to whom they are talking. Hashtags are commonly used on Twitter to indicate the topic a tweet is about. Furthermore, mentions (an at sign followed by a username) to other Twitter users are placed in tweets to notify other users. These mentions and hastags are extracte to see to whom a player is talking on Twitter and what he is talking about. A technique called regular expressions can be used to extract hashtags and mentions from a tweet. Secondly, it is interesting to see which words people are using to describe a player on Twitter. These words could give insight in not only the opinion about a player, but also in transfer rumors. A word cloud of the words that are used in tweets about a player can be generated to see which words are often used by people when they are talking about a player. vi A Method to Perform an Automated Background Check on Professional Football Players Finally, the popularity of a football player is important. A text mining technique called sentiment analysis is widely used by companies to evaluate customer feedback. The technique automatically analyzes a piece of text, and indicates whether the content is positive, negative, or neutral. In this project, the technique can be used to visualize the popularity of a football player.

Transforming Text Data into Valuable Information

After the data has been gathered and it is determined which player characteristics should be included in the player background check, the knowledge extraction from the data can start. While doing so, the third sub-question is answered:

Which text and data mining techniques are the most suitable to predict the determined player characteristics?

In order to answer this question, the personality extraction from the news articles is done first. A generally accepted process to perform text mining is applied to do so. The first step is to establish the corpus, the second step is to create a term-document matrix, and finally it is possible to extract knowledge from this matrix. Establishing the corpus implies getting all the text data in the same format. This has mainly been done when the textual data was saved in the database. Constructing a term-document matrix means putting all the words that occur in all the document on one axis, and all documents, in this case news articles, on the other one. Then, this matrix is filled by the word counts, such that in each cell, information can be found about how often a word (term) occurs in a document (article). Before the actual analysis can start, the data has to be cleaned. Word stemming is an important step in the data cleaning process in this project. It is a widely accepted form of data cleaning within the text mining domain, that reduces words to their stems, such that words with the same meaning are represented by the same word. For example, the words cat and cats are different, but they cover the same thing, except for the fact that the latter one is plural. Stemming would reduce the word cats to cat, such that the singular and the plural words are treated the same. After stemming all the words in news articles about a player, a word count on the 435 personality adjectives was performed to construct a personality profile of this player. However, applying this cleaning technique does not come without any problems. For example, one of the personality adjectives, playful, became the word play after stemming. The words playing and plays, which are words that are often used in articles about football players playing matches, are also words that are reduced to the stem play. This resulted in an extremely high word count of the personality adjective playful, because its stem appeared so often, while the majority of these occurrences actually was a word other than playful. Therefore, a second experiment was carried out, using the bag-of-words approach without word stemming. In this second experiment, a personality profile of a player was constructed by counting the exact occurrences of personality adjectives, not just their stem. Both methods have their advantages and disadvantages, but they have one disadvantage in common. Since both approaches look in a bag-of-words, they are basically word counts that do not look at the function of a word within a sentence. However, the personality words that are counted are personality adjectives. Therefore, a third experiment is carried out that first performs part-of-speech tagging on all words, after which only the adjectives are counted. Then a personality profile of a player is determined based on only adjectives in news articles about him. The results of all three experiments for the football player , together with the results of the questionnaire which is discussed later, are visualized in Figure1. For the Twitter data, three analyses were performed. Using regular expressions, the top 20 of most frequently used hashtags and mentions could be extracted for each one of the players. Furthermore, a word cloud was created out of all tweets that mentioned one particular player. In one of the word clouds, it was very clear which transfer rumored on Twitter about that player. The football player Zlatan Ibrahimovi´cwas strongly connected to the football club Manchester United. These rumors were confirmed a few weeks later. Finally, a sentiment analysis on each player was

A Method to Perform an Automated Background Check on Professional Football Players vii Figure 1: Determined personality profiles of Lionel Messi.

Figure 2: Twitter analysis on Zlatan Ibrahimovi´c,containing a sentiment analysis on the left and a word cloud on the right. performed, to visualize their popularity. An example of both a word cloud and a visualization of the sentiment analysis are provided in Figure2. Note that in the sentiment analysis a 1 implies an extreme positive sentiment, whereas -1 is very negative. The closer the subjectivity is to 1, the more subjective the tweet was.

Validating the Personality Prediction

After performing an analysis on the retrieved text data, it is necessary to check whether the results provide a good overview of players to add to a player background check. Therefore, the final research question that is answered, reads as follows:

How can the final model be validated?

For this part of the project, it is decided to completely focus on the model predicting a player’s personality, because off-the-shelf techniques were used to perform the Twitter analysis. Two steps were taken to analyze the performance of the created models. First, one of the articles was analyzed manually to show some weaknesses of the models. Second, a questionnaire was held to check which of the three models is the most accurate. In the manual analysis of one of the articles, it was found that negative words in front of adjectives were neglected, and that adjectives sometimes refer to things other than the players of viii A Method to Perform an Automated Background Check on Professional Football Players interest. Since this is only the analysis of one article, it could be the case that these weaknesses balance each other out when hundreds of articles about one player are analyzed. To check whether this is indeed the case, a questionnaire containing the Big Five Inventory, a widely accepted questionnaire to test for the Big Five personality traits, was held under football fans for four different players. The results of this questionnaire were compared to the results of the three experiments that were conducted. One example of such a comparison is visible in Figure1. While the results of the models are evaluated for only a very small part of the players in the data set, it seems that the performance of the models is quite low. In the comparison between the model outcomes and the results from the questionnaire, there is often one model that comes close to the actual personality perception of the player, but no model consistently predicts one personality trait well. The only two models that are worth further investigation in future research are the bag-of-words approach with word stemming to predict the personality trait openness to experience and the approach of part-of-speech tagging to predict the personality trait emotional stability, since these error rates were quite low for all four players. Explanations for the lack of performance of the models could be the ones that arose during the manual analysis of one article. First, negatives in front of adjectives are neglected, which results in assigning personality traits to players, while these players should be assigned the opposite. Fur- thermore, adjectives often refer the things other than the football player that is investigated. Not all words in an article are about one player. Therefore, future research could include named entity recognition, to check whether the adjective actually refers to the player of interest. Finally, the word list of adjectives might be inaccurate, especially the Dutch adjectives, which were translated manually. Therefore, the models should also be tested with other word lists, such as the Linguistic Inquiry and Word Count, which is often used in personality research.

Conclusions

The personality of a player in terms the Big Five appeared to be a very interesting aspect that could be added to a player background check. An attempt was made to construct a personality overview by combining the text data in different articles about a player. While the validation for four players showed promising results on the bag-of-words approach with word stemming to predict the Openness to Experience, and part-of-speech tagging to predict the Neuroticism trait, the results of the three experiments mainly show that the relationship between personality and text written in news articles requires further investigation. Named entity recognition and rhetorical structure theory should be considered in further improving this personality check. In the Twitter part of the analysis, regular expressions were used to extract hashtags and mentions from tweets placed by football players. This is relevant to add to a player background check, because SciSports currently searches manually which players are communicating a lot with each other and what they are talking about. A list with most used mentions and hashtags can save a lot of time here. Two other parts of the automated player background check, are the word cloud and the senti- ment analysis to create overview of a player’s popularity, both based on tweets in which a player is mentioned. The word cloud can be added to a player background check, for example to visual- ize transfer rumors, whereas the visualization of the sentiment analysis provides overview in the amount of retweets and likes of positive and negative tweets about a player. While most results seem promising, a note is to be made on the practical usefulness for SciS- ports. Only when a lot of text data is available for one player, proper results follow from the automated analysis. Considering the personality part of the research, it is problematic when there are only a few articles about a player available. This results in the fact that the importance scores of personality adjectives are extremely high, and they do score higher or lower than two times the standard deviation on almost every personality trait. Also the Twitter analysis on less famous players results in weaker results, and less interesting visualizations that can be added to a background check. This makes the models less useful in practice for SciSports, since this company mainly focuses on less known football players.

A Method to Perform an Automated Background Check on Professional Football Players ix

Preface

This master thesis is the result of a six month internship at SciSports, in order to fulfill the program of Operations Management & Logistics at the Eindhoven University of Technology. Conducting my master thesis project at SciSports has given me the opportunity to combine the knowledge I gained during my studies with one of my greatest hobbies; football. While I faced many challenges during the project, such as learning to program and get familiar with text mining, the results of the thesis provide useful insights from both a theoretical and a practical point of view. Of course, more research is required in order to construct a better personality profile based on news articles, but the first step is taken. Many people helped my in completing this project. First of all, I would like to thank my first supervisor, Rui de Almeida, for his support and valuable feedback. You have given me the opportunity to pick my own assignment and have always been very supportive and helpful. I would also like to thank my second supervisor, Uzay Kaymak, who gave me a lot of feedback to further improve my thesis. Thirdly, many thanks go out to Bart Aalbers, who has been my supervisor at SciSports, for his valuable input. You helped me a lot, both during and outside our weekly meetings. Furthermore, I want to thank all the other people at SciSports, not only for their help, but also for their hospitality. I was provided with many useful insights considering my master thesis, and the developers helped me a lot in my pathway to learn to program. Outside office hours, I was given the opportunity to play football in a real stadium and to watch the Champions League final in the cinema. So I do not only want to thank SciSports for an instructive half-year, but also for a very entertaining one.

Teun Hendrickx July 2016

A Method to Perform an Automated Background Check on Professional Football Players xi Contents

Contents xii

List of Figures xv

List of Tables xvii

List of Algorithms xix

1 Introduction 1 1.1 Problem Description...... 2 1.2 Research Questions...... 2 1.3 Scope and Methodology...... 4 1.4 Thesis Contributions...... 4 1.5 Thesis Outline...... 5

2 Preliminaries 7 2.1 Text Mining...... 7 2.1.1 Natural Language Processing...... 9 2.1.2 Text Mining Process...... 10 2.2 Personality...... 11

3 Selecting Sources and Gathering the Data 13 3.1 News Articles...... 13 3.1.1 Universal HTML Parsing...... 13 3.1.2 Website Specific HTML Parsing...... 16 3.2 Twitter Data...... 20

4 Identifying Interesting Player Characteristics 27 4.1 Personality and Text...... 27 4.1.1 Personality in Self-Narratives...... 28 4.1.2 Personality in Social Media...... 30 4.1.3 Case Specific Personality Prediction...... 31 4.2 Information from Twitter Data...... 31 4.2.1 Regular Expressions...... 32 4.2.2 Word Cloud...... 33 4.2.3 Sentiment Analysis...... 33

5 Transforming Text Data into Valuable Information 35 5.1 Personality Extraction Using a Bag-of-Words...... 36 5.1.1 Data Preprocessing...... 37 5.1.2 Personality Extraction...... 40 5.1.3 Personality Extraction without Word Stemming...... 46 5.2 Personality Extraction Using Part-of-Speech Tagging...... 48 xii A Method to Perform an Automated Background Check on Professional Football Players CONTENTS

5.3 Twitter Analysis...... 50 5.3.1 Regular Expressions...... 50 5.3.2 Word Cloud...... 51 5.3.3 Sentiment Analysis...... 53

6 Validating the Personality Prediction 57 6.1 Manual Analysis of a News Article...... 57 6.2 Validation via Questionnaire...... 59

7 Conclusions 65 7.1 Limitations and Future Research...... 65

Bibliography 67

Appendices 71

A Universal HTML Parsing 72 A.1 Example 1...... 72 A.2 Example 2...... 76

B Sample of Twitter Data 79 B.1 English Tweets...... 79 B.2 Dutch Tweets...... 81

C Removed Stop Words 85

D Factor Loadings of Personality Adjectives 87

E Translation of Adjectives 97

F Removed Adjectives 103 F.1 Unused Adjectives...... 103 F.1.1 Bag-of-Words Approach with Word Stemming...... 103 F.1.2 Bag-of-Words Approach without Word Stemming...... 103 F.1.3 Part-of-Speech Tagging...... 104 F.2 Too Frequently Used Adjectives...... 105 F.2.1 Bag-of-Words Approach with Word Stemming...... 105 F.2.2 Bag-of-Words Approach without Word Stemming...... 105 F.2.3 Part-of-Speech Tagging...... 105

G Personality Profiles before Scaling 107 G.1 Bag-of-Words Approach with Word Stemming...... 107 G.2 Bag-of-Words Approach without Word Stemming...... 111 G.3 Part-of-Speech Tagging...... 115

H Personality Profiles after Scaling 119 H.1 Bag-of-Words Approach with Word Stemming...... 119 H.2 Bag-of-Words Approach without Word Stemming...... 123 H.3 Part-of-Speech Tagging...... 127

I Twitter Analysis Results 131 I.1 Regular Expressions...... 131 I.2 Word Clouds...... 136 I.3 Sentiment Analyses...... 141

J News Article for Manual Analysis 147

A Method to Perform an Automated Background Check on Professional Football Players xiii CONTENTS

K Big Five Inventory in Dutch 149

L Questionnaire Results 153

xiv A Method to Perform an Automated Background Check on Professional Football Players List of Figures

1 Determined personality profiles of Lionel Messi...... viii 2 Twitter analysis on Zlatan Ibrahimovi´c...... viii

3.1 Example of a web page from .com containing a news article...... 14 3.2 Example tweet placed by footballer Victor Wanyama...... 21

5.1 Personalities determined by the bag-of-words approach with word stemming.... 44 5.2 Overview of the interpretation of the radar charts...... 45 5.3 Personalities determined by the bag-of-words approach without word stemming... 47 5.4 Personalities determined using part-of-speech tagging...... 49 5.5 Example of a word cloud, constructed from Zlatan Ibrahimovi´cmentions...... 52 5.6 Comparison of sentiment visualization before and after log transformation..... 55 5.7 Sentiment analysis visualization of 32 Mark-Jan Fledderus mentions...... 56

6.1 Results of experiments and questionnaire for two players in the .... 61 6.2 Results of experiments and questionnaire for two players outside the Netherlands.. 62

A Method to Perform an Automated Background Check on Professional Football Players xv

List of Tables

2.1 The Big Five personality factors...... 12

3.1 A sample of the database in which the news articles are stored...... 19 3.2 Overview of the news articles that are retrieved from the different websites..... 20 3.3 An overview of the players about whom Twitter data was obtained...... 24

4.1 Linguistic correlates of each Big Five trait...... 29

5.1 An example of a term-document matrix...... 37 5.2 The data cleaning process...... 38 5.3 A term-document matrix after data preprocessing...... 40 5.4 The frequency of personality adjectives in the example corpus...... 40 5.5 Example of a matrix containing relative frequencies of personality adjectives.... 41 5.6 The ten most frequently used personality adjectives stems...... 42 5.7 Example of a matrix containing the personality scores of one player...... 43 5.8 Twenty most used hashtags and mentions of player Zlatan Ibrahimovi´c...... 51

6.1 Absolute differences between the results of the experiments and the questionnaire. 63

A Method to Perform an Automated Background Check on Professional Football Players xvii

List of Algorithms

3.1 Retrieving links from Google and NewsJS...... 15 3.2 Downloading articles using the Newspaper package in Python...... 15 3.3 Downloading articles using the BeautifulSoup package in Python...... 16 3.4 Website specific HTML parsing...... 18 3.5 Retrieving basic Twitter information using Tweepy...... 22 3.6 Retrieving the content of tweets a player has placed using Tweepy...... 22 3.7 Retrieving the content of tweets about a player using Tweepy...... 23

5.1 Loading the articles properly encoded from the database into R...... 36

A Method to Perform an Automated Background Check on Professional Football Players xix

Chapter 1

Introduction

All over the world people like to play and watch the game of football. (throughout this thesis, the term football refers to association football or soccer). Because the sport has so much fans all over the world, a lot of money is involved in it. Millions of Euros are being paid by big clubs to attract the best players. Despite the fact that big amounts of money are spent in the football business, the decision-making process is not always very rational. There are several examples of irrational behavior by football clubs in the scouting process. In their book Why Lose & Other Curious Football Phenomena Explained, Kuper & Szy- manski(2009) give lots of examples of inefficiencies in the football transfer market. One of them is that it is easier to sell a bad footballer from Brazil, than it is to sell a brilliant player from Mexico. This is the case, because Brazil is always associated with great football. Another example is that football scouts unknowingly tend to scout blond players (except in Scandinavia), because there are so few of them. When there are 22 players on the pitch, and one of them has blond hair, this one probably attracts the attention of the scout because he looks different. This thesis is conducted at SciSports, a company that wants to improve the decision-making process in football. Its core business is to rationalize the way in which clubs buy new players by making scientific progress accessible and available for any professional football organization (SciSports, 2016). This is done by using the many data that is gathered during football matches by third parties. SciSports processes this data to gain insights in the performance of teams and players. This information is then sold to interested football organizations. The main priority of SciSports is looking at the performance of football players on the pitch. This implies that data of player actions on the pitch, like passes, shots, and headers, are analyzed and included in the player reports. However, how a player behaves outside the pitch is also very interesting to know for clubs that are potentially interested in buying the player. A player that is known for his professional attitude is probably more interesting for a club than a player who is spotted in night clubs all the time. Clubs seem to be interested in this kind of information, since SciSports receives a lot of requests for player background checks. Currently, these player background checks are executed manually, but this is considered to be a very time consuming process. People performing the background checks have to search through a lot of different sources, such as different news websites and social media profiles, to gather information about the player of interest. Reducing the required time for this process would give the personnel of SciSports more time to focus on other activities within the company. One of the goals of this research is to reduce the time that is required to perform player back- ground checks using text mining techniques. Since all the information that is required to compose a player background check is gathered on the internet, this information could also be downloaded and analyzed automatically. Text mining is a rapidly growing field in which information is extrac- ted from unstructured data sets (Nagarkar & Kumbhar, 2015). So far, SciSports has been very successful in analyzing structured data, but automatically processing unstructured data has not been used yet. This thesis is a fist attempt to change this by including text mining techniques in constructing a player background check.

A Method to Perform an Automated Background Check on Professional Football Players 1 CHAPTER 1. INTRODUCTION

The remainder of this chapter is organized as follows: First the problem is described. After this problem description, the research question and its sub-questions are formulated, such that a structured approach can be used in applying text mining in this specific case. After that, the scope and methodology of the thesis are briefly discussed, followed by the thesis contributions and finally the thesis outline.

1.1 Problem Description

As was mentioned in the general introduction, SciSports is currently performing background checks on football players manually. Most player reports contain information about the performance of the player on the pitch, but sometimes a club specifically asks for a background check on their player of interest. This background check contains remarkable facts about this player, both positive and negative. For example, it is mentioned when a player was involved in a car accident or a fight, or when he has misbehaved himself on social media. On the other hand, it is also mentioned when a player appears to have a good relationship with the fans, or when he has set up a foundation that helps other people. Furthermore, transfer rumors and some history of the player’s youth are also put in the background check. At the moment of writing, one employee of SciSports is responsible for the completion of background checks. This person not only creates background checks, but also is the face of SciSports to its customers. The time consuming background checks he has to compose on the side do not make his job easier. While performing the background checks form only a small part of his job description, it can take up to four hours to create one. So one problem is that performing a player background check is too time consuming. Reducing the required time for this process would make that this employee of SciSports can again focus on their primary tasks. One aim of this master thesis project is to achieve just that. Secondly, the background checks currently only contain certain facts about a player, directly found in a news article or a social media post. One could think of the clubs the player has played for when he was young, and positive and negative facts about the player in news articles. However, further analyzing the text data and trying to discover patterns within it are not a part of the background check yet. In other words, SciSports is currently looking at the content of the individual news articles and social media posts, while all these data sources together might also reveal new information about a player, such as his personality. This is another aspect in which this project could add value to the player background checks of SciSports. Therefore, this master thesis project should contribute to SciSports in two different ways. The first one is to make the process of creating background checks less time consuming, by automat- ically filtering interesting text data. Secondly, combining different text data sources are to be combined in order to reveal new information, that can not be extracted from the text data by just reading individual articles or social media posts.

1.2 Research Questions

This section contains the definition and discussion of the relevant research questions that need to be answered in order to solve the above described problem. First of all, a main research question is defined in order to capture the overall goal of the research. This goal is to develop a model for SciSports that can perform a background check on professional football players automatically. While doing so, scientific rigor and practical relevance need to be balanced. This implies that the deliverables are useful in practice for SciSports, while a scientific approach is used to develop these deliverables. After the definition of the main research question, four sub-questions are defined. The answers to those four research questions together should provide the answer to the main research question. As was already mentioned in the problem description, the goal is to solve the problem by ap- plying text mining techniques. These techniques should reduce the time that is currently required

2 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 1. INTRODUCTION to perform a player background check and combine different text data to reveal new information. Therefore, the main research question reads as follows:

Research Question How can text mining techniques help in performing a player background check on professional football players?

The first step in answering this question is to gather the data. Since text mining techniques are going to be used, enough text data on professional football players should be obtained. Because there are many sources available, a decision needs to be made about which sources will be used in this research and how the data is going to be obtained from those sources. Therefore, the first research question reads as follows:

Sub-question 1 Which data sources need to be selected in order to obtain the most meaningful information for a player background check and how should this data be gathered?

In order to answer this first sub-question, information retrieval is required. Information re- trieval only implies finding the documents that contain answers to a question, not to find the answers itself (Hearst, 1997). In the context of this research, it implies finding the documents that could contain relevant information about football players. This information can then be used later to generate a player background check. To answer the first question, some of the sources that SciSports already uses can be included. After the data has been retrieved, it has to be transformed into information. The ultimate goal is to perform this automatically. In order to do so, it needs to be determined first which information should be extracted. This implies knowing which player characteristics are important to obtain, before they can actually be obtained. Therefore, the second sub-question is:

Sub-question 2 Which player characteristics are interesting to know and how can these characteristics be obtained from the data?

This second sub-question can partly be answered by looking at the player background checks SciSports has performed in the past, and talking to the ones that are responsible for these back- ground checks. Furthermore, there is going to be some focus on the personality of a player in terms of the Big Five (Costa & MacCrae, 1992). Since the goal of the research is to develop a model that can be used in practice by SciSports, it is important to include player characteristics that are interesting to know for this company. After it is known which information is interesting to obtain from the text data, it is necessary to see how those characteristics can be extracted using text mining techniques. Thus, the appropriate techniques need to be selected and experiments are to be carried out. Therefore, the third sub- question is:

Sub-question 3 Which text and data mining techniques are the most suitable to predict the determined player characteristics?

Finally, it is important to validate the model. Validation could be a bit tricky in this research, since the information is extracted from text, which sometimes can be interpreted in different ways. Another aspect that could result in difficulties during the verification, is that some information is difficult to check. For example, when a player has been in a car accident, this can be mentioned in a news article about him, but there is no structured document available that indicates whether this player has indeed been in a car accident or not, with dummy variable 0 or 1. So the final sub-question of this master thesis project is:

A Method to Perform an Automated Background Check on Professional Football Players 3 CHAPTER 1. INTRODUCTION

Sub-question 4 How can the final model be validated?

Answering these four sub-questions results in an answer to the main research question. Every sub-question is going to be discussed in a separate chapter, and a conclusion is provided to combine the answers and answer the main research question.

1.3 Scope and Methodology

While SciSports is serving customers in several different European countries, most of them are located within the Netherlands. Because it is not feasible to include all European countries in the project, the thesis is mainly focusing on football players playing in the Netherlands. If the project provides useful results, the model could be extended to other countries and competitions. However, not only the location of the competition in which the players are playing is important. Another key factor is the language in which the data sources are available. News articles about players in the Dutch are obviously written in Dutch. Considering different text mining techniques, there is more research conducted in the English language. Therefore, articles in English are gathered about well-known players that play in other competitions, such that the applied methodology can be compared for both Dutch and English. In this project, two data sources are going to be used. As was mentioned before, SciSports currently uses both news articles and social media posts to construct a player background check. The same sources are to be used in an automated background check. At first, news articles about different football players from eight different news websites are gathered. The collection of news articles and the complications that arose during this process are described. The steps that were taken to collect the news articles using a Python script are provided. After the data collection, a first attempt is made to determine a player’s personality, in terms of the Big Five personality factors, based on news articles that are written about him. A list of 435 adjectives that are related to the Big Five personality traits is used to to so. Three experiments are conducted to count the amount of personality adjectives that occur in news articles about a player, and the results of those experiments are compared to each other and validated via a questionnaire. Secondly, Twitter data is used to obtain more background information about football players. Both tweets placed by eight different football players, and tweets of fans about those eight players are gathered using the Twitter API. On the tweets placed by the players their selves, regular expressions are applied to extract the most frequently used hashtags and mentions of this player. This provides insight in the topics a player is talking about, and the people to whom he is talking. The tweets about a player are used to map what the football fans think about this player. A word cloud on the tweets about a player is created to provide insight in interesting words that are used by the fans. Such a word cloud can for example provide insight in transfer rumors considering a player. To further investigate what the fans think about a player, the player’s popularity is measured by performing a sentiment analysis on the tweets about him.

1.4 Thesis Contributions

This thesis contributes to the existing literature in two different ways. First, it contains a first attempt to construct a personality profile of a person based on text that is not written by this person himself. Literature shows a relationship between text and personality, but only when the person whose personality is to be predicted writes this text. Constructing a personality profile based on news articles about a person is something that has never been done before. In this thesis, the focus is on football players, but the applied methodology can also be used on other persons who often appear in news articles. Second, existing text mining techniques are applied to a new domain. The techniques that were used on the Twitter data have proven to be useful in several environments, but have not been

4 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 1. INTRODUCTION used yet to obtain background information on football players. Especially sentiment analysis has been used a lot by companies to analyze customer’s opinions about a product, but not to map a football player’s popularity.

1.5 Thesis Outline

The remainder of this thesis is organized as follows: Chapter2 contains some preliminaries on text mining. Several basics, techniques, and challenges of text mining are discussed. Chapter2 also contains some preliminaries on personality, since predicting a player’s personality is going to be one of the aspects of this project. In Chapter3, the first sub-question is answered. It is described how an attempt was made to automatically retrieve news articles from any website, and why it is impossible to do so. After that, the way in which eight website specific HTML parsers were built is described. Furthermore, this chapter describes how Twitter data was obtained using the Twitter API. Chapter4 provides an answer to the second sub-question. The relationship between the use of text and a person’s personality is examined, and it is explained which information could be added to a player background check from the Twitter data. After that, the determined player characteristics are extracted from the gathered data in Chapter5. Three experiments are carried out to see which text mining techniques could predict a player’s personality from news articles. The methodologies and results of these experiments are presented. Besides that, three different techniques to analyze and visualize the Twitter data are proposed, which could be added to a player background check. The performance of the three experiments to predict a player his personality is analyzed in Chapter6. The results of the three experiments are compared to each other and to the results of a questionnaire to see which model provides the best prediction of a player’s personality. Finally, Chapter7 contains the conclusions of the masters thesis project. The results are discussed, answers to the research questions are provided, and suggestions for future research are made.

A Method to Perform an Automated Background Check on Professional Football Players 5

Chapter 2

Preliminaries

This chapter contains some introductions to the topics that are used throughout this thesis. Exist- ing literature is analyzed and important definitions are stated. The most important concept that is explained in this chapter is of course text mining. In Section 2.1, some basics about text mining are provided, such as the definition of text mining, some important concepts, and the general text mining process. After that, in Section 2.2, some important aspects considering personality theory are discussed. Since personality is one of the aspects that is going to be investigated during the project, the most important foundations and assumptions considering this topic are introduced here.

2.1 Text Mining

Several studies have indicated the large amount of information stored in unstructured data sources, such as text documents. In total 80 to 90 percent of all corporate data is stored in an unstructured form (Tan, 1999; Hotho, N¨urnberger & Paaß, 2005). Since such a large amount of information is stored in an unstructured way, companies need to effectively tap into their text data sources using text mining techniques in order to make better decisions (Turban, Sharda & Delen, 2010). Text mining, also referred to as text data mining or knowledge discovery in textual databases (Hearst, 1997; Feldman & Dagan, 1995; Turban et al., 2010), generally refers to the process of extracting interesting and non-trivial patterns or knowledge from unstructured text documents and can be considered as an extension of data mining or knowledge discovery from structured databases (Tan, 1999). Data mining and text mining can be defined as follows:

“Data mining is the process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases, where the data are organized in records structured by categorical, ordinal, or continuous variables. Text mining is the same as data mining in that it has the same purpose and uses the same processes, but with text mining the input to the process is a collection of unstructured (or less structured) data files such as Word documents, PDF files, text excerpts, XML files, and so on.” (Turban et al., 2010)

The latter immediately indicates why text mining is more complex than data mining techniques that make use of structured data. While text is very commonly used to store information, it requires interpretation to correctly retrieve this stored information. While this interpretation is very easy to do for human beings, the usual logic-based programming paradigm has great difficulties with it, since the stored information in text is often fuzzy and ambiguous (Hotho et al., 2005). Both Turban et al.(2010) and Hotho et al.(2005) consider text mining similar to data mining, but with the necessity to pre-process the unstructured data accordingly. Text mining techniques attempt to overcome the difficulties in processing unstructured data from text by using methods that are able to cope with a large number of words and structures in natural language

A Method to Perform an Automated Background Check on Professional Football Players 7 CHAPTER 2. PRELIMINARIES on the one hand, while handling vagueness, uncertainty and fuzziness on the other hand (Hotho et al., 2005). Text mining might be used for several purposes. A good example is when a company wants to process customer feedback. Of course, companies can ask their customers to fill in a structured questionnaire, which is easy to process, but one will receive better feedback if the customer is not bounded to the answer possibilities that have been generated on beforehand. If each customer has freedom in providing feedback via their own text, they can really say what they want to say. However, it is imaginable that this approach results in high amounts of unstructured data. Text mining can then be used to process all the customer feedback. This is just one example in which text mining techniques could be very helpful. There are more general applications where text mining is often used. In their book, Turban et al.(2010) provide a list with these most popular application areas of text mining:

Information extraction “Information Extraction is the name given to any process which se- lectively structures and combines data which is found, explicitly stated or implied, in one or more texts” (Cowie & Lehnert, 1996). Part-of-speech tagging, named entity recognition, named entity disambiguation and relation extraction, which are discussed in Section 2.1.1, are the challenges faced when one wants to properly extract information from text. Topic tracking Topic tracking is about advising new documents to a user, based on the user’s profile and interest in previous documents (Gupta & Lehal, 2009; Turban et al., 2010). Summarization This is a term that speaks for itself. Summarization is the application of text mining to summarize long documents. These summaries can inform potential read- ers whether the whole document might be interesting or not. Categorization Gupta & Lehal(2009) provide the following description of categorization: “Cat- egorization involves identifying the main themes of a document by placing the document into a pre-defined set of topics. When categorizing a document, a computer program will often treat the document as a bag of words. It does not attempt to process the actual information as information extraction does. Rather, categorization only counts words that appear and, from the counts, identifies the main topics that the document covers. Categorization often relies on a thesaurus for which topics are predefined, and relationships are identified by looking for broad terms, narrower terms, synonyms, and related terms. Categorization tools normally have a method for ranking the documents in order of which documents have the most content on a particular topic.” Clustering In order to categorize documents, the categories need to be determined first. When one wants to group a number of documents without the use of a predefined class, cluster- ing can be used (Larsen & Aone, 1999). In clustering, the categories are not known yet. Therefore, unsupervised learning is used to cluster documents. Concept linking Concept linking is about connecting related documents by identifying their shared concepts, to help users find information they probably would not have found using traditional searching methods (Fan, Wallace, Rich & Zhang, 2006; Gupta & Lehal, 2009; Turban et al., 2010). Question answering As the term already suggests, question answering is about giving answers to a question asked by the user. A question answering system accepts everyday language as an input, and provides a concise answer (Lin & Katz, 2003).

The most basic text mining approach, in which simpler representations that describe limited aspects of the textual information are used, is called the bag-of-words approach. (Turban et al., 2010; Collobert, Weston, Bottou, Karlen, Kavukcuoglu & Kuksa, 2011). Bag-of-words refers to putting all the words in a document into one bag without looking at grammar or the order in which words are placed. So this approach is basically counting how many times words appear in a document, disregarded the context of the words.

8 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 2. PRELIMINARIES

However, not only the use of words that make us human beings understand each other. The grammar and order of words are also an important factor. Interpreting grammar and structure is pretty easy to do for human beings, but it becomes more difficult to do for computers. This is where natural language processing comes into play. Natural language processing (NLP) studies how computers can process human natural language (Turban et al., 2010). This is often required to perform the above mentioned text mining applications.

2.1.1 Natural Language Processing While trying to understand the human natural language, NLP faces some big challenges, or bench- mark tasks (Collobert et al., 2011; Turban et al., 2010). A first challenge in NLP is part-of-speech tagging. Part-of-speech tagging implies assigning a part of speech (e.g. noun, pronoun, adjective, verb, adverb, preposition, etc.) to each word in a sentence. This can be very strait forward, for example in the sentence I drink water, but it can also be very complicated. While using natural language, the part of speech of a word not only depends on the word itself, but also on the con- text of that word, since words can often be ambiguous. Schmid(1994) uses the word store as an example to illustrate this. A person can go to a store, but a PC can also store information. So the meaning, and the part of speech tag, of the word store depends on the context in which it is placed. Where humans can easily interpret the word correctly by looking at the context, a computer has more difficulties to do so. Several studies within NLP have been conducted to tag words in sentences with their part of speech. A few examples of these studies are Brill(1992), who developed a simple-rule based tagger, Schmid(1994) who used a decision tree approach, and Toutanova, Klein, Manning & Singer(2003), who used multiple features with a cyclic dependency network and achieved an accuracy of 97.24%. Shen, Satta & Joshi(2007) achieved a 97.33% accuracy using guided learning. A second example of a challenge in NLP is that words are sometimes ambiguous. So word sense disambiguation is required in order to correctly interpret the data. Word sense disambiguation takes into account the context of word in order to determine the correct meaning (Turban et al., 2010). In discussing part-of-speech tagging, the example of the word order was given. Here the ambiguity of the word was in its part of speech tag (it could be a noun or a verb). However, if the word order is considered to be a noun, it still is ambiguous: order could be a command or instruction, but it could also be an arrangement of things in relation to each other. Word sense disambiguation tries to distinguish those latter two meanings. Not only the words we use can be ambiguous, but the same holds for the grammar. There is not one correct sentence structure, which implies that more than one structure has to be considered (Turban et al., 2010). Two people might use a different word order in their sentence while saying the same thing. Named entity recognition is another NLP challenge that was already mentioned in Section 2.1. This challenge is about identifying certain words or phrases and categorizing them, for example as persons, locations or processes (Nadkarni, Ohno-Machado & Chapman, 2011). For example, in the sentence John is visiting , the words John and Amsterdam refer to named entities, the name of a person and the name of a location, respectively. Where this is easy to understand for humans, it is a challenge in NLP. The Stanford Named Entity Tagger (http: //nlp.stanford.edu:8080/ner/) is a tool that can perform this task pretty well. Word sense disambiguation was mentioned as a second challenge in NLP. Of course, named entities can also be ambiguous, which makes named entity recognition even more difficult. There- fore, named entity disambiguation can also be considered as a difficult task for NLP. An example of an ambiguous name is . When the word Paris is put in a sentence, the context of this sentence and the context in which this sentence is placed determine whether Paris refers to the capital of France, or for example to the celebrity Paris Hilton. Not only the meaning of the words itself needs to be extracted from text, but of course also the relationship between those words. Relation extraction implies recognizing a relationship between two entities by processing text. Consider the following sentence as an example: signs for Real Madrid. The challenge for NLP is to detect the relation that is easy to recognize

A Method to Perform an Automated Background Check on Professional Football Players 9 CHAPTER 2. PRELIMINARIES for human beings, namely that the football player Cristiano Ronaldo signed for the football club Real Madrid. In Section 2.1, it was already briefly discussed how text mining can be very helpful in processing customer feedback. A technique that is often used to get a customers opinion out of text, is called sentiment analysis. A good description of sentiment analysis is provided by Pang & Lee(2008):

“A sizeable number of papers mentioning ‘sentiment analysis’ focus on the specific application of classifying reviews as to their polarity (either positive or negative), a fact that appears to have caused some authors to suggest that the phrase refers specifically to this narrowly defined task. However, nowadays many construe the term more broadly to mean the computational treatment of opinion, sentiment, and subjectivity in text.” (Pang & Lee, 2008, p. 10)

Sentiment analysis is mainly about processing opinions in text. In most cases, it is used to detect favorable or unfavorable opinions in pieces of text. This can be done by assigning positive or negative values to words, and then count the total number of positive and negative words in a feedback document. If the document contains more positive than negative words, the feedback can be seen as positive. Recently, it was pointed out that sentiment analysis can benefit from Rhetorical Structure Theory (RST) (Hogenboom, Frasincar, De Jong & Kaymak, 2015). RST splits text into rhetoric- ally related segments, and identifies the hierarchic structure of these segments Mann & Thompson (1988); Hogenboom et al.(2015). Segments are classified as nuclei or satellites, with the nuclei forming the most important parts of the text, supported by the secondary satellites. RST defines the relationship between the nuclei and satellites. Hogenboom et al.(2015) use an example sentence to show the relevance of RST in sentiment analysis. The positive sentence “While always complaining that he hates this type of movies, John bitterly confessed that he enjoyed this movie.” mostly consists of negative words. Sentiment ana- lysis would distinguish the top-level nucleus “John bitterly confessed that he enjoyed this movie.” from the satellite “While always complaining that he hates this type of movies”. The top-level nucleus can again be split into the satellite “John bitterly confessed that” and the nucleus “he enjoyed this movie.”, with the latter one containing the actual sentiment of the whole sentence. Sentiment analysis could be used in a player background check of professional football players to determine the popularity of a certain player. When buying a new player, the performance of this player on the pitch is very important to the buying club. However, taking a look at the popularity of a player might also be interesting. If a popular player is bought, it is more likely that a club is going to sell a lot of kits with this player’s name on it. Analyzing sentiment in tweets with respect to a player can be helpful in determine this player popularity. If many of the tweets about the player are positive, one could conclude that this is a popular player. On the other hand, when the majority of tweets about a player is negative, the player is probably not that popular.

2.1.2 Text Mining Process Where the CRISP-DM model is a widely accepted framework in the data mining industry, such a generalized framework is also required for text mining (Turban et al., 2010). In general, one could distinguish three main steps that need to be performed in text mining. Both Turban et al. (2010) and Hu & Liu(2012) provided a brief description of each of the steps. The main steps are identified below (the names Turban et al.(2010) gave to the steps are mentioned first, and the names of Hu & Liu(2012) are mentioned second):

1. Establish the Corpus or Preprocessing.

2. Create the Term-Document Matrix or Representation.

3. Extract Knowledge or Knowledge Discovery.

10 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 2. PRELIMINARIES

While in both cases different words are used, the essence is the same. The first step implies collecting all the documents that are related to the domain of interest and organizing them in the same format, so that the text mining application is able to read it. These input documents could be e-mails, text documents, XML files, web pages, etc. In fact, it could be any type of document, as long as it contains text. This process is often referred to as information retrieval. Note that information retrieval is only about finding the documents that contain answers to questions, not about finding the answers itself (Hearst, 1997). In this particular case, information retrieval implies gathering the text from social media and news websites about a certain player, so that this text can be used for text mining. In other words; information retrieval is required to answer sub-question 1. It is also necessary to structure each document in the same way (preprocessing). In the second step the collected data is represented in a term-document matrix. This is a matrix with the terms of interest on one axis, and the documents in the analysis on the other one. Then for each documents the count of each term can be presented in the matrix. However, a nor- malization method is often required to get a better understanding of the data. A commonly used normalization method is to make use of the term-frequency-inverse document frequency (Turban et al., 2010; Hu & Liu, 2012). By using this method a word gets a certain weight, based on both specificity of words (document frequencies) and the overall frequencies of their occurrences (term frequencies). The final step is to extract knowledge from the data using machine learning or data mining techniques. Where computers have difficulties with interpreting text data, they are really good with numbers. Creating a term-document matrix basically implies transforming textual data to numbers, which are easier to work with for the computer. Data mining techniques or machine learning could be used to extract information from the data by finding patterns in the numeric data.

2.2 Personality

Despite the fact that there is lack of empirical evidence that shows a relation between personality and the level of sport one practices, looking at the personality of a football player might still be very interesting. Knowing the key factors of a new players personality could help football clubs to improve their team composition. Besides that, personality does influence several sport relevant factors. To map the personality of a football player, it is important to know what is meant by the word personality. A widely adapted definition reads as follows:

“Personality is the unique pattern of traits which distinguishes one individual from another.” (Guilford, 1959, pp. 5-6)

The two main aspects in this definition of a personality are the fact that it differs from person to person (unique), and that for one person it is relatively stable over time (pattern). The personality of a person influences the behavior of this person. Since every person has a unique personality, every person will respond in a different manner to the same situation. This of course also holds for situations that occur during football training or matches. Since psychology is a qualitative field, there is not one main framework in which the personality of an individual can be captured. There are several well-known frameworks that attempt to map the personality of an individual. There is however one model that gets most attention lately, and the research field is reaching a consensus about this particular personality trait model; the Big Five Personality Factor Model (John & Srivastava, 1999). Factor analysis has shown that the five factors in the Big Five Model cover almost all personality traits. A lot of research is done on the Big Five Personality Factor Model. Therefore, this model is used in this thesis to map the personality profile of football players. The five factors are as follows:

• Agreeableness versus Antagonism.

A Method to Perform an Automated Background Check on Professional Football Players 11 CHAPTER 2. PRELIMINARIES

Table 2.1: The Big Five personality factors, adopted from Costa & MacCrae(1992).

Factor Facets Agreeableness Trust Straightforwardness Altruism Compliance Modesty Tender-mindedness Conscientiousness Competence Order Dutifulness Achievement striving Self-discipline Deliberation Extraversion Gregariousness Assertiveness Activity Excitement-seeking Positive emotions Warmth Neuroticism Anxiety Angry hostility Depression Self-consciousness Impulsiveness Vulnerability Openess to Experience Ideas Fantasy Aesthetics Actions Feelings Values

• Conscientiousness versus Lack of Direction. • Extraversion versus Introversion. • Neuroticism versus Emotional Stability. • Openness to experience versus Closedness to Experience.

The different facets of each factor are provided in Table 2.1. This table is adopted from the well-known research of Costa & MacCrae(1992). Their research includes six specific facets per factor to make measurement of the personality traits possible. There are several ways the personality in terms of the Big Five can be measured. A widely used method is making use of self-descriptive sentence questionnaires (De Fruyt, McCrae, Szirm´ak& Nagy, 2004). Other examples of measurements are the Revised NEO Personality Inventory (Costa & MacCrae, 1992), which consist of a number of questions that need to be answered in order to determine one’s personality, and the International Personality Item Pool (IPIP), which can be found on http://ipip.ori.org/ (Goldberg, Johnson, Eber, Hogan, Ashton, Cloninger & Gough, 2006). Often, filling out a personality questionnaire is a time consuming process, because a lot of questions need to be answered. A relatively short and still reliable personality test is the Big Five Inventory of John & Srivastava(1999). While it is not possible to provide a general mapping of a personality for athletes, research has shown that there are several personality facets that relate to the degree of physical activity. Extraversion and conscientiousness are often associated with higher physical activity, while lower activity is related to neuroticism (Rhodes & Smith, 2006; De Moor, Beem, Stubbe, Boomsma & De Geus, 2006; Bruijn, De Groot, Putte & Rhodes, 2009). However, these relationships are weak. The personality of a potential new player could be very interesting to know for a football club. The club can take into account the personality of the player to balance the team composition. Maybe the club prefers one particular personality factor over the others, or maybe it wants a player that scores low on one of the five factors. Therefore, this project is going to make an attempt to predict a player’s personality based on news articles about him. The first step is to gather these news articles, which is discussed in the next chapter.

12 A Method to Perform an Automated Background Check on Professional Football Players Chapter 3

Selecting Sources and Gathering the Data

In order to get started with the research, appropriate data is of course required. It was already made clear that both news articles and social media can contain valuable information about a football player. When the employee of SciSports is currently performing background checks manually, these are the sources that are mainly used. In this chapter, the first research question is answered and the data collection is described.

3.1 News Articles

Since football is such a popular sport, a lot of fans like to stay updated with the latest news about their favorite teams and players. There is a countless amount of different news websites available that can provide the fans with all the latest updates. Three examples of news websites that only focus on football related news are Football 365 (http://www.football365.com/), ESPN FC (http://www.espnfc.com/), and Goal.com (http://www.goal.com/), but one search request on a football player in an online search engine will provide many other websites, often in multiple languages. Thus, the required information is out there on the internet, and the challenge is to download the content of the different web pages, such that a structured text corpus about a player can be created. As was mentioned in Section 2.1, creating the corpus is the first step in the general text mining process.

3.1.1 Universal HTML Parsing Ideally, one would want to download all different news articles that are available from all different sources about a player. In theory, this should also be possible, since all the news articles are freely accessible on the internet. It is possible to type in a player name in a search engine, visit each web page manually, copy the content of the article and paste it in a text file or database. This method would provide perfectly clean data, because each web page is evaluated by a human who can easily distinguish the article content from other elements on the web page. However, this method is very time consuming, especially when articles about many different players need to be downloaded, which is the case in this specific project. Therefore, a code is required that automatically visits multiple web pages on which an article about a player is placed, and to download all the content of these articles from all different sources automatically. Creating such a code is known to be very hard. To illustrate the task of downloading the article content and the difficulties that come with it, Figure 3.1 is added. The parts of the page in which the interesting article content can be found are marked with an orange dotted line. Besides the article content, the page contains much more

A Method to Perform an Automated Background Check on Professional Football Players 13 CHAPTER 3. SELECTING SOURCES AND GATHERING THE DATA

Figure 3.1: Example of a web page from Goal.com containing a news article (Screenshot of web page http://www.goal.com/en/news/1862/premier-league/2016/03/22/21593552/ ozil-reveals-the-secret-behind-his-silky-control, retrieved on March 23, 2016). elements. On the top of the page a navigation bar is visible, together with some links to the social media accounts of Goal.com. Furthermore, related articles, an advertisement, and some sponsored links are displayed on the right hand side of the page. When one scrolls down the page a picture, a video and a section in which users can leave their comments are also found. These elements are very easy to distinguish with the human eye, so humans have to tell the computer which parts of the web page are interesting and which are not. Telling a computer which parts of a web page are relevant can be done via specifying the interesting HTML tags. HTML stands for HyperText Markup Language and is often used to describe the structure of web pages. A web page consists of different HTML tags. Each tag contains content of the web page, such as text or a figure, but it can also contain a new tag. For example, in Figure 3.1, the date has a

tag and the paragraphs each have a

tag. For this specific page, it is now possible to parse the HTML so that only the article content remains, by specifying those particular tags. However, websites other than Goal.com do have a different build up in the HTML tree, and might even use different tags for the same elements. Therefore, it is often required to identify the interesting HTML tags for each website individually. This is a well known problem, and many researchers have tried to find a solution to it. The most recent one is Wu(2016), who has described the problem very well and proposed a solution to it based on a text detection framework that is also used in video analysis. While this method outperforms the methods that already were available, it is only 46.38% accurate in terms of the perfect rate (Wu, 2016). This shows that the best universal parser out there is still far from perfect and that more research is required in this field to increase the performance of universal HTML parsing. To give an indication of the performance of a universal HTML parser, two different methods, both requiring different Python packages, were executed and compared to each other. Two search

14 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 3. SELECTING SOURCES AND GATHERING THE DATA engines were used to find news articles about football players. The first used search engine is Google (https://www.google.com/), because it is the most used one (NetMarketShare, 2016). Second, the search engine NewsJS (http://www.newsjs.com/) is used, because the employees of SciSports are currently using this website to find articles about a player on rather smaller, more regional news websites. To create a proper text corpus on a football player using those two search engines, a script is required that performs the actions described in Algorithm 3.1.

Algorithm 3.1 Retrieving links from Google and NewsJS. 1: search for news articles about a player on Google 2: store all links to an article containing web page in a list 3: search for news articles about a player on NewsJS 4: for all links found on NewsJS do 5: if not article link is in list then 6: add link to list 7: end if 8: end for 9: for link in list do 10: visit the web page 11: download article content 12: end for

First of all, a player name has to be filled in on Google and NewsJS in order to perform a search. If both search engines return a similar link, only one of the two needs to be stored. Therefore, only the unique links from NewsJS are stored in the list with links. When all the links are retrieved, the web pages need to be visited and the title and body text of the articles needs to be stored. An attempt was made to develop Algorithm 3.1 in Python. The final step, downloading the article content, was performed using two different methods. The first method was using the Newspaper package in Python (http://newspaper.readthedocs.org/). This is a package that claims to be able to extract the article title and the body text from the web page with the URL to the web page being the only input. The player that was used for this test was the football player Lionel Messi. The fact that he is known world-wide is the main reason that he was used to perform this test; there are a lot of different news websites that publish articles about Messi, so it is easy to compare the performance of the Newspaper package for different websites. The pseudocode that was used to perform this method is provided in Algorithm 3.2.

Algorithm 3.2 Downloading articles using the Newspaper package in Python. 1: for link in list do 2: build the article 3: download the article 4: parse the article 5: write the title and the body text of the article to a text file 6: end for

The results of this test were better than expected. For many web pages, the algorithm returned the main body text of the article. The biggest complication using this method was that for some websites it did not return any text at all. If text was returned, manual checking showed that the package worked pretty well, but that sometimes extra information which was not in the body text was also returned (for example: the title to a related article). The processing of the article titles was less accurate. Since this method does not work well for several websites, it cannot be used to universally parse HTML. However, it can help to download content from a specific website, but then it is necessary to check the performance of the method on this website on beforehand. The second method that was applied was extracting all

tags on a web page and returning

A Method to Perform an Automated Background Check on Professional Football Players 15 CHAPTER 3. SELECTING SOURCES AND GATHERING THE DATA the text that is placed within these tags. Normally, the

tag is used to create paragraphs. So if a web page is built correctly, parsing all the paragraph tags should return all the text that is in the paragraphs of the article. In order to also retrieve the article titles, all the

tags were parsed too. To parse the HTML in this method, the BeautifulSoup package in Python was used (http://www.crummy.com/software/BeautifulSoup/). The same links as in the previous method were used, in order to make both methods comparable. A brief overview of how the parsing via the use of BeautifulSoup was completed is provided in Algorithm 3.3.

Algorithm 3.3 Downloading articles using the BeautifulSoup package in Python. 1: for link in list do 2: for all

tags do 3: get the text out of the tag 4: write the title to a text file 5: end for 6: for all

tags do 7: get the text out of the tag 8: write the text to a text file 9: end for 10: end for

Note that a loop is used to run over multiple

tags, while there is usually only one title. This is done such that HTML that is not clean can still return the actual title, together with some other noise that has also been assigned to a

tag. The results show that the HTML tags are used in a lot of different ways by web developers. This is exactly the problem that makes universal HTML parsing impossible. While parsing all

and

tags works pretty well for some websites, it results in a total mess for others. The main problem is that the tags are not only used for the title and the paragraphs of the news article, but also for buttons in the navigation bar, titles and introductions to related articles, a paragraph that tells the user to log in in order to leave a comment, etc. Therefore, this method does not provide the clean title and body text of an article. In this project, the performance of both methods is checked manually. Normally, the perform- ance of the task needs to be measured precisely via the use of evaluation metrics, like the recall, precision, and F-measure (Wu, 2016). However, since looking at only a few text files already shows that there are a lot of errors in it, it is decided to make a selection of websites and build a site specific HTML parser for each one of them, and not further evaluate the two methods to parse HTML universally. A few examples of the parsed articles are provided in AppendixA to show why it is not reliable to universally parse HTML.

3.1.2 Website Specific HTML Parsing Since perfect universal HTML parsing appeared to be impossible, the decision is made to focus on a few websites and to build a specific HTML parser for each of them. Therefore, the search engines of NewsJS and Google cannot be used anymore, and the search engines on the specific websites are going to be used instead. The software that is used to build the HTML parsers is Python (version 3.5), and specifically the BeautifulSoup package that was also used in the second described method of Subsection 3.1.1. The first decision that needs to be made before building a specific parser, is the websites for which it is going to be built. This is strongly related to the scope of the project. If big international football news websites are chosen, multiple countries will be covered. However, smaller football countries and competitions often receive less attention on these websites, while these are particularly important for SciSports and its customers. Therefore, the decision was made to focus on news sources from one particular country, and to focus on that country only. Since SciSports is located in the Netherlands, as well as most of their customers, the scope of this

16 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 3. SELECTING SOURCES AND GATHERING THE DATA research was set to include only football players who are playing in the Netherlands. However, some English news sources are included as well, since the state-of-the-art text mining techniques might be developed further for the English language, compared to Dutch. So far, this decision only narrows the news sources options, but it does not provide a final selection of news websites yet. Within the Netherlands there are again several websites that provide news on football. Many of them are specifically designed to provide the latest news on football, but there are also general news websites that have their own football section. To decide which particular news sources should be included in the project, the expertise of the person currently performing the background checks manually was used. Asking this person for the most important news sources within the Netherlands resulted in a long list from which the following selection was made:

• Nederlandse Omroep Stichting (NOS) (http://nos.nl/). • (http://www.vi.nl/). • Voetbalprimeur (http://www.voetbalprimeur.nl/). • Voetbalzone (http://www.voetbalzone.nl/).

The first source, NOS (Dutch Broadcast Foundation) daily provides the most popular news broadcast on television in the Netherlands. On its website, a separate section for football can be found. The other three sources are websites that are focusing on football news only. All four sources are completely in Dutch. As was already mentioned, some English sources are included in the data gathering as well. State-of-the-art text mining techniques, packages, and pre-defined word dictionaries are available in English first. English articles are included in the project as well to make sure that the quality of the results does not depend on the performance of text mining techniques in a particular language. Therefore, the same person at SciSports was asked for his favorite English news sources as well, which resulted in the following selection:

• ESPN FC (http://www.espnfc.us/). • Goal.com (http://www.goal.com/en). • Mirror (http://www.mirror.co.uk/). • The Guardian (http://www.theguardian.com/international).

Now that the sources are known, the eight different HTML parsers are to be built in Python using the BeautifulSoup package to parse the HTML. The version that was used for this purpose is Python 3.5, because the standard encoding in this version is UTF-8. This is very beneficial in this case, since many player names or words in news articles contain special characters. Using Python 3.5, instead of version 2.7, these special characters (such as ´e or ø) can be processed fairly easy. The exact approach to parse an article differs per website, but each approach requires the same general steps. An overview of these steps is provided in Algorithm 3.4. First of all, it is necessary to search for news articles about a player on the website. Usually this can be done by manually performing a search request on the web page, and looking at the task bar to see which URL is required to perform a search request. This URL often contains the search request itself. While a new search request could be executed by again using the search bar in the search engine itself, one could also adjust the search terms in the URL. So when the search URL is known, the names of different players can easily be pasted in this URL to perform a search request. However, on some websites, such as NOS and Voetbal International, the search request is not visible in the URL. For these websites it is necessary to take a closer look at the page requests and returns when a search request is made. In the Google Chrome browser, this can be done by right-clicking on the web page and selecting the inspect button. In the window that appears, all the interactions between the client and the server can be found. For the above

A Method to Perform an Automated Background Check on Professional Football Players 17 CHAPTER 3. SELECTING SOURCES AND GATHERING THE DATA

Algorithm 3.4 Website specific HTML parsing. 1: for all websites do 2: type in the player name in the search URL 3: get all links to news articles from the response page 4: create an empty list 5: for all link in links do 6: if not article link is in list then 7: add link to list 8: end if 9: end for 10: for link in list do 11: visit the web page 12: get the HTML 13: get the body in which all interesting tags are nested 14: remove major uninteresting tags 15: parse the date, title, and body text 16: store article content in database 17: end for 18: end for mentioned examples, NOS and Voetbal International, there was found a JSON file in which the results of the search request were provided in this inspection window. Since the URLs of these JSON files did contain the search request, these URLs were used to execute search requests on those websites. When the correct URL to perform a search request is found, the links to news articles are parsed from the response HTML of this URL. This is done by looking for all tags in the HTML, which contain links to web pages. Of course not all links are links to news articles. Therefore, attributes need to be passed to the parser to specify which tags are links to news articles. For example, on the page containing results of a search request on the website of Voetbalprimeur, all links to news articles do have the attribute class with value title. This information can be passed to the parser. As is shown in Line6 and Line7 of Algorithm 3.4 the links to the articles are then stored in a list. The previous step results in a list with URLs to articles about a certain football player. The next step is to use these URLs to visit each web page containing an article. After that, certain parts of the HTML of that page need to be extracted. When navigating the HTML tree, the smallest tag that contains all interesting article content (date, title, and article text) is selected as the body. As is described by Line 14 the major uninteresting tags are removed from the body. These parts are considered to be noise. Examples of noise are tags in which advertisements, polls questions, figures, and tweets are placed. Note that it is only required to remove the noise if the used tags in it are also used in the interesting parts of the body. So if the main text of an article is placed within

tags, only noise containing

tags should be removed. When all unnecessary parts are removed from the body text, the interesting parts can be extracted from it. In this project, the interesting parts of the web page are the URL of the web page, the date on which the article was published, the title of the article, and the body text of the article. Since the URL is already known, there are three elements that still need to be parsed from the body of the HTML. These elements can be extracted using the tags in which they are placed together with the attributes of that tag. The date is often placed within a

18 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 3. SELECTING SOURCES AND GATHERING THE DATA

Table 3.1: A sample of the database in which the news articles are stored.

id player name url date source title text 39982 http://espnfc.. . . 2016-05-25 ESPN Barcelo. . . Neymar ad. . . 39983 Neymar http://espnfc.. . . 2016-05-24 ESPN Bayern . . . Former Ba. . . 39984 Neymar http://espnfc.. . . 2016-05-22 ESPN Luis En. . . MADRID. . . 39985 Neymar http://espnfc.. . . 2016-05-25 ESPN Pique a. . . And so it. . . 39986 Neymar http://espnfc.. . . 2016-05-24 ESPN Man U. . . The summ......

consists of multiple paragraphs that are placed in multiple tags, it is necessary to loop over these tags in order to loop over all these tags. The most used tag for the paragraphs of the body text is the

tag. But as was already mentioned in Subsection 3.1.1, this tag is not always used properly. For example, the first paragraph of articles on the website of Voetbal International are placed in a different tag than the rest of the text. When all the parts of the article content are parsed, they need to be stored. In this project, a database is used to do this. In the database of SciSports, a new table was created in which all news articles were stored. For each iteration, the player name, article URL, publication date, source, article title, and the body text were stored in the database. Furthermore, a unique ID number was assigned to each database entry. As an example, a few rows from the database are provided in Table 3.1. The execution of these steps was slightly different for the website of The Guardian, since this source provides an application program interface (API), what makes the article retrieval much easier. An API is a set of protocols, routines, and tools that that allows software applications to communicate with each other. If one wants to use the functionalities of an application within his own application, this can often be achieved by using the API of the application of interest. For example, if an application builder wants to include a map in his application, he could use the API of Google Maps to use the pre-created map of Google. In this case, the API of The Guardian can be used to include news articles from this source in an application. One search in the API returns a JSON file in which all articles containing the search request are provided. So a code is written to search for a player using the API, and to store all the article content in the database. After the web parsers were built, they were used to download articles from the four Dutch and four English news websites. In total, 58,140 articles were downloaded on 203 professional football players, 101 of whom are playing in the Dutch Eredivisie and 102 of whom are playing in other European competitions. An overview of how many articles where downloaded from each source is provided in Table 3.2. In this table, it is visible that both groups of Dutch and English articles are about the same size.

A Method to Perform an Automated Background Check on Professional Football Players 19 CHAPTER 3. SELECTING SOURCES AND GATHERING THE DATA

Table 3.2: Overview of the news articles that are retrieved from the different websites.

Source Articles Players Dutch NOS 5,075 101 Voetbal International 6,916 99 Voetbalprimeur 9,184 101 Voetbalzone 7,445 101 Total 28,620 101 English ESPN 8,935 100 Goal.com 2,476 100 Mirror 9,262 101 The Guardian 8,847 100 Total 29,520 102 Total 58,140 203

3.2 Twitter Data

Besides the use of news articles, social media form another important source of information about football players, which are currently used by SciSports to perform background checks as well. Footballers express themselves on different types of social media, such as Instagram, Snapchat, Facebook, and Twitter. In the first two mentioned, the main content consists of pictures and videos. Since this project is about extracting information from text data, the latter two sources are more interesting. Although Facebook is a very popular social medium and it provides the possibility to interact with each other via the use of text, it is chosen to only include Twitter in this research. First of all, considering the time constraint of this project, together with the fact that news articles already form a big part of the analysis, only one social medium is included. Because Twitter mainly contains short text messages (there is a maximum message length of 140 characters), these messages often contain a strong opinion. Since sentiment analysis is going to be performed during this project, Twitter is the most suitable platform to use considering this point. This is recently also pointed out by Kaya & Conley(2016), who stated that the reason for using Twitter in their research is the fact that tweets (user generated Twitter posts) yield a higher degree of sentiment analysis compared to Facebook posts, due to the fact that tweets are so short. Secondly, Twitter contains many other interesting functionalities. Users have the ability to retweet posts that are generated by others. This implies re-posting a tweet of another user on your own profile. Furthermore, users can reply to tweets, mark tweets they like, and mention other users in their own tweets. To illustrate all of this, an example tweet is shown in Figure 3.2. This is a very popular tweet of Southampton player Victor Wanyama, who gives his opinion about a meal he just had. In the figure, his name, Victor Wanyama, is clearly visible, together with a blue check mark to show that this is his verified account. Below his name, his Twitter username @VictorWanyama is displayed. The Follow button can be clicked to follow Wanyama on Twitter. This would ensure that tweets created by Wanyama are automatically on your Twitter homepage. Apparently, people find the content of this tweet, “I had spaghetti and it was very nice i enjoyed it”, very nice or funny, because it is retweeted almost 30,000 times and liked over 18,000 times. Furthermore the date and time of the tweet are displayed, and on the bottom row are, from left to right, the buttons to reply, retweet, and like the tweet. Something that is not displayed in the figure, but can be found when clicking on the username, is the number of tweets placed by the user, the number of profiles the user follows (friends), the number of followers of the user, and the number of likes he has placed. All this information could tell something about a player. For example, it could be the case that

20 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 3. SELECTING SOURCES AND GATHERING THE DATA

Figure 3.2: Example tweet placed by footballer Victor Wanyama (Screenshot of web page https: //twitter.com/victorwanyama/status/199525393475190784?lang=en, retrieved on May 10, 2016). a player with many followers is more popular than a player with less followers. The same could hold for the number of likes a player gets on his tweets. There are numerous interesting aspects about a Twitter profile, even when the content of the tweets is excluded. This was pointed out by Golbeck, Robles, Edmondson & Turner(2011), who predicted a person’s personality based on his Twitter profile. Golbeck et al.(2011) state that the value of content of tweets is arguable while predicting personality, since the messages are so short. So while this is an advantage for sentiment analysis, the content of the tweets is less valuable for other fields of interest. In Section 4.1.2 the work of Golbeck et al.(2011) is discussed in more detail. Just like The Guardian, Twitter also offers an API, which can be used to retrieve Twitter data in JSON format. All the information described above (and more) can be obtained using this API. The Twitter API has many different options, with the main disadvantage that there is a rate limit on the number of requests one can perform per time frame. So many data can be obtained using the API, but it is definitely not all. Since there was no other source available that could provide more data, the free Twitter API was used. The Twitter API comes with a comprehensive documentation, which could help to better understand the data gathering process described below (https://dev.twitter.com/overview/documentation). Like with the news articles, the Python software was used to gather the Twitter data. However, it is noteworthy that for this purpose Python 2.7 was used, while Python 3.5 was used for the news articles. The main reason for this, is that the package that is used to perform sentiment analysis on the tweets, is only available for Python 2.7. As was already mentioned, this is a disadvantage with respect to the encoding of text. In Python 2.7 it is required to specifically state which encoding needs to be applied. In order to use the Twitter API, the first step is getting API keys. To get these keys, it is required to create an account on Twitter and to create a new application. After this has been done, Twitter provides four different keys to access the API:

• Consumer Key (API Key).

• Consumer Secret (API Secret).

• Access Token.

• Access Token Secret.

Now the API can be accessed via Python using the Tweepy library (http://www.tweepy.org/). To do so, all four keys are required. There are two ways in which tweets are gathered during this project. First of all, the basic information about a player his profile (the number of followers his

A Method to Perform an Automated Background Check on Professional Football Players 21 CHAPTER 3. SELECTING SOURCES AND GATHERING THE DATA profile has, the number of users his profile follows, the number of tweets he has liked, and the total number of tweets he has placed) and the content of the tweets he placed are retrieved. This data is later used to get an overview of the Twitter behavior of the player himself. The process required in Python to retrieve the basic information about a player his profile is briefly described in pseudocode in Algorithm 3.5.

Algorithm 3.5 Retrieving basic Twitter information using Tweepy. 1: get the user object of a player’s profile by filling in his screen name 2: get the name of the player by the name attribute of the user object 3: get the number of followers of the player by the followers count attribute of the user object 4: get the number of users a player follows by the friends count attribute of the user object 5: get the number likes a player has placed by the favourites count attribute of the user object 6: get the number of tweets a player has placed by the statuses count attribute of the user object

The result of this little piece of code is the same as the information that is shown when the profile of the player is visited in a web browser, and therefore not that interesting. It is retrieved such that this information can easily be added to the background check. Algorithm 3.6 shows how more interesting Twitter content can be retrieved, namely the tweets placed by the account of the player.

Algorithm 3.6 Retrieving the content of tweets a player has placed using Tweepy. 1: create an empty list in which the tweets are going to be stored 2: request the 200 most recent tweets using the user timeline function 3: add these tweets to the earlier created list 4: subtract 1 from the id of the oldest tweet and store this value 5: while length of the amount of tweets to grab >0 do 6: request next 200 tweets filling in the max id parameter in the user timeline function 7: add these tweets to the earlier created list 8: again subtract 1 from the id of the oldest tweet and overwrite this value 9: end while 10: for all tweets do 11: if tweet has a retweeted status then 12: store value 1 in tweet specific variable 13: else 14: store value 0 in tweet specific variable 15: end if 16: end for 17: store all tweets in a file

In Algorithm 3.6, only 200 tweets are gathered initially, because this is the maximum amount that can be retrieved in one call. In total, 3,200 tweets of one user can be obtained using this function. Furthermore, the final step states that the collected data is stored in a file. In this project, the following attributes are written to a comma delimited file (csv):

• Tweet id of the tweet.

• User id of the user that tweeted or retweeted the tweet.

• The date and time on which the tweet was placed or retweeted.

• The textual content of the tweet (with possibly a link to a picture, movie, article, etc.).

• The number of retweets the original tweet has received.

22 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 3. SELECTING SOURCES AND GATHERING THE DATA

• The number of likes the tweet of the player has received. This value is equal to 0 when the player retweeted a tweet.

• Whether or not the tweet is retweeted by this user (this is the value determined in lines 12 to 14 of Algorithm 3.6).

• The date and time on which the tweet was retrieved.

Secondly, tweets are gathered that appear to be about the player of interest. This is done by searching for tweets in which the player was mentioned. On Twitter, a mention is a tweet in which the screen name of another user is placed, preceded by a @. So if a user wants to react to the tweet of Wanyama that was displayed in Figure 3.2, this tweet would contain @VictorWanyama. For example, if one wants to let to know his followers he/she finds the tweet very funny, this person could place the following tweet: “I think the tweet of @VictorWanyama is very funny.” In this example, Victor Wanyama is mentioned by the user who places the tweet. In this project, all the mentions about a player are gathered to gain some insight in the way people think about him. So in this example, a search query containing @VictorWanyama is sent to retrieve all mentions on Wanyama. Tweets about a player could also be gathered using other queries. For example, the name could form a query to search Twitter. People do not necessarily use a mention when they are talking about a player, but can also use his name (last name or both first and last name). However, the mention is the only search query used in this project, because looking for names only results in problems. For example, when one tries to find tweets about the Dutch football player Michiel Kramer by only typing in his last name Kramer, one will mainly find tweets considering Sven Kramer, who is a famous Dutch ice skater. Therefore, it is decided to only use the mention of the official Twitter account of the player as a search query. Of course, Twitter is known for the creative use of hashtags to show where people are talking about, and which topics are trending. Users might also use hashtags to talk about a football player (for example #Wanyama). However, there is not one generally accepted hashtag for one specific player (one user uses #Wanyama, another one #VictorWanyama, and a third one #Wanyama- Victor). A possibility to also include these tweets would be to also add some likely hashtags for a player, but this is not done within this project. The assumption is made that the average content or sentiment of tweets with a hashtag does not differ from the content or sentiment of tweets that contain a mention. Therefore, also looking for hashtags would only result in more data, not necessarily in better data. Because of this assumption and the fact that only a limited amount of requests can be send to the Twitter API, the decision was made to not include those hashtags when searching for tweets about a player. A code in Python was developed for SciSports, in which they can manually set the search queries for players, but as data input for this project only the player mentions are used.

Algorithm 3.7 Retrieving the content of tweets about a player using Tweepy. 1: specify the search query 2: specify the tweet language 3: load the id of the most recent tweet that is already obtained about this player 4: create an empty list 5: send search request for 1000 tweets using the input parameters 6: for all tweets do 7: if tweet has a retweeted status then 8: add tweet to earlier created list 9: else 10: do nothing 11: end if 12: end for 13: store all tweets in a file

A Method to Perform an Automated Background Check on Professional Football Players 23 CHAPTER 3. SELECTING SOURCES AND GATHERING THE DATA

Table 3.3: An overview of the players about whom Twitter data was obtained.

Player Username Tweets Mentions Dutch Mark-Jan Fledderus @Fleddie 787 32 @JetroWillems 15 75 107 Michiel Kramer @kramerinho 618 95 Dirk Kuijt @Kuyt 736 164 @RonVlaar4 475 24

Other @ChrisEriksen8 569 363 Cristiano Ronaldo @Cristiano 2,532 5,101 Mario Balotelli @FinallyMario 1,112 950 Zlatan Ibrahimovi´c @Ibra official 194 5,413 Riyad Mahrez @Mahrez22 2,824 525 Total 9,922 12,774

Again, the Tweepy library in Python was used to gather mentions on a football player. As input parameters for the search request, the search query, the language, and the id of the most recent tweet is provided. The search query is the @Username of a player, or the player his first and last name, as was described above. In the language parameter is specified which language the gathered tweets must be written in. For the players playing in the Netherlands, the language was set to Dutch. For the players playing in other countries, the language was set to English. By specifying the language, it is made sure that only tweets that are useful are downloaded. Cristiano Ronaldo, for example, receives a lot of mentions in English, but also in Portuguese. Since this project only investigates text mining techniques in the Dutch and in the English language, the tweets in Portuguese are redundant. The initial id of the most recent tweet is set to 1. After that, the actual id of the most recent tweet is used. This id is used, because it is necessary to gather tweets multiple times over time, since the API only allows to search for tweets from the last two weeks (for the search function). So running the script multiple times increases the time frame of the data set. Algorithm 3.7 shows how mentions and tweets containing a player name are received in this project. In line3 is stated to load the most recent tweet id. Each time the script ran, the id of the most recent tweet was loaded from a text file. At the end of the script, the content of this text file was overwritten by the new most recent tweet id. Line5 specifies that 1000 tweets are requested each time the script runs. This has again to do with the rate limit of the API. That is why the most recent tweet id needs to be stored each time, such that more data can be retrieved. In line8 is specified that only tweets that are not retweeted by the user are stored. This is done, because if all retweets of one tweet are downloaded separately, this is going to distort the sentiment analysis later. Instead of downloading all retweets separately, only the original tweet is downloaded, together with the total number this tweet has been retweeted. Furthermore, the final line of Algorithm 3.7 again states to store the gathered tweets in a file. This is again a csv file that has exactly the same attributes as the ones pointed out while discussing the player created tweets. Note that each tweet has a value of zero for the attribute whether or not the tweet is retweeted by this user, since retweets are not downloaded. It is noteworthy that the authentication steps are skipped in Algorithms 3.5, 3.6 and 3.7. Information about authentication using Tweepy can be found in the documentation of the Tweepy library (http://tweepy.readthedocs.io/en/v3.5.0/auth_tutorial.html). The Twitter data is gathered for 10 different players. Five of those players are playing in the Netherlands, while the other five are playing in other countries. This distinction is again made, because it might be possible that the text mining techniques that are going to be applied on the

24 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 3. SELECTING SOURCES AND GATHERING THE DATA data perform better for the English language. The number of players is much smaller than the number of players that are used to download the news articles, since this second part of the data gathering was restricted by the rate limits of the API. Tweets were gathered in a period of 13 days. During these 13 days, the script ran not constantly, since only one laptop was used for this project. Sometimes it was required to do something else on this laptop, or the laptop was not connected to internet. Therefore, not all mentions are gathered during this period. Some popular players received so much attention, that only constantly requesting those mentions would have resulted in a complete data set. Considering both the rate limit of the API and the fact that it was impossible to run the script constantly, this was impossible to achieve. However, the number of tweets in the data set seems still good. An overview of the players on which Twitter data was gathered is provided in Table 3.3. From left to right, the name of the player, his username on Twitter, the number of tweets in the data set that were posted or retweeted by this player, and the number of mentions on this player in the data set. A sample of the tweet data set can be found in AppendixB. The number of mentions seems pretty low, considering the fact that 1000 tweets are requested every time the script runs. This is due to the fact that retweets are filtered out after those 1000 tweets are requested. Since a retweet also includes the @Username of a player, those tweets of course also appear when @Username is set as search query. Unfortunately, there was no parameter in the search function to specify that retweets should not be included.

A Method to Perform an Automated Background Check on Professional Football Players 25

Chapter 4

Identifying Interesting Player Characteristics

After the data has been collected, it can be analyzed to discover information. Before this is done, it is required to specify which kinds of information are interesting from both an academic and a practical relevance perspective. In Section 2.2 of the preliminaries, the topic of personality was introduced. This chapter discusses why personality is an example of a piece of information that is both theoretical and practical relevant. Of course, a personality check only does not result in a complete player background check. In general, the goal of this chapter is to find multiple types of information that could be obtained from the data that was retrieved via the methodology described in Chapter3. To do so, the second sub-question is answered. This chapter starts with a literature review on the relationship between personality and text. It is discussed how written text can give an indication about a person’s personality, both from self-narratives and social media profiles. After that, a list of personality adjectives is introduced which is to be used later for the case specific personality prediction. In the second section of this chapter, it is discussed which information can be retrieved from the Twitter data that was retrieved. Regular Expressions, word clouds, and sentiment analysis are the topics that are going to be discussed.

4.1 Personality and Text

A first interesting topic that comes to mind when thinking of a player background check, is the personality of a player. SciSports indicated that the personality would be a very nice complement to their current background check. As was discussed in Section 1.1, SciSports currently only includes facts about a player derived from individual news articles in their background check. However, they do not look at the whole corpus of articles and derive conclusions from that. In this project, an attempt is made to analyze the text in all news articles about a player and derive a personality profile from that. Since the topic of personality was already discussed in Section 2.2, the focus here is put on the connection between personality and natural language. A lot of research has been conducted on the relationship between a person’s personality and language use. In this part, an overview of this research is provided. First of all, the relationship between the Big Five personality traits and language use is discussed. After that, it is discussed how someones personality can be distracted from social media. Finally, it is mentioned how the personality of football players can be derived from the data that is available.

A Method to Perform an Automated Background Check on Professional Football Players 27 CHAPTER 4. IDENTIFYING INTERESTING PLAYER CHARACTERISTICS

4.1.1 Personality in Self-Narratives

Pennebaker & King(1999) found that the ways people express themselves in words is remarkably stable over time and that reliable linguistic styles can be identified. Recall the definition of personality from Guilford(1959), stated in Section 2.2, which indicates that a personality is unique for each person, and that it is relatively stable over time. Combining those two findings might indicate that an individual’s personality can be derived from text. However, Pennebaker & King(1999) also found that despite the fact that the Big Five model is the most widely accepted personality model, the correlation between language dimensions and the Big Five model is weak. Pennebaker & King(1999) discuss three remarkable findings with respect to the Big Five personality model. Firstly, a negative correlation was found between the factor Immediacy and the Openness to Experience dimension, which means that people who have a more immediate and simple writing style score lower on this personality dimension. Second, the factor Making Distinctions was negatively correlated to the personality dimension Extraversion, which implies that individuals who use more exclusive words (e.g. ‘without’ and ‘except’), tentative words (e.g. ‘perhaps’ and ‘maybe’), negations (e.g. ‘not’ and ‘never’) and fewer inclusion words (e.g. ‘and’ and ‘with’) are more introvert. Finally, the factor Making Distinctions was also negatively correlated with the personality dimension Conscientiousness. In order to draw conclusions about the other personality traits of the Big Five as well, Hirsh & Peterson(2009) investigated the relation between language use in self-narratives and personality in terms of the Big Five. The assumption made by Hirsh & Peterson(2009) was that someone’s personality was better predictable from text when this person was writing about him-/herself, instead of doing a fixed writing assignment, as was done in the study of Pennebaker & King (1999). Undergraduate students had to perform a writing assignment, in which they had to write about both their past experiences and their envisioned future. The Linguistic Inquiry and Word Count (LIWC) of Pennebaker, Francis & Booth(2007)( http://liwc.wpengine.com/) was used to analyze the word frequencies in the assignments. For all five personality traits, Hirsh & Peterson (2009) found significant linguistic correlates. Table 4.1 shows these significant correlations ( < .05) together with an example phrase. These results show that personality differences (considering the Big Five Personality Factor model) of undergraduate students are manifested in the word choice when writing self-narratives (Hirsh & Peterson, 2009). The effect size in this research was much bigger than in the research of Pennebaker & King(1999). Since the findings of Hirsh & Peterson(2009) contain many linguistic categories that were also addressed in a similar study of Fast & Funder(2008), the conclusion that word choice while writing self-narratives does indeed say something about an individual’s personality is drawn. In their meta-analysis, Tskhay & Rule(2014) showed that besides Pennebaker & King(1999) and Hirsh & Peterson(2009) there are several other researchers who have shown that people use a persistent, consistent, and stable pattern of written expression (Chung & Pennebaker, 2008; Gill & Oberlander, 2002; Holtgraves, 2011; Lee, Kim, Seo & Chung, 2007; Pennebaker, Mehl & Niederhoffer, 2003), which shows that stability in writing and the word-use of an individual could be indicators for personality constructs. This theory could be very relevant to use in an automated background check. SciSports indicated that they find the personality aspect of a player very interesting, and considering this theory, creating an overview of the personality of a player should be possible. However, when directly connecting the above described theory to this project, a problem arises. All conclusions of the conducted research are based on people writing themselves, and not on pieces of text that are written by others. Although the news articles in the data set are about football players, they are written by journalists. No evidence exists that analyzing parts of text that are written by someone else about another person using the LIWC also says something about that person’s personality.

28 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 4. IDENTIFYING INTERESTING PLAYER CHARACTERISTICS

Table 4.1: Linguistic correlates of each Big Five trait, adopted from Hirsh & Peterson(2009, p. 526).

r Example sentences Extraversion Humans .25 “I feel that it facilitated my trust in people” Social Processes .22 “This experience contributed to my current love for public speak- ing” Family .21 “This goal will become increasingly important when I begin a family” Agreeableness Certainty .22 “I felt total security” Inclusive .22 “Now, I suddenly felt included” Family .21 “It was a hard decision that I had to make for the sake of my family” Body -.20 “It has caused me to have a relatively frail body nowadays” Anger -.26 “I hated my teacher” Conscientiousness Achievement .22 “We are high achievers and encourage one another to do our best” Work .21 “All the hard work was completely worthwhile” Body -.20 “I have very weak upper body strength” Death -.21 “I do not want to confront poverty, sickness and death” Anger -.23 “It made me angry and helpless” Exclusive -.24 “In my ideal future I would stop just reacting and start acting” Neuroticism Sad .29 “I walked around with a monstrous sadness” Negative Emotion .26 “It requires breaking a vicious circle of guilt” Body .22 “I felt this awkwardness in my body” Anger .20 “I will also be less angry with myself” Home .19 “I will stay home Saturday nights, when I will feel like it” Anxiety .19 “I was just chronically scared of the unknown” Work -.25 “Ive been so busy with work” Openness Perceptual Processes .28 “I will start trying to listen” Hear .27 “I want to be able to talk to them and hear their voices” Exclusive .20 “I do not want to live without music in my future”

A Method to Perform an Automated Background Check on Professional Football Players 29 CHAPTER 4. IDENTIFYING INTERESTING PLAYER CHARACTERISTICS

4.1.2 Personality in Social Media Besides news articles about football players, also tweets that can contain valuable information about a player are collected. Most of those tweets are created by the player himself (except retweets), so one might expect that those user generated tweets could be telling something about one’s personality. Not only a lot of text is used to add information to a profile or to communicate with other users, but also pictures and personal information are part of an individuals online profile. With this additional information, people can have a perception of someones personality by looking at his/her online profile. It was shown that an individual’s Facebook page does not provide an idealized image of that person, but that it is actually containing information about that individuals personality (Back, Stopfer, Vazire, Gaddis, Schmukle, Egloff & Gosling, 2010). Furthermore, Kluemper, Rosen & Mossholder(2012) found that the social networking website Facebook provides reliable information about all Big Five personality traits. Tskhay & Rule(2014) studied the perception of personality in text-based media and online social networks via a meta-analysis. It was concluded that the perceivers of open social network profiles not only generally agree with each other about the Big Five personality traits of the profile’s owner, but that this perception is also highly accurate for extraversion, conscientiousness, agreeableness, and openness to experience. It is noteworthy that in both the studies of Kluemper et al.(2012) and Tskhay & Rule(2014) the topic was perception of personality. This implies that they were mainly about how viewers of social media profiles judge about the personality of the individual behind that profile. To do so, viewers did not only look at the user created text, but also at profile pictures, demographic information and pages the profile owner has shown interest in. Golbeck et al.(2011) carried out a study in which a users personality was predicted by his/her Twitter profile. Again, not only user created text was considered, but also other information, like the number of followers (people following the user), number of followings (people the user follows) and the number of replies, were taken into account. In fact, it was not sure whether the text content of tweets was a useful source to predict personality:

“Previous research has shown that linguistic features can be used to predict personality traits (Mairesse, Walker, Mehl & Moore, 2007; Pennebaker & King, 1999). Data col- lected in Pennebaker & King(1999) was used in both studies. They had three separate sources of text, ranging from an average of 1,770 words to over 5,000 words per person. There is potential to apply these linguistic analysis methods to help predict personality by analyzing a persons tweets. However, the text samples used in earlier studies are much larger than are available to us through any twitter posting. Aggregating many tweets from a user gives more information, but as a series of disconnected statements rather than a coherent document as was used in other studies. Thus, it is unclear if Twitter text will be as connected to personality as was the case in other work. Tweets are much different sources of text. Each one is limited to 140 characters, and a com- pilation of tweets from a given user is more a stream of disjointed thoughts than a coherent narrative as is found in the text used in previous personality studies. Thus, it was not entirely clear whether tweets would be a useful source of data for this type of analysis.” (Golbeck et al., 2011, p. 151)

Golbeck, Robles & Turner(2011) also showed that personality was predictable using the in- formation on someones Facebook profile, but via machine learning instead of the perception of other people. The researchers were able to predict each of the Big Five personality traits within 11% of its actual value. Again it is noteworthy that the researchers did not only take the user created text into consideration, but also the social network, personal information and a list of favorite activities or things of a user. Like in the previous mentioned study considering Twitter profiles, the authors mentioned that Facebook posts are generally much shorter than the text documents used in the study of Pennebaker & King(1999), and that the value of user created text (or posts) is doubtful.

30 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 4. IDENTIFYING INTERESTING PLAYER CHARACTERISTICS

4.1.3 Case Specific Personality Prediction This chapter is not only about the question which player characteristics are interesting to know, but also about the question how these player characteristics can be obtained from the data. With respect to the personality of a footballer, literature shows that predicting personality is possible by looking at self-narratives, or by looking at someone’s social media profile (where the user generated content is not very interesting for this purpose). However, there seems to exist a gap in literature when it comes to predicting a person’s personality with pieces of text that are written by people other than this person himself. Therefore, one of the contributions of this project is making a first attempt to predict someone’s personality, without the necessity for this person to write about himself, but use articles about him instead. A starting point in bridging this gap could be the earlier mentioned LIWC of Pennebaker et al.(2007). As was shown in Table 4.1, there are very strong correlations between the use of certain word groups and the personality traits of the Big Five model. However, since the research of Hirsh & Peterson(2009) uses self-narratives as an input to predict personality, the LIWC is probably less relevant when it is used on news articles about a football player. It is likely that news articles about a football player do not include as much words from the different word groups of the LIWC as self-narratives of people. For example, when a journalist writes an article about the performance of a player during a match, he is probably not going to mention the player his family, since this is not relevant in this context. However, when the player is asked to write about his personal past and his envisioned future, it is much more likely that words of the family word group are used in this text. This is one reason to not include the LIWC in this project. A second reason to leave out the LIWC, is the simple fact that it is not free to use. Since this is a master thesis project, there is no money available to spend. It would be considerable to buy the LIWC, if it was expected to have added value to this project. However, as described above, this is not believed to be the case. So if the LIWC is not going to be used to derive personality from news articles, another starting point is to be chosen. Saucier & Goldberg(1996) pointed out that studies of the natural language are a prime source of the Big Five model, but that a large set of English personality adjectives did not exist yet. Therefore, they created a set of 435 adjectives and determined the factor loading of each of the five personality traits with each factor loading. This whole list of adjectives, together with their factor loadings, is included in AppendixD. While Saucier & Goldberg(1996) created the list of adjectives to find evidence for the Big Five personality model, this extensive word list could also be used to analyze players and say something about their personality. For example, if the word sympathetic appears a lot in news articles about a certain player, one might expect this player to score high on the Agreeableness factor. Since a large text corpus can be analyzed automatically, it is possible to check all news articles that are found about one player for all 435 personality adjectives, and count how often each adjective is used. This process is described in Section 5.1.

4.2 Information from Twitter Data

Next to the news articles, there is also Twitter data retrieved, which needs to be translated into useful information about a player. It was already pointed out that the content of the tweets is not relevant in predicting a player’s personality, but the tweets from and about a player can contain other valuable information. Several research has been conducted in the past, in which Twitter data was used in combination with football. An example is the study of van Oorschot, van Erp & Dijkshoorn(2012), in which the researchers collected tweets about Dutch football matches, and automatically extracted game events from these tweets. The minute in which the event occurred in, the event type, and whether the event did take place for the home or the away team were extracted. The studies of Esmin, J´unior, Santos, Botaro & Nobre(2014) and Jai-Andaloussi, El Mourabit, Madrane, Chaouni & Sekkaki (2015) did something comparable, but had as main goal to create a summary of a football match

A Method to Perform an Automated Background Check on Professional Football Players 31 CHAPTER 4. IDENTIFYING INTERESTING PLAYER CHARACTERISTICS using Twitter data as an input. Since people tweet a lot about football matches, the researchers proposed an approach to automatically process those tweets and use them to create a summary of the game. Both studies resulted in a successful approach to automatically summarize football matches using Twitter. While the above mentioned studies are all using a combination of football and Twitter data, they use it to describe a game of football itself. However, this project is focusing on a background check, and therefore the focus of interest is more on the players, and their behavior, rather than on the matches. Therefore, the Twitter data that was obtained in Chapter3 is used to draw conclusions about a specific players.

4.2.1 Regular Expressions Currently, SciSports is looking at a player’s Twitter profile to find pieces of information. Whether the player has placed a tweet a few years ago that could link him to a club or he sends a lot of tweets to one other player, it could bring SciSports further in performing their background check. Therefore it is interesting to automatically create an overview of the things a player is mostly talking about on Twitter and to whom he is talking or reacting. The process to do so is not complicated, but would save SciSports a lot of time. It is very common to add a hashtag to a tweet to show what the tweet is about. For example, when a football match between PSV and Ajax is played in the Dutch Eredivisie, people often use the hashtag #psvaja, which is a hashtag containing the three letter abbreviations of both teams, to show that they are talking about this match. It were exactly these hashtags that were analyzed by van Oorschot et al.(2012) to extract game events from Twitter data. Players do also use a lot of hashtags to show what they are talking about. Extracting these hashtags could provide SciSports an overview of what the player finds important. Secondly, SciSports wants to know who a player is communicating with via social media to maybe find out who are the friends of the player of interest. This could be easily extracted by checking out the mentions a player uses. The frequency of these mentions can then be ordered to see to whom the player is tweeting the most. Extracting both hashtags and mentions from tweets is fairly easy, due to the extraordinary characters these two phenomena start with. As is in the name, and was already explained, hashtags and mentions start with the # and @ characters, respectively. These starting characters make them very easy to detect, not only for human beings, but also for computers. A technique called regular expressions can be used to extract mentions and hashtags from the text data. A regular expression is a text string to search for certain patterns in text data. So rather than specifying the exact string one wants to look for, one can specify the character pattern of the string one wants to have. For example, if one wants to find all words in a tweet, that only consist of non-capital letters, one could search for the pattern [a-z]+. Other examples are:

Specific character: [a], matches the letter a. Range: [A-Za-z0-9], matches all letters (both cases) and numerals. Or: [a|b|c], matches either a, or b, or c. Optional: [behaviou?r], matches both behaviour and behavior Negation: [ˆ0-9], matches all characters except numerals. One or more: [go+], matches go, goo, goooooo, etc.

To match a hashtag or a mention, such a regular expression can easily be defined, because they always have the same pattern. The specific characters # and @ are followed by letters and numerals. So defining a pattern starting with either a # or a @ is sufficient to extract those objects from the text data. The following regular expressions will match all hashtags and mentions:

Hashtag #[A-Za-z0-9]+, matches all hashtags consisting of letters (both cases) and numerals.

32 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 4. IDENTIFYING INTERESTING PLAYER CHARACTERISTICS

Mention @[A-Za-z0-9 ]+, matches all mentions consisting of letters (both cases) and numerals.

So, regular expressions can be used to extract hashtags and mentions from the obtained Twitter data, which could contain meaningful information for SciSports to add to their player background check, or at least get more insight in the Twitter behavior of a player without going through his tweets manually.

4.2.2 Word Cloud A second aspect that is interesting to add to a player background check is the words people use when talking about this player. Frequently used words could contain useful information about what people generally think about this player. While there are words that do not contain any information individually, such as the word the, there are many that do. For example, frequently used words can be nicknames, abusive words, names of other players, names of football clubs, etc. If a certain name of a football club is mentioned a lot in tweets about a player, this might indicate that the fans are expecting this player to move to that club. Of course, not the whole context is covered by looking at word counts only, but it is an relatively easy way to get a nice overview of the text data. A word count would suffice to see which words are most frequently used. However, as was mentioned in the previous section, SciSports often uses nice visualizations in their player reports. A popular way to visualize the word count is the word cloud. A word cloud is a number of words put together in one image, in which the size of the words is often representative for the count of that word in the text corpus. To make it even more clear which words are used more often, colors can also be used to represent the frequency of word use. This is a nice way to visualize the word counts used in tweets about a player. Word clouds are especially interesting to use on tweets, since the language used on Twitter differs a lot from the natural language people use normally. On Twitter, people use a lot of abbre- viations, and are putting less attention to the correctness of punctuation and grammar, because the amount of characters in a tweet is limited (Kumar, Morstatter & Liu, 2014). Creating a word cloud does not require interpretation of the sentences, so punctuation and grammar are neglected, which makes it an interesting approach to represent Twitter data. Furthermore, abbreviations do not have to be interpreted, only counted and visualized, which is beneficial in this case, since a lot of abbreviations of football club names are used in tweets.

4.2.3 Sentiment Analysis A topic that was already introduced in Subsection 2.1.1 of the preliminaries, is Sentiment Analysis. This is a text mining technique that could be very useful in this particular case. Recall the definition of sentiment analysis, stated by Pang & Lee(2008):

“A sizeable number of papers mentioning ‘sentiment analysis’ focus on the specific application of classifying reviews as to their polarity (either positive or negative), a fact that appears to have caused some authors to suggest that the phrase refers specifically to this narrowly defined task. However, nowadays many construe the term more broadly to mean the computational treatment of opinion, sentiment, and subjectivity in text.” (Pang & Lee, 2008, p. 10)

Besides the introduction and definition of the topic, it was mentioned that sentiment analysis might come in very handy in determining a player’s popularity. If a club is buying a new player, his performance on the pitch is of course of major importance. However, considering whether or not the player is popular with the fans is also an interesting aspect. Of course, it is important for a football club to keep the fans happy, but having popular players in your squad also has financial benefits. An example could be the number of shirts you sell with a player’s name on them. In this project, an attempt is made to create an overview of a player his popularity by per- forming sentiment analysis on the tweets about him. To illustrate how this works, an example

A Method to Perform an Automated Background Check on Professional Football Players 33 CHAPTER 4. IDENTIFYING INTERESTING PLAYER CHARACTERISTICS is added. The following text is a tweet placed by Zlatan Ibrahimovi´cto announce his final home game for Paris Saint-Germain: My last game tomorrow at Parc des Princes. I came like a king, left like a legend. This tweet was retweeted and liked a lot (137,273 and 134,469 times respect- ively, at the moment of writing). One could imagine that number of reactions Zlatan Ibrahimovi´c received to this tweet was very large too. Too illustrate the interest of sentiment analysis, two of these reactions are briefly discussed. One fan of Ibrahimovi´creacted to the tweet with the words; “Yes @Ibra official you are a legend”, which clearly shows that this person is positive about Ibrahimovi´c. A second reaction on the tweet reads as follows: “@Ibra official ur shit and overrated”. This reaction is the opposite of the first one, and shows that the person who placed this tweet thinks rather negative about Ibrahimovi´c. These tweets are easy to analyze manually, since there are only two of them. However, when the number of reactions gets high (think in terms of thousands), this process becomes very time consuming. Therefore, sentiment analysis can be performed on the tweets to get an overview of all the different opinions about a player, and to say something about his popularity. Not only is it much quicker than reading through all tweets manually, but it is also less error prone and more consistent, since every tweet is analyzed in the same way. When a large amount of tweets needs to be rated with a sentiment score between -1 and 1, with -1 being very negative, and 1 being very positive, it is conceivable that it is difficult to maintain the exact same criteria for each tweet for human beings, but a computer is very good at applying these same criteria to each tweet. Since there is a lot of research conducted in the field of Sentiment Analysis, and there are very good off-the-shelf applications to perform sentiment analysis on text in both Dutch and English, sentiment analysis on a player is a second aspect that is chosen to include in the automated background check. An important aspect is the visualization of the sentiment to provide a useful overview of a player’s personality. SciSports uses a lot of visual representations in their player reports, so the aspects of the player background check should be nicely visualized as well.

34 A Method to Perform an Automated Background Check on Professional Football Players Chapter 5

Transforming Text Data into Valuable Information

With the data collected in Chapter3 and the interesting player characteristics explained in Chapter4, this chapter focuses on combining the findings of those two chapters to create an automated player background check. This background check should provide the interesting player characteristics using the gathered data, using text mining techniques. In this chapter of the thesis, the third sub-question is going to be answered. In the previous chapter, several interesting player characteristics were identified, and it was briefly discussed how those characteristics are to be obtained from the gathered data. The actual transformation from the text data into this player information is provided in this chapter. There- fore, the result of this chapter is an automated background check that consists of the following elements:

• An overview of a player’s personality in terms of the Big Five personality factors, based on news articles about this player.

• A list with the twenty most used hashtags and mentions a player uses on Twitter, to get insight in the topics he is talking about and the persons to whom he is talking.

• A word cloud of the most frequently used words in tweets about a player.

• A sentiment analysis on the tweets about a player, to provide an overview of the player his popularity.

The chapter starts with the news article analysis. Since the bag-of-words approach is the most straightforward way to perform text mining, this approach is used first to investigate whether this approach suffices for determining personality from text. After a first analysis has been done, the methodology is adjusted a bit to only search for exact matches with adjectives, instead of only their word stem. Finally a more advanced text mining technique, namely part-of-speech tagging, is used to only consider the actual adjectives in the news articles. The results of all three experiments are evaluated in Chapter6. After the analysis of news articles, three ways to extract information of the Twitter data are provided. First, regular expressions are used to construct the list with most frequently used hashtags and mentions. Second, a word cloud of player mentions is created and it is discussed which valuable information is visible in the result. The chapter ends with a sentiment analysis on the player mentions, together with a discussion on the visualization of the results.

A Method to Perform an Automated Background Check on Professional Football Players 35 CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

5.1 Personality Extraction Using a Bag-of-Words

As was described in Subsection 2.1.2, there are three main steps required for text mining. Both Turban et al.(2010) and Hu & Liu(2012) described those three steps, which are used as main guidance in this section. Recall that the three steps to properly perform a text mining analysis, are (Turban et al., 2010):

1. Establish the Corpus. 2. Create the Term-Document Matrix. 3. Extract Knowledge.

Note that a big part of the first step was already taken in Chapter3, since the data was gathered there. In establishing the corpus, it is important that all the unstructured data is put in the same format. This was done by storing the news articles in a database after they were downloaded. From each article, the player name, the url to the web page, the date, the source, the title, and of course the body text were stored in the database. Where Python was used to download the articles to the database, the R software (https: //www.r-project.org/) is used to analyze the content. Therefore, it is required to connect R to the database and load the articles into R. This step is easily performed using the RMySQL library. This package is used to send SQL queries to the database. Note that in this step it is very important to properly set the encoding of the data. The data in the database is specified in UTF-8 format. Therefore it is required to make sure that the data is loaded into R in this same format. If this is not the case, some characters do not appear properly, such as the letter ¨o. If these characters are not properly visible, this is going to result in problems when the personality adjective co¨operatief is to be found in the text corpus, or when articles about Zlatan Ibrahimovi´c (note the acute accent on the letter c) need to be loaded from the database. A disadvantage in the use of special characters is that not all people use them in the same way. Some people use the letter combination oe instead of ¨o, or write the name of Ibrahimovic without the accent on the c. The latter example can be fixed by stripping the accents of the letters, but cases in which oe is used instead of ¨o are not taken into account in this thesis. In R, the encoding of a character can easily be specified using the Encoding function. A brief overview of how to properly connect to the database and set the encoding of the characters is provided in Algorithm 5.1.

Algorithm 5.1 Loading the articles properly encoded from the database into R. 1: connect to the database 2: send a SQL query to set the names in the database to UTF-8 3: send a SQL query to obtain articles from the database 4: store the obtained articles in a data frame 5: specify the UTF-8 encoding for the variables player name, title, and text 6: close the connection with the database

While seems to be a simple one, step5 is crucial to get the text properly encoded in the data frame. It is highlighted here, because finding out how the encoding of the articles could be changed in R was found to be a time consuming process. With the data correctly loaded in R, a text corpus can be created. To perform text mining in R, the freely available text mining package tm was used (http://tm.r-forge.r-project.org/). The tm package provides very useful functionalities to establish the text corpus and to create the term-document matrix, steps 1 and 2 of the general text mining process respectively. The main reasons to choose for the tm package, are the fact that it is freely available, it is available for both the Dutch and the English language, and the functionalities suffice for the steps that need to be taken in this project. The tm package is used to transform the text data in the data frame into a nice text corpus. The data in the text column is specified to be the content of each document, while the rest of the

36 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Table 5.1: An example of a term-document matrix.

a fantastic football great is Messi player Ronaldo too Document 1 1 1 1 0 1 1 1 0 0 Document 2 1 0 0 1 1 0 1 1 1

data is set to be meta data. Note that it is important to set the language of the text data, Dutch or English in this particular case. This makes it easier to remove stop words later on. Now that the corpus is established, the next step, creating the term-document matrix, can be taken. The goal of this step is to create a sparse matrix, in which each row represents a document, which is one news article in this case, and the each term that occurs in the corpus is represented in a column. The values in the cells of the term-document matrix then represent how often each term is found in each document. Consider the following text corpus, consisting of two very short documents: • Document 1: Messi is a fantastic football player • Document 2: Ronaldo is a great player too The corresponding term-document matrix of this example corpus is provided in Table 5.1. Note that this matrix contains a lot of ones, while it was mentioned that the term-document matrix is normally sparse. This is due to the fact that only two documents are analyzed here. One can imagine that the number of unique, or less frequently used terms, quickly becomes larger when more documents are analyzed. In this project, a term-document matrix of one player can already contain almost 400 documents. Additionally, the two documents in this example only have a few words. Naturally, the news articles have much more content. Note that the two documents in the previous example contain very clean text. There is no punctuation in the sentences and there are no numbers or strange characters in the text. These circumstances are ideal for creating the term-document matrix, but of course not realistic. The actual text in this case does contain punctuation, numbers, and special characters. Therefore, some preprocessing of the corpus is required to clean the data and create a proper term-document matrix.

5.1.1 Data Preprocessing The process of data cleaning is visualized in Table 5.2. This is a general accepted process to clean the data in text mining, and could for example be found in the research of Yu & Wang (2015) and Iglesias, Tiemblo, Ledezma & Sanchis(2016). The data cleaning process transforms each document into a bag-of-words, as described in Subsection 2.1.1. Please note that the double quotation marks in the example text only show the distinction between words, and are not part of the text themselves. While the tm package is enough to perform all the transformations that are described in Table 5.2, some steps require some additional explanation. To start, each of the steps can be accomplished by a function in the tm package, except for the tokenization. However, there is no need to explicitly tokenize the documents, since the tm package tokenizes the documents within the other functions. Also, when a term-document matrix is constructed from the documents after the data cleaning, tm knows that the documents need to be tokenized, since the term axis of the term-document matrix should display only single terms. Furthermore, it is important to make a note about the removal of punctuation. It was already mentioned earlier that the data used in this project is UTF-8 encoded. However, the function of the tm package to remove the punctuation, some UTF-8 encoded characters are not recognized as being punctuation. Therefore, it is necessary to define another function that also removes these UTF-8 encoded characters, next to the function to remove punctuation provided by the tm package. The stopwords that are deleted from the text corpus can be found in AppendixC.

A Method to Perform an Automated Background Check on Professional Football Players 37 CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Table 5.2: The data cleaning process.

Step Description Example text ”’I think Lionel Messi is one of the best players in the world,’ he said after Barcelona had de- feated Real Madrid 2-1.” Remove punctuation Remove all punctuation from ”I think Lionel Messi is one of the text, such as dots, commas, the best players in the world etc. he said after Barcelona had de- feated Real Madrid 21” Remove numbers Remove all the numbers from ”I think Lionel Messi is one of the text, such that only textual the best players in the world data remains. he said after Barcelona had de- feated Real Madrid” Tokenization Distinguish between words in- ”I” ”think” ”Lionel” ”Messi” side the document. ”is” ”one” ”of” ”the” ”best” ”players” ”in” ”the” ”world” ”he” ”said” ”after” ”Barcelona” ”had” ”defeated” ”Real” ”Mad- rid” Capitalization conversion Transform all uppercase letters ”i” ”think” ”lionel” ”messi” to lowercase. ”is” ”one” ”of” ”the” ”best” ”players” ”in” ”the” ”world” ”he” ”said” ”after” ”barcelona” ”had” ”defeated” ”real” ”mad- rid” Remove stop words Remove frequently used words ”think” ”lionel” ”messi” ”one” which do not contain informa- ”best” ”players” ”world” ”said” tion from the text corpus. ”barcelona” ”defeated” ”real” ”madrid” Stem the words Bring words back to their stems ”think” ”lionel” ”messi” ”one” such that words with the same ”best” ”player” ”said” ”bar- meaning can be matched. celona” ”defeat” ”real” ”mad- rid”

38 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Thirdly, the step in which the words are stemmed requires some further explanation. Word stemming is a widely accepted form of data cleaning within the text mining domain. It reduces words to their stems, such that words with the same meaning are represented by the same word. For example, the words cat and cats are different, but they cover the same thing, except for the fact that the latter one is plural. Stemming would reduce the word cats to cat, such that the singular and the plural words are treated the same. In Table 5.2, word stemming results in the fact that players is reduced to player, and defeated is reduced to defeat. It is important to consider that word stemming does not have to result in a word that actual has a meaning. For example, both the words conspicuous and conspicuously are reduced to conspicu, which does not have a meaning itself. However, the goal of word stemming is that words are treated in the same way, for example when one is going to perform a word count on them, but not to represent actual words. A final remark is to be made on the word real, which in this case refers to the first word in the name of the football club Real Madrid. However, real is also a word in the English language itself. Since the bag-of-words approach is used in this case, there is no distinction between the meanings of the word. It would be possible to use named entity recognition (see Subsection 2.1.1) to disambiguate the meanings of the word, but this technique is not used within this experiment. The experiment discussed in Section 5.2, which used part-of-speech tagging, will distinguish between the word real as an adjective and a non-adjective. The result of this preprocessing approach is a list of stemmed terms. When the preprocessing approach is carried out for multiple documents, those terms and documents can be placed in a term document matrix. So after completing the preprocessing, the first two steps of the general text mining process are completed, and knowledge can be extracted from the data. In this project, an attempt is made to predict the personality in terms of the Big Five of professional football players, based on news articles that are written about them. The most important input for this process, besides the news articles themselves, is the table created by Saucier & Goldberg(1996), which was introduced in Subsection 4.1.3. The decision to use this specific, but relatively old table was made, because no other study was found that included such a large word list. However, before the table, which can be found in AppendixD, can be used, some preprocessing for this data is required as well:

1. Since the word list is in English, and the text corpus also includes Dutch words, translation is required. This translation is performed using Google Translate (https://translate. google.com/). After that, each translation was checked manually to see if it made sense. Some translations were adjusted. The result of the translating process can be found in AppendixE. 2. While all 435 adjectives are unique in English, the translation process resulted in duplicates in Dutch. To get rid of these duplicates, the mean of the factor loadings of the English adjectives that are translated to one Dutch adjective is taken as the factor loading of the Dutch word. So if there are two English adjectives with a factor loading of 0.5 and 0.1 respectively, and the translation of these adjectives results in the same Dutch adjective, the factor loading of this Dutch adjective is equal to (0.5 + 0.1)/2 = 0.3. After combining the adjectives with the same Dutch translation, 388 Dutch adjectives remain.

3. The list with adjectives, provided by Saucier & Goldberg(1996) includes factor loadings. Each of the adjectives is connected to each of the personality factors. However, each adjective is strongly connected to only one (or two) of the personality factors. To simplify the process, it is assumed that an adjective only describes a personality trait when the factor loading is larger than 0.3 or smaller than -0.3. A new table is created, in which only the values -1, 0, and 1 appear.

The result of this pre-processing is a large table that provides an overview of the relationships between adjectives and the Big Five personality traits. These relationships are equal to either 0, -1, or 1. With both the unstructured text data on football players and the table with personality adjectives being prepared, the data can be transformed into information.

A Method to Perform an Automated Background Check on Professional Football Players 39 CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Table 5.3: A term-document matrix after data preprocessing.

a and is lionel man messi passion player precis sympathet Document 1 1 0 1 1 1 1 1 0 0 0 Document 2 1 0 1 1 0 1 0 1 0 1 Document 3 0 1 1 1 0 1 0 0 1 1

Table 5.4: The frequency of personality adjectives in the example corpus.

passion precis sympathet Lionel Messi 1 1 2

5.1.2 Personality Extraction Next, the methodology that was applied to transform the obtained data into personality profiles using the bag-of-words approach is described. All the steps are performed in the R programming language using the tm package to perform text mining techniques on the data. Before the process is described, it is necessary to make a note with respect to the language in which the news articles are written. For both the Dutch and the English language, the described process is the same, unless specified otherwise. However, the process was executed separately for each language. So the algorithm ran once for the Dutch and once for the English language. Recall that in Subsection 5.1.1 a step was described to translate the adjectives to Dutch. Naturally, the original adjective list was used for the articles that are written in English. Only step 3 of the adjectives preprocessing was carried out for the English adjectives. The first step that is taken is stemming all the personality adjectives (435 in English, and 388 in Dutch), since the bag-of-words of each player consists of stemmed words only. The personality adjectives can only be found in this bag-of-words if they are stemmed too. Stemming the words requires to specify the language of the words, since the stemming algorithm is different for each language. The second step is to retrieve all the articles from the earlier created database for each player, putting all these articles together in one corpus, and preprocess this corpus in the way that was described in Subsection 5.1.1. The resulting terms are placed in a player specific term-document matrix. Consider the following text corpus, consisting of three fictive example articles about Lionel Messi: • Document 1: Lionel Messi is a passionate man. • Document 2: Lionel Messi is a sympathetic player. • Document 3: Lionel Messi is sympathetic and precise. The resulting term-document matrix of this text corpus, after preprocessing the data, is visible in Table 5.3. Thirdly, the number of columns in the term-document matrix is reduced, such that only the the stemmed personality adjectives remain. Therefore, the maximum number of columns of the term-document matrix after this step is equal to 435 for the English language, and equal to 388 for the Dutch language. This matrix can be useful to see which articles contain much personality adjectives and are therefore expected to say something about a player’s personality. Since the goal is to generate one personality profile of the player, and not a personality profile per article, all term frequencies are summed and stored in a vector. The (named) vector for the above described text corpus of Lionel Messi is visible in Table 5.4. The next step is to translate these personality adjectives to an actual personality profile. To do so, it is determined first how important each personality adjective is for one player, by transforming the absolute adjective frequencies to relative frequencies. This is done by normalizing the data.

40 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Table 5.5: Example of a matrix containing relative frequencies of personality adjectives.

passion precis sympathet Player A 0.25 0.25 0.5 Player B 0.5 0 0.5 Player C 0.33 0.33 0.33

In Subsection 2.1.2 of the preliminaries, the normalization method of the term frequency- inverse document frequency (TF-IDF) was mentioned. This normalization method counts the number of terms, but multiplies this frequency with the inverse document frequency, which is a factor containing information about how often the term occurs in the whole corpus. Words that are used in a lot of documents in the corpus, such as the word the, have a very low inverse document frequency, which results in a low TF-IDF for this term. On the other hand, words that are very rare in the text corpus are assigned a high inverse document frequency, such that when these words appear in a document, their high TF-IDF stands out. Although the TF-IDF is the most important and most used normalization method in text mining, it is not used in this particular case. The most important reason for this decision is that words that appear only once are assigned a very high TF-IDF value, while more common words are considered to be not important. This would imply that a word as flirtatious, which appeared once in the 29,520 English news articles that were used for this project, would receive a very high TF-IDF rating, while the factor loadings of this adjective on the five personality factors are relatively low (0.36 on extraversion is the highest). For comparison; the adjective social appeared 1024 times in the whole corpus of English articles, and the factor loading of the word social on extraversion is much higher than the one of flirtatious, namely 0.58. From this, the conclusion is drawn that the word social tells us more about a player’s personality than the word flirtatious, and therefore the decision is made to not include the TF-IDF. If the TF-IDF would have been used, the only occurrence of flirtatious would have received a rating that is too high, and the rating of social would be too low. Instead of using the TF-IDF, a value between 0 and 1 for each adjective was determined based on the total adjective count of that player. If a specific personality adjective is used once for a player, and in total 100 personality adjectives occur in the articles about this player, the relative frequency of that adjective is equal to 1/100 = 0.01. In the Lionel Messi example, these relative frequencies would be 0.25, 0.25, and 0.5 for passion, precis, and sympathet respectively. Calculating these percentages ensures that the data is normalized in the same way for all the players. Whether the text corpus about a player contains 50 personality adjectives, or 100, each adjective is assigned a value between 0 and 1 to indicate how important that adjective is for that player, relative to the other personality adjectives. The result of the above described steps is a vector including 435 (388 for the Dutch players) values from 0 to 1 per player. A zero indicates that this adjective is not mentioned in articles about this player, and a 1 implies that all the personality adjectives that occur in articles about that player are the same one. This vector can be constructed for each one of the players, and combining these vectors results in a matrix that provides an overview of the importance of each adjective for each player. If Lionel Messi is called Player A from now on, and news articles about other players than him (Player B and Player C ) are analyzed and added to the previous example, a part of this matrix could look like the one constructed in Table 5.5. Recall that this table contains fictitious values, and that those values are not the actual result of the project. Before this large matrix can be transformed into a personality profile, it is necessary to inspect the data. First, all the empty columns are deleted. These adjectives did not appear in the entire text corpus and do therefore not contain any information. This resulted in the removal of 27 English and 111 Dutch adjectives. The big difference in these numbers can be explained by the translation of the adjectives. Apparently, literally translating the adjectives results in Dutch words that are rarely used. The adjectives that did not appear in the text corpus are provided in

A Method to Perform an Automated Background Check on Professional Football Players 41 CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Table 5.6: The ten most frequently used personality adjectives stems.

Personality adjective Adjective stem Average importance playful play 19.8% worldly world 7.0% competitive competit 2.8% helpful help 2.7% decisive decis 2.5% defensive defens 2.2% possessive possess 1.9% forceful forc 1.9% confident confid 1.8% suggestible suggest 1.8%

Appendix F.1.1. Second, it is important to manually check whether the frequency of use of the word stems is realistic. Since word stems can be the same for different words, it might be the case that one word stem occurs very frequently, while it is unlikely that it is the stem of the personality adjective in all cases. For this inspection, the mean of each column is taken, to see whether an adjective is structurally used more frequent. Basically, this mean tells us how important an adjective is to describe a player on average. Considering the example case, the importance of the word stem passion on average describes 36 percent ((0.25 + 0.5 + 0.33)/3) of a player’s personality. In this example, the percentages of all three adjective stems are pretty large, since only three players are analyzed using three different adjectives. In the actual data set, those percentages more unequally distributed. The ten highest averages percentages for the English articles are provided in Table 5.6. On average, almost 20 percent of all occurrences of the 435 personality adjectives for a player is the stemmed word of playful. So 0.25 percent of the personality adjectives determines on average 20 percent of a player’s personality if this word is not removed. The explanation for this is that the word stem of playful is the word play. So those 20 percent definitely does not only consist of the word playful, since play is a word itself, but it is also the stem of words such as playing and plays. Since all news articles are about football players, who play a lot of football matches, it does make sense that the word play and its inflections are used a lot. Therefore, it seems fair to assume that these 20 percent does not consist of the word playful, and it is decided to remove this adjective and its stem from the word list. A same line of thinking can be maintained for the other 9 adjective stems in Table 5.6. Some of the adjectives are more likely to describe something else than a football player, such as the word defensive, or the stems of the adjectives are very common stems that are also the stems of other, frequently used, words, such as the stems of worldly and suggestible. In fact, not only these ten, but many more adjectives that have a high average importance seemed to have this problem. Therefore, it was decided to remove all the words with an average importance higher than 1 percent. This process was repeated until the word with the highest average importance was a word that is most likely to describe a football player, instead of being a stem of a frequently used word or a word that is very common in football. After one iteration a total of 50 English and 51 Dutch adjectives were removed from the lists. These removed adjective stems are provided in Appendix F.2.1. The result of the manual check is the same matrix as was discussed earlier, with the players on the one axis, and the personality adjectives on the other, but with only 358 of the 435 original English adjectives, and 226 of the 388 Dutch adjectives. The values in these two matrices are then multiplied with the preprocessed version of the matrix constructed by Saucier & Goldberg(1996), containing the values -1, 0, and 1, to generate a personality profile for each one of the players. So if 0.5 percent of all personality adjectives used to describe a player with the adjective sympathetic, this results in a score of 0.005 for the personality trait Agreeableness. After repeating this for every

42 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Table 5.7: Example of a matrix containing the personality scores of one player.

Agreeableness Extraversion Conscientiousness Neuroticism Openess passionate 0.25 0 0 0 0 precise 0 0 0.25 0 0 sympathetic 0.5 0 0 0 0

personality adjective, a table like the one in Table 5.7 can be constructed. This table represents the personality of Player A from the previous example. This table can now be generated for each player. The values in each column can be combined to create an overview of the personality of each player. This is done by taking the mean of all non-zero values in a column. According to the example of Table 5.7, Player A would be assigned the a value of 0.375 for Agreeableness and a value of 0.25 for Conscientiousness, but would receive a not a number value for the other personality traits. When this occurs, the player is removed from the data set, since not enough data was gathered to create a personality profile of him. The actual results of this step are provided in Appendix G.1. For most of the players, a primary personality profile has been constructed. However, small numbers in Appendix G.1 do not mean much by their selves. To really form a proper view on a player’s personality, a comparison is to be made between the different players. When this is done, the information and calculated numbers become much more meaningful and easier to interpret. The most obvious way to make a comparison between the players is to determine the mean and standard deviation, and see whether a player scores higher or lower than the average football player on each personality trait. The following process is carried out for both the players playing in the Netherlands, and the players not playing in the Netherlands separately, but only the latter will be described here. This implies that a player playing in the Netherlands is compared to all other players playing in the Netherlands, and not to the other players. As a starting point, the values in the columns of all personality traits are scaled to provide more insight in the data. Scaling the data does not actually change the data, but only puts the data on another scale. For all five personality traits, the mean of the scaled values is very close to zero, and the standard deviation is equal to one. Calculating the mean and standard deviation after the data is scaled ensures that the personality profiles of players can be easier and better visualized. It also ensures that the scaling and the mean value are the same for each of the five personality traits. The scaled personalities are provided in Appendix H.1. The scaled data is very easy to interpret. A value of 0 means that a player scores average on this personality trait, compared to the other football players in the data set. A value of -x or x implies that this player scores x times the standard deviation below or above the average, respectively. The personality profiles provided in AppendixH can then be put in a visualization to see the personality profile of a player at a glance. Such a visualization, rather than the raw numbers, could be added to a player background check by SciSports. Since there are five dimensions, all on the same scale, a radar chart is chosen to create this visualization. Two of those radar charts, both for a player playing in the Netherlands, and one for a player playing outside the Netherlands, are shown in Figure 5.1. Recall the five personality factors and their opposites that were discussed in Section 2.2 of the preliminaries to properly read the figure:

• Agreeableness versus Antagonism.

• Conscientiousness versus Lack of Direction.

• Extraversion versus Introversion.

• Neuroticism versus Emotional Stability.

• Openness to Experience versus Closedness to Experience.

A Method to Perform an Automated Background Check on Professional Football Players 43 CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Figure 5.1: Personalities determined by the bag-of-words approach with word stemming.

44 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Figure 5.2: Overview of the interpretation of the radar charts.

Below the two radar charts some additional numbers can be found to see on what information the plot is based. First, the number of articles about this player that is in the database is provided. Second, the total number of occurrences of personality adjectives is provided. This number is equal to the sum of the numbers in the player vector created in Table 5.4. Since some personality adjectives appear more than once, the number of unique personality terms is provided as well, which is equal to the length of the vector created in Table 5.4. The first mentioned extreme of each personality trait is represented on the outer line of the radar, while the second mentioned extreme is on the inner line. Note that the radars do have four segments. The middle dotted line represent the mean, in this case the value 0. The other dotted lines represent the number of standard deviations of the data set. If a player scores higher or lower than two times the standard deviation on one of the personality traits, the point on the radar is on the maximum or the minimum line, but the actual value is provided below the name of the personality trait. For most players, a value between -2 and 2 is found, which makes sense considering the fact that the majority of the players should be found within four times the standard deviation. In most cases where a player scores extremely high or extremely low, there are only a few articles retrieved about this player. This makes sense, since a few articles implies fewer personality adjectives that can be found in those articles. If only a few adjectives are found for one player, their relative importance becomes much higher. To get a better understanding of the interpretation of the radar charts, Figure 5.2 is added. Note that the blue line is the mean (µ), and therefore the 0 line. The other lines represent the number of standard deviations (σ). So when a player scores the value 2 on Openness to Experience, this is plotted on the red line and implies that this player is very open to new experiences compared to other players. On the other hand, when a player scores a value of -2 on this same factor, this is plotted on the pink line and implies that this player scores high on Closedness to Experience. Therefore, a negative value is not indicating that this is bad, but just shows that the player is scoring high on the other extreme of a personality trait. Figure 5.1 is a first result of the personality construction of a player. Other experiments and the validation in Chapter6 must give an insight in how accurate those first results are.

A Method to Perform an Automated Background Check on Professional Football Players 45 CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

5.1.3 Personality Extraction without Word Stemming As was described in the previous experiment, some difficulties in the methodology occurred arose. While stemming the words in a text corpus is a very usual way to preprocess the data in text mining, this resulted in unrealistic relative occurrences of personality adjectives. The most extreme example was the stem of the word playful, which is the word play. It was described why it is not strange that this stem was found in the text corpus a lot, and that not all occurrences where the actual stem of the word playful. To overcome this issue, the personality stems with a very high relative frequency were analyzed and removed. There are two downsides to removing the word stems with a high frequency from the corpus. First, the approach is not consequent. The stem play is removed from the corpus because its relative frequency is considered to be very high, but stems with a lower frequency are neglected, while it is not sure whether all these occurrences are the actual stems of the personality adjectives. For example, the stem of the personality adjective emotional is emot. However, the stem of the word emotion is emot too. This stem is kept in the corpus while other ones are removed, while it is unlikely that each occurrence of emot refers to the actual word emotional. Therefore, it is inconsistent to remove the most frequently used terms, just because their stem is also the stem of frequently used words other than the personality adjective. Second, removing the stems means removing possible information. While the word playful does tell something about a player’s personality, it is not used anymore, just because other words with the same stem occur too often. Considering the amount of words that were removed from the corpus (Appendix F.2), it is likely that removing all those terms that could tell something about a football player could result in a distorted picture of his personality. To get rid of those two downsides of the earlier proposed methodology, an adjustment to the experiment was made. Instead of looking for the stems of personality adjectives in the text corpus, only exact matches are taken into account. Considering the word playful, the text corpus is searched for the actual word playful instead of its stem play. All steps of data preprocessing described in Subsection 5.1.1 and the methodology described in Subsection 5.1.2 are applied again, except the steps in which the personality adjectives or the words in the text corpus are stemmed. Again, some words did not appear in the corpus and were removed from the term-document matrix (41, and 150 for the English and the Dutch articles respectively). These words are provided in Appendix F.1.2. Also in this experiment, the words that accounted for more than 1 percent of the total adjective occurrences were removed, and one iteration of this process was performed. This time, the words are not removed because their stem is also the stem of other words which might occur often, but because those words are believed to describe other football related related phenomena rather than they describe a player. For example, it is assumed that the word defensive is used to describe a team’s style of playing, rather than a player’s personality. A total of 51 English and 46 Dutch adjectives was removed from the text corpus because they appeared too frequent. A list of those adjectives can be found in Appendix F.2.2. All results of the experiment without word stemming are provided in Appendix G.2 (before scaling) and Appendix H.2 (after scaling). Two radar charts of the results are provided in Figure 5.3. Just as in Figure 5.1, the plots of the players Dirk Kuijt and Lionel Messi are provided, such that a comparison can be made. A first remarkable difference between the results of the two experiments is the reduction in the number of personality terms. This can be explained by the fact that a search is performed on exact word matches, instead of their stem. These exact matches appear less frequently. Furthermore, the results of both experiments in terms of personality profiles differ a lot. At this point, it is unknown which of the two experiments provides the most accurate results, but this is tested and discussed in Chapter6.

46 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Figure 5.3: Personalities determined by the bag-of-words approach without word stemming.

A Method to Perform an Automated Background Check on Professional Football Players 47 CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

5.2 Personality Extraction Using Part-of-Speech Tagging

Before the results are evaluated, one final experiment is conducted. Both experiments described in Section 5.1 are using the bag-of-words approach, which is considered to a pretty basic approach, as was described in Section 2.1 of the preliminaries of this thesis. Since this project is a first attempt to construct a personality profile based on news articles about a person, it seems a good starting point to use such a basic approach. However, it is likely that a more advanced approach might be crucial to get better results. Therefore, such a more advanced approach is applied on the data in this section to compare the resulting personality profile to the two earlier created profiles based on the bag-of-words approach. Since the table of Saucier & Goldberg(1996) consists of personality adjectives, it would make sense to only look for adjectives in the text corpus, instead of all the words. To do so, a technique is required to assign a part-of-speech tag to each word in the text corpus. Several challenges of Natural Language Processing were discussed in Subsection 2.1.1, one of which was referred to as part-of-speech tagging. Part-of-speech tagging could be used to filter out the adjectives of the text corpus and construct a term-document matrix with adjectives only. From a theoretical point of view, part-of-speech tagging could increase the results of the earlier conducted experiments. Compared to the bag-of-words approach with word stemming, it does not stem all words in the corpus, just the adjectives. Therefore, personality adjectives having a stem that could also be the stem of a non-adjective do not appear unrealistically more often than the stems of the other adjectives, such as in the example of playful, whose stem is play. So word stemming can still be applied, and less information is lost due to the deletion of adjectives from the text corpus. Compared to the bag-of-words approach without word stemming, an advantage of part-of-speech tagging could be that inflections of the same adjective are taken into account, because word stemming can be applied on all adjectives. Including inflections of an adjective was not possible in the second experiment, since a search was performed on exact word matches. The applied methodology in this third and final experiment largely corresponds to the applied of the previous two experiments. The major difference is that each article is analyzed first, and a part-of-speech tag is assigned to each word in the article, before the term-document matrix is constructed. This term-document matrix then consists of adjectives only. After that, the same steps can be applied to come to a personality profile of a player. An off-the-shelf part-of-speech tagger was used in this project. As was already indicated, part- of-speech tagging is known as a challenge in Natural Language Processing, and is therefore not always accurate yet. Even he state-of-the-art part-of-speech taggers, three of which were mentioned in Subsection 2.1.1, have not achieved an accuracy of 100 percent. In this project, the openNLP package in R was used (https://cran.r-project.org/web/packages/openNLP/openNLP.pdf), which is an interface to the Apache Open NLP tools. Apache Open NLP is a widely used tool to perform text mining techniques, such as part-of-speech tagging. It does not contain the best part-of-speech tagger currently available (Horsmann, Erbs & Zesch, 2015), but is easy to access and its performance is sufficient for the purposes of this project. Also in the applied methodology of this experiment, some adjectives did not occur in the entire text corpus (39 English and 174 Dutch). Those adjectives are listed in Appendix F.1.3. Despite the fact that only adjectives are considered, and no ambiguous stems of other words should be included here, it was decided to remove the adjectives from the corpus that accounted for more than 1 percent of the total appearing personality adjectives in the text corpus, because those adjectives, such as competitive, defensive, and friendly, are assumed to describe other football related phenomena rather than the personality of a player. However, where the process of deleting every personality adjective that accounts for more than 1 percent of the total was repeated once in the previous experiments, this is not the case in this one. The personality adjectives that were deleted, 25 English and 21 Dutch, respectively, in this step are presented in Appendix F.2.3. The results of the part-of-speech tagging experiment are to be found in Appendix G.3 (before scaling) and Appendix H.3 (after scaling). Again, the radar charts of the same two players are provided in Figure 5.4. The number of personality adjectives that is used for each player is higher than in the bag-

48 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Figure 5.4: Personalities determined using part-of-speech tagging.

A Method to Perform an Automated Background Check on Professional Football Players 49 CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION of-words approach looking only at exact matches, but is lower than the amount of personality adjectives that was found using the bag-of-words approach including word stemming. This could imply that this advanced approach works better than the bag-of-words approach, since the first experiment had an adjective count that was too high, and the second experiment had an adjective count that was too low. One thing that is remarkable, not only for this experiment, but for all three, is the difference between the analyses on Dutch and English news articles. First, the number of Dutch adjectives that is not used is much higher than the number of English ones in all experiments. Second, the number of personality terms used to describe a player is much lower for the players analyzed in Dutch articles. This is probably due to the fact that the translation of the personality adjectives is not accurate enough, or that the literally translated adjectives are rarely used in common Dutch. Unfortunately, there is no word list in Dutch available that is comparable to the word list provided by Saucier & Goldberg(1996), but it seems to be the case that the literal translations of adjectives is not good enough. Again, it is not possible at this point to say whether the, more advanced, part-of-speech tagging approach is more accurate than the two earlier conducted experiments. An evaluation of all experiments is provided in the next chapter.

5.3 Twitter Analysis

This section contains the analysis of the Twitter data. The three analyses on the Twitter data that were discussed in Section 4.2 are executed here and the results are discussed. First, the proposed regular expressions are applied to the Twitter data to discover the most frequently used hashtags and mentions of players. Second, word clouds of the Twitter data are established and evaluated. Third, a more advanced technique is applied to the Twitter data. Sentiment analysis is performed on the tweets to get an overview of the popularity of a player.

5.3.1 Regular Expressions An introduction to regular expressions and why this could be of interest for the analysis of tweets was provided in Subsection 4.2.1. Here, regular expressions are applied to the Twitter data that was obtained in this project. The goal is to extract the most frequently used hashtags and mentions to other Twitter users of the players on which Twitter data was obtained. If this methodology can be applied to these players, it can be applied to any football player with a Twitter account. This part of the analysis of Twitter data is carried out in the R programming language. First, the tweets of a player are loaded into R in a data frame. Since all hastags and mentions to other users are in the text variable of the data, this is the only variable that needs to be considered here. A disadvantage of the Twitter data in this particular case is the encoding. As was discussed in Section 3.2 of this thesis, the tweets are gathered using the Python 2.7 software, which does not use UTF-8 as standard encoding, instead of the Python 3.5 software, which does use the UTF-8 encoding. This has the disadvantage that tweets are not always encoded properly, and may contain strange characters. The reason why Python 2.7 was used, is because this simplifies the sentiment analysis of tweets, as is discussed in Subsection 5.3.3. After the tweets are loaded into R, the analysis can start. The applied methodology is the same for both mentions and hashtags, with the difference being the first character of the regular expression that is used, being a @ for a mention and a # for a hashtag. In this description, only the process to extract hashtags is described. The regular expression #[A-Za-z0-9]+ (an underscore is added to the regular expression of a mention, since some players use an underscore in their user name) is used to search for hashtags in the text of each tweet. If there are one or more hashtags found in the text, those are stored in a new variable called hashtags. If there are no hashtags in the text, the value of this variable is set to 0. After that, a text corpus can be created from the hashtags variable, in which the number of documents is equal to the number of tweets a player has placed and the content of those documents

50 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Table 5.8: Twenty most used hashtags and mentions of player Zlatan Ibrahimovi´c.

Hashtags Frequency Mentions Frequency 1 #daretozlatan 58 @psg inside 22 2 #madebysweden 12 @nikefootball 13 3 #805millionnames 7 @volvosverige 12 4 #xboxone 5 @alexthemauler 3 5 #behindthescenes 4 @ibra official 3 6 #riskeverything 4 @psg sverige 3 7 #vitaminwellplus 4 @wfp 3 8 #bethedifference 2 @101greatgoals 1 9 #maxmartin 2 @aalasfoor10 1 10 #theone 2 @alotibeh 1 11 #titanfall 2 @amirunsalleh 1 12 #vemibra 2 @areeb23 1 13 #wfp 2 @armouredguy 1 14 #zlatan 2 @at sunshine 1 15 #4thegamers 1 @ayyeitsarv 1 16 #chepsg 1 @azsportswear 1 17 #daretomeetzlatan 1 @bailofwz 1 18 #elleman2 1 @christiaaannzz 1 19 #fcnpsg 1 @citralistyarini 1 20 #fierdeparis 1 @coletteparis 1

being the extracted hashtags. Similar to the text corpus of news articles, a term-document matrix can be constructed after the data is cleaned. The data cleaning is performed by transferring the content to lowercase, which is one of the steps described in Table 5.2. The other steps are not necessary, since they have already been executed in the formulation of the regular expression. After the completion of the first step of the text mining process, establishing the corpus, a term- document matrix can be constructed. This term-document matrix is a sparse matrix containing information about how often each hashtag is used in each tweet. Summing the columns results in a total frequency of use of each hashtag. Finally, these frequencies can be sorted, such that the most frequently used hashtags stand out immediately. The results of this approach for the tweets of Zlatan Ibrahimovi´care presented in Table 5.8. The results of the other players can be found in Appendix I.1. This table contains quite some information about the Twitter behavior of this player. A lot of hashtags and mentions refer to names of sponsors. For example, the second most used hashtag madebysweden, and the third most used mention volvosverige show that quite some tweets of Ibrahimovi´crefer to car manufacturer Volvo, which is not strange considering the fact that Ibrahimovi´coften appears in television commercials of this manufacturer. A lot of other brands (most likely sponsors) are represented in the list. Of course, this list differs for each player, and whether it is useful or not is case specific.

5.3.2 Word Cloud In the previous chapter, it was discussed that the words people use when talking about a certain player could also contain valuable information to add to a player background check. An easy visualization technique, namely creating a word cloud, is used to realize this. Just like in the regular expressions part, it is mostly a matter of structuring and cleaning the data well and creating a nice overview out of it, but it could be very useful to add to a background check for some cases. Creating a word cloud requires the same steps as were performed during the regular expressions

A Method to Perform an Automated Background Check on Professional Football Players 51 CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Figure 5.5: Example of a word cloud, constructed from Zlatan Ibrahimovi´cmentions. approach, with the exception that the corpus of documents consists of all tweet text, instead of the hashtags only. The process is again executed in the R programming language. After the establishment of the corpus, the data cleaning process presented in Table 5.2 is executed, with exception of the step in which the words are stemmed, and one extra step is added in the beginning. Before anything in the text corpus is removed or transformed, the Twitter username of the player preceded by a @ is removed, since this string appears in every tweet, and will therefore affect the look of the word cloud a lot. Recall that all tweets about a player are mentions, and mentions always contain @username to show which user is mentioned. So in all Zlatan Ibrahimovi´c,the string @Ibra official appears. It is necessary to delete these before starting the rest of the data cleaning process, because if it is done later, punctuation and numbers will be removed, so giving the command to remove all @Ibra official strings is not going to have an effect, since the punctuation removal already has removed the underscore in the username. The cleaned data set now contains only lowercase letters, separated by a space. It is easy to construct a word cloud when the data is in this format, using the wordcloud package in R. The results for mentions on Zlatan Ibrahimovi´care represented in Figure 5.5. The word clouds of the other players are provided in Appendix I.2. The one of Ibrahimovi´cis provided here, because those results are the most interesting considering a player background check. The size of the words represent the number of times they are mentioned in the entire text corpus. The goal of the colors of the words is to strengthen this effect. The word cloud provides some meaningful insights with respect to how the people are talking about Zlatan Ibrahimovi´c.The most frequently used word in tweets about this player is manutd, which is the abbreviation of the football club Manchester United. This could be explained by the fact that the tweets were gathered at a moment Zlatan Ibrahimovi´cannounced that he was leaving his team, Paris Saint-Germain (he did this via the tweet that was given as an example in Subsection 4.2.3). Apparently, people link him to Manchester United. Not only manutd is to be found in the word cloud, but words as mufc (another abbreviation for Manchester United), man, manchester, and united are used very often. Even the names of the new Manchester United manager, Jos´eMourinho, of whom is known that he can get along with Ibrahimovi´cvery well, are

52 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION visible in the word cloud. Besides Manchester United, football club Arsenal is also mentioned, but this word is much smaller, and therefore less used. The rumors on Twitter appeared to be correct, since the most recent tweet to create this word cloud was placed on May 25, 2016, and over a month later, on June 30, Ibrahimovi´cconfirmed on Twitter that he is indeed going to play for Manchester United. A final remark on the result is that a lot of people seem to talk about the tweet Ibrahimovi´cused to announce his departure, since the words legend and king are used very often. This could be very relevant information for SciSports to add to a player background check. In a situation were a club inquires about Ibrahimovi´c,they can give the club more information than just the statistics of him on the pitch. They could tell that, according to the people on Twitter, it is likely that Ibrahimovi´cis talking to Manchester United to possibly discuss the terms of his contract.

5.3.3 Sentiment Analysis Finally, the topic that is probably the most interesting for SciSports is the sentiment analysis, because it actually analyzes the content of the tweets, rather than only representing this con- tent. An introduction to sentiment analysis was given in Subsection 2.1.1 of this thesis, and it was discussed why sentiment analysis could add value to the background check of SciSports in Subsection 4.2.3. This part provides some more explanation on sentiment analysis, together with the used methodology and results that were obtained. Sentiment analysis is a way to assign a positive or negative sentiment to a piece of text, with -1 being very negative, 0 being neutral, and +1 being very positive. This value lying between -1 and +1 is referred to as polarity. Another number that is often determined in sentiment analysis, is the subjectivity of a text. The subjectivity of a text varies from 0 to 1, with 0 meaning that the text is very objective, and 1 implying that the piece of text is very subjective. The goal of this sentiment analysis is to create a visualization of the overall sentiment about a player in the retrieved tweets, which contains more information than just the average sentiment and the average subjectivity. As was already mentioned earlier, an off-the-shelf tool in the Python 2.7 software was used to calculate the sentiment and subjectivity of each tweet. This tool is called the pattern module (http://www.clips.ua.ac.be/pages/pattern-en#sentiment). The next example is added to show the performance of this module. Consider the example tweet of football player Wanyama, provided in Figure 3.2. From the text “I had spaghetti and it was very nice i enjoyed it”, it can be concluded that Wanyama liked the spaghetti he just had, and manually one would assign a positive sentiment to it and the tweet is also quite subjective, since it is the opinion of Wanyama only. If this sentence is fed to the sentiment analyzer of the pattern module, it returns a polarity of 0.64, and a subjectivity of 0.85, which seem to be acceptable values. There are two main reasons for picking the pattern module. The first one is that it is available in both English and Dutch. The second one is that, after some testing, the performance of the module appeared to be pretty good. Also when denials were added to the test strings, the module reacted well to it. If the tweet of Wanyama is changed to the following: “I had spaghetti and it was not nice i did not enjoy it”, the values for polarity and subjectivity became -0.22, and 0.63 respectively. On the website of the module, it is stated that the accuracy of the English sentiment analyzer is equal to 75 percent for movie reviews, whereas the accuracy is equal to 82 percent on book reviews for the Dutch analyzer. Since it is outside the scope of this thesis to create a very accurate sentiment analyzer, these values are considered to be acceptable for the purposes of a player background check and the pattern module is used to perform the sentiment analysis. The sentiment analysis is performed on the mentions about the football players, to get an overview of the opinion of people about them. The tweets were gathered in the Python 2.7 software, because the pattern module is only available in this version of Python. So with the tweets still stored in the Python environment, the polarity and subjectivity of each tweet text is added in two new columns. It is noteworthy that this step is performed for the Dutch and English tweets separately, since it is necessary to define the language the pattern module has to operate

A Method to Perform an Automated Background Check on Professional Football Players 53 CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION in. After adding those two variables, all the Twitter data, including the values determined in the sentiment analysis, is stored into a comma-separated values file. This file is then loaded into R to create a visualization out of it. Recall that not only the text of the tweets was received, but also the number of likes and the number of mentions a tweet receives. This data can be included in the visualization. One person tweeting something bad about a player might be unimportant, but as the number of likes and retweets of this tweet grow, so does the relevance of the polarity of the text. Therefore, a visualization technique is selected in which this information can be included; the bubble chart. First, the player mentions are loaded into R, like in all previous analyses. A plot is made using the raw number of retweets and likes of a tweet. Since there is no interest in tweets containing no sentiment and no subjectivity, so where the sentiment analysis resulted in a value of 0 for both these aspects, those subsets are removed from the tweets. Maintaining these tweets results in a bubble chart with a lot of values in the origin of the plot. When a plot was made of the tweets that did contain information, some bubbles in the resulting chart overshadowed the other results, especially in the cases where there were not many mentions on the player available. The scale of the chart was therefore adjusted to solve this issue. Instead of looking at the absolute values of the number of likes and retweets, the log(1 + x) value of these values was taken to improve the visualization. It is required to take the log(1 + x) instead of log(x), since a lot of tweets do not have retweets or likes, and taking the log of zero is undefined. Adding one to the number of likes and retweets before taking the log is fair, because tweets with zero likes or retweets will remain a value of zero, since log(1 + 0) is equal to zero. A comparison between the charts before and after scaling is provided in Figure 5.6. The size of each bubble represents the number of likes a tweet had, the color represents the number of retweets, and the location of the center provides insight in the polarity and the subjectivity of the tweet. The upper chart represents the data using the raw numbers of likes and retweets, whereas in the lower chart the log(1 + x) transformation was used for both the number of likes and the number of retweets. The lower graph provides a much clearer representation of the sentiment about the player, and is therefore more useful to add in a player background check. A big disadvantage of the sentiment analysis is that it does not work well for less known players. The example in Figure 5.6 is based on 525 tweets, and provides a pretty clear overview. If there are more tweets available, such as in the case of Zlatan Ibrahimovi´cor Cristiano Ronaldo, it only provides a better overview of the polarity of their mentions. However, if there are only a few mentions available, such as in the case of Mark-Jan Fledderus, the sentiment analysis loses its power, which is visualized in Figure 5.7. This is to be expected, but is therefore less interesting for SciSports, since they are mainly focusing on less known players. All results of the performed sentiment analyses that are not presented here can be found in Appendix I.3.

54 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Figure 5.6: Comparison of sentiment visualization before and after log transformation on 525 Riyad Mahrez mentions.

A Method to Perform an Automated Background Check on Professional Football Players 55 CHAPTER 5. TRANSFORMING TEXT DATA INTO VALUABLE INFORMATION

Figure 5.7: Sentiment analysis visualization of 32 Mark-Jan Fledderus mentions.

56 A Method to Perform an Automated Background Check on Professional Football Players Chapter 6

Validating the Personality Prediction

After performing an analysis on the retrieved text data, it is necessary to check whether the results provide a good overview of players to add to a player background check. Especially for the part when a personality profile of a player is constructed based on text data from news articles, it is to be considered which of the three experiments provided the most useful results. Therefore, the fourth and final sub-question is to be answered in this final chapter of the thesis. Before answering the question, it is important to make a note about which model is to be validated. Although the only part of the Twitter analysis that could be validated is the performed sentiment analysis, it is decided to completely focus on the model predicting a player’s personality. As was mentioned in Subsection 5.3.3, the accuracy of the used sentiment analyzer is equal to 75 percent for the English language (for movie reviews), and to 82 percent for the Dutch language (for book reviews). It was decided not to manually check whether these accuracy rates are the same for tweets about football players, since it would be a very time consuming process. Since this is a master thesis project, in which time is limited, it is more interesting to spend this time on validating a newly developed model; the personality profile construction based on news articles. Therefore, this chapter is only going to consider this personality construction part of the project.

6.1 Manual Analysis of a News Article

To get a first insight in the performance of the personality experiments, a manual check on one news article is executed. For one of the Dutch players on whom a personality profile was constructed, one article is picked out to see how adequate counting personality adjectives in an entire article actually is. For football player Dirk Kuijt, whose results for the three experiments are presented in Figure 5.1, Figure 5.3, and Figure 5.4, one article from the NOS source is picked out that contains some of the personality adjectives. The whole article is provided in AppendixJ. When the bag-of-words approach with word stemming is applied on this article, and only the stems of personality adjectives from the article are kept, five word stems remain:

Defensief Stem of the word defensief (translation of defensive), in the sentence: “De persconferentie was, hoewel soms defensief, zeer vermakelijk.”

Inzicht Stem of the word inzichtelijke (translation of insightful), in the sentence: “Toen het over de gouden hand van wisselen van de bondscoach ging, kwam tot een paar mooie woorden over het inzicht van de bondscoach.”

Logisch Stem of the word logisch (translation of logical), in the sentence: “Leek logisch.”

A Method to Perform an Automated Background Check on Professional Football Players 57 CHAPTER 6. VALIDATING THE PERSONALITY PREDICTION

Ontspann Stem of the word ontspannen (translation of relaxed), in the sentence: “Dirk Kuijt keek met steeds meer ontspanning en met een grote glimlach naar het optreden van de jonge PSV’er.”

Verleg Stem of the word verlegen (translation of shy), in the sentence: “De Katwijker zit nooit om een woordje verlegen en voelt zich in persgesprekken als een vis in het water.”

This analysis immediately shows the weakness of the first experiment, where word stemming was applied on a bag-of-words, without looking at the context the words are placed in. In this context, the word defensief refers to a press conference, the word inzicht indicates that another player than Dirk Kuijt thinks the coach has good insight, logisch is used to indicate that the coach had made a logical decision to bring Dirk Kuijt to the world cup, ontspann is used to express the way Dirk Kuijt was watching the performance of another player, and verleg is used in a sentence that states that Dirk Kuijt is not shy. So, while the methodology of the first experiment looks for word stems of personality adjectives, it assigns certain words to a player, that are not related to him. So only the stems ontspann and verleg actually refer to Dirk Kuijt, with the important not that a negative word is used in the same sentence as the latter one to indicate Dirk Kuijt is not shy. A final note that is to be made is that the two stems defensief and logisch are removed from the text corpus, because they appeared too frequent (see Appendix F.2.1), which actually strengthens the performance of the model in this specific case. It is not a surprise that the first approach is not very accurate, since it was known from the beginning that applying the bag-of-words approach is pretty primitive. That is also the reason why two extra experiments were carried out. If the methodology of the second experiment, the one in which the bag-of-words approach was used without word stemming, is manually applied to the same article, it is again demonstrated that the bag-of-words approach seem to lack of performance for this particular purpose. The three words that are counted if this second approach is applied on the article are verlegen, logisch, and defensief. Again, the words logisch and defensief are removed from the corpus because they appear to often (Appendix F.2.2). This implies that the methodology applied in the second experiment on this article only would consider Dirk Kuijt to be shy, while human beings can read that the text states that Dirk Kuijt is not shy. Because of the inaccuracy of the bag-of-words approach, part-of-speech tagging was used in the final experiment to only look for adjectives. It was assumed that using this approach, the words that describe players are mostly extracted. Of course, the approach also extracts adjectives that provide information about something else than a football player, but this is unavoidable when no other techniques are combined with it. Applying the part-of-speech tagger to that was used in the experiment in Subsection 5.2, results in the extraction of the personality adjectives defensief and logisch, from which only the latter one is deleted from the text corpus (Appendix F.2.3 because of the frequency of use. This implies that the part-of-speech tagging methodology would consider Dirk Kuijt to be defensive, while the word defensief actually refers to a press conference, and not to Dirk Kuijt. By manually checking all three methodologies on one example article, it is already clear that searching for personality adjectives in whole articles does not always provide a good view on that player. Words could refer to things other than the player, and negatives are neglected. However, it is too early to state that the model to construct a player’s personality performs bad. This is a manual analysis on only one article. The negative effects of all three methodologies might balance each other out if hundreds of articles on one player are analyzed. Therefore, it is necessary to consider the resulting personality profiles rather than the individual articles, and compare them to the personality profiles people actually perceive about these players. This analysis and final validation is performed in the next section.

58 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 6. VALIDATING THE PERSONALITY PREDICTION

6.2 Validation via Questionnaire

In contrast to the analysis of how the three applied approaches perform on one individual news article, it is probably more interesting to consider how the techniques perform on the whole text corpus. The weaknesses of the techniques mentioned in the previous section might balance each other out when a large amount of news articles is analyzed. To check this, the results of the three experiments are to be compared to each other, and, above all, to a benchmark. The best way to set this benchmark and check which approach is the most accurate to predict a player’s actual personality is very difficult to determine. One possibility is to use a player’s self-rated personality. This implies that players fill in one of the personality tests that were briefly discussed in Section 2.2 could be filled in by a player, to see what the personality profile of the player looks like, based on the player’s own opinion. One downside is that the result of this self-rated personality could be a self-idealization of the player, rather than his actual personality. Furthermore, it is only one observation, which would be very subjective. Besides these two disadvantages, it is also very difficult to make a famous football player fill in such a questionnaire. Of course, SciSports has connections with players, but only with a few players that are in the data set. Especially for the players not playing in the Netherlands, it would be difficult to get in contact with them and make them fill out a personality questionnaire. A possible way to solve the issue that a player’s self-rated personality is subjective and could be an idealization, is to add results of the perception of this player’s personality from people who are closely related to the player, such as his family and friends. In many studies, it is very common to combine self-rating with peer-ratings to construct a personality profile of a person. However, since it is difficult to get in contact with the players themselves, it is even more difficult to get in contact with people who are closely related to him. Therefore, peer-ratings of a player’s personality provided by people who are related to the player is not an option. However, this project has been focusing on determining a player’s personality based on the content of news articles, so a perception of a player’s personality is constructed, rather than the actual personality. Therefore, it seems reasonable to compare the findings of this project with the perception of the player’s personality of people who also base this perception on news articles and other media; the football fans. Comparing the outcomes of the experiments to the perception of personality of football fans can give an indication how well a model can predict perception of personality. People from within SciSports and football fans from outside the company were asked to fill in the Big Five Inventory (John & Srivastava, 1999) about a certain player. The Big Five Inventory is a relatively short questionnaire to test a personality, consisting of 44 statements, each one related to one of the five big five personality traits. More extensive personality tests are available, but considering the fact that people need to invest time to fill in the questionnaire, without getting anything in return, the decision was made to use the Big Five Inventory (BFI). Since this master thesis is carried out in the Netherlands, and with Dutch being the native language of all the respondents of the questionnaire, the Dutch translated version of the original BFI, created and validated by Denissen, Geenen, Van Aken, Gosling & Potter(2008), was used. Normally, the BFI is formalized for self-rated personality tests. Therefore, the introduction of the questionnaire was slightly changed, such that respondents knew that had to evaluate each statement about one player compared to other football players. Each statement starts with the words: “I see this player, in comparison to other players, as someone who . . . ”, followed by the actual characteristic. Respondents of the BFI need to indicate the degree to which they agree with each of the statements on a five-point Likert scale. The questionnaire as it was presented to the respondents and the BFI scoring are added in AppendixK. People were invited to fill out an online version of this questionnaire about one or more of the following players: (PSV), Dirk Kuijt (), Mario Balotelli (AC Milan), and Lionel Messi (FC Barcelona). The first two are playing in the Netherlands, and the latter two are playing in Italy and Spain. Those four players were chosen, because many news articles were obtained about them, and they are all well known players for football fans in the Netherlands. A

A Method to Perform an Automated Background Check on Professional Football Players 59 CHAPTER 6. VALIDATING THE PERSONALITY PREDICTION total of 105 response were received, divided as follows:

• Luuk de Jong: 38 responses.

• Dirk Kuijt: 25 responses. • Mario Balotelli: 25 responses. • Lionel Messi: 17 responses.

The raw results for each player are provided in AppendixL. While these amounts of responses are not very high, it is assumed they provide proper reflections of the four player’s perception of personality. The scoring of the responses is provided in AppendixK, but an important note is to be made about the interpretation. In all three experiments executed in Chapter5 the player personality traits were compared to the mean of these personality traits of other football players in the data set. Therefore, the results of the validation questionnaire need to be on the same scale as the results of the experiments to make a proper comparison. Since it was not feasible to take questionnaires for all players in the data set, it was chosen to adjust the statements of the BFI, such that the respondents already compare each player to other players. In the scoring, the results, a number from 1 to 5 for each statement, were then interpreted as follows (statements that are scored reversed are interpreted in reverse order):

1. Two times the standard deviation below the mean of all football players. 2. One time the standard deviation below the mean of all football players. 3. The mean of all football players. 4. One time the standard deviation above the mean of all football players.

5. Two times the standard deviation above the mean of all football players.

Although there is no evidence available that this is a correct interpretation, it is used in this project, because a comparison between the players has to be made. To find the actual means and standard deviations of each personality trait for football players, it is required to let the questionnaire be filled in for all players in the data set, which is not feasible in this master thesis project. Using the interpretation, the results of the questionnaire can be compared to the results of the three experiments. The mean of the answers to the statements about a specific personality trait is used as the score of the player on this personality trait. The results of all three experiments and the validation are all plotted in Figure 6.1 and Figure 6.2. The green lines represent the validation that was obtained via the questionnaire, the other three lines represent the results of the three experiments from Chapter5; the bag-of-words approach with word stemming (BoW), the bag-of-words approach without word stemming (BoW No Stem), and the part-of-speech tagging approach (PoS). It is clearly visible that the performance of the three experiments varies a lot for all four players. Not only do they vary from the validateing questionnaire, they also differ a lot from each other. One remarkable thing is that the Openess to Experience aspect is predicted pretty well via the use of the bag-of-words approach with word stemming. However, for all other aspects and experiments, not one best model is clearly visible. To further investigate the performance of the three models, the error rates need to be analyzed further. Table 6.1 provides an overview of the difference of each method compared to the results of the questionnaire. This error rate is determined by taking the absolute value of the difference between the model outcome and the outcome of the questionnaire for all five personality traits; Openness to Experience (OE), Extraversion (E), Conscientiousness (C), Neuroticism (N), and Agreeableness (A). For each player, personality trait, and experiment, the error rate is calculated. For each player, all three experiments are conducted. The error rates of each experiment on each personality trait are printed on a different row. So, the upper left 0.01 implies that the absolute

60 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 6. VALIDATING THE PERSONALITY PREDICTION

Figure 6.1: Results of experiments and questionnaire for two players in the Netherlands.

A Method to Perform an Automated Background Check on Professional Football Players 61 CHAPTER 6. VALIDATING THE PERSONALITY PREDICTION

Figure 6.2: Results of experiments and questionnaire for two players outside the Netherlands.

62 A Method to Perform an Automated Background Check on Professional Football Players CHAPTER 6. VALIDATING THE PERSONALITY PREDICTION

Table 6.1: Absolute differences between the results of the experiments and the questionnaire. The lowest absolute errors are printed in bold.

Player Experiment OE E C N A BoW 0.01 0.74 0.59 0.51 0.41 Luuk de Jong BoW No Stem 0.41 0.03 0.68 0.70 0.27 PoS 0.06 0.20 0.63 0.46 0.02 BoW 0.33 1.58 1.45 0.22 1.20 Dirk Kuijt BoW No Stem 0.00 1.29 1.17 0.24 0.85 PoS 0.28 1.61 1.42 0.04 0.64 BoW 0.31 0.01 0.54 0.22 0.56 Mario Balotelli BoW No Stem 1.25 0.49 0.80 0.11 0.59 PoS 0.23 0.79 0.91 0.24 1.11 BoW 0.06 0.15 1.27 0.61 1.10 Lionel Messi BoW No Stem 0.76 0.46 0.82 0.54 0.60 PoS 0.78 0.33 0.72 0.20 0.69 BoW 0.18 0.62 0.96 0.39 0.82 Average BoW No Stem 0.60 0.57 0.87 0.40 0.58 PoS 0.34 0.73 0.92 0.23 0.61

error of the bag-of-words experiment with word stemming predicts the Openness to Experience personality trait of player Luuk de Jong with an absolute error of 0.01. Each experiment has five absolute error values; one for each personality trait. For each player three experiments are carried out, whose results are printed on a different row. For each combination of player and personality trait, the lowest absolute error is printed in bold, to indicate which model is predicting this personality trait most accurately. The row with average values contains the average error rate of each experiment for each player, to get an overview of how accurate the models perform on average. As is clearly visible in the table, none of the methodologies clearly outperforms the other two. Even when considering the values for the upper and lower two player separately, considering the difference in language from the articles, no clear pattern in the performance of the models is visible. Unfortunately, based on this small subsection of four players, it can not be stated which model has the best performance. Especially looking at the averages provided in the lowest three rows indicates the bad per- formance of the models in most cases. However, two of the averages are hopeful. As was already indicated the bag-of-words approach seems to predict the Openness to Experience trait pretty well in those four cases. For all four players the Openness factor is predicted with an absolute error of less than 0.34, and the average absolute error is equal to 0.18. Furthermore, the part-of-speech tagger predicts the Neuroticism factor pretty well for those four players With an average absolute error of 0.23, with the highest error being 0.46 for Luuk de Jong, the performance of this model on this specific personality trait could be interesting to develop further. The three other average absolute errors are all higher than 0.5 and therefore considered not to be interesting. While the results of the models are evaluated for only a very small part of the players in the data set, it seems clear that the performance of the models is low. In Figure 6.1 and Figure 6.2, there is often one model that comes close to the actual personality perception of the player, but as is clearly visible in Table 6.1, there is no model that consistently predicts one personality trait well. The only two models that provide promising results are the bag-of-words approach with word stemming to predict the personality trait Openness to Experience and the approach of part-of-speech tagging to predict the personality trait Emotional Stability. Possible reasons for the lack of performance of the three models are provided in Section 6.1, in which a few weaknesses of the approaches were discussed. The discussed weaknesses did eventually not balance each other out when multiple articles were analyzed.

A Method to Perform an Automated Background Check on Professional Football Players 63 CHAPTER 6. VALIDATING THE PERSONALITY PREDICTION

It is expected that the most important reason for the performance of the models is that not all words in an article are about one player. In Chapter3, web page parsers were built to automatically retrieve news articles about football players. A name of a player was searched for on one of the eight different websites, and all articles that are found are downloaded and stored in the database. However, as was made clearly understandable in Section 6.1, these articles are not only about that one specific player in the search request. There are also other players mentioned in the articles. In the three applied methodologies, it was expected that the words that were used for other players would balance each other out when a lot of articles were analyzed, but this did not appear to be the case. Therefore, the same methodologies should be applied on other text data, containing only text that is about one specific player. However, since this kind of text data seems unlikely to find, it might be better to complement the applied methodologies with other text mining techniques, to check whether an adjective is likely to refer to the player of interest. This could for example be achieved via named entity recognition and rhetorical structure theory, which were both briefly discussed in the preliminaries of this thesis. Named entity recognition could for example be used to filter out sentences that contain a specific player name. Rhetorical structure theory could consider the whole structure of the text, and distinguish between the interesting parts, the parts that are actually about a player, and the non-interesting parts. Furthermore a note needs to be made on the word list that was used, and then especially the self created word list of adjective translations. In all three experiments the number of Dutch adjectives that did not occur in the entire text corpus was much higher than in the number of English adjectives. It would have been better to use a Dutch version of the word list of Saucier & Goldberg(1996), but such a word list was not available at the moment of writing. The poor quality of the translation of words could also be one explanation for the performance of the models. While the results of these three experiments can not directly be used by SciSports to implement in their player background checks, it provides some interesting insights. It is pointed out that the use of the bag-of-words approach to predict the Openness to Experience personality factor, and that part-of-speech tagging might be useful in predicting a football player’s Neuroticism personality factor. Furthermore, it is found that applying the bag-of-words approach on news articles does not suffice to create an accurate personality profile of a person, even not when the adjectives are filtered out first via part-of-speech tagging. While the expectations of the results were higher, they do hopefully contribute to finding a way to accurately construct a personality profile of a person based on text data that is written by another person. This thesis could be considered as the first step in closing this gap in literature, which deserves more attention from researchers.

64 A Method to Perform an Automated Background Check on Professional Football Players Chapter 7

Conclusions

In this master thesis, an attempt was made to find different aspects that could contribute to an automated player background check on professional football players via the use of different text mining techniques. Articles from news websites, most of them focused on football, and tweets placed by both players and fans were used to achieve this. The personality of a player in terms the Big Five personality traits appeared to be a very interesting aspect that could be added to a player background check, and theory showed that there exists a relationship between self-narratives and personality. An attempt was made to construct a personality overview by combining the text data in different articles about a player. While the validation for four players showed promising results on the bag-of-words approach with word stemming to predict the Openness to Experience, and part-of-speech tagging to predict the Neuroticism trait, the results of the three experiments mainly show that the relationship between personality and text written in news articles requires further investigation. In the Twitter part of the analysis, regular expressions were used to extract hashtags and mentions from tweets placed by football players. This is relevant to add to a player background check, because SciSports currently searches manually which players are communicating a lot with each other and what they are talking about. A list with most used mentions and hashtags can save a lot of time here. Two other parts of the automated player background check, are the word cloud and the senti- ment analysis to create overview of a player’s popularity, both based on tweets in which a player is mentioned. The word cloud can be added to a player background check, for example to visual- ize transfer rumors, whereas the visualization of the sentiment analysis provides overview in the amount of retweets and likes of positive and negative tweets about a player. While most results seem promising, a note is to be made on the practical usefulness for SciS- ports. Only when a lot of text data is available for one player, proper results follow from the automated analysis. Considering the personality part of the research, it is problematic when there are only a few articles about a player available. This results in the fact that the importance scores of personality adjectives are extremely high, and they do score higher or lower than two times the standard deviation on almost every personality trait. Also the Twitter analysis on less famous players results in weaker results, and less interesting visualizations that can be added to a background check. This makes the models less useful in practice for SciSports, since this company mainly focuses on less known football players.

7.1 Limitations and Future Research

The part in which a first attempt was made to close a gap in the existing literature, is the part in which three experiment are carried out to predict a football player’s personality. However, the results of these experiments differed a lot, not only from each other, but also from the questionnaire that was held to get an overview of the actual perception of a player’s personality.

A Method to Perform an Automated Background Check on Professional Football Players 65 CHAPTER 7. CONCLUSIONS

Several assumptions are made applying the different methodologies. First, there are only few articles that are about one player only. There are many articles in which multiple players are discussed, or in which one player says something about somebody else. Therefore, an adjective that is found in the bag-of-words of this player, is not necessarily about this player. However, the methodology that is applied, counts each word that it is in the bag-of-words, not considering in which context it was placed. Since multiple articles about one player are analyzed, it is assumed that personality adjectives found about one player, but are not really about that player, are balancing each other out and that the most adjectives in the bag-of-words do refer to the player of interest. However, the results of the three models reflect that this is not the case. Therefore, future research should combine the applied methodologies with other text mining techniques, such as named entity recognition, to only consider adjectives that are really about a player. In addition to named entity recognition, future research should investigate whether rhetorical structure theory can further improve the results of this project. As was discussed in the prelim- inaries, RST has appeared to be beneficial in sentiment analysis. Since it considers the whole structure of a text, rather than looking at it as a bag-of-words, it could maybe distinguish the personality adjectives are referring to the player of interest from the ones that do not. Furthermore, the applied approaches only consider one word at a time. This has the disad- vantage that negative words in front of an adjective are neglected. For example, someone who is not kind is the opposite of someone who is kind. However, the applied approaches only sees the word kind in this example, and would consider this person to have the agreeable personality trait in both cases. Again, it was assumed that the negative words balance each other out. In future research, filtering negatives out and assigning the opposite personality score to them might provide better results. This is not always easy, since the negative word is not necessarily directly followed by the adjective. Finally, a remark is to be made on the word list that was used. Especially the manual transla- tions of the personality adjectives was not accurate. Research should focus on constructing such a word list for languages other than English. Also, the results show that the applied methodolo- gies to predict personality are not accurate with this specific word list. Using another word list, such as the LIWC, will result in different findings. Future research should show how the applied methodologies perform using different word lists considering personality.

66 A Method to Perform an Automated Background Check on Professional Football Players Bibliography

Back, M. D., Stopfer, J. M., Vazire, S., Gaddis, S., Schmukle, S. C., Egloff, B., & Gosling, S. D. (2010). Facebook profiles reflect actual personality, not self-idealization. Psychological Science, 21 (3), 372–374. 30 Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the workshop on Speech and Natural Language, (pp. 112–116). Association for Computational Linguistics.9 Bruijn, G.-J., De Groot, R., Putte, B., & Rhodes, R. E. (2009). Conscientiousness, extroversion, and action control: comparing moderate and vigorous physical activity. Journal of sport & exercise psychology, 31 (6), 724–742. 12 Chung, C. K. & Pennebaker, J. W. (2008). Revealing dimensions of thinking in open-ended self-descriptions: An automated meaning extraction method for natural language. Journal of Research in Personality, 42 (1), 96–132. 28 Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. The Journal of Machine Learning Research, 12, 2493–2537.8,9 Costa, P. T. & MacCrae, R. R. (1992). Revised NEO personality inventory (NEO PI-R) and NEO five-factor inventory (NEO FFI): Professional manual. Psychological Assessment Resources.3, 12 Cowie, J. & Lehnert, W. (1996). Information extraction. Communications of the ACM, 39 (1), 80–91.8 De Fruyt, F., McCrae, R. R., Szirm´ak,Z., & Nagy, J. (2004). The five-factor personality inventory as a measure of the five-factor model Belgian, American, and Hungarian comparisons with the NEO-PI-R. Assessment, 11 (3), 207–215. 12 De Moor, M., Beem, A., Stubbe, J., Boomsma, D., & De Geus, E. (2006). Regular exercise, anxiety, depression and personality: a population-based study. Preventive medicine, 42 (4), 273–279. 12 Denissen, J. J., Geenen, R., Van Aken, M. A., Gosling, S. D., & Potter, J. (2008). Development and validation of a dutch translation of the big five inventory (bfi). Journal of personality assessment, 90 (2), 152–157. 59, 149 Esmin, A. A., J´unior,R. S., Santos, W. S., Botaro, C. O., & Nobre, T. P. (2014). Real-time summarization of scheduled soccer games from twitter stream. In Natural Language Processing and Information Systems (pp. 220–223). Springer. 31 Fan, W., Wallace, L., Rich, S., & Zhang, Z. (2006). Tapping the power of text mining. Commu- nications of the ACM, 49 (9), 76–82.8 Fast, L. A. & Funder, D. C. (2008). Personality as manifest in word use: correlations with self- report, acquaintance report, and behavior. Journal of personality and social psychology, 94 (2), 334. 28

A Method to Perform an Automated Background Check on Professional Football Players 67 BIBLIOGRAPHY

Feldman, R. & Dagan, I. (1995). Knowledge discovery in textual databases (kdt). In KDD, volume 95, (pp. 112–117).7

Gill, A. & Oberlander, J. (2002). Taking care of the linguistic features of extraversion. In Pro- ceedings of the 24th annual conference of the cognitive science society, (pp. 363–368). 28

Golbeck, J., Robles, C., Edmondson, M., & Turner, K. (2011). Predicting personality from Twitter. In Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International Conference on, (pp. 149–156). IEEE. 21, 30

Golbeck, J., Robles, C., & Turner, K. (2011). Predicting personality with social media. In CHI’11 Extended Abstracts on Human Factors in Computing Systems, (pp. 253–262). ACM. 30

Goldberg, L. R., Johnson, J. A., Eber, H. W., Hogan, R., Ashton, M. C., Cloninger, C. R., & Gough, H. G. (2006). The international personality item pool and the future of public-domain personality measures. Journal of Research in personality, 40 (1), 84–96. 12

Guilford, J. P. (1959). Personality. McGraw-Hill. 11, 28

Gupta, V. & Lehal, G. S. (2009). A survey of text mining techniques and applications. Journal of emerging technologies in web intelligence, 1 (1), 60–76.8

Hearst, M. A. (1997). Text data mining: Issues, techniques, and the relationship to information access. In Presentation notes for UW/MS workshop on data mining, (pp. 112–117).3,7, 11

Hirsh, J. B. & Peterson, J. B. (2009). Personality and language use in self-narratives. Journal of research in personality, 43 (3), 524–527. 28, 29, 31

Hogenboom, A., Frasincar, F., De Jong, F., & Kaymak, U. (2015). Using rhetorical structure in sentiment analysis. Communications of the ACM, 58 (7), 69–77. 10

Holtgraves, T. (2011). Text messaging, personality, and the social context. Journal of research in personality, 45 (1), 92–99. 28

Horsmann, T., Erbs, N., & Zesch, T. (2015). Fast or accurate?–a comparative evaluation of pos tagging models. In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, (pp. 22–30). 48

Hotho, A., N¨urnberger, A., & Paaß, G. (2005). A brief survey of text mining. Ldv Forum, 20 (1), 19–62.7,8

Hu, X. & Liu, H. (2012). Text analytics in social media. In Mining text data (pp. 385–414). Springer. 10, 11, 36

Iglesias, J. A., Tiemblo, A., Ledezma, A., & Sanchis, A. (2016). Web news mining in an evolving framework. Information Fusion, 28, 90–98. 37

Jai-Andaloussi, S., El Mourabit, I., Madrane, N., Chaouni, S. B., & Sekkaki, A. (2015). Soc- cer events summarization by using sentiment analysis. In 2015 International Conference on Computational Science and Computational Intelligence (CSCI), (pp. 398–403). IEEE. 31

John, O. P. & Srivastava, S. (1999). The big five trait taxonomy: History, measurement, and theoretical perspectives. Handbook of personality: Theory and research, 2 (1999), 102–138. 11, 12, 59, 151

Kaya, M. & Conley, S. (2016). Comparison of sentiment lexicon development techniques for event prediction. Social Network Analysis and Mining, 6 (1), 1–13. 20

68 A Method to Perform an Automated Background Check on Professional Football Players BIBLIOGRAPHY

Kluemper, D. H., Rosen, P. A., & Mossholder, K. W. (2012). Social networking websites, person- ality ratings, and the organizational context: More than meets the eye? 1. Journal of Applied Social Psychology, 42 (5), 1143–1172. 30

Kumar, S., Morstatter, F., & Liu, H. (2014). Visualizing twitter data. In Twitter Data Analytics (pp. 49–69). Springer. 33

Kuper, S. & Szymanski, S. (2009). Why England Lose & Other Curious Football Phenomena Explained. Harper Collins Publishers.1

Larsen, B. & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining, (pp. 16–22). ACM.8

Lee, C. H., Kim, K., Seo, Y. S., & Chung, C. K. (2007). The relations between personality and language use. The Journal of general psychology, 134 (4), 405–413. 28

Lin, J. & Katz, B. (2003). Question answering from the web using knowledge annotation and know- ledge mining techniques. In Proceedings of the twelfth international conference on Information and knowledge management, (pp. 116–123). ACM.8

Mairesse, F., Walker, M. A., Mehl, M. R., & Moore, R. K. (2007). Using linguistic cues for the automatic recognition of personality in conversation and text. Journal of Artificial Intelligence Research, 30, 457–500. 30

Mann, W. C. & Thompson, S. A. (1988). Rhetorical structure theory: Toward a functional theory of text organization. Text-Interdisciplinary Journal for the Study of Discourse, 8 (3), 243–281. 10

Nadkarni, P. M., Ohno-Machado, L., & Chapman, W. W. (2011). Natural language processing: an introduction. Journal of the American Medical Informatics Association, 18 (5), 544–551.9

Nagarkar, S. P. & Kumbhar, R. (2015). Text mining. Library Review, 64 (3), 248–262.1

NetMarketShare (2016). Desktop search engine market share. http://www.netmarketshare. com/search-engine-market-share.aspx?qprid=4&qpcustomd=0. Accessed: March 22, 2016. 15

Pang, B. & Lee, L. (2008). Opinion mining and sentiment analysis. Foundations and trends in information retrieval, 2 (1-2), 1–135. 10, 33

Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2007). Linguistic inquiry and word count: LIWC 2007. Mahway: Lawrence Erlbaum Associates. 28, 31

Pennebaker, J. W. & King, L. A. (1999). Linguistic styles: language use as an individual difference. Journal of personality and social psychology, 77 (6), 1296–1312. 28, 30

Pennebaker, J. W., Mehl, M. R., & Niederhoffer, K. G. (2003). Psychological aspects of natural language use: Our words, our selves. Annual review of psychology, 54 (1), 547–577. 28

Rhodes, R. & Smith, N. (2006). Personality correlates of physical activity: a review and meta- analysis. British journal of sports medicine, 40 (12), 958–965. 12

Saucier, G. & Goldberg, L. R. (1996). Evidence for the big five in analyses of familiar english personality adjectives. European Journal of Personality, 10, 61–77. 31, 39, 42, 48, 50, 64, 87

Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing, volume 12, (pp. 44–49). Citeseer.9

A Method to Perform an Automated Background Check on Professional Football Players 69 BIBLIOGRAPHY

SciSports (2016). Science serving football. http://www.scisports.com/. Accessed: April 14, 2016.1 Shen, L., Satta, G., & Joshi, A. (2007). Guided learning for bidirectional sequence classification. In ACL, volume 7, (pp. 760–767). Citeseer.9

Tan, A.-H. (1999). Text mining: The state of the art and the challenges. In Proceedings of the PAKDD 1999 Workshop on Knowledge Disocovery from Advanced Databases, volume 8, (pp. 65–70).7 Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology- Volume 1, (pp. 173–180). Association for Computational Linguistics.9 Tskhay, K. O. & Rule, N. O. (2014). Perceptions of personality in text-based media and osn: A meta-analysis. Journal of Research in Personality, 49, 25–30. 28, 30 Turban, E., Sharda, R., & Delen, D. (2010). Decision support and business intelligence systems (9th ed.). Pearson Education.7,8,9, 10, 11, 36 van Oorschot, G., van Erp, M., & Dijkshoorn, C. (2012). Automatic extraction of soccer game events from twitter. In Proceedings of the Workhop on Detection, Representation, and Exploit- ation of Events in the Semantic Web (DeRiVE 2012), volume 902, (pp. 21–30). 31, 32

Wu, Y. C. (2016). Language independent web news extraction system based on text detection framework. Information Sciences, 14, 132–149. 14, 16 Yu, Y. & Wang, X. (2015). World cup 2014 in the twitter world: A big data analysis of sentiments in us sports fans tweets. Computers in Human Behavior, 48, 392–400. 37

70 A Method to Perform an Automated Background Check on Professional Football Players Appendices

A Method to Perform an Automated Background Check on Professional Football Players 71 Appendix A

Universal HTML Parsing

This appendix provides some results of universal HTML parsing, using the Newspaper package and the BeautifulSoup package in Python, as was discussed in Section 3.1. For each example, the manually selected title and body text are provided first, and the text generated by the use of both packages are printed after that. Not all 99 articles are evaluated here, only a few that are believed to give a good overview of the variation in performance of both methods. Of course, all 99 parsed articles can be requested from the author.

A.1 Example 1

In this example, parsing the HTML via the use of the BeautifulSoup package adds some data prior to and after the actual body text. Also, a title of a related article is printed somewhere in the middle of the text. All the body text is there, but there is also some noise around it. Using the Newspaper package results in an almost perfectly parsed article in this case. Except for a title of a related article somewhere in the middle, the body text is the same as the manually copied text.

Article URL: http://www.theguardian.com/football/2016/feb/28/barcelona-sevilla-match- report-lionel-messi

72 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX A. UNIVERSAL HTML PARSING

Article A.1: Manually selected title and body text. TITLE: Barcelona’s Lionel Messi and Gerard Piqu´eseal victory over Sevilla

TEXT: Goals from Lionel Messi and Gerard Piqu´eenabled Barcelona to come from behind to beat Sevilla 2-1 at the Camp Nou and equal a Spanish record of 34 games without defeat in all competitions.

The visitors took the lead through Vitolo’s volley but a superb free-kick from Messi levelled the scores on the half-hour mark, and the Argentinian then laid on a pass for Piqu´eto put the Catalans in charge early in the second half.

Although Sevilla continued to cause Bara problems, they held on to stay eight points ahead of Atl´eticoMadrid at the top of the Primera Divisin as well as drawing level with the unbeaten record set by Leo Beenhakker’s Real Madrid side in 1989.

Gary Neville’s four-game winning streak with Valencia came to an abrupt end with a 3-0 defeat to Athletic Bilbao following seven frantic minutes late in the second half.

Sabin Merino broke the deadlock in a tense and even game on 73 minutes and Iker Muniain and the former Valencia striker Aritz Aduriz added to the visitors’ lead in quick succession.

Neville, who called the referee a ‘‘joke’’ after the match for waving away two penalty appeals from Valencia, had seen his side beat Espanyol and Granada in the league and enjoyed back-to-back victories over Rapid Vienna in the Europa League but the late onslaught was too much to overcome.

Jos´eGonz´alezmade an immediate impact as Granada coach as his side lifted themselves off the bottom of the table with a 1-0 win over Deportivo La Coru~na.

A Method to Perform an Automated Background Check on Professional Football Players 73 APPENDIX A. UNIVERSAL HTML PARSING

Article A.2: Title and body text parsed using the BeautifulSoup package in Python. TITLE:

Barcelona’s Lionel Messi and Gerard Piqu´eseal victory over Sevilla

TEXT: Press Association

Sunday 28 February 2016 22.22GMT

Last modified on Sunday 28 February 2016 23.25GMT

Goals from Lionel Messi and Gerard Piqu´eenabled Barcelona to come from behind to beat Sevilla 2-1 at the Camp Nou and equal a Spanish record of 34 games without defeat in all competitions. The visitors took the lead through Vitolo’s volley but a superb free-kick from Messi levelled the scores on the half-hour mark, and the Argentinian then laid on a pass for Piqu´eto put the Catalans in charge early in the second half. Related: Barcelona’s miracle worker Lionel Messi leaves Arsenal praying for one | Barney Ronay Although Sevilla continued to cause Bara problems, they held on to stay eight points ahead of Atl´eticoMadrid at the top of the Primera Divisin as well as drawing level with the unbeaten record set by Leo Beenhakker’s Real Madrid side in 1989. Gary Neville’s four-game winning streak with Valencia came to an abrupt end with a 3-0 defeat to Athletic Bilbao following seven frantic minutes late in the second half. Sabin Merino broke the deadlock in a tense and even game on 73 minutes and Iker Muniain and the former Valencia striker Aritz Aduriz added to the visitors’ lead in quick succession. Neville, who called the referee a ‘‘joke’’ after the match for waving away two penalty appeals from Valencia, had seen his side beat Espanyol and Granada in the league and enjoyed back-to-back victories over Rapid Vienna in the Europa League but the late onslaught was too much to overcome. Jos´eGonz´alezmade an immediate impact as Granada coach as his side lifted themselves off the bottom of the table with a 1-0 win over Deportivo La Coru~na. Sign in or create your Guardian account to join the discussion.

This discussion is closed for comments.

We’re doing some maintenance right now. You can still read comments, but please come back later to add your own.

Commenting has been disabled for this account (why?)

74 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX A. UNIVERSAL HTML PARSING

Article A.3: Title and body text parsed using the Newspaper package in Python. TITLE: Barcelona’s Lionel Messi and Gerard Piqu´eseal victory over Sevilla

TEXT: Goals from Lionel Messi and Gerard Piqu´eenabled Barcelona to come from behind to beat Sevilla 2-1 at the Camp Nou and equal a Spanish record of 34 games without defeat in all competitions.

The visitors took the lead through Vitolo’s volley but a superb free-kick from Messi levelled the scores on the half-hour mark, and the Argentinian then laid on a pass for Piqu´eto put the Catalans in charge early in the second half.

Related: Barcelona’s miracle worker Lionel Messi leaves Arsenal praying for one | Barney Ronay

Although Sevilla continued to cause Bara problems, they held on to stay eight points ahead of Atl´eticoMadrid at the top of the Primera Divisin as well as drawing level with the unbeaten record set by Leo Beenhakker’s Real Madrid side in 1989.

Gary Neville’s four-game winning streak with Valencia came to an abrupt end with a 3-0 defeat to Athletic Bilbao following seven frantic minutes late in the second half.

Sabin Merino broke the deadlock in a tense and even game on 73 minutes and Iker Muniain and the former Valencia striker Aritz Aduriz added to the visitors’ lead in quick succession.

Neville, who called the referee a ‘‘joke’’ after the match for waving away two penalty appeals from Valencia, had seen his side beat Espanyol and Granada in the league and enjoyed back-to-back victories over Rapid Vienna in the Europa League but the late onslaught was too much to overcome.

Jos´eGonz´alezmade an immediate impact as Granada coach as his side lifted themselves off the bottom of the table with a 1-0 win over Deportivo La Coru~na.

A Method to Perform an Automated Background Check on Professional Football Players 75 APPENDIX A. UNIVERSAL HTML PARSING

A.2 Example 2

This second example includes a dutch article about Lionel Messi. Neither of the two applied methods works in this case. Both methods return a message that asks the user to disable his adblocker. In fact, this message is the only thing that is returned using the Newspaper package. The parsing using BeautifulSoup also returns some content of the actual article, but leaves the first paragraph out.

Article URL: http://www.nu.nl/voetbal/4237564/messi-noemt-rijkaard-belangrijkste-trainer- in-loopbaan.html

Article A.4: Manually selected title and body text. TITLE: Messi noemt Rijkaard belangrijkste trainer in loopbaan

TEXT: Lionel Messi meent dat Frank Rijkaard de trainer is geweest die de grootste invloed heeft gehad op zijn carrire. "Van elke coach heb ik wel iets geleerd, maar ik denk dat Rijkaard voor mij de belangrijkste trainer is geweest", aldus de sterspeler van FC Barcelona in gesprek met de Egyptische televisiezender MBC.

"Als Rijkaard me niet mee had laten trainen met het eerste en daarna had laten debuteren, had ik misschien nooit het eerste elftal gehaald. Hij gaf mij het vertrouwen."

Rijkaard fungeerde tussen 2003 en 2008 als coach van FC Barcelona. In oktober 2004 gunde hij Messi zijn competitiedebuut.

Na het vertrek van de Nederlander werkte de Argentijn in Cataloni samen met de trainers Josep Guardiola, Tito Vilanova, Gerardo Martino en, sinds medio 2014, Luis Enrique.

76 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX A. UNIVERSAL HTML PARSING

Article A.5: Title and body text parsed using the BeautifulSoup package in Python. TITLE:

Messi noemt Rijkaard belangrijkste trainer in loopbaan

TEXT: Beste bezoeker,Wij zien dat u een adblocker gebruikt die ervoor zorgt dat u geen advertenties ziet op NU.nl. Dit vinden wij jammer, want NU.nl is mede dankzij deze advertenties gratis toegankelijk. Wilt u een uitzondering maken voor NU.nl, of meer lezen over hoe wij met advertenties omgaan? Klik dan hier.

"Van elke coach heb ik wel iets geleerd, maar ik denk dat Rijkaard voor mij de belangrijkste trainer is geweest", aldus de sterspeler van FC Barcelona in gesprek met de Egyptische televisiezender MBC. "Als Rijkaard me niet mee had laten trainen met het eerste en daarna had laten debuteren, had ik misschien nooit het eerste elftal gehaald. Hij gaf mij het vertrouwen." Rijkaard fungeerde tussen 2003 en 2008 als coach van FC Barcelona. In oktober 2004 gunde hij Messi zijn competitiedebuut. Na het vertrek van de Nederlander werkte de Argentijn in Cataloni samen met de trainers Josep Guardiola, Tito Vilanova, Gerardo Martino en, sinds medio 2014, Luis Enrique.

Article A.6: Title and body text parsed using the Newspaper package in Python. TITLE: Messi noemt Rijkaard belangrijkste trainer in loopbaan

TEXT: Beste bezoeker,

Wij zien dat u een adblocker gebruikt die ervoor zorgt dat u geen advertenties ziet op NU.nl. Dit vinden wij jammer, want NU.nl is mede dankzij deze advertenties gratis toegankelijk. Wilt u een uitzondering maken voor NU.nl, of meer lezen over hoe wij met advertenties omgaan? Klik dan hier.

A Method to Perform an Automated Background Check on Professional Football Players 77

Appendix B

Sample of Twitter Data

This appendix shows some of the tweets that were used during this project. A sample of both the Dutch and the English tweets is provided here. Only the tweet text and the time and date on which the tweet was placed are provided, because those two attributes are the most relevant in this project. Tweet attributes such as the tweet id and the user id are omitted. Please note that emoticons are not displayed in these tables, due to the fact that the tweets were downloaded using Python 2.7. Since emoticons are encoded in UTF-8 they do not appear properly. Since this project only considers text data, this is not a problem. There has been research conducted to perform sentiment analysis using emoticons, but this theory is not used in this project.

B.1 English Tweets

Note that not all tweets are in English. Since Ibrahimovi´cwas playing in France at the moment of writing, some tweets are in French. This is due to the fact that no language filter was set while downloading all tweets placed by a player.

Fifty most recent tweets that were retrieved from the account of Zlatan Ibrahimoci´c (@Ibra official).

Tweet text Tweet date and time 1 https://t.co/GaQ5i1KXl1 2016-05-23 09:26:07 2 La grande mdaille de la ville de Paris. I am proud and honored. 2016-05-22 10:33:06 Thank you Paris! https://t.co/OapvuiOjQB 3 Nr 30 https://t.co/p9IBpfFDzc 2016-05-22 05:53:17 4 2015/16 Hard work pays off, and still one to go! ht- 2016-05-21 15:32:43 tps://t.co/1DACsHkaEt 5 Its not about the gear. Stay tuned at https://t.co/Rzu90ZMyEo 2016-05-19 07:30:05 and @AZsportswear https://t.co/EstfPU0Pv8 6 June 7th 2016 marks the start of my new project. Stay 2016-05-13 14:00:03 tuned at https://t.co/Rzu90ZMyEo #itsnotaboutthegear ht- tps://t.co/LULEWEmJa0 7 My last game tomorrow at Parc des Princes. I came like a king, 2016-05-13 07:35:59 left like a legend https://t.co/OpLL3wzKh0 8 Me & Olivier the Nose Pescheux with our latest cre- 2016-04-29 12:13:17 ation. As you can see, were both very happy about it..! ht- tps://t.co/1kFx3bJ5Lk 9 No.29 https://t.co/wjkgeU9Wol 2016-04-24 07:51:03 10 VW+ now in stores. And also The Swedish Football Asso- 2016-04-11 07:15:13 ciation’s official sports drink. https://t.co/8jKR1rNSMQ ht- tps://t.co/V0SjckXTte

A Method to Perform an Automated Background Check on Professional Football Players 79 APPENDIX B. SAMPLE OF TWITTER DATA

11 World’s best vitamin team. We are just warming up. #vitamin- 2016-04-03 19:52:12 well #vitaminwellplus https://t.co/4HFG5oEaSb 12 If you cant find what you need, you have to make it your- 2016-03-29 19:31:32 self. #VitaminWellplus the next generation of sportdrinks ht- tps://t.co/lxfeGW4msJ 13 Stronger than ever and just warming up. #VitaminWellplus to 2016-03-26 10:06:50 be continued March 29th 9:30pm CET. https://t.co/Vo7TdEIr9X 14 You think Im done. But Im just warming up. #VitaminWellplus 2016-03-24 09:41:35 https://t.co/FRWPGstbJ1 15 Thanks for the response @LaTourEiffel and I do agree with you 2016-03-16 12:23:51 - Paris looks beautiful from up here where we stand - and I love you too. 16 Fasten seatbelts https://t.co/8YbgYHmCPV 2016-03-13 16:21:17 17 PSG https://t.co/mfRDoKbu91 2016-03-11 16:55:54 18 Almost there... https://t.co/IFui93FpfH 2016-02-21 19:44:32 19 Legend! I thought I was the only beast in France... ht- 2016-01-25 21:36:18 tps://t.co/d24hlW5YLR 20 Malm, I’m here... Are you ready?! See you tonight! ht- 2015-11-25 12:31:00 tps://t.co/KpwFHKNIvs 21 Hey Malm just one more day to go! https://t.co/uSOty5Xf9p 2015-11-24 13:07:27 22 The new Swedish away jersey. Together we will create 2015-11-12 13:02:46 special moments wearing this kit. #BeTheDifference ht- tps://t.co/HfWB3Nguub 23 My 10th Golden Ball, the award for Sweden’s Football Player of 2015-11-10 14:14:51 the Year. https://t.co/9t3gXsVNFL 24 The new home jersey for Sweden. Proud to wear this when we 2015-11-05 09:14:13 battle for our place at the Euro 2016. #BeTheDifference ht- tps://t.co/nqKTLsRoj4 25 Today is #WorldFoodDay, please support my friends @WFP for 2015-10-16 10:03:21 a world free from hunger by 2030. One Future, #ZeroHunger http://t.co/WnDVxhf9Uh 26 An hour before kickoff... http://t.co/QSx4acMOPd 2015-09-15 17:55:00 27 MFF in Champions League. One day I hope to experience 2015-08-26 08:22:11 the Champions League in Malm on the pitch. Congratulations! http://t.co/ZJ5CAMPBNH 28 Check out my friend Svartzonker in the new Volvo commercial. 2015-08-23 09:58:40 Awesome, real. Watch it: https://t.co/Glu1G7380H 29 Get the story behind my perfume. On Instagram 2015-08-12 14:09:37 http://t.co/L1JdqSkAZE Video https://t.co/wbo9pb9yab http://t.co/vcBqWg3L4R 30 Join me on http://t.co/E9Aae5Fc3h and reveal my new video. 2015-08-10 13:16:20 http://t.co/nITDIyhzLd 31 Its getting close. In stores Aug. 12 http://t.co/0kiRRFb56i 2015-08-08 14:04:07 32 On my way home with nr 27. http://t.co/MFAkgKCMWZ 2015-08-02 07:06:14 33 Mvp thanks to this guy Maxwell http://t.co/s7s6vbiJKY 2015-07-30 06:38:49 34 Pre-season in the US. http://t.co/d6EF9JJWzg 2015-07-25 17:30:45 35 What a champion, my friend Djokovic! Another Grand Slam. 2015-07-12 16:49:07 Idemoooo. @DjokerNole http://t.co/vMGo61sVyE 36 #805millionnames #WFP@WFP@TimesSquareNYC 2015-06-25 20:58:17 http://t.co/H1DyS7dEhX http://t.co/ENgs81Xo2C 37 Pre-release event @coletteparis, Wednesday June 24 Ill be there 2015-06-22 15:54:51 at 3 pm. Will you? http://t.co/AvDMucr7wv

80 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX B. SAMPLE OF TWITTER DATA

38 Sneak preview the result of 2 yrs hard work to create my 2015-06-22 08:02:50 first fragrance. This is what it looks like! Avail. Aug 12 http://t.co/0SgmSBMLu8 39 A sneak preview of something Ive been working on for the 2015-06-16 08:00:49 past 2 years. Something Im very proud of. So stay tuned.. http://t.co/IM2LI2LUng 40 Good news today! But there are still 795 million hungry 2015-05-27 14:27:38 people. Make sure the world knows. #805millionnames http://t.co/Y8yI0buvOH #WFP 41 Overwhelmed #805millionnames http://t.co/s3QXn1V8Ec 2015-03-06 12:04:05 42 This is Carmen. Read her story. http://t.co/yQCUmeIAvk 2015-02-24 14:03:21 #805millionnames http://t.co/w4Va9b72Vc 43 The names are no longer on my body but the people are 2015-02-20 10:59:37 still out there. http://t.co/H7VPUKORyb #805millionnames http://t.co/PwnNqT6kSZ 44 805 million people are suffering from hunger. Make sure the 2015-02-17 18:45:23 world knows. #805millionnames http://t.co/H7VPUKORyb http://t.co/9LiIgqfkoP 45 805 million people suffer from hunger. Make sure the world knows. 2015-02-15 14:21:56 #805millionnames w/@WFP http://t.co/H7VPUL6spJ 46 A very important name for me. http://t.co/8Pp8FGg1Kn 2015-02-13 23:00:11 47 Introducing Vitamin Well my new friends from Sweden 2014-12-08 22:16:42 making great vitamin drinks. http://t.co/wOduBA8M3v http://t.co/jiy6Zc9DYr 48 Mixed exercises at the gym today. http://t.co/iaOt5npg51 #Pre- 2014-07-19 11:10:28 Season #Training http://t.co/T7Z9SY1BNK 49 Finally pre-season training is in full swing.Cardio in com- 2014-07-18 14:55:32 bination with strength and stability http://t.co/iaOt5npg51 http://t.co/TfJ8sm6uWW 50 Follow me in Brazil! My app Zlatan Unplugged is 2014-06-19 08:33:10 now available in Brazil on http://t.co/iaOt5npg51 #vemibra http://t.co/HVA3dvozbD ......

B.2 Dutch Tweets

All Mark-Jan Fledderus mentions (@Fleddie) that were retrieved.

Tweet text Tweet date and time 1 Wat een avond en wat een supporters! Trots op @HeraclesAl- 2016-05-23 23:00:08 melo en natuurlijk @Fleddie en @brahimdarri #europesedroom https://t.co/ZP0wmKG860 2 Geweldig seizoen, 6e plaats en #europesedroom. Geniet er- 2016-05-23 20:14:20 van! @HeraclesAlmelo @Bultje @ThomasBruns @Fleddie ht- tps://t.co/lpW7w6s8ep 3 @Fleddie Mooi man, toch nog Europa in! gefelicibrandbiertun- 2016-05-23 19:23:09 ited! https://t.co/Xttx0eJsKZ 4 Aanvoerder @fleddie laat het publiek Gonzalo Garca toezingen, 2016-05-23 18:09:33 hij kan er helaas niet bij zijn. 5 @HeraclesAlmelo @Fleddie @JoeyPelupessy @marijndekler jam- 2016-05-23 18:07:16 mer vd regie v rtvoost interviewen tijdens presenteren vd spelers joey gemist

A Method to Perform an Automated Background Check on Professional Football Players 81 APPENDIX B. SAMPLE OF TWITTER DATA

6 Delen jullie ook jullie mooie foto’s, @Fleddie , @JoeyPelupessy , 2016-05-23 18:04:44 @marijndekler , en alle anderen!! #wijgaaneuropain #europesed- room 7 Korte impressie: @Fleddie: ’waar is @JoeyPelupessy ’: ht- 2016-05-23 18:01:56 tps://t.co/g0EqzwFIFa 8 Aanvoerder @fleddie spreekt! https://t.co/VadtJwgBGr 2016-05-22 17:31:27 9 @Fleddie moeten jullie wel allemaal fijn blijven @HeraclesAlmelo 2016-05-22 16:48:37 10 Fantastisch resultaat van mijn oude club!! Geweldig gedaan boys 2016-05-22 16:38:39 !! @ThomasBruns @HeraclesAlmelo @Fleddie !! Tis jullie gegund !! 11 @Fleddie super gefeliciteerd !! Is je gegund. Groetjes een Roda jc 2016-05-22 16:22:37 fan die je niet vergeten is 12 @HeraclesAlmelo @fleddie ... Zo vierde het Amaliaplein in Almelo 2016-05-22 16:18:57 mee #europesedroom #europa https://t.co/ooyz2Bfv0v 13 @Fleddie GEWELDIG!! #verdiend 2016-05-22 16:04:22 14 @Fleddie het is je gegund! #europa 2016-05-22 16:01:45 15 @Fleddie Gefeliciteerd Mark-Jan. Is jullie gegund! Maak er een 2016-05-22 15:58:43 mooi feest van 16 @Fleddie wij ook. Supertrots 2016-05-22 15:58:19 17 @Fleddie ooit kwam je naar @rodajckerkrade om Europees voet- 2016-05-22 15:57:43 bal te halen. Dat is helaas nooit gelukt, maar nu ga je Europa in! Gefeliciteerd! 18 @HeraclesAlmelo @Fleddie @asito Het is jullie gegund! Zo’n 2016-05-22 15:55:20 mooie club verdient Europees Voetbal!!! 19 @Fleddie gefeliciteerd kerel 2016-05-22 15:53:01 20 @Fleddie gefeliciteerd met het behalen van de Europaleague 2016-05-22 15:45:22 21 Gefeliciteerd @Fleddie @woutweghorst9 2016-05-22 15:28:41 22 @HeraclesAlmelo @Fleddie Gefeliciteerd met Europees!! 2016-05-22 14:51:37 #verdiend https://t.co/lQpeHT3dyF 23 @Fleddie Gefeliciteerd met het behalen van Europees voetbal, 2016-05-22 14:36:41 Koempel! 24 @HeraclesAlmelo @Fleddie gefeliciteerd met het behalen van 2016-05-22 14:28:46 Europees voetbal!!! Fan v @fcgroningen 25 @Fleddie van harte gefeliciteerd met het behalen van Europees 2016-05-22 14:27:14 voetbal 26 #Wedstrijdfeit: Heracles won in 14/15 voor het laatst bij @FC- 2016-05-22 12:15:01 Utrecht (2-4). Trefzekere Heraclieden: Bruns, @Fleddie, Weg- horst & IBH. #utrher 27 PRETALK | Is @Fleddie fit? Dat & meer in 2016-05-19 12:18:05 de voorbeschouwing met hoofdcoach John Stegeman | ht- tps://t.co/i3VEMBJCbr https://t.co/v9jQQHo95M 28 @Fleddie Dankje wel vriend 2016-05-17 18:15:17 29 #Wedstrijdfeit: @Fleddie scoorde net als @LuisSuarez9 tegen 2016-05-15 10:15:06 Partizan in z’n 1e Europese duel voor @FCGroningen tegen Par- tizan (4-2). #hergro 30 Wie speelde er voor #FCGroningen n #Heracles? We trappen 2016-05-12 14:25:49 af met Ab Gritter en @Fleddie. Help je mee? #groher ht- tps://t.co/OcmLMj1QGT 31 Als @Fleddie met Heracles de dubbele ontmoeting tegen 2016-05-11 13:40:09 Groningen wint, kan hij mede-recordhouder worden. ht- tps://t.co/ADsUMwIsyR

82 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX B. SAMPLE OF TWITTER DATA

32 Wederom met Levi vanuit Ermelo kilometers maken om zijn fa- 2016-05-08 08:35:31 voriete club #Heracles en spelers @Fleddie en Weghorst te be- wonderen! #groher

A Method to Perform an Automated Background Check on Professional Football Players 83

Appendix C

Removed Stop Words

Below a list is provided with stop words that were deleted from the text corpus:

English a, about, above, after, again, against, all, am, an, and, any, are, aren’t, as, at, be, because, been, before, being, below, between, both, but, by, can’t, cannot, could, couldn’t, did, didn’t, do, does, doesn’t, doing, don’t, down, during, each, few, for, from, further, had, hadn’t, has, hasn’t, have, haven’t, having, he, he’d, he’ll, he’s, her, here, here’s, hers, herself, him, himself, his, how, how’s, i, i’d, i’ll, i’m, i’ve, if, in, into, is, isn’t, it, it’s, its, itself, let’s, me, more, most, mustn’t, my, myself, no, nor, not, of, off, on, once, only, or, other, ought, our, ours, ourselves, out, over, own, same, shan’t, she, she’d, she’ll, she’s, should, shouldn’t, so, some, such, than, that, that’s, the, their, theirs, them, themselves, then, there, there’s, these, they, they’d, they’ll, they’re, they’ve, this, those, through, to, too, under, until, up, very, was, wasn’t, we, we’d, we’ll, we’re, we’ve, were, weren’t, what, what’s, when, when’s, where, where’s, which, while, who, who’s, whom, why, why’s, with, won’t, would, wouldn’t, you, you’d, you’ll, you’re, you’ve, your, yours, yourself, yourselves Dutch aan, al, alles, als, altijd, andere, ben, bij, daar, dan, dat, de, der, deze, die, dit, doch, doen, door, dus, een, eens, en, er, ge, geen, geweest, haar, had, heb, hebben, heeft, hem, het, hier, hij, hoe, hun, iemand, iets, ik, in, is, ja, je, kan, kon, kunnen, maar, me, meer, men, met, mij, mijn, moet, na, naar, niet, niets, nog, nu, of, om, omdat, onder, ons, ook, op, over, reeds, te, tegen, toch, toen, tot, u, uit, uw, van, veel, voor, want, waren, was, wat, werd, wezen, wie, wil, worden, wordt, zal, ze, zelf, zich, zij, zijn, zo, zonder, zou

A Method to Perform an Automated Background Check on Professional Football Players 85

Appendix D

Factor Loadings of Personality Adjectives

Table D.1 shows the factor loadings of 435 adjectives on the five personality factors. The order reflects the relative size of the factors, and the minus sign at factor IV reflects that the negative pole of factor IV has more items than the positive pole. The factors are numbered as follows: • Factor I: Agreeableness • Factor II: Extraversion

• Factor III: Conscientiousness • Factor IV: Neuroticism • Factor V: Intellect/Imagination (In this project referred to as Openess to Experience)

Factor loadings of 435 familiar personality adjectives on five factors, adopted from Saucier & Goldberg(1996, pp. 66-74)

Adjective I II III IV- V Sympathetic 0.62 0.02 -0.05 0.07 0.03 Kind 0.60 0.07 0.06 0.02 0.00 Warm 0.56 0.26 0.06 0.10 0.07 Understanding 0.53 0.03 0.10 -0.04 0.13 Courteous 0.53 -0.07 0.23 0.02 0.00 Compassionate 0.52 0.12 0.02 0.13 0.18 Cooperative 0.52 -0.04 0.22 -0.18 -0.03 Polite 0.52 -0.07 0.30 0.04 -0.07 Affectionate 0.51 0.21 -0.01 0.21 0.09 Considerate 0.51 -0.01 0.18 -0.07 0.05 Respectful 0.51 -0.04 0.31 -0.02 -0.16 Sincere 0.49 -0.01 0.12 -0.05 0.10 Sentimental 0.48 0.08 0.01 0.34 -0.09 Cordial 0.47 0.07 0.01 0.02 0.16 Helpful 0.47 0.14 0.19 -0.14 0.09 Tolerant 0.46 -0.17 -0.07 -0.36 0.13 Charitable 0.46 0.12 0.04 -0.14 0.11 Sensitive 0.46 -0.10 0.00 0.35 0.23 Agreeable 0.46 -0.07 -0.01 -0.16 0.03

A Method to Perform an Automated Background Check on Professional Football Players 87 APPENDIX D. FACTOR LOADINGS OF PERSONALITY ADJECTIVES

Pleasant 0.45 0.22 0.03 -0.22 -0.06 Feminine 0.43 0.00 -0.06 0.37 0.02 Trustful 0.43 0.06 0.15 -0.23 -0.13 Loyal 0.43 0.06 0.13 -0.03 0.09 Thoughtful 0.42 -0.07 0.20 -0.03 0.15 Peaceful 0.42 -0.08 0.10 -0.32 -0.02 Obliging 0.41 -0.08 0.05 -0.11 0.03 Generous 0.40 0.15 -0.03 -0.15 0.04 Amiable 0.40 0.08 -0.08 -0.08 0.27 Cheerful 0.40 0.38 0.03 -0.22 -0.06 Mannerly 0.39 -0.09 0.34 0.10 -0.02 Flexible 0.39 0.07 -0.02 -0.20 0.13 Reasonable 0.38 -0.06 0.25 -0.25 0.12 Modest 0.37 -0.29 0.18 -0.08 -0.01 Genial 0.37 0.14 -0.08 -0.03 0.30 Jovial 0.37 0.36 -0.06 -0.13 0.09 Accommodating 0.35 0.02 0.03 -0.12 0.01 Humble 0.35 -0.21 0.12 -0.17 -0.12 Gullible 0.33 -0.10 -0.28 0.27 -0.18 Moral 0.33 -0.09 0.26 -0.02 -0.02 Honest 0.32 0.01 0.22 -0.17 0.04 Religious 0.31 0.01 0.17 0.06 -0.18 Passionate 0.31 0.20 -0.04 0.27 0.09 Optimistic 0.31 0.30 0.09 -0.24 0.02 Lenient 0.30 -0.15 -0.16 -0.15 0.07 Benevolent 0.30 -0.09 -0.04 -0.11 0.23 Unselfish 0.30 0.08 0.06 -0.22 0.05 Conscientious 0.30 -0.08 0.30 -0.02 0.19 Reverent 0.29 -0.01 0.20 0.06 -0.14 Tactful 0.29 -0.06 0.21 -0.02 0.21 Earnest 0.29 -0.03 0.07 -0.09 0.22 Adaptable 0.27 0.06 0.11 -0.24 0.13 Truthful 0.26 -0.01 0.19 -0.22 0.10 Naive 0.26 -0.20 -0.21 0.12 -0.18 Altruistic 0.25 0.00 -0.03 -0.06 0.22 Compliant 0.24 -0.19 -0.02 0.01 -0.04 Natural 0.23 0.09 0.03 -0.17 0.14 Suggestible 0.17 0.00 -0.06 0.07 -0.16 Selfless 0.13 -0.12 0.00 -0.12 0.00 Cold -0.50 -0.33 0.03 -0.02 0.10 Harsh -0.50 0.03 0.06 0.08 -0.05 Rude -0.50 0.08 -0.15 0.00 0.06 Unsympathetic -0.48 -0.14 -0.01 -0.16 0.01 Antagonistic -0.47 -0.09 -0.09 0.19 -0.02 Abusive -0.46 0.01 -0.11 -0.03 -0.11 Rough -0.45 0.16 0.03 -0.23 -0.11 Inconsiderate -0.43 -0.06 -0.29 -0.03 0.05 Egotistical -0.42 0.12 -0.09 0.18 0.09 Combative -0.42 0.19 0.07 0.04 -0.09 Callous -0.41 -0.03 0.04 -0.11 -0.21 Domineering -0.41 0.40 0.15 0.10 -0.02 Impolite -0.41 0.00 -0.25 -0.17 0.15 Belligerent -0.41 0.09 -0.03 0.11 -0.11 Ruthless -0.41 0.09 0.01 -0.07 -0.16

88 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX D. FACTOR LOADINGS OF PERSONALITY ADJECTIVES

Coarse -0.41 0.01 -0.13 -0.10 -0.17 Abrupt -0.41 0.12 -0.10 0.11 0.07 Insincere -0.40 -0.08 -0.11 0.01 -0.07 Cruel -0.40 0.00 -0.04 -0.06 0.00 Unkind -0.40 -0.14 -0.09 -0.12 0.01 Insensitive -0.39 -0.09 -0.04 -0.27 -0.02 Impersonal -0.38 -0.31 0.06 -0.07 -0.02 Scornful -0.38 -0.09 -0.04 0.19 -0.04 Uncharitable -0.38 -0.14 0.00 0.00 -0.01 Cynical -0.36 -0.17 -0.11 0.04 0.19 Bitter -0.35 -0.21 -0.06 0.27 0.06 Uncooperative -0.35 -0.02 -0.16 0.05 0.04 Demanding -0.34 0.19 0.15 0.31 0.00 Egocentric -0.34 0.04 -0.12 0.13 0.12 Conceited -0.34 0.11 -0.12 0.18 0.16 Bigoted -0.34 0.02 -0.07 0.05 -0.17 Intolerant -0.34 0.04 0.04 0.28 -0.08 Unforgiving -0.34 -0.12 -0.02 0.13 0.09 Disrespectful -0.33 0.00 -0.31 -0.14 0.23 Tough -0.33 0.15 0.13 -0.24 -0.05 Sly -0.33 0.16 0.04 -0.01 0.00 Stubborn -0.33 0.07 -0.04 0.23 0.05 Greedy -0.33 -0.05 -0.12 0.27 0.00 Argumentative -0.32 0.16 -0.02 0.22 0.01 Bullheaded -0.32 0.15 -0.08 0.12 0.03 Boastful -0.32 0.29 -0.10 0.17 -0.14 Critical -0.32 -0.01 0.06 0.31 0.17 Rebellious -0.30 0.15 -0.24 0.08 0.29 Deceitful -0.30 0.03 -0.15 0.07 -0.06 Sarcastic -0.30 0.03 -0.09 0.16 0.14 Underhanded -0.29 0.08 -0.08 -0.05 -0.16 Vindictive -0.29 0.03 0.05 0.11 -0.28 Suspicious -0.29 -0.18 0.05 0.24 -0.04 Devious -0.29 0.16 -0.10 0.09 -0.11 Thoughtless -0.29 -0.07 -0.26 -0.09 0.00 Ungracious -0.29 -0.16 -0.19 -0.15 0.01 Prejudiced -0.28 -0.01 -0.03 0.07 -0.08 Uncouth -0.28 0.02 -0.20 -0.18 -0.08 Tactless -0.27 -0.01 -0.21 -0.06 -0.15 Manipulative -0.27 0.16 0.08 0.13 0.02 Cunning 0.27 0.22 0.09 -0.06 0.02 Opinionated -0.26 0.22 0.07 0.11 0.03 Unruly -0.26 0.22 -0.25 -0.07 -0.02 Rigid -0.26 -0.11 0.24 0.01 -0.18 Obstinate -0.26 0.03 -0.05 0.24 0.09 Stingy -0.25 -0.13 0.04 0.22 0.01 Nonreligious -0.25 -0.08 -0.16 -0.17 0.19 Smug -0.24 0.01 0.01 0.10 -0.01 Skeptical -0.24 -0.23 0.03 0.11 0.22 Vain -0.23 0.05 -0.13 0.21 0.16 Irreverent -0.18 -0.05 -0.14 -0.11 0.09 Crafty -0.16 0.10 0.06 -0.10 -0.04 Masochistic -0.14 -0.08 -0.06 -0.01 -0.08 Self-seeking -0.11 0.05 0.04 0.07 0.08

A Method to Perform an Automated Background Check on Professional Football Players 89 APPENDIX D. FACTOR LOADINGS OF PERSONALITY ADJECTIVES

Extroverted 0.06 0.64 -0.07 -0.01 0.07 Talkative 0.09 0.62 -0.15 0.13 -0.07 Aggressive -0.30 0.61 0.09 0.00 0.02 Sociable 0.27 0.59 -0.01 -0.04 -0.07 Social 0.25 0.58 -0.04 0.00 -0.08 Assertive -0.15 0.53 0.15 -0.08 0.14 Bold -0.18 0.53 0.03 -0.19 0.05 Verbal -0.03 0.53 -0.08 0.05 0.11 Enthusiastic 0.29 0.50 0.02 -0.03 -0.01 Spirited 0.23 0.49 -0.01 -0.02 0.10 Confident -0.07 0.49 0.26 -0.32 0.10 Communicative 0.25 0.49 0.04 -0.02 0.10 Magnetic 0.22 0.48 -0.02 -0.05 0.17 Energetic 0.16 0.48 0.12 -0.13 0.02 Daring -0.14 0.46 0.00 -0.20 0.02 Rambunctious 0.00 0.46 -0.15 0.06 -0.04 Outspoken -0.14 0.45 -0.09 -0.03 0.21 Vivacious 0.31 0.44 0.00 0.03 0.10 Dominant -0.36 0.44 0.20 0.05 0.09 Merry 0.38 0.44 -0.02 -0.15 -0.09 Unrestrained -0.01 0.43 -0.17 -0.10 -0.02 Active 0.11 0.43 0.15 -0.14 0.02 Boisterous -0.15 0.42 -0.15 -0.01 -0.07 Assured 0.01 0.42 0.23 -0.31 0.12 Uninhibited -0.06 0.42 -0.16 -0.20 0.20 Playful 0.20 0.41 -0.12 -0.02 -0.09 Happy-go-lucky 0.21 0.40 -0.21 -0.22 -0.21 Vigorous 0.04 0.40 0.05 -0.15 0.12 Friendly 0.37 0.39 -0.03 -0.17 -0.16 Flamboyant -0.04 0.39 -0.10 0.08 0.01 Adventurous 0.00 0.38 -0.04 -0.19 0.10 Expressive 0.12 0.38 0.06 0.14 0.22 Forceful -0.34 0.38 0.15 -0.01 0.11 Carefree 0.18 0.37 -0.22 -0.22 -0.12 Flirtatious 0.03 0.36 -0.15 0.20 0.00 Competitive -0.25 0.34 0.19 -0.03 -0.11 Mischievous -0.10 0.34 -0.19 0.02 0.14 Direct -0.07 0.33 0.14 -0.13 0.22 Spontaneous 0.09 0.33 -0.11 -0.04 0.25 Zealous 0.20 0.32 -0.03 0.00 0.08 Gregarious 0.07 0.32 -0.10 0.03 0.07 Exhibitionistic -0.20 0.32 -0.17 0.06 0.05 Straightforward -0.02 0.31 0.10 -0.19 0.19 Humorous 0.15 0.31 -0.09 -0.13 0.01 Frank -0.08 0.31 0.02 -0.09 0.22 Opportunistic -0.09 0.30 0.16 -0.04 0.01 Demonstrative 0.03 0.29 -0.07 0.01 0.15 Enterprising -0.04 0.28 0.21 -0.19 0.27 Sexy 0.06 0.28 -0.01 0.07 0.17 Hearty 0.11 0.28 0.07 -0.24 0.02 Wordy 0.00 0.26 -0.09 0.17 0.04 Explosive -0.25 0.25 -0.10 0.21 -0.04 Witty -0.01 0.24 -0.03 -0.15 0.21 Persistent -0.10 0.23 0.20 -0.01 0.12

90 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX D. FACTOR LOADINGS OF PERSONALITY ADJECTIVES

Proud -0.09 0.20 0.16 0.01 0.07 Impetuous -0.04 0.17 -0.17 0.12 0.06 Intrusive -0.13 0.16 -0.05 0.03 -0.14 Withdrawn -0.05 -0.67 -0.04 0.02 0.08 Silent 0.07 -0.66 0.08 -0.08 -0.02 Introverted 0.02 -0.65 0.01 0.03 0.10 Shy 0.18 -0.65 0.06 0.03 -0.01 Quiet 0.15 -0.64 0.15 -0.09 0.12 Reserved 0.09 -0.60 0.22 -0.06 0.02 Timid 0.30 -0.60 -0.01 0.08 -0.11 Bashful 0.22 -0.59 0.05 0.03 -0.02 Unsociable -0.17 -0.59 0.02 -0.04 0.16 Unaggressive 0.35 -0.57 -0.14 -0.07 0.06 Inhibited 0.09 -0.54 -0.02 0.15 -0.06 Uncommunicative -0.18 -0.52 -0.07 -0.08 -0.01 Passive 0.22 -0.48 -0.06 -0.03 -0.19 Meek 0.22 -0.48 -0.08 0.02 -0.22 Restrained 0.12 -0.47 0.20 -0.10 0.03 Dull -0.03 -0.46 -0.05 -0.02 -0.05 Bland -0.02 -0.44 0.07 -0.09 -0.19 Sedate 0.07 -0.41 0.06 -0.10 -0.10 Somber -0.09 -0.41 0.12 -0.03 -0.14 Melancholic 0.00 -0.40 -0.04 0.20 -0.09 Unfriendly -0.28 -0.40 -0.03 -0.01 0.15 Unadventurous 0.10 -0.38 0.01 0.08 -0.08 Detached -0.18 -0.37 -0.08 -0.12 0.07 Uncompetitive 0.29 -0.37 -0.18 -0.07 0.14 Submissive 0.34 -0.36 -0.09 0.11 -0.23 Cowardly 0.25 -0.36 -0.15 0.26 -0.05 Indirect 0.11 -0.35 -0.14 0.06 0.01 Pessimistic -0.14 -0.35 -0.07 0.24 0.02 Negativistic -0.31 -0.34 -0.08 0.31 -0.03 Placid 0.12 -0.34 0.10 -0.20 -0.04 Sluggish -0.04 -0.33 -0.19 0.09 -0.11 Nonpersistent 0.16 -0.33 -0.23 -0.14 -0.04 Serious 0.03 -0.31 0.31 0.04 0.17 Aloof -0.20 -0.30 -0.03 0.03 0.06 Vague 0.03 -0.28 -0.28 0.11 -0.07 Docile 0.24 -0.28 0.05 -0.04 -0.19 Secretive -0.07 -0.28 0.02 0.11 -0.04 Unimaginative 0.05 -0.25 0.00 -0.07 -0.22 Wary -0.06 -0.25 0.07 0.09 0.18 Lethargic -0.07 -0.25 -0.19 0.04 -0.16 Unaffectionate -0.15 -0.25 0.00 -0.22 0.13 Prudish 0.13 -0.24 0.15 0.12 -0.17 Apathetic -0.08 -0.23 -0.04 -0.01 -0.18 Humorless -0.06 -0.23 0.13 -0.10 -0.06 Impartial 0.08 -0.23 0.08 -0.17 0.12 Organized 0.08 -0.06 0.65 0.08 -0.02 Precise -0.02 -0.01 0.61 -0.01 0.12 Responsible 0.28 -0.02 0.59 -0.06 0.08 Thorough 0.04 0.00 0.58 -0.10 0.13 Efficient 0.09 0.01 0.57 0.00 0.07 Orderly 0.08 -0.06 0.57 0.10 -0.06

A Method to Perform an Automated Background Check on Professional Football Players 91 APPENDIX D. FACTOR LOADINGS OF PERSONALITY ADJECTIVES

Self-disciplined 0.12 0.01 0.55 -0.16 0.09 Practical 0.14 -0.08 0.54 -0.16 0.04 Systematic 0.05 -0.12 0.54 -0.04 0.08 Dependable 0.31 -0.05 0.51 -0.06 -0.01 Reliable 0.33 -0.02 0.49 -0.07 0.03 Exacting -0.08 -0.02 0.47 0.02 0.10 Concise 0.05 0.00 0.46 -0.13 0.13 Careful 0.19 -0.22 0.46 0.01 0.01 Prompt 0.16 -0.03 0.46 0.05 -0.03 Logical -0.03 -0.04 0.46 -0.22 0.27 Consistent 0.05 -0.03 0.45 -0.34 -0.07 Steady 0.17 -0.01 0.44 -0.33 0.06 Meticulous 0.01 -0.11 0.44 0.09 0.04 Decisive -0.03 0.18 0.43 -0.22 0.15 Punctual 0.14 -0.08 0.40 0.04 0.03 Firm -0.17 0.17 0.40 -0.15 0.08 Economical 0.14 -0.13 0.39 -0.14 0.01 Cautious 0.14 -0.32 0.39 0.12 -0.03 Strict -0.14 -0.04 0.38 0.12 -0.11 Purposeful 0.11 0.15 0.38 -0.10 0.19 Dignified 0.06 0.07 0.38 0.00 0.04 Formal 0.07 -0.09 0.37 0.14 -0.14 Perfectionistic -0.07 -0.07 0.37 0.16 0.13 Mature 0.20 -0.01 0.37 -0.21 0.20 Industrious 0.07 0.12 0.36 -0.14 0.25 Stern -0.29 0.02 0.36 0.01 -0.15 Controlled 0.10 -0.17 0.36 -0.22 -0.02 Alert 0.11 0.16 0.36 -0.09 0.20 Rational 0.13 -0.08 0.35 -0.24 0.32 Thrifty 0.12 -0.14 0.34 -0.12 -0.03 Ambitious 0.04 0.28 0.34 0.03 0.11 Conservative 0.09 -0.20 0.32 0.00 -0.28 Wise 0.07 0.08 0.30 -0.21 0.28 Sophisticated 0.14 0.13 0.30 0.07 0.18 Foresighted 0.12 -0.02 0.29 -0.11 0.27 Deliberate -0.08 0.00 0.28 -0.09 0.12 Refined 0.25 -0.05 0.27 0.07 0.15 Principled 0.20 -0.08 0.27 -0.10 0.15 Cultured 0.21 0.05 0.27 0.06 0.18 Objective 0.15 -0.04 0.27 -0.16 0.22 Moralistic 0.23 -0.11 0.26 0.07 -0.23 Discreet 0.24 -0.26 0.26 -0.09 0.18 Poised 0.18 0.13 0.26 -0.11 0.12 Disorganized 0.01 0.01 -0.64 -0.06 0.07 Haphazard -0.10 0.04 -0.57 0.00 -0.09 Disorderly -0.08 0.05 -0.57 -0.13 0.05 Careless -0.09 0.06 -0.57 -0.01 -0.04 Inefficient 0.01 -0.16 -0.54 -0.04 0.06 Impractical -0.06 -0.02 -0.53 0.11 -0.01 Unreliable -0.18 0.02 -0.51 -0.01 0.01 Inconsistent -0.06 -0.05 -0.51 0.22 0.06 Absent-minded 0.02 -0.10 -0.50 0.05 0.09 Scatterbrained 0.13 0.01 -0.47 0.15 -0.01 Illogical 0.07 0.00 -0.46 0.14 -0.14

92 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX D. FACTOR LOADINGS OF PERSONALITY ADJECTIVES

Sloppy -0.10 -0.01 -0.46 -0.17 0.08 Undependable -0.13 0.00 -0.45 -0.04 0.10 Immature -0.05 0.00 -0.44 0.16 -0.09 Erratic -0.13 -0.01 -0.43 0.19 0.12 Negligent -0.16 -0.07 -0.43 0.03 -0.12 Reckless -0.21 0.18 -0.43 -0.03 0.07 Indecisive 0.12 -0.27 -0.41 0.20 -0.09 Forgetful 0.02 -0.06 -0.40 0.08 0.02 Lazy -0.02 -0.24 -0.40 0.10 -0.02 Unstable 0.00 -0.20 -0.39 0.28 0.05 Aimless -0.07 -0.21 -0.38 -0.16 -0.05 Foolhardy -0.02 0.14 -0.38 0.00 -0.18 Impulsive 0.06 0.26 -0.37 0.21 0.06 Lax 0.02 -0.09 -0.36 -0.08 -0.09 Wishy-washy 0.07 -0.26 -0.33 0.14 -0.06 Wasteful -0.06 0.06 -0.32 0.09 -0.03 Unambitious 0.06 -0.27 -0.30 -0.09 -0.03 Frivolous 0.07 0.09 -0.30 0.21 -0.04 Rash -0.18 0.11 -0.29 0.12 -0.04 Unsophisticated 0.03 -0.17 -0.25 -0.22 -0.10 Dishonest -0.20 0.01 -0.25 0.05 0.05 Unobservant 0.04 -0.12 -0.24 -0.12 -0.23 Indiscreet -0.11 0.14 -0.21 -0.03 -0.16 Lustful -0.16 0.07 -0.20 0.12 0.11 Indulgent 0.04 0.11 -0.18 0.10 0.09 Transparent 0.08 -0.03 -0.13 0.10 -0.12 Moody -0.13 -0.17 -0.07 0.53 0.04 Touchy -0.06 -0.15 0.01 0.51 -0.14 Temperamental -0.20 -0.05 -0.05 0.51 0.01 Irritable -0.27 -0.10 -0.01 0.49 -0.03 Emotional 0.38 0.12 -0.09 0.49 -0.02 Jealous -0.09 -0.07 -0.06 0.47 -0.10 Envious -0.04 -0.12 -0.10 0.47 -0.08 Possessive -0.07 -0.01 0.11 0.46 -0.07 Fretful 0.07 -0.24 -0.07 0.45 -0.05 Impatient -0.23 0.00 -0.02 0.44 0.01 Self-pitying -0.03 -0.36 -0.08 0.43 -0.03 Nervous 0.08 -0.23 -0.05 0.42 -0.05 Crabby -0.24 -0.15 -0.01 0.42 -0.07 Defensive -0.17 -0.17 -0.04 0.41 -0.05 Grumpy -0.27 -0.19 -0.03 0.40 -0.07 High-strung -0.03 0.10 -0.08 0.40 0.03 Insecure 0.06 -0.36 -0.29 0.39 0.02 Cranky -0.30 -0.08 -0.05 0.39 -0.11 Fearful 0.18 -0.24 -0.06 0.38 -0.12 Faultfinding -0.31 -0.10 0.05 0.38 -0.02 Quarrelsome -0.36 0.11 -0.01 0.38 -0.05 Anxious 0.05 -0.02 0.00 0.38 -0.13 Finicky -0.06 -0.06 0.07 0.35 -0.02 Snobbish -0.25 0.00 -0.05 0.33 0.08 Bossy -0.31 0.30 0.04 0.32 -0.02 Gossipy 0.08 0.13 -0.19 0.32 -0.15 Nosey -0.04 0.16 -0.15 0.32 -0.14 Selfish -0.24 -0.16 -0.14 0.31 0.21

A Method to Perform an Automated Background Check on Professional Football Players 93 APPENDIX D. FACTOR LOADINGS OF PERSONALITY ADJECTIVES

Excitable 0.22 0.30 -0.07 0.31 -0.10 Restless -0.10 0.05 -0.20 0.30 0.07 Compulsive -0.02 0.12 -0.26 0.26 0.02 Hypocritical -0.16 -0.09 -0.18 0.26 -0.14 Obsessive -0.18 -0.05 -0.12 0.25 -0.04 Meddlesome -0.05 0.16 -0.08 0.25 -0.16 Extravagant 0.01 0.21 -0.13 0.24 0.00 Volatile -0.05 0.11 -0.08 0.22 -0.03 Self-indulgent -0.09 0.08 -0.15 0.22 0.13 Superstitious 0.05 0.05 -0.08 0.10 -0.09 Relaxed 0.21 0.17 0.04 -0.48 -0.02 Unemotional -0.30 -0.26 0.08 -0.47 0.05 Patient 0.43 -0.13 0.11 -0.44 0.07 Masculine -0.34 0.01 0.05 -0.43 0.10 Undemanding 0.30 -0.23 -0.13 -0.42 0.00 Easy-going 0.30 0.22 -0.15 -0.41 -0.06 Unexcitable -0.19 -0.34 0.04 -0.41 0.13 Courageous -0.08 0.33 0.10 -0.35 0.07 Brave -0.09 0.30 0.13 -0.35 0.07 Informal 0.07 0.01 -0.26 -0.33 0.23 Down-to-earth 0.24 -0.01 0.14 -0.28 0.09 Passionless -0.19 -0.21 0.03 -0.28 0.03 Earthy 0.22 -0.03 -0.01 -0.26 0.18 Nonchalant -0.01 0.04 -0.09 -0.26 0.03 Unassuming 0.16 -0.18 -0.14 -0.26 0.16 Casual 0.12 0.08 -0.16 -0.25 0.00 Weariless -0.01 0.18 0.08 -0.23 0.06 Intelligent 0.03 0.03 0.16 -0.13 0.55 Intellectual -0.02 0.04 0.21 -0.08 0.50 Smart 0.03 0.02 0.18 -0.17 0.49 Complex 0.00 -0.10 -0.04 0.14 0.48 Philosophical 0.03 -0.03 0.01 -0.04 0.47 Innovative 0.04 0.13 0.09 -0.16 0.44 Bright 0.09 0.12 0.12 -0.12 0.44 Unconventional -0.02 0.03 -0.32 -0.10 0.44 Knowledgeable 0.02 0.05 0.26 -0.16 0.43 Deep 0.20 -0.07 0.03 0.01 0.43 Ingenious -0.02 0.11 0.06 -0.18 0.43 Inquisitive 0.05 0.09 -0.03 0.00 0.43 Insightful 0.10 -0.04 0.01 -0.10 0.42 Nonconforming -0.09 0.04 -0.17 -0.17 0.42 Analytical -0.08 -0.12 0.27 -0.02 0.42 Introspective 0.12 -0.26 -0.05 0.05 0.41 Contemplative 0.15 -0.24 0.02 0.03 0.41 Perceptive 0.14 0.04 0.19 -0.04 0.40 Articulate 0.07 0.10 0.15 0.01 0.38 Inventive -0.04 0.16 0.06 -0.17 0.36 Creative 0.07 0.10 0.00 -0.08 0.33 Shrewd -0.33 0.02 0.02 -0.09 0.33 Individualistic -0.17 0.08 0.06 -0.13 0.33 Clever -0.09 0.18 0.14 -0.21 0.33 Intense -0.02 0.12 0.10 0.12 0.31 Imaginative 0.05 0.14 -0.04 -0.05 0.31 Independent -0.14 0.20 0.18 -0.20 0.30

94 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX D. FACTOR LOADINGS OF PERSONALITY ADJECTIVES

Self-critical 0.18 -0.21 0.03 0.24 0.30 Progressive 0.15 0.16 0.17 -0.08 0.28 Diplomatic 0.23 0.08 0.13 -0.06 0.28 Empathic 0.22 -0.01 -0.01 0.03 0.28 Versatile 0.12 0.16 0.09 -0.23 0.27 Distrustful -0.27 -0.19 -0.11 0.18 0.27 Idealistic 0.18 0.02 -0.10 0.12 0.26 Eloquent 0.10 0.18 0.16 0.00 0.25 Artistic 0.12 0.01 0.01 0.01 0.25 Candid -0.01 0.15 -0.06 -0.10 0.25 Worldly -0.04 0.20 0.09 -0.04 0.24 Unpredictable 0.03 0.10 -0.17 0.12 0.23 Scrupulous 0.04 -0.10 0.12 -0.09 0.23 Curious 0.05 0.13 0.02 0.05 0.23 Tenacious -0.06 0.10 0.10 -0.05 0.22 Animated 0.17 0.13 -0.19 0.05 0.21 Sensual 0.12 0.16 -0.13 0.16 0.20 Ethical 0.20 -0.17 0.18 -0.10 0.20 Autonomous -0.03 -0.10 0.13 -0.12 0.18 Subjective 0.04 -0.01 -0.02 0.05 0.06 Simple 0.12 -0.13 0.01 -0.17 -0.45 Conventional 0.12 -0.13 0.33 -0.06 -0.38 Traditional 0.14 -0.14 0.28 0.02 -0.36 Uninquisitive -0.03 -0.21 0.00 -0.14 -0.35 Unintelligent 0.07 -0.09 -0.16 -0.02 -0.34 Surly -0.24 0.03 0.13 -0.04 -0.31 Pompous -0.16 0.12 0.07 0.08 -0.31 Dependent 0.22 -0.14 -0.07 0.18 -0.29 Shallow -0.10 -0.12 -0.09 -0.03 -0.29 Unintellectual 0.10 -0.10 -0.20 -0.09 -0.27 Patronizing -0.02 0.03 0.04 0.06 -0.27 Ignorant 0.04 -0.07 -0.18 0.02 -0.27 Inarticulate 0.03 -0.15 -0.12 -0.12 -0.26 Pretentious -0.15 0.14 -0.02 0.10 -0.25 Unscrupulous -0.06 0.08 -0.06 -0.05 -0.24 Predictable 0.04 -0.19 0.14 -0.19 -0.22 Condescending -0.11 -0.07 0.06 0.06 -0.15 Dogmatic -0.10 -0.04 0.06 0.01 -0.11

A Method to Perform an Automated Background Check on Professional Football Players 95

Appendix E

Translation of Adjectives

This appendix includes the English adjectives of AppendixD, together with their Dutch transla- tion.

Translations of the adjectives from English to Dutch.

English Dutch English Dutch abrupt abrupt boastful opschepperig absent-minded verstrooid boisterous onstuimig abusive beledigend bold stoutmoedig accommodating meegaand bossy bazig active actief brave dapper adaptable aanpasbaar bright helder adventurous avontuurlijk bullheaded stijfkoppige affectionate hartelijk callous ongevoelig aggressive agressief candid openhartig agreeable aangenaam carefree zorgeloos aimless doelloos careful voorzichtig alert alert careless zorgeloos aloof afzijdig casual luchtig altruistic altru¨ıstisch cautious voorzichtig ambitious ambitieus charitable liefdadig amiable beminnelijk cheerful vrolijk analytical analytisch clever knap animated geanimeerde coarse grof antagonistic vijandig cold koude anxious angstig combative strijdlustig apathetic apathisch communicative communicatief argumentative betogend compassionate meelevend articulate geleed competitive competitief artistic artistiek complex complex assertive zelfbewust compliant inschikkelijk assured verzekerd compulsive gedwongen autonomous autonoom conceited verwaand bashful verlegen concise beknopt belligerent strijdlustig condescending minzaam benevolent welwillend confident zelfverzekerd bigoted kwezelachtig conscientious gewetensvol bitter bitter conservative conservatief bland zacht considerate attent

A Method to Perform an Automated Background Check on Professional Football Players 97 APPENDIX E. TRANSLATION OF ADJECTIVES

consistent consistent emotional emotioneel contemplative beschouwend empathic empatische controlled gecontroleerde energetic energiek conventional conventioneel enterprising ondernemend cooperative co¨operatief enthusiastic enthousiast cordial hartelijk envious afgunstig courageous moedig erratic grillige courteous beleefd ethical ethisch cowardly laf exacting veeleisend crabby prikkelbaar excitable prikkelbaar crafty sluw exhibitionistic exhibitionistische cranky humeurig explosive explosief creative creatief expressive expressief critical kritisch extravagant verkwistend cruel wreed extroverted extravert cultured ontwikkeld faultfinding gevit cunning sluw fearful angstig curious nieuwsgierig feminine vrouwelijk cynical cynisch finicky kieskeurig daring gedurfd firm stevig deceitful bedrieglijk flamboyant flamboyante decisive beslissend flexible flexibele deep diep flirtatious flirterig defensive defensief foolhardy roekeloos deliberate opzettelijk forceful krachtig demanding veeleisende foresighted vooruitziend demonstrative aanwijzend forgetful vergeetachtig dependable betrouwbaar formal formeel dependent afhankelijk frank openhartig detached vrijstaand fretful knorrig devious omslachtig friendly vriendelijk dignified waardig frivolous frivool diplomatic diplomatisch generous gul direct direct genial geniaal discreet discreet gossipy pratend dishonest oneerlijk greedy gulzig disorderly wanordelijk gregarious gezellig disorganized ongeorganiseerd grumpy knorrig disrespectful oneerbiedig gullible goedgelovig distrustful wantrouwend haphazard lukraak docile volgzaam happy-go-lucky toevallig dogmatic dogmatisch harsh hard dominant dominant hearty hartig domineering heerszuchtig helpful behulpzaam down-to-earth gewoon high-strung zenuwachtig dull saai honest eerlijk earnest ernstig humble nederig earthy aardsgezind humorless humorloos easy-going gemakzuchtig humorous humoristisch economical zuinig hypocritical schijnheilig efficient efficient idealistic idealistisch egocentric egocentrisch ignorant onwetend egotistical ego¨ıstische illogical onlogisch eloquent welsprekend imaginative fantasierijke

98 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX E. TRANSLATION OF ADJECTIVES

immature onvolwassen mature volwassen impartial onpartijdig meddlesome bemoeiziek impatient ongeduldig meek zachtmoedig impersonal onpersoonlijk melancholic melancholisch impetuous onstuimig merry vrolijk impolite onbeleefd meticulous nauwgezet impractical onpraktisch mischievous ondeugend impulsive impulsief modest bescheiden inarticulate ongeleed moody humeurig inconsiderate onbezonnen moral moreel inconsistent inconsequent moralistic moralistisch indecisive besluiteloos naive naief independent onafhankelijk natural natuurlijk indirect indirect negativistic negativistisch indiscreet indiscreet negligent nalatig individualistic individualistisch nervous nerveus indulgent toegeeflijk nonchalant nonchalant industrious ijverig nonconforming onovereenkomstig inefficient ineffici¨ent nonpersistent onhardnekkig informal informele nonreligious ongelovig ingenious vernuftig nosey nieuwsgierig inhibited geremd objective objectief innovative innovatief obliging bereidwillig inquisitive nieuwsgierig obsessive obsessieve insecure onzeker obstinate koppig insensitive ongevoelig opinionated eigenzinnig insightful inzichtelijke opportunistic opportunistisch insincere onoprecht optimistic optimistisch intellectual intellectueel orderly ordelijk intelligent intelligent organized georganiseerd intense intens outspoken openhartig intolerant onverdraagzaam passionate hartstochtelijk introspective introspectief passionless onhartstochtelijk introverted introvert passive passief intrusive opdringerig patient geduldig inventive vindingrijk patronizing neerbuigend irreverent oneerbiedig peaceful vredig irritable prikkelbaar perceptive opmerkzaam jealous jaloers perfectionistic perfectionistisch jovial joviaal persistent aanhoudend kind vriendelijk pessimistic pessimistisch knowledgeable ge¨ınformeerde philosophical filosofisch lax laks placid kalm lazy lui playful speels lenient mild pleasant aangenaam lethargic slaperig poised gedoodverfd logical logisch polite beleefd loyal loyaal pompous hoogdravend lustful wellustig possessive bezittelijk magnetic magnetisch practical praktisch manipulative manipulatief precise precies mannerly welgemanierd predictable voorspelbaar masculine mannelijk prejudiced bevooroordeeld masochistic masochistische pretentious pretentieus

A Method to Perform an Automated Background Check on Professional Football Players 99 APPENDIX E. TRANSLATION OF ADJECTIVES

principled principieel sloppy slordig progressive progressief sluggish traag prompt prompt sly sluw proud trots smart slim prudish preuts smug zelfgenoegzaam punctual stipt snobbish snobistisch purposeful doelbewust sociable sociaal quarrelsome twistziek social sociaal quiet rustig somber somber rambunctious onstuimige sophisticated geraffineerd rash onbezonnen spirited levendig rational rationeel spontaneous spontaan reasonable redelijk steady standvastig rebellious opstandig stern streng reckless roekeloos stingy gierig refined verfijnd straightforward rechtdoorzee relaxed ontspannen strict streng reliable betrouwbaar stubborn koppig religious religieus subjective subjectief reserved gereserveerd submissive onderdanig respectful eerbiedig suggestible vatbaar responsible verantwoordelijk superstitious bijgelovig restless rusteloos surly nors restrained terughoudend suspicious verdacht reverent eerbiedig sympathetic sympathiek rigid stijf systematic systematisch rough ruw tactful tactvol rude onbeleefd tactless tactloos ruthless meedogenloos talkative spraakzaam sarcastic sarcastisch temperamental temperamentvolle scatterbrained warhoofdig tenacious vasthoudend scornful minachtend thorough grondig scrupulous scrupuleus thoughtful bedachtzaam secretive geheimzinnig thoughtless onnadenkend sedate bezadigd thrifty zuinig self-critical zelfkritisch timid timide self-disciplined gedisciplineerd tolerant verdraagzaam self-indulgent genotzuchtig touchy gevoelig selfish egostisch tough taai selfless onbaatzuchtig traditional traditioneel self-pitying zelfmedelijdend transparent transparant self-seeking egostisch trustful vertrouwend sensitive gevoelig truthful waarheidlievend sensual sensueel unadventurous onavontuurlijk sentimental sentimenteel unaffectionate onverschillig serious ernstig unaggressive bedaard sexy sexy unambitious ambitieloos shallow ondiep unassuming bescheiden shrewd sluw uncharitable onbarmhartig shy verlegen uncommunicative onmeedeelzaam silent stil uncompetitive incompetitief simple simpel unconventional onconventionele sincere oprecht uncooperative weigerachtig skeptical sceptisch uncouth lomp

100 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX E. TRANSLATION OF ADJECTIVES

undemanding bescheiden unsophisticated ongeknutseld undependable onbetrouwbaar unstable wankel underhanded slinks unsympathetic onsympathiek understanding begripvol vague vaag unemotional onaandoenlijk vain ijdel unexcitable onge¨ınteresseerd verbal verbaal unforgiving onverzoenlijk versatile veelzijdig unfriendly onvriendelijk vigorous krachtig ungracious onhoffelijk vindictive wraakzuchtig unimaginative fantasieloos vivacious levendig uninhibited onbevangen volatile vluchtig uninquisitive nieuwsgierig warm warm unintellectual onintellectueel wary omzichtig unintelligent onintelligent wasteful verkwistend unkind onaardig weariless onvermoeibaar unobservant onoplettend wise wijs unpredictable onvoorspelbaar wishy-washy slap unreliable onbetrouwbaar withdrawn teruggetrokken unrestrained onbeperkt witty geestig unruly onhandelbaar wordy langdradig unscrupulous gewetenloos worldly werelds unselfish onzelfzuchtig zealous ijverig unsociable ongezellig

A Method to Perform an Automated Background Check on Professional Football Players 101

Appendix F

Removed Adjectives

This appendix includes the personality adjectives that did not occur in the text corpus, or were removed from it because they appeared too frequently.

F.1 Unused Adjectives

Below is a list of personality adjectives that did not occur in the text corpus of news articles. For each of the three experiments, a separate list is provided.

F.1.1 Bag-of-Words Approach with Word Stemming English adjectives bullheaded, dependent, dominant, empathic, exhibitionistic, faultfinding, finicky, gossipy, gregarious, high-strung, impractical, indiscreet, meddlesome, nonconform- ing, nonpersistent, nonreligious, nosey, quarrelsome, self-seeking, tactless, unaffectionate, unaggressive, undependable, uninquisitive, unintellectual, weariless, wordy

Dutch adjectives aanpasbaar, aardsgezind, altru¨ıstisch, ambitieloos, analytisch, autonoom, bazig, bedachtzaam, beknopt, beminnelijk, bemoeiziek, bevooroordeeld, bezadigd, diplomatisch, discreet, doelloos, dogmatisch, exhibitionistische, expressief, fantasierijke, flirterig, genotzuchtig, geremd, gevit, gewetenloos, gewetensvol, goedgelovig, gul, gulzig, heerszuchtig, hoogdravend, humorloos, idealistisch, incompetitief, indiscreet, inschikkelijk, intellectueel, introspectief, joviaal, kwezelachtig, langdradig, liefdadig, magnetisch, manipulatief, masochistische, mee- gaand, melancholisch, minzaam, moralistisch, negativistisch, obsessieve, onaandoenlijk, onavon- tuurlijk, onbeleefd, onderdanig, ondiep, oneerbiedig, ongeknutseld, ongeleed, ongelovig, ongezel- lig, onhandelbaar, onhardnekkig, onhartstochtelijk, onhoffelijk, onintellectueel, onintelligent, onmeedeelzaam, onoprecht, onovereenkomstig, onpartijdig, onpraktisch, onsympathiek, on- verdraagzaam, onverzoenlijk, onzelfzuchtig, opmerkzaam, opschepperig, ordelijk, preten- tieus, preuts, principieel, progressief, religieus, scrupuleus, sensueel, snobistisch, stijfkop- pige, stoutmoedig, subjectief, systematisch, tactloos, tactvol, temperamentvolle, twistziek, verdraagzaam, vergeetachtig, verkwistend, verstrooid, verwaand, vredig, waarheidlievend, warhoofdig, weigerachtig, welgemanierd, wellustig, welsprekend, wraakzuchtig, zachtmoedig, zelfgenoegzaam, zelfmedelijdend

F.1.2 Bag-of-Words Approach without Word Stemming English adjectives bossy, bullheaded, empathic, exhibitionistic, faultfinding, finicky, foresighted, gossipy, gregarious, highstrung, impersonal, impractical, indiscreet, mannerly, masculine, meddlesome, nonconforming, nonpersistent, nonreligious, nosey, patronizing, perfectionistic,

A Method to Perform an Automated Background Check on Professional Football Players 103 APPENDIX F. REMOVED ADJECTIVES

quarrelsome, scrupulous, selfdisciplined, selfpitying, selfseeking, suggestible, tactless, trust- ful, unaffectionate, unaggressive, uncharitable, undependable, underhanded, unexcitable, uninquisitive, unintellectual, unsociable, weariless, wordy Dutch adjectives aanpasbaar, aanwijzend, aardsgezind, afgunstig, altrustisch, ambitieloos, ana- lytisch, autonoom, bazig, bedachtzaam, bedrieglijk, beknopt, beminnelijk, bemoeiziek, beschouwend, betogend, bevooroordeeld, bezadigd, bezittelijk, conventioneel, diplomatisch, discreet, doel- loos, dogmatisch, eerbiedig, efficient, egocentrisch, empatische, ethisch, exhibitionistische, expressief, fantasierijke, flirterig, geanimeerde, gedoodverfd, geestig, geleed, genotzuchtig, geremd, gevit, gewetenloos, gewetensvol, goedgelovig, gul, gulzig, heerszuchtig, hoogdrav- end, humeurig, humoristisch, humorloos, idealistisch, incompetitief, indiscreet, individu- alistisch, inschikkelijk, intellectueel, introspectief, inzichtelijke, joviaal, kwezelachtig, lang- dradig, liefdadig, magnetisch, manipulatief, masochistische, meegaand, meelevend, melan- cholisch, minzaam, moralistisch, naief, negativistisch, obsessieve, onaandoenlijk, onavon- tuurlijk, onbeleefd, onderdanig, ondernemend, ondeugend, ondiep, oneerbiedig, ongenteress- eerd, ongeknutseld, ongeleed, ongelovig, ongeorganiseerd, ongezellig, onhandelbaar, onhard- nekkig, onhartstochtelijk, onhoffelijk, onintellectueel, onintelligent, onmeedeelzaam, onoplettend, onoprecht, onovereenkomstig, onpartijdig, onpersoonlijk, onpraktisch, onsympathiek, on- verdraagzaam, onverzoenlijk, onzelfzuchtig, opdringerig, opmerkzaam, opschepperig, or- delijk, pretentieus, preuts, prikkelbaar, principieel, progressief, religieus, scrupuleus, sen- sueel, slinks, snobistisch, stijfkoppige, stoutmoedig, subjectief, systematisch, tactloos, tact- vol, temperamentvolle, toegeeflijk, twistziek, vasthoudend, verdraagzaam, vergeetachtig, verkwistend, verstrooid, vertrouwend, verwaand, vindingrijk, vluchtig, vooruitziend, vredig, waarheidlievend, wanordelijk, wantrouwend, warhoofdig, weigerachtig, welgemanierd, wel- lustig, welsprekend, wraakzuchtig, zachtmoedig, zelfgenoegzaam, zelfmedelijden

F.1.3 Part-of-Speech Tagging English adjectives accommodating, bossy, bullheaded, communicative, dependent, dominant, empathic, exhibitionistic, faultfinding, finicky, foresighted, gossipy, gregarious, high-strung, impractical, indiscreet, masculine, meddlesome, nonconforming, nonpersistent, nonreligious, nosey, patronizing, perfectionistic, prejudiced, quarrelsome, scrupulous, self-pitying, self- seeking, somber, tactless, unaffectionate, unaggressive, uncharitable, undependable, unin- quisitive, unintellectual, weariless, wordy Dutch adjectives aangenaam, aanpasbaar, aanwijzend, aardsgezind, afgunstig, altrustisch, am- bitieloos, analytisch, autonoom, bazig, bedaard, bedachtzaam, bedrieglijk, begripvol, beknopt, beledigend, beminnelijk, bemoeiziek, beschouwend, bevooroordeeld, bezadigd, bezittelijk, diplomatisch, discreet, doelloos, dogmatisch, eerbiedig, exhibitionistische, expressief, fantas- ierijke, flirterig, geanimeerde, gecontroleerde, gedisciplineerd, gedoodverfd, gedurfd, gedwon- gen, genformeerde, geleed, genotzuchtig, georganiseerd, geraffineerd, geremd, gereserveerd, gevit, gewetenloos, gewetensvol, goedgelovig, grof, gul, gulzig, heerszuchtig, hoogdravend, humeurig, humorloos, idealistisch, impulsief, incompetitief, indiscreet, inefficint, inschikkelijk, intellectueel, introspectief, jaloers, joviaal, kwezelachtig, laf, langdradig, liefdadig, lomp, lui, magnetisch, manipulatief, masochistische, meegaand, meelevend, melancholisch, mild, minzaam, moralistisch, nauwgezet, neerbuigend, negativistisch, obsessieve, onaandoenlijk, onavontuurlijk, onbeleefd, onbeperkt, onderdanig, ondernemend, ondeugend, ondiep, oneer- biedig, ongeknutseld, ongeleed, ongelovig, ongezellig, onhandelbaar, onhardnekkig, onhartstochtelijk, onhoffelijk, onintellectueel, onintelligent, onmeedeelzaam, onnadenkend, onoprecht, onovereen- komstig, onpartijdig, onpraktisch, onsympathiek, ontspannen, ontwikkeld, onverdraagzaam, onverzoenlijk, onvolwassen, onzelfzuchtig, opdringerig, opmerkzaam, opschepperig, ordelijk, pratend, pretentieus, preuts, prikkelbaar, principieel, progressief, rechtdoorzee, religieus, scrupuleus, sensueel, sentimenteel, sexy, slap, snobistisch, spraakzaam, stijf, stijfkoppige, stipt, stoutmoedig, subjectief, systematisch, tactloos, tactvol, temperamentvolle, terug- getrokken, terughoudend, timide, toegeeflijk, transparant, twistziek, verdraagzaam, ver-

104 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX F. REMOVED ADJECTIVES

fijnd, vergeetachtig, verkwistend, verlegen, verstrooid, verwaand, verzekerd, vindingrijk, vooruitziend, vredig, waarheidlievend, wanordelijk, wantrouwend, warhoofdig, weigerachtig, welgemanierd, wellustig, welsprekend, wraakzuchtig, wreed, zachtmoedig, zelfgenoegzaam, zelfmedelijdend

F.2 Too Frequently Used Adjectives

A list of personality adjectives that were deleted from the text corpus can be found below. These adjectives were excluded from the experiments, because they appeared to often on average, and were believed to serve other purposes than describing a particular player. A list is provided for each of the three experiments.

F.2.1 Bag-of-Words Approach with Word Stemming English adjectives adventurous, aimless, analytical, assured, assured, bold, conservative, crafty, detached, docile, domineering, earnest, firm, forceful, forceful, helpful, innovative, insec- ure, kind, meddlesome, negativistic, nonconforming, objective, possessive, rambunctious, reasonable, relaxed, respectful, self-indulgent, sensitive, serious, sluggish, smart, snobbish, submissive, suggestible, superstitious, systematic, thoughtful, transparent, unadventurous, uncooperative, understanding, unforgiving, unkind, unrestrained, unsophisticated, vigorous, wasteful, zealous

Dutch adjectives actief, afhankelijk, alert, attent, avontuurlijk, beleefd, beslissend, bezittelijk, defensief, diep, direct, dominant, eerlijk, ernstig, geduldig, geleed, gevoelig, gewoon, hard, knap, koude, kritisch, logisch, natuurlijk, nors, ontwikkeld, onzeker, opzettelijk, precies, re- delijk, rustig, simpel, slap, slim, slordig, speels, stevig, stil, traditioneel, trots, vasthoudend, verantwoordelijk, vertrouwend, verzekerd, vijandig, volwassen, voorzichtig, vrijstaand, warm, werelds, wijs

F.2.2 Bag-of-Words Approach without Word Stemming English adjectives aggressive, alert, ambitious, assured, bitter, brave, bright, careful, clever, cold, competitive, confident, consistent, controlled, creative, critical, decisive, deep, defens- ive, demanding, direct, dominant, emotional, firm, frank, friendly, harsh, honest, intelligent, intense, kind, natural, nervous, optimistic, patient, poised, proud, quiet, relaxed, reliable, re- sponsible, serious, simple, smart, social, tough, traditional, understanding, versatile, warm, withdrawn Dutch adjectives actief, afhankelijk, alert, ambitieus, attent, beleefd, bescheiden, beslissend, defensief, diep, direct, dominant, eerlijk, enthousiast, ernstig, formeel, geduldig, gedwongen, gewoon, hard, knap, kritisch, logisch, natuurlijk, ontwikkeld, onzeker, precies, prompt, re- delijk, rustig, simpel, slap, slim, slordig, stevig, stil, traag, trots, verantwoordelijk, verzekerd, volwassen, voorzichtig, vrijstaand, vrolijk, warm, wijs

F.2.3 Part-of-Speech Tagging English adjectives bright, clever, competitive, confident, consistent, creative, decisive, deep, de- fensive, direct, domineering, excitable, friendly, honest, informal, intense, natural, optimistic, proud, quiet, respectful, serious, simple, social, tough

Dutch adjectives actief, diep, direct, eerlijk, ernstig, gewoon, hard, knap, logisch, natuurlijk, onzeker, precies, redelijk, rustig, slordig, stevig, stil, trots, verantwoordelijk, vijandig, voor- zichtig

A Method to Perform an Automated Background Check on Professional Football Players 105

Appendix G

Personality Profiles before Scaling

This appendix includes the primary results of the personality construction. Both the results for the Dutch players and the non-Dutch players are provided. Note that not all players are represented in the tables. For some players, too few articles were gathered to calculate a value for each personality trait. The factors are numbered as follows: • Factor I: Agreeableness

• Factor II: Extraversion • Factor III: Conscientiousness • Factor IV: Emotional Stability (In this project referred to as Neuroticism) • Factor V: Intellect/Imagination (In this project referred to as Openess to Experience)

G.1 Bag-of-Words Approach with Word Stemming

Personality profiles of football players not playing in the Netherlands before scaling (all values are multiplied by 102).

Player name I II III IV V Cristiano Ronaldo 0.13 0.21 0.32 -0.31 0.28 Lionel Messi 0.13 0.17 0.29 -0.07 0.48 Luis Suarez 0.04 0.21 0.32 -0.09 0.34 Karim Benzema 0.10 0.17 0.32 -0.34 0.35 Juanfran 0.11 0.14 0.41 -0.02 0.30 Thomas Muller 0.17 0.26 0.49 -0.03 0.30 Robert Lewandowski 0.15 0.16 0.48 -0.01 0.39 Diego Godin 0.08 0.33 0.39 -0.14 0.30 Manuel Neuer 0.14 0.24 0.42 -0.04 0.33 0.11 0.17 0.45 -0.11 0.48 Gianluigi Buffon 0.20 0.15 0.38 0.15 0.32 Toni Kroos 0.21 0.28 0.39 -0.13 0.42 Neymar 0.22 0.19 0.39 -0.06 0.43 Mesut Ozil 0.12 0.08 0.40 -0.01 0.33 Philipp Lahm 0.18 0.16 0.46 -0.02 0.49 Leonardo Bonucci 0.18 0.14 0.39 0.03 0.24 Javier Mascherano 0.18 0.10 0.36 -0.09 0.36

A Method to Perform an Automated Background Check on Professional Football Players 107 APPENDIX G. PERSONALITY PROFILES BEFORE SCALING

Dani Alves 0.08 0.35 0.26 -0.22 0.22 Jose Fonte 0.21 0.17 0.43 -0.06 0.32 Pedro Rodriguez 0.26 0.18 0.38 -0.18 0.35 Gerard Pique 0.15 0.33 0.34 -0.04 0.24 Yaya Toure 0.10 0.17 0.36 -0.07 0.39 0.14 0.27 0.54 0.05 0.37 Laurent Koscielny 0.16 0.25 0.42 0.04 0.28 Alexis Sanchez 0.13 0.22 0.37 -0.03 0.31 Francesc Fabregas 0.25 0.09 0.66 -0.17 0.52 David Silva 0.16 0.17 0.44 -0.10 0.40 Fernandinho 0.11 0.17 0.27 -0.08 0.41 Angel di Maria 0.14 0.28 0.36 -0.09 0.39 Giorgio Chiellini 0.04 0.42 0.26 -0.10 0.33 0.17 0.18 0.41 -0.04 0.51 Petr Cech 0.21 0.22 0.39 0.08 0.41 Joe Hart 0.17 0.25 0.53 -0.23 0.39 Andres Iniesta 0.17 0.21 0.35 -0.06 0.20 Gareth Bale 0.11 0.22 0.34 -0.05 0.31 Koke 0.12 0.21 0.36 -0.05 0.32 Ivan Rakitic 0.30 0.18 0.38 -0.07 0.17 Jan Oblak 0.17 0.38 0.61 0.08 0.56 Gabriel Fernandez 1.73 0.20 1.36 0.00 -0.08 0.13 0.20 0.41 -0.27 0.24 Christian Eriksen 0.13 0.29 0.41 -0.19 0.25 Joao Miranda 0.46 0.88 1.16 -0.96 0.82 Nicolas Otamendi 0.16 0.18 0.33 -0.16 0.35 Sergio Aguero 0.21 0.15 0.33 -0.04 0.48 Marcelo 0.19 0.23 0.44 0.03 0.28 Wayne Rooney 0.23 0.27 0.51 -0.07 0.47 0.28 0.20 0.45 0.09 0.31 Jamie Vardy 0.16 0.32 0.35 0.05 0.20 0.22 0.24 0.39 -0.01 0.48 Filipe Luis 0.15 0.14 0.35 -0.02 0.23 Jordi Alba 0.16 0.21 0.39 -0.15 0.28 Luka Modric 0.11 0.25 0.53 -0.27 0.39 Nemanja Matic 0.15 0.29 0.30 0.09 0.41 Sadio Mane 0.10 0.18 0.29 -0.06 0.36 David Alaba 0.08 0.20 0.42 -0.09 0.34 Jerome Boateng 0.16 0.17 0.41 -0.09 0.27 Zlatan Ibrahimovic 0.17 0.20 0.43 0.01 0.44 0.10 0.27 0.41 0.03 0.44 Mario Gotze 0.17 0.21 0.37 -0.04 0.25 Arturo Vidal 0.15 0.17 0.44 -0.01 0.34 0.15 0.23 0.51 -0.04 0.30 Willian Borges Da Silva 1.49 0.50 1.71 0.75 0.90 Daniel Carvajal 0.34 0.47 0.85 -0.79 1.10 Branislav Ivanovic 0.11 0.29 0.24 -0.07 0.28 David De Gea 0.15 0.14 0.43 -0.14 0.41 Maxwell 0.18 0.17 0.43 0.07 0.33 Isco 0.15 0.15 0.41 -0.16 0.42 Rafinha 0.24 0.16 0.41 0.00 0.25 Gonzalo Higuain 0.18 0.16 0.44 0.01 0.35 Nacho Monreal 0.19 0.17 0.39 -0.08 0.39 Per Mertesacker 0.22 0.26 0.42 0.04 0.38

108 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX G. PERSONALITY PROFILES BEFORE SCALING

Hugo Lloris 0.14 0.20 0.46 -0.09 0.42 Keilor Navas 0.41 -1.22 1.46 0.00 0.00 Jose Gimenez 0.31 0.61 1.01 -0.15 0.61 James Rodriguez 0.20 0.23 0.44 -0.06 0.33 Cesar Azpilicueta 0.23 0.17 0.39 0.02 0.35 Bastian Schweinsteiger 0.23 0.29 0.40 -0.15 0.33 0.12 0.21 0.40 -0.02 0.38 Romelu Lukaku 0.28 0.14 0.45 0.02 0.51 Paul Pogba 0.16 0.12 0.42 -0.06 0.51 Santi Cazorla 0.28 0.33 0.29 0.09 0.27 Shinji Kagawa 0.30 0.11 0.43 0.00 0.45 Henrikh Mkhitaryan 0.14 0.16 0.38 -0.01 0.35 Thiago Silva 0.11 0.25 0.33 0.11 0.25 Juan Mata 0.16 0.22 0.51 -0.13 0.50 Aaron Ramsey 0.25 0.19 0.44 0.02 0.31 Gareth Barry 0.22 0.27 0.50 0.02 0.56 Vincent Kompany 0.28 0.27 0.36 -0.14 0.28 Morgan Schneiderlin 0.23 0.34 0.37 -0.07 0.41 Stephan Lichtsteiner 0.16 0.17 0.35 0.14 0.36 0.05 0.27 0.41 -0.08 0.25 -0.03 0.22 0.24 0.03 0.40 Claudio Marchisio 0.10 0.15 0.36 -0.13 0.31 Hector Bellerin 0.14 0.12 0.47 -0.06 0.40 Jesus Navas 0.19 0.12 0.38 -0.09 0.29 Oscar dos Santos 0.42 0.32 1.13 0.25 0.98 Chris Smalling 0.18 0.19 0.53 -0.24 0.47 Mario Balotelli 0.10 0.32 0.30 -0.07 0.28 Riyad Mahrez 0.11 0.14 0.47 0.06 0.28

Personality profiles of football players playing in the Netherlands before scaling (all values are multiplied by 102).

Player name I II III IV V Jasper Cillessen 2.22 0.58 1.48 -0.69 1.48 0.89 0.62 0.76 -0.30 2.22 Joel Veltman 1.21 1.09 0.61 0.00 2.73 1.33 1.47 2.48 0.00 1.33 Hector Moreno -0.49 0.83 0.23 -1.62 1.89 Luuk de Jong 0.89 1.70 1.25 0.19 2.05 Ron Vlaar 2.05 1.76 0.95 -1.99 1.49 Andres Guardado 0.47 0.79 0.43 -0.23 1.44 Dirk Kuijt 0.46 0.43 0.79 -0.40 1.58 0.48 0.91 0.97 -0.29 1.45 Arek Milik 0.50 1.27 2.00 -0.67 2.80 Lasse Schone -0.25 0.25 0.45 -0.82 2.06 0.32 0.33 0.59 0.31 1.26 0.09 0.97 1.04 -0.74 1.04 Joshua Brenet -0.32 0.74 0.93 -0.69 1.62 Jeffrey Bruma 1.43 1.19 0.24 0.32 2.62 0.88 0.43 0.91 -2.02 1.52 Nemanja Gudelj -0.40 0.73 0.41 -0.92 1.88 Jurgen Locadia 0.35 0.23 1.64 1.61 2.82 Ruben Schaken 0.86 1.43 -0.33 1.00 2.24

A Method to Perform an Automated Background Check on Professional Football Players 109 APPENDIX G. PERSONALITY PROFILES BEFORE SCALING

Davy Propper 1.24 0.66 0.08 0.37 1.80 Marko Vejinovic -0.62 0.84 0.25 0.00 1.47 Bram van Polen 0.93 0.86 0.00 -0.62 2.78 Kevin Diks 3.45 0.69 3.45 1.72 3.45 1.43 0.23 1.19 -2.50 1.71 Johnny Heitinga 1.14 0.00 2.04 0.60 2.55 Renato Ibarra 2.08 0.26 0.69 0.69 2.08 -0.37 0.15 1.41 0.14 1.23 Guram Kashia 1.23 0.13 1.62 0.00 2.25 Luciano Slagveer -1.59 1.67 1.13 0.00 2.04 2.02 0.57 1.80 -1.18 1.47 Markus Henriksen 2.05 1.54 1.67 -0.68 1.37 Dabney dos Santos 0.60 1.52 0.68 0.48 2.38 Maxime Lestienne 0.38 0.28 0.58 0.78 1.82 -0.86 0.24 0.17 -0.58 1.08 Joey van den Berg 1.61 0.46 0.90 2.53 1.27 Thulani Serero 1.08 0.23 0.68 1.08 1.69 Navarone Foor 2.41 -0.22 0.88 1.75 4.39 0.70 1.15 1.08 -0.22 2.15 Santiago Arias -0.33 1.17 0.74 0.56 1.67 Valeri Qazaishvili 0.33 1.50 0.00 0.00 2.33 -0.41 1.48 -0.29 0.34 2.04 Felipe Gutierrez 0.65 -0.29 0.74 -1.47 2.65 Mike Havenaar 1.36 1.51 1.13 -1.13 2.71 -0.49 0.42 -0.78 2.45 3.53 0.56 1.03 1.22 0.00 1.49 Sam Larsson 0.28 0.42 0.56 0.28 1.94 Jetro Willems 0.12 1.55 2.07 0.25 2.59 Thomas Kristensen 1.59 2.98 0.00 0.00 4.76 0.33 0.60 0.62 -0.09 1.80 Anthony Limbombe -0.51 0.32 -0.64 -0.85 3.42 0.76 0.35 0.20 0.61 1.46 1.54 0.83 2.70 0.23 1.35 Edouard Duplan 0.46 1.69 1.15 0.00 2.31 Hans Hateboer 1.22 0.73 2.44 4.07 3.25 0.58 0.93 0.97 -0.19 1.56 Mitchell Dijks -0.82 1.39 1.31 -1.09 1.91 Willem Janssen 0.83 0.13 1.10 -0.51 2.73 Tonny Vilhena -0.13 0.33 0.81 0.00 1.42 Michiel Kramer 0.38 0.00 0.34 -0.35 1.31 Nick van der Velden 0.58 0.93 2.62 1.74 3.10 Rochdi Achenteh 0.60 0.95 0.81 1.19 2.38 Guus Hupperts 3.46 2.20 2.52 1.89 2.83 1.00 0.25 1.43 -2.00 2.50 0.24 1.85 0.75 -0.85 2.97 1.77 2.13 1.60 0.61 3.55 Eric Botteghin 0.93 1.38 1.09 0.00 1.49 Erwin Mulder -0.85 0.08 0.63 0.23 1.63 1.28 -1.65 3.85 3.85 6.41 Etienne Reijnen -0.52 -0.15 0.95 0.52 2.08 Kamohelo Mokotjo 0.68 0.00 1.62 0.00 1.56 -0.68 0.66 1.14 2.27 2.12 0.40 0.62 0.18 1.81 2.48 Martin Hansen 1.40 1.48 0.58 1.86 3.88

110 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX G. PERSONALITY PROFILES BEFORE SCALING

Ben Rienstra 1.72 2.71 1.31 0.49 3.27 Sebastien Haller 0.68 -0.28 2.12 1.13 4.24 Adil Auassar 0.00 1.14 2.05 -3.03 2.84 Gaston Pereiro 0.34 0.18 0.91 0.00 2.08 2.03 0.61 0.24 0.85 2.54 0.20 0.99 -0.18 -2.14 2.14 Joost van Aken -0.56 1.48 1.90 0.89 3.56 Jarchinio Antonia 0.64 2.11 0.72 3.26 1.45 Jan-Arie van der Heijden 1.00 0.80 1.22 0.60 1.49 Lucas Bijker 0.00 0.35 1.01 2.78 4.63 0.94 1.42 0.51 3.77 4.72 0.00 1.27 1.23 -0.46 2.31 Robbert Schilder 0.86 1.15 2.51 1.72 1.72 Sergio Padt 0.90 0.14 1.28 1.50 1.28 Bilal Basacikoglu -0.83 1.56 1.75 0.00 3.75 Mark-Jan Fledderus 0.00 1.54 0.82 -0.67 2.00

G.2 Bag-of-Words Approach without Word Stemming

Personality profiles of football players not playing in the Netherlands before scaling without word stemming (all values are multiplied by 102).

Player name I II III IV V Cristiano Ronaldo 0.27 0.00 0.15 0.53 0.32 Lionel Messi 0.25 -0.04 0.23 0.23 0.68 Luis Suarez 0.17 0.12 0.27 0.27 0.71 Karim Benzema 0.27 0.19 0.38 -0.31 1.15 Juanfran 0.02 0.06 0.25 -0.00 0.36 Thomas Muller 0.09 0.26 0.48 -0.05 0.36 Robert Lewandowski 0.15 0.25 0.18 0.40 0.93 Diego Godin -0.13 0.32 0.17 -0.41 0.51 Manuel Neuer 0.05 0.28 0.47 0.00 0.52 Sergio Busquets 0.10 0.01 0.71 0.47 0.67 Gianluigi Buffon 0.27 0.25 0.38 0.27 0.55 Toni Kroos 0.31 -0.11 0.41 -0.39 0.74 Neymar 0.27 0.21 0.49 0.48 1.01 Mesut Ozil 0.14 0.16 0.43 0.48 0.64 Philipp Lahm 0.16 0.15 0.43 -0.17 0.53 Leonardo Bonucci 0.18 0.13 0.29 0.49 0.19 Javier Mascherano 0.10 0.06 0.41 0.46 0.57 Dani Alves 0.08 0.30 0.24 0.45 0.50 Jose Fonte 0.24 0.19 0.13 0.33 0.44 Pedro Rodriguez 0.43 0.17 0.60 0.20 0.60 Gerard Pique 0.10 0.34 0.23 0.18 0.25 Yaya Toure 0.23 0.24 0.24 0.23 0.82 Toby Alderweireld -0.08 0.33 0.46 0.38 0.66 Laurent Koscielny 0.17 0.00 0.37 0.40 0.13 Alexis Sanchez 0.06 0.00 0.37 0.52 0.65 Francesc Fabregas 0.51 -0.08 0.75 0.21 1.11 David Silva 0.08 0.16 0.18 0.64 0.41 Fernandinho -0.11 -0.12 0.05 0.49 0.48 Angel di Maria 0.21 0.28 0.33 0.23 0.99

A Method to Perform an Automated Background Check on Professional Football Players 111 APPENDIX G. PERSONALITY PROFILES BEFORE SCALING

Giorgio Chiellini 0.19 0.02 -0.05 0.25 1.11 Thibaut Courtois -0.15 -0.05 0.28 0.37 1.58 Petr Cech 0.26 -0.09 0.43 0.15 0.50 Joe Hart 0.07 0.32 -0.02 0.46 0.74 Andres Iniesta 0.19 0.16 0.21 0.09 0.22 Gareth Bale 0.18 0.04 0.12 0.58 0.93 Koke 0.04 0.25 0.10 0.33 0.65 Ivan Rakitic 0.20 -0.04 0.22 0.22 0.43 Jan Oblak 0.21 0.50 0.73 0.21 1.95 Gabriel Fernandez 0.00 0.35 3.48 -2.44 0.81 Arjen Robben -0.07 0.22 0.51 -0.57 0.37 Christian Eriksen 0.07 0.57 -0.19 0.18 0.30 Joao Miranda 0.15 2.84 2.27 -2.27 2.27 Nicolas Otamendi -0.22 0.16 0.05 0.42 0.54 Sergio Aguero 0.07 -0.02 -0.18 0.58 0.56 Marcelo 0.22 0.15 0.42 0.21 0.72 Wayne Rooney 0.42 0.08 0.79 0.51 0.79 John Terry -0.06 0.09 0.64 0.27 1.56 Jamie Vardy -0.04 0.34 0.50 0.63 0.58 Eden Hazard 0.09 0.26 0.40 0.12 1.26 Filipe Luis 0.11 0.15 0.11 0.25 0.23 Jordi Alba 0.26 0.06 0.28 0.25 0.52 Luka Modric 0.00 0.29 0.61 0.21 0.73 Nemanja Matic -0.07 0.32 0.14 0.44 0.94 Sadio Mane 0.11 0.04 0.14 0.32 0.59 David Alaba -0.05 0.42 0.33 0.13 0.30 Jerome Boateng 0.03 0.01 0.28 0.21 0.35 Zlatan Ibrahimovic -0.06 0.20 0.27 0.62 0.77 Jan Vertonghen -0.33 0.50 0.05 0.42 1.04 Mario Gotze 0.12 0.10 0.35 0.11 0.59 Arturo Vidal 0.23 0.26 0.42 -0.04 0.58 Fraser Forster 0.10 0.22 0.45 -0.13 0.52 Willian Borges Da Silva 5.00 2.50 6.25 5.00 10.00 Daniel Carvajal 1.20 0.00 0.25 2.70 2.70 Branislav Ivanovic -0.22 0.38 -0.05 0.22 1.32 David De Gea 0.15 -0.03 0.45 0.04 0.83 Maxwell 0.45 0.22 0.21 0.37 0.70 Isco 0.05 -0.03 0.10 0.08 1.49 Rafinha 0.19 0.10 0.30 0.15 0.42 Gonzalo Higuain 0.04 0.07 0.28 0.22 0.54 Nacho Monreal 0.20 0.03 0.54 0.23 0.65 Per Mertesacker 0.25 -0.02 0.48 0.55 0.55 Hugo Lloris 0.08 0.17 0.03 0.29 0.96 Jose Gimenez -0.15 0.93 1.48 2.16 2.31 James Rodriguez 0.16 0.21 0.16 0.30 0.78 Cesar Azpilicueta 0.18 0.14 0.14 0.35 0.51 Bastian Schweinsteiger 0.24 0.12 0.49 0.17 0.71 Claudio Bravo 0.09 0.22 0.32 0.21 0.51 Romelu Lukaku 0.26 -0.06 0.03 0.34 0.59 Paul Pogba 0.26 0.04 0.39 0.61 0.91 Santi Cazorla 0.05 0.22 0.35 0.30 -0.11 Shinji Kagawa 0.46 -0.04 0.18 0.60 1.27 Henrikh Mkhitaryan 0.12 0.06 0.12 0.36 0.81 Thiago Silva -0.06 0.28 0.29 0.30 0.38

112 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX G. PERSONALITY PROFILES BEFORE SCALING

Juan Mata 0.26 0.03 0.63 0.10 0.84 Aaron Ramsey 0.22 -0.07 0.27 0.32 0.46 Gareth Barry -0.10 0.16 0.52 0.25 0.75 Vincent Kompany 0.00 -0.27 0.16 0.99 0.66 Morgan Schneiderlin 0.21 0.03 0.31 -0.00 0.86 Stephan Lichtsteiner 0.12 0.36 0.03 0.40 0.44 Sergio Ramos -0.09 0.18 0.49 0.03 0.56 Patrice Evra 0.07 0.28 -0.04 0.67 1.95 Claudio Marchisio -0.09 0.11 0.18 0.34 0.17 Hector Bellerin 0.14 -0.02 0.45 0.30 0.46 Jesus Navas -0.04 -0.03 -0.08 0.30 0.32 Chris Smalling 0.16 -0.52 0.71 0.34 2.32 Mario Balotelli -0.06 0.23 0.13 0.21 1.89 Riyad Mahrez 0.06 0.12 0.37 0.62 0.60

Personality profiles of football players playing in the Netherlands before scaling without word stemming (all values are multiplied by 102).

Player name I II III IV V Jasper Cillessen 3.33 0.62 2.00 -0.75 2.50 Luciano Narsingh 0.45 0.00 0.90 0.45 2.03 Joel Veltman 2.42 0.40 0.00 -3.23 4.30 Davy Klaassen 0.74 1.84 3.68 -0.42 2.94 Hector Moreno -0.93 0.95 0.37 -1.11 1.11 Luuk de Jong 1.74 2.00 1.95 2.44 1.22 Ron Vlaar 2.42 1.35 1.52 -4.04 3.03 Andres Guardado 1.16 0.85 1.33 0.39 3.10 Dirk Kuijt 1.12 0.53 1.47 -0.56 1.96 Ricardo van Rhijn 0.67 1.21 0.67 -0.48 3.33 Arek Milik 0.00 3.17 0.00 0.93 3.70 Lasse Schone 0.34 -0.17 -0.34 -2.08 -2.38 Eljero Elia 1.54 0.00 0.62 1.32 2.78 Jeroen Zoet -0.53 2.15 2.63 -3.29 2.63 Joshua Brenet 0.00 0.29 3.68 0.98 1.47 Jeffrey Bruma 0.89 1.04 -0.83 0.93 1.04 Karim El Ahmadi 1.43 -0.81 6.45 -1.29 3.23 Nemanja Gudelj -2.17 0.78 -0.72 -1.63 1.30 Jurgen Locadia 2.22 0.99 1.78 2.54 1.48 Ruben Schaken 1.88 2.81 -1.00 1.88 1.50 Davy Propper 1.67 0.31 -0.83 2.08 5.00 Bram van Polen 0.45 1.74 -0.78 -0.62 3.12 Kevin Diks 7.14 7.14 10.71 3.57 7.14 Sven van Beek -0.00 -1.67 0.00 -6.67 5.00 Johnny Heitinga 2.30 -0.86 4.31 1.15 0.00 Renato Ibarra 4.35 1.86 -1.45 2.17 0.00 Terence Kongolo 0.00 0.34 3.03 0.87 3.03 Guram Kashia 1.37 -0.32 2.31 1.92 1.92 Luciano Slagveer -0.87 4.35 2.90 1.74 0.00 Anwar El Ghazi 2.78 0.46 2.08 -1.85 0.69 Maxime Lestienne 1.80 -0.45 0.00 5.41 1.80 Kenneth Vermeer 0.00 0.00 0.58 -0.44 1.40 Joey van den Berg 1.52 -1.42 1.33 3.55 2.13 Thulani Serero 1.79 -1.79 0.71 1.79 3.57

A Method to Perform an Automated Background Check on Professional Football Players 113 APPENDIX G. PERSONALITY PROFILES BEFORE SCALING

Navarone Foor 6.90 -0.99 0.00 5.52 6.90 Nick Viergever 0.00 1.68 -0.98 0.00 2.21 Santiago Arias -1.88 1.25 1.56 3.12 -1.56 Valeri Qazaishvili -2.38 3.70 0.00 -4.76 4.76 Jorrit Hendrix -1.67 5.00 1.67 3.00 5.00 Mike Havenaar 1.39 2.50 0.00 1.39 1.39 Wout Brama 0.00 0.89 2.38 7.14 7.14 Marco van Ginkel -1.43 1.43 1.90 0.41 2.86 Sam Larsson 0.00 2.22 2.78 0.93 3.70 Jetro Willems -2.12 2.12 -3.70 1.23 3.70 Riechedly Bazoer 1.10 1.19 0.99 -0.52 1.72 Anthony Limbombe -4.35 0.00 -2.17 0.00 5.80 Hakim Ziyech 3.81 -0.63 -2.29 1.43 2.86 Vincent Janssen 1.61 0.81 1.94 0.81 3.23 Edouard Duplan 1.11 2.22 2.22 1.11 3.33 Hans Hateboer 0.71 0.00 1.67 10.00 5.00 Jens Toornstra 0.74 0.00 2.21 -0.59 2.94 Mitchell Dijks -1.94 1.61 3.23 -1.61 4.30 Willem Janssen 2.23 0.60 1.79 -2.68 7.14 Tonny Vilhena 0.00 -0.21 1.18 1.70 1.70 Michiel Kramer 0.81 -0.54 -0.54 -0.27 1.21 Nick van der Velden 4.44 2.22 -6.67 5.00 6.67 Guus Hupperts 10.53 6.14 7.89 5.26 7.89 Muamer Tankovic 2.73 5.11 4.55 6.06 -4.55 Jordens Peters 1.53 1.72 5.75 -1.72 3.45 Isaiah Brown 0.00 2.50 10.00 10.00 10.00 Kelvin Leerdam -1.85 4.94 2.22 -1.85 0.00 Hedwiges Maduro 6.82 4.55 4.55 1.95 4.55 Eric Botteghin 2.02 0.67 0.43 0.00 -3.03 Erwin Mulder -3.33 -0.00 -0.42 1.25 3.33 Simon Tibbling 0.00 -5.13 3.85 7.69 7.69 Etienne Reijnen 0.00 0.88 1.75 -1.75 5.26 Kamohelo Mokotjo 2.12 0.46 3.70 3.70 7.41 Mark van der Maarel -1.08 0.72 1.61 4.30 4.30 Mike van Duinen 1.43 0.37 0.00 4.17 2.22 Martin Hansen 0.00 2.48 -4.35 4.35 0.00 2.00 3.60 4.00 2.67 4.00 Sebastien Haller 2.40 -3.43 4.00 4.00 4.00 Adil Auassar 0.00 0.79 4.76 -4.76 4.76 Gaston Pereiro -0.39 1.35 1.93 1.80 0.90 Jesper Drost 0.00 0.32 -1.08 3.23 6.45 Thomas Bruns -2.21 1.96 -1.96 -2.94 0.00 Joost van Aken -2.61 1.74 4.35 3.26 7.25 Jarchinio Antonia -4.62 1.54 0.00 5.77 1.28 Jan-Arie van der Heijden 2.30 -1.15 0.00 0.86 0.00 Lucas Bijker 0.71 1.02 3.57 35.71 7.14 Michael de Leeuw 0.00 -0.79 1.11 11.11 5.56 Bryan Linssen -1.29 1.08 2.58 -3.23 1.08 Robbert Schilder -5.88 0.00 3.53 5.88 5.88 Sergio Padt 1.25 -0.58 -1.25 3.75 0.00 Mark-Jan Fledderus -0.97 1.58 1.55 -0.87 0.72

114 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX G. PERSONALITY PROFILES BEFORE SCALING

G.3 Part-of-Speech Tagging

Personality profiles of football players not playing in the Netherlands before scaling using part-of- speech tagging (all values are multiplied by 102).

Player name I II III IV V Cristiano Ronaldo 0.15 0.26 0.29 0.10 0.27 Lionel Messi 0.10 0.14 0.48 0.10 0.46 Luis Suarez 0.06 0.13 0.41 0.31 0.32 Karim Benzema 0.04 0.34 0.52 -0.17 0.56 Juanfran -0.14 0.22 0.44 0.02 0.42 Thomas Muller 0.15 0.28 0.58 -0.07 0.24 Robert Lewandowski 0.16 0.16 0.43 0.34 0.47 Diego Godin -0.04 0.37 0.32 0.04 0.34 Manuel Neuer 0.14 0.46 0.55 0.08 0.56 Sergio Busquets 0.01 0.17 0.61 0.20 0.84 Gianluigi Buffon 0.27 0.27 0.44 0.50 0.51 Toni Kroos 0.10 0.19 0.47 0.02 0.42 Neymar 0.24 0.25 0.56 0.26 0.47 Mesut Ozil 0.09 0.18 0.47 0.16 0.58 Philipp Lahm 0.09 0.20 0.42 0.15 0.77 Leonardo Bonucci 0.15 0.15 0.47 0.28 0.14 Javier Mascherano 0.10 0.14 0.47 0.10 0.44 Dani Alves -0.10 0.32 0.31 0.38 0.34 Jose Fonte -0.00 0.02 0.48 0.44 0.57 Pedro Rodriguez 0.13 0.22 0.59 0.13 0.26 Gerard Pique 0.08 0.31 0.40 0.08 0.27 Yaya Toure 0.05 0.07 0.50 0.17 0.63 Toby Alderweireld 0.05 0.38 0.57 0.16 0.61 Laurent Koscielny -0.11 0.16 0.36 0.35 0.26 Alexis Sanchez 0.03 0.08 0.50 0.12 0.14 Francesc Fabregas 0.08 0.05 0.77 0.15 0.85 David Silva 0.01 0.07 0.37 0.30 0.42 Fernandinho 0.01 0.17 0.27 0.03 0.45 Angel di Maria 0.10 0.31 0.44 0.13 0.64 Giorgio Chiellini 0.02 0.20 0.23 0.16 0.65 Thibaut Courtois -0.02 0.33 0.53 0.36 0.57 Petr Cech 0.17 -0.00 0.55 0.37 0.62 Joe Hart -0.06 0.33 0.43 0.41 0.36 Andres Iniesta 0.16 0.18 0.35 0.24 0.16 Gareth Bale -0.04 0.17 0.30 0.32 0.36 Koke -0.03 0.30 0.35 0.13 0.42 Ivan Rakitic 0.18 0.10 0.39 0.05 0.30 Jan Oblak 0.24 0.54 0.78 0.21 0.99 Gabriel Fernandez 1.34 0.20 1.79 2.23 0.60 Arjen Robben -0.13 0.32 0.37 -0.06 0.13 Christian Eriksen -0.04 0.32 0.23 0.20 0.27 Joao Miranda 0.47 1.95 2.00 0.00 1.30 Nicolas Otamendi -0.09 0.26 0.32 0.13 0.40 Sergio Aguero 0.12 0.04 0.20 0.23 0.32 Marcelo 0.23 0.23 0.65 0.31 0.31 Wayne Rooney 0.29 0.30 0.76 0.28 0.53 John Terry 0.19 0.18 0.79 0.68 0.69

A Method to Perform an Automated Background Check on Professional Football Players 115 APPENDIX G. PERSONALITY PROFILES BEFORE SCALING

Jamie Vardy 0.06 0.37 0.52 0.63 0.11 Eden Hazard 0.06 0.26 0.51 0.26 0.48 Filipe Luis -0.00 0.23 0.25 0.16 0.15 Jordi Alba 0.14 0.18 0.37 0.18 0.32 Luka Modric 0.05 0.31 0.64 -0.12 0.47 Nemanja Matic 0.00 0.57 0.29 0.31 0.56 Sadio Mane 0.05 0.01 0.30 0.35 0.41 David Alaba -0.13 0.25 0.35 0.11 0.37 Jerome Boateng 0.02 0.09 0.40 0.07 0.38 Zlatan Ibrahimovic 0.05 0.32 0.46 0.24 0.56 Jan Vertonghen -0.11 0.42 0.40 0.11 0.66 Mario Gotze 0.03 0.14 0.37 0.15 0.49 Arturo Vidal 0.16 0.31 0.44 0.23 0.38 Fraser Forster 0.01 0.49 0.46 0.27 0.42 Willian Borges Da Silva 0.00 0.62 3.12 3.12 2.34 Daniel Carvajal 0.26 0.96 1.15 0.00 2.88 Branislav Ivanovic -0.06 0.35 0.27 0.13 0.66 David De Gea 0.15 0.13 0.52 0.31 0.30 Maxwell 0.28 0.22 0.41 0.55 0.45 Isco -0.02 0.06 0.39 0.22 0.74 Rafinha 0.04 0.15 0.42 0.21 0.25 Gonzalo Higuain -0.01 0.08 0.48 0.36 0.55 Nacho Monreal -0.00 0.10 0.45 0.12 1.08 Per Mertesacker 0.05 0.09 0.51 0.54 0.52 Hugo Lloris 0.03 0.12 0.46 0.24 0.43 Keilor Navas -0.79 0.00 7.14 2.38 7.14 Jose Gimenez -0.00 1.11 1.48 0.33 0.81 James Rodriguez 0.13 0.23 0.44 0.18 0.31 Cesar Azpilicueta 0.08 0.29 0.29 0.13 0.32 Bastian Schweinsteiger 0.09 0.39 0.55 0.12 0.52 Claudio Bravo 0.05 0.21 0.45 0.14 0.33 Romelu Lukaku 0.17 0.06 0.35 0.26 0.53 Paul Pogba 0.18 0.07 0.48 0.40 0.42 Santi Cazorla 0.08 0.20 0.40 0.52 0.41 Shinji Kagawa 0.38 0.23 0.47 0.30 0.77 Henrikh Mkhitaryan 0.13 0.10 0.41 0.20 0.78 Thiago Silva 0.01 0.34 0.41 0.46 0.34 Juan Mata 0.16 0.12 0.62 0.16 0.70 Aaron Ramsey 0.20 0.07 0.41 0.15 0.44 Gareth Barry -0.08 0.17 0.49 0.09 0.85 Vincent Kompany 0.08 -0.26 0.40 0.57 0.00 Morgan Schneiderlin 0.18 0.41 0.53 0.17 0.24 Stephan Lichtsteiner 0.02 0.26 0.29 0.47 0.64 Sergio Ramos -0.06 0.26 0.58 0.25 0.25 Patrice Evra 0.09 0.29 0.32 0.46 0.93 Claudio Marchisio -0.11 0.20 0.32 0.15 0.33 Hector Bellerin 0.07 0.12 0.53 0.08 0.67 Jesus Navas 0.05 0.03 0.30 0.13 0.25 Oscar dos Santos 1.09 1.17 2.05 1.23 4.92 Chris Smalling 0.05 0.31 0.57 0.21 0.71 Mario Balotelli 0.10 0.20 0.38 0.30 0.59 Riyad Mahrez 0.02 0.10 0.44 0.40 0.22

116 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX G. PERSONALITY PROFILES BEFORE SCALING

Personality profiles of football players playing in the Netherlands before scaling using part-of- speech tagging (all values are multiplied by 102).

Player name I II III IV V Jasper Cillessen 0.43 0.12 0.44 0.29 -0.13 Luciano Narsingh -0.09 -0.05 0.34 0.28 -0.30 Joel Veltman 0.00 0.24 0.00 0.22 -0.19 Davy Klaassen 0.18 0.26 0.58 0.42 0.05 Hector Moreno -0.25 0.11 0.04 -0.24 0.04 Luuk de Jong 0.26 0.31 0.45 0.39 -0.35 Ron Vlaar 0.21 0.42 0.22 0.37 0.05 Andres Guardado 0.05 0.32 0.07 0.18 0.18 Dirk Kuijt 0.19 0.04 0.29 0.06 -0.31 Ricardo van Rhijn 0.06 0.17 0.26 0.24 -0.95 Arek Milik -0.13 0.17 0.44 0.57 -0.05 Lasse Schone -0.02 -0.02 0.09 0.20 -1.07 Eljero Elia 0.09 0.11 0.05 0.36 -0.31 Jeroen Zoet 0.02 0.08 0.51 0.18 0.00 Joshua Brenet 0.05 0.11 0.74 0.40 -0.07 Jeffrey Bruma 0.17 0.21 0.31 0.28 0.28 Karim El Ahmadi 0.11 0.00 0.38 -0.04 -0.21 Nemanja Gudelj 0.00 0.13 0.54 0.23 -0.03 Jurgen Locadia 0.00 -0.09 0.49 0.41 -0.46 Ruben Schaken 0.16 0.29 0.00 0.22 0.09 Davy Propper 0.26 0.25 0.18 0.25 0.00 Marko Vejinovic -0.09 0.22 0.23 0.44 -0.45 Bram van Polen 0.11 0.08 0.19 1.12 -0.37 Kevin Diks 0.27 0.13 1.27 0.48 -0.32 Sven van Beek 0.00 0.00 0.36 0.40 -0.10 Johnny Heitinga 0.24 -0.10 0.62 0.42 -0.06 Renato Ibarra 0.08 0.00 0.19 0.63 -0.36 Terence Kongolo 0.06 0.09 0.44 0.18 -0.44 Guram Kashia 0.06 0.06 0.40 0.24 0.06 Luciano Slagveer -0.25 0.32 0.39 0.50 -0.29 Anwar El Ghazi 0.20 0.15 0.34 0.15 -0.52 Markus Henriksen 0.56 0.28 0.30 0.00 -3.86 Dabney dos Santos 0.36 0.42 0.16 0.14 -1.42 Maxime Lestienne -0.14 -0.04 0.38 0.46 0.26 Kenneth Vermeer 0.27 0.11 0.25 0.36 -0.13 Joey van den Berg 0.22 0.06 0.19 0.62 0.25 Thulani Serero 0.18 -0.09 0.24 0.27 -0.17 Navarone Foor 0.34 -0.34 0.45 0.81 0.45 Nick Viergever 0.22 0.29 0.29 0.12 -0.80 Santiago Arias -0.10 0.12 0.14 0.70 -0.46 Valeri Qazaishvili 0.27 0.14 0.00 0.23 -0.23 Jorrit Hendrix -0.13 0.22 0.00 0.25 -0.30 Felipe Gutierrez 0.11 -0.21 1.15 0.38 0.31 Mike Havenaar 0.39 0.25 0.33 0.47 -0.11 Wout Brama 0.31 0.14 0.00 0.78 0.33 Marco van Ginkel -0.03 0.11 0.27 0.44 -0.24 Sam Larsson -0.25 0.21 0.29 0.61 -0.07 Jetro Willems -0.02 0.15 0.42 0.43 -0.83 Thomas Kristensen -0.27 1.35 0.34 1.35 -0.34 Riechedly Bazoer 0.14 0.18 0.22 0.12 0.09

A Method to Perform an Automated Background Check on Professional Football Players 117 APPENDIX G. PERSONALITY PROFILES BEFORE SCALING

Anthony Limbombe 0.41 -0.19 0.00 0.00 -0.32 Hakim Ziyech -0.06 0.17 0.20 0.61 -0.09 Vincent Janssen 0.24 0.16 0.55 0.20 -0.55 Edouard Duplan 0.15 0.29 0.63 0.06 0.15 Iliass Bel Hassani -0.26 0.25 0.16 0.62 -0.23 Hans Hateboer 0.16 0.14 0.29 0.34 -0.38 Jens Toornstra 0.12 0.28 0.40 0.03 -0.03 Mitchell Dijks -0.05 0.18 0.39 0.53 -0.05 Wout Droste 0.73 0.61 2.45 0.92 -0.31 Willem Janssen 0.20 -0.00 0.37 0.22 0.20 Brandley Kuwas 0.46 0.31 0.00 0.31 -0.92 Tonny Vilhena 0.18 0.08 0.16 0.18 -0.17 Michiel Kramer 0.00 -0.05 0.20 0.32 0.00 Nick van der Velden 0.18 0.00 0.86 0.75 -0.11 Rochdi Achenteh 0.07 0.19 0.33 0.52 -0.52 Guus Hupperts 0.50 0.29 0.70 0.29 -0.58 Muamer Tankovic 0.00 0.39 0.43 0.26 -0.77 Jordens Peters 0.67 0.00 0.50 -0.44 0.00 Isaiah Brown 0.54 0.33 1.49 3.57 0.45 Kelvin Leerdam -0.31 0.45 0.07 0.87 -0.35 Hedwiges Maduro 0.60 0.37 0.43 0.32 -0.83 Eric Botteghin 0.04 0.25 0.34 0.39 -0.79 Erwin Mulder 0.04 0.02 0.33 0.21 -0.29 Simon Tibbling 0.37 -0.15 0.93 0.37 1.11 Etienne Reijnen -0.14 0.08 0.28 0.26 -0.42 Kamohelo Mokotjo 0.16 0.00 0.84 0.42 -0.12 Mark van der Maarel -0.12 0.18 0.34 1.08 -0.69 Mike van Duinen 0.09 0.13 0.12 0.52 0.06 Martin Hansen -0.06 0.17 0.52 0.83 -0.25 Ben Rienstra 0.07 0.38 0.36 0.53 0.08 Sebastien Haller 0.16 -0.13 0.40 0.20 0.00 Adil Auassar 0.09 0.45 0.60 0.24 -0.40 Gaston Pereiro -0.07 0.06 0.24 0.40 -0.54 Jesper Drost 0.29 0.30 0.00 0.50 0.25 Thomas Bruns 0.14 0.27 0.00 0.00 -0.00 Joost van Aken -0.23 0.47 0.55 0.55 0.44 Jarchinio Antonia 0.08 0.39 0.21 1.13 -0.17 Jan-Arie van der Heijden 0.17 0.17 0.23 0.16 -0.75 Lucas Bijker 0.00 0.00 0.43 2.91 0.60 Michael de Leeuw 0.05 0.65 0.26 0.40 -0.10 Bryan Linssen 0.08 0.17 0.17 -0.06 -0.18 Robbert Schilder 0.00 -0.30 0.96 0.48 -0.24 Sergio Padt 0.27 0.12 0.37 0.34 -1.05 Bilal Basacikoglu -0.28 0.32 0.50 1.13 -0.23 Mark-Jan Fledderus 0.09 0.23 0.15 0.05 0.20

118 A Method to Perform an Automated Background Check on Professional Football Players Appendix H

Personality Profiles after Scaling

This appendix includes the primary results of the personality construction. Both the results for the Dutch players and the non-Dutch players are provided. Note that not all players are represented in the tables. For some players, too few articles were gathered to calculate a value for each personality trait. The factors are numbered as follows: • Factor I: Agreeableness

• Factor II: Extraversion • Factor III: Conscientiousness • Factor IV: Emotional Stability (In this project referred to as Neuroticism) • Factor V: Intellect/Imagination (In this project referred to as Openess to Experience)

H.1 Bag-of-Words Approach with Word Stemming

Personality profiles of football players playing in the Netherlands after scaling.

Player name I II III IV V Cristiano Ronaldo -0.33 0.01 -0.57 -1.46 -0.61 Lionel Messi -0.35 -0.26 -0.71 -0.04 0.65 Luis Suarez -0.73 0.00 -0.59 -0.15 -0.21 Karim Benzema -0.48 -0.26 -0.57 -1.63 -0.17 Juanfran -0.44 -0.39 -0.22 0.22 -0.48 Thomas Muller -0.14 0.23 0.12 0.20 -0.46 Robert Lewandowski -0.26 -0.31 0.08 0.31 0.12 Diego Godin -0.57 0.62 -0.31 -0.44 -0.48 Manuel Neuer -0.27 0.15 -0.15 0.09 -0.29 Sergio Busquets -0.41 -0.22 -0.04 -0.31 0.68 Gianluigi Buffon 0.01 -0.35 -0.34 1.21 -0.34 Toni Kroos 0.02 0.36 -0.28 -0.41 0.26 Neymar 0.06 -0.14 -0.28 0.01 0.32 Mesut Ozil -0.40 -0.73 -0.26 0.32 -0.26 Philipp Lahm -0.12 -0.29 -0.01 0.23 0.74 Leonardo Bonucci -0.12 -0.42 -0.31 0.50 -0.86 Javier Mascherano -0.11 -0.61 -0.40 -0.19 -0.10 Dani Alves -0.55 0.75 -0.83 -0.92 -0.95

A Method to Perform an Automated Background Check on Professional Football Players 119 APPENDIX H. PERSONALITY PROFILES AFTER SCALING

Jose Fonte 0.03 -0.24 -0.13 -0.00 -0.33 Pedro Rodriguez 0.25 -0.17 -0.35 -0.71 -0.18 Gerard Pique -0.24 0.64 -0.49 0.13 -0.86 Yaya Toure -0.49 -0.26 -0.41 -0.07 0.12 Toby Alderweireld -0.30 0.32 0.34 0.64 -0.02 Laurent Koscielny -0.18 0.21 -0.15 0.56 -0.58 Alexis Sanchez -0.31 0.04 -0.36 0.18 -0.43 Francesc Fabregas 0.24 -0.65 0.83 -0.66 0.91 David Silva -0.17 -0.25 -0.09 -0.24 0.17 Fernandinho -0.43 -0.25 -0.78 -0.09 0.20 Angel di Maria -0.28 0.38 -0.41 -0.19 0.08 Giorgio Chiellini -0.73 1.12 -0.84 -0.24 -0.29 Thibaut Courtois -0.17 -0.21 -0.22 0.14 0.86 Petr Cech 0.04 0.04 -0.31 0.84 0.23 Joe Hart -0.16 0.18 0.30 -0.96 0.11 Andres Iniesta -0.13 -0.04 -0.46 0.02 -1.07 Gareth Bale -0.41 0.02 -0.52 0.05 -0.39 Koke -0.36 -0.04 -0.41 0.06 -0.35 Ivan Rakitic 0.44 -0.19 -0.32 -0.06 -1.26 Jan Oblak -0.14 0.93 0.65 0.79 1.19 Gabriel Fernandez 7.00 -0.08 3.78 0.35 -2.85 Arjen Robben -0.33 -0.09 -0.19 -1.20 -0.84 Christian Eriksen -0.34 0.43 -0.22 -0.72 -0.76 Joao Miranda 1.18 3.66 2.92 -5.22 2.78 Nicolas Otamendi -0.20 -0.20 -0.54 -0.60 -0.14 Sergio Aguero 0.04 -0.34 -0.55 0.13 0.65 Marcelo -0.06 0.11 -0.09 0.50 -0.61 Wayne Rooney 0.12 0.32 0.22 -0.03 0.59 John Terry 0.34 -0.06 -0.04 0.88 -0.41 Jamie Vardy -0.20 0.59 -0.48 0.67 -1.08 Eden Hazard 0.09 0.17 -0.29 0.27 0.68 Filipe Luis -0.22 -0.40 -0.45 0.24 -0.88 Jordi Alba -0.18 -0.04 -0.31 -0.50 -0.59 Luka Modric -0.45 0.21 0.28 -1.18 0.07 Nemanja Matic -0.25 0.43 -0.66 0.85 0.23 Sadio Mane -0.47 -0.19 -0.71 0.03 -0.07 David Alaba -0.57 -0.06 -0.16 -0.15 -0.20 Jerome Boateng -0.20 -0.23 -0.23 -0.18 -0.63 Zlatan Ibrahimovic -0.13 -0.08 -0.14 0.39 0.41 Jan Vertonghen -0.45 0.31 -0.20 0.50 0.41 Mario Gotze -0.17 -0.01 -0.39 0.11 -0.81 Arturo Vidal -0.26 -0.26 -0.07 0.27 -0.24 Fraser Forster -0.25 0.08 0.21 0.13 -0.45 Willian Borges Da Silva 5.92 1.57 5.21 4.66 3.27 Daniel Carvajal 0.62 1.43 1.64 -4.20 4.54 Branislav Ivanovic -0.41 0.41 -0.93 -0.03 -0.57 David De Gea -0.26 -0.42 -0.11 -0.46 0.25 Maxwell -0.09 -0.24 -0.14 0.74 -0.28 Isco -0.23 -0.34 -0.19 -0.59 0.29 Rafinha 0.18 -0.28 -0.23 0.35 -0.77 Gonzalo Higuain -0.10 -0.28 -0.10 0.43 -0.18 Nacho Monreal -0.06 -0.25 -0.29 -0.09 0.13 Per Mertesacker 0.10 0.24 -0.17 0.60 0.03 Hugo Lloris -0.28 -0.10 -0.01 -0.16 0.30

120 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX H. PERSONALITY PROFILES AFTER SCALING

Keilor Navas 0.94 -7.90 4.19 0.35 -2.35 Jose Gimenez 0.48 2.21 2.29 -0.52 1.50 James Rodriguez -0.03 0.11 -0.09 0.02 -0.25 Cesar Azpilicueta 0.13 -0.26 -0.29 0.45 -0.13 Bastian Schweinsteiger 0.12 0.43 -0.26 -0.49 -0.31 Claudio Bravo -0.36 -0.02 -0.24 0.22 0.02 Romelu Lukaku 0.35 -0.38 -0.05 0.44 0.88 Paul Pogba -0.18 -0.53 -0.16 0.01 0.85 Santi Cazorla 0.37 0.64 -0.73 0.84 -0.65 Shinji Kagawa 0.43 -0.59 -0.13 0.35 0.48 Henrikh Mkhitaryan -0.31 -0.30 -0.32 0.29 -0.14 Thiago Silva -0.42 0.19 -0.56 0.98 -0.81 Juan Mata -0.20 0.06 0.23 -0.39 0.82 Aaron Ramsey 0.19 -0.10 -0.08 0.48 -0.43 Gareth Barry 0.06 0.32 0.18 0.50 1.19 Vincent Kompany 0.37 0.33 -0.40 -0.48 -0.60 Morgan Schneiderlin 0.11 0.71 -0.36 -0.06 0.20 Stephan Lichtsteiner -0.22 -0.24 -0.45 1.14 -0.11 Sergio Ramos -0.69 0.34 -0.20 -0.11 -0.80 Patrice Evra -1.05 0.05 -0.90 0.53 0.14 Claudio Marchisio -0.48 -0.32 -0.43 -0.39 -0.41 Hector Bellerin -0.26 -0.51 0.04 0.01 0.18 Jesus Navas -0.07 -0.54 -0.33 -0.14 -0.53 Oscar dos Santos 1.00 0.56 2.81 1.77 3.80 Chris Smalling -0.12 -0.10 0.28 -1.01 0.60 Mario Balotelli -0.46 0.57 -0.66 -0.04 -0.59 Riyad Mahrez -0.45 -0.43 0.06 0.73 -0.57

Personality profiles of football players playing in the Netherlands after scaling.

Player name I II III IV V Jasper Cillessen 1.70 -0.36 0.53 -0.71 -0.84 Luciano Narsingh 0.26 -0.31 -0.34 -0.41 -0.09 Joel Veltman 0.61 0.34 -0.52 -0.19 0.43 Davy Klaassen 0.74 0.87 1.73 -0.19 -1.00 Hector Moreno -1.23 -0.02 -0.98 -1.40 -0.42 Luuk de Jong 0.26 1.20 0.25 -0.05 -0.27 Ron Vlaar 1.52 1.28 -0.11 -1.67 -0.83 Andres Guardado -0.20 -0.08 -0.74 -0.36 -0.88 Dirk Kuijt -0.21 -0.57 -0.30 -0.49 -0.74 Ricardo van Rhijn -0.18 0.09 -0.09 -0.41 -0.88 Arek Milik -0.16 0.60 1.15 -0.69 0.50 Lasse Schone -0.97 -0.83 -0.71 -0.80 -0.26 Eljero Elia -0.36 -0.72 -0.54 0.03 -1.07 Jeroen Zoet -0.60 0.18 0.00 -0.74 -1.29 Joshua Brenet -1.05 -0.14 -0.14 -0.71 -0.70 Jeffrey Bruma 0.84 0.48 -0.97 0.05 0.32 Karim El Ahmadi 0.25 -0.57 -0.16 -1.69 -0.81 Nemanja Gudelj -1.13 -0.16 -0.76 -0.87 -0.43 Jurgen Locadia -0.32 -0.85 0.72 1.00 0.52 Ruben Schaken 0.23 0.82 -1.65 0.55 -0.07 Davy Propper 0.63 -0.26 -1.16 0.09 -0.52 Marko Vejinovic -1.37 -0.01 -0.95 -0.19 -0.85

A Method to Perform an Automated Background Check on Professional Football Players 121 APPENDIX H. PERSONALITY PROFILES AFTER SCALING

Bram van Polen 0.30 0.03 -1.25 -0.65 0.48 Kevin Diks 3.03 -0.22 2.89 1.09 1.16 Sven van Beek 0.84 -0.86 0.18 -2.05 -0.61 Johnny Heitinga 0.53 -1.18 1.20 0.25 0.25 Renato Ibarra 1.55 -0.81 -0.42 0.32 -0.23 Terence Kongolo -1.10 -0.96 0.44 -0.09 -1.10 Guram Kashia 0.62 -0.99 0.70 -0.19 -0.06 Luciano Slagveer -2.42 1.15 0.11 -0.19 -0.27 Anwar El Ghazi 1.48 -0.38 0.91 -1.07 -0.86 Markus Henriksen 1.52 0.97 0.76 -0.70 -0.96 Dabney dos Santos -0.06 0.93 -0.43 0.16 0.07 Maxime Lestienne -0.29 -0.78 -0.56 0.39 -0.50 Kenneth Vermeer -1.63 -0.85 -1.05 -0.62 -1.25 Joey van den Berg 1.04 -0.53 -0.16 1.69 -1.06 Thulani Serero 0.47 -0.86 -0.44 0.61 -0.63 Navarone Foor 1.91 -1.48 -0.20 1.11 2.12 Nick Viergever 0.05 0.43 0.04 -0.36 -0.16 Santiago Arias -1.06 0.45 -0.36 0.22 -0.66 Valeri Qazaishvili -0.34 0.91 -1.25 -0.19 0.02 Jorrit Hendrix -1.14 0.89 -1.60 0.06 -0.27 Felipe Gutierrez 0.00 -1.59 -0.37 -1.28 0.35 Mike Havenaar 0.76 0.92 0.11 -1.03 0.41 Wout Brama -1.23 -0.59 -2.19 1.63 1.25 Marco van Ginkel -0.10 0.26 0.22 -0.19 -0.83 Sam Larsson -0.40 -0.60 -0.58 0.01 -0.37 Jetro Willems -0.57 0.99 1.24 -0.01 0.28 Thomas Kristensen 1.01 2.97 -1.25 -0.19 2.51 Riechedly Bazoer -0.35 -0.34 -0.50 -0.26 -0.52 Anthony Limbombe -1.26 -0.73 -2.02 -0.83 1.13 Hakim Ziyech 0.12 -0.69 -1.01 0.26 -0.86 Vincent Janssen 0.97 -0.02 2.00 -0.03 -0.98 Edouard Duplan -0.20 1.18 0.14 -0.19 -0.00 Hans Hateboer 0.62 -0.16 1.68 2.82 0.96 Jens Toornstra -0.08 0.12 -0.08 -0.33 -0.77 Mitchell Dijks -1.59 0.76 0.33 -1.00 -0.40 Willem Janssen 0.19 -1.00 0.07 -0.57 0.43 Tonny Vilhena -0.84 -0.72 -0.28 -0.19 -0.91 Michiel Kramer -0.29 -1.18 -0.84 -0.45 -1.02 Nick van der Velden -0.07 0.12 1.89 1.10 0.81 Rochdi Achenteh -0.06 0.14 -0.28 0.69 0.07 Guus Hupperts 3.04 1.89 1.77 1.21 0.53 Jordens Peters 0.38 -0.83 0.47 -1.68 0.20 Kelvin Leerdam -0.44 1.40 -0.35 -0.82 0.67 Hedwiges Maduro 1.22 1.79 0.67 0.26 1.26 Eric Botteghin 0.31 0.74 0.05 -0.19 -0.83 Erwin Mulder -1.62 -1.06 -0.50 -0.02 -0.69 Simon Tibbling 0.68 -3.47 3.37 2.66 4.19 Etienne Reijnen -1.27 -1.38 -0.11 0.19 -0.23 Kamohelo Mokotjo 0.04 -1.18 0.70 -0.19 -0.76 Mark van der Maarel -1.44 -0.25 0.11 1.49 -0.19 Mike van Duinen -0.27 -0.31 -1.03 1.15 0.18 Martin Hansen 0.81 0.89 -0.55 1.19 1.60 Ben Rienstra 1.15 2.61 0.32 0.17 0.98 Sebastien Haller 0.03 -1.57 1.30 0.65 1.97

122 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX H. PERSONALITY PROFILES AFTER SCALING

Adil Auassar -0.70 0.41 1.21 -2.44 0.54 Gaston Pereiro -0.33 -0.93 -0.16 -0.19 -0.23 Jesper Drost 1.50 -0.33 -0.96 0.44 0.24 Thomas Bruns -0.48 0.20 -1.47 -1.78 -0.17 Joost van Aken -1.30 0.89 1.04 0.47 1.27 Jarchinio Antonia -0.01 1.76 -0.38 2.23 -0.88 Jan-Arie van der Heijden 0.37 -0.06 0.22 0.25 -0.83 Lucas Bijker -0.70 -0.69 -0.04 1.87 2.37 Michael de Leeuw 0.32 0.80 -0.63 2.61 2.46 Bryan Linssen -0.70 0.60 0.23 -0.54 0.01 Robbert Schilder 0.23 0.43 1.76 1.09 -0.60 Sergio Padt 0.27 -0.98 0.29 0.92 -1.05 Bilal Basacikoglu -1.60 1.00 0.85 -0.19 1.47 Mark-Jan Fledderus -0.70 0.97 -0.27 -0.69 -0.31

H.2 Bag-of-Words Approach without Word Stemming

Personality profiles of football players not playing in the Netherlands without word stemming after scaling.

Player name I II III IV V Cristiano Ronaldo 0.18 -0.48 -0.37 0.29 -0.51 Lionel Messi 0.15 -0.57 -0.26 -0.11 -0.18 Luis Suarez 0.00 -0.18 -0.21 -0.06 -0.14 Karim Benzema 0.19 -0.02 -0.05 -0.85 0.27 Juanfran -0.30 -0.32 -0.24 -0.43 -0.47 Thomas Muller -0.15 0.17 0.07 -0.51 -0.47 Robert Lewandowski -0.05 0.14 -0.33 0.11 0.06 Diego Godin -0.56 0.30 -0.35 -0.99 -0.33 Manuel Neuer -0.24 0.21 0.07 -0.43 -0.32 Sergio Busquets -0.14 -0.45 0.39 0.21 -0.18 Gianluigi Buffon 0.18 0.12 -0.06 -0.06 -0.29 Toni Kroos 0.26 -0.76 -0.03 -0.97 -0.12 Neymar 0.19 0.03 0.09 0.22 0.14 Mesut Ozil -0.06 -0.09 0.00 0.23 -0.21 Philipp Lahm -0.02 -0.12 0.01 -0.67 -0.31 Leonardo Bonucci 0.01 -0.17 -0.18 0.24 -0.63 Javier Mascherano -0.14 -0.34 -0.02 0.19 -0.27 Dani Alves -0.18 0.25 -0.24 0.18 -0.34 Jose Fonte 0.12 -0.00 -0.39 0.02 -0.40 Pedro Rodriguez 0.48 -0.07 0.24 -0.16 -0.24 Gerard Pique -0.14 0.36 -0.26 -0.18 -0.58 Yaya Toure 0.11 0.10 -0.25 -0.12 -0.04 Toby Alderweireld -0.48 0.34 0.05 0.10 -0.19 Laurent Koscielny -0.01 -0.48 -0.07 0.12 -0.68 Alexis Sanchez -0.22 -0.48 -0.07 0.28 -0.20 Francesc Fabregas 0.64 -0.69 0.43 -0.15 0.23 David Silva -0.18 -0.08 -0.33 0.45 -0.42 Fernandinho -0.53 -0.77 -0.50 0.25 -0.36 Angel di Maria 0.07 0.22 -0.13 -0.12 0.12 Giorgio Chiellini 0.03 -0.43 -0.64 -0.09 0.23 Thibaut Courtois -0.60 -0.61 -0.19 0.08 0.68

A Method to Perform an Automated Background Check on Professional Football Players 123 APPENDIX H. PERSONALITY PROFILES AFTER SCALING

Petr Cech 0.17 -0.71 0.01 -0.22 -0.34 Joe Hart -0.19 0.31 -0.60 0.20 -0.12 Andres Iniesta 0.03 -0.08 -0.29 -0.31 -0.60 Gareth Bale 0.02 -0.37 -0.41 0.37 0.06 Koke -0.25 0.14 -0.44 0.02 -0.20 Ivan Rakitic 0.05 -0.58 -0.27 -0.14 -0.41 Jan Oblak 0.07 0.75 0.41 -0.15 1.02 Gabriel Fernandez -0.32 0.38 4.11 -3.78 -0.05 Arjen Robben -0.46 0.06 0.11 -1.21 -0.46 Christian Eriksen -0.19 0.93 -0.82 -0.18 -0.53 Joao Miranda -0.04 6.52 2.48 -3.55 1.32 Nicolas Otamendi -0.75 -0.09 -0.50 0.14 -0.30 Sergio Aguero -0.20 -0.52 -0.81 0.36 -0.28 Marcelo 0.10 -0.10 -0.01 -0.14 -0.13 Wayne Rooney 0.47 -0.28 0.49 0.27 -0.07 John Terry -0.44 -0.27 0.29 -0.06 0.66 Jamie Vardy -0.40 0.37 0.10 0.43 -0.27 Eden Hazard -0.16 0.15 -0.03 -0.27 0.37 Filipe Luis -0.11 -0.11 -0.42 -0.09 -0.59 Jordi Alba 0.16 -0.32 -0.20 -0.09 -0.32 Luka Modric -0.32 0.23 0.24 -0.15 -0.12 Nemanja Matic -0.46 0.32 -0.39 0.17 0.08 Sadio Mane -0.11 -0.37 -0.38 0.01 -0.26 David Alaba -0.41 0.55 -0.12 -0.26 -0.53 Jerome Boateng -0.27 -0.46 -0.20 -0.14 -0.48 Zlatan Ibrahimovic -0.44 0.02 -0.21 0.42 -0.08 Jan Vertonghen -0.95 0.76 -0.51 0.14 0.17 Mario Gotze -0.09 -0.23 -0.10 -0.28 -0.26 Arturo Vidal 0.11 0.15 -0.00 -0.49 -0.26 Fraser Forster -0.13 0.07 0.03 -0.61 -0.32 Willian Borges Da Silva 9.11 5.68 7.82 6.43 8.58 Daniel Carvajal 1.94 -0.48 -0.24 3.28 1.73 Branislav Ivanovic -0.74 0.45 -0.63 -0.13 0.43 David De Gea -0.04 -0.55 0.04 -0.38 -0.03 Maxwell 0.53 0.05 -0.29 0.07 -0.15 Isco -0.22 -0.56 -0.44 -0.32 0.59 Rafinha 0.02 -0.25 -0.17 -0.23 -0.42 Gonzalo Higuain -0.26 -0.30 -0.19 -0.13 -0.31 Nacho Monreal 0.05 -0.40 0.16 -0.11 -0.20 Per Mertesacker 0.15 -0.52 0.07 0.33 -0.29 Hugo Lloris -0.18 -0.06 -0.53 -0.04 0.09 Jose Gimenez -0.62 1.80 1.42 2.53 1.36 James Rodriguez -0.02 0.04 -0.36 -0.02 -0.08 Cesar Azpilicueta 0.02 -0.14 -0.38 0.04 -0.33 Bastian Schweinsteiger 0.14 -0.18 0.09 -0.20 -0.14 Claudio Bravo -0.16 0.05 -0.13 -0.15 -0.33 Romelu Lukaku 0.17 -0.64 -0.53 0.04 -0.26 Paul Pogba 0.16 -0.39 -0.05 0.40 0.05 Santi Cazorla -0.23 0.06 -0.10 -0.02 -0.91 Shinji Kagawa 0.54 -0.58 -0.33 0.39 0.38 Henrikh Mkhitaryan -0.09 -0.34 -0.41 0.07 -0.05 Thiago Silva -0.44 0.20 -0.18 -0.02 -0.45 Juan Mata 0.17 -0.40 0.28 -0.29 -0.02 Aaron Ramsey 0.09 -0.65 -0.20 0.01 -0.37

124 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX H. PERSONALITY PROFILES AFTER SCALING

Gareth Barry -0.51 -0.09 0.12 -0.09 -0.10 Vincent Kompany -0.32 -1.16 -0.35 0.92 -0.19 Morgan Schneiderlin 0.06 -0.40 -0.15 -0.43 -0.01 Stephan Lichtsteiner -0.10 0.40 -0.53 0.12 -0.40 Sergio Ramos -0.50 -0.05 0.09 -0.39 -0.28 Patrice Evra -0.19 0.21 -0.63 0.48 1.02 Claudio Marchisio -0.49 -0.20 -0.32 0.04 -0.65 Hector Bellerin -0.06 -0.52 0.04 -0.03 -0.37 Jesus Navas -0.40 -0.57 -0.67 -0.02 -0.51 Chris Smalling -0.03 -1.75 0.38 0.04 1.37 Mario Balotelli -0.44 0.07 -0.39 -0.14 0.96 Riyad Mahrez -0.21 -0.20 -0.07 0.42 -0.25

Personality profiles of football players playing in the Netherlands without word stemming after scaling.

Player name I II III IV V Jasper Cillessen 1.04 -0.24 0.18 -0.50 -0.20 Luciano Narsingh -0.12 -0.57 -0.23 -0.26 -0.38 Joel Veltman 0.67 -0.35 -0.56 -1.00 0.49 Davy Klaassen -0.01 0.40 0.80 -0.44 -0.03 Hector Moreno -0.68 -0.06 -0.43 -0.58 -0.73 Luuk de Jong 0.40 0.49 0.16 0.14 -0.69 Ron Vlaar 0.68 0.14 -0.00 -1.17 0.00 Andres Guardado 0.17 -0.12 -0.07 -0.28 0.03 Dirk Kuijt 0.15 -0.28 -0.02 -0.47 -0.40 Ricardo van Rhijn -0.03 0.07 -0.32 -0.45 0.12 Arek Milik -0.30 1.11 -0.56 -0.17 0.26 Lasse Schone -0.17 -0.66 -0.69 -0.77 -2.06 Eljero Elia 0.32 -0.57 -0.33 -0.09 -0.09 Jeroen Zoet -0.51 0.57 0.42 -1.02 -0.15 Joshua Brenet -0.30 -0.41 0.80 -0.16 -0.59 Jeffrey Bruma 0.06 -0.02 -0.87 -0.17 -0.75 Karim El Ahmadi 0.28 -0.99 1.84 -0.61 0.08 Nemanja Gudelj -1.18 -0.16 -0.83 -0.68 -0.65 Jurgen Locadia 0.59 -0.05 0.10 0.16 -0.59 Ruben Schaken 0.45 0.92 -0.94 0.02 -0.58 Davy Propper 0.37 -0.40 -0.87 0.06 0.75 Bram van Polen -0.12 0.35 -0.85 -0.48 0.04 Kevin Diks 2.58 3.20 3.42 0.36 1.57 Sven van Beek -0.30 -1.45 -0.56 -1.69 0.75 Johnny Heitinga 0.62 -1.02 1.04 -0.12 -1.15 Renato Ibarra 1.45 0.42 -1.10 0.08 -1.15 Terence Kongolo -0.30 -0.39 0.56 -0.18 0.00 Guram Kashia 0.25 -0.74 0.29 0.03 -0.42 Luciano Slagveer -0.65 1.73 0.51 -0.00 -1.15 Anwar El Ghazi 0.82 -0.32 0.21 -0.73 -0.89 Maxime Lestienne 0.42 -0.80 -0.56 0.73 -0.46 Kenneth Vermeer -0.30 -0.57 -0.35 -0.44 -0.62 Joey van den Berg 0.31 -1.32 -0.07 0.36 -0.34 Thulani Serero 0.42 -1.51 -0.30 0.01 0.21 Navarone Foor 2.48 -1.09 -0.56 0.76 1.48 Nick Viergever -0.30 0.32 -0.93 -0.35 -0.31

A Method to Perform an Automated Background Check on Professional Football Players 125 APPENDIX H. PERSONALITY PROFILES AFTER SCALING

Santiago Arias -1.06 0.09 0.02 0.27 -1.74 Valeri Qazaishvili -1.26 1.39 -0.56 -1.31 0.66 Jorrit Hendrix -0.97 2.07 0.06 0.25 0.75 Mike Havenaar 0.26 0.75 -0.56 -0.07 -0.62 Wout Brama -0.30 -0.10 0.32 1.08 1.57 Marco van Ginkel -0.88 0.19 0.14 -0.27 -0.06 Sam Larsson -0.30 0.61 0.47 -0.17 0.26 Jetro Willems -1.16 0.55 -1.94 -0.11 0.26 Riechedly Bazoer 0.14 0.06 -0.20 -0.46 -0.49 Anthony Limbombe -2.06 -0.57 -1.37 -0.35 1.06 Hakim Ziyech 1.23 -0.90 -1.41 -0.07 -0.06 Vincent Janssen 0.35 -0.14 0.16 -0.19 0.08 Edouard Duplan 0.15 0.61 0.26 -0.13 0.12 Hans Hateboer -0.01 -0.57 0.06 1.66 0.75 Jens Toornstra -0.01 -0.57 0.26 -0.47 -0.03 Mitchell Dijks -1.08 0.28 0.64 -0.68 0.49 Willem Janssen 0.60 -0.25 0.10 -0.89 1.57 Tonny Vilhena -0.30 -0.68 -0.12 -0.01 -0.50 Michiel Kramer 0.02 -0.85 -0.76 -0.41 -0.69 Nick van der Velden 1.49 0.61 -3.04 0.65 1.39 Guus Hupperts 3.94 2.67 2.37 0.70 1.86 Muamer Tankovic 0.80 2.13 1.13 0.86 -2.88 Jordens Peters 0.32 0.34 1.57 -0.70 0.16 Isaiah Brown -0.30 0.75 3.16 1.66 2.66 Kelvin Leerdam -1.05 2.04 0.26 -0.73 -1.15 Hedwiges Maduro 2.45 1.83 1.13 0.04 0.58 Eric Botteghin 0.51 -0.21 -0.40 -0.35 -2.30 Erwin Mulder -1.65 -0.57 -0.72 -0.10 0.12 Simon Tibbling -0.30 -3.27 0.87 1.19 1.78 Etienne Reijnen -0.30 -0.10 0.09 -0.71 0.85 Kamohelo Mokotjo 0.55 -0.32 0.81 0.39 1.67 Mark van der Maarel -0.74 -0.19 0.04 0.51 0.49 Mike van Duinen 0.27 -0.37 -0.56 0.48 -0.30 Martin Hansen -0.30 0.74 -2.18 0.52 -1.15 Ben Rienstra 0.50 1.33 0.92 0.18 0.37 Sebastien Haller 0.67 -2.38 0.92 0.45 0.37 Adil Auassar -0.30 -0.15 1.21 -1.31 0.66 Gaston Pereiro -0.46 0.15 0.15 0.01 -0.81 Jesper Drost -0.30 -0.40 -0.96 0.29 1.31 Thomas Bruns -1.19 0.47 -1.29 -0.95 -1.15 Joost van Aken -1.35 0.35 1.05 0.30 1.61 Jarchinio Antonia -2.16 0.25 -0.56 0.81 -0.66 Jan-Arie van der Heijden 0.62 -1.17 -0.56 -0.18 -1.15 Lucas Bijker -0.01 -0.03 0.76 6.83 1.57 Michael de Leeuw -0.30 -0.99 -0.15 1.88 0.97 Bryan Linssen -0.82 0.00 0.40 -1.00 -0.74 Robbert Schilder -2.68 -0.57 0.75 0.83 1.09 Sergio Padt 0.20 -0.87 -1.03 0.40 -1.15 Mark-Jan Fledderus -0.69 0.27 0.01 -0.53 -0.87

126 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX H. PERSONALITY PROFILES AFTER SCALING

H.3 Part-of-Speech Tagging

Scaled personality profiles of football players not playing in the Netherlands using part-of-speech tagging.

Player name I II III IV V Cristiano Ronaldo 0.29 0.01 -0.39 -0.45 -0.42 Lionel Messi 0.06 -0.44 -0.16 -0.45 -0.19 Luis Suarez -0.12 -0.49 -0.24 0.02 -0.35 Karim Benzema -0.20 0.29 -0.10 -1.05 -0.08 Juanfran -1.04 -0.15 -0.20 -0.62 -0.24 Thomas Muller 0.30 0.07 -0.03 -0.84 -0.45 Robert Lewandowski 0.34 -0.39 -0.22 0.08 -0.18 Diego Godin -0.59 0.44 -0.36 -0.59 -0.33 Manuel Neuer 0.23 0.75 -0.07 -0.49 -0.09 Sergio Busquets -0.36 -0.33 0.01 -0.23 0.24 Gianluigi Buffon 0.86 0.06 -0.21 0.42 -0.14 Toni Kroos 0.08 -0.25 -0.17 -0.64 -0.24 Neymar 0.70 -0.04 -0.06 -0.10 -0.19 Mesut Ozil 0.04 -0.28 -0.17 -0.33 -0.06 Philipp Lahm -0.00 -0.21 -0.23 -0.33 0.15 Leonardo Bonucci 0.32 -0.43 -0.17 -0.05 -0.56 Javier Mascherano 0.08 -0.44 -0.17 -0.46 -0.21 Dani Alves -0.85 0.24 -0.38 0.17 -0.33 Jose Fonte -0.40 -0.89 -0.15 0.31 -0.07 Pedro Rodriguez 0.23 -0.15 -0.01 -0.39 -0.43 Gerard Pique -0.02 0.21 -0.25 -0.49 -0.41 Yaya Toure -0.17 -0.71 -0.14 -0.31 -0.01 Toby Alderweireld -0.15 0.48 -0.04 -0.33 -0.03 Laurent Koscielny -0.91 -0.35 -0.31 0.11 -0.42 Alexis Sanchez -0.26 -0.68 -0.13 -0.42 -0.56 Francesc Fabregas -0.02 -0.78 0.21 -0.34 0.24 David Silva -0.35 -0.71 -0.30 -0.01 -0.24 Fernandinho -0.36 -0.34 -0.43 -0.62 -0.21 Angel di Maria 0.06 0.19 -0.20 -0.39 0.01 Giorgio Chiellini -0.33 -0.21 -0.47 -0.33 0.02 Thibaut Courtois -0.50 0.28 -0.10 0.13 -0.08 Petr Cech 0.39 -0.98 -0.07 0.14 -0.01 Joe Hart -0.69 0.27 -0.22 0.23 -0.31 Andres Iniesta 0.34 -0.30 -0.32 -0.13 -0.54 Gareth Bale -0.57 -0.35 -0.39 0.03 -0.31 Koke -0.53 0.14 -0.33 -0.39 -0.24 Ivan Rakitic 0.43 -0.60 -0.28 -0.57 -0.38 Jan Oblak 0.71 1.08 0.23 -0.22 0.41 Gabriel Fernandez 5.82 -0.23 1.52 4.28 -0.04 Arjen Robben -1.01 0.24 -0.29 -0.80 -0.57 Christian Eriksen -0.59 0.22 -0.48 -0.23 -0.41 Joao Miranda 1.79 6.38 1.79 -0.67 0.76 Nicolas Otamendi -0.80 0.02 -0.36 -0.38 -0.26 Sergio Aguero 0.18 -0.82 -0.52 -0.16 -0.36 Marcelo 0.65 -0.10 0.06 0.01 -0.36 Wayne Rooney 0.94 0.18 0.20 -0.05 -0.11 John Terry 0.49 -0.30 0.24 0.82 0.07

A Method to Perform an Automated Background Check on Professional Football Players 127 APPENDIX H. PERSONALITY PROFILES AFTER SCALING

Jamie Vardy -0.10 0.44 -0.11 0.72 -0.59 Eden Hazard -0.12 0.02 -0.11 -0.09 -0.17 Filipe Luis -0.42 -0.11 -0.45 -0.31 -0.55 Jordi Alba 0.25 -0.28 -0.29 -0.27 -0.35 Luka Modric -0.15 0.20 0.05 -0.94 -0.19 Nemanja Matic -0.40 1.18 -0.40 0.01 -0.08 Sadio Mane -0.17 -0.95 -0.38 0.10 -0.26 David Alaba -1.01 -0.02 -0.32 -0.43 -0.30 Jerome Boateng -0.31 -0.65 -0.26 -0.53 -0.28 Zlatan Ibrahimovic -0.18 0.24 -0.18 -0.15 -0.08 Jan Vertonghen -0.89 0.63 -0.25 -0.43 0.03 Mario Gotze -0.26 -0.46 -0.30 -0.33 -0.16 Arturo Vidal 0.33 0.20 -0.21 -0.16 -0.29 Fraser Forster -0.35 0.88 -0.18 -0.07 -0.24 Willian Borges Da Silva -0.40 1.39 3.24 6.26 1.95 Daniel Carvajal 0.79 2.66 0.71 -0.67 2.56 Branislav Ivanovic -0.67 0.35 -0.42 -0.38 0.03 David De Gea 0.30 -0.50 -0.10 0.01 -0.38 Maxwell 0.91 -0.16 -0.24 0.55 -0.21 Isco -0.48 -0.75 -0.28 -0.18 0.12 Rafinha -0.23 -0.43 -0.23 -0.20 -0.43 Gonzalo Higuain -0.43 -0.66 -0.15 0.12 -0.09 Nacho Monreal -0.40 -0.61 -0.19 -0.42 0.51 Per Mertesacker -0.15 -0.62 -0.12 0.52 -0.13 Hugo Lloris -0.24 -0.51 -0.18 -0.14 -0.23 Keilor Navas -4.08 -0.98 8.41 4.61 7.40 Jose Gimenez -0.40 3.21 1.13 0.06 0.20 James Rodriguez 0.19 -0.10 -0.21 -0.27 -0.37 Cesar Azpilicueta -0.03 0.13 -0.40 -0.39 -0.36 Bastian Schweinsteiger 0.01 0.51 -0.06 -0.40 -0.13 Claudio Bravo -0.15 -0.20 -0.20 -0.37 -0.34 Romelu Lukaku 0.40 -0.73 -0.32 -0.10 -0.12 Paul Pogba 0.42 -0.69 -0.16 0.21 -0.24 Santi Cazorla -0.01 -0.22 -0.26 0.48 -0.26 Shinji Kagawa 1.38 -0.11 -0.17 -0.02 0.16 Henrikh Mkhitaryan 0.22 -0.59 -0.24 -0.23 0.17 Thiago Silva -0.37 0.32 -0.25 0.35 -0.33 Juan Mata 0.32 -0.54 0.03 -0.32 0.07 Aaron Ramsey 0.51 -0.70 -0.25 -0.35 -0.22 Gareth Barry -0.76 -0.33 -0.14 -0.48 0.25 Vincent Kompany -0.03 -1.97 -0.25 0.60 -0.72 Morgan Schneiderlin 0.46 0.57 -0.09 -0.29 -0.44 Stephan Lichtsteiner -0.33 0.02 -0.40 0.36 0.01 Sergio Ramos -0.69 0.02 -0.03 -0.12 -0.43 Patrice Evra 0.04 0.12 -0.37 0.34 0.34 Claudio Marchisio -0.92 -0.21 -0.36 -0.35 -0.35 Hector Bellerin -0.05 -0.53 -0.09 -0.49 0.05 Jesus Navas -0.15 -0.88 -0.38 -0.38 -0.43 Oscar dos Santos 4.67 3.45 1.86 2.05 4.87 Chris Smalling -0.18 0.18 -0.04 -0.20 0.09 Mario Balotelli 0.08 -0.23 -0.29 -0.01 -0.05 Riyad Mahrez -0.32 -0.61 -0.21 0.21 -0.47

128 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX H. PERSONALITY PROFILES AFTER SCALING

Scaled personality profiles of football players playing in the Netherlands using part-of-speech tagging.

Player name I II III IV V Jasper Cillessen 1.49 -0.20 0.17 -0.29 0.20 Luciano Narsingh -0.96 -1.00 -0.12 -0.31 -0.10 Joel Veltman -0.53 0.33 -1.09 -0.42 0.10 Davy Klaassen 0.34 0.42 0.57 -0.05 0.53 Hector Moreno -1.72 -0.26 -0.99 -1.33 0.51 Luuk de Jong 0.69 0.66 0.21 -0.10 -0.21 Ron Vlaar 0.45 1.16 -0.44 -0.13 0.54 Andres Guardado -0.28 0.71 -0.88 -0.52 0.77 Dirk Kuijt 0.36 -0.61 -0.26 -0.75 -0.12 Ricardo van Rhijn -0.23 0.00 -0.33 -0.40 -1.31 Arek Milik -1.12 0.02 0.18 0.25 0.34 Lasse Schone -0.64 -0.87 -0.84 -0.46 -1.54 Eljero Elia -0.09 -0.25 -0.95 -0.16 -0.12 Jeroen Zoet -0.44 -0.42 0.38 -0.51 0.44 Joshua Brenet -0.31 -0.29 1.04 -0.08 0.32 Jeffrey Bruma 0.25 0.21 -0.18 -0.31 0.97 Karim El Ahmadi 0.01 -0.78 0.01 -0.94 0.05 Nemanja Gudelj -0.53 -0.18 0.46 -0.41 0.39 Jurgen Locadia -0.53 -1.21 0.31 -0.07 -0.40 Ruben Schaken 0.24 0.59 -1.09 -0.43 0.60 Davy Propper 0.68 0.39 -0.58 -0.37 0.44 Marko Vejinovic -0.93 0.23 -0.43 -0.00 -0.38 Bram van Polen -0.00 -0.40 -0.55 1.33 -0.25 Kevin Diks 0.76 -0.19 2.58 0.07 -0.15 Sven van Beek -0.53 -0.78 -0.06 -0.08 0.27 Johnny Heitinga 0.59 -1.26 0.71 -0.05 0.33 Renato Ibarra -0.16 -0.78 -0.55 0.38 -0.21 Terence Kongolo -0.25 -0.37 0.18 -0.52 -0.38 Guram Kashia -0.26 -0.52 0.05 -0.40 0.55 Luciano Slagveer -1.69 0.69 0.03 0.11 -0.09 Anwar El Ghazi 0.42 -0.11 -0.11 -0.58 -0.52 Markus Henriksen 2.09 0.51 -0.24 -0.87 -6.69 Dabney dos Santos 1.15 1.17 -0.64 -0.60 -2.19 Maxime Lestienne -1.18 -0.96 0.01 0.04 0.92 Kenneth Vermeer 0.73 -0.26 -0.37 -0.15 0.21 Joey van den Berg 0.48 -0.49 -0.56 0.35 0.90 Thulani Serero 0.32 -1.20 -0.40 -0.34 0.13 Navarone Foor 1.07 -2.35 0.21 0.73 1.28 Nick Viergever 0.53 0.57 -0.25 -0.62 -1.04 Santiago Arias -1.02 -0.24 -0.69 0.50 -0.42 Valeri Qazaishvili 0.76 -0.11 -1.09 -0.40 0.01 Jorrit Hendrix -1.16 0.23 -1.09 -0.37 -0.11 Felipe Gutierrez 0.01 -1.74 2.23 -0.11 1.01 Mike Havenaar 1.31 0.38 -0.13 0.05 0.24 Wout Brama 0.94 -0.12 -1.09 0.66 1.06 Marco van Ginkel -0.69 -0.27 -0.31 -0.01 -0.01 Sam Larsson -1.68 0.22 -0.27 0.34 0.31 Jetro Willems -0.64 -0.09 0.11 -0.02 -1.10 Thomas Kristensen -1.80 5.49 -0.12 1.79 -0.18 Riechedly Bazoer 0.11 0.06 -0.46 -0.64 0.61

A Method to Perform an Automated Background Check on Professional Football Players 129 APPENDIX H. PERSONALITY PROFILES AFTER SCALING

Anthony Limbombe 1.39 -1.66 -1.09 -0.87 -0.14 Hakim Ziyech -0.79 0.00 -0.51 0.33 0.27 Vincent Janssen 0.62 -0.04 0.49 -0.48 -0.57 Edouard Duplan 0.16 0.58 0.74 -0.74 0.72 Iliass Bel Hassani -1.75 0.40 -0.64 0.36 0.01 Hans Hateboer 0.21 -0.12 -0.27 -0.19 -0.26 Jens Toornstra 0.03 0.53 0.07 -0.81 0.38 Mitchell Dijks -0.74 0.03 0.05 0.17 0.36 Wout Droste 2.93 2.06 5.97 0.94 -0.12 Willem Janssen 0.39 -0.78 -0.02 -0.43 0.82 Brandley Kuwas 1.63 0.64 -1.09 -0.26 -1.25 Tonny Vilhena 0.30 -0.42 -0.64 -0.51 0.13 Michiel Kramer -0.53 -1.02 -0.53 -0.25 0.44 Nick van der Velden 0.34 -0.78 1.40 0.62 0.24 Rochdi Achenteh -0.19 0.08 -0.15 0.17 -0.53 Guus Hupperts 1.82 0.57 0.92 -0.29 -0.63 Muamer Tankovic -0.53 1.01 0.15 -0.36 -0.99 Jordens Peters 2.61 -0.78 0.35 -1.74 0.44 Isaiah Brown 2.00 0.77 3.20 6.15 1.27 Kelvin Leerdam -1.99 1.28 -0.89 0.85 -0.20 Hedwiges Maduro 2.31 0.94 0.15 -0.24 -1.10 Eric Botteghin -0.35 0.40 -0.11 -0.10 -1.02 Erwin Mulder -0.34 -0.68 -0.15 -0.45 -0.10 Simon Tibbling 1.22 -1.47 1.58 -0.14 2.50 Etienne Reijnen -1.19 -0.43 -0.29 -0.35 -0.33 Kamohelo Mokotjo 0.22 -0.78 1.33 -0.05 0.23 Mark van der Maarel -1.08 0.05 -0.10 1.25 -0.83 Mike van Duinen -0.13 -0.16 -0.73 0.16 0.56 Martin Hansen -0.81 -0.01 0.41 0.77 -0.02 Ben Rienstra -0.20 0.96 -0.05 0.17 0.58 Sebastien Haller 0.23 -1.37 0.06 -0.47 0.44 Adil Auassar -0.13 1.29 0.63 -0.40 -0.29 Gaston Pereiro -0.85 -0.49 -0.39 -0.07 -0.56 Jesper Drost 0.82 0.62 -1.09 0.12 0.91 Thomas Bruns 0.14 0.49 -1.09 -0.87 0.44 Joost van Aken -1.63 1.39 0.48 0.21 1.25 Jarchinio Antonia -0.17 1.05 -0.49 1.36 0.13 Jan-Arie van der Heijden 0.28 -0.00 -0.44 -0.55 -0.95 Lucas Bijker -0.53 -0.78 0.15 4.85 1.56 Michael de Leeuw -0.30 2.22 -0.33 -0.09 0.26 Bryan Linssen -0.14 0.01 -0.61 -0.98 0.11 Robbert Schilder -0.53 -2.17 1.67 0.07 0.00 Sergio Padt 0.75 -0.21 -0.04 -0.20 -1.50 Bilal Basacikoglu -1.85 0.71 0.35 1.35 0.03 Mark-Jan Fledderus -0.09 0.28 -0.66 -0.77 0.81

130 A Method to Perform an Automated Background Check on Professional Football Players Appendix I

Twitter Analysis Results

In the thesis, the results of the Twitter analysis are presented for some of the players. In this appendix, the results are provided for the other players.

I.1 Regular Expressions

In this section, all the results of the analysis of applying Regular Expressions to extract hashtags and mentions are provided. Note that some players have their own user name as the most fre- quently used mention. This means that the player has retweeted a lot of tweets in which he was mentioned himself.

A Method to Perform an Automated Background Check on Professional Football Players 131 APPENDIX I. TWITTER ANALYSIS RESULTS

Twenty most used hashtags and mentions of player Mario Balotelli.

Hashtags Frequency Mentions Frequency 1 #foreverfaster 4 @finallymario 39 2 #puma 3 @barwuahenock1 32 3 #startbelieving 3 @officialel92 26 4 #evopower 2 @ducieduce 25 5 #respect 2 @kpbofficial 20 6 #13 1 @drake 14 7 #acmilan 1 @askvianney 12 8 #bane 1 @officialmonto 12 9 #barca 1 @ igna20 11 10 #batman 1 @mario falcone 9 11 #beast 1 @pumafootball 9 12 #bestvine 1 @realemiskilla 9 13 #bottomtothetopconcert 1 @kobebryant 8 14 #choosepower 1 @njr92 8 15 #dance 1 @gallinari8888 7 16 #doitdorthevine 1 @miss1897 7 17 #evoaccuracy 1 @officialniang 7 18 #fastergraph 1 @ndj official 6 19 #forzabalo 1 @pato7oficial 6 20 #forzamilan 1 @riomusic10 6

Twenty most used hashtags and mentions of player Christian Eriksen.

Hashtags Frequency Mentions Frequency 1 #coys 60 @chriseriksen8 62 2 #ajax 31 @siemdejong 38 3 #nike 16 @blinddaley 27 4 #23 10 @mzanka 27 5 #denmark 9 @spursofficial 21 6 #3points 8 @gvanderwiel 19 7 #girlfriend 7 @rasmusfalk9 18 8 #golazo 6 @tsorensen1 17 9 #magista 6 @jan vertonghen 14 10 #thfc 6 @afcajax 9 11 # 5 @nickibille14 8 12 #spurs 5 @ryanbabel 7 13 #afca 4 @ksigthorsson 6 14 #endelafnogetst 4 @mikkelhansen24 6 15 #gs2squad 4 @madsrroslind 5 16 #thfctour 4 @nicoboilesen 5 17 #amsterdam 3 @coco lamela 4 18 #emh 3 @dbufodbold 4 19 #europaleague 3 @europaleague 4 20 #family 3 @spillerforening 4

132 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX I. TWITTER ANALYSIS RESULTS

Twenty most used hashtags and mentions of player Riyad Mahrez.

Hashtags Frequency Mentions Frequency 1 #lcfc 278 @mahrez22 2194 2 #teamdz 165 @ilfenomenoy2 168 3 #teamalgerie 108 @boucherzacharie 114 4 #mahrez 91 @officialfoxes 91 5 #teamfennecs 75 @aknockaert 90 6 #teamhac 66 @ludovicancel 81 7 #teamelkhedra 45 @mickaellebihan 79 8 #alg 42 @gozarzargo 65 9 #leicester 36 @rom saiss26 62 10 #hac 29 @nabil djellit 60 11 #algerie 18 @elkhedra 52 12 #havre 16 @manzala 7 51 13 #can2015 15 @distelzola 49 14 #bpl 13 @djoriv 47 15 #mesloub 13 @lcfc 47 16 #sarcelles 13 @ketket95 42 17 #algeria 12 @moussa z life 41 18 #lcfcfamily 12 @diimsou 39 19 #teammahrez 12 @suncompoti972 39 20 #leicestercity 11 @florianpinteaux 36

Twenty most used hashtags and mentions of player Cristiano Ronaldo.

Hashtags Frequency Mentions Frequency 1 #celebrate15m 75 @cristiano 421 2 #weare20million 56 @realmadrid 55 3 #halamadrid 37 @nikefootball 51 4 #vivaronaldo 36 @gamebyronaldo 39 5 #cr7underwear 31 @vivaronaldo 31 6 #cr7emag 24 @ochocinco 23 7 #for 23 @headsup 16 8 #mercurial 21 @mystarautograph 14 9 #mymercurials 20 @savethechildren 14 10 #worldcup 18 @tagheuer 12 11 #portugal 17 @ronaldofilm 11 12 #por 16 @apoyocr7 10 13 #dontcrackunderpressure 14 @emiliegranb 10 14 #cr7shirts 11 @daniel nilsen 9 15 #crpenaltygame 11 @herbalife 9 16 #ff 11 @jensonbutton 9 17 #realmadrid 10 @officialpes 8 18 #cr7 9 @realkaka 8 19 #galaxy11 9 @sacoorbrosgl 8 20 #cr7footwear 8 @twitchange 8

A Method to Perform an Automated Background Check on Professional Football Players 133 APPENDIX I. TWITTER ANALYSIS RESULTS

Twenty most used hashtags and mentions of player Mark-Jan Fledderus.

Hashtags Frequency Mentions Frequency 1 #rodajc 69 @fleddie 129 2 #forzaroda 40 @rodajckerkrade 102 3 #eredivisie 22 @kim wendy 29 4 #heracles 20 @goossensrob 22 5 #41 16 @heraclesalmelo 21 6 #koempels 8 @willieovertoom 16 7 #almelo 6 @heraclestweet 13 8 #fcgroningen 6 @pietvandijken 12 9 #knvb 6 @dannyholla 9 10 #trainingskamp 6 @markvanrijswijk 9 11 #trots 6 @dannyhesp 7 12 #heerod 5 @joshooiveld20 7 13 #nacrod 5 @medisananl 7 14 #rodajckerkrade 5 @mvandenbunder 7 15 #eredivisielive 4 @mydiaryofstyle 7 16 #knvbbeker 4 @vi nl 7 17 #marrakech 4 @fcgroningen 6 18 #rodnac 4 @ruudjevormer8 6 19 #adorod 3 @alexkroesseg 5 20 #arjenrobben 3 @brandbierunited 5

Twenty most used hashtags and mentions of player Michiel Kramer.

Hashtags Frequency Mentions Frequency 1 #feyenoord 20 @kramerinho 323 2 #ado 10 @raafrob 47 3 #camfey 6 @feyenoord 32 4 #cavsvraptors 6 @nba 22 5 #fcafkicken 6 @footballmaties 21 6 #kramer 5 @kjansen7 21 7 #feypec 4 @raemonsluiter 16 8 #nbavine 4 @adodenhaag 12 9 #rapaja 4 @fcafkicken 12 10 #adodenhaag 3 @edgroven 7 11 #syndroom 3 @eljero11elia 7 12 #adofey 2 @grandstandfoot 6 13 #adopsv 2 @karlijn b 6 14 #adoselfiesessie 2 @kuyt 6 15 #ambassadeur 2 @martinhansen90 6 16 #bigz 2 @bram070 5 17 #denhaag 2 @johnvanzweden 5 18 #familie 2 @mkrabby 5 19 #feyaja 2 @srhamme 5 20 #feyaz 2 @weelhaar01 5

134 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX I. TWITTER ANALYSIS RESULTS

Twenty most used hashtags and mentions of player Dirk Kuijt.

Hashtags Frequency Mentions Frequency 1 #feyenoord 24 @dkuytfoundation 38 2 #legioen 12 @dirk 18 kuyt 28 3 #fenerbah 9 @dirkkuytfoundation 13 4 #ynwa 9 @kuyt 9 5 #supercup 6 @feyenoord 7 6 #fener 5 @sportstad 7 7 #training 5 @intrmzzo 4 8 #dkfrunningteam 4 @lucasleiva87 4 9 #jft96 4 @sneijder101010 4 10 #eenvriendendienst 3 @538 3 11 #lfc 3 @barbarabarend 3 12 #1e 2 @fenerbahce 3 13 #dkfrunningteam2015 2 @heldenmagazine 3 14 #dutchteam 2 @jamiewestland 3 15 #fenerbahce 2 @marathonrdam 3 16 #feyaz 2 @rdsportopmaat 3 17 #friendlymatch 2 @aideyman 2 18 #knvbbeker 2 @ballondecoratie 2 19 #nbaglobalgames 2 @castro maniac 2 20 #omroepwest 2 @di rectmusic 2

Twenty most used hashtags and mentions of player Ron Vlaar.

Hashtags Frequency Mentions Frequency 1 #utv 37 @ronvlaar4 88 2 #avfc 33 @raemonsluiter 19 3 #feyenoord 30 @spierenvspieren 15 4 #avfclive 19 @tiltmonkey 13 5 #yojana 11 @acornshospice 12 6 #oranje 8 @avfcofficial 10 7 #legioen 7 @bguzan 9 8 #sterkerdoorstrijd 7 @janverhaas 9 9 # 6 @misakrajicek 9 10 #seriousrequest 6 @savinglivesuk 9 11 #believe 5 @superguidetti 9 12 #brazil 5 @jochemmyjer 8 13 #spierathlon 5 @azalkmaar 6 14 #stefandevrij 5 @heldenmagazine 6 15 #az 4 @lisanvlaar 6 16 #bedankt 4 @thiemodebakker 6 17 #cupfighters 4 @22gards 4 18 #hivtestweek 4 @bouke deboer 4 19 #thanks 4 @darrenbent 4 20 #villapark 4 @fatimamdm 4

A Method to Perform an Automated Background Check on Professional Football Players 135 APPENDIX I. TWITTER ANALYSIS RESULTS

Twenty most used hashtags and mentions of player Jetro Willems.

Hashtags Frequency Mentions Frequency 1 #14 2 @jetrowillems 15 10 2 #eendrachtmaaktmacht 2 @gwijnaldum 6 3 #fifa16 2 @psv 4 4 #fut 2 @spartarotterdam 4 5 #newliv 2 @luukdejong9 3 6 #psvhee 2 @memphis depay22 3 7 #psvpec 2 @fbarend 2 8 #totw 2 @memphisdepay 2 9 #vitpsv 2 @mohoudoe 2 10 #amsterdam 1 @optajohan 2 11 #atlpsv 1 @axednl 1 12 #bobmarleyquotes 1 @ea benelux 1 13 #broertje 1 @edsiliarombley 1 14 #dbovvs 1 @funx 1 15 #easports 1 @jamailoman 1 16 #foxsport 1 @jetrowill 1 17 #herpsv 1 @jorgenraymann 1 18 #higler 1 @ 1 19 #johancruijff 1 @kj huntelaar 1 20 #mumid 1 @memphis 1

I.2 Word Clouds

One of the word clouds was discussed in the thesis. The other nine are provided in this section. Please note that some of them do not contain a lot of words. This is due to the fact that there were not so many mentions on those players available.

136 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX I. TWITTER ANALYSIS RESULTS

Word cloud constructed from Mario Balotelli mentions.

Word cloud constructed from Christian Eriksen mentions.

A Method to Perform an Automated Background Check on Professional Football Players 137 APPENDIX I. TWITTER ANALYSIS RESULTS

Word cloud constructed from Riyad Mahrez mentions.

Word cloud constructed from Cristiano Ronaldo mentions.

138 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX I. TWITTER ANALYSIS RESULTS

Word cloud constructed from Mark-Jan Fledderus mentions.

Word cloud constructed from Michiel Kramer mentions.

A Method to Perform an Automated Background Check on Professional Football Players 139 APPENDIX I. TWITTER ANALYSIS RESULTS

Word cloud constructed from Dirk Kuijt mentions.

Word cloud constructed from Ron Vlaar mentions.

140 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX I. TWITTER ANALYSIS RESULTS

Word cloud constructed from Jetro Willems mentions.

I.3 Sentiment Analyses

Finally, the plots of the sentiment analyses of the players that were not discussed in the thesis are provided in this section. Recall that the size of each bubble represents the number of likes a tweet had, the color represents the number of retweets, and the location of the center provides insight in the polarity and the subjectivity of the tweet.

A Method to Perform an Automated Background Check on Professional Football Players 141 APPENDIX I. TWITTER ANALYSIS RESULTS

Sentiment analysis on Mario Balotelli mentions.

Sentiment analysis on Christian Eriksen mentions.

142 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX I. TWITTER ANALYSIS RESULTS

Sentiment analysis on Zlatan Ibrahimovi´cmentions.

Sentiment analysis on Cristiano Ronaldo mentions.

A Method to Perform an Automated Background Check on Professional Football Players 143 APPENDIX I. TWITTER ANALYSIS RESULTS

Sentiment analysis on Michiel Kramer mentions.

Sentiment analysis on Dirk Kuijt mentions.

144 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX I. TWITTER ANALYSIS RESULTS

Sentiment analysis on Ron Vlaar mentions.

Sentiment analysis on Jetro Willems mentions.

A Method to Perform an Automated Background Check on Professional Football Players 145

Appendix J

News Article for Manual Analysis

Below, one of the articles in the entire text corpus is provided. This article was retrieved from the NOS source when a search request on football player Dirk Kuijt was performed. The text is copied from the original web page, but is exactly the same in the database created during this project, .

URL: http://nos.nl/artikel/666202-dirk-kuijt-de-mentor-van-oranje.html Accessed on: 29/06/2016 Title: Dirk Kuijt, de mentor van Oranje Date and time: 25/06/2014, 11:26 Adjusted: 25/06/2014, 11:35 Category: Football Text: We zaten er gistermiddag allemaal weer. De nationale en internationale pers neemt tijdens dit WK een dag na de wedstrijd plaats in de perszaal op het trainingscomplex van Flamengo. In afwachting van twee spelers van Oranje die een half uur na de hersteltraining zullen aanschuiven. Tijdens dat half uur zitten wij keurig op de campingstoeltjes te wachten terwijl achter een schermpje in de zaal de spelers nog wat oefeningen doen. Deze trainingsarbeid wordt begeleid door muziek die wij goed kunnen horen. Vandaag konden wij meeneurin met de klanken van Toto, maar vorige week galmde Andr´e Hazes door de zaal. Je zou misschien vermoeden dat de dj van die dag was maar het was heel iemand anders. Uiteindelijk schoof er een mooi koppel aan in de perszaal. Dirk Kuijt en Memphis Depay. De eerste hoopt tijdens zijn laatste WK nog aan zijn honderdste interland toe te komen. De tweede heeft inmiddels al acht kruisjes achter zijn naam staan en mag pronken met twee doelpunten gemaakt op een wereldkampioenschap. De aandacht van de media is voor Dirk Kuijt gesneden koek. De Katwijker zit nooit om een woordje verlegen en voelt zich in persgesprekken als een vis in het water. Het was prachtig om te zien hoe hij jonkie Depay in de gaten hield toen die onder vuur werd genomen door mijn collega Bert Maalderink. Dirk bleef in de buurt van de microfoon om, indien noodzakelijk, in te grijpen. Maar dat bleek helemaal niet nodig. Dirk Kuijt keek met steeds meer ontspanning en met een grote glimlach naar het optreden van de jonge PSV’er. Als een mentor die zag dat het goed was. Die rol van groepsoudste leek ook de reden voor de bondscoach om de inmiddels 34-jarige Kuijt mee te nemen naar het WK. Leek logisch. Maar had Dirk Kuijt op

A Method to Perform an Automated Background Check on Professional Football Players 147 APPENDIX J. NEWS ARTICLE FOR MANUAL ANALYSIS

zijn oude dag ook nog onderworpen aan omscholing en posteerde de routinier als linksback tegen Chili. Daar was Kuijt een net zo’n grote verrassing als Memphis Depay tijdens dit WK. De persconferentie was, hoewel soms defensief, zeer vermakelijk. Toen het over de gouden hand van wisselen van de bondscoach ging, kwam Memphis Depay tot een paar mooie woorden over het inzicht van de bondscoach. Je zag Dirk Kuijt lachen. Was de vraag hem gesteld, dan was hij nog eens over de, een dag eerder, veel besproken gouden pik van de bondscoach begonnen. En dan nog even dit. De muziek van Andr´eHazes die wij vorige week in de perszaal tot ons kregen, bleek uitgezocht door . Oranje verrast ons dit toernooi zowel binnen als buiten het veld.

148 A Method to Perform an Automated Background Check on Professional Football Players Appendix K

Big Five Inventory in Dutch

This appendix contains the questionnaire that was held. It is the translation of the BFI created by Denissen et al.(2008), slightly adjusted to make it applicable for this particular case. The questionnaire was exactly the same for all four players, with the exception of their name in the introduction. In the questionnaire below, the word playername indicate that the specific player name is stated at this place.

Persoonlijkheidstest playername Hieronder vind je een lijst met karakteristieken die volgens jou wel of niet van toepassing zijn voor de voetballer playername. Je hoeft de speler niet persoonlijk te kennen om de vragen te beantwoorden. Je kunt gewoon antwoord geven op basis van hoe deze speler op jou overkomt in de media. Geef voor iedere eigenschap aan in hoeverre jij denkt dat deze van toepassing is op playername, in vergelijking tot andere voetballers, op een schaal van 1 tot 5. De getallen 1 tot en met 5 staan voor:

1. Volledig mee oneens 2. Een beetje mee oneens 3. Niet eens en niet oneens 4. Een beetje mee eens 5. Volledig mee eens

Ik zie playername, in vergelijking to andere voetballers, als iemand die...

1. Spraakzaam is 2. Geneigd is kritiek te hebben op anderen 3. Grondig te werk gaat 4. Somber is 5. Origineel is, met nieuwe idee¨enkomt 6. Terughoudend is 7. Behulpzaam en onzelfzuchtig ten opzichte van anderen is 8. Een beetje nonchalant kan zijn 9. Ontspannen is, goed met stress kan omgaan

A Method to Perform an Automated Background Check on Professional Football Players 149 APPENDIX K. BIG FIVE INVENTORY IN DUTCH

10. Benieuwd is naar veel verschillende dingen 11. Vol energie is 12. Snel ruzie maakt 13. Een werker is waar men van op aan kan 14. Gespannen kan zijn 15. Scherpzinnig, een denker is 16. Veel enthousiasme opwekt 17. Vergevingsgezind is 18. Doorgaans geneigd is tot slordigheid 19. Zich veel zorgen maakt 20. Een levendige fantasie heeft 21. Doorgaans stil is 22. Mensen over het algemeen vertrouwt 23. Geneigd is lui te zijn 24. Emotioneel stabiel is, niet gemakkelijk overstuur raakt 25. Vindingrijk is 26. Voor zichzelf opkomt 27. Koud en afstandelijk kan zijn 28. Volhoudt tot de taak af is 29. Humeurig kan zijn 30. Waarde hecht aan kunstzinnige ervaringen 31. Soms verlegen, geremd is 32. Attent en aardig is voor bijna iedereen 33. Dingen efficint doet 34. Kalm blijft in gespannen situaties 35. Een voorkeur heeft voor werk dat routine is 36. Hartelijk, een gezelschapsmens is 37. Soms grof tegen anderen is 38. Plannen maakt en deze doorzet 39. Gemakkelijk zenuwachtig wordt 40. Graag nadenkt, met ideen speelt 41. Weinig interesse voor kunst heeft 42. Graag samenwerkt met anderen 43. Gemakkelijk afgeleid is 44. Het fijne weet van kunst, muziek, of literatuur

150 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX K. BIG FIVE INVENTORY IN DUTCH

BFI Scoring BFI scale scoring (John & Srivastava, 1999): Extraversion: 1, 6R, 11, 16, 21R, 26, 31R, 36; Agreeableness: 2R, 7, 12R, 17, 22, 27R, 32, 37R, 42; Conscientiousness: 3, 8R, 13, 18R, 23R, 28, 33, 38, 43R; Neurotiscism: 4, 9R, 14, 19, 24R, 29, 34R, 39; Openness: 5, 10, 15, 20, 25, 30, 35R, 40, 41R, 44 An R denotes reverse-scored items, which implies that these questions are scored the other way around. For those questions, a 1 is scored as a 5, a 2 is scored as a 4, a 3 stays a 3, a 4 is scored as a 2, and a 5 is scored as a 1.

A Method to Perform an Automated Background Check on Professional Football Players 151

Appendix L

Questionnaire Results

In this appendix the responses of the questionnaire are presented. The number of responses for each player is as follows:

• Mario Balotelli: 25 responses

• Lionel Messi: 17 responses • Luuk de Jong: 38 responses • Dirk Kuijt: 25 responses

For each player, a table is created in which all raw responses to the questionnaire are provided. Each table consists of 44 rows, each row representing one question of the BFI. The numbers correspond to the numbers of the question provided in AppendixK. Each column represents one response to the questionnaire.

A Method to Perform an Automated Background Check on Professional Football Players 153 APPENDIX L. QUESTIONNAIRE RESULTS

Responses to the questionnaire considering Mario Balotelli.

Q 1 3242442224234251242451555 2 4445554555445344454554552 3 3121222322121123121121121 4 3324412233322322224214333 5 3221223254245525145441543 6 3311212111321231211212122 7 2121212121131211121211431 8 5555554451445555555555445 9 2131334544444545234444133 10 3335445241423525245312432 11 3435351423443345344342531 12 3545454555445555555555443 13 1111212221121121211121231 14 2535241523222112333254542 15 3111211131121211121211131 16 3321342422241315232341521 17 3221313322241313331321223 18 2455454354545454445355534 19 3331221521242222212124521 20 3545434445444555355541545 21 4311212421422315431144123 22 3223313433242323222222242 23 4555455245445555544555445 24 5121314341222225111311122 25 3343333242243545243441441 26 4455454554445455454554555 27 5443442433424455434355225 28 3231213121221222211111431 29 5543452435424455345255545 30 1115223333221335131111333 31 1121221122221315221124222 32 3121412222242322331311223 33 4221214324141224221241434 34 4131444553343545153541144 35 3241213222412223331251122 36 2233414222342444342411422 37 3545454455445445445555444 38 3331324522442233222241242 39 1223222142222111412122523 40 3121322341221325131222441 41 5555454334545545535554313 42 3231323521232225222211222 43 1455425555545555455554545 44 2111312332121221131111233

154 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX L. QUESTIONNAIRE RESULTS

Responses to the questionnaire considering Lionel Messi.

Q 1 22222112232112122 2 11223123242233442 3 55345434444455421 4 33423423534224214 5 45255545555454521 6 34244442541454115 7 44344554534444523 8 22343354451434343 9 54355545445455554 10 34322434445332423 11 43245524555354454 12 13222111132111321 13 43443323213344255 14 24332224214213323 15 55435444515455551 16 53255445454455454 17 53334453524455443 18 22331241421242232 19 23332423111243121 20 53335435553325551 21 55554544553554445 22 43324444543444433 23 23432253422341411 24 32344544552455452 25 45445555544454554 26 43345425223344442 27 43434233451424334 28 43334424443455343 29 24344424313224422 30 53342434442223531 31 55352544442454415 32 53343443533445423 33 54355433555454424 34 54345545545454554 35 41331321225252242 36 53343423523354332 37 21313111242221334 38 45335434555344433 39 33311121121221213 40 54335444545445532 41 23333332235443225 42 54353455544454334 43 43321232321312323 44 43323332331123221

A Method to Perform an Automated Background Check on Professional Football Players 155 APPENDIX L. QUESTIONNAIRE RESULTS

Responses to the questionnaire considering Luuk de Jong.

Q 1 44444333434444234445444444434444442544 2 43233222214334244224332432324222221224 3 34443444444453433435444444544445444555 4 22322222211222122314212223344222343124 5 43242433323223324322323232222422333332 6 32322343221222442323223224353323335223 7 44444554455344425555344344454444525533 8 33322313212344242242445412212324123254 9 44445443444432445455544434443454444434 10 33343324333333322434423333443432332423 11 44244454234442432435433344333442541534 12 22222211411224242221441212312212331221 13 45454554555443524554444444554445545435 14 33233333232344242322333332343344333444 15 33243444344432233324323233454323413424 16 44344523245443442343322244432424332543 17 44244444254442424444334344354344435425 18 32222222222233422222223442212325233241 19 33232334234224232312123323335222434154 20 32233224223233323222332232223422321321 21 34422335222222344212223223254313335234 22 44344444454243433444334333454444335444 23 21222213411223132242442422112222124241 24 52444454444342524544344334455354423425 25 43243333224432324234433232223332322323 26 44344444445444444445554433444444323443 27 22322233411224332223432323334224442142 28 45454454454444344545544444455544445555 29 23222333221324354333451323444335443344 30 23231334233222322211323222223422412323 31 33422245232243342332123333444313435234 32 44454544455343422444234244254442425454 33 44344444234442434445554444354432343342 34 44444453545442424435454444454243354532 35 33224333253242442335433423234234435355 36 33544443345243424454424334353424324454 37 32112322411224444124542522513224441331 38 43444454444243444534334434444443434445 39 22222223222224222222222232222223323243 40 43343444444222334334424332253433322422 41 33242422432244444455553334433343312334 42 45445444254443324544444334354344325445 43 23232212221222232223322422223324321232 44 33122324212221311311123342133322413322

156 A Method to Perform an Automated Background Check on Professional Football Players APPENDIX L. QUESTIONNAIRE RESULTS

Responses to the questionnaire considering Dirk Kuijt.

Q 1 4554544444452154445555443 2 3312342431254542342241223 3 4554345454544145455544455 4 3122221211112552222121121 5 2232123244343122221343523 6 3332122231222524134232222 7 4554555445255155455535545 8 2212121421222512221251321 9 3444255145545155343435444 10 2232233244344144443434343 11 4555535555455144555555545 12 2222221221133512321131121 13 5555555555555155555545555 14 2323312432252522342342334 15 2221122434352143322424322 16 4554535455555154444535445 17 3454344345455155445544444 18 2322121421233512412351222 19 2322112432232543323443122 20 3222223334442123223432243 21 2322222231412523232131222 22 4544454445544145343545444 23 2111111111212511111121111 24 4444455444445144444435345 25 3432244344344142322432424 26 4443544445455143444534444 27 2222122131221522321121122 28 4554555555524154455535445 29 3222324232344514342141323 30 2322224344221122213322331 31 2223121222221524232231223 32 3553355345545155444535444 33 3332143444543154444543433 34 4443345245555154444534344 35 3444545325454554442525442 36 4444444445444144344545443 37 2123321321154541322132111 38 3544445555454155444535545 39 2122211321232514222131312 40 2322122344353134322545431 41 5344542332525543443333435 42 4544445445455154454534444 43 2322221221233522222131222 44 1321123233331123213333211

A Method to Perform an Automated Background Check on Professional Football Players 157