Predicting the Gender of Flemish Twitter Users Using an Ensemble of Classifiers
Total Page:16
File Type:pdf, Size:1020Kb
Predicting the Gender of Flemish Twitter Users Using an Ensemble of Classifiers Els Van Zegbroeck Supervisor: Prof. dr. ir. Rik Van de Walle Counsellors: Ir. Fréderic Godin, Ir. Baptist Vandersmissen Master's dissertation submitted in order to obtain the academic degree of Master of Science in de ingenieurswetenschappen: computerwetenschappen Department of Electronics and Information Systems Chairman: Prof. dr. ir. Jan Van Campenhout Faculty of Engineering and Architecture Academic year 2013-2014 Predicting the Gender of Flemish Twitter Users Using an Ensemble of Classifiers Els Van Zegbroeck Supervisor: Prof. dr. ir. Rik Van de Walle Counsellors: Ir. Fréderic Godin, Ir. Baptist Vandersmissen Master's dissertation submitted in order to obtain the academic degree of Master of Science in de ingenieurswetenschappen: computerwetenschappen Department of Electronics and Information Systems Chairman: Prof. dr. ir. Jan Van Campenhout Faculty of Engineering and Architecture Academic year 2013-2014 Preface This master's dissertation is written to complete my master degree in Computer Sciences - Software Engineering at UGent. I would like to take this opportunity to thank some people. My promotor, prof. dr. ir. Rik Van de Walle, I would like to thank for giving me the ability to write my master's dissertation about something that is very inspiring to me. Social media interests me greatly, and I was happy to learn about machine learning. This is an area that was completely new to me, and I am very glad that I was able to delve into it. Most of all, I would like to thank Ir. Fr´edericGodin and Ir. Baptist Vandersmissen because they were the best monitors I could have hoped for. They were there when I had any questions, and they provided constructive feedback. They also ran tests on TweetGenie enabling me to compare my results. Furthermore, they tried to make the process of writing a master's dissertation as easy as possible by accomodating me with a very helpful road map and keeping me on track. Thank you. Of course, I also want to thank my parents, for making it possible for me to complete this master and my whole family for always supporting me. Els Van Zegbroeck, June 2014 i ii Admission to Loan \The author gives permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In the case of any other use, the limitations of the copyright have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation." Els Van Zegbroeck, June 2014 iii Predicting the Gender of Flemish Twitter Users Using an Ensemble of Classifiers by Els Van Zegbroeck Master's dissertation submitted in order to obtain the academic degree of Master of Engineering : Computer Sciences - Software Engineering Academic year 2013-2014 Ghent University Faculty of Engeneering and Architecture Department of Electronics and Information Systems Chairman: prof. dr. ir. J. Van Campenhout Promotor: prof. dr. ir. Rik Van de Walle Monitors: Ir. Fr´edericGodin and Ir. Baptist Vandersmissen Summary In this master's dissertation, an ensemble is built that classifies Flemish Twitter users into male and female classes. The ensemble takes the input of five different classifiers and combines them to provide a final label. It is trained with Flemish Twitter users which were classified manually. Features with the highest mutual information (MI) are selected to create vectors. A vector is a data presentation of a Twitter user. It is the input to the different classifier components. Features are extracted from three major sources of information: the user's profile information, tweets, and social network. All classifiers use either a Support Vector Machine or Naive Bayes. Six different classifiers are compared. One of them, the tweet style classifier, is not added to the ensemble. This was the only classifier that did not increase the accuracy of the ensemble. It is analyzed how each other classifier leads to a gain in accuracy. The accuracies of two test cases obtained by this paper are compared to the accuracies of TweetGenie, the most similar work found. In both cases, we obtained an accuracy of more than 6% higher than the accuracy obtained by TweetGenie. This master's dissertation also proves that using the descriptions of the friends of a user to predict the gender of that user is very beneficial. Keywords: Twitter, gender classification, machine learning, ensemble iv Predicting the Gender of Flemish Twitter Users Using an Ensemble of Classifiers Els Van Zegbroeck Supervisor(s): prof. dr. ir. Rik Van de Walle, Ir. Frederic´ Godin, Ir. Baptist Vandersmissen Abstract—In this paper, an ensemble is built that classifies Flemish Twit- the evaluation of each classifier and the whole ensemble can be ter users into male and female classes. The ensemble takes the input of five found. We conclude in Section IX. different classifiers and combines them to provide a final label. It is trained with Flemish Twitter users which were classified manually. Features with the highest mutual information (MI) are selected to create vectors. A vec- II. RELATED WORK tor is a data presentation of a Twitter user. It is the input to the different classifier components. Features are extracted from three major sources of Several papers have studied classifiers that try to predict the information: the user’s profile information, tweets, and social network. All classifiers use either a Support Vector Machine or Naive Bayes. Six differ- gender of English speaking Twitter users ([3], [4], [5], [6]). [7] ent classifiers are compared. One of them, the tweet style classifier, is not proves that the accuracy of these kinds of classifiers depends added to the ensemble. This was the only classifier that did not increase the on the language spoken by the instances that are to be classi- accuracy of the ensemble. It is analyzed how each other classifier leads to a gain in accuracy. The accuracies of two test cases obtained by this pa- fied. They do this by comparing results for Japanese, Indone- per are compared to the accuracies of TweetGenie, the most similar work sian, Turkish, and French users. This leads us to the conclusion found. In both cases, we obtained an accuracy of more than 6% higher than that there is a need for a classifier which focusses on Flemish the accuracy obtained by TweetGenie. This paper also proves that using the Twitter users. descriptions of the friends of a user to predict the gender of that user is very beneficial. The closest work that has been found is [2], TweetGenie, Keywords—Twitter, gender classification, machine learning, ensemble in which the classifier concentrates on Dutch Twitter users. While Dutch is a language very similar to Flemish, we believe I. INTRODUCTION that there are enough differences between the two languages to study a classifier for Flemish users. One of those differences WITTER is a social networking and microblogging service is that the Flemish are more likely to cut off their words [8]. Twhich is becoming more popular every day. Users post TweetGenie only extracts features from a user’s tweets. Conse- messages called tweets which are limited to 140 characters for quently, TweetGenie cannot classify users that have not posted many different reasons. Users want to share their daily activities any tweets. and thoughts, keep up with friends, and share and find new infor- One of the classifiers that makes up our ensemble extracts mation [1]. These tweets contain a lot of information about the features from the descriptions of the users’ friends. No work author. They can be used to predict latent attributes of the user. has been found in which the user’s social network is used in the These are attributes that may be expected to be readily available same way that we do. on Twitter, like on other similar social networking sites, but they are not. III. DATASET In this paper, we will predict one of these latent attributes for Flemish Twitter users, namely his or her gender. We will do To train and test the classifiers, a set of users was retrieved this by building an ensemble of classifiers. Each of these clas- from Twitter using the Twitter Search API. To try to limit the sifiers will take into account another piece of information about set to Flemish Twitter users, users were harvested from the the user. We will gather information from three major differ- friends of popular Flemish accounts such as e´en,´ destandaard, ent sources: the user’s profile information, tweets, and social and VTM. 4826 females and 3965 males were collected. Each network. Each classifier component of the ensemble is evalu- of them was manually labeled. ated. The whole ensemble is evaluated, next, on three different test sets. The results of the ensemble are compared to those of IV. FEATURES TweetGenie [2], the most similar work found. This paper also proves that using the descriptions of a user’s Features were selected for six different classifiers. For each friends (accounts that the user follows) is very beneficial when classifier, the features with the highest mutual information (MI) trying to predict the gender of the user. are selected because the classifier cannot find patterns in mil- The rest of the paper is organized as follows. In Section II, lions of features. The features with the highest mutual informa- related work is discussed. The dataset used to train and test tion are those that provide the highest certainty that their pres- with is described in Section III. In Section IV, the selection and ence leads to a particular label. The selected features are those extraction of features is explained. These features are used to that are used to create vectors.