Predicting the Gender of Flemish Twitter Users Using an Ensemble of Classifiers

Els Van Zegbroeck

Supervisor: Prof. dr. ir. Rik Van de Walle Counsellors: Ir. Fréderic Godin, Ir. Baptist Vandersmissen

Master's dissertation submitted in order to obtain the academic degree of Master of Science in de ingenieurswetenschappen: computerwetenschappen

Department of Electronics and Information Systems Chairman: Prof. dr. ir. Jan Van Campenhout Faculty of Engineering and Architecture Academic year 2013-2014

Predicting the Gender of Flemish Twitter Users Using an Ensemble of Classifiers

Els Van Zegbroeck

Supervisor: Prof. dr. ir. Rik Van de Walle Counsellors: Ir. Fréderic Godin, Ir. Baptist Vandersmissen

Master's dissertation submitted in order to obtain the academic degree of Master of Science in de ingenieurswetenschappen: computerwetenschappen

Department of Electronics and Information Systems Chairman: Prof. dr. ir. Jan Van Campenhout Faculty of Engineering and Architecture Academic year 2013-2014 Preface

This master’s dissertation is written to complete my master degree in Computer Sciences - Software Engineering at UGent. I would like to take this opportunity to thank some people. My promotor, prof. dr. ir. Rik Van de Walle, I would like to thank for giving me the ability to write my master’s dissertation about something that is very inspiring to me. Social media interests me greatly, and I was happy to learn about machine learning. This is an area that was completely new to me, and I am very glad that I was able to delve into it. Most of all, I would like to thank Ir. Fr´edericGodin and Ir. Baptist Vandersmissen because they were the best monitors I could have hoped for. They were there when I had any questions, and they provided constructive feedback. They also ran tests on TweetGenie enabling me to compare my results. Furthermore, they tried to make the process of writing a master’s dissertation as easy as possible by accomodating me with a very helpful road map and keeping me on track. Thank you. Of course, I also want to thank my parents, for making it possible for me to complete this master and my whole family for always supporting me.

Els Van Zegbroeck, June 2014

i ii Admission to Loan

“The author gives permission to make this master dissertation available for consultation and to copy parts of this master dissertation for personal use. In the case of any other use, the limitations of the copyright have to be respected, in particular with regard to the obligation to state expressly the source when quoting results from this master dissertation.”

Els Van Zegbroeck, June 2014

iii Predicting the Gender of Flemish Twitter Users Using an Ensemble of Classifiers by Els Van Zegbroeck Master’s dissertation submitted in order to obtain the academic degree of Master of Engineering : Computer Sciences - Software Engineering Academic year 2013-2014 Ghent University Faculty of Engeneering and Architecture Department of Electronics and Information Systems Chairman: prof. dr. ir. J. Van Campenhout Promotor: prof. dr. ir. Rik Van de Walle Monitors: Ir. Fr´edericGodin and Ir. Baptist Vandersmissen

Summary In this master’s dissertation, an ensemble is built that classifies Flemish Twitter users into male and female classes. The ensemble takes the input of five different classifiers and combines them to provide a final label. It is trained with Flemish Twitter users which were classified manually. Features with the highest mutual information (MI) are selected to create vectors. A vector is a data presentation of a Twitter user. It is the input to the different classifier components. Features are extracted from three major sources of information: the user’s profile information, tweets, and social network. All classifiers use either a Support Vector Machine or Naive Bayes. Six different classifiers are compared. One of them, the tweet style classifier, is not added to the ensemble. This was the only classifier that did not increase the accuracy of the ensemble. It is analyzed how each other classifier leads to a gain in accuracy. The accuracies of two test cases obtained by this paper are compared to the accuracies of TweetGenie, the most similar work found. In both cases, we obtained an accuracy of more than 6% higher than the accuracy obtained by TweetGenie. This master’s dissertation also proves that using the descriptions of the friends of a user to predict the gender of that user is very beneficial.

Keywords: Twitter, gender classification, machine learning, ensemble

iv Predicting the Gender of Flemish Twitter Users Using an Ensemble of Classifiers Els Van Zegbroeck Supervisor(s): prof. dr. ir. Rik Van de Walle, Ir. Frederic´ Godin, Ir. Baptist Vandersmissen

Abstract—In this paper, an ensemble is built that classifies Flemish Twit- the evaluation of each classifier and the whole ensemble can be ter users into male and female classes. The ensemble takes the input of five found. We conclude in Section IX. different classifiers and combines them to provide a final label. It is trained with Flemish Twitter users which were classified manually. Features with the highest mutual information (MI) are selected to create vectors. A vec- II.RELATED WORK tor is a data presentation of a Twitter user. It is the input to the different classifier components. Features are extracted from three major sources of Several papers have studied classifiers that try to predict the information: the user’s profile information, tweets, and social network. All classifiers use either a Support Vector Machine or Naive Bayes. Six differ- gender of English speaking Twitter users ([3], [4], [5], [6]). [7] ent classifiers are compared. One of them, the tweet style classifier, is not proves that the accuracy of these kinds of classifiers depends added to the ensemble. This was the only classifier that did not increase the on the language spoken by the instances that are to be classi- accuracy of the ensemble. It is analyzed how each other classifier leads to a gain in accuracy. The accuracies of two test cases obtained by this pa- fied. They do this by comparing results for Japanese, Indone- per are compared to the accuracies of TweetGenie, the most similar work sian, Turkish, and French users. This leads us to the conclusion found. In both cases, we obtained an accuracy of more than 6% higher than that there is a need for a classifier which focusses on Flemish the accuracy obtained by TweetGenie. This paper also proves that using the Twitter users. descriptions of the friends of a user to predict the gender of that user is very beneficial. The closest work that has been found is [2], TweetGenie, Keywords—Twitter, gender classification, machine learning, ensemble in which the classifier concentrates on Dutch Twitter users. While Dutch is a language very similar to Flemish, we believe I.INTRODUCTION that there are enough differences between the two languages to study a classifier for Flemish users. One of those differences WITTER is a social networking and microblogging service is that the Flemish are more likely to cut off their words [8]. Twhich is becoming more popular every day. Users post TweetGenie only extracts features from a user’s tweets. Conse- messages called tweets which are limited to 140 characters for quently, TweetGenie cannot classify users that have not posted many different reasons. Users want to share their daily activities any tweets. and thoughts, keep up with friends, and share and find new infor- One of the classifiers that makes up our ensemble extracts mation [1]. These tweets contain a lot of information about the features from the descriptions of the users’ friends. No work author. They can be used to predict latent attributes of the user. has been found in which the user’s social network is used in the These are attributes that may be expected to be readily available same way that we do. on Twitter, like on other similar social networking sites, but they are not. III.DATASET In this paper, we will predict one of these latent attributes for Flemish Twitter users, namely his or her gender. We will do To train and test the classifiers, a set of users was retrieved this by building an ensemble of classifiers. Each of these clas- from Twitter using the Twitter Search API. To try to limit the sifiers will take into account another piece of information about set to Flemish Twitter users, users were harvested from the the user. We will gather information from three major differ- friends of popular Flemish accounts such as e´en,´ destandaard, ent sources: the user’s profile information, tweets, and social and VTM. 4826 females and 3965 males were collected. Each network. Each classifier component of the ensemble is evalu- of them was manually labeled. ated. The whole ensemble is evaluated, next, on three different test sets. The results of the ensemble are compared to those of IV. FEATURES TweetGenie [2], the most similar work found. This paper also proves that using the descriptions of a user’s Features were selected for six different classifiers. For each friends (accounts that the user follows) is very beneficial when classifier, the features with the highest mutual information (MI) trying to predict the gender of the user. are selected because the classifier cannot find patterns in mil- The rest of the paper is organized as follows. In Section II, lions of features. The features with the highest mutual informa- related work is discussed. The dataset used to train and test tion are those that provide the highest certainty that their pres- with is described in Section III. In Section IV, the selection and ence leads to a particular label. The selected features are those extraction of features is explained. These features are used to that are used to create vectors. Vectors are data representations create vectors, which are explained in Section V. In Section VI of a user. Features were extracted from three different major the different classifier components are explained. The structure sources of information. Each of them is described in the follow- of the ensemble is illustrated in Section VII. In Section VIII, ing sections. A. Profile Information D. Tweet Content Classifier From the profile information of the Twitter user, we take the A Support Vector Machine is used by the tweet content classi- username, name, and description if he or she has one. From each fier. ‘C’ and ‘γ’ are set to 4 and 0.1 respectively. 2000 features of these, n-grams are taken to create vectors. are used by this classifier. Features include the unigrams and bigrams extracted from the tweets. B. Tweets E. Tweet Style Classifier All tweets that are not retweets (tweets that are posted by a user but which were initially written by another user) are re- The tweet style classifier also uses a Support Vector Machine trieved from the user. The tweet content classifier extracts n- with ‘C’ set to 3 and ‘γ’ set to 10. It is trained with 100 features. grams from the tweets. These n-grams include words but also These features are emoticons, punctuation sequences, and the emoticons. The tweet style classifier counts specific emoticons, domains of the URLs posted. specific punctuation sequences, and the domains of the URLs that a user posts. F. Friend Description Classifier The friend description classifier uses 5000 features extracted C. Social Network from the descriptions of the friends of the users. It uses a Sup- N-grams are extracted from the descriptions of the friends of port Vector Machine with ‘C’ and ‘γ’ are set to 4 and 0.1 respec- the users. To the best of our knowledge, this has never been tively. done before. VII.ENSEMBLE V. VECTORS The ensemble is composed of five classifiers. The tweet style To be able to classify the users, vectors need to be constructed classifier was not added because it decreased the accuracy of the that represent them. A user is represented by a different vector ensemble. The description classifier, username classifier, name for each kind of classifier as they all look at different features of classifier, friend description classifier, and tweet content classi- the user. A created vector −→xj = (x1, x2, ..., xn) has n elements fier each increased the accuracy of the ensemble. Each classifier if n features are being counted. xi corresponds with a feature i. uses either a Support Vector Machine or Naive Bayes. The en- It is zero if the feature does not occur for a particular user j. If semble takes the output of each of these classifiers and provides the feature does in fact occur for the user, xi is the percentage a final label. The output of each classifier is its certainty that the of tweets or friend descriptions (depending on the kind of vec- user is female and its certainty that the user is male. The ensem- tor) in which the feature appears. For the username, name, and ble adds all of the certainties together for each class and provides description, xi is the number of times the feature occurs. These the label of the class with the highest summation. This comes vectors are the input to the classifiers. down to classifying the instance in the class with the highest average certainty. VI.CLASSIFIERS A. Username Classifier VIII.RESULTS The username classifier uses Naive Bayes. It is trained with Figure 1 illustrates the accuracy of each individual classifier 3000 features which were extracted from the usernames. The when trained on 2500 instances of each class and tested on a features are bigrams, trigrams, and four-grams based on charac- test set of 1622 females and 1334 males. The name classifier ters. reaches the highest accuracy, and the description classifier ob- tains the lowest accuracy. Because less than 80% of the test set B. Name Classifier has a description, however, the description classifier was unable to classify each instance. The name classifier also uses Naive Bayes. It is trained When tested on this set, the tweet content classifier reaches an with 5000 features which were extracted from the users’ names. accuracy of 74.36%. The addition of the description classifier Names on Twitter can be anything at all as long as they have at leads to an accuracy of 75.14%. Adding the friend description least one character. Based on a manual inspection, we assume classifier gives 80.99%. Including the username classifier gives that the first token of the name (with punctuation removed) is an addition of 1.52%. Finally, when adding the name classifier, the first name of the user. Features are the first token and the we attain an accuracy of 89.28%. The confusion matrix of the trigrams and four-grams based on characters extracted from this final stage can be found in Table I. From this, we can conclude first token. that the precision is close to 0.88 for females and the recall is almost 0.94. The F1 score for the female is therefore about 0.91. C. Description Classifier For the male class, the precision is 0.92 and the recall is 0.84. This classifier uses a Support Vector Machine. ‘C’ and ‘γ’ The F1 score is 0.88. are set to 4 and 10 respectively. These parameters were decided To perform a final evaluation, two more test sets were re- based upon a grid search. Unigrams and bigrams based on to- trieved. Because the previous test set was retrieved in the same kens are the features extracted from the descripton. The 2000 way as the training set, we will perform extra tests on these to features with the highest MI were selected in this case. make sure the ensemble works for all kinds of users. The first 90 IX.CONCLUSION 87.54 In this paper, we presented an ensemble that predicts the gen- 85 der of Flemish Twitter users. The accuracies obtained for three test cases vary from 89.28% to 93.32%. We compared the en- semble to the most similar work, TweetGenie, twice. The en- 78.86 80 semble reaches an accuracy of more than 6% higher than Tweet- 75.36 75.43 Genie reaches in each case. With a friend description classifier 75 that on its own reaches an accuracy of 75.43% on our test set, we

accuracy (%) have also proved that features extracted from the descriptions of 70 friends of a user are fruitful when predicting that user’s gender. 65.74 66.34 REFERENCES 65 [1] Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng, “Why we twitter: Understanding microblogging usage and communities,” in Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, New York, NY, USA, 2007, WebKDD/SNA-KDD name ’07, pp. 56–65, ACM. username [2] Dong Nguyen, Rilana Gravel, Dolf Trieschnigg, and Theo Meder, descriptiontweet style “”how old do you think i am?” a study of language and age in twitter,” tweet content http://www.aaai.org/ocs/index.php/ICWSM/ICWSM13/ friend description paper/view/5984, 2013. [3] John D. Burger, John Henderson, George Kim, and Guido Zarrella, “Dis- criminating gender on twitter,” in Proceedings of the Conference on Empiri- Fig. 1. The accuracy of each classifier with a training set size of 2500 (2500 cal Methods in Natural Language Processing, Stroudsburg, PA, USA, 2011, females and 2500 males) EMNLP ’11, pp. 1301–1309, Association for Computational Linguistics. [4] Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta, TABLE I “Classifying latent user attributes in twitter,” in Proceedings of the 2nd in- ternational workshop on Search and mining user-generated contents, New CONFUSIONMATRIXFORTHECOMPLETEENSEMBLE. York, NY, USA, 2010, SMUC ’10, pp. 37–44, ACM. [5] Faiyaz Al Zamal, Wendy Liu, and Derek Ruths, “Homophily and latent at- tribute inference: Inferring latent attributes of twitter users from neighbors,” 2012. Predicted Class [6] David Bamman, Jacob Eisenstein, and Tyler Schnoebelen, “Gender in twit- ter: Styles, stances, and social networks,” CoRR, vol. abs/1210.4567, 2012. Female Male [7] Morgane Ciot, Morgan Sonderegger, and Derek Ruths, “Gender infer- ence of twitter users in non-english contexts,” http://aclweb.org/ anthology/D/D13/D13-1114.pdf, 2013. Female 1518 104 [8] UTwente, ,” http://www.utwente. nl/en/newsevents/2013/5/164645/ computer-guesses-age-and-gender-of-tweeters, May Male 213 1121 2013. Actual Class

new test set (which we will from now on call ‘Random’) con- sists of users who tweeted about trending topics and for each of them one follower picked at random. The second set contains users who tweeted with the hashtag ‘#tvvv’ on May 2nd of 2014. We will call this set ‘TVVV’. Both sets contain 500 users. The results of these tests can be found in Table II. The table also in- cludes the results of TweetGenie on the same test sets. Note that 13 of the users in these test sets have never tweeted before and could therefore not be classified by TweetGenie. Our ensemble correctly classifies 100% of them.

TABLE II RESULTSONTWONEWTESTSETS.

Ensemble TweetGenie

Random 91.89% 82.15%

Test Set TVVV 93.32% 86.44% Gendervoorspelling van Vlaamse Twittergebruikers Gebruikmakend van een Ensemble van Classifiers Els Van Zegbroeck Supervisor(s): prof. dr. ir. Rik Van de Walle, Ir. Frederic´ Godin, Ir. Baptist Vandersmissen

Abstract—In dit paper wordt een ensemble van classifiers gebouwd dat den van een gebruiker zijn de accounts die een gebruiker volgt. het geslacht van Vlaamse Twitter gebruikers voorspelt. Het ensemble krijgt De rest van dit paper wordt als volgt gestructureerd. In Sec- de invoer van vijf verschillende classifiers en combineert deze om een fi- naal label te bekomen. Het is getraind met Vlaamse Twitter gebruikers die tie II worden verschillende gerelateerde papers aangehaald. De elk manueel geclassificeerd werden als mannelijk of vrouwelijk. De ken- dataset die gebruikt wordt om het ensemble te trainen en testen merken met de hoogste mutual information (MI) werden geselecteerd om wordt beschreven in Sectie III. In Sectie IV wordt het selectie- de vectoren te creeren.¨ Deze vectoren worden gebruikt als de invoer voor de classifiers. Kenmerken waaruit de vectoren bestaan worden afgeleid uit en extractieproces van de verschillende kenmerken uitgelegd. drie belangrijke bronnen: the profielinformatie van de gebruiker, de tweets Sectie V legt uit hoe de data representatie van de gebruikers in van de gebruiker, en het sociaal netwerk van de gebruiker. Elke classi- elkaar zit. In Sectie VI worden de verschillende classifiercom- fier gebruikt ofwel een Support Vector Machine ofwel Naive Bayes. Zes ponenten opgesomd. De structuur van het ensemble wordt uit- verschillende classifiers worden vergeleken. E´ en´ van hen, de tweet style classifier, wordt niet toegevoegd aan het ensemble. Dit is omdat dit de enige gelegd in Section VII. In Sectie VIII evalueren we de verschil- classifier is die de nauwkeurigheid van het ensemble naar beneden trekt. Er lende classifiers alsook het volledige ensemble. We concluderen wordt een analyze gemaakt waarin blijkt dat elke andere classifier het re- in Sectie IX. sultaat van het ensemble verbetert. De nauwkeurigheid van twee proeven verkregen door het ensemble worden vergeleken met de nauwkeurigheid van TweetGenie. TweetGenie is het werk dat het meest gelijkaardige is aan II.GERELATEERD WERK het onze. In beide gevallen verwerven we een nauwkeurigheid van meer dan 6% meer dan de nauwkeurigheid van TweetGenie. Dit paper bewijst ook In verschillende papers werden classifiers bestudeerd die het dat het gebruik van de beschrijvingen van de vrienden van een gebruiker geslacht van Engelstalige Twitter gebruikers voorspellen ([3], zeer voordelig is wanneer we het geslacht van deze gebruiker trachten te [4], [5], [6]). [7] bewijst dat de nauwkeurigheid van dit soort bepalen. classifiers afhankelijk is van de taal die gesproken wordt door Keywords—Twitter, gendervoorspelling, machinaal leren, ensemble de gebruikers. Zij doen dit door de resultaten te vergelijken voor Japanse, Indoneese, Turkse, en Franse gebruikers. Dit leidt I.INTRODUCTIE ons tot de conclusie dat er een nood is voor een classifier die gebaseerd is op Vlaamse Twitter gebruikers. WITTER is een sociaalnetwerksite en microblogdienst die Tsteeds populairder wordt. Gebruikers posten berichten, Het meest gelijkaardige werk dat door ons gevonden werd tweets genoemd, die gelimiteerd zijn tot 140 karakters voor is [2], TweetGenie, waarin de classifier zich concentreert op verschillende redenen. Gebruikers willen hun dagelijkse ac- Nederlandse Twitter gebruikers. Hoewel Nederlands en Vlaams tiviteiten en gedachten delen, zien waar hun vrienden mee bezig erg op elkaar lijken, geloven wij toch dat er genoeg verschillen zijn, en nieuws delen en vinden [1]. Deze tweets bevatten veel tussen de twee talen om een classifier te bestuderen die zich richt informatie over hun auteur. Ze kunnen daarom gebruikt worden tot Vlaamse Twitter gebruikers. Een van deze verschillen is, om verschillende latente eigenschappen van gebruikers te voor- bijvoorbeeld, dat Vlamingen vaak hun woorden afkappen [8]. spellen. Er wordt vaak verwacht dat deze eigenschappen te vin- Ook verwerft TweetGenie enkel kenmerken uit de tweets van den zijn op Twitter, zoals op gelijkaardige sociaalnetwerksites, gebruikers. Zijn zijn dus niet in staat om gebruikers die geen maar dit is niet het geval. tweets posten te classificeren. In dit paper zullen we een van deze latente eigenschap- Een van de classifiers dat een deel uitmaakt van het ensem- pen trachten te voorspellen voor Vlaamse Twitter gebruikers, ble gebruikt kenmerken afkomstig uit de beschrijvingen van de namelijk hun geslacht. We zullen dit doen door een ensemble vrienden van gebruikers. Er werd geen werk gevonden waarin van classifiers te bouwen. Elk van hen zal zich baseren op an- het sociaal netwerk van een gebruiker op een gelijkaardige dere informatie over de gebruiker. We zullen informatie verza- manier gebruikt wordt. melen van drie verschillende belangrijke bronnen: de profielin- formatie van de gebruiker, de tweets van de gebruiker, en het III.DATASET sociaal netwerk van de gebruiker. Het volledige ensemble wordt Om de classifiers te trainen en testen werd een set van ge- vervolgens geevalueerd¨ op drie verschillende test sets. De resul- bruikers opgehaald van Twitter. Dit werd gedaan door ge- taten van het ensemble worden vergeleken met die van Tweet- bruik te maken van de Twitter Search API. Om deze set zoveel Genie [2]. TweetGenie is het meest gelijkaardige werk dat door mogelijk te limiteren tot Vlaamse Twitter gebruikers werden ons gevonden werd. zij opgehaald van de volgers van enkele bekende Vlaamse ac- Dit paper bewijst ook dat het gebruik van de beschrijvingen counts zoals e´en,´ destandaard, en VTM. In het totaal werden van de vrienden van een gebruiker zeer vruchtbaar is wanneer 4826 vrouwen en 3965 mannen gecollecteerd. Deze werden elk men het geslacht van deze gebruiker wil voorspellen. De vrien- manueel geclassificeerd. IV. KENMERKEN B. Naam Classifier

Kenmerken werden geselecteerd voor zes verschillende clas- De naam classifier gebruikt ook Naive Bayes. Het wordt sifiers. Voorelke classifier werden de kenmerken met de hoogste getraind met 5000 kenmerken die gehaald werden uit de na- mutual information geselecteerd omdat de classifier geen patro- men van de gebruikers. Een naam op Twitter kan eender wat nen kan vinden in miljoenen kenmerken. De kenmerken met zijn zolang het minstens e´en´ karakter bevat. Gebaseerd op de hoogste mutual information zijn die dat de grootste zeker- een manuele inspectie, hebben we besloten om het eerste to- heid bieden dat de aanwezigheid van die kenmerken leidt tot ken van de naam (met de interpunctie verwijderd) als voornaam een bepaald label. De geselecteerde kenmerken worden dan ge- te beschouwen. Kenmerken zijn dit eerste token alsook de tri- bruikt om vectoren te creeren.¨ Vectoren zijn data representaties grams en vier-grams gebaseerd op karakters die afgeleid werden van een gebruiker. Kenmerken worden afgeleid uit drie verschil- uit dit eerste token. lende belangrijke bronnen van informatie. Zij worden allen be- sproken in de volgende secties. C. Beschrijving Classifier Deze classifier maakt gebruik van een Support Vector Ma- A. Profielinformatie chine. ‘C’ en ‘γ’ worden gelijkgesteld aan 4 en 10 respec- De gebruikersnaam, naam, en beschrijving (indien die er is) tievelijk. Deze parameters werden bepaald op basis van een worden uit de profielinformatie genomen. Uit elk van deze drie grid search. Unigrams en bigrams gebaseerd op tokens zijn de worden n-grams afgeleid om vectoren te maken. kenmerken die afgeleid werden uit de beschrijvingen. De 2000 kenmerken met de hoogste MI werden geselecteerd in dit geval. B. Tweets D. Tweet Inhoud Classifier Alle tweets van de gebruiker dat geen retweets zijn (tweets dat Een Support Vector Machine werd gebruikt door de tweet in- gepost worden door een gebruiker maar waarvan hij eigenlijk houd classifier. ‘C’ and ‘γ’ werden respectievelijk ingesteld op niet de oorspronkelijke auteur is) worden opgehaald. De tweet 4 en 0.1. 2000 kenmerken werden door deze classifier gebruikt. inhoud classifier neemt n-grams van de tweets. De tweet stijl Kenmerken omvatten de unigrams en bigrams afgeleid uit de classifier telt de emoticons, specifieke interpunctie sequencies, tweets. en de domeinen van URL’s die door de user gepost worden. E. Tweet Stijl Classifier C. Sociaal Netwerk De tweet stijl classifier gebruikt ook een Support Vector Ma- N-grams worden verworven uit de beschrijvingen van de chine. Deze keer wordt ‘C’ gelijkgesteld aan 3 en ‘γ’ aan 10. Er vrienden van de gebruikers. Voor zover wij weten werd dit nog werd getraind met 100 kenmerken. Deze kenmerken zijn emoti- nooit gedaan in eerder werk. cons, interpunctie sequenties, en domeinen van de URL’s die gepost werden door de gebruikers. V. VECTOREN F. Vrienden Beschrijvingen Classifier Om gebruikers te kunnen classificeren moeten vectoren wor- den gemaakt die hun voorstellen. Een gebruiker wordt verte- De vrienden beschrijvingen classifier gebruikt 5000 ken- genwoordigd door een verschillende vector voor elke soort clas- merken die afgeleid werden uit de beschrijvingen van de vrien- sifier aangezien zij elk van verschillende kenmerken gebruik- den van de gebruikers. Het gebruikt een Support Vector Ma- maken. Een aangemaakte vector −→xj = (x1, x2, ..., xn) heeft n chine waarin ‘C’ 4 is en ‘γ’ 0.1. elementen indien n kenmerken geteld worden. xi stemt overeen met een kenmerk i. Het is nul als het kenmerk voor gebruiker j VII.ENSEMBLE niet voorkomt. Als het kenmerk wel degelijk voorkomt, dan is Het ensemble bevat vijf classifiers. De tweet stijl classifier xi het percentage van de tweets of beschrijvingen van vrienden (afhankelijk van de soort vector) waarin het kenmerk voorkomt. werd niet toegevoegd omdat deze de nauwkeurigheid van het en- Voor de vectoren voor de gebruikersnaam, naam, en beschrij- semble verminderde. De beschrijving classifier, gebruikersnaam classifier, naam classifier, vrienden beschrijvingen classifier, en ving komt xi overeen met het aantal keer dat het kenmerk voorkomt. Deze vectoren zijn de invoer voor de classifiers. tweet inhoud classifier verbeterden elk de nauwkeurigheid van het ensemble. Elke classifier gebruikt ofwel een Support Vector Machine ofwel Naive Bayes. Het ensemble neemt de uitvoer VI.CLASSIFIERS van elk van de classifiers en voorziet dan het uiteindelijke label. A. Gebruikersnaam Classifier De uitvoer van elke classifier is de zekerheid dat die heeft dat de gebruiker mannelijk of vrouwelijk is. Het ensemble telt de De gebruikersnaam classifier maakt gebruik van Naive Bayes. zekerheid voor elke klasse op en geeft tenslotte het label van de Het wordt getraind met 3000 kenmerken die afgeleid worden uit klasse met de hoogste sommering. Dit komt neer op het clas- de gebruikersnamen. De kenmerken zijn bigrams, trigrams, en sificeren van instanties in klassen met de hoogste gemiddelelde vier-grams gebaseerd op karakters. zekerheid. TABLE I VIII.RESULTATEN CONFUSIONMATRIXVOORHETCOMPLETEENSEMBLE. Figuur 1 illustreert de zekerheid van elke individuele classi- fier wanneer deze getraind is met 2500 instanties van elke klasse en getest met een test set van 1622 vrouwen en 1334 mannen. Voorspelde Klasse De naam classifier bereikt het hoogste resultaat, en de beschrij- ving classifier de laagste. Aangezien minder dan 80% van de Vrouwelijk Mannelijk instanties van de test set een beschrijving hebben geplaatst, was de beschrijving classifier echter niet in staat om elke instantie te Vrouwelijk 1518 104 classificeren. Mannelijk 213 1121 90 Werkelijke Klasse 87.54

85 ook de resultaten van TweetGenie op dezelfde test sets. Merk op dat 13 van de gebruikers in deze test sets nog nooit getweet 80 78.86 hebben. Deze konden daardoor niet geclassificeerd worden door TweetGenie. Ons ensemble classificeerd 100% van hen juist. 75.36 75.43 75 TABLE II

accuracy (%) RESULTATEN OP DE TWEE NIEUWE TEST SETS. 70 65.74 66.34 Ensemble TweetGenie 65 Random 91.89% 82.15%

naam Test Set TVVV 93.32% 86.44% tweet stijl beschrijving tweet inhoud gebruikersnaam IX.CONCLUSIE vrienden beschrijvingen In dit paper werd een ensemble voorgesteld die het geslacht van Vlaamse Twitter gebruikers voorspeld. De nauwkeurigheid Fig. 1. De nauwkeurigheid van elke classifier die getraind is met een test set van grootte 2500 (2500 vrouwen en 2500 mannen) die door het ensemble verkregen werd voor de drie testen varieert tussen 89.28% en 93.32%. We hebben deze resultaten Waneer de tweet content classifier getest wordt op deze set, vergeleken met het meest gelijkaardige werk, TweetGenie. Het bereikt het een nauwkeurigheid van 74.36%. Het toevoegen ensemble bereikt iedere keer meer dan 6% nauwkeurigheid meer van de beschrijving classifier leidt tot een nauwkeurigheid van dan TweetGenie. Met de vrienden beschrijvingen classifier die 75.14%. Het toevoegen van de vrienden beschrijvingen clas- op zichzelf een nauwkeurigheid van 75.43% bereikt, hebben wij sifier geeft 80.99%. De gebruikersnaam classifier meereke- in dit paper ook bewezen dat het gebruiken van kenmerken uit nen geeft 1.52% extra. Tenslotte, wanneer de naam classifier de beschrijvingen van vrienden zeer voordelig is bij het bepalen toegevoegd wordt bereiken we een nauwkeurigheid van 89.28%. van het geslacht van gebruikers. De confusion matrix van deze finale stap kan gevonden worden REFERENCES in Tabel I. Hieruit kunnen we besluiten dat de precisie bijna 0.88 is voor vrouwen en dat de recall bijna 0.94 is. De F1 score voor [1] Akshay Java, Xiaodan Song, Tim Finin, and Belle Tseng, “Why we twitter: Understanding microblogging usage and communities,” in Proceedings of vrouwen is daarom 0.91. Voor de mannen is de precisie 0.92 en the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and de recall 0.84. Dit leidt tot een F1 score van 0.88 voor mannen. social network analysis, New York, NY, USA, 2007, WebKDD/SNA-KDD Om een laatste evaluatie uit te voeren werden twee extra test ’07, pp. 56–65, ACM. [2] Dong Nguyen, Rilana Gravel, Dolf Trieschnigg, and Theo Meder, sets opgehaald van Twitter. De oorspronkelijke test set werd “”how old do you think i am?” a study of language and age in twitter,” opgehaald op dezelfde manier als de training set. We willen http://www.aaai.org/ocs/index.php/ICWSM/ICWSM13/ daarom de evaluatie uitvoeren op twee extra test sets om zeker paper/view/5984, 2013. [3] John D. Burger, John Henderson, George Kim, and Guido Zarrella, “Dis- te zijn dat het ensemble ook nauwkeurig is voor willekeurige criminating gender on twitter,” in Proceedings of the Conference on Empiri- gebruikers. De eerste test set (die we vanaf nu ‘Random’ zullen cal Methods in Natural Language Processing, Stroudsburg, PA, USA, 2011, noemen) bevat gebruikers die getweet hebben over trending on- EMNLP ’11, pp. 1301–1309, Association for Computational Linguistics. [4] Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta, derwerpen op Twitter en van elk van hen een willekeurige vol- “Classifying latent user attributes in twitter,” in Proceedings of the 2nd in- ger. De tweede set bestaat uit gebruikers die getweet hebben met ternational workshop on Search and mining user-generated contents, New het hashtag ‘#tvvv’ op 2 mei 2014. We zullen deze set ‘TVVV’ York, NY, USA, 2010, SMUC ’10, pp. 37–44, ACM. [5] Faiyaz Al Zamal, Wendy Liu, and Derek Ruths, “Homophily and latent at- noemen. Beide sets bevatten 500 gebruikers. De resultaten van tribute inference: Inferring latent attributes of twitter users from neighbors,” deze testen kunnen gevonden worden in Tabel II. De tabel bevat 2012. [6] David Bamman, Jacob Eisenstein, and Tyler Schnoebelen, “Gender in twit- ter: Styles, stances, and social networks,” CoRR, vol. abs/1210.4567, 2012. [7] Morgane Ciot, Morgan Sonderegger, and Derek Ruths, “Gender infer- ence of twitter users in non-english contexts,” http://aclweb.org/ anthology/D/D13/D13-1114.pdf, 2013. [8] UTwente, ,” http://www.utwente. nl/en/newsevents/2013/5/164645/ computer-guesses-age-and-gender-of-tweeters, May 2013. xii Contents

1 Introduction1 1.1 Problem Definition...... 1 1.2 Approach...... 3 1.3 Overview...... 4

2 Related Work7 2.1 Classifying Based on Gender...... 7 2.1.1 Twitter...... 7 2.1.2 Facebook...... 10 2.1.3 Blogs...... 12 2.2 Classifying Based on Other Latent Attributes...... 17 2.2.1 Twitter...... 17 2.3 Personalized Recommendations...... 19 2.4 Conclusion...... 19

3 Terminology 21 3.1 Follow...... 21 3.2 Follower...... 21 3.3 Friend...... 21 3.4 Hashtag...... 21 3.5 Mention...... 22 3.6 Protected Tweets...... 22 3.7 Retweet...... 22 3.8 RT...... 22 3.9 Timeline...... 23 3.10 Trend...... 23 3.11 Tweet...... 23

4 The Dataset 25 4.1 Retrieval of the Users...... 25 4.1.1 Method...... 25 4.1.2 Result...... 26 4.2 Attributes of Users...... 28 4.2.1 Username...... 28 4.2.2 Name...... 28 4.2.3 Description...... 28 4.2.4 Tweets...... 28 4.2.5 Friends...... 28 4.3 Languages Most Often Used by the Users in the Dataset...... 30 4.4 Conclusion...... 31

xiii 5 Feature Extraction 33 5.1 Username...... 33 5.2 Name...... 34 5.3 Description...... 34 5.4 Tweets...... 36 5.4.1 Features based on content...... 36 5.4.2 Features based on style...... 36 5.5 Friends...... 37 5.6 Conclusion...... 37

6 Data Representation 39 6.1 Vectors...... 39 6.2 Selection of Features...... 40 6.2.1 Transformation...... 40 6.2.2 Tokenization...... 40 6.2.3 Feature Selection...... 40 6.3 Vector Creation...... 42 6.3.1 Load Dictionary with Features...... 42 6.3.2 Create Vector...... 42 6.4 Conclusion...... 42

7 Classification 45 7.1 Used Machine Learning Techniques...... 45 7.1.1 Naive Bayes...... 46 7.1.2 Support Vector Machines...... 46 7.2 Ensemble...... 48 7.2.1 What Is An Ensemble?...... 48 7.2.2 Structure...... 48 7.2.2.1 Username Classifier...... 49 7.2.2.2 Name Classifier...... 50 7.2.2.3 Description Classifier...... 51 7.2.2.4 Tweet Content Classifier...... 52 7.2.2.5 Tweet Style Classifier...... 53 7.2.2.6 Friend Description Classifier...... 54 7.3 WEKA...... 55 7.4 Conclusion...... 56

8 Results 57 8.1 Evaluating the Different Classifiers...... 57 8.1.1 Changing the Training Set Size...... 57 8.1.2 Changing the Amount of Features...... 58 8.1.3 Comparing the Classifiers...... 59 8.2 Evaluating the Ensemble...... 60 8.2.1 Testing the Ensemble on the Testing Set...... 60 8.2.2 Evaluation of a Growing Ensemble...... 61 8.2.2.1 Tweet Content Classifier...... 63 8.2.2.2 Addition of the Description Classifier...... 63 8.2.2.3 Addition of the Friend Description Classifier...... 64 8.2.2.4 Addition of the Username Classifier...... 64 8.2.2.5 Addition of the Name Classifier or The Complete Ensemble 65 8.3 Testing the Ensemble on a Completely New Set of Users...... 65

xiv 8.3.1 Test Set: Random...... 66 8.3.1.1 Just the tweet content classifier versus Tweetgenie... 67 8.3.1.2 Our ensemble versus TweetGenie...... 67 8.3.2 Test Set: of Flanders ()... 67 8.3.2.1 Just the tweet content classifier versus Tweetgenie... 68 8.3.2.2 Our ensemble versus TweetGenie...... 68 8.4 Conclusion...... 69

9 Conclusion 71 9.1 Future Work...... 71 9.2 Conclusion...... 71

References 75

xv xvi Used Abbreviations

MAP Maximum A Posteriori NB Naive Bayes RBF Radial Basis Function SVM Support Vector Machine URL Uniform Resource Locator WEKA Waikato Environment for Knowledge Analysis

xvii Chapter 1

Introduction

1.1 Problem Definition

Twitter is a social networking and microblogging service which is becoming more popular every day. Last year, Twitter proclaimed that it had more than 200,000,000 active Twitter accounts. These users posted an average of 400,000,000 messages, on Twitter called tweets, per day (Wickre(2013)). This means that Twitter is an enormous source of data. These tweets are limited to 140 characters and are written by users for many different reasons such as sharing one’s daily activities and thoughts, keeping up with friends, and sharing and finding information (Java et al.(2007)). As these tweets contain a lot of information about what users like, think, and do, a lot of statistics could be derived from them. This is information that might be expected to be part of the profile information, as it is on many other similar social networking sites, but unfortunately is not readily available. For this reason, many attempts are being made to create predictors, or classifiers, that correctly predict different characteristics of Twitter users. Knowing the gender of a user is very useful for marketing purposes. It would, for example, be useful if a TV station was able to see the percentage of female viewers and the percentage of male viewers that are tweeting about particular shows. Men and women are very different. Certain kinds of advertisements are interesting to members of one gender while they may not be so alluring for the other gender. For instance, a commercial about lipstick may not be very interesting to males. This information about the gender distribution of the show’s viewers could then be used to air targeted advertisements during these shows, making the advertisements more effective. It is known that Twitter has the highest penetration in the Netherlands, with more than 92% of its population having a Twitter account (TrendWatch(2013)). Digimeter (2013) reports that 28% of the Flemish population has a Twitter account in September of 2013. While the number of Flemish Twitter users has increased significantly over the last couple of years, it is still miniscule compared to the percentage in the Netherlands. This difference in activity is one reason why a Dutch classifier may not necessarily work

1 well for Flemish users. Another reason is that there are Flemish words that are not used by the Dutch and vice versa. Additionally, the dialects are very different. Flemish users, for example, are much more likely to cut off their words (Wien(2014)).

The goal of this masterthesis is to construct a classifier especially for Flemish Twitter users. The results obtained with this classifier will be compared with the accuracy obtained when using an existing Dutch classifier, TweetGenie. TweetGenie is the only work found that focusses on Dutch speaking Twitter users. It is therefore the work that is the closest to the work in this masterthesis. UTwente(2013) claims that TweetGenie attains an accuracy of 85% for Dutch Twitter users. We will study if TweetGenie also reaches 85% accuracy for Flemish Twitter users. Because TweetGenie is the closest work to ours, we will compare it with our results, too. 85% is a good starting point, but we would like to make our classifier even more accurate. Also, TweetGenie classifies Twitter users based on their tweets. This means that users who have not tweeted are not able to be classified by TweetGenie. Additionally, TweetGenie may not have enough information for users who have only posted a couple of tweets to be able to classify them correctly. By creating an ensemble that does not only look at a user’s tweets, we will try to classify all users, not just those who have enough tweets, as accurately as possible. The accuracy of the main classifier will be compared at several different stages: each time adding another subclassifier. Each of these classifiers will handle the user’s information in a different way.

Another important keypoint of the masterthesis is to study whether the social network of a user can provide additional useful information. As was just mentioned, the classifier of this masterthesis will not only utilize a user’s tweets to predict the gender of that user. There are three main sources of information that will be used. The first is of course the tweets. The second is the profile information of a user. This includes the user’s username, name, and description. While this information was not used by TweetGenie, it was used by many other works such as Burger et al.(2011) and Zamal et al.(2012). The final source of information that will be used is the user’s social network. Looking at specific accounts that are being followed by both genders has been done before. Zamal et al. (2012) study the distribution of the genders of the accounts that a user follows. No work, however, has been found in which the description of the accounts that a user follows is exploited to help predict the gender of the user. We will therefore study whether this information is useful when determining the gender of a Twitter user.

To recapitulate, there are three main goals of this masterthesis. Firstly, we will build the first classifier that focusses on predicting the gender of Flemish Twitter users. Secondly, we will evaluate how well TweetGenie, the most similar work, compares to the classifier we built. Finally, we will analyze whether utilizing the descriptions of the accounts that a user follows is fruitful when predicting the gender of that user.

2 1.2 Approach

First, a set will be constructed containing harvested Twitter users. Approximating a set containing only Flemish Twitter users is one of the first big problems that has to be tackled. The location of a user as found in the profile of the user cannot be used because it can include anything including ‘Wonderland’ and the user is not even required to enter a location. Filtering based on location would mean that the set would be biased as it would only contain users that want to let the world know more about themselves. This is why the set will be approximated by harvesting followers of popular Flemish accounts such as Een (21,571 followers), VTM (30,889 followers), or Studio Brussels (138,840 followers). To avoid biasing the set, users will not be filtered based upon whether the user profile includes a URL or mentions the gender of the corresponding user in order to annotate them. The users will be annotated manually. This set of users will act as a training and validation set for an ensemble of classifiers. An ensemble of classifiers will be used instead of one single classifier. Different classifiers will each make their own prediction, and these predictions will be the input to one main classifier which will give the final result. Using an ensemble of classifiers means that one classifier can easily be added or removed to measure the effectiveness of a single classifier. Bishop(2006a) The first classifier will look at the users’ tweets. N-grams will be collected and counted based on which the classifier will learn which n-grams occur more often in which gender’s tweets. This kind of classifier has been used most often in previous work that classified either English speaking users or classified independent of the language used. Tweets, however, are very limited in length, and many Flemish users have posted a very limited amount of tweets. This will be one of the main differences with previous work. A second classifier will be based on the characteristics of the tweets posted by the user. It will, for exmample, count the amount of exlamation marks or question marks that are found in a user’s tweet on average. It will look at how many pictures are posted by the user. The classifier will also look at how many smileys a user uses on average. The amount of words that are entirely capitalized will be counted as well. The three following classifiers will look at a user’s name, username, and description. A user’s name on Twitter can be anything. This is why we can not predict the gender of a user based solely on their first name. This also means that even if the user enters their actual name, it is unknown whether the first token is the first name or part of the last name. Because Flemish users often put their first name first, it will be attempted to make a correct prediction based on this first token. Names may also be included in the username, but versions of ‘boy’ and ‘girl’ are found in usernames as well. The usage of ‘x”s may be an indicator towards being female, too. To extract these kinds of features, n-grams will be taken from the username, but this time the n-grams will consist of a number of adjacent characters instead of a number of adjacent tokens. To

3 extract features from the description, n-grams consisting of adjacent tokens will be used. Adjacent tokens should extract features such as ‘papa van’. Studying the extracted features and the results of the classifier will reveal whether these features are beneficial at all. The fifth classifier looks at a user’s friends. Friends are the users that follow a particular account. Classifiers which look at the description of friends and followers to predict the gender of a Twitter user have not been found in previous work. One of the advantages of looking at a user’s social network is that classifying is also possible if the user has not posted many tweets. This is important when trying to predict the gender of Flemish users, because, as has been mentioned previously, Flemish users often have not posted many tweets. A significant amount of users, called information seekers, use Twitter as a news aggregation reader. These users follow several accounts but may not have posted any tweets. By looking at their social network, these users can be classified as well. The final, and main, classifier will take the predictions of all previous classifiers as input and will output the final prediction based on all the previous predictions. Different functions to translate the incoming predictions to a final prediction will be analyzed. This may be a majority vote, or the function could also give different priorities to different classifiers. The biggest obstacles that are expected are the amount of inactive users and the fact that many Flemish users use a lot of English words and sometimes even post tweets completely in English.

1.3 Overview

This introduction will be followed by a review of related work in Chapter2. Literature in which the gender of users is predicted both on Twitter and other social networking sites is listed. This is followed by some works in which other latent attributes were predicted. In Chapter3, several subject-specific terms are defined. The users that were used for training and testing are discussed in Chapter4. In Chapter5, the feature extraction process is explained step by step. Afterward, the creation of vectors is explained in Chapter6. These vectors are the input to the ensemble, which is clarified in Chapter7. In the same chapter, the most important points of the used machine learning techniques are described. Next, the ensemble is depicted and the used library is stated. The results are evaluated in Chapter8. In the first section of the last mentioned chapter, the different classifiers that make up the ensemble are evaluated. We first see the effect of changing the amount of users that the classifiers are trained on and then analyze the effect of changing the amount of features they use. In the second section, the accuracy of the complete ensemble is assessed. This is done by first calculating the accuracy

4 of the ensemble on the test set as new classifiers are added tot the ensemble. Later, the accuracy of the ensemble is compared to that of TweetGenie. We close with an conclusion and an overview of possible improvements in Chapter9.

5 6 Chapter 2

Related Work

Knowing attributes such as the age and gender of social media users can provide new ways to utilize personalized advertisements. To gain more insight as to what has been done before, we have studied work which has classified users of different social media. In Section 2.1 we mention several papers which have tried to classify users based on gender on Twitter (Section 2.1.1), Facebook (Section 2.1.2) and blogs (Section 2.1.3). Because similar approaches are used to predict other latent attributes (such as age and political orientation), we also studied other work which classifies Twitter users (Section 2.2).

2.1 Classifying Based on Gender

2.1.1 Twitter

When it comes to classifying non-English speaking social network users, very little work has been done. Ciot et al.(2013) studied whether a similar accuracy can be reached for languages other than English using existing gender SVM classifiers and whether adding language-specific features can increase this accuracy. To do this, they first evaluated the performance of an existing gender classifier using SVM using tweets from users speaking one of four different languages: Japanese, Indonesian, Turkish or French. The first part of the study concluded that such a classifier gives comparable results for English, French and Indonesian, better results for Turkish, and worse results for Japanese. This shows that the accuracy obtained for one language is not necessarily a good prediction for the accuracy for other languages. To try to increase the accuracy when classifying French-speaking users, a new classifier was built which makes use of language-specific characteristics. This classifier searched for tweets containing a version of ‘e suis’ followed by an adjective or a past participle as correct French spelling would add an ’e’ after the adjective or past participle for females.

7 Users without such tweets were classified by the regular SVM classifier. The original classifier reached an accuracy of 76% for French users, while the newly built classifier reached 90% accuracy for users with tweets containing ‘je suis’ and 83% overall. The classifier used by Ciot et al.(2013) only looks at users’ tweets and at the frequencies at which users perform certain actions. The social network of the user was not taken into account. Also, while it is interesting to observe that adding language-specific features can greatly increase the accuracy of a classifier, this was only shown for French. French has gender-specific nouns and adjectives. This is not the case for most other languages. The highest accuracy for Twitter users’ gender, 92%, was obtained by Burger et al. (2011). Their sample of Twitter users may speak any language, but it is biased as all of them are linked to a blog. All users are represented by a boolean vector, meaning frequency of usage is not taken into account. This is because they claim that this did not seem to have a big effect. Their features are n-grams of both characters and words of users’ screen names, full names, descriptions, and tweets. Features that are counted for less than three users are discarded. Naive Bayes, Support Vector Machine (SVM), and Balanced Winnow2’s performance were compared, resulting in a preference for Balanced Winnow2. The training time for this algorithm was significantly shorter than for SVM while reaching the best accuracy. They executed a grid search and established that a low learning rate was needed when training on one feature while a higher learning rate was needed when combining features. To compare the performance of the classifier with humans’ capabilities, they made use of Amazon Mechanical Turk (AMT). Workers were able to classify correctly 68.7% of the time when classifying users only based on their tweets. Using the classifier with different feature sets demonstrated that the full name attribute was the most informative of all features used. Based on the full name alone, an accuracy of 89.1% was obtained. All features combined resulted in an accuracy of 92% as was mentioned before. Rao et al.(2010) not only study the classification into gender classes, but also study classification into age, regional origin, and political orientation classes. To gather their dataset of users, they collected an initial set of users followed by their followers using the breadth-first-search (BFS) strategy. The users they started off with to classify into male and female classes were those of sororities, fraternities, and male and female hygiene products. A disadvantage of this method for collecting users is that it is biased. When considering which features they would utilize, they performed tests on the effec- tiveness of the usage of follower-following ratios, and follower and following frequencies, as well as the frequencies at which the users replied, retweeted, and tweeted. All of these features, however, did not alter the effectiveness of the classifier significantly. The classifier used was a stack of Support Vector Machines which relied only on the content of tweets. Both socio-linguistic features (such as the usage of certain smileys,

8 the word ’OMG’, ellipses, etc.) and content-features (unigrams and bigrams) were used. Strong indicators of a certain class were the words which follow ’my’. Adding these to their socio-linguistic features improved the accuracy of the classifier, even though these were already accounted for by the bigrams.

The stacked model obtained an accuracy of 74.11%.

Zamal et al.(2012) is the only paper found which takes the social network of a user into account in a way similar to how we intend to use it. This paper tries to predict the age, political orientation and gender of users. Next to the typical features used to describe a user which are used often in other papers, features also include the average of attributes of the user’s friends. Results depended upon whether all friends, the most popular friends, or the least popular friends were considered for this purpose. They find that using features of so-called neighbors alone can lead to similar or even better accuracies for some attributes.

Because they looked at so many features of neighbors, a lot of space was needed to store the design matrix. For this reason, only a set of 400 different users was considered.

In this case, both Support Vector Machines (SVM) and Gradual Boosted Decision Trees (GBDT) were considered. Here, SVMs are said to achieve the best results.

The features used were the most used words, the most used stems, the most occuring digrams and trigrams, the top co-stems, the most used hashtags, some frequency statis- tics, the ratio of retweets, and the neighborhood size. Twenty neighbors were collected for each user. For gender, all previously mentioned definitions of neighborhood were tested, as well as three different ways in which the user and their neighbors’ features were combined: only neighbors’ features, the average of both user and friends, and the features of both concatenated.

No significant improvement for gender was found here by making use of the social net- work.

The point Bamman et al.(2012) are trying to make is not really the classification accu- racy they achieve, but more that when other work sums up the stylistic and content-based features that a certain gender uses more often, the analysis is often done incorrectly. This is because users who talk more about different topics use different styles and different content. Authors should first be clustered. Usually, the different clusters are dominated by one particular sex. Females in one cluster may write completely differently from those in other clusters. The way people talk depends on the situation they are in. Results found in one particular cluster may contradict the results found by approaches that do not cluster first. By using these clusters, other attributes such as age which influence the way we talk are also accounted for. They say this is only important when classifying into genders to actually study the differences between the language used by males and females.

9 A classifier is built to group American Twitter users. Only single word features are used. The classifier applies logistic regression and standard regularization and achieves an accuracy of 88%. They argue that when a particular word is used more often by one gender than another, this gives good results for predicting the gender of a user, but this should not be concluded when summing up the differences between both genders as this depends on the age of the user. They also find that users who are classified into the wrong gender are those whose social network is not skewed more to the user’s gender. Measuring the homophily of a user’s social network and using this when classifying that user does not increase the accuracy of the classifier, however, as long as enough words are available in the corpus.

Conclusion

Not much work has been done for languages other than English. Ciot et al.(2013) showed that applying previous research not always yields similar results for different languages. Ciot et al.(2013), Rao et al.(2010) and Bamman et al.(2012) show that only using features derived from tweets can lead to a decent accuracy when enough data is available. While using features extracted from those of friends and followers may be very opportune to increase the accuracy, they have not been investigated much. Zamal et al.(2012) are the only ones to use the social network information of a user in an interesting way. There may, however, be more beneficial ways to use this information than taking the average of feature counts for all users in the network. Burger et al. (2011) showed us that the name of the user should be used in the classifying process. To the best of our knowledge, no much research has been conducted which classifies based on gender focused on Dutch Twitter users (except for one we will mention in Section 2.2.1, but here the focus really is on age and not gender), let alone Flemish users.

2.1.2 Facebook

In Schwartz et al.(2013), the open vocabulary approach is compared with Linguistic Inquiry and Word Count (LIWC). Features consisted of sequences of one to three words and topics. Only sequences of words that were used by at least 1% of the users in their data set were kept. Topics were procured by use of the Latent Dirichlet Allocation (LDA) algorithm. The word counts were normalized for each subject and the Anscombe transformation was applied. A great advantage that the authors of this paper had is that they had 75,000 volunteers who all indicated their age, gender, and filled out a personality test. As was expected, the open vocabulary approach did much better than the application of LIWC. An accuracy of 91.9% was achieved using a linear support vector machine with regularization.

10 Discarding phrases which do not have a high enough Pointwise Mutual Information (PMI) is a good way to limit the amount of features in a n-gram classifier without throwing away valuable information. The PMI of two events measures the divergence of the probability of the two occuring together and the probability one would expect given the probabilities of the two events seperately and the fact that they are concidered independent (see Equation 2.1)(Bouma(2013)).

p(y x) PMI(x; y) = log | (2.1) p(y)

Liu and Ruths(2013) use first names to infer the gender of Facebook users. They also created a list of first names of a large percentage of NYC users which is much more extensive than other lists of names of NYC inhabitants. When names were translated to their canonical form, the popularity of names had a Zipf distribution. Names consisting of just one letter, which do not include any vowels, or which have been referenced only once are excluded. This means that the biggest part of the names occurs only between 20 and 50 times. Considering that the dataset consisted of 1.67 million NYC Facebook users, this is an incredibly small number. Only 6% of the names were tagged for both genders. Multinomial Naive Bayes (MNB) was used to predict the gender of the users.

Seven different kinds of classifiers were compared which make use of an offline list of names, the list created by the authors, attributes found on the profiles such as relation- ship status, the distribution of genders befriended (this is the only classifier which used decision trees instead of MNB), and hybrids.

The users were split into different groups, and each group was classified by a particular classifier. This resulted in a better accuracy than when using a single classifier. First the users were split into a group of users who have enough information available to be able to classify, and a group with very little information. The latter is then classified randomly and is very small (less than 4% of the users), and the first is then split up into more groups depending on which information is available.

The achieved accuracy was 95.2%. The algorithm was also tested on Boston Facebook users and achieved similar results, proving that the classifier does not just work well for NYC users.

It was also found that there was a correlation between the privacy settings of a user and the gender of the user; females were more concerned with privacy.

These accuracies achieved by the last two papers are remarkably higher than those of studies with Twitter users. Facebook offers more personal attributes of a user such as a relationship status, the gender the user is interested in, and more. These attributes are very advantageous when trying to predict the gender of users. One should also be reminded that a Facebook message can be up to 63,206 characters long (Agrawal(2011)). This is more than 450 times as long as a tweet. This means that a lot more features

11 can be extracted from each message. Furthermore, the size of the dataset used by these works is much bigger than all works previously mentioned except for Burger et al.(2011).

Conclusion

Schwartz et al.(2013) told us that using an open vocabulary approach is better than using LIWC. Liu and Ruths(2013) again shows us the name of a user is a very valuable feature when predicting the name of a user. Both papers which classify Facebook users conclude with a remarkably higher accuracy than when Twitter users were classified. We know that Facebook is more difficult to get data from, as you have to get permission from users (for Liu and Ruths(2013), this was not yet the case) to collect their data. Messages are generally much longer than tweets and there are many more user attributes specified than on Twitter.

2.1.3 Blogs

Argamon et al.(2007) studies the difference between females and males and different age categories based on blogs extracted from blogger.com. Only English blogs with at least 500 words were included in the study. To classify the authors, they took the 1000 most used words and started grouping words that often occured together in blog posts into different topics. The frequency of all topics and parts-of-speech were measured for all gender and age classes, and clear differences were found between classes. Two different classifying algorithms were tested. When used to classify based on gender, the Bayesian multinomial logistic regression model obtained an accuracy of 79.3% and the multi-class balanced real-valued Winnow achieved 80.5%. They claim to be the first to find a correlation between the difference between the gender categories and the difference between the age categories and conclude that the words often used by younger bloggers are also often used by female bloggers. The same authors also wrote Schler et al.(2005). In this paper, they discuss style- based features and content-based features. As to the first, they found that females use more blog words while males post more URLs. As to the second, it is reasoned that blogs written by females are much more personal, while males are keen on providing information. To train their classifier (which again uses the Multi-Class Real Winnow algorithm), the features utilized were parts-of-speech (POS), so-called ’function words’, the usage of words like ’lol’ and ’haha’ which they call blog words, and the frequency at which URL’s are posted. Again, no information is used that is not found in the tweets themselves. The authors claim that the algorithm used is much more proficient than SVM for these purposes, while SVM is used in many other similar studies.

12 The accuracy reached when classifying into gender classes is 80.1%.

Mukherjee and Liu(2010) makes use of POS tags with a new approach. One feature that utilizes POS tags is the F-measure which indicates how implicit a text is. In their dataset, females were notably more implicit than males. POS tags are also used to find POS sequence patterns. These are different from POS n-grams in three major ways. Firstly, they do not have fixed lengths. The second way in which they differ is that they are only included if they occur in enough documents (here called the support of the pattern). They are also only included if their so called ‘adherence’ is high enough. The latter is the squared probability of the terms occuring together divided by the average of the probabilities of all possible two subsets of the sequence occuring. So, these patterns are a way of trying to find sequences that are really discriminative of a gender instead of using all possible fixed length n-grams. POS n-grams find the stylistic tendencies of both genders compared to regular n-grams which just look at which words are used more often by which gender.

They used many features used by previous papers plus the ones mentioned in the previous paragraph. However, not all features were used to classify. To optimize their classifier, they created a feature selection algorithm which combines two popular techniques called filtering and wrapping. Filtering is where the ‘best’ features are selected based on one or more selection criteria, while wrapping starts off with a set of features and only adds new ones if they improve the accuracy of the classifier. A combination of both was used to increase the efficiency, and this is called the Ensemble Feature Selection (EFS) algorithm.

They sum up the three most used ones: boolean, term frequency, and TF-IDF. They compared the first two in their study but did not use the third as this did not provide good results. Three different classifiers were compared as well: SVM classification, SVM regression, and Naive Bayes.

They obtain a maximum of 88.56% accuracy. The use of their EFS algorithm, boolean values, and SVM regression yielded the best accuracy. The boolean values gave better results for all classifiers, so for this reason we will also make use of boolean values in our study.

Conclusion

Again, all studies found are focusing on English-speaking users. Here, SVM is not as popular like it is for Twitter and Facebook. The POS patterns used by Mukherjee and Liu(2010) are a very interesting approach to classifying based on style features. The POS patterns with variable length, however, would probably be much more limited in tweets as these are much shorter than an average blog post. The obtained accuracies were considerably high, but one has to remember that in this case a lot more data is

13 available than there is for Twitter users as blog posts are generally much longer than tweets.

14 Paper Language(s) Dataset Algorithm Features Size (Dataset) Accuracy Ciot et al.(2013) Japanese, Indone- Streaming output SVM Only tweet-based 5000/language 83% for the sian, Turkish, with language de- features (each user >1000 French French tection (Twitter) tweets) Burger et al. All Bloggers (Twit- Balanced Win- Ngrams of chars 184,000 92% (2011) ter) now2 and words of screenname, full name, descrip- tion, and tweets Rao et al.(2010) English Followers of Stack of SVMs Only tweet-based 1000 74.11% sororities, fra- features (both ternities, and socio-linguistic

15 male and female and content uni- hygiene products and bigrams) (Twitter) Zamal et al. English Americans whose SVM Screenname, full 400 + all neigh- 80.2% (2012) first name is of the name, descrip- bors 100 most common tion, tweets, and (Twitter) the average of these features of friends Bamman et al. American English Americans (Twit- Logistic Regres- Unigrams 14,464 88% (2012) ter) sion Paper Language(s) Dataset Algorithm Features Size (Dataset) Accuracy Schwartz et al. English Volunteers (Face- SVM Unigrams, bi- 75,000 91.9% (2013) book) grams, and trigrams with high PMI, topics (LDA) Liu and Ruths English NYC users (Face- Multinomial First names, at- 680,000 95.2% (2013) book) Naive Bayes tributes such as relationship sta- tus and looking for, the distribu- tion of genders 16 befriended Argamon et al. English Bloggers (blog- Balanced Winnow Frequency of top- 19,320 80.5% (2007) ger.com) ics and POS Schler et al. English Bloggers (blog- Balanced Winnow Style-based and 37,478 80.1% (2005) ger.com) content-based features Mukherjee and English Bloggers (blogs) SVM Regression POS sequence 3,100 88.56% Liu(2010) patterns, tweets, description, screenname, full name

Table 2.1: Overview of papers that try to predict the gender of social media users. 2.2 Classifying Based on Other Latent Attributes

2.2.1 Twitter

Nguyen et al.(2013) classify Twitter users based on age in three different ways: once in different age groups, once in exact ages, and once in life stages. The gender of the users is also considered in the classifying process because they believe that a person’s writing style and content changes differently with age depending on the person’s gender. The dataset in this study is the only one found that works with Dutch tweets specifically. To accumulate a set of Dutch speaking users, tweets are filtered based on the word ’het’, as this is a word that occurs in many Dutch tweets, and the authors of those tweets are added to the set as well as their followers and followees. Users with no Dutch tweets were discarded. To try to minimize the amount of accounts belonging to organizations and celebrities, users with more than 5000 followers were discarded as well. The same was done with users whose tweets were private or who had not posted at least ten tweets. Both logistic and linear regression were used. The first as a classifier and the second to predict the exact age. Linear regression is combined with ridge regularization to avoid overfitting. The utilized features consist of unigrams that appear at least ten times. The logistic regression classifier is a one versus all multiclass classifier. It obtains an 86.32% accuracy based on the F1 micro measure when classifying into age categories, and 86.28% when classifying into life stages. The third classifier reaches an accuracy of 57.09%. Older people are often predicted younger than they actually are. Interesting is the fact that it turns out to be easier to classify females than males because they exhibit their sex more explicitly. Pennacchiotti and Popescu(2011) use completely different sets of target classes. In this paper, three different classification tasks were performed: one based on ethnicity, one into democratic or republican, and one predicting whether a user is a fan of Starbucks or not. The classifiers used Gradient Boosted Decision Trees. The same algorithm was used for all three tasks. This is one of the only papers found which took the social network of the user to be clas- sified into account. Next to the typical content-based and user behavior-based features, social features were used as well. The social features are used to take into account who the user follows and who his friends (accounts the user follows) are. The social features were only used when classifying into democratic or republican. These features were a significant information gain for this classification problem. This may not be so for other situations, as this particular situation leads to very specific accounts being followed by a large percent of the users. We do not expect this kind of usage of the social network will be useful when classifying into male and female, unless we only look at accounts of organizations and celebrities.

17 One of the user behavior-based features utilized, the ratio of followers and friends, was the best indicator when predicting whether a user is a Starbucks fan or not, while for other problems the usage of the ratio was not beneficial. This shows that different features are important for different classification problems even when the dataset always consists of Twitter users.

The accuracies reached for these three problems when using all features are the following: 91.5% for democrats, 84% for republicans, 65.5% for ethnicity, and 75.9% for Starbucks fans.

Fabricio Benevenuto and Almeida(2010) tried to predict whether a user is a spammer or not. Knowing some of the most important attributes that often led to being a spam- mer could be very beneficial when trying to avoid spammers. The most discriminative attributes were calculated by using χ2. Warning bells included a high percentage of tweets with URIs, a high average number of URIs in a tweet, and very recently created accounts.

A support vector machine with a Radial Basis Function for a kernel was used to classify users into spammers and non-spammers. 5-fold cross-validation was used to validate the results. Less than 4% percent of non-spammers were misclassified, and almost 30% of spammers were not classified as such. By using a weight to make the classifier more likely to classify into a certain class, priority can be given to either one of the classes depending on the particular goal of the algorithm. This is not practical in our case as we do not prefer one gender over another.

Also, both micro-F1 and macro-F1 measures were used to calculate the accuracy of the classifier. The F1 measure is the harmonic mean of precision and recall. Macro-F1 is especially effective when one class has a definite dominance over the other. Both genders will be evenly distributed, so this is not necessarily appropriate in our case.

Conclusion

Nguyen et al.(2013) is the only study found which focuses on Dutch users. The algo- rithms used here are less important for us, as they are here used as multiclass classifiers. They do show that high accuracies can also be obtained for Dutch users. Pennacchiotti and Popescu(2011) uses Gradient Boosted Decision Trees and makes use of the network of the user. They show that the features from friends that they use are much more use- ful in one case than in the others. While also using a similar approach as the others to classify users into classes, Fabricio Benevenuto and Almeida(2010) also provides insight into how to detect spammers on Twitter.

18 2.3 Personalized Recommendations

Abel et al.(2011) uses OpenCalais to define entities and topics talked about in messages and it also gives a unique URI for each identified entity. This could be a conventient tool when looking at descriptions of friends and followers in our study.

2.4 Conclusion

The prediction of latent attributes of social media users is currently a topic that demands the attention of many researchers.

When trying to predict the gender of Twitter users, Rao et al.(2010) and Bamman et al.(2012) only use the users’ tweets to extract features from. Rao et al.(2010) only classifies users who follow certain accounts which are very gender-specific, and they only achieve 74.11% accuracy. Bamman et al.(2012) reaches an accuracy of 88%, but the study is solely based on American Twitter users.

Ciot et al.(2013) does not use features gathered from anything other than tweets either. It is also the only work found (except for Nguyen et al.(2013)) which does not focus on English speaking Twitter users. In fact, this study compares the accuracies of a classifier on users of four different languages and thereby shows that the accuracy is not independent of the language the users use. This leads us to the conclusion that a classifier that is trained specifically for Flemish Twitter users needs to be build even if only to compare it to the Dutch classifier of Nguyen et al.(2013).

Features other than those concluded from tweets are used by Burger et al.(2011). In this case, an accuracy of 92% is obtained. The classifier in this work, however, is trained and tested only with Twitter users that are bloggers. As this is only a small subset of the general public that all tweet to promote their blogs, this accuracy is not representative of the accuracy that can be achieved for any type of person. Zamal et al.(2012) also uses other attributes to classify users such as the user’s screenname, name, description, and social network. Their dataset, nevertheless, is limited to users who’s first name is in the top 100 list of most common American names. They obtain an accuracy of 80.20%.

Schwartz et al.(2013) and Mukherjee and Liu(2010) both attempt to predict the gender of Facebook users. The former achieves an accuracy of 91.90%, and the latter obtains 95.20%. These numbers are much higher than those of similar studies based on Twitter users. This could be because Facebook messages are generally a lot longer than tweets, and Facebook offer some advantageous attributes that Twitter does not.

The prediction of gender has also been performed for bloggers. Schler et al.(2005), Argamon et al.(2007), and Mukherjee and Liu(2010) are instances in which the gender of bloggers is predicted. All three of the mentioned studies are based on English speaking

19 bloggers. Blogs are, again, much longer than tweets. Accuracies obtained vary between 80.50% and 88.56%. The gender of a user is not the only thing researchers are trying to predict. Pennacchiotti and Popescu(2011) predicts the ethnicity of a user, whether he or she is a Starbucks fan, and whether he or she is democratic or republican. They show that the usefulness of certain features depends on the attribute of a user that is one is trying to predict. The accuracies achieved for all classes range from 65.50% to 91.50%. We also saw Fabricio Benevenuto and Almeida(2010), in which spammers are detected. Nguyen et al.(2013) classifiers users into female and male but also into age groups. This is the work that is most like ours, because it is trained with Dutch Twitter users. Nevertheless, as has been mentioned before, Dutch is not exactly the same as Flemish. Also, Nguyen et al.(2013) only use tweets to classify users. We conclude that while much work has been done before, there still is a need for a similar study based on Flemish Twitter users as this is something that we are currently lacking. Previous work has relied greatly on features extracted from tweets (or Facebook messages, blog posts...). While this will also be a first step into the direction of classifying Flemish Twitter users, it will probably not be enough because Flemish Twitter users tweet significantly less. We will therefore, along the same lines of Zamal et al.(2012), incorporate the use of friends’ features.

20 Chapter 3

Terminology

In this chapter, several terms peculiar to Twitter are explained. These words will be assumed to be understood in the following chapters.

3.1 Follow

One can choose to follow another by clicking the ‘follow’ button. When one follows an account, one sees the other account’s tweets in his or her timeline.

3.2 Follower

A follower is an account that follows a particular user.

3.3 Friend

A friend on Twitter is an account that is followed by a particular user.

3.4 Hashtag

A hashtag, ‘#’, is used on Twitter to categorize tweets. Hashtags make it easier for other accounts to find tweets about a particular subject. All one has to do is add a hashtag followed by a phrase. This phrase will be the subject associated with the tweet. In the retweet shown in Figure 3.1, ‘#reddevilstobrazil’ is a hashtag.

21 3.5 Mention

A user can be mentioned in a tweet if his or her username is entered following an ‘@’ symbol. This may be used to reference to someone, but it is also used to engage in a conversation with someone.

3.6 Protected Tweets

Tweets that are protected are only visible to users that have the author’s permission.

3.7 Retweet

A retweet is a tweet posted by a user that was originally posted by another Twitter user. To retweet is to post a retweet. In Figure 3.1, we see an example of a retweet. This retweet was originally posted by BelgianRedDevils (with username @BelRedDevils) and was retweeted on May 13th by ´e´en. We can see that the tweet has been retweeted 613 times.

Figure 3.1: An example of a retweet.

3.8 RT

RT is an abbreviation for retweet. It is often placed in retweets to indicate that the tweet is a retweet.

22 3.9 Timeline

A user’s timeline is a list of tweets that have been posted by his or her friends in chronological order. The timeline can be found on Twitter’s homepage after logging in.

3.10 Trend

Twitter specifies several trends. These are subjects or hashtags that are the most popular on Twitter at the moment. These can be specific to the logged in user’s area. By clicking one of the trends, one sees a list of all the tweets that use the specific hashtag or are about the specific subject in chronological order.

3.11 Tweet

A tweet is a message posted on Twitter. Tweets are limited to 140 characters. When one posts a message on Twitter, one is tweets.

23 24 Chapter 4

The Dataset

In this chapter, we will describe the dataset of Flemish Twitter users that will be used to train and test the ensemble. In Section 4.1 we first see how the dataset is retrieved. In the same section, we also depict some characteristics of the collected set such as the percentage of users that has a description and the average amount of non-retweets. The attributes of the users that will later on be used to extract features from are listed in Section 4.2. Because we are building an ensemble to classify Flemish Twitter users, it is interesting to know which languages are used most often by the users in the dataset. The distribution of the most used languages is finally presented in Section 4.3.

4.1 Retrieval of the Users

In this section, we see how the users for the dataset were collected (Subsection 4.1.1) using the Twitter Search API. The characteristics of the resulting dataset are discussed in Subsection 4.1.2.

4.1.1 Method

Data for the training set was retrieved from Twitter through its Search API. As the aim was to collect Flemish Twitter users, a subset of the followers of popular Flemish Twit- ter accounts was harvested. The resulting dataset was reviewed manually and accounts of companies, organizations, and spammers were deleted as were private accounts. Ac- counts of users that gave a location not in Belgium and who did not seem to have any Flemish tweets were assumed to not be Flemish and were deleted, too. Accounts of users of whom we were unable to determine the gender were removed as well. This set was very limited, however. This occured when the user has a first name that could belong to either gender (or no name at all), had a nondescript profile picture (or none at all), and did not have any other characteristics that hinted at their gender. The remaining users

25 were manually classified as male or female. The manual review was performed using the webpage of which a screen capture can be seen in Figure 4.1. Several of the users’ attributes were taken into account when manually classifying the users. Most of the time, one look at the user’s profile picture and name was enough to determine his or her gender. But, as has been mentioned before, some users did not have one or both of these attributes or had a nondescript name and profile pictures. As it was highly important that the gender was entered correctly, other attributes were also considered in these cases. First, we looked at the description: users describing themselves as ‘mama’ (mom) or ‘papa’ (dad) were straightforward to classify. If the user didn’t enter a description, or if the description didn’t give any more clues as to what the gender of the user is, the link to the profile was followed. Links to Instagram pictures or a Facebook page were then looked for. If this still didn’t help, the user was deleted from the dataset. At a later stage it was decided to also delete all users with less than 100 tweets that are not retweets from the training set.

4.1.2 Result

Table 4.1 gives an overview of the resulting dataset.

Table 4.1: Users harvested to train classifiers.

Account Username Total Females Males VTM 7075 4073 3002 ´e´en 948 434 514 destandaard 768 319 449 TOTAL 8791 4826 3965

In Table 4.2, the average number of non-retweets (tweets that are not retweets), the median number of non-retweets, and the percentage of users that has a description is given of all harvested users grouped by account from which they were collected. The same statistics are given in Table 4.3 but by gender.

Table 4.2: Statistics of the harvested users (grouped by original account).

Account Average Median Has Username Non-Retweets Non-Retweets Description (%) VTM 854.55 440 78.83 ´e´en 752.86 336 79.96 destandaard 692.64 313.5 79.69

Table 4.3: Statistics of the harvested users (grouped by gender).

Gender Average Non-retweets Median Non-Retweets Has Description (%) Female 826.89 419.50 81.39 Male 832.54 404 76.14

26 Figure 4.1: Screen capture of the webpage used to manually classify users.

27 4.2 Attributes of Users

To be able to predict the gender of the users as accurately as possible, many different attributes of the users are considered. In this section, each of them is explained.

4.2.1 Username

The one attribute every single user has on Twitter is a username. One cannot have an account without a username. Usernames are used to reference users, and they also form the URL used to visit a user’s Twitter page. Usernames on Twitter are, unlike on many other similar sites, changeable by the user. They are valuable pieces of information because many people have their first name embedded in their username. This is especially helpful when the user has not given an accurate name (see Subsection 4.2.2).

4.2.2 Name

Not all users have given an informative name. One is required to enter at least one character for their name. This character does not have to be an actual letter, so one’s name may as well be ‘!’. Also, as anything may be entered by a user, phrases like ‘I love Justin’ also occur as a name, and this is much more difficult to detect.

4.2.3 Description

Users are able to provide a description of themselves. As was previously mentioned in Subsection 4.1.2, around 79% of the users in dataset has a description. In Table 4.4, five descriptions randomly picked for each gender are listed. A description includes at most 161 characters.

4.2.4 Tweets

A tweet is a message posted on Twitter and is limited to 140 characters. Even though tweets are so short, they still contain a lot of interesting hints toward the gender of the author. Five random tweets per gender are shown in Table 4.5. As has been mentioned before in Subsection 4.1.1, the users in the dataset all have at least 100 tweets. There are however many Twitter users that have posted a lot less tweets or even none at all.

4.2.5 Friends

Friends are the accounts that a user is following. To determine the gender of the user, we will be looking at the descriptions of the accounts a user is following. To the best of

28 Table 4.4: Examples of descriptions of random users.

Gender Description Female 25 year old HR Management student at HUB, dancer, nerd- fighter, potterhead and social network/coffee addict Female I’m not a model, I’m far from pretty. But I ’m myself, and that’s more beautiful than anything else. Female redactrice reportage en muziek voor @Flairbelgi¨e/ medewerker @StepsMagazine / altijd honger Female moeder van 4 kinderen ,verpleegster , docent aan de vlaamse trainersschool. atletiektrainster en looptrainster Cercle Brugge, fan wielrennen en veldrijden Female Life is what you make of it / journalist / loves food, fashion, arts, radio, television ... Male Vader, echtgenoot, zakelijke dienstverlening, meteorologie, astronomie, politiek, neerlandistiek en thuis in Nederland- Vlaanderen. Male Dad / Soccerplayer / Outdoorlover Male runner fietser ex-radiomaker papa echtgenoot arbeider... Gen- eration Connect Male CEO VisitFlanders — Administrateur-generaal Toerisme Vlaan- deren Male Cityphile - Hostel Worker - History Geek - Photo & Video Ad- dict - Book Lover - Museum Wanderer - European

Table 4.5: Examples of tweets of random users.

Gender Tweet Female Rare voetballers toch die Japanners met geblondeerde haren #beljap Female Pff die reclame altijd #tvvv Female Really wanna go to Tunesia...:( Female Smsse whoopwhoop #Ilovemonday #Smsopschool Female Het is zondag! Om hoe laat ben ik wakker.. 8:40!! Male Ochtendspits viel mee .. Male Demichelis is wa zieksnel op fifa (38) :):):) Male Lukaku heeft nu echt ne volledigen bueno verdient Male Eten en studeren .. Male Honger

our knowledge, this technique has not been used in any previous work. In Table 4.6, the usernames and descriptions of the five accounts most followed by females in our dataset are shown. The same can be found for friends of males in Table 4.7. VTM is left out because most users were harvested because they are friends of VTM (and would therefore be number one for both females and males).

29 Table 4.6: The five accounts that are followed most by females.

# Username Description 1 Astridbryan TV Personality, Model, Actress, Producer https://t.co/V4XdMslYfq 2 Kurt Rogiers acteur, ochtend-DJ Q-music, presentator #SODD, gek op zijn 3 blonde meisjes, dvd-series, Philip Glass, James Bond, bomen en beesten 3 AnLemmens 4 BarackObama This account is run by Organizing for Action staff. Tweets from the President are signed -bo. 5 katyperry LET THE LIGHT IN. PRISM. OUT NOW!

Table 4.7: The five accounts that are followed most by males.

# Username Description 1 vrtderedactie deredactie.be is de nieuwssite van VRT, de openbare omroep in Vlaanderen. 2 destandaard De officile Twitter-account van de krant De Standaard. 3 stubru Life is music 4 tomwaes 5 SvenOrnelis Ochtend- en party-DJ van Q-Music. Werkt bij Medialaan. Columnist HLN. Foodie. Wandelaar. Levensgenieter. Glas is half vol maar schenk gerust bij!

4.3 Languages Most Often Used by the Users in the Dataset

In our dataset, 64.9% of the users tweeted in Flemish/Dutch most often, 32.15% primarily in English, 1.3% mainly in French, and 1.65% essentially in another language. This has been portrayed in Figure 4.2. To calculate these numbers, the Java library Shuyo(2010) was used. For each user in the dataset, a maximum of 200 non-retweets were gathered. This way, tweets which were not actually written by the user in the dataset are not taken into account. Every tweet was examined by the detector of the last mentioned library. The language that was detected most often in this set of 200 tweets was used for the final count.

It should be noted that the language detector does not work correctly for every single tweet. While the profiles in ‘profiles.sm’ were used by the language detector (these are the profiles trained especially for short texts), the very short tweets were sometimes classified incorrectly. Also, tweets consisting of just a URL or an emoticon led to the ‘other’ language. Flemish tweets often contain many words written in a certain dialect. This also makes it much harder for the detector to analyze these messages. Lastly, as was mentioned before in Subsection 4.1.1, there is no standard way to collect Flemish users. Even though accounts which were thought not to be of a Flemish person were removed with great care, it is possible that there are still some accounts of people who

30 are not Flemish.

DUTCH

64.9%

1.65% 1.3% OTHER FRENCH 32.15%

ENGLISH

Figure 4.2: Pie chart of languages used by users in dataset

4.4 Conclusion

Different aspects of the collected dataset were discussed in this chapter. We saw how the Twitter Search API was used to harvest users. To try to limit the dataset to Flemish Twitter users, followers of ‘VTM’, ‘De Standaard’, and ‘E´en’were´ collected. These are all Twitter accounts of a Flemish newspaper and of two Flemish TV stations. We also saw how these Twitter users were classified manually using a webpage. A total of 8791 users were collected: 4826 females and 3965 males. An average of almost 80% of them has a description. We also saw the averages of the amount of non-retweets posted by several subsets of the dataset. Next, we discussed the different attributes of a user that will be used to extract features from. These are the features that will be counted to create the vectors which will be the input to the ensemble. The attributes that are used are the username, the name, the description, the tweets, and the friends of the user. Finally, we illustrated the distribution of the most used languages by the users in the dataset. It is conspicuous that a large percentage of the users, 32.15% to be exact, mostly posts English tweets. 64.9% of the users tweets mostly in Flemish. Other users tweet mostly in French or another language. We noted that users who mostly just post URLs would also be classified as users who post primarily in another language. In Chapter5, we will see how the attributes discussed in this chapter are transformed into features.

31 32 Chapter 5

Feature Extraction

In Section 4.2, the different attributes of a user that are considered to help predict the gender of users are explained. In this chapter, we will explain how each of these attributes is converted into features. Features are measurable properties of the instances that are to be classified. They should be as discriminating as possible. Machine learning techniques will try to find patterns in the values of the features for each class. Consequently, it is these features that will be the input that will aid in the prediction process. Each attribute will be explained in a seperate section.

5.1 Username

The actual username of a user is not going to impart anything because every username is unique. Having a list of usernames and their corresponding genders is therefore not going to be helpful when trying to classify other Twitter users based on their username. But, when different substrings of that username are obtained from the username, we will have features that also occur for other users. These substrings are N-grams based on characters. All bigrams in ‘the gender’, for example, are ‘th’, ‘he’, ‘e ’, ‘ g’, ‘ge’, ‘en’, ‘nd’, ‘de’, and ‘er’. Analogously, a trigram would consist of all consecutive characters of length three.

Some users have not given a name, as elucidated in Section 4.2.2. For these users, the name attribute will not provide any advantageous information. It is, nonetheless, possible that the username contains the first name of the user. Besides containing the first name, the username can also include other useful features. For example, in our dataset, 90.7% of the users that have ‘xx’ in their username are females. Of all users with ‘x3’ embedded in their username, 96.2% are females.

We have extracted all bigrams, trigrams, and four-grams of each username.

33 5.2 Name

To take features from the name of a user, the name is first transformed: 1. Leading and trailing whitespaces are removed. 2. All uppercase letters are translated into lowercase letters. 3. All accented letters are translated into the same letter without an accent. 4. All characters that are not a letter or a space are removed. By doing this, someone with the name ‘FREDERIK’ will be considered to have the same name as someone with the name ‘Fr´ederik’.Next, everything but the first token is tossed. The first token contains all the letters until the first space or until the end of the name if there is no whitespace. Based on the fact that a user is free to enter anything at all as a name, as long as he enters at least one character, we cannot tell if the first token is the first name of the user. For obvious reasons, the last name of a user does not reveal any information about the gender of the user (at the very least for the majority of Flemish Twitter users). Based on a manual review of the names in the dataset, it was henceforth assumed that the first token is the first name of the user. The first token is the first feature extracted from the name. From the assumed first name, other features were extracted as well. This was done in a way similar to the feature extraction for usernames. All trigrams and four-grams of forenamed feature were acquired.

5.3 Description

In Figure 5.1, we see an example of some of the available profile information of one of the users in our dataset. This user has a description from which some very obvious clues as to what her gender is can be extracted. For example, ‘mama’ means mother. This is a very important feature for this user. In a part of the description, however, we find ‘Mama,boekenverslindster,fotografie,kerkhovenliefhebster,schoenenverslaafd’. These words are all glued together as if they are actually one word. This is why we need to perform some transformations on the description to able to recognize these very helpful clues. The following transformations will be performed: 1. URLs are removed. 2. All uppercase letters are translated into lowercase letters. This is because we do not want words that are capitalized differently but are otherwise the same to count as different features. 3. ‘ˆˆ’ is seperated from other characters. This is to be able to count the emoticon seperately from surrounding words.

34 Figure 5.1: Example of the available profile information of one of the users in our dataset.

4. Quotation marks and exclamation points are removed. We do not want ‘hello?’ and ‘hello’ to count as different features. 5. All exotic characters are removed. This limits the amount of features somewhat. These characters are also only used by a limited amount of users. 6. All hearts are seperated from other characters. We want to be able to count hearts seperately from the surrounding words. 7. All repeating periods are moved away from neighboring words because, again, we want to be able to count the words seperately. 8. All hashtags are removed because we want ‘#verkiezingen’ and ‘verkiezingen’ to count as the same feature because it shows the same interest of the user. 9. All referals to other users are removed (e.g. @BarackObama). This is because we would otherwise have a lot of features that are only used by a few people. 10. All commas are removed. 11. All tokens of length one and consisting of ‘/’ or ‘#’ are removed. 12. All ‘s’s at the end of words are removed. This to make sure that words that are made plural by adding an ‘s’ at the end of the word are not counted as two different features. 13. ‘ is removed unless followed by t, k, m, n, or s and whitespace. This is because in the phrase ‘ ‘This is an example’ ’, we do not want ‘ ‘This’ to be a feature. Instead, we want ‘this’ to be a feature. We, however, do want to know if the user shortens some of his or her words. For example, “n’ is the shortened version of ‘een’. 14. All arrows of any length are removed (— and ==== ). i i 15. ‘(’ is removed if following a character that is not considered punctuation. 16. Punctuation at the end or beginning of a series of letters is removed. 17. Numbers, hours, distances, ... are removed. 18. All accented letters are translated into the same letter without an accent.

35 19. All tokens of length one are removed. 20. All leading and trailing whitespace is removed. All of these transformations are meant to exclude nonimportant features, unify some tokens so that they would map onto the same feature, and seperate some characters so that they would count as two different features if necessary. From the resulting phrase, all unigrams and bigrams based on tokens are extracted. Bigrams are meant to extract features such as ‘father of’, ‘my girlfriend’, etc. In the case of the quoted part of the example, the letters will all be translated to lower- case, and the commas will be deleted. We will then tokenize the remaining description by cutting the phrase everywhere where there is whitespace. The quoted part of the descrip- tion would, for instance, lead to the features: ‘mama’, ‘boekenverslindster’, ‘fotografie’, ‘kerkhovenliefhebster’, and ‘schoenenverslaafd’. As was just stated, the corresponding bigrams will be counted as features, as well.

5.4 Tweets

Features are extracted from tweets in two different ways. The first way looks at the words and emoticons a user utilizes and will be explained in Section 5.4.1. The second approach looks at specific emoticons, specific URL domains, and several style attributes employed by the user as will be clarified in Section 5.4.2.

5.4.1 Features based on content

To extract features from tweets, all tweets which are not retweets and are authored by the same user are glued together as if it is one long text. This text is transformed in the same way as descriptions (see Section 5.3) after which several Flemish and English stop words are removed. All remaining tokens and all bigrams based on words are features. Every occurence of each feature is then counted.

5.4.2 Features based on style

In the second approach, different features are considered. Firstly, any URLs found in a tweet are expanded. URLs in tweets are made shorter most of the time to save space (remember that a tweet is limited to 140 characters). URLs therefore take a form like ‘ow.ly/i/5v0JA’, ‘goo.gl/fb/nC3fu’, ‘http://bit.ly/1ipbJCf’, and more. To find out to which domain the URL leads, the URL has to be expanded. To do this, an HTTP connection has to be made to the shortened URL. In the header field ‘location’ of the return message, the full URL can be found. The domain can finally be concluded from the expanded URL.

36 Next, a specific set of emoticons is counted per tweet. If an emoticon appears more than once in a tweet, it is only counted once. The aforementioned set contains: ‘:)’, ‘;)’, ‘=)’, ‘(:’, ‘(;’, ‘(=’, ‘- -’, ‘ˆˆ’, ‘ˆ-ˆ’, ‘ˆ ˆ’, ‘:p’, ‘:d’, ‘:’)’, ‘:/’, ‘: ’, ‘:s’, ‘:-)’, ‘;-)’, ‘:-p’, ‘:-/’, ‘:- ’, \ \ and ‘<3’. Afterward, punctuation is checked. This, again, is counted at most one time per tweet. The following occurences are checked: one ‘!’, more than one ‘!’, a series of ‘!’ and ‘?’ starting with ‘!’, a series of ‘!’ and ‘?’ starting with ‘?’, a single ‘?’, more than one ‘?’, and words that are completely written in uppercase letters. At most 2000 tweets are examined per user. This is solely to save time: the retrieval of the domain takes a while due to the HTTP connection that has to be made.

5.5 Friends

As was mentioned in Section 4.2.5, features will be extracted from the friends of a user by looking at the description of each friend. If a friend has a description, the features are extracted in the same was as in Section 5.3. The counts for each description will be added together. It would be like performing the method in Section 5.3 on one long text consisting of the description of all of the user’s friends.

5.6 Conclusion

In this chapter, we took the attributes which were discussed in Chapter4 and saw how these would be translated into features. We saw how some features consist of n-grams. These n-grams can be either of characters or tokens depending on the feature. We also saw that many other occurences will be counted such as emoticons and punctuations. In tweets, we will not only look at the content, but we will go as far as to expand the URLs and count the number of times a user posts links to a certain domain. All features will be counted to create the vectors needed to represent a user. The creation of vectors will be discussed in Chapter6.

37 38 Chapter 6

Data Representation

In Chapter5, we saw how different features were extracted from different attributes of a user. In this chapter, we will see how these features are all brought together in one data representation, called a vector. We explain what a vector is in Section 6.1. In Section 6.2, we see how the features that make up the vector are selected. The actual creation of the vectors is clarified in Section 6.3.

6.1 Vectors

To be able to classify the users, vectors need to be constructed that represent them. A user is represented by a different vector for each kind of classifier as they all look at different features of the user. A created vector −→xj = (x1, x2, ..., xn) has n elements if n features are being counted. xi corresponds with a feature i. It is zero if the feature does not occur for a particular user j. If the feature does in fact occur for the user, xi is the percentage of tweets or friend descriptions (depending on the kind of vector) in which the feature appears. It can also just be the number of times it occurs if the vector is created for something that the user only has once like a name for example. The values will be normalized later on by the classifier.

The vectors are the input to the classifier. The classifier will try to recognize patterns in the values of the vectors. These patterns will help the classifier predict the gender of new users.

Example Let us look at an example in which the only three features that were selected for the tweet content classifier are ‘<3’, ‘xx’, and ‘love’. Twitter user A has posted only two tweets: ‘I <3 you’ and ‘Ik zie je morgen xx <3’. In this case, user A’s vector for tweet content would be −→xA = (<3, xx, love) = (1, 0.5, 0). This is because the user has utilized the feature ‘<3’ in 100 procent of his or her tweets, the feature ‘xx’ in 50% of his or her tweets, and has never posted a tweet containing the feature ‘love’.

39 6.2 Selection of Features

Because the data set uses many different languages and dialects, there is an abundance of different features. Our classifiers will not be able to look at every single feature because they cannot effectively find patterns when looking at millions of features. For this reason, we will need to make a selection of features of which the presence contributes the most to classifying the dataset correctly. The selection of features is performed in several steps. All of these steps are explained in the subsequent subsections. The process is illustrated in Figure 6.1.

6.2.1 Transformation

Each input is transformed as was described in Chapter5. This is done by a particular transformer, depending on which kind of vector is to be created. The goal of this step is to make it easier to count features. It makes sure some tokens are mapped onto the same feature. For example, in tweets, tokens that are made plural by adding an ‘s’ at the end will be mapped onto the same feature as it would be if it was singular. Unnecessary features are removed by transformation as well. For example, times (e.g. ‘4h30’) will be deleted from tweets. This is also to limit the different amount of possible features somewhat.

6.2.2 Tokenization

Next, every transformed input is tokenized as was also explained in Chapter5. This comes down to cutting the text in pieces where there is whitespace after the transforma- tion. It also gathers the n-grams if necessary. The outcome of this step are the actual features.

6.2.3 Feature Selection

Tokens are counted for all inputs of all users. The resulting frequencies are stored in a file. The final count will portray how many users have a certain feature.

Next, the mutual information (MI) of every feature is calculated as explained in Press (2009) with Equation 6.1. This step is necessary because the total amount of possible features is in terms of millions. The machine learning techniques will not be able to draw conclusions from so many features. It will also take up too much memory. Further- more, there will be many unique features that occur only for one user. These features are not useful. The mutual information of a feature is the expected value of the PMI (Equation 2.1) over all possible outcomes.

40 In the future, when we need N features, we will select the N features with the high- est mutual information. The features are written to “(classifiername) ordered.txt” in descending order of their mutual information.

input

transformer

tokenizer

feature selector

frequency ordered file file

Figure 6.1: Data flow for feature selection

N11 N N11 N01 N N01 log2( × ) + log2( × )+ N × N1. N.1 N × N0. N.1 × × N10 N N10 N00 N N00 log2( × ) + log2( × ) (6.1) N × N1. N.0 N × N0. N.0 × ×

In Equation 6.1, N11 is the number of females that have the particular feature, and N01 is the number of females that do not. N10 are the males that have the same particular feature, and N00 are the males that do not. This means that N11 + N01 is the total number of females, and N00 + N10 is the total number of males.

Each term in the equation calculates the probability that a certain class has a certain feature times the discrepancy between the probability that members of a certain class have a certain feature and the probability that the class occurs times the probability that the feature occurs independently.

41 6.3 Vector Creation

After ordering the features based on mutual information, vectors can be created. This again takes a couple of steps, each of which will be explained in the coming subsections. These same steps are shown in Figure 6.2.

6.3.1 Load Dictionary with Features

In Section 6.2, a file “(classifiername) ordered.txt” was created with the features in descending order according to the values of their mutual information. This file and N, the number of features required, are needed to load the dictionary with the N most valuable features. Of course, there N features will not be loaded if there are less than N features in the file. In this case, all available features will be loaded.

6.3.2 Create Vector

The dictionary, which was filled in the previous step, will be passed to the vector creator. Input will be transformed and tokenized depending on the kind of vectors that are to be created. The creator will only count the features found in the dictionary. All other tokens will be disregarded. Each vector will be stored in a file. The features in the vector are also stored in descending order of mutual information. This way, we can load vectors of any size. If we want to represent a user with N features, then all we will have to do is read the first N lines of this file (assuming that there are N features).

6.4 Conclusion

In this chapter, we first explained how each user is represented. This is done by creating a vector. Next, we saw how features are selected that will be counted to create the vectors. The text is first transformed and then tokenized. The tokenizer’s output consists of the features found in the text. All tokens are then counted for all users. Based on these counts, the mutual information of each feature is calculated. All features are ordered by mutual information. The features with the highest information will be loaded into a dictionary. To create a vector, all tokens that are found in the dictionary are counted for the user. We will then form the vector −→xj = (x1, x2, ..., xn) as was explained in Section 6.1. In Chapter7, we will see how these vectors are used as inputs for the different classifiers.

42 N ordered file

feature selector

dictionary

feature vector creator input

transformer

tokenizer

vector

Figure 6.2: Data flow of vector creation

43 44 Chapter 7

Classification

In Chapter4, we have seen different characteristics of our dataset. Every member of this set has been manually classified. The objective now is to be able to learn from these instances. When presented with a user from outside of this set, we want to be able to correctly classify the user as male of female, solely based on the combination of features he posesses. This competence is called generalization.

To achieve this, we start off with a subset of the dataset which we will call the training set. For each of these instances, we have a created a vector in Chapter6. The label of the instance, ‘F’ for female or ‘M’ for male, is the target. We will use two different machine learning techniques, namely Support Vector Machines (SVM) and Naive Bayes (NB). During training, a chosen technique will assemble a function y(x) which will map the vector x onto its predicted label.

The learning performed here is called supervised learning because we are starting off with instances that are labeled. If we were to take a batch of unlabeled vectors and try to decide the label of each of these by clustering them, then the learning technique would be called unsupervised. In the case of supervised learning, deciding new discrete labels is called classification (Bishop(2006b)). Predicting a class from a continous set when having trained with labeled data is called regression.

We will explain SVM and NB in Section 7.1. In Section 7.2, we wil learn what an ensemble is, and how the ensemble that was used by this masterthesis was composed.

7.1 Used Machine Learning Techniques

In this section, we will see the two machine learning techniques that were used in this masterthesis. Naive Bayes is explained in Section 7.1.1 and Support Vector Machines (SVM) are explained in Section 7.1.2. The model selection process used for the classifiers that use a SVM is also explained in Section 7.1.2.

45 7.1.1 Naive Bayes

Our first machine learning technique is called Naive Bayes. Naive Bayes is a simple and naive machine learning technique, but it can still achieve accurate results in some situations. It considers the text as a bag of words, meaning that the order of the words in the text is irrelevant. Bayes’ rule (Equation 7.1) forms the basis of the technique. This would mean that the occurence of one feature is always independent of another.To calculate the most likely class of an instance, maximum a posteriori (MAP) is used (see

Equation 7.5)(Jurafsky(20xx)). The class with the highest CMAP is the predicted class.

P (d c)P (c) P (c d) = | (7.1) | P (d)

CMAP = arg max P (c d) (7.2) c∈C | P (d c)P (c) = arg max | (7.3) c∈C P (d) = arg max P (d c)P (c) (7.4) c∈C |

= arg max P (x1, x2, ..., xn c)P (c) (7.5) c∈C |

No model parameters have to be determined for the Naive Bayes technique. However, many labeled instances are needed to train.

7.1.2 Support Vector Machines

Support Vector Machines (SVM) are our second machine learning technique. Determin- ing the model parameters of an SVM is a convex optimization problem, meaning that any local minimum is also a global minimum.

The model parameters that have to be found are w and b in y(x) = wT φ(x) + b, where b is the bias and φ(x) transforms the feature vector to a highest dimensional space with the hope that in this space, a good linear boundary can be found (Huynh(2008)). An SVM tries to find the boundary between the classes with the largest margin (Bishop (2006b)). The margin is the smallest distance between said boundary and any of the instances.

Because the classes will not be perfectly linearly seperable, some instances will be clas- sified incorrectly. These instances will have a certain slack ξn. An SVM will try to minimize Equation 7.6, with C > 0 the cost parameter which determines the trade-off between the slack variables and the margin (Bishop(2006b)). C is a regularization coef- ficient. See Figure 7.1 for an example of a two-class classification problem using an RBF

46 kernel. The wide, black border is the margin.

N X 1 C ξ + w 2 (7.6) n 2 n=1 || ||

Figure 7.1: Example of a margin in a two-class classification problem using an RBF kernel. (Eigenvector.com(2011))

The SVM has two parameters, C and γ, which need to be optimized depending on the problem at hand. Their definition can be found in Table 7.1. As we will see in Section 7.2, this masterthesis uses several classifiers, some of which use a SVM. For each SVM, the C and γ parameters were determined by performing a grid search. The process of determining these parameters is called model selection. In the grid search, all combinations of C 1, 2, 3, ..., 16 and γ 0.00001, 0.0001, 0.001, ..., 100 ∈ ∈ were tested. 2-fold cross-validation (see next paragraph) is performed for each combi- nation of parameters. If the combination that gave the highest accuracy was not in the border of the grid, 10-fold cross-validation is performed for all surrounding parameters. The combination that gave the highest accuracy in the end, is the one that was fixed for further testing on other instances. In X-fold cross-validation, the set is divided into X complementary subsets. The classifier is then trained with X 1 of the subsets, and tested on the remaining subset. The latter − 47 Table 7.1: Definition of the parameters of an SVM with a RBF kernel. (Scikit-learn.org (2013b))

γ A low γ means that one instance’s influence does not reach far. A high γ means that one instance has a lot of influence. C Determines the trade-off between misclassification and overfit- ting (memorizing the training samples, poor generalization). is repeated so that each subset has served as a test subset once. The final accuracy is the mean of the accuracies obtained for each subset.

7.2 Ensemble

In Section 7.1, we explained the different machine learning techniques that will be used by the classifiers. In this section, we will learn more about ensembles. In Section 7.2.1, we will clarify what an ensemble is, and in Section 7.2.2, we will see the structure of the ensemble of classifiers used in this masterthesis.

7.2.1 What Is An Ensemble?

An ensemble is a set of classifiers in which the outputs of the different classifiers are combined to give a new prediction. The advantage of using an ensemble is that the accuracy obtained by the ensemble is often much higher than that of its modules. The only condition that has to be met is that all components of the ensemble have an accuracy higher than 50% and components do not always make the same mistakes. (Dietterich (2000)).

7.2.2 Structure

In Figure 7.2, we can see that our ensemble takes the input of six different classifiers and generates its own prediction based on these inputs. This mechanism is called stacking (Gupta and Thakkar(2014)). As has been mentioned before, a specific classifier will only provide an output (input for the ensemble) if the user has the information needed to create the vector for that classifier. The six classifiers that provide an input to the ensemble are: a username classifier (Sec- tion 7.2.2.1), a name classifier (Section 7.2.2.2), a description classifier (Section 7.2.2.3), a tweet content classifier (Section 7.2.2.4), a tweet style classifier (Section 7.2.2.5), and finally a friends description classifier (Section 7.2.2.6). Even though we clearly have two discrete classes, ‘F’ and ‘M’ for female and male, we will employ regression. We have compared this output to the output obtained when classification is performed. Using regression led to a gain of about 2% in accuracy.

48 By letting zero represent female and one represent male, the output of the regression problem is the percentage of certainty that a user is male. The certainty that a user is female is one minus the certainty that a user is male, as there are only two classes.

The ensemble takes these two certainties from each classifier as input. The ensemble adds up all certainties that a user is female and adds up all certainties that a user is male. The highest of the two will decide the final label. This is the same as taking the average of all certainties.

We have also tried to use the certainties of each of the each classifiers to train a new classifier that would produce the output of the ensemble using a Support Vector Machine. This classifier would train on the certainties of each classifier component and would then decide the label of the instance. This, however, resulted in a lower accuracy, so we opted for taking the average instead.

Username Classifier

Name Classifier

Description Ensemble Classifier

Tweet Content Classifier

Tweet Style Classifier

Friends Classifier

Figure 7.2: The structure of the ensemble: the output of each classifier serves as input to the ensemble.

7.2.2.1 Username Classifier

The username classifier uses vectors created as shown in Section 5.1. These are the only vectors that are always available because a username is the only thing we use to classify that every single user has to have.

49 This classifier uses the Naive Bayes machine learning technique which was explained in Section 7.1.1. 3000 features were selected using Equation 6.1. These are the username- based features that had the most mutual information. The first twenty can be found in Table 7.2. The features shown in this table are clearly based on the name of the user. This explains why Table 7.2 has comparable features to Table 7.3. At position 32 and 33 of the list of mutual information values, however, we find ‘ x’ and ‘xx’ respectively. At the 47th position, we find ‘love’. This indicates that the useful features in the username are not only based on the name of the user. Another reason why the username gives extra information is because sometimes the first name of a user can be found in his or her username but not in his or her actual name on Twitter.

Table 7.2: The twenty strongest features for usernames.

# Feature Gender 1 li F 2 lie F 3 ie F 4 el F 5 eli F 6 ien F 7 ne M 8 ine F 9 in F 10 sa F 11 e F 12 elie F 13 lien F 14 anne F 15 lin F 16 ann F 17 na M 18 lo F 19 a M 20 aur F

7.2.2.2 Name Classifier

The name classifier uses vectors created as shown in Section 5.2. These vectors are only created, and therefore used, if the user’s name is not an empty string after transformation. This classifier uses the Naive Bayes machine learning technique which was explained in Section 7.1.1. 5000 features were selected using Equation 6.1. These are the name-

50 based features that had the most mutual information. The first twenty can be found in Table 7.3.

Table 7.3: The twenty strongest features for names.

# Feature Gender 1 laura F 2 lie F 3 lisa F 4 ine F 5 lien F 6 julie F 7 bart M 8 ien F 9 kelly F 10 lies F 11 charlotte F 12 eli F 13 ellen F 14 ann F 15 annelies F 16 eline F 17 bert M 18 amber F 19 koen M 20 jana F

7.2.2.3 Description Classifier

Vectors used by the description classifier are created by counting features as was ex- plained in Section 5.3. Vectors will only be made if the user has a description. Therefore, this classifier will only work for users with a description. This classifier uses the Support Vector Machine (SVM) machine learning technique which was explained in Section 7.1.2. The ‘C’ and ‘γ’ parameters were set to 4 and 10 respec- tively. These values were decided by performing a grid search. 2000 features were selected using Equation 6.1. These are the description-based features that had the most mutual information. The first twenty can be found in Table 7.4. From this top twenty list, it can, for example, be concluded that users mentioning the band ‘One Direction’ in their description are much more likely to be female. This does not mean that most of the dataset does so. The top list includes the most discriminating terms. Terms that are used often by both genders have a lower mutual information value.

51 In the third colummn of Table 7.4, we can see that strong features hint at being female. This is similar to previous work by Burger et al.(2011), and will also be recognized for the following classifiers. Some of the terms following closely include: ‘dance’, ‘xx’, ‘fashion’, ‘studente’, ‘father’, ‘mama van’, ‘chocolate’ and ‘vader’.

Table 7.4: The twenty strongest features for description.

# Feature Gender 1 love F 2 my F 3 follow F 4 you F 5 me F 6 directioner F 7 life F 8 to F 9 one F 10 i’m F 11 back F 12 girl F 13 and F 14 follow me F 15 the F 16 direction F 17 follow back F 18 belieber F 19 one direction F 20 :) F

7.2.2.4 Tweet Content Classifier

Vectors used by the tweet content classifier are created by counting features as was explained in Section 5.4.1. Vectors will only be made if the user has posted tweets and if the user has not protected his or her tweets (this is never the case in our dataset, but this could happen in other cases).

This classifier uses the Support Vector Machine (SVM) machine learning technique which was explained in Section 7.1.2. The ‘C’ and ‘γ’ parameters were set to 4 and 0.1 respec- tively. These values were decided by performing a grid search.

2000 features were selected using Equation 6.1. These are the tweet content-based fea- tures that had the most mutual information. The first twenty can be found in Table 7.5.

52 Remember that all tweets have been transformed to lowercase, meaning that ‘:d’ is the same as ‘:D’.

Table 7.5: The twenty strongest features for tweet content.

# Feature Gender 1 <3 F 2 xx F 3 love F 4 love you F 5 :( F 6 so F 7 haha F 8 cute F 9 happy F 10 omg F 11 :d F 12 :) F 13 girl F 14 you’re F 15 follow me F 16 friend F 17 can’t F 18 please F 19 beautiful F 20 ooh F

7.2.2.5 Tweet Style Classifier

Vectors used by the tweet style classifier are created by counting features as was explained in Section 5.4.2. Vectors will only be made the user has posted tweets and if the user has not protected his or her tweets (this is never the case in our dataset, but this could happen in other cases). This classifier uses the Support Vector Machine (SVM) machine learning technique which was explained in Section 7.1.2. The ‘C’ and ‘γ’ parameters were set to 3 and 10 respec- tively. These values were decided by performing a grid search. 100 features were selected using Equation 6.1. These are the tweet style-based features that had the most mutual information. The first twenty can be found in Table 7.6. Note that ‘(ˆ )! 2, [ $]’ is a regular expression that means a token consisting of at least | { } two ‘!’s and nothing else. As ‘!’s were moved away from letters, the feature would also be counted for the tweet ‘hello!!!’. ‘?[!?]*![!?]*’ is a regular expression as well, this time

53 describing: ‘?’ followed by a series of ‘!’ or ‘?’s followed by a ! followed by one or more ‘!’s and ‘?’s.

Table 7.6: The twenty strongest features for tweet style.

# Feature Gender 1 <3 F 2 :) F 3 :d F 4 ;) F 5 (: F 6 (ˆ )! 2, [ $] F | { } 7 twitter.com F 8 :p F 9 ?[!?]*![!?]* F 10 livestream.com F 11 weheartit.com F 12 :s F 13 mtv.com F 14 twitition.com F 15 tumblr.com F 16 instagram.com F 17 media.tumblr.com F 18 (; F 19 ˆˆ F 20 sporza.be M

7.2.2.6 Friend Description Classifier

Vectors used by the friend description classifier are created by counting features as was explained in Section 5.5.

This classifier uses the Support Vector Machine (SVM) machine learning technique which was explained in Section 7.1.2. The ‘C’ and ‘γ’ parameters were set to 4 and 0.1 respec- tively. These values were decided by performing a grid search.

5000 features were selected using Equation 6.1. These are the friend description-based features that had the most mutual information. The first twenty can be found in Ta- ble 7.7. Closely following features include: ‘girl’, ‘boy’, ‘fashion’, ‘singer’, ‘xx’, ‘actor’, and ‘songwriter’.

54 Table 7.7: The twenty strongest features for friend descriptions.

# Feature Gender 1 you F 2 my F 3 me F 4 follow F 5 love F 6 i’m F 7 one F 8 :) F 9 to F 10 one direction F 11 direction F 12 justin F 13 life F 14 directioner F 15 so F 16 on F 17 the F 18 just F 19 be F 20 now F

7.3 WEKA

To perform the classification itself once the vectors were created, the Java library Waikato Environment for Knowledge Analysis (WEKA) was used. We chose WEKA because it is the best known tool for this purpose (Huynh(2008)). We have used this to perform the Naive Bayes learning, and we have utilized the LibSVM library. WEKA also provides also facilitates crossvalidation and the grid search process. When using Naive Bayes, no options were needed. For LibSVM, the following options were set (Chang and Lin(2014)): S (svm type) = 0 (C-SVC) (default) • K (kernel) = 2 (Radial Basis Function (RBF) kernel) (e−γ|u−v|2 ) • D (degree of kernel) = 3 (default) • G for γ depends on the specific classifier, see Section 7.2.2 • R (coef0) = 0 (default) • 55 N (nu) = 0.5 (default) • M (cachesize) = 40 • C (cost) depends on the specific classifier, see Section 7.2.2 • E (epsilon, tolerance of termination criterion) = 0.001 (default) • P (epsilon, in loss function) = 0.1 (default) • B = 1 (to get probability estimates) •

7.4 Conclusion

In this chapter, we investigated the two machine learning techniques that are used by our ensemble: Naive Bayes (see Subsection 7.1.1) and SVM (see Subsection 7.1.2). Next, in Section 7.2, we learned what an ensemble actually is: a classifier that takes the output of a group of classifiers and uses those to provide a final label. We also described each classifier that produces an output for the ensemble. Finally, we discussed the different parameter settings of the classifiers in Section 7.3.

56 Chapter 8

Results

In this chapter, we will analyze the classifiers and the whole ensemble. The classifiers that produce the input for the ensemble are evaluated in Section 8.1. The ensemble itself is evaluated in Section 8.2.

8.1 Evaluating the Different Classifiers

In this section, the different classifiers that make up the ensemble are compared with each other. In Subsection 8.1.1 we will present the differences in accuracy depending on the size of the training set. The effect of changing the amount of features is illustrated in Subsection 8.1.2. Subsection 8.1.3 highlights the differences in accuracies obtained by each classifier.

8.1.1 Changing the Training Set Size

The effect of changing the training set size for each of the classifiers can be seen in Figure 8.1. These accuracies were obtained by performing 3-fold crossvalidation on each classifier with different training set sizes. Note that with a training set size of, for example, 1000, we mean that there are actually 2000 instances that are used to train: 1000 females and 1000 males. The graph clearly demonstrates how a larger training set generally leads to a higher accuracy. This can be explained easily. When the classifier can train on more users, it can more effectively find patterns, similarities, in its data. When using a bigger training set, we also hope to have a bigger variety of instances which makes the classifier robuster.

Note that the classifiers are each using a different amount of features. Each classifier uses the amount of features which was mentioned in Chapter7.

57 90

85

80

75

70 Accuracy (%)

65 username name description tweet content 60 tweet style friend description

250 500 750 1,000 1,250 1,500 1,750 2,000 2,250 2,500 2,750 Training Set Size (for females and males seperatly)

Figure 8.1: The accuracies obtained by each classifier for an increasing training set size.

8.1.2 Changing the Amount of Features

In Figure 8.2, we see the effect of the amount of features used for each classifier. Adding more features to a classifier has a bigger effect on some classifiers than others. The name classifier, for example, keeps getting more accurate as more features are added. The description classifier’s accuracy, on the other hand, stays rather constant. The accuracies were obtained by performing 2-fold crossvalidation with a training set size of 1000.

From Figure 8.2, it is clear that the name classifier obtains the best results. It also brings to light that the friend description classifier obtains a rather high accuracy. When using more than 1000 features, it is second only to the name classifier. The reader is to be reminded that the concept of the friend description classifier has not, to the best of our knowledge, been used in any previous work.

Both Figure 8.1 and Figure 8.2 show that the description classifier by itself gets a rather low accuracy compared to the other classifiers. The reason for this is, of course, that not all users have a description and therefore the classifier will not be able to classify every single user. The trainingset includes users who do not have a description. It was chosen to include these users anyway, so that the classifiers can be compared when trained using the exact same training set. In the next section, however, we will see that the description

58 80

75

70 Accuracy (%)

65 username name description 60 tweet content tweet style friend description

250 500 750 1,000 1,250 1,500 1,750 2,000 2,250 2,500 2,750 Amount of features

Figure 8.2: The accuracies obtained by each classifier for an increasing amount of features. classifier increases the accuracy of our ensemble.

As the training set was selected in such a way that it only includes users with at least 100 tweets, the tweet content classifier and the tweet style classifier were, in this case, able to classify each user.

8.1.3 Comparing the Classifiers

Figure 8.3 illustrates the accuracies of each classifier when trained with a training set of size 2500 (2500 females and 2500 males) and with the amount of features reported in Chapter7. This clearly shows that the name classifiers reaches the highest accuracy by far with 87.54%. The second highest accuracy is obtained by the username classifier with 78.86%. The tweet content classifier and the friend description classifier attain a very similiar accuracy with 75.36% and 75.43%, respectively. The second lowest accuracy is obtained by the tweet style classifier with 66.34%, and the lowest is obtained by the description classifier with 65.74%. Remember that the description classifier is only able to be used for less than 80% of the users to begin with.

59 90 87.54 85

80 78.86 75.36 75.43 75

accuracy (%) 70 65.74 66.34 65

name username descriptiontweet style tweet content friend description

Figure 8.3: The accuracy of each classifier with a training set size of 2500 (2500 females and 2500 males) and the amount of features mentioned in Chapter7

8.2 Evaluating the Ensemble

8.2.1 Testing the Ensemble on the Testing Set

Subsection 8.1.1 and Subsection 8.1.2 showed the effect of changing the size of the training set and the amount of features for each classifier. The accuracies obtained there were from applying cross-validation. In this section, we will show the accuracy of the ensemble instead of each classifier seperatly, and we will do it using a seperate testing set.

The dataset was described in Chapter4. The retrieval was explained and different characteristics of the set were presented. This dataset contains 8791 users. A subset of these were used in Section 8.1.1 and Section 8.1.2 to train. Our testing set is also a subset of the aforementioned dataset. All users are an element of either the training set or the testing set. The training set and the testing set are each other’s complements. The division into sets was done randomly. The test set includes 1622 females and 1334 males. While more females are being tested, training occured with an equal subset of the training set for both genders: 2500 females and 2500 males.

This time, a classifier is only used if the necessary information is available. For example, when classifying a user with no description, the ensemble will not use the description classifier. In the testing set, 77.77% of the users has a description. As has been mentioned before, all users have at least 100 tweets.

We have a set which was gathered in the same way as the training set, and we limited the users to ones who have a minimum amount of tweets (like was explained in Chapter 4). To make sure that the ensemble also works well on a completely independent set of

60 users, we will test the ensemble with a completely new testing set in the next section. To evaluate the ensemble, the accuracy is measured for just the tweet content classifier at first. In each of the subsequent stages, a classifier is added to the ensemble. The addition of each classifier (except one) leads to an accuracy gain. This is clearly depicted in Table 8.1. The addition of the tweet style classifier is the only case in which the accuracy dropped. The tweet style classifier is removed from the ensemble in all further evaluations accordingly. By itself, the tweet content classifier reached an accuracy of 74.36%. The number in- creased to 75.16% when adding the description classifier. 5.85% in accuracy is gained when adding the friend description classifier. The username classifier, the only one which can be used for any user as every user has a username, induces a 1.52% increase. The accuracy of the complete ensemble is obtained by adding the name classifier. In the end, an accuracy of 89.28% is attained. We will measure several evaluation metrics in the following subsubsections. In each subsection, a classifier will be added to the ensemble as was done for Table 8.1. In Subsection 7.2.1, we noted that an ensemble achieves higher results than its compo- nents if (1) all components reach an accuracy higher than 50% and (2) the components do not always make the same mistake. The first condition was already verified in Fig- ure 8.3. To check the second condition, we calculated in how many cases all classifiers are incorrect at the same time. This happened for 21 of 2956 instances. On average, about 25.75% of the components of the classifier is wrong. We can conclude that all components being wrong at the same time is something that occurs very infrequently. Consequently, both of the mentioned conditions have been met by our ensemble.

8.2.2 Evaluation of a Growing Ensemble

In this section, we will discuss the confusion matrices for the ensemble as more classifier inputs are added. First, we look at the metrics for the ensemble that just includes the content classifier. In Section 8.2.1 we have seen that the accuracy on the test set for this classifier alone is 74.36%. In Section 8.2.2.1, we will see the true and false positives for each class in the form of a confusion matrix. From these, we will calculate the precision, the recall, and the F score. We will do these same for the ensemble after adding the input of the description classifier in Section 8.2.2.2. After adding the friend description classifier, we will get the metrics shown in Section 8.2.2.3. The effect of the addition of the username classifier is depicted in Section 8.2.2.4. Finally, the metrics for the complete ensemble are presented in Section 8.2.2.5. Before we get into the actual measurements, we will first explain the metrics: The precision measures the percentage of instances that were classified into a particular class that were classified correctly (see Equation 8.1). To know the percentage of a

61 Table 8.1: The accuracy of the ensemble on the testing set.

NAME (89.28 %)

USERNAME (82.51 %)

FRIEND DESCRIPTION (80.99 %)

DESCRIPTION (75.14 %)

TWEET CONTENT (74.36 %)

class that was classified correctly, we calculate the recall (see Equation 8.2). The two previous measures are parameters to the F1 score, or F-measure (see Equation 8.3). It is a weighted average of its parameters to evaluate the accuracy of the classification. The best score is reached when the F score is 1. The worst measure is reached at 0 (Scikit-learn.org(2013a)).

true positives precision = (8.1) true postives + false positives

62 true positives recall = (8.2) true positives + false negatives

precision recall F1 = 2 ∗ (8.3) ∗ precision + recall

8.2.2.1 Tweet Content Classifier

First, we look at the ensemble when it only contains the tweet content classifier. In Table 8.2, we see that 1157 females were classified as female, but 465 females were classified as male. Of the males, 1041 were classified correctly, and 293 were classified incorrectly. For the female class, this gives us a precision of 0.79 and a recall of 0.71 when rounded to the nearest hundredth. This leads to an F1 score of about 0.75. For the male class, a precision of 0.69 and a recall of 0.78 were reached. This gives a F1 score of 0.73. Females were classified more accurately than males. This could be explained by the fact that the features with the best mutual information were for the female class, as was depicted in Table 7.5.

Table 8.2: Confusion matrix for the tweet content classifier.

Predicted Class

Female Male Female 1157 465

Male 293 1041 Actual Class

8.2.2.2 Addition of the Description Classifier

In the next stage, the output of the description classifier is added to the input of the ensemble. In Table 8.1, we saw this made the accuracy increase by 0.78%. 1220 females are now classified correctly while 402 of them are still classified as male. 1041 of the males were classified as male, but 333 were classified incorrectly. We immediately see that the amount of incorrectly classified females has descreased by adding the description classified, but the amount of incorrectly classified males has increased. An overview is given in Table 8.3.

The precision for the female class at this stage is 0.79 and the recall is 0.75. The F1 score is about 0.77. For the male class, the precision is 0.71 and the recall is 0.75 leading to a F1 score of 0.73. The F1 score for males is still the same as before, but the recall has decreased while the precision has increased.

63 Table 8.3: Confusion matrix for the ensemble with the tweet content classifier and description classifier.

Predicted Class

Female Male Female 1220 402

Male 333 1001 Actual Class

8.2.2.3 Addition of the Friend Description Classifier

In the third stage, the friend description classifier is added to the ensemble giving us three classifier inputs in total. We now have an accuracy of 80.99%. In Table 8.4, we can see that nearly 1300 of 1622 females were classified correctly as well as almost 1100 males of 1334.

The female precision is 0.85 and the recall is 0.80. This means that the F1 score has increased from 0.77 to 0.82. For the male class, a precision of 0.77 and a recall of 0.82 was reached, giving a F1 score of 0.79. The male F1 score increased by 0.06. Both the precision and recall have increased for females and males after adding the friend description classifier.

Table 8.4: Confusion matrix for the ensemble with the tweet content classifier, description classifier, and friend classifier.

Predicted Class

Female Male Female 1298 324

Male 238 1096 Actual Class

8.2.2.4 Addition of the Username Classifier

The accuracy of the ensemble increased by 1.52% after adding the username classifier. Table 8.5 discloses that 1442 females are now classified correctly, leaving only 180 females that were classified incorrectly. The accuracy of the males has decreased. Almost 1000 males are classified correctly, but 337 males are classified as males.

When calculating the precision for the female class at this stage, we obtain 0.81 and a recall of 0.89. The F1 score in this case is almost 0.85. For the male class, we obtain a precision of 0.85 and a recall of 0.75. We still have an F1 score of 0.79.

64 Table 8.5: Confusion matrix for the ensemble with the tweet content classifier, description classifier, friend classifier, and username classifier.

Predicted Class

Female Male Female 1442 180

Male 337 997 Actual Class

8.2.2.5 Addition of the Name Classifier or The Complete Ensemble

By adding the name classifier, we have finally built the complete ensemble. We now have a total accuracy of 89.28%. This is a gain of almost 15% compared to when we just used the tweet content classifier. Both males and females are classified more accurately. More than 1500 females were classified correctly and more than 1100 males were as well. For an overview, see Table 8.6. The precision is now close to 0.88 for females. The recall is almost 0.94 for the same class. The F1 score for the female is therefore about 0.91 which is the highest it has been compared to the other stages. For the male class, the precision is 0.92 and the recall is

0.84. The F1 score is 0.88 which is also higher than in all other stages.

Table 8.6: Confusion matrix for the complete ensemble.

Predicted Class

Female Male Female 1518 104

Male 213 1121 Actual Class

8.3 Testing the Ensemble on a Completely New Set of Users

In Section 8.2.1, we tested the ensemble on the test set. We reached an accuracy of almost 90%. The test set, however, was retrieved in exactly the same way as the training set: they are friends of popular Flemish Twitter accounts. To make sure that the ensemble achieves similar results on a completely new set of users, we perform two final tests. We will not only calculate the accuracy on two new test sets, but we will also compare the accuracies with the accuracies achieved by TweetGenie (http://www.tweetgenie.nl) (see Nguyen et al.(2013)).

65 In Chapter2, we noted that this is the only found work that focusses on Dutch users. One of the main differences between TweetGenie and this masterthesis is that they focus on Dutch people in general, while we focus on Flemish people. It is known that Twitter has an extremely high penetration in the Netherlands (TrendWatch(2013)). 92% of its inhabitants has a Twitter account. This is a lot compared to the 28% of Flemish people who have a Twitter account even though this number is increasing. Dutch Twitter users are more active on Twitter, and they generally tweet more. Another major difference is that they classify users only based on their tweets, while we also use the user’s description, name, username, and friends. This implies that there will be users that we can classify, that TweetGenie cannot. One major advantage TweetGenie has is that when people ask it to classify their user account, they can choose to give TweetGenie their real gender. This means that TweetGenie has a big set of correctly classified users to train on.

The test on the first new set will be discussed in Section 8.3.1. This set is made up of random users. The second test set are all users who tweeted with the hashtag ‘#tvvv’ on May 2nd, 2014. This test will be reviewed in Section 8.3.2. Both sets were collected using the Twitter Streaming API.

8.3.1 Test Set: Random

The test in this subsection is performed on a new test set. This test set was collected in a way to make the set as random as possible. To make it as certain as possible that we were retrieving Flemish users, we gathered users based on their tweeting about a Flemish Twitter trend. This included cultural trends, ‘Red Devil’ trends, and political trends. We amassed 250 users this way. From each of these 250 users, we collected one follower, giving us a total of 500 users, to make the set even more random. By doing this, we also increase the chance of gathering users who do not tweet often. We collected followers instead of friends because the chance of these being Flemish as well is bigger. The only users who were removed from the set were organizations and companies, as they obviously do not have a gender. Spammers were deleted as well. Other than that, no cuts were made. Users were not removed because they had never tweeted.

226 of the 500 users are female. The test set includes 274 males. 81.20% of the users has a description. 81.9% of the females has a description, and 80.66% of the males has a description. In total, these 500 users had posted 844,232 tweets at the time of retrieval. That is an average of 1688.45 tweets per user and a median of 1347 tweets per user. The females have posted an average of 1686.10 tweets and the males an average of 1690.38 tweets. Seven users have never posted a single tweet. The rest of the distribution of the tweets can be found in Table 8.7.

Popular people in the set include ‘Cain Ransbottyn’, ‘Pieter Collier’, ‘Jani Kazaltzis’, ‘Bart De Waele’, ‘Sven Vanthourenhout’, and ‘Jean Marie Dedecker’.

66 Table 8.7: Distribution of tweets posted by the random test set.

# Tweets Amount of Users =0 7 <=10 18 <=25 31 <=50 46 <=100 78 <=500 166 <=1000 225 <=5000 494 >=5000 6

8.3.1.1 Just the tweet content classifier versus Tweetgenie

Because TweetGenie only uses tweets to classify, we will first compare the results of our tweet content classifier with the TweetGenie results. This means that we have a set of 493 users to classify, as there were seven without tweets. In this case, we classify 389 of the 493 correctly, which is 78.90%. TweetGenie classifies 405 of the users correctly, which is 82.15%. In this case, TweetGenie achieves better results.

8.3.1.2 Our ensemble versus TweetGenie

The test on this set was done seperatly for the users with no tweets and the others as TweetGenie is unable to classify the former. Of the seven users with no tweets, 100% were classified correctly by the ensemble. Next, we classify the other 493 users and compare the results with the TweetGenie results. TweetGenie still has 82.15%. The ensemble classifies 453 users correctly. This is 91.89% of the users. This is just short of 13% more than our tweet content classifier attained and almost 10% more than TweetGenie.

8.3.2 Test Set: The Voice of Flanders (The Voice Van Vlaan- deren)

Another new test set is collected for this second test. All the users in this set tweeted with the hashtag ‘#tvvv’ on May 2nd, 2014. This hashtag is for the Flemish TV show ‘The Voice of Flanders’ which is The Voice Van Vlaanderen (TVVV) in Flemish. This set of users also contains 500 users. Only spammers, organizations and businesses were removed. 405 of the 500 users are female, leaving 95 males. 70.40% of the users has a description. 69.63% of the females has a description, and 73.68% of the males has a description. In total, these 500 users had posted 633,253 tweets at the time of retrieval. That is an

67 average of 1266.50 tweets per user, and a median of 702 tweets per user. The females have posted an average of 1225.62 tweets and the males an average of 1440.78 tweets. Six users have never posted a single tweet. The rest of the distribution of the tweets can be found in Table 8.8. Compared to the previous test set, a large part of the users has posted less tweets. Well known people in this set include: ‘Elke Vanelderen’, ‘’, ‘Axelle Red’ and ‘Tom Helsen’.

Table 8.8: Distribution of tweets posted by the TVVV test set.

# Tweets Amount of Users =0 6 <=10 41 <=25 73 <=50 102 <=100 142 <=500 226 <=1000 279 <=5000 500 >=5000 0

8.3.2.1 Just the tweet content classifier versus Tweetgenie

Because TweetGenie only uses tweets to classify, we will again first compare the results of our tweet content classifier with the TweetGenie results. This means that we have a set of 494 users to classify, as there were six users without tweets. This time, we classify 417 of the 494 correctly, which is 84.13%. TweetGenie classifies 427 of the users correctly, which is 86.44%. TweetGenie’s results are 2.31% better.

8.3.2.2 Our ensemble versus TweetGenie

The test on this set was done seperatly for the users with no tweets and the others as TweetGenie is unable to classify the former. Of the six users with no tweets, 100% were classified correctly by the ensemble. Next, we classify the other 494 users and compare the accuracy with the accuracy of TweetGenie. TweetGenie still has 86.44%. The ensemble classifies 461 users correctly. This is 93.32% of the users. This is almost 9.20% more than our tweet content classifier attained and almost 6.90% more than TweetGenie.

Use Case This set can be interpreted as an actual use case. VTM may want to know what kind of people are watching The Voice of Flanders. Knowing the percentage of male and female viewers could be interesting when deciding which commercials to air

68 during the show. In this case, the users were first collected to then manually classify them because a label was needed to measure the accuracy. It would, however, also be possible to classify these users right after collecting them. Of course, there will be some delay because of the time that is needed to collect the friends and tweets of the users using the Twitter API (and the waiting times that may be required to do this) as well as the creation of the input vectors. In case there is no waiting time for retrieval from Twitter, and the user does not have an exponential amount of friends and tweets, the time between when the user posts a tweet about The Voice of Flanders and the time when the user is classified could be a minute or less.

8.4 Conclusion

In this chapter, we evaluated the classifiers and ensemble described in this masterthesis in many different ways.

In Section 8.1, we analyzed the different components of the ensemble. We saw that, in general, increasing the training set size of each classifier increases its accuracy. We also illustrated that the effect of changing the amount of features that a classifiers uses to train on, meaning changing the size of the vectors, depends on the classifier. For the name and the username classifier, the accuracy keeps increasing as the size of the vectors is incremented. For the friend description classifier, the accuracy increases in smaller steps. For the tweet content, tweet style, and the description classifier, the accuracy oscillates but stays about the same. We have also compared the results of each classifier when trained with 2500 females and 2500 males and the amount of features that are used for each classifier in the final ensemble. We noted that the name classifier obtains the highest accuracy while the description classifier reaches the lowest accuracy.

The entire ensemble was evaluates in Section 8.2. To test the accuracy of the ensemble, we first tested it with a test set that is a subset of the dataset described in Chapter4. We started off with the tweet content classifier which reached an accuracy of 74.36% on this set. We then analyzed the accuracy, precision, and recall of the ensemble for each addition of a new classifier component. When all components were added, we obtained an accuracy of 89.28%.

Next, we tested our ensemble by harvesting two totally new test sets. One includes random users who have posted about a trending topic and their followers, and the other includes users who posted using the hashtag ‘#tvvv’ on May 2nd of 2014. This was done to make sure a decent accuracy was also obtained on sets that were not harvested in the same way as the training set. We also compared our results with those of the most similar work found, TweetGenie. For the first set, our ensemble reached an accuracy of 91.89%, while TweetGenie obtained one of 82.16%. When testing with the second set (containing users who tweeted with the hashtag ‘#tvvv’), we attained an accuracy of

69 93.32%. This was almost 6.90% more than TweetGenie. In both cases, we were able to correctly classify 100% of the users with no tweets. These same users could not be classified by TweetGenie because TweetGenie only uses a user’s tweet to predict his or her gender. Consequently, we can conclude that we have built an ensemble that is able to predict the gender of Flemish Twitter users more correctly than TweetGenie.

70 Chapter 9

Conclusion

9.1 Future Work

Even though the ensemble achieves accuracies between 89.28% and 93.32%, depending on the test set, some adjustments could still be made to make the ensemble even better. More rules and features could be added to the tweet content classifier. It should be studied whether taking n-grams based on characters of the tokens in the tweets provides useful information. Also, it should be analyzed whether utilizing a user’s retweets yields any additional information. Retweets were never used in this work. To extract even more features, we have tried to divide the friends of the user into types with the intention to use these types to classify the user. For example, if the user had a friend with ‘actor’ in his or her description, then we meant to put this user in a group of actors. We tried to do this by using the DBPedia Spotlight li- brary (https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki) to anno- tate the descriptions of the user’s friends. This, however, did not work as was hoped. When trying to annotate the text ‘Barack Obama is the president of the United States’, for example, ‘United’ was annotated and mapped to ‘United and Uniting churches’. This fact, along with the fact that the annotation process took extremely long when trying to perform it on many different descriptions, lead us to the realization that this course of action was not feasible for this masterthesis. Nevertheless, we still think that this approach could provide fruitful information. Future work would be more practical if another library was found that works better for short texts.

9.2 Conclusion

In the past, a lot of work has been done to try to predict different latent attributes of social media users including the users’ gender. Not much work has been done, however, that focusses on Dutch speaking Twitter users. Per contra, only one work, Nguyen et al.

71 (2013), has been found. No work has been found that concentrates on Flemish speaking Twitter users. This is the case in spite of the fact that Ciot et al.(2013) show that the accuracy of classifiers depends on the language spoken by the instances that are to be classified. The highest accuracy for predicting the gender of Twitter users that has been found is 92% by Burger et al.(2011). This work cannot be generalized for the whole public because it only uses Twitter users who mention that they are bloggers. The second highest accuracy that has been found is Bamman et al.(2012). They achieve 88% for American Twitter users.

In this masterthesis, we successfully built an ensemble of classifiers that classifies Flemish Twitter users into female and male groups. The ensemble is trained with 2500 females and 2500 males. These users were taken from our retrieved dataset consisting of 8791 Twitter users which were manually classified.

Flemish Twitter users were classified by extracting features from three major sources.

The first source is the profile information of the user. We use the name, the username, and the description of the user. For each of these values, we have built a different classifier that is a component of the ensemble. The component will only be used if the necessary information is available. For example, we have seen that in our dataset, less than 80% of the users has a description. The description classifier will not provide an input to the ensemble if the user has no description.

The tweets of the users are the second source of information we retrieve from the user. Tweets are the messages that a user posts that are limited to 140 characters. They are the only information that is used by Nguyen et al.(2013).

Our final source consists of the social network of the user. To classify a user, we use the descriptions of the user’s friends. No other work has been found that exploits this information.

For each piece of information mentioned, a vector was created. The vectors are a data representation of the user and will be the input to the ensemble. Their dimensionality is equal to the amount of features that are used by the classifier. For each classifier, we have limited this number because it cannot find patterns in millions of features. We have done this by calculating the mutual information of each feature and only using the features that are ranked the highest.

When training the ensemble, it attempts to recognize patterns that are peculiar to each class. Afterwards, when new vectors are provided, the ensemble predicts the label of the user that the vector represents by finding these same patterns. The classifiers that are components of the ensemble use either a Support Vector Machine or Naive Bayes depending on their source of information.

We first evaluated each component of the ensemble with a test set which is a subset of the first 8791 users that were retrieved. The test set includes 1622 females and 1334 males.

72 The name classifier achieved the highest accuracy with 87.54%. The username classifier obtained an accuracy of 78.86%. The friend description classifier was third with 75.43%. The tweet content classifier attained 75.36%. 66.34% was obtained by the tweet style classifier. Lastly, the description classifier achieved 65.74%. One is to keep in mind that the description classifier could only classify those users that had a description (which was less than 80% of the test set). These results already show that the descriptions of the friends of a user are indeed a fruitful source of information. This is one of the main things that we wanted to demonstrate through this masterthesis. Next, we analyzed the entire ensemble. We analyzed the precision, recall, and F1 score for the tweet content classifier at first, and then for every step in which an additional classifier was added. The final ensemble obtained and accuracy of 89.28% for our test set. Because this test set was retrieved in exactly the same way as our training set, we decided to fetch two completely new test sets. One consists of users that tweeted about trending topics and for each of those users one of their followers picked at random. The second set consists of users who tweeted about The Voice of Flanders on May 2nd of 2014. Each of these sets contains 500 Flemish Twitter users. Our ensemble reached an accuracy of 91.89% on the first test set. This is almost 10% more than TweetGenie, Nguyen et al. (2013), achieves. The ensemble attained an accuracy of 93.32% on the The Voice of Flanders test set. In this case, TweetGenie obtains an accuracy of 86.44%. It is very important to note that in the previous tests, thirteen users could not be classified by TweetGenie because they have never posted a single tweet. Our ensemble, however, was able to correctly classify all of them. We conclude by stating that we have reached all three of our goals: we have built a classifier that concentrates on Flemish Twitter users, we have compared our results with the most similar work, TweetGenie, and concluded that we obtained a higher accuracy, and we have proven that the usage of the descriptions of users’ friends is beneficial when predicting the gender of a user.

73 74 References

Abel, F., Gao, Q., Houben, G.-J., and Tao, K. (2011). Analyzing user modeling on twitter for personalized news recommendations. In Proceedings of the 19th international conference on User modeling, adaption, and personalization, UMAP’11, pages 1–12, Berlin, Heidelberg. Springer-Verlag. Agrawal, H. (2011). Facebook status length : 63,206 characters. http://www.shoutmeloud.com/ how-to-update-facebook-status-in-more-than-420-character.html. Argamon, S., Koppel, M., Pennebaker, J. W., and Schler, J. (2007). Mining the blogo- sphere: Age, gender and the varieties of self-expression. First Monday, 12(9). Bamman, D., Eisenstein, J., and Schnoebelen, T. (2012). Gender in twitter: Styles, stances, and social networks. CoRR, abs/1210.4567. Bishop, C. M. (2006a). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA. Bishop, C. M. (2006b). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA. Bouma, G. (2013). Normalized (pointwise) mutual information in collocation extraction. Burger, J. D., Henderson, J., Kim, G., and Zarrella, G. (2011). Discriminating gender on twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP ’11, pages 1301–1309, Stroudsburg, PA, USA. Association for Computational Linguistics. Chang, C.-C. and Lin, C.-J. (2014). Libsvm – a library for support vector machines. http://www.csie.ntu.edu.tw/ cjlin/libsvm/. Ciot, M., Sonderegger, M., and Ruths, D. (2013). Gender inference of twitter users in non-english contexts. http://aclweb.org/anthology/D/D13/D13-1114.pdf. Dietterich, T. G. (2000). Ensemble methods in machine learning. In Proceedings of the First International Workshop on Multiple Classifier Systems, MCS ’00, pages 1–15, London, UK, UK. Springer-Verlag. Digimeter (2013). Samenvatting van het rapport van wave 6 (augustus 2013 - september 2013.

75 Eigenvector.com (2011). Svmda. http://wiki.eigenvector.com/index.php?title=Svmda. Fabricio Benevenuto, Gabriel Magno, T. R. and Almeida, V. (2010). Detecting spammers on twitter. Gupta, A. and Thakkar, A. (2014). Optimization of stacking ensemble configuration based on various metahueristic algorithms. In Advance Computing Conference (IACC), 2014 IEEE International, pages 444–451. Huynh, D. T. G. (2008). Human Activity Recognition with Wearable Sensors. PhD thesis, TU Darmstadt. Java, A., Song, X., Finin, T., and Tseng, B. (2007). Why we twitter: Understand- ing microblogging usage and communities. In Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop on Web mining and social network analysis, WebKDD/SNA-KDD ’07, pages 56–65, New York, NY, USA. ACM. Jurafsky, D. (20xx). Text classification and nave bayes (standford university). http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html. Liu, W. and Ruths, D. (2013). What’s in a name? using first names as features for gender inference in twitter. Mukherjee, A. and Liu, B. (2010). Improving gender classification of blog authors. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’10, pages 207–217, Stroudsburg, PA, USA. Association for Computational Linguistics. Nguyen, D., Gravel, R., Trieschnigg, D., and Meder, T. (2013). ”how old do you think i am?” a study of language and age in twitter. http://www.aaai.org/ocs/index. php/ICWSM/ICWSM13/paper/view/5984. Pennacchiotti, M. and Popescu, A.-M. (2011). A machine learning approach to twitter user classification. In Adamic, L. A., Baeza-Yates, R. A., and Counts, S., editors, ICWSM. The AAAI Press. Press, C. U. (2009). Mutual information. http://nlp.stanford.edu/IR- book/html/htmledition/mutual-information-1.html. Rao, D., Yarowsky, D., Shreevats, A., and Gupta, M. (2010). Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents, SMUC ’10, pages 37–44, New York, NY, USA. ACM. Schler, J., Koppel, M., Argamon, S., and Pennebaker, J. W. (2005). Effects of age and gender on blogging. Schwartz, H., JC, E., ML, K., L, D., SM, R., et al. (2013). (2013) personality, gender, and age in the language of social media: The open-vocabulary approach. PLoS ONE.

76 Scikit-learn.org (2013a). 3.5. model evaluation: quantifying the quality of predictions. http://scikit-learn.org/stable/modules/model evaluation.html. Scikit-learn.org (2013b). Rbf svm parameters. http://scikit- learn.org/stable/auto examples/svm/plot rbf parameters.html. Shuyo, N. (2010). Language detection library for java. TrendWatch, N. (2013). Netherlands. http://www.newmediatrendwatch.com/ markets-by-country/10-europe/76-netherlands. UTwente (2013). http://www.utwente.nl/en/newsevents/2013/5/164645/ computer-guesses-age-and-gender-of-tweeters. Wickre, K. (2013). Celebrating #twitter7. Wien, U. (2014). Belgische en nederlandse uitspraakverschijnselen. https:// dutch-beta.ned.univie.ac.at/node/55. Zamal, F. A., Liu, W., and Ruths, D. (2012). Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors.

77 78 List of Figures

3.1 An example of a retweet...... 22

4.1 Screen capture of the webpage used to manually classify users...... 27 4.2 Pie chart of languages used by users in dataset...... 31

5.1 Example of the available profile information of one of the users in our dataset...... 35

6.1 Data flow for feature selection...... 41 6.2 Data flow of vector creation...... 43

7.1 Example of a margin in a two-class classification problem using an RBF kernel. (Eigenvector.com(2011))...... 47 7.2 The structure of the ensemble: the output of each classifier serves as input to the ensemble...... 49

8.1 The accuracies obtained by each classifier for an increasing training set size. 58 8.2 The accuracies obtained by each classifier for an increasing amount of features...... 59 8.3 The accuracy of each classifier with a training set size of 2500 (2500 females and 2500 males) and the amount of features mentioned in Chapter7... 60

79 80 List of Tables

2.1 Overview of papers that try to predict the gender of social media users.. 16

4.1 Users harvested to train classifiers...... 26 4.2 Statistics of the harvested users (grouped by original account)...... 26 4.3 Statistics of the harvested users (grouped by gender)...... 26 4.4 Examples of descriptions of random users...... 29 4.5 Examples of tweets of random users...... 29 4.6 The five accounts that are followed most by females...... 30 4.7 The five accounts that are followed most by males...... 30

7.1 Definition of the parameters of an SVM with a RBF kernel. (Scikit- learn.org(2013b))...... 48 7.2 The twenty strongest features for usernames...... 50 7.3 The twenty strongest features for names...... 51 7.4 The twenty strongest features for description...... 52 7.5 The twenty strongest features for tweet content...... 53 7.6 The twenty strongest features for tweet style...... 54 7.7 The twenty strongest features for friend descriptions...... 55

8.1 The accuracy of the ensemble on the testing set...... 62 8.2 Confusion matrix for the tweet content classifier...... 63 8.3 Confusion matrix for the ensemble with the tweet content classifier and description classifier...... 64 8.4 Confusion matrix for the ensemble with the tweet content classifier, de- scription classifier, and friend classifier...... 64 8.5 Confusion matrix for the ensemble with the tweet content classifier, de- scription classifier, friend classifier, and username classifier...... 65 8.6 Confusion matrix for the complete ensemble...... 65 8.7 Distribution of tweets posted by the random test set...... 67 8.8 Distribution of tweets posted by the TVVV test set...... 68

81