Data Segmentation Using NLP: Gender and Age

Data Segmentation Using NLP: Gender and Age

UPTEC STS 21001 Examensarbete 30 hp Januari 2021 Data Segmentation Using NLP: Gender and Age Gustav Demmelmaier Carl Westerberg Abstract Data Segmentation Using NLP: Gender and Age Gustav Demmelmaier & Carl Westerberg Teknisk- naturvetenskaplig fakultet UTH-enheten Natural language processing (NLP) opens the possibilities for a computer to read, decipher, and interpret human languages to eventually use it in Besöksadress: ways that enable yet further understanding of the interaction and Ångströmlaboratoriet Lägerhyddsvägen 1 communication between the human and the computer. When appropriate data Hus 4, Plan 0 is available, NLP makes it possible to determine not only the sentiment information of a text but also information about the author behind an Postadress: online post. Previously conducted studies show aspects of NLP Box 536 751 21 Uppsala potentially going deeper into the subjective information, enabling author classification from text data. Telefon: This thesis addresses the lack of demographic insights of online user 018 – 471 30 03 data by studying language use in texts. It compares four popular yet Telefax: diverse machine learning algorithms for gender and age segmentation. 018 – 471 30 00 During the project, the age analysis was abandoned due to insufficient data. The online texts were analysed and quantified into 118 parameters Hemsida: based on linguistic differences. Using supervised learning, the http://www.teknat.uu.se/student researchers succeeded in correctly predicting the gender in 82% of the cases when analysing data from English online users. The training and test data may have some correlations, which is important to notice. Language is complex and, in this case, the more complex methods SVM and Neural networks were performing better than the less complex Naive Bayes and Logistic regression. Handledare: Frederique Pirenne Ämnesgranskare: Matteo Magnani Examinator: Elísabet Andrésdóttir ISSN: 1650-8319, UPTEC STS** *** Popul¨arvetenskaplig sammanfattning Allt eftersom den digitaliserade erans v˚agsveper ¨over jordens alla h¨ornblir m¨anniskan alltmer uppkopplad. Faktumet att det ¨armer regel ¨anundantag att en m¨anniska har en smartphone kan idag till och med upplevas som en underdrift. Att inte inneha en smartphone kan rentav upplevas som ett allm¨ant avvikande drag. Med detta st¨andigt ¨okande antal uppkopplade m¨anniskor ¨okar ¨aven intresset och m¨ojligheternaatt analy- sera all data som skapas i och med detta. Alla inl¨aggp˚asociala medier, alla likes, alla skickade meddelanden, alla tankar publicerade i en tweet, till och med nedladdningen av detta examensarbete genererar data om vad du g¨or,var du ¨aroch dina personliga ˚asikter.Denna data lockar m˚angaolika typer av intressenter, b˚adei industrin och inom den akademiska v¨arlden. Natural Language Processing (NLP) en gren inom Artificiell Intelligens som stud- erar m¨anskligtskriven text och bland annat interaktionen mellan dator och m¨anniska genom. Det ¨oppnar m¨ojligheternaf¨oren dator att l¨asa,avkoda och tolka det m¨anskliga spr˚aket, f¨oratt sedan f¨orst˚aoch anv¨andadet, vilket ger upphov till ytterligare f¨orst˚aelse mellan m¨anniskan och datorn. Ett ofta studerat omr˚adeinom NLP ¨arSentimentanalys som ¨amnaranalysera och klassificera subjektiv information och ˚asikterfr˚anm¨anskligt skriven text och praktiseras f¨orn¨arvarande kommersiellt. F¨oretagfr˚anett brett spek- trum av industrier har sett potentialen av Sentimentanalys och i samband med det ¨okande antalet anv¨andareonline och de enorma m¨angderna data lockar den ¨annu fler, framf¨oralltf¨oretagspecialiserade inom marknadsf¨oring. Anv¨andare, eller potentiella kunder som marknadsf¨orarnaser dem, l¨amnarsp˚arefter sig vart de ¨an g˚arp˚agrund av deras uppkoppling, och vissa g˚ar¨annu l¨angregenom att fritt uttrycka sina ˚asikterp˚a olika plattformar - omedvetna om den information de tillhandah˚alleroch realtidsdatain- samlingen som utf¨ors.Detta skapar i sin tur m¨ojligheterf¨oralla, s¨arskilt stora f¨oretag, att sk¨ordam¨angderdata och utforska den for att producera otroligt noggranna och pre- cisa analyser. F¨ormarknadsf¨oringsindustrinhar detta resulterat i ett paradigmskifte d¨aranaloga marknadsanalyser l˚angsamt f¨orsvinnerf¨or att ta framtida, mer digitaliser- ade v¨agar. Notabelt ¨ar¨andock att tidigare akademiska studier har visat att andra aspekter av NLP potentiellt kan g˚a¨annu djupare in i den subjektiva informationen och g¨oradet f¨or datorn m¨ojligtatt avg¨oravem f¨orfattarentill en m¨anskligtskriven text ¨ar,helt utan att tidigare k¨annatill dess m¨anskligaegenskaper eller att visuellt se personen. Fr˚an m¨anskligtskrivna texter, eller naturligt spr˚aksom det direkt ¨overs¨attstill, har klassifi- cering av f¨orfattarensdemografiska faktorer s˚asomk¨onoch˚alderhar framg˚angsriktgjort inom den akademiska v¨arlden. Dessa studier anv¨anderemellertid textdata fr˚anen och samma onlinek¨allaeller plattform, resulterar i en homogen data betr¨affandetextl¨angd, struktur och kontextuella attribut. En datainsamling med en mer varierad upps¨attning insamlingsk¨allorskulle potentiellt skapa b¨attref¨oruts¨attningarf¨orklassificering samt statistisk representation av de demografiska faktorerna av en verklig population och d¨arf¨or¨aven skapa ett b¨attreanv¨andningsomr˚adef¨orkommersiella ¨andam˚al. Examensarbetet gjordes i samarbete med det marknadsteknologiska f¨oretagetGraviz Labs och studerar den bristande insikten g¨allandedemografiska faktorer av onlin- eanv¨andare. Genom datorlingvistik studeras olika f¨orfattaresspr˚akanv¨andningoch fyra olika maskininl¨arningsalgoritmerkonstrueras och tr¨anasf¨orklassificering av fak- torerna k¨on och ˚alder.Under projektets varaktighet ¨overgavs ˚aldersfaktornp˚agrund av otillr¨acklig data. Med hj¨alpav maskininl¨arningsmetoden Supervised Learning, ¨over- satt till ¨overvakat l¨arande,d¨aralgoritmerna l¨arsig att replikera och generalisera given tr¨aningsdata,lyckas studien att prediktera faktorn k¨onmed en tr¨affs¨akerhet p˚a82% vid analys av engelsktalande f¨orfattare. Asidosatt˚ en potentiellt korrelation i studiens dataset, indikerar resultaten att studien, med avseende p˚atr¨affs¨akerhet, ¨arkonkur- renskraftig i j¨amf¨orelsemed internationella studier. Studien p˚avisar¨aven en framtida kommersiell potential vid likalydande utformning. Acknowledgements This thesis is a result of a project conducted at the analytics company Graviz Labs. We would like to thank the team at Graviz Labs and especially our supervisor Frederique Pirenne for the opportunity to execute the project and for all the help and assistance throughout the project. We would also like to thank our subject reader Matteo Magnani who has contributed with great inputs and significant feedback along the way. Thank you! Distribution of work This thesis project has be created by Gustav Demmelmaier and Carl Westerberg. It has been carried out in a close collaboration between the two and all covered parts of the thesis have been studied, written and reviewed conjointly. Individual tasks were distributed periodically during the course of the project. When a task was achieved or problems occurred they were reviewed instantly to maintain the defined collaboration, to solve problems and to share information. In general, all this was done immediately since the majority of the work was conducted in the same office. Pair programming and writing were practiced frequently and the overall work distribution of this thesis was practically 50/50. Glossary AI - Artificial Intelligence API - Application Programming Interface AUC - Area Under Curve FPR - False Positive Rate IAA - Inter-Annotator Agreement MLP - Multi Layer Perceptron NLP - Natural Language Processing POS - Part of Speech RBF - Radial Basis Function ROC - Receiver Operating Characteristic SVM - Support Vector Machine TPR - True Positive Rate Contents 1 Introduction 1 1.1 Related work . .2 1.1.1 Supervised learning . .2 1.1.2 Natural Language Processing . .3 1.1.3 Sentiment analysis . .4 1.1.4 Demographics and linguistics . .4 1.2 Ethics in online user data . .4 1.2.1 Privacy in online data . .5 1.2.2 Gender as a variable . .5 1.2.3 Age as a variable . .6 1.2.4 Ethical framework for this research . .6 1.2.5 Ethical consequences of the project . .7 1.3 Research definition . .8 1.4 Disposition . .8 2 Theory 9 2.1 Machine learning in classification . .9 2.2 Classification models . .9 2.2.1 Logistic regression . .9 2.2.2 Support Vector Machine . 10 2.2.3 Naive Bayes . 10 2.2.4 Neural networks . 11 2.3 Natural language processing . 13 2.3.1 Feature classes . 13 2.4 Calibration methods . 17 2.4.1 Hyper-parameter tuning and weighting . 17 2.5 Evaluation . 17 2.5.1 Evaluation metrics . 18 2.5.2 ROC-curve . 18 2.5.3 Cross validation . 19 3 Method 21 3.1 Workflow . 21 3.2 Data . 23 3.2.1 Data collection . 23 3.2.2 Cleaning and pre-processing . 25 3.2.3 Labelling . 26 3.2.4 Parametrisation . 27 3.3 Machine learning . 30 3.3.1 Selection of classifiers . 30 3.3.2 Implementing the classifiers . 31 3.3.3 Hyper-parameter tuning and model calibration . 32 3.4 Limitations . 33 4 Results 35 4.1 Data collection . 35 4.1.1 Data sampling . 35 4.2 Machine learning models and hyper-parameter tuning . 36 4.3 Further tuning and comparisons . 38 5 Discussion 42 5.1 The data set . 42 5.2 Pre-processing and parametrisation . 43 5.3 Model comparison . 43 5.4 Validity . 45 5.5 Ethics . 46 6 Future work 48 6.1 Age . 48 6.2 Parameter weighing and tuning . 48 6.3 Pre-processing variations . 48 6.4 Demographics . 49 6.5 Other linguistic features . 49 7 Conclusion and summary 50 8 References 51 1 Introduction Data has often been described as the new oil. The value of the data itself is now recognised throughout the world, just as oil has been for centuries. For the data to be useful though, the user needs to know how to use and interpret it [1]. Big data is today used in many fields of both the commercial and non-commercial business. Capitalising on data is progressively becoming a necessity for companies to survive and stay up to date with their competitors.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    66 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us