Noise Reduction and Normalization of Microblogging Messages

FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO PROGRAMA DOUTORAL EM ENGENHARIA INFORMÁTICA Noise reduction and normalization of microblogging messages Supervisor Author Eugénio da Costa Oliveira Gustavo Alexandre Co-Supervisor Teixeira Laboreiro Luís Morais Sarmento February 5, 2018 Dedicated to my loving family Acknowledgements First of all I would like to thank my family in loving recognition of their tireless support and understanding throughout this long endeavor. Such a work would not be possible without their help. Further thanks are also warmly extended to my supervisor and co- supervisor who provided knowledge, help, guidance and encouragement through all these years. Special appreciation is also due to the teachers who provided valuable insight and opinions, which greatly enriched this work and to whom the author feels indebted; particularly Eduarda Mendes Rodrigues and Car- los Pinto Soares. I would also like to express my recognition for the help provided by the members of the Scientific Committee. For many days several colleagues accompanied me in this academic journey (but also a life journey), and who selflessly shared their time and friendship. To all of them I present an expression of gratitude. The friendly department staff is not forgotten, specially Sandra Reis who always went beyond the line of duty to provide assistance in the hours of need. The author would also like to make known the financial support that was provided through grant SAPO/BI/UP/Sylvester from SAPO/Portugal Telecom and Fundação para a Ciência e a Tecnologia (FCT) through the RE- ACTION project: Retrieval, Extraction and Aggregation Computing Tech- nology for Integrating and Organizing News (UTA-Est/MAI/0006/ 2009). i ii Abstract User-Generated Content (UGC) was born out of the so-called Web 2.0 — a rethinking of the World Wide Web around content produced by the users. Social media posts, comments, and microblogs are examples of textual UGC that share a number of characteristics: they are typically short, sporadic, to the point and noisy. In this thesis we propose new pre-processing and normalization tools, specifically designed to overcome some limitations and difficulties com- mon in the analysis of UGC that prevent a better grasp on the full potential of this rich ecosystem. Machine learning was employed in most of these solutions that improve (in a substantial way some times) over the baseline methods that we used as comparison. Performance was measured using standard measures and error were examined for better evaluation of limitations. We achieved 0.96 and 0.88 F1 values on our two tokenization tasks (0.96 and 0.78 accuracy), employing a language-agnostic solution. We were able to assign messages to their respective author from a pool of 3 candidates with a 0.63 F1 score, based only on their writting style, using no linguistic knowledge. Our attempt at identifying Twitter bots was met with median accuracy of 0.97, in part due to the good performance of our stylistic features. Language variant identification is much more difficult than simply recogniz- ing the language used since it is a much finer problem; however we were able to achieve 0.95 accuracy based on a sample of 100 messages per user. Finally we address the problem of reverting obfuscation, which is used to disguise words (usually swearing) in text. Our experiments measured a 0.93 median weighted F1 score in this task. We have also made available the Portuguese annotated corpus that was created and examined for this task. iii iv Sumário Os Conteúdos Gerados por Utilizadores (CGU) nasceram da chamada Web 2.0 — uma revisão da World Wide Web em torno de conteúdos produzi- dos pelos utilizadores. Exemplos dos CGU são as mensagens colocadas nos media sociais, os comentários em websites e os microblogs. Todos eles partilham algumas características típicas como a sua brevidade, serem es- porádicos, diretos e ruidosos. Nesta tese propomos novas ferramentas de pré-processamento e nor- malização que foram criadas explicitamente para superarem algumas limi- tações e dificulades típicas na análise de CGU e que impedem uma melhor compreensão do verdadeiro potencial deste ecossistema rico. Na maioria dos casos foi empregue aprendisagem automática, o que permitiu superar (por vezes de forma substancial) os sistemas usados para comparação. O desempenho foi calculado recorrendo a métricas padronizadas e os erros foram examinados por forma a avaliar melhor as limitações dos sistemas propostos. Alcançámos valores F1 de 0,96 e 0,88 nas nossas duas tarefas de ato- mização (0,96 e 0,78 de exatidão), empregando uma solução sem qualquer dependência linguística. Conseguimos atribuir a autoria de uma mensa- gem, de entre 3 candidatos, com uma pontuação F1 de 0,63 baseada apenas no seu estilo de escrita, sem qualquer atenção à língua em que foi redigida. A nossa tentativa de identificar sistemas automáticos no Twitter recebeu uma mediana de exatidão de 0,97, em parte graças ao bom desempenho das ca- racterísticas estilísticas mencionadas anteriormente. A identificação de vari- antes da língua é um problema muito mais complexo do que a simples iden- tificação da língua, dado que é uma questão de granularidade mais fina; no entanto consguimos alcançar 0,95 de exatidão operando sobre 100 mensagens por cada utilizador. Por fim, abordámos o problema de reverter a ofuscação, que é empegue para disfarçar algumas palavras (regra geral, pa- lavrões) em texto. As nossas experiências apontaram uma pontuação mediana ponderada do F1 de 0,93 nesta tarefa. Disponibilizámos livremente o nosso corpus anotado em português que foi criado e examinado para esta tarefa. v vi Contents 1 Introduction 1 1.1 Research questions . .3 1.2 State of the art . .6 1.3 Methodology used . .7 1.4 Contributions . .7 1.5 Thesis Structure . .8 2 UGC and the world of microblogging 9 2.1 What is User-Generated Content . 10 2.2 A brief introduction to microblogs . 12 2.3 A few notes about Twitter . 13 2.3.1 The network . 14 2.3.2 User interaction . 14 2.3.3 Defining the context of a message . 16 2.3.4 Takeaway ideas . 16 2.4 How microblog messages can be problematic . 17 2.4.1 Spelling choices . 17 2.4.2 Orality influences . 18 2.4.3 Non-standard capitalization . 19 2.4.4 Abbreviations, contractions and acronyms . 20 2.4.5 Punctuation . 23 2.4.6 Emoticons . 23 2.4.7 Decorations . 24 2.4.8 Obfuscation . 25 2.4.9 Special constructs . 26 2.5 Conclusion . 26 3 Tokenization of micro-blogging messages 27 3.1 Introduction . 28 3.2 Related work . 32 3.3 A method for tokenizing microblogging messages . 34 3.3.1 Extracting features for classification . 35 3.4 Tokenization guidelines . 36 vii viii CONTENTS 3.4.1 Word tokens . 37 3.4.2 Numeric tokens . 38 3.4.3 Punctuation and other symbols . 38 3.5 Experimental set-up . 39 3.5.1 The tokenization methods . 39 3.5.2 Evaluation scenarios and measures . 40 3.5.3 The corpus creation . 41 3.6 Results and analysis . 42 3.7 Error analysis . 44 3.7.1 Augmenting training data based on errors . 47 3.8 Conclusion and future work . 47 4 Forensic analysis 51 4.1 Introduction . 52 4.2 Related work . 52 4.3 Method description & stylistic features . 54 4.3.1 Group 1: Quantitative Markers . 55 4.3.2 Group 2: Marks of Emotion . 55 4.3.3 Group 3: Punctuation . 56 4.3.4 Group 4: Abbreviations . 56 4.4 Experimental setup . 56 4.4.1 Performance evaluation . 57 4.5 Results and analysis . 58 4.5.1 Experiment set 1 . 58 4.5.2 Experiment set 2 . 59 4.6 Conclusions . 60 5 Bot detection 63 5.1 Introduction . 64 5.1.1 Types of users . 65 5.2 Related work . 67 5.3 Methodology . 69 5.3.1 Chronological features . 69 5.3.2 The client application used . 72 5.3.3 The presence of URLs . 72 5.3.4 User interaction . 72 5.3.5 Writing style . 73 5.4 Experimental set-up . 77 5.4.1 Creation of a ground truth . 77 5.4.2 The classification experiment . 78 5.5 Results and analysis . 79 5.6 Conclusion and future work . 80 CONTENTS ix 6 Nationality detection 83 6.1 Introduction . 84 6.2 Related work . 86 6.3 Methodology . 87 6.3.1 User selection . 88 6.3.2 Features . 88 6.4 Experimental setup . 90 6.4.1 Dataset generation . 91 6.4.2 Determining the number of messages . 91 6.4.3 Determining the most significant features . 92 6.5 Results . 92 6.6 Analysis . 94 6.7 Conclusion . 95 6.8 Future work . 96 7 An analysis of obfuscation 97 7.1 Introduction . 98 7.2 What is profanity, and why it is relevant . 99 7.2.1 On the Internet . 100 7.3 Related work . 101 7.3.1 How prevalent is swearing? . 102 7.3.2 Why is it difficult to censor swearing? . 106 7.4 Swearing and obfuscation . 109 7.4.1 Our objectives . 112 7.5 Works related to the use of swear words . 113 7.5.1 List-based filtering systems . 114 7.5.2 Detection of offensive messages . 115 7.5.3 Works related to the study of swear words . 116 7.6 The SAPO Desporto corpus . 120 7.6.1 Description of the dataset . 121 7.6.2 The annotation . 122 7.6.3 The lexicon . 124 7.7 Obfuscation . 126 7.7.1 Obfuscation methods . 128 7.7.2 Obfuscation method analysis . 130 7.7.3 Preserving word length . 131 7.7.4 Altering word length .

Noise Reduction and Normalization of Microblogging Messages

Ehrensperger Report

Preprint Corpus Analysis in Forensic Linguistics

Swearing a Cross-Cultural Study in Asian and European Languages

Identifying Idiolect in Forensic Authorship Attribution: an N-Gram Textbite Approach Alison Johnson & David Wright University of Leeds

An Introduction to Forensic Linguistics: Language in Evidence

Being Pragmatic About Forensic Linguistics Edward K

Germanistische Studien

About the AAFS

Euphemism and Language Change: the Sixth and Seventh Ages

Usage and Origin of Expletives Usage

Page 1 DOCUMENT RESUME ED 335 965 FL 019 564 AUTHOR

Tove Skutnabb-Kangas Page 1 1/9/2009