
Minority Target Class Detection for Short Text Classification by Fatima Chiroma The thesis is submitted in partial fulfilment of the requirements for the award of the degree of Doctor of Philosophy of the University of Portsmouth January 2021 To the memory of my father. ii Declaration Whilst registered as a candidate for the above degree, I have not been registered for any other research award. The results and conclusions embodied in this thesis are the work of the named candidate and have not been submitted for any other academic award. Parts of the work presented in this thesis draws from the following publications: • Fatima Chiroma, Han Liu, and Mihaela Cocea. Text classification for suicide re- lated tweets. In 2018 International Conference on Machine Learning and Cybernet- ics (ICMLC), volume 2, pages 587–592. IEEE, 2018. (Chapter 3) • Fatima Chiroma, Han Liu, and Mihaela Cocea. Suicide related text classification with prism algorithm. In 2018 International Conference on Machine Learning and Cybernetics (ICMLC), volume 2, pages 575-580. IEEE, 2018. (Chapter 3) • Fatima Chiroma, Mihaela Cocea, and Han Liu. Detection of suicidal Twitter posts. In UK Workshop on Computational Intelligence, pages 307–318. Springer, 2019. (Chapter 3) • Fatima Chiroma and Ella Haig. Disambiguation of features for improving target class detection from social media text. In Proceedings of the IEEE World Congress on Computational Intelligence (WCCI) 2020. Institute of Electrical and Electronics Engineers, 2020. (Chapter 4) Fatima Chiroma January 2021 iii Abstract The rise of social media has resulted in large and socially relevant informational con- tents, and unhealthy behaviours such as cyberbullying, suicidal ideation and hate speech. These behaviours are shown to have offline consequences and measures have been put in place by lawmakers and social media platforms to detect such behaviours. However, the measures are manual and unscalable, hence making them ineffective for the evolving web. Numerous research has been done from both computational linguistics, and machine learn- ing point of view for the effective and robust automatic detection and identification of such contents (target classes), which make up only a small percentage of the overall social media posts and needs to be distinguished from other discourse on social media that may discuss such behaviours without displaying that behaviour (non-target classes), thus, making this a challenging task. In this thesis, we employ short text classification to improve the de- tection of the target classes from Twitter. We reviewed the literature related to short text classification of unhealthy social media behaviours, highlighting the impact of textambi- guity on classification performance when distinguishing target classes from the non-target classes. In addition, relevant machine learning techniques and methods were identified where performance of the most popular machine learning algorithms for short text classifi- cation of unhealthy social media behaviours on Twitter data were empirically investigated. Besides, we introduce two methods that aim to improve the detection of the target class in a binary classification problem by minimizing common or ambiguous terms. We refer to the minimization process as “term disambiguation”. The first method, Short Text Term Disambiguation (STTD), increases the target and non-target class terms by identifying and iv minimizing terms that are common to the two classes. The second method, Partition Based - Short Text Term Disambiguation (PB-STTD), aim to further improve the detection of the target class by explicitly addressing class imbalance as part of the term disambiguation process. Finally, we validated and evaluated the proposed term disambiguation methods on three data sets containing unhealthy social media behaviours, using different machine learning algorithms. The results showed that both proposed term disambiguation methods led to improved detection of the target class (i.e. unhealthy behaviours from Twitter data). v Acknowledgements I wish to express my deepest gratitude and appreciation to my supervisor, Dr. Ella Haig, for her continuous support, guidance, encouragement, understanding and patience throughout this research. I would also like to extend my appreciation to my co-supervisor, Dr. Han Liu, for his support and guidance. I would like to thank my sponsors, Petroleum Technology Development Fund (PTDF). I am grateful for the support. Also, I would like to express my gratitude to my family for their dedicated support and encouragement. Thank you to everyone that supported me during my study. vi Table of Contents Declaration iii Abstract v Acknowledgements vi List of Tables xiii List of Figures xv List of Abbreviations xvii List of Publications xviii 1 Introduction 1 1.1 Research Motivation . .1 1.2 Research Aim . .3 1.3 Thesis Overview . .4 2 Literature Review 6 2.1 Background on Machine Learning . .6 2.1.1 Relevant Methods and Techniques . .7 2.1.1.1 Decision Tree (DT) . .8 2.1.1.2 Ensemble Methods . .9 2.1.1.3 k-Nearest Neighbor (kNN) . 10 vii 2.1.1.4 Logistic Regression (LR) . 11 2.1.1.5 Na¨ıve Bayes (NB) . 11 2.1.1.6 Neural Networks . 12 2.1.1.7 Random Forest (RF) . 13 2.1.1.8 Support Vector Machine (SVM) . 14 2.2 Text Classification . 15 2.2.1 Text Classification Process . 16 2.2.1.1 Preprocessing . 16 2.2.1.2 Feature Engineering . 17 2.2.1.3 Classification . 27 2.2.1.4 Evaluation . 28 2.3 Short Text Classification . 29 2.3.1 Social Media Text . 30 2.3.2 Challenges of Short Text Classification . 31 2.3.2.1 Noisy and High Dimensional . 32 2.3.2.2 Short, Informal and Sparse . 34 2.3.2.3 Ambiguous . 35 2.3.2.4 Imbalanced . 37 2.4 Short Text Classification of Unhealthy Social Media Behaviours (USMB) . 40 2.4.1 Suicidal Ideation . 40 2.4.2 Hate Speech . 45 2.4.3 Misogyny . 48 2.5 Chapter Conclusion . 52 3 Investigating Short Text Classification of Unhealthy Social Media Behaviours 54 3.1 Introduction . 54 3.2 Data sets . 55 3.2.1 DS1: Suicidal Ideation . 55 viii 3.2.2 DS2: Online Misogyny . 56 3.2.3 DS3: Online Hate Speech . 57 3.3 Proposed Short Text Classification Approach . 57 3.4 Data Preparation . 60 3.4.1 Text preprocessing . 60 3.4.2 Data manipulation . 61 3.5 Feature Engineering . 63 3.6 Experiments . 64 3.6.1 Results and Discussion . 66 3.6.1.1 Baseline Results . 66 3.6.1.2 Ensemble Classification Results . 71 3.7 Chapter Conclusion . 73 4 Short Text Term Disambiguation (STTD) 75 4.1 Introduction . 75 4.2 Short Text Term Disambiguation (STTD) Method . 77 4.3 Validation of STTD Method . 80 4.3.1 Results and Discussion . 83 4.4 Chapter Conclusion . 89 5 Partition Based - Short Text Term Disambiguation (PB-STTD) 91 5.1 Introduction . 91 5.2 Partition Based - Short Text Term Disambiguation (PB-STTD) Method . 92 5.3 Validation of PB-STTD Method . 96 5.3.1 Results and Discussion . 99 5.3.1.1 Real world context . 107 5.4 Evaluation of PB-STTD in dealing with class imbalance . 107 5.4.1 Results and Discussion . 108 ix 5.5 Chapter Conclusion . 111 6 Conclusion and Future Work 114 6.1 Research Contributions . 117 6.2 Limitations and Direction for Future Work . 119 References 121 Appendix A: Research Ethics Review 134 Appendix B: Data Description 136 Appendix C: Baseline Result for Binary-class (Suicide and Non-suicide) without the Flippant Class 137 Appendix D: Full List of Sets’ Terms 138 D.1 No Disambiguation . 138 D.1.1 DS1: Suicidal Ideation . 138 D.1.2 DS2: Misogyny . 142 D.1.3 DS3: Hate Speech . 151 D.2 Post-STTD . 171 D.2.1 DS1: Suicidal Ideation . 171 D.2.2 DS2: Misogyny . 176 D.2.3 DS3: Hate Speech . 184 D.3 Post-PB-STTD . 204 D.3.1 DS1: Suicidal Ideation . 204 D.3.2 DS2: Misogyny . 209 D.3.3 DS3: Hate Speech . 218 Appendix E: Confusion Matrix for STTD Results in Chapter 4 239 x Appendix F: Significance Test for Performance of STTD and PB-STTD 241 xi List of Tables 3.1 Summary of manipulated data sets with their Class Imbalance Probability (CIP) measures . 63 3.2 Summary of feature sets . 65 3.3 Baseline Results: Set1 ............................. 68 3.4 Baseline Results: Set2 ............................. 69 3.5 Baseline Results: Set3 ............................. 70 4.1 Sample terms for DS1 prior to the application of STTD . 77 4.2 Sample terms for the T, O and B(T,O) sets from DS1 after the application of STTD . 80 4.3 Short Text Term Disambiguation (STTD) feature count . 83 4.4 Baseline Results . 84 4.5 STTD Results . 85 4.6 Impact of STTD method based on the difference in F-measure where “–” implies decrease in performance after application of STTD . 86 5.1 Sample terms for each data set prior to term disambiguation . 93 5.2 Sample terms for T, O and B(T,O) sets for DS1, DS2 ad DS3 after applying PB-STTD . 96 5.3 Data sets description . 97 5.4 Term count for each data set before (original) and after (post) applying the term disambiguation methods . 98 xii 5.5 Baseline Results . 100 5.6 STTD Results . 101 5.7 PB-STTD Results . 102 5.8 FastText Results . 103 5.9 Comparison of STTD and PB-STTD methods based on the difference in F-measure where ‘–’ implies higher F-measure for STTD and ‘+’ implies higher F-measure for PB-STTD . 103 5.10 SMOTE Results . ..
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages259 Page
-
File Size-