Towards Better Prediction and Content Detection Through Online Social Media Mining

This document is downloaded from DR‑NTU (https://dr.ntu.edu.sg) Nanyang Technological University, Singapore. Towards better prediction and content detection through online social media mining Chen, Weiling 2018 Chen, W. (2018). Towards better prediction and content detection through online social media mining. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/75925 https://doi.org/10.32657/10356/75925 Downloaded on 24 Sep 2021 22:29:20 SGT TOWARDS BETTER PREDICTION AND CONTENT DETECTION THROUGH ONLINE SOCIAL MEDIA MINING CHEN WEILING SCHOOL OF COMPUTER SCIENCE AND ENGINEERING 2018 TOWARDS BETTER PREDICTION AND CONTENT DETECTION THROUGH ONLINE SOCIAL MEDIA MINING CHEN WEILING SCHOOL OF COMPUTER SCIENCE AND ENGINEERING A thesis submitted to Nanyang Technological University in partial fulfilment of the requirement for the degree of Doctor of Philosophy 2018 To Dad and Mom, Cao Pi and Furuya Rei, for their encouragements and love. i Acknowledgements This thesis would not be possible without many people who have helped me and changed my life deeply during my study at Nanyang Technological University (NTU). First and foremost, I would like to express my sincerest gratitude to my supervi- sors, Associate Professors Lau Chiew Tong and Yeo Chai Kiat for giving me all the support and freedom to carry out the research which I am interested in. In addition, I would also like to extend my appreciation to Associate Professor Lee Bu Sung who has given me many useful suggestions for my research. Their precious and warm help in my research and studies, infectious enthusiasm and kindness, unlimited patience and inspiring guidance have been the major driving force during my candidature at NTU. Under their supervision, I have developed skills in critical thinking, research methodologies and integrity, technical writing, communication and leadership, which are invaluable for my future career. Moreover, I would like to thank the students and staff of Computer Networks and Communications Graduate Lab (CNCL) in the School of Computer Science and En- gineering (SCSE) at NTU. I would like to thank my seniors, Yang Yiqun, Pham Thi Ngoc Diep; my lab peers, Zhang Yan and Chen Zhaomin; my junior, Yean SeangLidet and my good friends, Liang Yuhan, Chu Zhaowei and Li Qiye. They have not only pro- vided me with their valuable suggestions for my studies and research, but also enriched my life at NTU with unforgettable experiences. Last but not least, I cannot end without giving my special thanks to my parents for their continuous support and endless love. I am also grateful to Cao Pi and Furuya Rei whom I always admire and give me all the courage to overcome the difficulties I have met. This thesis cannot be finished without them and is dedicated to them. Contents Acknowledgements i List of Figures viii List of Tables ix List of Abbreviations x Abstract x 1 Introduction 1 1.1 Background . 1 1.1.1 Good Aspects of OSN . 2 1.1.1.1 Social Functions . 2 1.1.1.2 Marketing and Business . 3 1.1.1.3 Data Analytics . 4 1.1.2 Bad Aspects of OSN . 5 1.1.2.1 Low-quality content . 6 1.1.2.2 Misinformation . 7 1.1.3 Discussions . 9 1.2 Motivation and Scope . 9 1.2.1 Content Polluters Detection . 9 1.2.2 Rumor Detection . 11 1.2.3 Stock Index Prediction . 14 1.3 Methodology . 16 1.4 Major Contributions . 16 Contents iii 1.4.1 A Real-time Low-quality Content Detection Framework . 17 1.4.2 An Unsupervised Rumor Detection Model based on Users' Be- haviors . 18 1.4.3 RNN-Boost - A Hybrid Model for Predicting Stock Market Index 19 1.5 Thesis Organization . 19 2 Literature Review 22 2.1 Low-quality Content Detection . 22 2.1.1 Definition of Low-quality Content . 22 2.1.2 Spam Detection . 24 2.1.2.1 Content feature based filters . 24 2.1.2.2 Non-content feature based filters . 25 2.1.3 Phishing Detection . 26 2.1.3.1 Blacklists . 26 2.1.3.2 Heuristics . 27 2.1.3.3 Data Mining . 29 2.1.4 Low-quality Content Detection . 29 2.1.4.1 Content-based methods . 30 2.1.4.2 Non-content based methods . 31 2.1.4.3 Real-time detection . 32 2.1.5 Discussions . 33 2.2 Rumor Detection . 33 2.2.1 Studying Rumors . 33 2.2.1.1 Definition of Rumor . 33 2.2.1.2 Taxonomy of Rumors . 34 2.2.2 Theories Related to Rumors . 35 2.2.2.1 User Behaviors . 35 2.2.2.2 Propagation . 36 2.2.3 Feature Selection . 37 2.2.4 Detection Methods . 39 2.2.4.1 Classification . 39 2.2.4.2 Anomaly Detection . 40 Contents iv 2.2.5 Discussions . 41 2.3 Stock Index Prediction . 42 2.3.1 Market Efficiency . 42 2.3.1.1 Definition . 42 2.3.1.2 Limitations . 43 2.3.1.3 Adaptive Market Hypothesis . 44 2.3.2 Feature engineering in financial prediction . 45 2.3.2.1 Technical Analysis . 45 2.3.2.2 Fundamental Analysis . 46 2.3.2.3 Combination of Technical and Fundamental Analysis . 48 2.3.3 Machine learning in financial prediction . 49 2.3.4 Discussions . 51 2.4 Evaluation metrics . 51 2.4.1 Content Detection . 51 2.4.2 Trend Prediction . 52 2.5 Summary . 53 3 Real-time Low-quality Content Detection Framework from the Users' Perspective 54 3.1 General Overview . 55 3.1.1 Terminology . 55 3.1.2 Overview of the Framework . 56 3.2 A study on low-quality content from users' perspective . 57 3.2.1 Cluster analysis of low-quality content . 57 3.2.2 Design of the survey . 60 3.2.3 Results of the survey . 61 3.3 Identifying features characterizing low-quality content . 66 3.3.1 Direct features . 66 3.3.2 Indirect features . 68 3.3.3 Word level analysis . 69 3.4 Pre-implementation tweet processing . 70 3.4.1 Data collection and preprocessing . 70 Contents v 3.4.2 Labeling tweets . 71 3.4.3 Training and testing classifiers . 72 3.5 Implementation results and evaluation . 73 3.5.1 Word level analysis . 73 3.5.2 Feature rank . 74 3.5.3 Detection performance . 76 3.5.4 Comparisons with other methods . 77 3.5.4.1 Blacklists and Twitter policy . 77 3.5.4.2 Other spam/phishing detection methods . 78 3.6 Conclusions . 79 4 Unsupervised Rumor Detection Model based on User Behaviors 81 4.1 General Overview . 82 4.1.1 Definition of Rumors . 82 4.1.2 Descriptions of the Problem . 83 4.1.3 Overview of the Model . 84 4.2 Data Collection Methods and Detection Models . 85 4.2.1 Data Collection . 85 4.2.2 Feature Selection . 87 4.2.3 Recurrent Neural Networks . 89 4.2.4 Autoencoder . 92 4.2.5 Combination Model of RNN and AE . 93 4.2.6 Rumor Detection . 95 4.3 Results and Comparisons . 97 4.3.1 One Hidden Layer vs. Multiple Hidden Layers . 98 4.3.2 Standard Autoencoder vs. Proposed Variant Autoencoder . 99 4.3.3 One Aggregated Model vs. Individual Models . 99 4.3.4 Proposed Model vs. Other Methods . 100 4.4 Conclusions . 101 5 Leveraging Social Media News to Predict Stock Index Movement Us- ing RNN-Boost 102 Contents vi 5.1 Overview of the Model . 103 5.2 Methods and models . 104 5.2.1 Data collection . 105 5.2.2 Feature engineering . 107 5.2.2.1 Technical features . 107 5.2.2.2 Content features . 108 5.2.3 Recurrent Neural Networks . 110 5.2.4 Adaboost . 113 5.2.5 RNN-Boost . 114 5.3 Results and Comparisons . 116 5.3.1 Single RNN with different feature sets . 117 5.3.2 Single RNN vs. RNN-Boost . 118 5.3.3 Comparisons with other methods . 119 5.4 Conclusions . 119 6 Conclusions and Future Work 121 6.1 Conclusions . ..

Towards Better Prediction and Content Detection Through Online Social Media Mining

On the Road to Hades–Helpful Automatic Development Email Summarization

Police Interactions with the Community on Facebook: an Examination of the Content Of

Forming Innovative Learning Environments Through Technology. Conversations in Excellence. INSTITUTION National Catholic Educational Association, Washington, DC

Simon Young from the Slants

I Give Permission for Public Access to My Thesis and for Any Copying to Be Done at the Discretion of the Archives Librarian And/Or the College Librarian

And HTML Attachments Are Evil a Radical Solution to Get Rid of Them

Twitter: Expressions of the Whole Self

Comprehensive Guide to a Professional Blog Site: a Wordpress Example

First Name Initial Last Name

Newsreaders.Com: Getting Started: Netiquette

The Australian Library Journal the Australian Library Journal Is the Flagship Publication Volume 57 No

Raporttien Ulkoasu Ja Lähteisiin Viittaaminen