Towards Better Prediction and Content Detection Through Online Social Media Mining

Towards Better Prediction and Content Detection Through Online Social Media Mining

This document is downloaded from DR‑NTU (https://dr.ntu.edu.sg) Nanyang Technological University, Singapore. Towards better prediction and content detection through online social media mining Chen, Weiling 2018 Chen, W. (2018). Towards better prediction and content detection through online social media mining. Doctoral thesis, Nanyang Technological University, Singapore. http://hdl.handle.net/10356/75925 https://doi.org/10.32657/10356/75925 Downloaded on 24 Sep 2021 22:29:20 SGT TOWARDS BETTER PREDICTION AND CONTENT DETECTION THROUGH ONLINE SOCIAL MEDIA MINING CHEN WEILING SCHOOL OF COMPUTER SCIENCE AND ENGINEERING 2018 TOWARDS BETTER PREDICTION AND CONTENT DETECTION THROUGH ONLINE SOCIAL MEDIA MINING CHEN WEILING SCHOOL OF COMPUTER SCIENCE AND ENGINEERING A thesis submitted to Nanyang Technological University in partial fulfilment of the requirement for the degree of Doctor of Philosophy 2018 To Dad and Mom, Cao Pi and Furuya Rei, for their encouragements and love. i Acknowledgements This thesis would not be possible without many people who have helped me and changed my life deeply during my study at Nanyang Technological University (NTU). First and foremost, I would like to express my sincerest gratitude to my supervi- sors, Associate Professors Lau Chiew Tong and Yeo Chai Kiat for giving me all the support and freedom to carry out the research which I am interested in. In addition, I would also like to extend my appreciation to Associate Professor Lee Bu Sung who has given me many useful suggestions for my research. Their precious and warm help in my research and studies, infectious enthusiasm and kindness, unlimited patience and inspiring guidance have been the major driving force during my candidature at NTU. Under their supervision, I have developed skills in critical thinking, research methodologies and integrity, technical writing, communication and leadership, which are invaluable for my future career. Moreover, I would like to thank the students and staff of Computer Networks and Communications Graduate Lab (CNCL) in the School of Computer Science and En- gineering (SCSE) at NTU. I would like to thank my seniors, Yang Yiqun, Pham Thi Ngoc Diep; my lab peers, Zhang Yan and Chen Zhaomin; my junior, Yean SeangLidet and my good friends, Liang Yuhan, Chu Zhaowei and Li Qiye. They have not only pro- vided me with their valuable suggestions for my studies and research, but also enriched my life at NTU with unforgettable experiences. Last but not least, I cannot end without giving my special thanks to my parents for their continuous support and endless love. I am also grateful to Cao Pi and Furuya Rei whom I always admire and give me all the courage to overcome the difficulties I have met. This thesis cannot be finished without them and is dedicated to them. Contents Acknowledgements i List of Figures viii List of Tables ix List of Abbreviations x Abstract x 1 Introduction 1 1.1 Background . 1 1.1.1 Good Aspects of OSN . 2 1.1.1.1 Social Functions . 2 1.1.1.2 Marketing and Business . 3 1.1.1.3 Data Analytics . 4 1.1.2 Bad Aspects of OSN . 5 1.1.2.1 Low-quality content . 6 1.1.2.2 Misinformation . 7 1.1.3 Discussions . 9 1.2 Motivation and Scope . 9 1.2.1 Content Polluters Detection . 9 1.2.2 Rumor Detection . 11 1.2.3 Stock Index Prediction . 14 1.3 Methodology . 16 1.4 Major Contributions . 16 Contents iii 1.4.1 A Real-time Low-quality Content Detection Framework . 17 1.4.2 An Unsupervised Rumor Detection Model based on Users' Be- haviors . 18 1.4.3 RNN-Boost - A Hybrid Model for Predicting Stock Market Index 19 1.5 Thesis Organization . 19 2 Literature Review 22 2.1 Low-quality Content Detection . 22 2.1.1 Definition of Low-quality Content . 22 2.1.2 Spam Detection . 24 2.1.2.1 Content feature based filters . 24 2.1.2.2 Non-content feature based filters . 25 2.1.3 Phishing Detection . 26 2.1.3.1 Blacklists . 26 2.1.3.2 Heuristics . 27 2.1.3.3 Data Mining . 29 2.1.4 Low-quality Content Detection . 29 2.1.4.1 Content-based methods . 30 2.1.4.2 Non-content based methods . 31 2.1.4.3 Real-time detection . 32 2.1.5 Discussions . 33 2.2 Rumor Detection . 33 2.2.1 Studying Rumors . 33 2.2.1.1 Definition of Rumor . 33 2.2.1.2 Taxonomy of Rumors . 34 2.2.2 Theories Related to Rumors . 35 2.2.2.1 User Behaviors . 35 2.2.2.2 Propagation . 36 2.2.3 Feature Selection . 37 2.2.4 Detection Methods . 39 2.2.4.1 Classification . 39 2.2.4.2 Anomaly Detection . 40 Contents iv 2.2.5 Discussions . 41 2.3 Stock Index Prediction . 42 2.3.1 Market Efficiency . 42 2.3.1.1 Definition . 42 2.3.1.2 Limitations . 43 2.3.1.3 Adaptive Market Hypothesis . 44 2.3.2 Feature engineering in financial prediction . 45 2.3.2.1 Technical Analysis . 45 2.3.2.2 Fundamental Analysis . 46 2.3.2.3 Combination of Technical and Fundamental Analysis . 48 2.3.3 Machine learning in financial prediction . 49 2.3.4 Discussions . 51 2.4 Evaluation metrics . 51 2.4.1 Content Detection . 51 2.4.2 Trend Prediction . 52 2.5 Summary . 53 3 Real-time Low-quality Content Detection Framework from the Users' Perspective 54 3.1 General Overview . 55 3.1.1 Terminology . 55 3.1.2 Overview of the Framework . 56 3.2 A study on low-quality content from users' perspective . 57 3.2.1 Cluster analysis of low-quality content . 57 3.2.2 Design of the survey . 60 3.2.3 Results of the survey . 61 3.3 Identifying features characterizing low-quality content . 66 3.3.1 Direct features . 66 3.3.2 Indirect features . 68 3.3.3 Word level analysis . 69 3.4 Pre-implementation tweet processing . 70 3.4.1 Data collection and preprocessing . 70 Contents v 3.4.2 Labeling tweets . 71 3.4.3 Training and testing classifiers . 72 3.5 Implementation results and evaluation . 73 3.5.1 Word level analysis . 73 3.5.2 Feature rank . 74 3.5.3 Detection performance . 76 3.5.4 Comparisons with other methods . 77 3.5.4.1 Blacklists and Twitter policy . 77 3.5.4.2 Other spam/phishing detection methods . 78 3.6 Conclusions . 79 4 Unsupervised Rumor Detection Model based on User Behaviors 81 4.1 General Overview . 82 4.1.1 Definition of Rumors . 82 4.1.2 Descriptions of the Problem . 83 4.1.3 Overview of the Model . 84 4.2 Data Collection Methods and Detection Models . 85 4.2.1 Data Collection . 85 4.2.2 Feature Selection . 87 4.2.3 Recurrent Neural Networks . 89 4.2.4 Autoencoder . 92 4.2.5 Combination Model of RNN and AE . 93 4.2.6 Rumor Detection . 95 4.3 Results and Comparisons . 97 4.3.1 One Hidden Layer vs. Multiple Hidden Layers . 98 4.3.2 Standard Autoencoder vs. Proposed Variant Autoencoder . 99 4.3.3 One Aggregated Model vs. Individual Models . 99 4.3.4 Proposed Model vs. Other Methods . 100 4.4 Conclusions . 101 5 Leveraging Social Media News to Predict Stock Index Movement Us- ing RNN-Boost 102 Contents vi 5.1 Overview of the Model . 103 5.2 Methods and models . 104 5.2.1 Data collection . 105 5.2.2 Feature engineering . 107 5.2.2.1 Technical features . 107 5.2.2.2 Content features . 108 5.2.3 Recurrent Neural Networks . 110 5.2.4 Adaboost . 113 5.2.5 RNN-Boost . 114 5.3 Results and Comparisons . 116 5.3.1 Single RNN with different feature sets . 117 5.3.2 Single RNN vs. RNN-Boost . 118 5.3.3 Comparisons with other methods . 119 5.4 Conclusions . 119 6 Conclusions and Future Work 121 6.1 Conclusions . ..

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    170 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us