Statistical Data Mining for Sina Weibo, a Chinese Micro-Blog: Sentiment Modelling and Randomness Reduction for Topic Modelling
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Data Mining for Sina Weibo, a Chinese Micro-blog: Sentiment Modelling and Randomness Reduction for Topic Modelling London School of Economics and Political Sciences Wenqian Cheng A thesis submitted to the Department of Statistics of the London School of Economics for the degree of Doctor of Philosophy, London, March 2017 2 Declaration I certify that the thesis I have presented for examination for the MPhil/PhD degree of the London School of Economics and Political Science is solely my own work other than where I have clearly indicated that it is the work of others (in which case the extent of any work carried out jointly by me and any other person is clearly identified in it). The copyright of this thesis rests with the author. Quotation from it is permitted, provided that full acknowledgement is made. This thesis may not be reproduced without my prior written consent. I warrant that this authorisation does not, to the best of my belief, infringe the rights of any third party. I declare that my thesis consists of approximately 50,000 words. 1 Abstract Before the arrival of modern information and communication technology, it was not easy to capture people's thoughts and sentiments; however, the development of statistical data mining techniques and the prevalence of mass social media provide opportunities to capture those trends. Among all types of social media, micro-blogs make use of the word limit of 140 characters to force users to get straight to the point, thus making the posts brief but content-rich resources for investigation. The data mining object of this thesis is Weibo, the most popular Chinese micro-blog. In the first part of the thesis, we attempt to perform various exploratory data mining on Weibo. After the literature review of micro-blogs, the initial steps of data collection and data pre-processing are introduced. This is followed by analysis of the time of the posts, analysis between intensity of the post and share price, term frequency and cluster analysis. Secondly, we conduct time series modelling on the sentiment of Weibo posts. Considering the properties of Weibo sentiment, we mainly adopt the framework of ARMA mean with GARCH type conditional variance to fit the patterns. Other distinct models are also considered for negative sentiment for its complexity. Model selection and validation are introduced to verify the fitted models. Thirdly, Latent Dirichlet Allocation (LDA) is explained in depth as a way to discover topics from large sets of textual data. The major contribution is creating a Randomness Reduction Algorithm applied to post-process the output of topic 2 3 models, filtering out the insignificant topics and utilising topic distributions to find out the most persistent topics. At the end of this chapter, evidence of the effectiveness of the Randomness Reduction is presented from empirical studies. The topic classification and evolution is also unveiled. Acknowledgements First, I would like to express my sincere gratitude to Prof. Piotr Fryzlewicz for the continuous support of my PhD study, for his patience, motivation, and immense knowledge. I am grateful to him for his constructive criticism during my thesis writing and for his encouragement at all stages of my research. This work would not have been done without his continued guidance and generous support. My thanks also goes to Dr. Clifford Lam and Dr. Wicher Bergsma for their comments and suggestions on my work. I would like to thank the Centre of Analysis of Time Series (CATS) and the Time Series Group of the Statistics Department at LSE for providing a perfect environment in which to pursue research. I am thankful to Ian Marshall (the Research Administrator of the Department of Statistics), and Lyn Grove and Jill Beattie (the Administrators at CATS) for their kind support. There are many people whose suggestions, discussions and comments have greatly contributed to my PhD work. I very much appreciate Christopher Sciberras's efforts and kind assistance in proofreading this thesis and correcting my grammatical errors. I am grateful to Dr. Alan Pryor, my master project supervisor, to keep giving me valuable advice and inspiration in my research. I am also grateful to Ivan Sanchez, Dr. Sebastian Riedel and Dr. Nikos Aletras from the Computer Science Department at University College London and Dr. Georg Hahn from the Statistics Department at Imperial College for their comments and suggestions about my research. Hearty thanks go to my colleagues over the years for their stimulating 4 5 discussions and after-work chats. They include Rafal Baranowski, Anna Louise Schroeder, Na Huang, Majeed Simaan, Ewelina Sienkiewicz, Ali Habibnia, Cheng Qian, Yajing Zhu, and Hyeyoung Maeng. I would also like to extend thanks to the following for their constant and unfailing support, and I could not have managed it without them: Chen Lu, Si Qiao, Ying Chen, Ruoxi Li, Anran Chen, Kun Wang, Jiang Shu, Nawal Mustafa, Michelle Warbis and Cynthia Endezoumou. Finally, I am very grateful to my family who offered me both spiritual encouragement and financial support throughout all the stages of my research. Thanks for your unconditional love and belief in me. Thank you all! Contents 1 Introduction 21 2 Exploratory Data Mining 25 2.1 Literature Review of Data Analysis of Microblogs . 25 2.1.1 Review of Relevant Twitter Data Analysis . 25 2.1.2 Review of Weibo Data Analysis . 28 2.2 Data Collection . 31 2.3 Analysis of the Time of the Posts . 36 2.4 Intensity of Posts vs Share Price . 44 2.5 Chinese Word Segmentation . 54 2.6 Term Frequency . 55 2.6.1 Most Frequent Terms for Vanke and Biguiyuan . 57 2.6.2 Most Frequent Terms for Suning and Guomei . 58 2.6.3 Most Frequent Terms for Donghang, Biyadi and Maotai . 60 2.6.4 Beyond Term Frequency . 62 2.7 Cluster Analysis . 63 6 CONTENTS 7 2.8 Summary . 73 3 Time Series Modelling on Sentiment 75 3.1 Introduction to Sentiment Analysis . 75 3.2 Time Series of Sentiment . 84 3.3 Univariate Time Series Model Fitting . 89 3.3.1 A General Framework . 89 3.3.2 Fitting Proportional Positive Sentiment . 97 3.3.3 Fitting Proportional Negative Sentiment . 103 3.3.4 Other Approaches for Fitting Proportional Negative Sentiment112 3.3.5 Model Comparison and Validation for Proportional Negative Sentiment . 118 3.4 Multivariate Time Series Model Fitting . 122 3.5 Summary . 128 4 Topic Modelling and Randomness Reduction 130 4.1 Introduction to Topic Models . 130 4.2 Latent Dirichlet Allocation in Depth . 132 4.2.1 The Development of Topic Models . 132 4.2.2 LDA: a Generative Probabilistic Model . 138 CONTENTS 8 4.2.3 Learning LDA by Gibbs Sampling . 142 4.3 Application of LDA on Microblogs' data . 149 4.3.1 Data Pre-processing for LDA . 149 4.3.2 Parameter Control for Functions . 150 4.4 Randomness Reduction . 152 4.4.1 Motivation and the Literature . 152 4.4.2 The Algorithm . 156 4.4.3 Empirical Examples for Randomness Reduction . 160 4.5 Case Studies for the Significance of Randomness Reduction . 166 4.6 Significance of Randomness Reduction on Twitter Datasets . 173 4.6.1 Randomness Reduction on Tweets about Apple . 174 4.6.2 Randomness Reduction on Tweets about US Airlines . 176 4.7 Topic Classification and Evolution . 178 4.8 Summary . 192 5 Conclusion, Discussion and Future Direction 194 Appendix 200 A Appendix for Chapter 2 200 CONTENTS 9 A.1 Capturing Data via API . 200 A.2 Details for Web Crawling . 202 A.3 Details for Web Parsing . 202 A.4 Additional Results for the Analysis of the Time of the Posts . 204 A.5 Result of Linear Regression for Vanke . 207 A.6 Additional Results for Intensity of the Posts vs Share Price . 209 A.6.1 Time Series Figures for Company Guomei . 209 A.6.2 Time Series Figures for Company Donghang . 213 A.6.3 Time Series Figures for Company Biyadi . 217 A.7 Word Cloud . 221 A.8 Additional Figures for Cluster Analysis . 222 B Appendix for Chapter 3 229 B.1 Additional Figures for Positive/Negative Polarity . 229 B.2 Additional Figures for Seven Dimensions Sentiments . 233 B.3 Details for Fitting Proportional Negative Sentiment Time Series . 236 B.3.1 R Package rugarch for Fitting Proportional Negative Sentiment236 B.3.2 Additional Figures for Proportional Negative Sentiment Time Series . 238 CONTENTS 10 C Appendix for Chapter 4 239 C.1 Additional Topic Words and Original Posts . 239 C.2 Non-persistent Topics in Monthly Topic Evolutions . 242 List of Figures 2.3.1 Daily post amount of Vanke, coloured by time, from 3rd May to 9th Dec 2013. 38 2.3.2 Hourly post amount of Vanke. 40 2.3.3 Hourly time series plot for Vanke post amount from 3rd May to 9th December 2013. 42 2.3.4 Hourly time series plot for post amount (multiple companies) from 3rd May to 9th December 2013. 43 2.4.1 Post amount vs share price for Vanke. 45 2.4.2 Dot plot of post amount (x-axis) and share price (y-axis) for Vanke. 46 2.4.3 Time series for share price of Vanke. 48 2.4.4 Time series for post amount of Vanke. 49 2.4.5 Nadaraya-Watson kernel regression estimate with Bandwith 0.5 for 0 Vanke (Rt and Dt). .......................... 51 2.4.6 Regression estimate using local polynomials with Bandwith 0.5 for 0 Vanke (Rt and Dt). .......................... 52 0 2.4.7 SiZer plot for Vanke (Rt and Dt). 53 2.6.1 Term frequency for Vanke. 58 11 LIST OF FIGURES 12 2.6.2 Term frequency for Biguiyuan.