Modelling Stock Market Manipulation in Online Forums
Total Page:16
File Type:pdf, Size:1020Kb
Modelling Stock Market Manipulation in Online Forums by David Nam A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen's University Kingston, Ontario, Canada October 2020 Copyright c David Nam, 2020 Abstract Over the past several decades, advances in technology have significantly impacted all aspects of the financial system. While it has led to numerous benefits, it has also increased the methods for manipulating the market. A frequent platform used to perform these market manipulation schemes has been through social media. In particular, online forums have become a tool for manipulators to disseminate false or misleading information so that they can profit from other investors. As a result, my research provides investors with valuable insights and the tools necessary for detecting pump-and-dump schemes. To achieve this, posts and comments within financial forums were first collected. Then, financial data was added to associate the texts with resulting market behaviours. By using statistical methods, the records were then initially labelled depending on whether they exhibited a known market pattern that commonly occurs when investors act upon deceptive content. To further improve upon the labelling method, comments of deceptive posts were then relabelled based on their level of agreement to fraudulent information. With the described agreement model, results showed that predictions among the tested classification techniques (XGBoost, Random Forest, SVM, MLP, CNN, BiLSTM) were improved. Additionally, by comparing the performance of the classifiers, CNNs were found to be the best performing model among those that were tested. i Acknowledgments I would like to express my sincere gratitude to my supervisor, Dr. David Skillicorn. His guidance and patience throughout this journey have been invaluable to my success. Thank you to all the professors whom I had the pleasure of meeting. The knowledge they shared has played an important role in shaping my research. I would also like to thank all my lab-mates. Their company and support have made my days at the lab enjoyable, and my overall experience at Queen's University memorable. Lastly, I would like to thank my family and friends who have been supportive and patient with me. Their encouragement pushed me to become the best version of myself. ii Contents Abstract i Acknowledgments ii Contents iii List of Tables v List of Figures vi Chapter 1: Introduction 1 Chapter 2: Background and Related Work 6 2.1 Data Sources . 6 2.2 Market Data . 8 2.2.1 OHLCV . 9 2.2.2 Market Capitalization . 10 2.2.3 Penny Stocks . 11 2.2.4 Market Manipulation . 11 2.2.5 Pump and Dump . 13 2.2.6 Event Study . 15 2.3 Tools . 16 2.3.1 Stance Detection . 16 2.3.2 Empath . 16 2.3.3 Synthetic Minority Oversampling Technique (SMOTE) . 17 2.3.4 Adaptive Synthetic (ADASYN) Approach . 17 2.3.5 SHAP . 18 2.4 Techniques . 18 2.4.1 Non-negative Matrix Factorization (NMF) . 19 2.4.2 Latent Dirichlet Allocation (LDA) . 19 2.4.3 Singular Value Decomposition (SVD) . 20 2.4.4 Extreme Gradient Boosting (XGBoost) . 20 iii 2.4.5 Random Forest (RF) . 21 2.4.6 Support Vector Machine (SVM) . 22 2.4.7 Artificial Neural Networks . 22 2.5 Related Work . 25 2.6 Summary . 32 Chapter 3: Experiments 33 3.1 Data Collection . 34 3.1.1 Collecting Data from Reddit . 35 3.1.2 Collecting Data from Yahoo! Finance . 36 3.2 Text Preprocessing . 39 3.3 Data Labelling . 45 3.3.1 Anomaly Detection . 46 3.3.2 Price Trend . 48 3.3.3 Agreement Model . 50 3.4 Handling Class Imbalance . 55 3.5 Techniques Used . 56 3.6 Summary . 59 Chapter 4: Results 60 4.1 Data Overview . 61 4.2 Clustering Results . 69 4.3 Classification Results . 75 4.4 Discussion . 86 4.5 Results Summary . 89 Chapter 5: Conclusion 92 5.1 Summary . 92 5.2 Limitations . 94 5.2.1 Accuracy of Market Behaviour . 94 5.2.2 Labelling of Market Data . 95 5.2.3 Biases within Agreement Model . 95 5.2.4 Mistaking as Deceptive Content . 96 Bibliography 97 Appendix A: Computational Resources 111 Appendix B: SHAP Summary Plots 113 iv List of Tables 3.1 Features of Reddit data . 36 3.2 Features of Yahoo! Finance data . 38 3.3 List of stopwords that were removed . 44 3.4 List of generated words by Empath . 52 3.5 List of custom words used in the Agreement Model . 54 4.1 Breakdown of records collected from subreddits . 61 4.2 Dataset class distribution . 68 4.3 Comments pre and post agreement model . 71 4.4 Summary of individual model performance . 76 4.5 Examples of misclassified posts from CNN model . 87 v List of Figures 2.1 Screenshot of a subreddit on Reddit . 7 2.2 Sample chart of a stock on Yahoo! Finance . 8 2.3 Candlestick for Stock Price (OHLC) . 10 2.4 Stages of Pump and Dump . 14 3.1 Experiment workflow . 34 3.2 Time window used to collect market data. 39 3.3 Labelling of stock behaviours . 49 3.4 Distribution of stock price trend slopes . 51 4.1 Data Collection Trend . 62 4.2 Top 20 frequent words . 64 4.3 Histogram of discussed market sectors within subreddits . 65 4.4 Histogram of discussed market sectors within texts labelled as P&Ds 66 4.5 Healthcare posts and comments trend . 68 4.6 Technology posts and comments trend . 69 4.7 Pump and dump posts trend . 70 4.8 NMF plot for posts . 72 4.9 NMF plot for posts and comments . 72 4.10 LDA plot for posts . 73 vi 4.11 LDA plot for posts and comments . 73 4.12 SVD plot for posts . 74 4.13 SVD plot for posts and comments . 74 4.14 MLP SHAP Summary Plot for posts . 80 4.15 MLP SHAP Summary Plot for posts and comments . 81 4.16 CNN SHAP Summary Plot for posts . 82 4.17 CNN SHAP Summary Plot for posts and comments . 83 4.18 BiLSTM SHAP Summary Plot for posts . 84 4.19 BiLSTM SHAP Summary Plot for posts and comments . 85 B.1 XGBoost SHAP Summary Plot for posts . 114 B.2 XGBoost SHAP Summary Plot for posts and comments . 115 B.3 RF SHAP Summary Plot for posts . 116 B.4 RF SHAP Summary Plot for posts and comments . 117 B.5 SVM SHAP Summary Plot for posts . 118 B.6 SVM SHAP Summary Plot for posts and comments . 119 vii 1 Chapter 1 Introduction Ever since the introduction of the global financial system, market manipulation has been an important issue. Broadly defined as the intentional act of deceiving others to alter or misrepresent market prices, its presence poses a threat to the belief held by many investors that the market is fair and free. In order to provide the public with confidence and bring efficiency within the markets, financial regulators (i.e., SEC) employ various monitoring techniques to detect, investigate and prosecute these illicit activities [12, 27]. While the schemes that manipulate the market are well documented and have stayed relatively the same, the means and opportunities for conducting them have continued to evolve through the years. In particular, the introduction of new financial products and technologies has allowed investors to easily enter the market. However, it has also increased the risk of manipulation. With the growth in participants, detecting and investigating fraudulent activities have become much more difficult, resulting in many to go undetected [32]. The advent of social media has given rise to new methods for manipulating the market. With its development and popularity, many fraudsters have viewed it as an 2 easy and inexpensive mean of exchanging financial information to conduct illegal ac- tivities [49]. As a result, investors who acquire information from online forums must always be cautious of the content that they do come across. The alluring potential of obtaining a quick and easy return on investment is what many manipulators seek to exploit. A scheme known as Pump and Dump (P&D) is popular among forums. Fraudsters disseminate false information about a particular stock in an attempt to ar- tificially raise the price such that they can sell their purchased shares at a higher rate. Investors with little knowledge or trading experience may act upon the information believing it to be credible and fall victim to it by buying in. Once the fraudsters sell off their shares, the price of the stock begins to plummet, resulting in many investors losing their money. While it can be difficult for investors to detect these types of deceptive content, it may be different for computers, as they are known to be capable of handling such tasks [34, 73]. Even though technological advancements attribute to increased risk of manipula- tion, it has also provided new methods for collecting and analyzing data to identify their occurrences. With the use of a computer, an earlier approach to detecting ma- nipulation has been to observe known patterns and predefined thresholds. By taking in the market data, such as the price and trading volume of stocks, suspicious activ- ities are monitored using a set of rules and triggers for notification. However, those methods suffer from various weaknesses, such as the inability to detect abnormal be- haviours that deviate from historical patterns, as well as struggling to adapt to the changing market conditions [44]. Machine Learning is an alternative method that can overcome the mentioned challenges, as it can learn and improve through experience.