<<

Modelling in Online Forums

by

David Nam

A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science

Queen’s University Kingston, Ontario, Canada October 2020

Copyright c David Nam, 2020 Abstract

Over the past several decades, advances in technology have significantly impacted all aspects of the financial system. While it has led to numerous benefits, it has also increased the methods for manipulating the market. A frequent platform used to perform these market manipulation schemes has been through social media. In particular, online forums have become a tool for manipulators to disseminate false or misleading information so that they can profit from other . As a result, my research provides investors with valuable insights and the tools necessary for detecting pump-and-dump schemes. To achieve this, posts and comments within financial forums were first collected. Then, financial data was added to associate the texts with resulting market behaviours. By using statistical methods, the records were then initially labelled depending on whether they exhibited a known market pattern that commonly occurs when investors act upon deceptive content. To further improve upon the labelling method, comments of deceptive posts were then relabelled based on their level of agreement to fraudulent information. With the described agreement model, results showed that predictions among the tested classification techniques (XGBoost, Random Forest, SVM, MLP, CNN, BiLSTM) were improved. Additionally, by comparing the performance of the classifiers, CNNs were found to be the best performing model among those that were tested.

i Acknowledgments

I would like to express my sincere gratitude to my supervisor, Dr. David Skillicorn. His guidance and patience throughout this journey have been invaluable to my success. Thank you to all the professors whom I had the pleasure of meeting. The knowledge they shared has played an important role in shaping my research. I would also like to thank all my lab-mates. Their company and support have made my days at the lab enjoyable, and my overall experience at Queen’s University memorable. Lastly, I would like to thank my family and friends who have been supportive and patient with me. Their encouragement pushed me to become the best version of myself.

ii Contents

Abstract i

Acknowledgments ii

Contents iii

List of Tables v

List of Figures vi

Chapter 1: Introduction 1

Chapter 2: Background and Related Work 6 2.1 Data Sources ...... 6 2.2 Market Data ...... 8 2.2.1 OHLCV ...... 9 2.2.2 ...... 10 2.2.3 Penny ...... 11 2.2.4 Market Manipulation ...... 11 2.2.5 ...... 13 2.2.6 Event Study ...... 15 2.3 Tools ...... 16 2.3.1 Stance Detection ...... 16 2.3.2 Empath ...... 16 2.3.3 Synthetic Minority Oversampling Technique (SMOTE) . . . . 17 2.3.4 Adaptive Synthetic (ADASYN) Approach ...... 17 2.3.5 SHAP ...... 18 2.4 Techniques ...... 18 2.4.1 Non-negative Matrix Factorization (NMF) ...... 19 2.4.2 Latent Dirichlet Allocation (LDA) ...... 19 2.4.3 Singular Value Decomposition (SVD) ...... 20 2.4.4 Extreme Gradient Boosting (XGBoost) ...... 20

iii 2.4.5 Random Forest (RF) ...... 21 2.4.6 Support Vector Machine (SVM) ...... 22 2.4.7 Artificial Neural Networks ...... 22 2.5 Related Work ...... 25 2.6 Summary ...... 32

Chapter 3: Experiments 33 3.1 Data Collection ...... 34 3.1.1 Collecting Data from Reddit ...... 35 3.1.2 Collecting Data from Yahoo! Finance ...... 36 3.2 Text Preprocessing ...... 39 3.3 Data Labelling ...... 45 3.3.1 Anomaly Detection ...... 46 3.3.2 Price Trend ...... 48 3.3.3 Agreement Model ...... 50 3.4 Handling Class Imbalance ...... 55 3.5 Techniques Used ...... 56 3.6 Summary ...... 59

Chapter 4: Results 60 4.1 Data Overview ...... 61 4.2 Clustering Results ...... 69 4.3 Classification Results ...... 75 4.4 Discussion ...... 86 4.5 Results Summary ...... 89

Chapter 5: Conclusion 92 5.1 Summary ...... 92 5.2 Limitations ...... 94 5.2.1 Accuracy of Market Behaviour ...... 94 5.2.2 Labelling of Market Data ...... 95 5.2.3 Biases within Agreement Model ...... 95 5.2.4 Mistaking as Deceptive Content ...... 96

Bibliography 97

Appendix A: Computational Resources 111

Appendix B: SHAP Summary Plots 113

iv List of Tables

3.1 Features of Reddit data ...... 36 3.2 Features of Yahoo! Finance data ...... 38 3.3 List of stopwords that were removed ...... 44 3.4 List of generated words by Empath ...... 52 3.5 List of custom words used in the Agreement Model ...... 54

4.1 Breakdown of records collected from subreddits ...... 61 4.2 Dataset class distribution ...... 68 4.3 Comments pre and post agreement model ...... 71 4.4 Summary of individual model performance ...... 76 4.5 Examples of misclassified posts from CNN model ...... 87

v List of Figures

2.1 Screenshot of a subreddit on Reddit ...... 7 2.2 Sample chart of a stock on Yahoo! Finance ...... 8 2.3 Candlestick for Stock Price (OHLC) ...... 10 2.4 Stages of Pump and Dump ...... 14

3.1 Experiment workflow ...... 34 3.2 Time window used to collect market data...... 39 3.3 Labelling of stock behaviours ...... 49 3.4 Distribution of stock price trend slopes ...... 51

4.1 Data Collection Trend ...... 62 4.2 Top 20 frequent words ...... 64 4.3 Histogram of discussed market sectors within subreddits ...... 65 4.4 Histogram of discussed market sectors within texts labelled as P&Ds 66 4.5 Healthcare posts and comments trend ...... 68 4.6 Technology posts and comments trend ...... 69 4.7 Pump and dump posts trend ...... 70 4.8 NMF plot for posts ...... 72 4.9 NMF plot for posts and comments ...... 72 4.10 LDA plot for posts ...... 73

vi 4.11 LDA plot for posts and comments ...... 73 4.12 SVD plot for posts ...... 74 4.13 SVD plot for posts and comments ...... 74 4.14 MLP SHAP Summary Plot for posts ...... 80 4.15 MLP SHAP Summary Plot for posts and comments ...... 81 4.16 CNN SHAP Summary Plot for posts ...... 82 4.17 CNN SHAP Summary Plot for posts and comments ...... 83 4.18 BiLSTM SHAP Summary Plot for posts ...... 84 4.19 BiLSTM SHAP Summary Plot for posts and comments ...... 85

B.1 XGBoost SHAP Summary Plot for posts ...... 114 B.2 XGBoost SHAP Summary Plot for posts and comments ...... 115 B.3 RF SHAP Summary Plot for posts ...... 116 B.4 RF SHAP Summary Plot for posts and comments ...... 117 B.5 SVM SHAP Summary Plot for posts ...... 118 B.6 SVM SHAP Summary Plot for posts and comments ...... 119

vii 1

Chapter 1

Introduction

Ever since the introduction of the global financial system, market manipulation has been an important issue. Broadly defined as the intentional act of deceiving others to alter or misrepresent market prices, its presence poses a threat to the belief held by many investors that the market is fair and free. In order to provide the public with confidence and bring efficiency within the markets, financial regulators (i.e., SEC) employ various monitoring techniques to detect, investigate and prosecute these illicit activities [12, 27]. While the schemes that manipulate the market are well documented and have stayed relatively the same, the means and opportunities for conducting them have continued to evolve through the years. In particular, the introduction of new financial products and technologies has allowed investors to easily enter the market. However, it has also increased the risk of manipulation. With the growth in participants, detecting and investigating fraudulent activities have become much more difficult, resulting in many to go undetected [32]. The advent of social media has given rise to new methods for manipulating the market. With its development and popularity, many fraudsters have viewed it as an 2

easy and inexpensive mean of exchanging financial information to conduct illegal ac- tivities [49]. As a result, investors who acquire information from online forums must always be cautious of the content that they do come across. The alluring potential of obtaining a quick and easy return on investment is what many manipulators seek to exploit. A scheme known as Pump and Dump (P&D) is popular among forums. Fraudsters disseminate false information about a particular stock in an attempt to ar- tificially raise the price such that they can sell their purchased shares at a higher rate. Investors with little knowledge or trading experience may act upon the information believing it to be credible and fall victim to it by buying in. Once the fraudsters sell off their shares, the price of the stock begins to plummet, resulting in many investors losing their money. While it can be difficult for investors to detect these types of deceptive content, it may be different for computers, as they are known to be capable of handling such tasks [34, 73]. Even though technological advancements attribute to increased risk of manipula- tion, it has also provided new methods for collecting and analyzing data to identify their occurrences. With the use of a computer, an earlier approach to detecting ma- nipulation has been to observe known patterns and predefined thresholds. By taking in the market data, such as the price and trading volume of stocks, suspicious activ- ities are monitored using a set of rules and triggers for notification. However, those methods suffer from various weaknesses, such as the inability to detect abnormal be- haviours that deviate from historical patterns, as well as struggling to adapt to the changing market conditions [44]. Machine Learning is an alternative method that can overcome the mentioned challenges, as it can learn and improve through experience. More importantly, it does not require in-depth financial knowledge to build a model. 3

Thus, much research has been conducted within the field to use machine learning to detect manipulation. The majority of available literature on the subject has focused on analyzing market data specifically to build models that can identify manipulations, and aid regulators in catching those that conduct them. While the proposed approaches may be useful in prosecuting individuals that engage in such activities, it does little to help investors from being deceived. In most cases, many victims never recover their investments. In spite of this, there is a different approach to address this issue. Rather than analyzing the market data to identify manipulations, machine learning and natural language processing (NLP) may provide a model that can distinguish P&D schemes within financial forums. With its use, it would quickly inform investors of the material that they do come across, ensuring that they refrain from investing in illegal schemes that aim to defraud them. While there are a few studies that have explored this avenue, they do not extensively look into the application of different machine learning techniques that are available. Furthermore, little is discussed on how the textual data are handled to improve the performance of the models. Therefore, my research focuses on those areas to investigate a better solution for detecting market manipulation. In order to develop a model that can detect P&D content within financial forums, I begin by constructing a labelled dataset that contains texts that are associated with P&Ds. I achieve this by first collecting textual data from different forums and then identifying the mentioned stock within each text. By using that information, I then retrieve the corresponding market data surrounding the day that the message is published. By combining the text and market data, it is possible to differentiate which contents are influencing the . As a result, with the use of statistical 4

methods, I initially label all texts that exhibit abnormal market behaviour as P&D. This leads to the discovery that I should not treat all textual data from the forums as the same. Within a forum, there exist two different categories of texts. The first is known as a post, which is those that initiate a discussion. The other category is the comments about the post. For example, an individual may create a post to announce that a stock is about to rise, which may result in many others sharing their opinion under the same discussion thread. Among the responses, not all messages may be the same, as there may be those that disagree with the original statement. This analysis leads to an agreement model which looks at the messages and determines their level of agreement. As a result, the model only labels the texts that agree with the post as P&D. This approach establishes the idea that responses which agree with a fraudulent statement may exhibit similar language. With the use of the agreement model onto the initially labelled dataset, I achieve a more refined set of record labels for the classification models. With the new la- belling method, the results of the classifiers reveal a more robust prediction of P&D content. It also indicates that comments which agree to fraudulent posts can provide insight in detecting other P&D posts. By also comparing the performance of the clas- sifiers, the best performing model achieves an accuracy of 85% and an F1-Score of 62%. Even though detecting market manipulation is inherently difficult, my research demonstrates that by taking a different approach, machine learning and NLP can be used to tackle the issue. Furthermore, it provides a unique method of labelling textual data from online forums to improve the performance of the detection models. This thesis is organized into five chapters. Following the introduction, Chapter 2 5

provides the background knowledge and the related work that are necessary for the understanding of my research. In Chapter 3, I outline the details of my experimental setup. Chapter 4 presents the initial findings of the data and the resulting perfor- mance of the individual techniques that were used. Chapter 5 summarizes my work within this thesis and presents its limitations. 6

Chapter 2

Background and Related Work

This chapter explains the various concepts and techniques that are relevant to the understanding of my research. Furthermore, it also presents the work that has been conducted by others in regards to market manipulation. The first section describes the data sources that were used to conduct the study. The second section provides information on the market data, which includes basic terminologies, types of manipu- lation, and the event study methodology. The third section provides the descriptions of the critical tools that were used within my research to help analyze the data. The subsequent section provides a brief description of each of the techniques that were used. Finally, the last section presents previous works in the detection of market manipulation.

2.1 Data Sources

In order to obtain the data for my research, two different data sources were utilized. The first was a popular online website called Reddit, where many users frequently interact to discuss various topics, including the stock market. The second source of data was Yahoo! Finance which is a financial market platform that provides historical 2.1. DATA SOURCES 7

Figure 2.1: Screenshot of a subreddit on Reddit data about any given company.

Reddit:

Reddit is an online social news platform that provides access to a collection of forums for users to discuss and vote on the content. Each forum is commonly referred to as a subreddit, where each is dedicated to the discussions of a specific topic. Additionally, all users can create a post, which is a new thread for discussion. They may also com- ment on any existing post to engage in a discussion about its contents. While there are many subreddits which discuss a myriad of topics, the following are a few popu- lar forums that were created explicitly for the discussions of stocks: r/pennystocks, r/wallstreetbets, r/stocks, r/RobinHoodPennyStocks, r/TheWallStreet. To help vi- sualize, Figure 2.1 provides a screenshot of what a subreddit looks like. 2.2. MARKET DATA 8

Figure 2.2: Sample chart of a stock on Yahoo! Finance

Yahoo! Finance:

Yahoo! Finance is a free platform provided by Yahoo! for investors to access financial news, market data, and basic financial tools to aid in their investment decisions. Among the many features that are provided by the website, the one that is of interest for my research is the ability to look at the historical and present information of any given stock. By providing the stock symbol or company name, it provides the relevant market data similar to that of what is shown in Figure 2.2. It contains the volume and the various price points at which the stock is traded within the given time frame.

2.2 Market Data

In a time where market trades no longer take place over the phone and are conducted over the Internet, market data refers to the real-time information of a financial in- strument, such as stocks. Among many other things, the data contains the price, 2.2. MARKET DATA 9

bid/ask quotes, and market volume of the stock. This information is reported by various trading venues such as stock exchanges. Some examples of well-known stock exchanges are , New York (NYSE), and Toronto Stock Ex- change (TSX). Traders and investors take the market data from the exchanges to gain information about a stock before they make a trade. While market data is gen- erated in real-time, it can also be used to obtain historical information. It is with this information that many traders try to derive patterns and strategies for future trades.

2.2.1 OHLCV

To help display the market data, a candlestick chart is often used to graphically illustrate the change in price and trading volume of a given stock. It utilizes five different points which are known as Open, High, Low, Close, Volume (OHLCV) to describe the price and volume movements within a specific time frame. As shown in Figure 2.3, the candlesticks are composed of two components. The wide bar, which is known as the real body represents the price range between the trading interval, whereas the wicks or the thin lines represent the highest and lowest prices for the stock. In order to help visualize whether the closing price was higher or lower than the opening price, the colours green and red are typically used to indicate the movement. Along with the prices, the lower portion of the chart is used to display the volume movements for a given stock. Unlike the OHLC, the volume is typically represented by bars, where the height indicates the amount traded within the time interval. The volume also utilizes colours to display whether the stock price closed higher for that interval versus the prior one’s close. Putting the OHLCV values together, a chart like Figure 2.2 can be made. 2.2. MARKET DATA 10

Figure 2.3: Candlestick for Stock Price (OHLC)

2.2.2 Market Capitalization

Also referred to as “market cap”, market capitalization is the current market value of a company. It is calculated by taking the market price of one company share and multiplying it by the total number of outstanding shares that it has. Due to the varying prices and the number of shares that a company can have, they can widely differ in their market capitalization. Therefore, companies are categorized within specific market cap ranges to help group those of similar sizes. The following are common classifications for different market caps:

• Large-cap (> $10 billion)

• Mid-cap ($2 billion - $10 billion)

• Small-cap ($300 million - $2 billion)

• Micro-cap ($50 million - $300 million)

• Nano-cap (< $50 million) 2.2. MARKET DATA 11

2.2.3 Penny Stocks

A refers to the stock of a company whose market capitalization is typically either a micro-cap or nano-cap. The term penny stock came about as it referred to stocks that traded for less than a dollar. However, it has been reclassified by the U.S. Securities and Exchange Commission (SEC) to refer to any stock that is traded by a small public company for less than $5 per share [70]. Many of these companies are known for their due to their limited coverage by analysts and interest from institutional buyers. Furthermore, due to their low price, retail investors often buy a large quantity of these stocks without having to invest too much money. This, however, can make the price of the stock susceptible to changes in demand and supply. It is from the volatility that there are potentials to make large returns on investments. While this has allured many investors to take part in trading penny stocks, it has also left it vulnerable to manipulations by malicious actors. One particular study found that 50% of manipulated stocks are those with a small market capitalization [14].

2.2.4 Market Manipulation

Market manipulation can be broadly defined as intentional attempts to deceive in- vestors by affecting or controlling the price of a (i.e. stocks). These types of actions are prohibited by laws and regulations as it damages the trust and belief held by many investors that the market is fair. While market manipulation has be- come easier to conduct through the use of the Internet, it is not an issue that has recently emerged. It is often described as a perpetual game between manipulators and regulatory bodies that have existed for centuries [62]. Even when the market is well regulated and mature, a study by Aggarwal and Wu [14] has shown that within 2.2. MARKET DATA 12

the U.S. stock market, there were 142 manipulation cases in the 1990s. In spite of its definition, there exist widely distinct strategies for manipulation [44]. It was not until the work of Allen and Gale [16] that the study of manipulation was brought to great attention. Within it, they introduced the concept where manipulation strategies can be classified into the following three forms:

• Action-Based Manipulation

• Information-Based Manipulation

• Trade-Based Manipulation

Action Based Manipulation

Manipulation strategies that fall under this category involve various actions taken by an individual or the management of a company to change the actual or perceived value of the stock [47]. Any action, besides trading, that distorts the actual value of an asset can fall under this category [66]. A simple example of this can be when a manipulator acquires shares of a company and then promptly announces that they plan to purchase the company. This would lead to an increase in price and the manipulator can sell their shares at a higher price. Once they have sold their shares, they would subsequently withdraw their bid on the company.

Information Based Manipulation

This form of manipulation often involves the release of false or privileged information to affect the value of a stock. With the use of false information, individuals may look to spread favourable or damaging rumours to change the price artificially. An 2.2. MARKET DATA 13

example might be where a large share owner of a company disseminates false positive information to inflate the price and sell it when it reaches a higher . While this type of manipulation is often made by unknown individuals, it can also be conducted by those who possess privileged information. Market analysts and forecast agencies are among many who may announce misleading information or influence the stock price with the announcement itself, as they are considered credible among investors [14].

Trade Based Manipulation

This type of manipulation involves influencing the price of a stock through trading. It often occurs when an individual artificially generates activity for a stock in hopes of raising the price. An example of this would be when a manipulator buys and sells shares of a stock to themselves at a high frequency in order to make the stock seem more desirable than it really is. Outside investors may look towards the activity of the stock and believe that there is a growing demand for it. Subsequently, those who purchase it would only further increase the price and help the manipulator reach their goal of selling their shares at a profit.

2.2.5 Pump and Dump

Classified under information-based manipulation, pump and dump (P&D) is the act of artificially raising the price of a stock through the dissemination of false information. As shown in Figure 2.4, this manipulation strategy involves three different stages [51]. The operators of the scheme will first purchase the desired stock that they are looking to manipulate (Accumulation). Once they have acquired the shares, they will 2.2. MARKET DATA 14

Figure 2.4: Stages of Pump and Dump release false information in order to make it more desirable than it is, subsequently driving up the price (Pump). Once the price has risen to the desired level of profit, the operators will sell off their shares before anyone uncovers that the information has no basis or the hype dies down (Dump). Prior to the Internet, manipulators would often conduct cold calls in an attempt to convince individuals to purchase shares of stock. However, through the use of the Internet, it has become much more common and simpler for the fraudsters to reach a wider audience through spam email, and social media (e.g., online forums, chat rooms) [6, 58]. Thus, P&Ds has become the most popular type of manipulation that is conducted over the Internet. While P&D schemes can occur to any stock on the market, it is typically prevalent amongst those that are micro and small-cap [35]. The study by Aggarwal and Wu [14] has found that around 50% of manipulated stocks are small-cap stocks. This is primarily due to the liquidity of the stocks as it does 2.2. MARKET DATA 15

not take a large number of buyers to push the price of a stock higher. In order to identify P&Ds within the market, typical patterns of the scheme must be established. While the duration and method for conducting a P&D may vary, two leading indicators that can help identify them within the market are the price and volume [51]. When observing the price, a P&D will display a significant increase within a amount of time. The gain in price would be larger than the average fluctuations that the stock typically experiences. This would be usually accompanied by a dramatic decrease in the price once the pump has been completed. As for the volume, it would also display a similar increase as the stock gains more interest among investors within the pump phase. However, the volume will not immediately experience the same sort of decline as the price when the operators begin to dump their shares. This is typically due to the investors who will also try to sell off their investments once they realize the price is falling.

2.2.6 Event Study

In finance, an event study is an empirical analysis performed to examine the impact of a specific event on the value of a company [64]. It involves looking at the price returns within a given period prior to an event, which is referred to as the estimation window. With them, it can be estimated as to what the normal return of a stock would be on the day of the event and the days surrounding it. The time frame surrounding the event is commonly referred to as the event window. By comparing the estimated returns to the actual returns within the event window, the difference can show what the impact was to the company. Examples of its use can be when companies declare bankruptcy or when they announce a merger. Using an event study can reveal trends 2.3. TOOLS 16

or patterns among similar events which can help investors predict how certain types of stocks will behave [52].

2.3 Tools

Within my research, various tools were utilized to aid in the analysis of the data that was collected. This section provides the description to those tools and explains any underlying concept that may be required to understand them.

2.3.1 Stance Detection

Within the field of Natural Language Processing (NLP), stance detection is the pro- cess to determine the attitude or viewpoint of a text made towards a target. It aims to detect whether the author of the text is in support of or against a given entity [59]. Unlike sentiment analysis, which only determines the polarity of a text, stance detection considers the author’s favourability towards the target [4]. Within the text, the target may not be explicitly stated but understood that it is addressed to them. Some known applications of stance detection have been towards political debates, fake news, and social media [41, 77, 83].

2.3.2 Empath

Empath is a tool that was developed by Fast et al. [38] for researchers to generate and validate new lexical categories on demand. It utilizes deep learning to establish con- nections between words and phrases that were used in modern fiction. Given a small set of seed words that represents a category, Empath can provide new related terms 2.3. TOOLS 17

using its neural embeddings. The embeddings are a low dimensional, vector represen- tation of the text data that it has learned. It also employs the use of crowd-sourcing to validate the terms that have been provided as related. Along with the ability to create new categories, Empath comes with 200 built-in, pre-validated categories that are of common topics (e.g., neglect, government, social media).

2.3.3 Synthetic Minority Oversampling Technique (SMOTE)

SMOTE is a method that has been developed to address the problem associated with imbalanced datasets [21]. By having a small number of examples within the minority class, machine learning techniques often have difficulty learning the features that are associated with the class. In order to address the problem, one approach is to oversample the minority class [26]. Rather than just duplicating the necessary examples, SMOTE provides new information to the models by synthetically creating new examples. Selecting a random example within the minority class, it chooses one of its nearest neighbours within the same class. It then creates a synthetic example that is in-between the two examples. This approach allows new examples to be created which are relatively close in feature space to those that already exist within the minority class.

2.3.4 Adaptive Synthetic (ADASYN) Approach

Similar to that of SMOTE, ADASYN is another approach that addresses the problem associated with the minority class in imbalanced datasets. It achieves this by also creating synthetic examples. The critical difference between ADASYN and SMOTE is that the former uses a density function. It uses the function to decide the number 2.4. TECHNIQUES 18

of synthetic examples to create for a given minority data point [85]. By observing the distribution of the majority class around a minority data point, it generates more synthetic examples within neighbourhoods dominated by those of the majority class. This allows the approach to be deemed as adaptive, as it produces more data within areas that are harder to learn for the models [72].

2.3.5 SHAP

SHAP (SHapley Additive exPlanation) is a tool that was developed by Lundberg and Lee [63] as a unified approach to interpreting the output of any machine learning model and observing the individual factors that contribute to its prediction. It is based on Shapley values, a concept from game theory that determines a fair way to distribute the payoff for players that have worked in coalition towards an outcome [87]. It works best when the contribution of each player is different, and the payoff must be made to reflect their input. Taking this concept to machine learning and utilizing the Shapley values, SHAP provides the estimated contribution value of each feature to the model’s performance. To support different model types, several functions known as SHAP Explainers have been provided to compute the model’s SHAP values. One particular function that is commonly used is TreeExplainer, which is optimized for interpreting tree-based algorithms.

2.4 Techniques

Several techniques were used in my research to help in the detection of P&Ds within online forums. This section provides a brief description of those techniques and their implementation. The first three descriptions are of the clustering algorithms that 2.4. TECHNIQUES 19

were employed, which are then followed up by the different classification algorithms that were used.

2.4.1 Non-negative Matrix Factorization (NMF)

Non-negative matrix factorization is a matrix factorization technique that produces two smaller matrices, W and H when given a matrix X [65]. As the name suggests, it enforces the property that all three matrices contain non-negative elements. This allows the resulting matrices to be easily interpreted. In short, the technique reduces the dimensionality of the given matrix such that the resulting matrices are split into the underlying components and weights. The rows of matrix W are the weights of each component, whereas, for matrix H, they are the components [54]. Due to its non-negative constraint, NMF is frequently used in image processing and document clustering, among many other things [81]. While there are many variations of NMF available online (e.g. Nimfa), the one that was used within this study was retrieved from SciKit-Learn, which provided their implementation of NMF [74].

2.4.2 Latent Dirichlet Allocation (LDA)

Primarily used for topic modelling within NLP, Latent Dirichlet Allocation is defined as a generative probabilistic model for collections of discrete data [19]. Given a set number of components, it provides the probability distribution of the components for each observation [57]. In the context of topic modelling, LDA will return the different probabilities of the document being a part of the topic. It achieves this by representing each topic with a set of words and then mapping all the documents to the topics until most are represented by them [40]. Thus, a document can either be 2.4. TECHNIQUES 20

a part of multiple topics or one topic. Among the various implementations of LDA (e.g., Gensim, MALLET), the one from Scikit-Learn was used within my research [74].

2.4.3 Singular Value Decomposition (SVD)

Similar to NMF, Singular Value Decomposition is a matrix factorization technique that produces three matrices UDVT , when given a matrix A [20]. While SVD is often used in digital signal processing for noise reduction and data compression, it can also be used for dimensionality reduction. Unlike some matrix decompositions, SVD can be used for any square or rectangular matrix [71]. It produces two matrices, U and V, which are each orthogonal, and a square matrix called D which is diagonal. The entries within D are non-negative values which are ordered from largest to smallest. They are used to understand the amount of variance by the columns of U and V. For the purposes of my research, a reduced version of SVD called Truncated SVD from Scikit-Learn was used [74]. It only calculates the specified number of columns for U and V which correspond to that of the largest singular values. This allows the calculations to be done much faster as the full SVD is not required.

2.4.4 Extreme Gradient Boosting (XGBoost)

Extreme Gradient Boosting is a decision-tree based ensemble algorithm that has been known for its speed and performance [23]. Used for supervised learning problems, it combines the predictive power of multiple learners (decision trees) to create a robust model. The term boosting refers to models that are built sequentially such that each subsequent learner aims to reduce the errors of the previous one. This allows for the 2.4. TECHNIQUES 21

trees to learn from its predecessors and increases its own performance [11]. As such, gradient boosting refers to the use of a gradient descent algorithm to minimize the errors within the sequential model [69]. Among the unique features that XGBoost has, the most important ones are its ability to handle sparse and weighted data. It uses a sparsity-aware split finding algorithm to handle sparse data which contain missing values or are the result of using data preprocessing steps like one-hot encoding. Unlike other tree-based algorithms which can only handle data of equal weights, XGBoost uses a distributed weighted quantile sketch algorithm to find the best split points for the weighted data [11]. In order to use XGBoost within the research, an open-source python library called Xgboost was used [30].

2.4.5 Random Forest (RF)

Random Forest is a decision-tree based ensemble algorithm that is often used for classification problems. It is made up of small decision trees, called estimators, which each make their own predictions based on a random subset of features [88]. This allows for variation among the trees and results in lower correlation and diversification. Low correlation among the estimators helps make accurate predictions as they protect each other from individual errors [92]. When it comes to making a prediction, RF combines the prediction of all the estimators. For classification tasks, it uses a majority vote to decide on a predicted class. As for regression, it will take the mean value of the predictions that were provided by the estimators [88]. Within my research, the class called Random Forest Classifier from Scikit-Learn was used [74]. 2.4. TECHNIQUES 22

2.4.6 Support Vector Machine (SVM)

Primarily used for classification problems, Support Vector Machine is a supervised learning algorithm that looks to find a hyperplane that best separates the given data points. Hyperplanes can be best described as decision boundaries that help classify the data [39]. While there are many hyperplanes that can exist, SVM finds the most optimal one by maximizing the around it. This is achieved by using data points which are the closest to the hyperplane, known as support vectors. By using them, they help influence the position and orientation of the boundary. For data points that are not linearly separable, SVM uses a technique called the kernel trick. These are functions that allow for the operation of points within a low dimensional space by transforming it onto a higher dimensional space [61]. For my research, due to the size of the dataset, Scikit-Learn’s implementation for Linear Support Vector Classification was used [74]. This class is similar to that of another class called Support Vector Classification with the kernel parameter set as linear but performs faster on the dataset.

2.4.7 Artificial Neural Networks

Artificial Neural Networks are computational networks that are inspired by the bio- logical nervous system [33]. Much like how the brain processes information, neural networks are composed of a large number of highly interconnected units, called arti- ficial neurons. Each artificial neuron is capable of receiving and outputting a signal, which is in the form of real numbers. The outputs of each neuron are calculated by what is known as activation functions (non-linear functions). They process the input signal that it receives and takes the weighted sum along with some bias to determine 2.4. TECHNIQUES 23

whether it should send a signal [29]. The connections among the neurons in which they use to communicate are known as edges. It is the aggregation of neurons that make up a layer. Each neural network will have an input layer, an output layer, and multiple layers in-between, known as hidden layers. By taking in information through the input layer, the hidden layers transform the input, such that the output layer provides some prediction [37]. Building upon ANNs, Deep learning refers to models that use more than one hidden layer. The increase in layers allows for the neural network to learn representations of data with various levels of abstraction [55]. This has allowed for improvements within fields such as object recognition and text classification. Within my research, different neural network models were employed to detect P&Ds within the forums. Each was built by using an open-source Python library called Keras [31], which runs on top of TensorFlow [13].

Multilayer Perception (MLP)

Multilayer Perceptron is known to be the most common type of neural network [5]. It is a class of feedforward neural network that consists of at least three layers (input, hidden, output). Feedforward refers to the direction at which the data travels from the input layer to the output layer. The connection in-between the layers are assigned with individual weight values to represent its importance. During the training phase, it will initially assign random weight values to the edges. This will result in the model’s output to be different from the expected output. This difference is referred to as the error. To correct it, a technique called backpropagation is used iteratively to send the error back through the network such that the weights can be adjusted to produce the correct prediction [2]. 2.4. TECHNIQUES 24

Convolutional Neural Network (CNN)

Convolutional Neural Network is a class of deep neural networks that have been designed initially to work with image data [80]. However, they are not limited to that field, as they can also be used for Recommendation Systems, NLP, and many other tasks. While CNNs are very similar to MLP, they are capable of considering the locality of features [50]. This is achieved through what is known as convolutional layers. They are the core building block of the network which consists of learnable filters [1]. Each filter is relatively small, but they extend through the full depth of the input data. By sliding the filters across the width and height of the data, the dot product between the filter and the input is computed. The resulting output of the values is a two-dimensional array that is called a feature map [24]. The values of the feature map can then be passed through a non-linear activation function like ReLu (Rectified Linear Unit). A pooling layer is then used to reduce the spatial size of the features. By using dimensionality reduction, the dominant features are extracted and then given to a fully-connected layer (i.e. MLP) to compute the output [80].

Bidirectional Short-Term Memory (BiLSTM)

Long Short-Term Memory (LSTM) network is a type of recurrent neural network (RNN) that was designed to handle long-term dependencies that are present in sequence prediction problems (e.g., speech recognition, image generation, machine translation) [22]. LSTM improves upon the short-comings of RNN, which was that they were unable to handle long sequential data. The reason for this is due to a problem called vanishing gradient. During training, the values (i.e. gradients) that are used to update the neural network weights would shrink as it back propagates. 2.5. RELATED WORK 25

Over time, the gradient values would become small enough that the model stops learning. To address this, LSTM introduced an internal mechanism called gates that regulate the flow of information. With three types of gates (forget gate, input gate, output gate), it can learn which data in a sequence is necessary to remember or forget [75]. This allows it to pass along relevant information down the long chain of sequen- tial data to make a prediction. Improving upon LSTM, BiLSTM processes the data in both directions (forwards and backwards) with two separate hidden layers. This provides more context to the model and results in better learning of the problem [25].

2.5 Related Work

The application of machine learning algorithms for detecting market manipulation is a relatively new approach in the field of finance. Despite its infancy, there is evidence of growing interest as the number of research have continued to increase steadily. Among the various works, a study that is often referred to is the paper by Ogut et al. [93] which looked at detecting trade-based manipulation in the emerging Istanbul Stock Exchange (ISE). Their detection model looked at the statistical difference between the daily return, volume and volatilities of manipulated stocks and the market. By using Artificial Neural Networks and Support Vector Machine, they compared their results against those of discriminant analysis and logistics regression. They found that the performance of ANNs and SVM were better at detecting manipulation than multivariate statistical techniques. For both the ANNs and SVM, they achieved a specificity of 0.88%, and a sensitivity of 0.65%. Their work has been considered by others to be the first that introduced machine learning techniques to detect stock price manipulation [86]. 2.5. RELATED WORK 26

Looking also to find better performance than traditional multivariate statisti- cal techniques, the work by Wang et al. [86] also used machine learning methods to improve manipulation detection capabilities. They proposed the use of a novel RNN-based ensemble learning (RNN-EL) to detect trade-based manipulation. The authors state that existing research often dismisses the fact that stock trading data are complex time series which consist of variables like price and volume. As such, they looked to take advantage of the properties of time series with Recurrent Neural Network to detect trade-based manipulation. By using the prosecuted manipulation cases reported by the China Securities Regulatory Commission (CSRC), they built a labelled dataset containing trading data and characteristic information to conduct their experiments. Their results showed that their proposed method significantly out- performed traditional multivariate statistical techniques. Their proposed model was able to achieve an F1-Score of 47%, with precision being 31.7% and recall at 90.2%. Taking a different approach to the properties of time series, the work by Cao et al. [28], looked to transform time-varying financial data into pseudo-stationary time series such that machine learning algorithms can be easily applied to detecting market manipulation. Taking real trading data from four popular stocks on NASDAQ, they injected synthetic cases of manipulation (spoofing and quote stuffing) into the data to test their models. By using One-class Support Vector Machine (OCSVM) and K-Nearest Neighbour (KNN), they found that not only did the transformation of the data help in the performance of the models but also the algorithms successfully discovered the manipulation cases. Building upon their earlier work, Cao et al. [90] state that existing work primarily focuses on either empirical studies of manipulation or analysis of specific types of 2.5. RELATED WORK 27

manipulation based on given assumptions. Furthermore, they claim that there is a lack of effective approaches to detecting and analyzing price manipulation in real-time. As such, they propose a novel approach called Adaptive Hidden Markov Model with Anomaly States (AHMMAS) for modelling and detecting trade-based manipulation. Taking a similar approach to their former study, they took seven popular stock data from NASDAQ and the and injected ten simulated stock prices. Comparing their proposed model to other benchmark models (OCSVM, KNN, and GMM), they found that AHMMAS gives the best results and effectively detects price manipulation when given time series data. The work by Diaz et al. [36] used an open-box approach that took in finan- cial variables, ratios and textual sources to detect trade-based manipulation. The study incorporated manipulation cases pursued by the U.S. Securities and Exchange Commission (SEC) in 2003 as well as other sources of data (profiling information, intraday trading information, financial news and filing relations) to analyze over 100 million trades and 170 thousand quotes. They first used various clustering algo- rithms to create the training dataset and then applied decision trees (QUEST, C5.0 and CART) and knowledge discovery techniques to detect stock price manipulation. Their research found that their model achieved better results than traditional statis- tical methods in classifying trades and identifying new fraud patterns associated with market manipulation. Building upon the work of Diaz et al. [36], the research by Golomohammadi et al. [44] looked at adopting various supervised learning algorithms to detect market manipulation. They found that among the manipulation schemes that were observed all fell within one of the following three groups: marking the close, wash trades, and 2.5. RELATED WORK 28

. By utilizing the dataset from the base work, they used CART, Conditional Inference Trees, C5.0, Random Forest, Na¨ıve Bayes, Neural Networks, SVM and KNN to find the best classifier. They also used SMOTEBoost to tackle the issue of imbalanced classes within their dataset. Their results showed that Na¨ıve Bayes was the best algorithm among the ones that they tested as it achieved an F2-Score of 53% along with sensitivity of 0.89% and specificity of 0.83%. Among the literature, there is clear evidence that the majority of the research primarily focuses on the detection of trade-based manipulation. This is due to the fact that of the three classifications of manipulation, trade is the most common within the market [86]. Research by Huang and Chang further supports this claim as they found that within the manipulation cases prosecuted in Taiwan from 1991 to 2010, 96.61% were trade-based, and 3.39% were information-based [48]. There were no cases of action based manipulation. Despite its prevalence, the statistics do not accurately reflect the difficulty that comes with detecting trade-based manipulation. While action and information-based manipulation can be traced back to a single instance of exchange, trade-based manipulation relies on a set of trades to occur that must be all tied back in order to be detected. However, based on the literature, machine learning techniques do show greater performance than traditional methods and have demonstrated that they are capable of detecting market manipulation. While many look towards the developed financial markets as areas for research, it is important to note that emerging markets such as cryptocurrency are also highly vulnerable to manipulation. Cryptocurrencies have recently surged in popularity due to their ability to facilitate payments without the need for a central authority (i.e. banks). Due to their future potential, many have turned to digital currencies as a form 2.5. RELATED WORK 29

of investment. This, however, has led many investors to be vulnerable to manipulation schemes as there are many risks associated with emerging markets. Among them, the lack of credible information and regulatory oversight has been the most prominent issues which manipulators look to take advantage of [51]. A standard scheme that thrives within those conditions is P&Ds as it is a popular form of manipulation within cryptocurrencies. In a study conducted by Xu and Livshits [89], they looked at the cryptocurrency market to identify any market patterns associated with P&D schemes. By investigat- ing 412 P&D activities that were organized in Telegram channels, they were able to derive features of manipulated coins prior, during, and post its pump. Given that in- formation, they presented Random Forest and Generalized Linear Models that were capable of providing the likelihood of a pump event. The results of their findings showed that their model was capable of detecting market manipulation before being given a Telegram message. Similarly, the work of Victor and Hagemann [84] also looked at P&D schemes that were coordinated through Telegram chats. They looked at 149 confirmed cases (Ground Truth) of P&D by observing market capitalization, trading volume, price impact and profitability. By combining the messages from Telegram channels along with those collected from Twitter concerning cryptocurrency, they looked to detect suspicious trading activity within the market. By using XGBoost, they found that the proposed model was able to identify beyond the 149 P&D schemes within the market. It was also able to achieve a sensitivity of 85% and specificity of 99%. Within their research they concluded that P&Ds were frequent among cryptocurrencies that had a market capitalization of $50 million or below and often involved trading volumes of 2.5. RELATED WORK 30

several hundred thousand dollars within a short time-frame. While the previous work utilized social platforms to support the market data in detecting manipulation, the work by Mirtaheri et al. [67] looked specifically at forecasting P&Ds by combining the information from Twitter and Telegram. By manually labelling known P&D operation messages on Telegram, they used SVM with a stochastic gradient descent optimizer to label the remaining messages as either those that were of P&D or not. Using the data, they used Random Forest to detect whether a manipulation event was going to take place within the market. Their results showed that they were able to detect, with reasonable accuracy, whether there is an unfolding manipulation scheme occurring on Telegram. Their proposed model was able to achieve an accuracy of 87% and an F1-Score of 90%. While the field of cryptocurrency and its market is fairly new, the available lit- erature shows that it is not immune to market manipulation. Differing from the research within developed markets, various work regarding manipulation detection of cryptocurrency have favoured cases of P&Ds. A possible explanation for this might be that the cryptocurrency market is not fully developed to handle the complexity of the other schemes. Regardless of the market and the schemes that are being detected, the application of machine learning techniques continues to prove useful within the field. Among the research that I have examined in regards to market manipulation, a small amount has looked into its detection within social media. While the work that I have presented concerning cryptocurrency use online messages, they are specifically used to either verify known manipulations or detect those (e.g., Pump groups) that are orchestrating them. Only a few studies have actually looked into automating 2.5. RELATED WORK 31

the detection of texts on forums which are made to deceive investors. One is the research by Delort et al. [34], which proposed a model that is based on Na¨ıve Bayes classifiers. They used their model to examine collected messages from HotCopper, an Australian stock message board, to identify manipulations and various other issues. While their model found success in identifying messages of concern, they reported that the number of misclassification errors was too significant to be used autonomously. Therefore, they suggested that the model be used in a semi-automated context to help moderators quickly identify messages that required attention. Another paper that looked into automating the detection of fraudulent text within online forums was by Owda et al. [73]. Similar to the previous work, their goal was to develop a system that identified suspicious comments which warranted further inves- tigation by moderators. By using an Information Extraction system, they compared the collected messages to lexicon templates of known illegal financial activities (e.g. Pump and Dump, Insider Information). They determined that the higher the number of words and phrases that were matched, the greater the probability would be that the message was in regards to something illegal. With the system, they found that of the 3000 comments that were collected on a daily basis, 0.2% were deemed suspicious. Building upon that research, Lee et al. [56] released another study that incorporated a methodology called Forward Analysis to their system. By using stock prices for the comments that were flagged as suspicious, the system observed and classified them within different ranges of price movements. This allowed them to detect potentially illegal comments based on levels of risk. If their model was implemented, it meant that the moderators could start investigating those that had the most significant impact onto the market. 2.6. SUMMARY 32

By looking at the available literature on the detection of manipulation within online forums, it is clear that more work can be done within the area. The use of other machine learning techniques that are better suited for text is something that can be explored. Furthermore, a more robust and flexible method to determine the lexicon of those that are known for manipulation must be investigated. Taking these into consideration, my research will look at the gaps within the known literature and identify a model that can improve the detection of manipulation that exists within online forums.

2.6 Summary

In this chapter, the background knowledge for the concepts and techniques that were used in my research were presented. I began by outlining the requirements for the data sources that were necessary and then providing a description to those that were selected. Next, I introduced various financial terminologies that would be essential to understand the data that I had collected. I then explained what market manipulation is, and the broad classifications for the different strategies that are possible. Among the various types of manipulation, I also discussed the most common scheme that is conducted within online forums. Subsequently, I described the tools that were used in my research. I then discussed the various techniques that were used to analyze the data. Finally, I presented the related works that looked into detecting market manipulation with machine learning algorithms. In addition, I presented works in regards to detecting market manipulation within online forums and discussed the areas of improvements that will be explored within my research. 33

Chapter 3

Experiments

This chapter highlights the various methodologies and procedures employed in my research. First, I describe the process and the various tools used to collect the neces- sary information from the following data sources: Reddit and Yahoo! Finance. Next, I outline the various steps for preprocessing the data. Among them, I describe how the text data extracted from Reddit is labelled as being a Pump and Dump (P&D). In order to further refine the records that have been labelled, I explain the method that I utilize to help boost the detection of posts that are associated with market manipulation. While working with the data, I explored several approaches to identify any lin- guistic cues or patterns among the texts that can help in detecting P&D posts. My first approach involved the use of clustering techniques. I used Non-negative Matrix Factorization (NMF), Latent Dirichlet Allocation (LDA), and Singular Value De- composition (SVD) to group similar documents and identify the possible relationship between them and their labels. The second approach I took used several classification techniques such as Extreme Gradient Boosting (XGBoost), Random Forest, Support Vector Machine (SVM), and Artificial Neural Networks (ANNs) to identify feature 3.1. DATA COLLECTION 34

importance and to determine the best performing technique that can detect P&D content. In order to help visualize the flow of the experimental setup, Figure 3.1 has been provided.

Figure 3.1: Experiment workflow

3.1 Data Collection

An important component of any problem addressed by machine learning is the data. The quality and quantity of the data have a significant role in dictating the amount of effort that is required to find a solution. As such, careful consideration must be taken when gathering the right information. In my research, all data has been personally collected and does not rely on any pre-collected dataset that others have provided online. Although using existing datasets provides more time to focus on the issue at hand, it also has many drawbacks. By personally collecting the data, it provides the benefit of having more flexibility and control over the material that is being gathered and also ensures its integrity. 3.1. DATA COLLECTION 35

3.1.1 Collecting Data from Reddit

For those that are looking to gain access and build applications with Reddit in mind, the social news company provides free public access to developers to interact with their platform through their Application Programming Interface (API). While there are certain terms and conditions that must be strictly followed, the data that is provided is readily accessible for non-commercial use. In order to quickly build an application that can collect the data from Reddit, I relied on the use of Python Reddit API Wrapper (PRAW), which is a popular open-source python package that allows users to access Reddit’s API [3] easily. By simply providing the necessary account credentials (Client ID, Client Secret, User-Agent, Username, Password) from Reddit, I was able to draw upon the data from the following subreddits within the social news platform:

• r/pennystocks

• r/RobinHoodPennyStocks

The mentioned subreddits were selected based upon their popularity and their con- tent, which were primarily in regards to penny stocks. Due to the inherent volatility of penny stocks, many manipulators target them to turn a quick profit. As such, I determined that the chosen subreddits would be the best source to observe P&Ds take place. While both posts and comments were collected from the subreddits mainly for their text, I also gathered several other key pieces of information to aid in the steps beyond data collection. Table 3.1 outlines all the information that I collected for each post and comment. With the application, a Bash script was written and executed by cron (time-based job scheduler) at the end of every day to collect the data and save 3.1. DATA COLLECTION 36

Feature Description Post Title Title of the post. Post ID Unique identification code for post. Post Author Author of the post. Post Created Unix Timestamp of when post was submit- ted. Post Body Text of the post. Comment ID Unique identification code for comment. Comment Author Author of the comment. Comment Created Unix Timestamp of when comment was sub- mitted. Comment Body Text of the comment.

Table 3.1: Features of Reddit data them as CSV files. To avoid duplications, I set the application to only collect the content submitted for the day within the subreddits.

3.1.2 Collecting Data from Yahoo! Finance

Similar to that of Reddit, Yahoo! Finance once provided access to their historical data API for users to gather information. However, they discontinued the service in 2017, and those that sought to gather financial data were made to rely on alternative methods. As such, I relied upon a python module called yfinance to gather the data that I required. Yfinance was built as a workaround that scrapes the data from the Yahoo! Finance website and returns the requested financial information (e.g., OHLCV) [17]. As previously mentioned, the relevance of the data that was collected from Reddit to a particular stock must be identified. However, due to the nature of my research, they were analyzed a week after the time in which they were submitted on the sub- reddits. This was done to capture any after-effects that the posts and comments may 3.1. DATA COLLECTION 37

have had on the stock market. As such, I built an application that first loaded and analyzed each post within the Reddit CSV file for any mentions of a stock symbol within the text. A symbol is an arrangement of characters, usually, one to six letters in length, that is chosen to represent the company on publicly traded stock exchanges (e.g., NASDAQ). To verify that a word within a text was a possible symbol, I first filtered through the use of regular expressions, and then verified through a list of publicly traded stock symbols in NASDAQ, NYSE, and AMEX. If the word did not exist within the three exchanges, it was also checked with the use of yfinance as a last resort. In the case that a post mentioned more than one stock, the post and the associated comments were skipped as it was difficult to link several market behaviours to a particular text. Conversely, those that did not contain any stock symbols were also skipped as they could not be directly associated with any market behaviour. While comments are treated as an individual document much like posts, the reason they were not examined for symbols was because of the understanding that they were responses made in regards to the content within the post. Therefore, any comment made about a post would also be associated with the stock information identified in the post. As such, under the outlined conditions, if the text matched an existing stock symbol, yfinance was used to collect the financial information outlined in Table 3.2. As shown in Figure 3.2, the period of daily Open, High, Low, Close, and Volume (OHLCV) data that was collected was a total of nine business days. The reasoning as to the chosen number of days can be broken down into two factors:

• Estimation Window

• Event Window 3.1. DATA COLLECTION 38

Feature Description Open Opening price of the stock for the given pe- riod. High Highest price for the stock within the given period. Low Lowest price for the stock within the given period. Close Closing price of the stock for the given pe- riod. Volume Total number of shares traded within the given period. Market Sector Associated industry that the company is in. Market Capitalization Total market value of the company’s out- standing shares.

Table 3.2: Features of Yahoo! Finance data

The first five days [T0,T1] were set aside for the estimation window, as a time frame beyond that might introduce greater chances of overlapping with other P&Ds or events. The data for those days were collected to identify the stock price and volume trend leading up to the day of the post. The trend information was critical in estab- lishing the baseline for the stock behaviour within the market. For the remaining four business days [T1,T2], beginning from the day of post submission, OHLCV data were collected to analyze the short-term impact of the specific content from the subreddit. While there is no standard as to how long the duration of the event window can be, it is expected that a P&D would take place quickly once the information of a particular stock has been disseminated to the public. Research by Sabherwal et al. [79], which studied the effects of online message boards in conducting market manipulation, have found that P&Ds typically occur within four days. Therefore, a four-day window was considered to be adequate to observe the expected changes in the market. In the sce- nario that yfinance was not able to retrieve the OHLCV information for a particular 3.2. TEXT PREPROCESSING 39

Figure 3.2: Time window used to collect market data. day due to events such as national holidays, the application automatically looked for the next business date available. Once the necessary amount of financial data was collected, the information was added onto the relevant posts and comments within the dataset and saved as a CSV file.

3.2 Text Preprocessing

After gathering and combining the data from Reddit and Yahoo! Finance, the dataset was then preprocessed. Before any of the data was edited, the first step I took was to filter the data by removing any records that I decided were irrelevant to the research. 3.2. TEXT PREPROCESSING 40

In particular, the records that were intentionally skipped during the financial data collection phase were removed from the dataset. Next, the text data within the records were preprocessed so that it could be converted to a form that allows machine learning algorithms to better perform on. The following are the list of steps that were taken to preprocess the text within each record:

1. Remove URL

Within posts and comments, it is common for many users to provide links to where they have acquired their information. Since it is difficult to determine what the exact information is referring to within the links, the links themselves are not conducive to solving the problem that is being addressed within my research. Therefore, with the use of regular expressions, any links that started with HTTP / HTTPS were removed from the text. The removal of the links also eliminates any possibility for the words within the links to be taken as meaningful.

2. Expand Contractions

Used in written and spoken forms in the English language, contractions are a com- bination of words that are shortened by removing specific characters and replacing them by an apostrophe. Some examples of contractions are:

• don’t ⇒ do not

• I’m ⇒ I am

• we’re ⇒ we are 3.2. TEXT PREPROCESSING 41

It is important in Natural Language Processing (NLP) tasks to deal with contractions as they present several challenges if they are not addressed. In the case that they are left alone and only punctuations were to be removed, words such as “don’t” would be transformed into “don” and “t”, which can be misleading and confusing in the later phases of the research. On the other hand, even if the apostrophes were not removed to maintain the original contraction, not only would it increase the number of vocabularies used within the texts but the models may perceive the terms differently despite them having the same meaning. Therefore, to help in the expansion of the contractions, a python library called contractions was used to easily expand the terms [53].

3. Remove HTML Tags

While it is unlikely that the texts collected from Reddit would contain any HTML tags within them, it is always possible that some posts and comments may contain them. As tags add no value among texts, they were removed with the use of a popular python module called Gensim [78]. Among the tools that Gensim provides, it contains the ability to quickly preprocess raw text with custom filters, including HTML tags as an option.

4. Remove Punctuation

Punctuation is important in the written English language as they help convey impor- tance and meaning to a sentence, but the complexity of their use is yet to be fully understood by a machine. As such, their presence may only be considered as extra noise if they remain within the texts. Therefore, by using Gensim’s punctuation filter, 3.2. TEXT PREPROCESSING 42

they were also removed.

5. Remove Extra Whitespaces

Considered as a standard practice among text normalization techniques, the extra whitespaces that existed within texts were removed with the use of Gensim.

6. Remove Numbers

While the presence of numerals is common, especially in content that deals with financial information, they are not considered relatively important in research that is looking at the language that is being used. Therefore, numbers were also removed from the texts by using Gensim.

7. Lemmatization

Within linguistics, lemmatization is the process of returning an inflected word back to its base form. Its use is common within NLP tasks as it reduces the noise in texts by replacing the different forms of a word back to its root form. The following are some examples of words to demonstrate lemmatization:

• Selling ⇒ Sell

• Am, Are, Is ⇒ Be

• Stocks ⇒ Stock

While there is a similar method called stemming which also obtains the base form of a word, its approach differs as it simply removes the inflections from the words. For example: 3.2. TEXT PREPROCESSING 43

Stemming Lemmatization

change, changing, changes ⇒ chang change, changing, changes ⇒ change

However, lemmatization relies on lexical knowledge to get the correct base form, whereas stemmed words may not always be present in the dictionary, as shown above. Therefore, with the use of an open-source NLP library called spaCy, the texts were lemmatized [46].

8. Remove Stopwords

In NLP, stopwords are a specific set of words that are filtered out before or after processing of text data. The list of words may vary upon the user as they can be of those that are common within a corpus, or those that provide little meaning to a sentence and can be removed without sacrificing the context. For the purposes of my research Table 3.3 contains the list of words that were removed from the texts. The stopwords that were used for my research was retrieved and modified from a list of English stopwords contained within a python library called Natural Language Toolkit (NLTK) [18]. While there are many other libraries (e.g., spaCy, Gensim) that provide a prepared list of stopwords, NTLK’s default list of words appeared to take the least aggressive approach, with only 179 stopwords. The list that I used within the research contained a total of 124 stopwords which excluded contractions and incomplete words that were a part of the initial list. 3.2. TEXT PREPROCESSING 44

Stopwords i them does before any me their did after both my theirs doing above each myself themselves a below few we what an to more our which the from most ours who and up other ourselves whom but down some you this if in such your that or out no yours these because on nor yourself those as off not yourselves am until over only he is while under own him are of again same his was at further so himself were by then than she be for once too her been with here very hers being about there can herself have against when will it has between where just its had into why should itself having through how now they do during all

Table 3.3: List of stopwords that were removed

Replacing Stock Symbols

Once the text data were transformed by standard normalization techniques, an addi- tional step was taken to replace the stock symbols that were mentioned within them. Dependent upon the market sector designated to the stock on yfinance, the mentioned symbols were replaced within the texts. The reasoning behind this action was that specifically targeted stocks for P&D events might quickly change over time, and any 3.3. DATA LABELLING 45

individual stocks that are deemed important by models may be irrelevant when new data is presented. However, if the symbols were converted to a grouped term, a trend may be identified. Therefore, all symbols that are mentioned among posts and com- ments were replaced with their respective sectors. In the case that a sector cannot be identified for a given symbol, it was designated as Unknown. Once the symbols were replaced by their market sectors, it was possible for them to be mistaken as a common word. As such, the prefix Sector was added onto the sector names to differentiate them. The following is an example to help illustrate the process:

• “AYTU perfect time to buy” ⇒ “SectorHealthcare perfect time to buy”

3.3 Data Labelling

After the text data was preprocessed, the next step was to associate each record to a market behaviour such that models can learn the words used in texts for P&Ds. In order to establish a link between the text and the resulting market behaviour, I relied on the historical financial data that was collected from Yahoo! Finance. For a given record, the stock data surrounding the day in which the text was submitted onto Reddit were analyzed. If the market data exhibited behaviours that were known to be of P&Ds, the records were labelled accordingly. While it is possible to detect a P&D event by human observation, the number of exchanges and transactions that occur within a day make a manual approach to detecting them unfeasible. As such, like others that have conducted research within this field, an automated approach was taken by using anomaly detection [51, 82]. 3.3. DATA LABELLING 46

3.3.1 Anomaly Detection

Anomaly detection can be broadly defined as the process in identifying occurrences that deviate from the norm [43]. As outlined in Chapter 2, P&Ds are often char- acterized in the market by a sharp abnormal increase in volume and price, which is then followed by a quick decline in price. Therefore, the use of anomaly detection algorithms can be used to identify these price and volume movements as they break from their historical trend. In order to achieve this, the following factors were taken into consideration when looking for an anomaly:

Baseline Values

In an effort to establish that an event is anomalous, the stock’s price and volume data prior to when the post was submitted were used as a baseline for normal market behaviour. To get the baseline values, the values for the average price and volume were calculated from the data that was collected within the five-day estimation window.

Since the stock price for a given date, Xt, was collected as OHLC, the daily average price (DAP ) of the values was first calculated for each of the five days.

1 DAP (X ) = (X + X + X + X ) (3.1) t 4 topen thigh tlow tclose

By using the daily averages, the baseline average price (BAP ) was then also cal- culated. Within Equation 3.2, Xest denotes the financial data collected within the estimation window, starting from five days prior to the post submission date T0, up 3.3. DATA LABELLING 47

until the day before T1.

T 1 X1 BAP (X ) = · DAP (X ) (3.2) est 5 t t=T0

As for the baseline average volume (BAV ), it is simply calculated by taking the average of the volume values that were collected within the estimation window.

T 1 X1 BAV (X ) = · X (3.3) est 5 tvolume t=T0

Price Anomaly

Once the baseline values were computed, a threshold was established such that if a stock price went above a certain point, it would be considered anomalous. The threshold was set at two standard deviations above the average price within the estimation window. By using Equation 3.1 and providing each of the daily average prices within the four-day event window, it was calculated whether the price for a specific day went higher than the threshold.

  T rue, if DAP (Xt) ≥ 2σ + BAP (Xest) Price Anomaly(Xevent) =  F alse, otherwise

Volume Anomaly

Similar to detecting the price anomaly, if the volume of the stock went beyond the established threshold, it was considered to be anomalous. By providing the daily volume within the four day event window, it was calculated whether the volume went 3.3. DATA LABELLING 48

higher than the threshold.

  T rue, if Xtvolume ≥ 2σ + BAV (Xest) Volume Anomaly(Xevent) =  F alse, otherwise

Labelling Anomalies

By using the equations shown above, the records were labelled to either be associated with a P&D or not. In order to be linked to a P&D, the market data needed to exhibit both a large enough rise in price and volume to be classified as anomalous. Furthermore, the data needed to show a decrease in price subsequent to a rise, at some point within the event window. Only when these criteria were met, was the record labelled as being a part of a P&D. Figure 3.3 shows a comparison of the stock behaviours that were labelled using the mentioned approach.

3.3.2 Price Trend

While the use of anomaly detection is sufficient enough to collect P&D events that have occurred within the market, it is not guaranteed that the content within the subreddits may have affected its behaviour. Therefore, to ensure that the change in price and volume closely coincides with the post submission date, those indicated as P&Ds were filtered by taking into account the stock price trend. The slope for the trend line was calculated by using the data from the estimation window. The average prices, denoted as Y , were retrieved from the relevant period and were normalized using the method called min-max normalization. As shown in Equation 3.4, min- max normalization allows for the values to retain the distance ratio while scaling 3.3. DATA LABELLING 49

Figure 3.3: Comparison of stock behaviours that have been labelled using anomaly detection every value into the range of [0,1].

Y − Ymin Ynorm = (3.4) Ymax − Ymin

Once the daily average prices were normalized, linear regression was then used to obtain the slope of the price trend. Equation 3.5 is the formula for how the slopes 3.3. DATA LABELLING 50

were calculated, where Yi represents the index value assigned to each of the normalized prices within [T0,T1].

PT1 ¯ ¯ (Yi − Yi)(Yt − Ynorm) Slope = t=T0 norm (3.5) PT1 (Y − Y¯ )2 t=T0 i i

This process was performed to verify that the stock price was not already dramatically increasing prior to the post submission date. However, if it was determined that the slope of the trend was steep, this would indicate that other factors may be driving up the price of the stock well before the post. In order to reduce such cases, the threshold value was set as the median of all the slopes that have been calculated within the dataset. To illustrate this, Figure 3.4 shows the resulting distribution of the stock price trend slopes within the dataset, where the median value was identified as 0.18. Thus, those lower than the threshold were considered more likely to have been influenced by the content that was presented on the subreddits. Conversely, for those that showed to have a greater slope than the median, if they were labelled as positive for being a P&D, they were relabelled to be negative.

3.3.3 Agreement Model

As mentioned within the data collection stage, individual posts and their respective comments were associated with the same financial information. As a result, those that were considered linked to market manipulation were all labelled the same. How- ever, early testing of the data showed that the inclusion of comments led to lower performances by the models. Rather than disregarding parts of the data, a closer ex- amination of the comments was made. Upon inspection, I found that not all records should be treated equally. By looking at the comments, it was evident that not 3.3. DATA LABELLING 51

Figure 3.4: Distribution of stock price trend slopes all agreed with the post, as there were those that disagreed or were neutral on the matter. Thus, I determined that the language patterns of the comments which did not support or encourage fraudulent claims were not conducive in the detection of P&D posts. The reasoning for this approach was that if manipulators were looking to deceive investors, they would also comment on the post to convince others that the information is legitimate. As such, the comments could either be made by the same account that was used to create the post or under different accounts. By drawing inspiration from the works of Stance Detection, an approach was taken to identify the comments that supported P&D posts [60, 68]. While the goal of stance 3.3. DATA LABELLING 52

Empath Agreement Words only done better true knew besides like maybe wanted liked also important buying understand good understood needed work because successful knowing grateful plus much reasonable should give happy course glad well considering anyway agree meaning great probably sure thought guaranteed more honestly positive thankful actually agreed special doubt guess though bet buy surpass worth suppose although especially definitely certain figured given means

Table 3.4: List of generated words by Empath detection within NLP is to classify texts into three different categories (Positive, Negative, Neutral), only two (Positive, Negative) were considered necessary in my research. The intention is that only the comments which agree with the post need to be identified since they express similar ideas. A simpler approach to that of stance detection was used within the research, which I refer to as the Agreement Model. By looking at the text within each comment, words that are used to convey agreement were sought out. It was achieved through a two-step process which involved constructing a lexicon for detecting comments that agreed with P&D posts. The first step incorporated the work of Empath to retrieve a set of words by creating a new lexical category for agreement [38]. The following are the seed words that were passed to Empath to create the category: bought, agree, positive, increasing, good, now. The words for the category were chosen to reflect the aspects of what may be seen by those supporting P&Ds. With the words as the basis for agreement, Empath returned the words that are listed in Table 3.4. 3.3. DATA LABELLING 53

While the words provided by Empath were a great starting point for creating a vocabulary for agreement within P&Ds, I determined that it still lacked various terms that were unique to the topic. Therefore, the second step looked at adding more words that were either missed or could not be properly processed by Empath. For example, words like “moon” or “rocket” could not be given to Empath as it associated them to terms related to space, whereas in financial forums they are used to denote an upward trend. Thus, a separate set of words were required to extend the vocabulary of agreement words. This was achieved by taking a manual approach, which involved looking at the comments that were labelled as being associated with market manipulation and selecting some of the most common terms that conveyed support or agreement to P&Ds. The following are some examples of those comments:

• “probably go to shoot up tomorrow”

• “this bad boy just rocket”

• “i will see you on the moon”

Table 3.5 contains the list of words that were chosen from this approach. Once the vocabulary of words was established to adequately capture the comments that agreed with P&Ds, the comments that were initially labelled as being associated with it were re-examined. Under the defined model, the labels remained the same for the comments that contained at least two or more of the agreement words, whereas those that did not meet the requirement were relabelled. Furthermore, the comments that were made by the author of the post were also unaffected by the change as it was deemed that they would only write comments that supported their original claim. In doing this, the goal was to reduce the noise within the data such that machine learning 3.3. DATA LABELLING 54

Custom Agreement Words moon fast massive rich surprise rocket profit top easy move pump peak early load soar climb worth shoot quick jump rise sale money burst pop high gain breakout drive hype spike run cash nice fly go up hit bank awesome confident surpass more zoom big great potential advantage

Table 3.5: List of custom words used in the Agreement Model algorithms will have an easier time identifying texts that are of market manipulation. As such, the following are some examples of comments that were labelled as not being P&D by the agreement model:

• “it be the american dream to fall for snake oil salesman and then lose everything it be a story as old as humanity”

• “clearly a pump and dump scheme”

• “do not touch it if the chart look like a hockey stick”

While the examples are of texts that obviously should not be labelled as P&D, it is important to mention that this model is in no way perfect for filtering out comments that do not agree with the post. There are several reasons for this. First, it does not take into account the use of negation words which can reverse the meaning of a sentence. This means that comments that do have one or more agreements words may actually be against the claims of the post. Next, the words that have been provided by Empath and those that I have chosen do not perfectly depict the words that may 3.4. HANDLING CLASS IMBALANCE 55

indicate an agreement towards P&Ds. Finally, similar to that of the negation words, the model does not take into consideration the context in which the words are being used. This means that under certain situations, certain words may not actually be an indicator for agreement. As a result, the agreement model was introduced within my research as a way to avoid labelling texts that are obviously against the P&D posts.

3.4 Handling Class Imbalance

With P&Ds being considered as anomalies, one of the effects on the data is class imbalance. A problem that arises for training a model with unequal distribution of classes is that it overfits the majority class, which in this case are the records that are labelled as not being fraudulent. This results in the the model having difficulties in predicting those of the minority class, which are the P&Ds. To address this problem, two common approaches that were tested onto the data were: under-sampling and over-sampling.

Under-sampling

To balance the data, under-sampling is a technique that is used to reduce the size of the majority class. It achieves this by keeping all the samples within the minority class and randomly selecting an equal number of records within the majority class.

Over-sampling

In contrast, over-sampling is a technique that is used to increase the size of the minority class. It balances the data by generating new minority samples. Some 3.5. TECHNIQUES USED 56

methods used to produce them are repetition, SMOTE (Synthetic Minority Over- Sampling Technique) [21], or ADASYN (Adaptive Synthetic Sampling) [45].

Class Weight

The techniques mentioned above were applied independently and also used in com- bination to try and address the issue of class imbalance. However, they did not positively improve the results of the classification models that were tested. As such, rather than modifying the data, the class weight parameter was used. By providing the weight for each class, models were able to penalize mistakes on the minority class.

3.5 Techniques Used

Once the data was finally preprocessed, it was then used for data analysis. In order to better understand the data and identify the patterns for detecting P&D posts, several clustering and classification techniques were chosen for my research.

Clustering Techniques

The clustering techniques that I used were:

• Non-negative Matrix Factorization (NMF)

• Latent Dirichlet Allocation (LDA)

• Singular Value Decomposition (SVD)

The techniques were selected to determine if any insight can be derived from the texts. Due to the sparsity and large dimensionality of the data, clustering algorithms felt like the logical choice as the initial step for identifying patterns within it. However, in 3.5. TECHNIQUES USED 57

order for the models to process the texts, it was first converted into a document-term matrix with the use of SciKit-Learn’s TfidfVectorizer [74]. A document-term matrix is a representational model that presents the frequency of terms within the collection of documents. Each row corresponds to a document, whereas the columns correspond to a unique term that was used within the corpus. By using the TF-IDF vectorizer, the term frequency (TF) for each word was normalized by the inverse document frequency (IDF). This allowed for the sparse terms to have a higher weighting than those that occurred more frequently within the documents. Subsequently, each of the algorithms was given the normalized text and tested with a various number of components. The outputs of each technique were then plotted with their designated labels to analyze the possible relationship between the grouped documents and market manipulation.

Classification Techniques

As for the classification techniques, the following were used within my research:

• Extreme Gradient Boosting (XGBoost)

• Random Forest (RF)

• Support Vector Machine (SVM)

• Artificial Neural Networks (ANNs)

– Multilayer Perceptron (MLP)

– Convolutional Neural Network (CNN)

– Bidirectional Long Short Term Memory (BiLSTM) 3.5. TECHNIQUES USED 58

The techniques that are outlined were chosen to discover the best performing algo- rithm for detecting content that was a part of P&Ds. Furthermore, they were also used to get a better understanding as to which words were the most important in detecting those types of events. For the purposes of training and testing the data, all models except for the ANNs were provided text matrices by using TF-IDF. As for ANNs, the text data were converted into text to sequence vectors using Keras’ Tokenizer. Due to the sequence vectors having different lengths, padding was used to resolve the issue. The pad length was set to be the median word length of the texts within the dataset. To accurately evaluate the performance of the models, five-fold cross-validation was used. In order to compare the performance of the models and understand them, the following metrics and techniques were utilized:

• Accuracy, Precision, Recall, F1-Score

• Confusion Matrix

• SHAP Summary Plot

By using Scikit-Learn’s classification report, a detailed analysis of each model’s per- formance was taken. Among the information that is provided, the recall, precision, and F1-score for each class were given, which was useful in determining whether the models were correctly classifying the P&D posts. Furthermore, the confusion matrix was used to help visualize the number of records that were being misclassified. In order to get a better understanding of how the models classified the text, SHAP’s summary plot was used to rank the features by importance. It was also used to visualize the impact of each feature to the prediction of P&D posts. 3.6. SUMMARY 59

3.6 Summary

This chapter outlined the various steps that were a part of my experimental setup for the research. It first described the systems that were involved and their purposes. The data collection process was then laid out, detailing the sources, collection methods, and the features of the data that were of interest. Afterwards, the preprocessing steps for the text were outlined. Along with the methods that are common to NLP, it included the steps to replace all mentions of stock symbols within texts to a more generalized form. The process to label the data was also explained as I used anomaly detection and the data from Yahoo! Finance to associate P&D market behaviour to the text. To help reduce the noise that was brought on by the comments that were labelled as P&D, I introduced and described an approach based upon the work of Stance Detection, which I referred to as the Agreement Model. Furthermore, the issue of class imbalance was also mentioned, along with how it was addressed within my research. Finally, the various techniques that I used to gain insight and help detect P&Ds within the subreddits were outlined. This included the different methods and tools to gauge the performance of the models. 60

Chapter 4

Results

In this chapter, I present and discuss the results that I have obtained in my research. I begin by examining various findings of the collected data that I have acquired from conducting an exploratory data analysis (EDA). Some of the insight I have gained from the analysis can be attributed to the design choices and concepts that are introduced in my research. Afterwards, the results I gathered from the clustering techniques are shown, along with a discussion as to whether their outputs were signif- icant. Subsequently, the results from the classification techniques are also presented and discussed. Each classifier’s best performance and their respective feature impor- tance with the use of SHAP values are provided for comparison. Among those that are observed, the best performing model is identified and closely examined. Finally, I discuss the overall results of the research, which addresses the question as to the viability of the proposed model in detecting P&Ds when given textual data. 4.1. DATA OVERVIEW 61

Subreddit Number of Posts Number of Comments Total r/pennystocks 12,049 234,149 246,198 r/RobinHoodPennyStocks 6,506 78,429 84,935 Total 18,555 312,578 331,133

Table 4.1: Breakdown of records collected from subreddits

4.1 Data Overview

Within the duration of my research, data from Reddit and Yahoo! Finance were collected from October 1, 2019, to June 28, 2020. A breakdown of the data is pre- sented in Table 4.1. As shown within the table, the majority of the data is retrieved from r/pennystocks. Those that were collected from r/RobinHoodPennyStocks only account for roughly a third of the total number of records. This is to be expected as the content for the latter is specifically associated with users that invest in penny stocks on the commission-free investing platform called . Another aspect of the data that was to be expected was that the number of comments would be larger than those of the posts. However, it was surprising to see that posts only make up roughly 5% of the data, which suggests that while posts are the primary focus of my research, the inclusion of comments can significantly influence the outcome of the experiments. Therefore, when working with comments, careful consideration should be taken for their inclusion, as they should not be treated the same as posts. An interesting event that should also be brought to attention is the change in the number of records that were collected daily. As shown in Figure 4.1, an increase in submissions can be seen. This is supported by the growing number of users that had subscribed to the subreddits within the period of data collection. The following is the change in the number of members within the respective subreddits:

• r/pennystocks - 139,000 Members ⇒ 257,000 Members 4.1. DATA OVERVIEW 62

Figure 4.1: Data Collection Trend

• r/RobinHoodPennyStocks - 52,000 Members ⇒ 133,0000 Members

By looking at the difference, approximately a two-fold increase within the membership is shown. The difference, however, does not linearly correlate to the amount of content that has been submitted online. This can be due to a large number of users who are not subscribed to the subreddits but still communicate within them. Given the influx of users, an explanation as to why there has been a large increase in traffic can be traced back to the COVID-19 pandemic, which has had a large effect on the world. In regards to the financial market, due to virus, the S&P 500 experienced a signifi- cant drop for approximately one month until late March of 2020. However, unlike the 4.1. DATA OVERVIEW 63

events of the 2008-2009 financial crisis, the market began to recover at an incredibly fast pace. As a result, news articles such as the one from Global News have reported that many stock trading platforms have experienced a similar increase within their user base [15]. According to the business professionals who were interviewed within the article, many stated that this influx of users was due to investors reconsidering their investment strategies and taking a more hands-on approach to investing. Fur- thermore, they also stated that it may be due to new investors, primarily millennials, who are looking to capitalize on the cheap stocks. Taking that into consideration, a reason as to why the subreddits have increased participation may be due to investors having more time at home. With many countries going under quarantine, investors may have felt that it is the perfect time to do their research on investments and to gain insight from others within online forums. It is also to be expected that a higher amount of P&Ds have occurred within that time frame, as manipulators look to take advantage of new investors during the pandemic. Alerts and press releases by the SEC and the Canadian Securities Administrators (CSA) seem to agree to that out- look as they have cautioned new investors to be vigilant about the increasing number of P&D schemes that have occurred within that time [7, 8, 9]. Moving away from the observations made about the data collection, the analy- sis that was done on the text data also yielded interesting results. Looking at the texts that were retrieved from Reddit, I observed that the dimensionality of the text representation was large; having 4,862 unique words. Furthermore, the underlying data itself was sparse due to the fact that the median word count was 22 words for posts and comments. This meant that while the lexicon for the documents was large, only a few were used within each document. To provide an example of the data, 4.1. DATA OVERVIEW 64

Figure 4.2: Top 20 frequent words

Figure 4.2 shows the frequency of the top 20 words. The frequency distribution of words gathered from the data corresponds to what is well-known within NLP, which is that a few high-frequency English words account for most of the vocabulary within texts (e.g., “the”, “of”, “i”) [76]. As seen within the figure, the word “be” is used most frequently. This is due to the use of lemmatization, where words such as “am”, “are”, and “is” are all considered the same when they are returned to their base form. Moreover, other words also look to be relatively common as they are used quite often within regular communications. Since my research is looking at the words that are used within P&Ds, I determined that the high-frequency words should be removed. Thus, the approach to use stop words was taken in order to eliminate the noise, allow- ing the machine learning algorithms to concentrate on words of greater significance. 4.1. DATA OVERVIEW 65

Figure 4.3: Histogram of discussed market sectors within subreddits

Another interesting result is when the stock symbols within the texts were replaced with their respective market sectors. This allowed for a better understanding as to which stocks were generally discussed within the online forums. By taking a look at Figure 4.3, it was revealed that healthcare stocks were those that were the most mentioned among them, followed by technology stocks. To get a better understanding 4.1. DATA OVERVIEW 66

Figure 4.4: Histogram of discussed market sectors within texts labelled as P&Ds as to why these sectors were discussed more than others, a closer examination was conducted. Figure 4.5 and 4.6 were the results that were discovered upon review. By taking a looking at the trends for the particular sectors, a key factor was identified. Both trends showed a similar behaviour beginning from March until the end of June, where there was an increase in activity within the relevant sectors. This coincided 4.1. DATA OVERVIEW 67

with the previous remark on the impact of the pandemic, where there has been heightened interest among investors. In regards to the posts and comments with an unknown sector, it is surprising to see that a high number of records could not be identified to a specific sector. The reason for this was traced back to an issue with Yahoo! Finance, where the specified information could not be retrieved. While there may be different methods to retrieve the data from other sources, it was left as-is for my research. Taking that into consideration, the overall result shows that there are a select number of sectors that are predominately discussed within the subreddits. Judging by this information, it can also be expected that most P&Ds will also occur within those sectors. As such, by removing the records that were labelled as not being P&Ds within the dataset, Figure 4.4 shows the previous claim to hold true as a similar distribution is presented. Furthermore, Figure 4.7 shows an increasing trend of posts that are labelled as P&D. This seemed to align with the warnings provided by the SEC and CSA that there is an increasing number of P&D activities. Having labelled the records using anomaly detection, price trend, and the agree- ment model, Table 4.2 shows the resulting class distribution for the dataset. By comparing the number of records within the two classes, it is evident that the data is imbalanced. Upon calculation, it is revealed that almost 9% of the records are labelled as being P&D. This is to be expected as the area of research is in regards to fraud, where class imbalance is a common characteristic of the problem. Focusing on the number of posts and comments that are labelled as P&D, it can be seen that the number of comments significantly outweigh the posts, as they account for 90% of the records. Thus, I determined that if the comments were to be 4.1. DATA OVERVIEW 68

Figure 4.5: Trend of posts and comments that discussed healthcare stocks

Record Type P&D Not P&D Total Posts 3,006 15,549 18,555 Comments 26,727 285,851 312,578 Total 29,733 312,142 331,133

Table 4.2: Dataset class distribution included and provided to the models, they would have to be treated differently. In the earlier stages of my research, all comments under a post that was considered to be P&D were labelled the same. This proved to be difficult for the models as their performance got worse when the comments were included in the dataset. This was attributed to the fact that models were being trained with comments which did not 4.2. CLUSTERING RESULTS 69

Figure 4.6: Trend of posts and comments that discussed technology stocks use P&D language. In order to address this, the agreement model was introduced to identify those that supported them. As shown by Table 4.3, this helped to reduce the number of comments that were initially labelled as P&D.

4.2 Clustering Results

With the texts being preprocessed and labelled, the data was then used by the pre- viously identified clustering techniques to analyze for insights that can help in the detection of P&D posts. For each of the techniques, two different versions of the data 4.2. CLUSTERING RESULTS 70

Figure 4.7: Trend of posts that have been labelled as P&D were provided. First, the data consisting of only posts were analyzed. Then, the nor- malized text data consisting of both posts and comments were provided. In addition, each technique was tested with a various number of components to identify any that may look promising for the purposes of my research. To help visualize the relation- ship between the document clusters and the market behaviours, the outputs of the techniques were plotted and labelled according to their respective classes. As such, to provide a sample of the results that were gathered, Figure 4.9 to 4.12 shows the 3D visualizations of the three most significant components for the individual techniques. By taking a look at the figures, it can be seen that none of the observations 4.2. CLUSTERING RESULTS 71

Comments P&D Not P&D Pre-Agreement Model 44,098 268,480 Post-Agreement Model 26,727 285,851

Table 4.3: Comments pre and post agreement model demonstrate any apparent clustering of documents which may be linked to P&Ds. Despite further testing, none of the techniques that were used revealed any meaningful results that could be used to help in the detection of P&D posts. While the results themselves may not have shown anything interesting, I learned that the problem might be a lot more challenging to solve with the taken approach. It is also important to acknowledge that despite the results that were presented, it does not imply that different clustering techniques may give a similar outcome. Therefore, the results that are shown only apply to the techniques that I have tested. For those who are conducting similar research, they may find success by using different techniques. 4.2. CLUSTERING RESULTS 72

Figure 4.8: NMF plot for posts

Figure 4.9: NMF plot for posts and comments 4.2. CLUSTERING RESULTS 73

Figure 4.10: LDA plot for posts

Figure 4.11: LDA plot for posts and comments 4.2. CLUSTERING RESULTS 74

Figure 4.12: SVD plot for posts

Figure 4.13: SVD plot for posts and comments 4.3. CLASSIFICATION RESULTS 75

4.3 Classification Results

In light of the results obtained in the previous section, a different approach was taken to address the problem raised in my research. By using classification techniques, I believed that the techniques might be better suited to pick up on features that may lead to the detection of P&D posts. Similar to what had been conducted for the clustering techniques, the binary classifiers were provided with two versions of the data. This was to determine whether the inclusion of comments would actually be beneficial. Furthermore, in order to address the imbalance within the dataset, all models were provided with the class weights. To gauge the performance of the models, each underwent five-fold cross-validation. This meant that for each segment of the dataset, the models were given different training, validation, and testing sets. In addition, for the version of the data that had both the posts and comments, the models were only measured on the accuracy of the posts within the testing phase. As such, the comments were removed at the end to assess how well the models can detect P&D posts. In order to compare the performance of each model, I generated the confusion matrix and classification report (precision, recall, f1-score) for every segment during cross-validation. Relative to the provided version of the data, Table 4.4 provides the total number of true positives (P&D), false positives, true negatives (Not P&D), false negatives for each technique, along with its accuracy, precision, recall, and F1-score. By taking a look at the performance values of the models, it is evident that the classification techniques are capable of detecting P&D posts. Especially the different types of ANNs show the best performance compared to the other models. While models such as Random Forest show a high accuracy, its precision and recall values 4.3. CLASSIFICATION RESULTS 76

Model TP FP TN FN Accuracy Precision Recall F1-Score XGBoost with 57.46% 20.71% 57.49% 30.45% 1728 6615 8934 1278 Posts (±3.73%) (±0.48%) (±0.68%) (±2.25%) XGBoost with 53.41% 20.79% 66.77% 31.71% 2007 7646 7903 999 Posts and Comments (±1.42%) (±0.85%) (±1.58%) (±0.96%) RF with 81.78% 29.55% 9.01% 13.81% 271 646 14903 2735 Posts (±0.51%) (±1.40%) (±0.52%) (±0.78%) RF with 84.89% 66.24% 13.77% 22.80% 414 211 15338 2592 Posts and Comments (±0.69%) (±1.69%) (±0.47%) (±0.75%) SVM with 64.88% 24.98% 58.28% 34.97% 1752 5263 10286 1254 Posts (±1.14%) (±0.76%) (±1.05%) (±1.16%) SVM with 70.68% 31.79% 70.69% 43.86% 2125 4559 10990 881 Posts and Comments (±0.49%) (±0.43%) (±0.56%) (±0.57%) MLP with 87.38% 58.10% 79.24% 67.04% 2382 1718 13831 624 Posts (±6.66%) (±11.65%) (±12.76%) (±12.12%) MLP with 81.11% 44.70% 69.96% 54.55% 2103 2602 12947 903 Posts and Comments (±3.71%) (±4.28%) (±3.80%) (±4.36%) CNN with 87.38% 58.13% 78.94% 66.96% 2373 1709 13840 633 Posts (±7.04%) (±12.02%) (±12.76%) (±12.37%) CNN with 85.07% 52.70% 76.65% 62.46% 2304 2068 13481 702 Posts and Comments (±1.25%) (±2.33%) (±3.45%) (±2.64%) BiLSTM with 82.73% 47.93% 76.41% 58.91% 2297 2495 13054 709 Posts (±8.11%) (±9.92%) (±10.94%) (±10.82%) BiLSTM with 83.36% 49.12% 76.11% 59.71% 2288 2370 13179 718 Posts and Comments (±2.27%) (±3.25%) (±3.86%) (±3.54%)

Table 4.4: Summary of individual model performance reveal that it does not perform well on identifying the P&D posts specifically. As such, it is essential to recognize that in cases where there is a class imbalance, such as the one within my research, a high accuracy should not be the only metric that should be evaluated. Therefore, when observing the models, precision and recall values were given higher importance for assessing their performance. An exceptionally high recall value was required as it determined how many of the given P&D posts were predicted correctly by the classifier. In light of the results made by the ANNs, an additional experiment was conducted to determine whether the performance would improve if the weaker performing models (XGBoost, RF, SVM) worked together. By using a voting classifier, the models 4.3. CLASSIFICATION RESULTS 77

were used in conjunction to predict P&D posts. The intuition behind this approach was that the models would work collectively to overcome their weaknesses in the predictions. However, their results did not prove to be an improvement to their individual performance. Therefore, I determined that the weaker performing models suffer from similar weaknesses that ANNs do not. At first glance, by comparing the performance values of the ANNs based upon the data that is given, it would seem that the models with posts perform better than those with posts and comments. However, when taking into account the standard deviation of the values, it can be concluded that the inclusion of comments provides stability for correctly identifying P&D posts. Therefore, when examining the models that have been provided with both posts and comments, the best performing model is CNN. While its accuracy and recall are promising, its precision is relatively low. This suggests that of all the records that the model predicts to be P&D, only 52.7% are actually correct. As previously mentioned, this is partly due to the imbalance of data, as there are more negative cases within the dataset. As such, if we look at the rate at which each class is labelled to be positive, a better outlook of the model is provided. If given a positive P&D text, the model has a 76.65% chance of classifying it correctly. Whereas, if it is given a negative text, it has a 13.3% chance of classifying it incorrectly as positive. In that perspective, given the number of records within the majority class, the model performs relatively well in identifying P&D posts. When examining the performance of CNN, it is not surprising that it was the best against the other models. When comparing traditional machine learning methods to deep learning, the latter has been known to perform better for tasks within NLP, as they are considered to be state of the art [10, 91]. While BiLSTM was able to 4.3. CLASSIFICATION RESULTS 78

get decent results, it fell a little short compared to those achieved by the CNN. The reason for this may be attributed to how CNNs learn to recognize patterns across space rather than time. With patterns among texts being phrases of expression, they have found success in classification tasks such as Sentiment Analysis [42]. With that being said, the problem that is being addressed within my research may also be similar in nature. Key phrases among the texts linked to P&D posts may be how the CNN is able to perform better than the other models. To also get a deeper understanding as to the decisions that were made by the various models, SHAP Summary Plot was used. With the use of various SHAP Explainers, diagrams were produced which ranked the features by their impact on to the classifications. Furthermore, the dataset samples were plotted to show the positive and negative relationships of the features to their target label. The horizontal location of the dots represents the effect of the values on to the outcome of the prediction. Those that were plotted to the right of the centre line indicate a positive effect, whereas those that are on the left have a negative effect. The colour also signifies the values of the observation, as those that are high are coloured in red, while the low values are blue. Therefore, a large grouping of similar dots which are situated to the right indicates a positive relationship among the values of that feature and those that are classified by the models. As such, Figure 4.14 to 4.19 shows the resulting diagrams for the top three performing classifiers. The diagrams for the others can be found in Appendix B. Each diagram contains 30 features that were the most impactful in the predictions made by the models. By taking a look at the diagrams, a commonality among them can be easily iden- tified, which is the presence of the stock sectors that I had discussed in the earlier 4.3. CLASSIFICATION RESULTS 79

part of this chapter. Each model has shown that the presence of the prevalent sectors outlined in Figure 4.4 is an important indicator in detecting P&Ds. Furthermore, several words from the agreement model such as “buy” and “go” seem to contribute as well to their detection. Specifically looking at those that had the best performance, it appears that the same set of words emerge as the most impactful features. Despite this, by looking at the values of all the features, none have a strong positive rela- tionship with the model output. The majority of values for the features tend to be plotted around the center line. This suggests that while they are the top indicators for the detection of P&D posts, none are a clear predictor for the classifiers. It may also indicate that the features alone do not determine a P&D post but rather the context in which they are used. This supports the reason as to how CNN was able to achieve the best performance, as it works well by identifying key phrases within texts. 4.3. CLASSIFICATION RESULTS 80

Figure 4.14: MLP SHAP Summary Plot for posts 4.3. CLASSIFICATION RESULTS 81

Figure 4.15: MLP SHAP Summary Plot for posts and comments 4.3. CLASSIFICATION RESULTS 82

Figure 4.16: CNN SHAP Summary Plot for posts 4.3. CLASSIFICATION RESULTS 83

Figure 4.17: CNN SHAP Summary Plot for posts and comments 4.3. CLASSIFICATION RESULTS 84

Figure 4.18: BiLSTM SHAP Summary Plot for posts 4.3. CLASSIFICATION RESULTS 85

Figure 4.19: BiLSTM SHAP Summary Plot for posts and comments 4.4. DISCUSSION 86

4.4 Discussion

Given the results of the classification techniques, it is also essential to discuss the misclassifications made by the proposed model. Depending on the problem that is being addressed, the consequences of misclassifications can vary for those looking to implement their model into the real world. For those that handle imbalanced data, missing a case within the minority class is worse than incorrectly classifying one from the majority class. Examples of real-life problems that illustrate this are within the medical field where diagnosing a condition may be critical for the survivability of a patient. While a patient getting a false positive may not be optimal, the worst-case scenario would be if they were to get a false negative. This may lead to a patient being left untreated for a condition that may be fatal. While it may not be to the same degree of criticality, fraud detection problems also share a similar priority to correctly classifying those within the minority class. Depending on how the model of my research is used, the misclassifications of P&D posts and the resulting consequence will vary. While it has been previously stated that the aim of my research is not to identify deceitful actors, if regulatory bodies chose to use the model to do so, the errors for misclassifications would be dangerous. An example would be if a securities commis- sion were to wrongfully charge an individual for conducting market manipulation. As seen by the results, even the best model is not perfect in detecting P&D posts. Therefore, at the very most, the model may be used to point regulatory bodies in the right direction or provide further evidence against those that are conducting P&Ds. In the case that the model is used by investors, the harm in misclassifying P&D 4.4. DISCUSSION 87

Predicted Label Actual Label Misclassified Post P&D Not P&D sectorunknown about to soar P&D Not P&D sectorunknown fitness equipment maker owner of bow flex completely sell out of most retail store how be this look just buy in share P&D Not P&D quick all in sectorcommunicationservices pump my first time actually do something right the lambos go to be green for gain P&D Not P&D blast off look like gold and oil will be big player this i also suggest look at sectortech- nology P&D Not P&D sectorenergy drop time to buy it be drop be- low which be its day low be it a good time to buy Not P&D P&D sectortechnology release patent news on ther- mal tech could be a mark sympathy play bust out over Not P&D P&D sectorhealthcare do anyone understand why sectorhealthcare shoot up soo much i be not able to find any real catalyst Not P&D P&D sectorhealthcare on the move this have po- tential reach today Not P&D P&D sectorhealthcare to the moon Not P&D P&D any thought on when to sell sectorenergy bought in late i be up after hour should i wait til tomorrow or sell as soon as possible in the am

Table 4.5: Examples of misclassified posts from CNN model posts may not be as severe. If investors used the model to avoid P&Ds, a misclassi- fication might lead them to miss out on an investment opportunity that could have been potentially profitable. It may also lead an to buy into a stock that is part of a P&D, in which the downside would be a loss of investment if they are unable to sell it within a short amount of time. In either case, the consequence of misclassifying for investors look to be less severe than those by regulators. In order to understand what sort of texts that the CNN model is having difficulty 4.4. DISCUSSION 88

with, Table 4.5 shows a small collection of misclassified posts. For those that are falsely predicted as being P&D, there are several explanations. The first may be that the post was a failed attempt by the fraudster. It is to be expected that not all fraudulent information that is posted on online forums will result in a rise in price and volume. Some misleading posts may not garner enough attention from investors to influence the market. They may also be blatantly apparent that they are P&D that many would avoid their advice. Another reason could be that the stock price and volume movements were not significant enough to be caught by my detection method. While the market behaviours of successful P&Ds are considered anomalous, there is no set understanding as to how much it can impact the market. As such, while the methods within my experiment has identified clear cases of P&D, it may have missed some that did not fit the conditions for detection. As for the records that should have been classified as P&D, there may also be several reasons for their misclassifications. For texts that are relatively short, such as “sectorhealthcare to the moon”, the model may have difficulty due to the limited number of words. Without being given enough content, the model may rely on specific keywords, which alone can be weak indicators for P&D. This aligns with what was discussed in the previous section, as no specific word was a definite predictor for the classifiers. Another reason as to why some records have been misclassified as P&D may be due to a weakness with the labelling method. Since texts are labelled according to their market behaviour, there may be those that did not actually aim to pump up the price of a stock but have been labelled otherwise. Upon looking at the examples, it is not easy to precisely determine as to why the model is not correctly classifying the posts. However, the issue most likely lies 4.5. RESULTS SUMMARY 89

with how the records were labelled. While there can be further adjustments made to the experimental design (e.g., extending the event window) to reduce the errors, due to the complexity of the problem, there may always be misclassifications. Also, it must be taken into consideration that while the texts have been labelled as P&D within my research, it does not guarantee that they are deceptive, or that they have manipulated the market. As such, it is important to acknowledge the mentioned weaknesses when looking to implement the proposed model in the real world. Possible use of the model can be for a tool (e.g., browser add-on) that aids investors with any financial information that they do encounter within online forums. Since it is capable of identifying deceptive content, investors may be notified early on as to whether or not what they are viewing is credible. While the performance values of the model have shown that it is far from being reliably accurate, it could encourage investors to do more research about the information that is marked as deceptive. A clear benefit to the application of the proposed model is that once it is trained, it does not require market data to recognize contents of P&D. Having learned the language patterns that are among P&Ds, the tool can search for similarities within the posts that investors are interested in. Thus, the described concept is an example of how the model from my research can be used to benefit investors in making informed decisions about their investments.

4.5 Results Summary

Within this chapter, I showed that it is possible to detect P&D posts with reason- able accuracy. In order to do so, I began by presenting and discussing the results 4.5. RESULTS SUMMARY 90

that I had gathered by conducting an exploratory data analysis. Within the anal- ysis, several interesting characteristics of the data were discovered. First, I showed the distribution of data that have been collected from the two subreddits and also the increase in record collection over the duration of my research. Along with it, I provided a brief explanation as to the reason for the growth in content and the implications for it within the stock market. Next, I discussed the results of the anal- ysis done on the texts within the dataset. It showed that certain frequent words accounted for most of the vocabulary within texts. This led to the use of stop words within the experiments such that words of greater significance can be identified for the detection of P&Ds. Subsequently, I presented my findings for changing the stock symbols to their market sectors. This revealed that a select number of sectors were predominately discussed within the subreddits. Upon further investigation, it was shown that this was strongly related to the mentioned increase in content and traffic among the forums. As expected, this also meant that a majority of P&Ds that I had identified were within those sectors. Next, I presented the class distribution within the dataset upon using anomaly detection, price trend, and the agreement model. This showed that a class imbalance existed within the dataset and provided the basis for the use of the agreement model within my research. Once the results gathered from the EDA were discussed, I moved on to presenting the results of the clustering techniques. Despite various methods, I showed that an apparent separation could not be identified among the documents for P&Ds. As a result, I moved on to the findings gathered from the classification techniques. With each binary classifier undergoing five-fold cross-validation, the performance of each model and their respective outputs for feature importance were provided. The results showed that comments labelled 4.5. RESULTS SUMMARY 91

under the agreement model do help to improve the performance of the models. In addition, I presented the best performing model that was able to detect P&D posts accurately. Finally, I discussed the misclassifications made by the proposed model and how it can be used as tool by investors in identifying deceptive content. 92

Chapter 5

Conclusion

5.1 Summary

With the recent advances in technology, the methods for manipulating the market have become more readily accessible. Online forums are among the many platforms that manipulators have used to defraud investors. With P&Ds being a typical scheme within the forums, investors must always be cautious of the information that they do come across. However, it may not always be apparent whether the content is deceptive or not. While the efforts made by regulators have led to the prosecution of many perpetrators, it has done little to eliminate or deter such activities from taking place. Most of the actions taken by the regulators are deemed to be too late for investors; by the time a scheme is detected, many investors may have already become victims. As a result, providing investors with the information and the tools necessary for detecting such deceptive contents can have a greater impact in addressing market manipulation. By using machine learning techniques, a model can be developed to help investors detect P&Ds within online forums. In order to do this, I first collect textual data 5.1. SUMMARY 93

from popular penny stock forums within Reddit. The subreddits were chosen based upon the characteristics of P&Ds, as they typically occur among stocks that have small market capitalization. To associate the texts with relevant market behaviours, I collect financial data from Yahoo! Finance and then add them to the records. By using statistical methods, all records showing abnormal market returns were labelled as P&D. Posts that intended to manipulate the market were predicted based on their lan- guage patterns. However, including the comments associated with each post tended to make prediction more difficult for the models. As a result, to effectively use the comments, I found that they all should not be treated the same, as they varied in opinion. While a comment on a post may express agreement (“i bought as well”) or disagreement (“avoid this stock”), my objective is to take the language patterns from only those that support the post. Comments of P&D posts were of particular interest as those that did not agree with the initial statements were still labelled as deceptive by my labelling methods. Thus, to help distinguish the comments that agree with P&D posts, I sought to develop an agreement model. By observing the use of agreement words among the comments, the model would only label them as P&D if they agreed with the deceptive content. My goal was to use the comments to boost the performance of the classifiers as they may exhibit similar language to those within fraudulent posts. With the use of the agreement model, I achieve a more refined set of record labels for the classification models. By using the dataset, the results of the models show a more robust prediction of P&D posts. This indicates that the use of specific comments do strengthen the detection of posts that contain deceptive content. By 5.2. LIMITATIONS 94

comparing the performances of the various classification models, I found that CNN produced the best results. As somewhat expected, CNNs are known to perform well among classification tasks within NLP. Since CNNs are good at extracting local and position-invariant features, it is likely that the model learned common key phrases that are associated with P&Ds to achieve the best performance. The results of my research show that it is possible to detect P&Ds within online forums. Despite the difficulties associated with reducing market manipulations, my research demonstrates that by taking a different approach, machine learning and NLP can be used to tackle the issue. My research provides a unique method of labelling textual data from online forums to improve the performance of the detection models. Additionally, it also provides several insights into the occurrences and language of P&Ds, which may assist investors in avoiding fraudulent content.

5.2 Limitations

5.2.1 Accuracy of Market Behaviour

When selecting a source to collect the financial data, Yahoo! Finance was chosen for its accessibility. Furthermore, its popularity, the amount of community support and the available libraries has proven it to be a good choice for my research. Despite the benefits of its use, one downside to the platform was that it could not provide minute by minute market data once a specific amount of time had passed. Since there is a gap in time between collecting the text and market data within my research, it was difficult to retrieve the OHLCV values on a per-minute basis for a given stock. As a result, the day to day values were retrieved instead. This means that the market data associated 5.2. LIMITATIONS 95

with the texts does not accurately reflect the resulting market behaviours. If minute- interval data was available, the procedure would have been to start collecting the OHLCV values at the time that the post is submitted. By doing so, it would have allowed for a more precise determination as to whether the text truly impacted the stock market. Furthermore, it may have led to more accurate labels for the texts, which may also explain the misclassifications that I discussed in Section 4.4.

5.2.2 Labelling of Market Data

Given the market data, the two methods used to initially label the records were based upon the known behaviours of P&D. They were formed under the assumption that a P&D would follow a similar pattern as to one that was shown in Figure 2.4. However, it cannot be expected that all of them would exhibit the same behaviour. Due to the nature of the stock market, it would be difficult to detect all forms of P&D as there is no clear definition of the duration and impact that it can have on the market. Therefore, the approach that I took within my research was to identify those that were undoubtedly P&Ds. This, however, meant that if the actual P&Ds did not meet my requirements, then they were mislabelled within the dataset. While this approach may have attributed to some of the misclassifications made by the models, it is an issue that is difficult to address, given the unpredictability of the market.

5.2.3 Biases within Agreement Model

The agreement model that is presented within my research is a combination of Em- path and custom words. In order to create the initial lexicon for agreement, select seed words were provided to Empath. The words that were chosen encapsulated 5.2. LIMITATIONS 96

what I determined P&D comments would be like. Furthermore, due to the use of domain-specific words within financial forums, words that were frequently observed among the comments that supported P&D posts were added to expand the lexicon. By having direct input into creating the lexicon for agreement, it is possible that my personal biases have influenced the choice of words. In order to address this, a dif- ferent approach could have been explored to identify agreement. By using clustering algorithms onto the comments of P&D posts, a possible grouping of documents could have revealed which comments agreed with the fraudulent information.

5.2.4 Mistaking as Deceptive Content

While the goal of this research is to detect deceptive content that is disseminated by manipulators, not all that is identified by the models may be deceptive. There may be investors that post information online who genuinely believe what they are sharing is reliable, despite it being incorrect. The appearance of such content is impossible to distinguish against those that are deceptive. Only the individual that has shared the information could reveal the intent behind their submission. Therefore, the proposed model would be better suited as a tool to aid investors in their decisions rather than a mechanism for finding individuals that are looking to manipulate the market. BIBLIOGRAPHY 97

Bibliography

[1] Convolutional Neural Networks (CNNs / ConvNets). https://cs231n.github. io/convolutional-networks/, Accessed: 2020-09-17.

[2] Neural Network Concepts: Perceptrons and Multi-Layer Perceptrons: The

Artificial Neuron at the Core of Deep Learning. https://missinglink. ai/guides/neural-network-concepts/perceptrons-and-multi-layer- perceptrons-the-artificial-neuron-at-the-core-of-deep-learning/, Accessed: 2020-09-16.

[3] Python Reddit API Wrapper (PRAW). https://praw.readthedocs.io/en/ v3.6.0/, Accessed: 2020-09-02.

[4] Semeval-2016 task 6. http://alt.qcri.org/semeval2016/task6/, Accessed: 2020-10-04.

[5] What is a multi-layered perceptron? https://www.educative.io/edpresso/ what-is-a-multi-layered-perceptron, Accessed: 2020-09-16.

[6] Fast Answers: ”Pump-and-Dumps” and Market Manipulations, Jun 2013.

https://www.sec.gov/fast-answers/answerspumpdumphtm.html, Accessed: 2020-09-16. BIBLIOGRAPHY 98

[7] Canadian securities regulators warn public of coronavirus-related investment

scams, Mar 2020. https://www.securities-administrators.ca/aboutcsa. aspx?id=1878, Accessed: 2020-09-06.

[8] Investor Alerts and Bulletins: Frauds Targeting Main Street Investors – In-

vestor Alert, Apr 2020. https://www.sec.gov/oiea/investor-alerts-and- bulletins/ia_frauds, Accessed: 2020-09-06.

[9] Press Release: SEC Charges Microcap Fraud Scheme Participants Attempting

to Capitalize on the COVID-19 Pandemic, Jun 2020. https://www.sec.gov/ news/press-release/2020-131, Accessed: 2020-09-06.

[10] The Current State of the Art in Natural Language Processing (NLP),

Jun 2020. https://zappy.ai/ai-blogs/the-current-state-of-the-art- in-natural-language-processing-nlp, Accessed: 2020-10-05.

[11] XGBoost Algorithm: XGBoost In Machine Learning, May 2020.

https://www.analyticsvidhya.com/blog/2018/09/an-end-to-end-guide- to-understand-the-math-behind-xgboost/, Accessed: 2020-09-15.

[12] Section 9(a)(2) of the Securities Exchange Act. SECURITIES EXCHANGE ACT OF 1934. 2012.

[13] Mart´ınAbadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, San- jay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Leven- berg, Dan Man´e,Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike BIBLIOGRAPHY 99

Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi´egas,Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.

[14] Rajesh K. Aggarwal and Guojun Wu. Stock Market Manipulations*. The Journal of Business, 79(4):1915–1953, July 2006.

[15] Erica Alini. How COVID-19 is luring Canadians into the stock market, Jul

2020. https://globalnews.ca/news/7200337/covid-19-investing-stock- market-canada/, Accessed: 2020-08-21.

[16] Franklin Allen and Douglas Gale. Stock-Price Manipulation. The Review of Financial Studies, 5(3):503–529, 1992.

[17] Ran Aroussi. yfinance. https://aroussi.com/post/python-yahoo-finance, Accessed: 2020-09-02.

[18] Steven Bird, Edward Loper, and Ewan Klein. Natural Language Processing with Python. O’Reilly Media Inc., 2009.

[19] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet Allocation. page 8.

[20] Avrim Blum, John Hopcroft, and Ravindran Kannan. Foundations of Data Science. Cambridge University Press, 2020. BIBLIOGRAPHY 100

[21] Kevin W. Bowyer, Nitesh V. Chawla, Lawrence O. Hall, and W. Philip Kegelmeyer. SMOTE: Synthetic Minority Over-sampling Technique. CoRR, abs/1106.1813, 2011.

[22] Jason Brownlee. A Gentle Introduction to Long Short-Term Memory Net-

works by the Experts, Feb 2020. https://machinelearningmastery.com/ gentle-introduction-long-short-term-memory-networks-experts/, Ac- cessed: 2020-09-17.

[23] Jason Brownlee. A Gentle Introduction to XGBoost for Applied Ma-

chine Learning, Apr 2020. https://machinelearningmastery.com/gentle- introduction-xgboost-applied-machine-learning/, Accessed: 2020-09-15.

[24] Jason Brownlee. How Do Convolutional Layers Work in Deep Learning Neu-

ral Networks?, Apr 2020. https://www.jeremyjordan.me/convolutional- neural-networks/, Accessed: 2020-09-17.

[25] Jason Brownlee. How to Develop a Bidirectional LSTM For Sequence Classifica-

tion in Python with Keras, Aug 2020. https://machinelearningmastery.com/ develop-bidirectional-lstm-sequence-classification-python-keras/, Accessed: 2020-09-18.

[26] Jason Brownlee. Smote for imbalanced classification with python, Aug

2020. https://machinelearningmastery.com/smote-oversampling-for- imbalanced-classification/, Accessed: 2020-10-04.

[27] Bill C-46. (Criminal Code, RSC 1985, c C-46, s 382). 1985. BIBLIOGRAPHY 101

[28] Yi Cao, Yuhua Li, Sonya Coleman, Ammar Belatreche, and T M McGinnity. Detecting Price Manipulation in the . page 8, March 2014.

[29] Nagesh Singh Chauhan. Introduction to Artificial Neural Networks(ANN), Sep

2020. https://towardsdatascience.com/introduction-to-artificial- neural-networks-ann-1aea15775ef9, Accessed: 2020-09-16.

[30] Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 785–794, New York, NY, USA, 2016. ACM.

[31] Francois Chollet et al. Keras, 2015.

[32] Carole Comerton-Forde and T¯alisJ. Putni¸nˇs.Stock Price Manipulation: Preva- lence and Determinants*. Review of Finance, 18(1):23–66, 03 2013.

[33] James Dacombe. An introduction to Artificial Neural Networks (with exam-

ple), Oct 2017. https://medium.com/@jamesdacombe/an-introduction- to-artificial-neural-networks-with-example-ad459bb6941b, Accessed: 2020-09-16.

[34] Jean-Yves Delort, Bavani Arunasalam, and Cecile Paris. Automatic moderation of online discussion sites. International Journal of Electronic Commerce, 15(3):9– 30, 2011.

[35] Rajeev Dhir. Pump and Dump, Jan 2020. https://www.investopedia.com/ terms/p/pumpanddump.asp, Accessed: 2020-09-02. BIBLIOGRAPHY 102

[36] David Diaz, Babis Theodoulidis, and Pedro Sampaio. Analysis of stock market manipulations using knowledge discovery techniques applied to intraday trade prices. Expert Systems with Applications, 38(10):12757–12771, September 2011.

[37] Luke Dormehl. What is an artificial neural network? Here’s everything you need

to know, Jan 2019. https://www.digitaltrends.com/cool-tech/what-is- an-artificial-neural-network/, Accessed: 2020-09-16.

[38] Ethan Fast, Binbin Chen, and Michael S. Bernstein. Empath: Understanding Topic Signals in Large-Scale Text. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 4647–4657, San Jose California USA, May 2016. ACM.

[39] Rohith Gandhi. Support Vector Machine - Introduction to Machine Learning

Algorithms, Jul 2018. https://towardsdatascience.com/support-vector- machine-introduction-to-machine-learning-algorithms-934a444fca47, Accessed: 2020-09-15.

[40] Thushan Ganegedara. Intuitive guide to latent dirichlet allocation, Mar 2019.

https://www.investopedia.com/terms/p/pennystock.asp, Accessed: 2020- 09-14.

[41] Bilal Ghanem, Paolo Rosso, and Francisco Rangel. Stance detection in fake news a combined feature representation. In Proceedings of the First Workshop on Fact Extraction and VERification (FEVER), pages 66–71, Brussels, Belgium, November 2018. Association for Computational Linguistics. BIBLIOGRAPHY 103

[42] Shreya Ghelani. Text Classification - RNN’s or CNN’s?, Jun 2019.

https://towardsdatascience.com/text-classification-rnns-or-cnn- s-98c86a0dd361, Accessed: 2020-10-05.

[43] Markus Goldstein and Seiichi Uchida. A Comparative Evaluation of Unsu- pervised Anomaly Detection Algorithms for Multivariate Data. PLOS ONE, 11(4):1–31, 04 2016.

[44] Koosha Golmohammadi, Osmar R. Zaiane, and David Diaz. Detecting stock market manipulation using supervised learning algorithms. In 2014 International Conference on Data Science and Advanced Analytics (DSAA), pages 435–441, Shanghai, China, October 2014. IEEE.

[45] Haibo He, Yang Bai, E. A. Garcia, and Shutao Li. ADASYN: Adaptive syn- thetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pages 1322–1328, 2008.

[46] Matthew Honnibal and Ines Montani. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.

https://spacy.io/, Accessed: 2020-09-02, 2017.

[47] Yu Chuan Huang and Yao Jen Cheng. Stock manipulation and its effects: pump and dump versus stabilization. Review of Quantitative Finance and Accounting, 44(4):791–815, May 2015. BIBLIOGRAPHY 104

[48] Yu Chuan Huang and Yao Jen Cheng. Stock manipulation and its effects: pump and dump versus stabilization. Review of Quantitative Finance and Accounting, 44(4):791–815, May 2015.

[49] IOSCO. Investigating and Prosecuting Market Manipulation. 2000. https: //www.iosco.org/library/pubdocs/pdf/IOSCOPD103.pdf.

[50] Jeremy Jordan. Convolutional neural networks., Apr 2019. https://www. jeremyjordan.me/convolutional-neural-networks/, Accessed: 2020-09-17.

[51] Josh Kamps and Bennett Kleinberg. To the moon: defining and detecting cryp- tocurrency pump and dumps. Crime Science, 7(1):18, December 2018.

[52] Will Kenton. Event Study, Aug 2020. https://www.investopedia.com/terms/ e/eventstudy.asp, Accessed: 2020-09-02.

[53] Pascal Kooten. contractions. https://github.com/kootenpv/contractions, Accessed: 2020-09-02.

[54] Keita Kurita. A Practical Introduction to NMF (nonnegative matrix factor-

ization), Mar 2018. https://mlexplained.com/2017/12/28/a-practical- introduction-to-nmf-nonnegative-matrix-factorization/, Accessed: 2020-09-02.

[55] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015.

[56] Pei Shyuan Lee, Majdi Owda, and Keeley Crockett. The Detection of Fraud Ac- tivities on the Stock Market Through Forward Analysis Methodology of Financial BIBLIOGRAPHY 105

Discussion Boards. In Kohei Arai, Supriya Kapoor, and Rahul Bhatia, editors, Advances in Information and Communication Networks, volume 887, pages 212– 220. Springer International Publishing, Cham, 2019. Series Title: Advances in Intelligent Systems and Computing.

[57] David Lettier. Your Guide to Latent Dirichlet Allocation, May

2019. https://medium.com/@lettier/how-does-lda-work-ill-explain- using-emoji-108abf40fa7d, Accessed: 2020-09-02.

[58] Christian Leuz, Steffen Meyer, Maximilian Muhn, Eugene Soltes, and Andreas Hackethal. Who Falls Prey to the Wolf of Wall Street? Investor Participation in Market Manipulation. page 75, 2017.

[59] Yingjie Li and Cornelia Caragea. Multi-task stance detection with sentiment and stance lexicons. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6299–6305, Hong Kong, China, November 2019. Association for Computational Linguistics.

[60] Anders Edelbo Lillie and Emil Refsgaard Middelboe. Fake News Detection us- ing Stance Classification: A Survey. arXiv:1907.00181 [cs], June 2019. arXiv: 1907.00181.

[61] Clare Liu. A Top Machine Learning Algorithm Explained: Support Vector

Machines (SVM). https://www.kdnuggets.com/2020/03/machine-learning- algorithm-svm-explained.html, Accessed: 2020-09-15. BIBLIOGRAPHY 106

[62] Ke Liu, Kin Keung Lai, Jerome Yen, and Qing Zhu. A Model of Stock Ma- nipulation Ramping Tricks. Computational Economics, 45(1):135–150, January 2015.

[63] Scott M Lundberg and Su-In Lee. A Unified Approach to Interpreting Model Predictions. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Pro- cessing Systems 30, pages 4765–4774. Curran Associates, Inc., 2017.

[64] Craig A MacKinlay. Event Studies in Economics and Finance. Journal of Eco- nomic Literature, 35(1):13–39, 1997.

[65] Norm Matloff. Quick Introduction to Nonnegative Matrix Factorization, 2016.

http://heather.cs.ucdavis.edu/NMFTutorial.pdf, Accessed: 2020-09-15.

[66] Maruf Rahman Maxim. Stock market manipulation: A literature review. Asian Accounting and Auditing Advancement, 4(7):29–33, 2014.

[67] Mehrnoosh Mirtaheri, Sami Abu-El-Haija, Fred Morstatter, Greg Ver Steeg, and Aram Galstyan. Identifying and Analyzing Cryptocurrency Manipulations in Social Media. preprint, Open Science Framework, February 2019.

[68] Saif M Mohammad, Svetlana Kiritchenko, Parinaz Sobhani, Xiaodan Zhu, and Colin Cherry. A Dataset for Detecting Stance in Tweets. page 8, 2016.

[69] Vishal Morde. XGBoost Algorithm: Long May She Reign!, Apr

2019. https://towardsdatascience.com/https-medium-com-vishalmorde- xgboost-algorithm-long-she-may-rein-edd9f99be63d, Accessed: 2020-09- 15. BIBLIOGRAPHY 107

[70] Chris B. Murphy. Penny Stock, Aug 2020. https://www.investopedia.com/ terms/p/pennystock.asp, Accessed: 2020-09-02.

[71] Sho Nakagome. Linear algebra 101-part 9: Singular value decomposition

(svd), Feb 2019. https://medium.com/sho-jp/linear-algebra-101-part-9- singular-value-decomposition-svd-a6c53ed2319e, Accessed: 2020-09-14.

[72] Rui Nian. An Introduction to ADASYN (with code!), Dec 2019.

https://medium.com/@ruinian/an-introduction-to-adasyn-with-code- 1383a5ece7aa, Accessed: 2020-10-05.

[73] Majdi Owda, Pei Shyuan Lee, and Keeley Crockett. Financial Discussion Boards Irregularities Detection System (FDBs-IDS) using information extrac- tion. In 2017 Intelligent Systems Conference (IntelliSys), pages 1078–1082, Lon- don, September 2017. IEEE.

[74] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

[75] Michael Phi. Illustrated guide to lstm’s and gru’s: A step by step explana-

tion, Jun 2020. https://towardsdatascience.com/illustrated-guide-to- lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21, Accessed: 2020-09-17. BIBLIOGRAPHY 108

[76] Steven T Piantadosi. Zipf’s word frequency law in natural language: a critical review and future directions. Psychonomic bulletin & review, 21(5):1112–1130, October 2014.

[77] Ashwin Rajadesingan and Huan Liu. Identifying users with opposing opinions in twitter debates. In William G. Kennedy, Nitin Agarwal, and Shanchieh Jay Yang, editors, Social Computing, Behavioral-Cultural Modeling and Prediction, pages 153–160, Cham, 2014. Springer International Publishing.

[78] Radim Reh˚uˇrekandˇ Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges

for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http: //is.muni.cz/publication/884893/en.

[79] Sanjiv Sabherwal, Salil K. Sarkar, and Ying Zhang. Do Internet Stock Message Boards Influence Trading? Evidence from Heavily Discussed Stocks with No Fun- damental News: DO INTERNET STOCK MESSAGE BOARDS INFLUENCE TRADING? Journal of Business Finance & Accounting, 38(9-10):1209–1237, November 2011.

[80] Sumit Saha. A Comprehensive Guide to Convolutional Neural Networks-the

ELI5 way, Dec 2018. https://towardsdatascience.com/a-comprehensive- guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53, Accessed: 2020-09-17.

[81] Suvrit Sra and Inderjit S. Dhillon. Generalized Nonnegative Matrix Approxi- mations with Bregman Divergences. In Y. Weiss, B. Sch¨olkopf, and J. C. Platt, BIBLIOGRAPHY 109

editors, Advances in Neural Information Processing Systems 18, pages 283–290. MIT Press, 2006.

[82] Mikhail Stukalo. Are there signs of the pump and dump behavior on Red-

dit penny stock forums?, Nov 2018. https://nycdatascience.com/blog/ student-works/are-there-signs-of-the-pump-and-dump-behavior-on- reddit-penny-stock-forums/, Accessed: 2020-09-02.

[83] Matt Thomas, Bo Pang, and Lillian Lee. Get out the vote: Determining support or opposition from congressional floor-debate transcripts. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 327–335, Sydney, Australia, July 2006. Association for Computational Linguis- tics.

[84] Friedhelm Victor and Tanja Hagemann. Cryptocurrency Pump and Dump Schemes: Quantification and Detection. In 2019 International Conference on Data Mining Workshops (ICDMW), pages 244–251, Beijing, China, November 2019. IEEE.

[85] Rohit Walimbe. Handling imbalanced dataset in supervised learning using fam-

ily of SMOTE algorithm. https://www.datasciencecentral.com/profiles/ blogs/handling-imbalanced-data-sets-in-supervised-learning-using- family, Accessed: 2020-10-05.

[86] Qili Wang, Wei Xu, Xinting Huang, and Kunlin Yang. Enhancing intraday stock price manipulation detection by leveraging recurrent neural networks with ensemble learning. Neurocomputing, 347:46–58, June 2019. BIBLIOGRAPHY 110

[87] Cody Marie Wild. One Feature Attribution Method to (Supposedly) Rule

Them All: Shapley Values, Jan 2018. https://towardsdatascience.com/one- feature-attribution-method-to-supposedly-rule-them-all-shapley- values-f3e04534983d, Accessed: 2020-09-14.

[88] Thomas Wood. Random Forests, Sep 2020. https://deepai.org/machine- learning-glossary-and-terms/random-forest, Accessed: 2020-09-16.

[89] Jiahua Xu and Benjamin Livshits. The anatomy of a cryptocurrency pump-and- dump scheme. In Proc. of the 28th USENIX Conference on Security Symposium, pages 1609–1625, August 2019.

[90] Yi Cao, Yuhua Li, Sonya Coleman, Ammar Belatreche, and Thomas Martin McGinnity. Adaptive Hidden Markov Model With Anomaly States for Price Manipulation Detection. IEEE Transactions on Neural Networks and Learning Systems, 26(2):318–330, February 2015.

[91] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Sch¨utze. Compar- ative Study of CNN and RNN for Natural Language Processing. CoRR, abs/1702.01923, 2017.

[92] Tony Yiu. Understanding Random Forest, Aug 2019. https: //towardsdatascience.com/understanding-random-forest-58381e0602d2, Accessed: 2020-09-16.

[93] Hulisi O˘g¨ut,M.¨ Mete Do˘ganay, and Ramazan Akta¸s. Detecting stock-price manipulation in an emerging market: The case of Turkey. Expert Systems with Applications, 36(9):11944–11949, November 2009. 111

Appendix A

Computational Resources

For the purposes of my research, three different computational resources were used. First, a Linux Virtual Machine (VM) was set up within the Queen’s University School of Computing Research vSphere Cluster. The specifications for the machine were as follows:

• CPU - Intel Xeon Gold 6130 2.10GHz (12 Cores Allocated)

• RAM - 12 GB Allocated

• Storage - 64 GB Allocated

The VM was used for cron, a time-based job scheduler, in order to automate the data collection process. Once enough data was collected, a macOS machine was then used to preprocess the data. The following are the specifications for the machine:

• CPU - 8-Core Intel Core i9 2.3GHz

• RAM - 32 GB DDR4

• Storage - 512 GB SSD 112

Finally, all analysis and testing of the data were conducted on Google Colab. These were the settings that were used:

• Runtime Type: Python 3

• Runtime Shape: High-RAM

• Hardware Accelerator: GPU 113

Appendix B

SHAP Summary Plots 114

Figure B.1: XGBoost SHAP Summary Plot for posts 115

Figure B.2: XGBoost SHAP Summary Plot for posts and comments 116

Figure B.3: RF SHAP Summary Plot for posts 117

Figure B.4: RF SHAP Summary Plot for posts and comments 118

Figure B.5: SVM SHAP Summary Plot for posts 119

Figure B.6: SVM SHAP Summary Plot for posts and comments