INFLUENCE OF ON LSE TWITTER CASHTAGS

Antón Lorenzo García

Master’s Thesis presented to the Telecommunications Engineering School Master’s Degree in Telecommunications Engineering

Supervisors Rebeca P. Díaz Redondo Ana Fernández Vilas

2018

Acknowledgements This work is funded by: the European Regional Development Fund (ERDF) and the Galician Regional Government under agreement for funding the Atlantic Research Center for Information and Communication Technologies (AtlantTIC), and the Spanish Ministry of Economy and Competitiveness under the National Science Program (TEC2014-54335- C4-3-R and TEC2017-84197-C4-2-R). We thank the Centro de Supercomputación de Galicia (CESGA) for its computational support during the research stay.

Vigo, July 15, 2018

i

Abstract There is a general consensus about the good sensing and original characteristics of Twitter as an information media for complex financial markets. Analysis establishes Twitter as a relevant feeder for taking decisions regarding the financial market and even fraudulent activities in that market. One of the main mechanisms used in Twitter to track financial tweets is the cashtag, a label formed by the ticker of a company preceded by the $ symbol. However, in the last months the irruption of the cryptocurrencies has produced a degra- dation in the quality of the information obtained through this mechanism. This is due to the fact that a few of them have homonym tickers to those of some of the companies in the main markets, which means that when using the cashtag, results referring both to stock companies and to cryptocurrencies are obtained. With the overall aim of this research, to deploy a classification system that allows split both types of tweets, a set of analysis were made to extract the distinctive features of both information sets. To be precise, the interference between both types is studied for the main London Stock Exchange (LSE) companies through the constituent companies of its two main markets, the FTSE-100 and the AIM-100 between July 1, 2017 and February 15, 2018. In addition, different classifying systems using adapted heuristics and supervised methods have been proposed, analysing their main advantages and limitations, as well as their useful lifespan. The experimental results confirm that a behaviour change can be appreciated in recent months in the data collected by the cashtag of some LSE companies for which there is a with the same ticker. However, this analysis shows that both types of tweets can be accu- rately split using classifiers that considered the distinctive features of both types of tweets.

Key words: AIM-100, Cashtags, Cryptocurrencies, Data Analysis, FTSE-100, London Stock Exchange, Support Vector Machines, Twitter

iii

Contents

Acknowledgementsi

Abstract iii

List of figures vii

List of tables ix

1 Introduction1

2 Related Work5

3 Motivation9 3.1 Original idea...... 9 3.2 Cashtag behaviour change...... 11

4 Datasets 17 4.1 Extraction process...... 20 4.2 Tweet structure...... 21

5 Methodology 23

6 Tweet features 27 6.1 Corpus information...... 28 6.2 User information...... 30 6.3 Tweet time and place...... 33

7 Application of filtering criteria and results 35 7.1 Heuristic filters...... 37 7.2 SVM classifiers...... 39 7.3 Combined systems...... 42 7.4 LSTM classifiers...... 44 7.5 Logistic regression systems...... 47 7.6 Conclusions and limitations...... 47

v Contents

8 Conclusions and future lines 49

9 Appendix I: Classifiers 53 9.1 Supervised Methods...... 53 9.2 Measurements...... 56

10 Appendix II: LSTM networks 59

Bibliography 63

vi List of Figures

3.1 Searches on Google trend evolution...... 12 3.2 LSE-100 tweet time distribution...... 13 3.3 LSE-100 tweet time distribution, Homonym(black) vs No Homonym(blue) 14 3.4 Homonym tweets time distribution, LSE(blue) vs Cryptocurrency(black). 15

5.1 Block diagram...... 26

6.1 Cryptocurrency text word cloud...... 28 6.2 Company text word cloud...... 28 6.3 Ticker distribution, LSE(blue) vs Cryptocurrency(black)...... 29 6.4 Cryptocurrency hashtag word cloud...... 30 6.5 Company hashtag word cloud...... 30 6.6 Cryptocurrency user description word cloud...... 30 6.7 Company user description word cloud...... 30 6.8 Follower distribution by user, LSE(blue) vs Cryptocurrency(black).... 31 6.9 Friend distribution by user, LSE(blue) vs Cryptocurrency(black)..... 31 6.10 Cryptocurrency default profile distribution...... 32 6.11 Company default profile distribution...... 32 6.12 Account creation time distribution, LSE(blue) vs Cryptocurrency(black). 32 6.13 Tweet time distribution, LSE(blue) vs Cryptocurrency(black)...... 33

7.1 AUC Basic SVM classifier...... 42 7.2 AUC Extended SVM classifier...... 42 7.3 AUC Combined SVM classifier...... 43 7.4 AUC Independent SVM classifier...... 43 7.5 AUC LSTM SVM classifier...... 46 7.6 AUC independent LSTM SVM classifier...... 46

10.1 Recurrent network model...... 59 10.2 LSTM cell structure...... 60

vii

List of Tables

3.1 Homonym tickers...... 13

4.1 Cryptocurrencies captured...... 17 4.2 LSE-100 tickers...... 18 4.3 Datasets overview...... 19 4.4 Tweet structure...... 22

6.1 Tweet main features...... 34

7.1 Wordbase heuristic filter measurements...... 38 7.2 Cryptocurrencies used (Heuristic word filter)...... 38 7.3 Words used (Heuristic word filter)...... 39 7.4 Independent variables (Basic svm classifier)...... 40 7.5 Vocabulary Extended SVM classifier...... 41 7.6 SVM classifier measurements...... 41 7.7 Combined SVM classifier...... 43 7.8 Independent SVM classifier...... 43 7.9 Independent variables(Independent SVM filter)...... 44 7.10 Vocabulary independent SVM classifier...... 44 7.11 LSTM SVM classifier...... 46 7.12 Independent LSTM SVM classifier...... 46 7.13 Logistic regression systems measurements...... 47

ix

1 Introduction

Progressive usage of technology in the stock market has lead to a continuous growth in their business. Helping both business and individual investors harvest information about diverse topics such as the perspective of the situation of a company, the opinions of the clients, news about significant changes, rumours... As it is well known, success in stock trade highly depends on the quality and the speed of the information to support decision-making. As on-line social media invaded the habits of people, also companies, brokers and other key roles in the financial market began to share more and more useful information and professional opinions about the stock exchanges. All this public information, turns social media in one of the main, if not the greatest, information source for brokers.

Currently, Twitter is one of the most used platforms to share financial information from companies, brokers, news agencies or individual investors. As Twitter usage in this context is definitively increasing, it is important to stress that, according to (Sprenger et al., 2014), stock microblogs exhibit three distinct characteristics about stock message boards: (1) Twitter’s public timeline may capture the natural market conversation more accurately and reflect up to date developments; (2) Twitter reflects a more ticker-like live conversation, which allows twitter-bloggers to be exposed to the most recent information of all stocks and does not require users to actively enter the forum for a particular stock; and (3) twitter-bloggers have a strong incentive to publish valuable information to maintain reputation (increase mentions, the rate of retweets, and their followers), while financial bloggers can be indifferent to their reputation in the forum. Providing sensing, harvesting and analysing methods, this information can be very useful for many stakeholders such as businesses, individuals making decisions to invest, stock market analysts or law enforcement agencies.

One of the main mechanisms provided by Twitter to track the financial information about a stock company is the cashtag. A cashtag is a label formed by the ticker of a company preceded by the $ symbol. Remember that the ticker of a company is a short

1 Chapter 1. Introduction sequence of letters and sometimes a few numbers, that identifies a stock company in financial environments. For example, in the case of Vodafone, its ticker would be VOD and its cashtags $VOD. This label is added to tweets, similarly to what happens with hashtags, and indicates that it contains financial information about the company the ticker references. Twitter also provides resources to track the tweets that contain a specific cashtag. All of this turns cashtags into one of the most useful mechanisms to easily harvest financial information in Twitter.

However, the irruption of the cryptocurrencies in recent months has produced a degrada- tion in the quality of the information obtained through this resource. This is due to the fact that a few of them have homonym tickers to those of some of the companies in the main markets, that is, some cryptocurrencies have the same ticker as some stock compa- nies, largely due to the huge quantity and lack of regulation of the former. As a result, when the cashtag is used, results referring to both stock companies and cryptocurrencies can be obtained. Moreover, due to its recent popularity, the amount of published tweets that refer to cryptocurrencies far surpasses those about stock companies. This, added to the low quality of cryptocurrency tweets, most of them spam or auto-generated messages, produces a very significant degradation in the informative capacity of this resource and evidences the need of new mechanisms which allow differentiating between both types of tweets.

With the overall aim of this research, to deploy a classification system that allows to split both types of tweets, the change in the nature of the information collected through the cashtag of companies from the LSE (London Stock Exchange) is analysed. To be precised, the interference in the London Stock Exchange (LSE) companies is studied through the constituents of its two main markets, the FTSE-100 and the AIM-100 between July 1, 2017 and February 15, 2018. The objective of this initial analysis is to identify companies with a cryptocurrency homonym ticker and to study the impact of new tweets about cryptocurrencies on the tweets related to LSE companies collected through the cashtag. Then, amongst all the information available in a tweet, by a descriptive analysis, the distinctive features between both types have been identified using sets of tweets that only refer to a cryptocurrency or to a stock company. With all of this information, a set of classifier systems able to split both types of tweets have been raised. These systems use both heuristics adapted to the situation and supervised methods. Finally, for each of them its main advantages and limitations, as well as its useful lifespan, are analysed.

To do so, the paper is structured as follows. First of all, the related work is commented (Section 2). In (Section 3), the scope of the change, the companies affected and the impact it has is analysed. Thereafter, the used datasets and the methodology followed are commented in (Section 4) and (Section 5), respectively. Then, amongst all the information available in a tweet, the distinctive features between both types have been identified using groups of tweets that only refer to a cryptocurrency or to a stock company (Section 6). With all this information, a set of classifier systems able to split both types

2 of tweets using heuristics and supervised methods have been raised, analysing their main advantages and limitations, as well as their useful lifespan (Section 7). All the main ideas presented within the paper and the applicability and limitations of the results obtained are summarized (Section 8). Finally, the theoretical bases of the models used and their performance measurements are explained in Appendices I and II, Sections 9 and 10 respectively.

3

2 Related Work

The modernization and digitalization of the financial market has been accompanied with the enormous increasing of information available in the diverse web information sources for brokers and individual investors. All this information forms a huge source of knowledge that can be applied to predict and study movements and make decisions in financial markets. Thus, several researches have been accomplished regarding the predictive power of the information contained in social media. Initially, (Bordino et al., 2012) have shown that trading volumes of stocks listed in NASDAQ-100 are correlated with their query volumes, the number requests submitted on the Internet, and (Wang et al, 2012) proposed sentiment analysis as one of the most relevant features to improve the accuracy of financial time series forecasting. More recently (Cavalcante et al,2016) and (Li et al. 2017) raised similar ideas such as the importance of mining textual context and sentiment analysis of professional opinions in social media and financial news as useful supplementary sources.

In addition, several intelligent trading system have been proposed, such as (Gunduz and Cataltepe, 2015) that deployed a forecasting method which combines the analysis of news articles from Turkish finance websites, the extraction of feature vectors and stock prices to predict future market movements or (Nassirtoussi et al, 2015) which used text mining of financial news-headlines to predict movements in the FOREX market. Deep learning has been also applied to model both short-term and long-term influences of events on stock price movements in (Ding et al, 2015). Finally, in (Ranco et al, 2016) the combination of public news with the browsing activity of the users of Yahoo! Finance to forecast intra-day and daily price changes of a set of 100 highly capitalised US stocks was explored. To sum up, most of the research in this field address the predictive power of social media, specially combined with others information sources.

If the impact of information from on-line data sources into the financial market is broadly recognized by researchers and professionals, there is also a huge consensus about Twitter, one of the most used social media regarding financial markets. In addition

5 Chapter 2. Related Work to the well-known hashtag, that allows people to follow topics they are interested in, Twitter unveiled a new clicking and tracking feature for stock symbols known as cashtags which are, as explained before, stock market symbols that can be included in tweets preceded with a dollar sign (for example $TSCO in regards to Tesco). This resource allows to track financial information about a specific company or market. Related to this mechanisms, (Hentschel and Alonso, 2014) reported an exploratory analysis of public tweets in English which contain at least one cashtag from NASDAQ (National Association of Securities Dealers Automated Quotation) or NYSE (New York Stock Exchange). The research concludes that the use of cashtag is higher in the technologic sector, which seems to be related to the technological profile of most of the Twitter users. It also highlights the existence of relevant information behind the co-occurrence of cashtags and the co-occurrence of cashtags and hashtags together.

It should be mentioned that according to (Dredze et al., 2016) there are mainly five types of users that posted financial information in Twitter: journalist, companies and their representatives, activist investors, government agencies and citizen journalists. It also says that, despite the classic information sources, the type of the financial information is different, form by breaking news, rumours and speculations. (Ceccarelli et al., 2016), which also agrees in the usefulness of Twitter as a source of financial information and in its complementary action with the classic information sources, suggests that popularity within finance is not necessarily the same as popularity within other areas in Twitter and that novelty is highly related with the popularity of financial tweets. Moreover, (Elliott et al, 2017) shows that the impact of negative financial news over investors differs from Investor Relations Twitter account and CEO’s Twitter account. Thus, meanwhile the news comes from the first ones the investors’ willingness to invest highly decrease but when the news come from the second source, they have no effect on it.

(Rao and Srivastava, 2014) studied the relationship between Twitter sentiment and financial market instruments like volatility, trading volume... with promising results in Dow Jones Industrial Average (DJIA) and NASDAQ-100 for high frequency trades. In this type of movements, where traders make an investment position that is held only for very brief periods of time, traders use social media information as a fast mechanism to see public behaviour and opinion and make decisions base on it. More recently, (Fernández Vilas et al., 2018) analyses the variation in the information volume, content, sentiment and geographical provenance after far-impacting financial events, such as a merger and concludes that although Twitter is not a specific financial forum, it is permeable to financial events. Thus, it shows that Twitter sentiment changes with finance events and that this information can be very useful to harvest the stock market state. However, not everything related with Twitter are good news. According to (Dredze et al., 2016), the use of Twitter adds new challenges as the huge volume of available data, the high number of repetition of the same information or the quality of the tweets.

In addition to what was previously commented, one of the most studied points is the

6 analysis of the relationships between Twitter behaviour and stock share price, especially in some relevant moments as the quarterly announcements or big movements. Thus, (Ranco et al., 2015), which investigated a 15-month period of Twitter data about 30 stock companies of the Dow Jones Industrial Average (DJIA) index, shows that there is not only a strong relation between Twitter behaviour and stock share price in well known relevant moments, but there are correlation peaks not corresponding to any expected news about the stock market. Moreover, (Liu et al., 2015) used Twitter to identify and predict stock co-movement according to firm-specific social media metrics and (Shutes et al., 2016) studied tweets related to US market as indicator of some new information in the stock market rather than on evaluating the problem of causality on stock prices. Also, (Shutes et al., 2016) shows that nearly a third of the tweets in the study are associated with abnormal price movements. This can be a problem due to the lack of information on regular periods which in addition to the lack of a concrete trading recommendation that other information sources provide makes difficult that Twitter completely replace the traditional sources for financial market.

Other researches have studied the predictive value of the information extracted from Twitter to take decisions about trading, despite the previously commented papers that work only in the relation between Twitter an financial market. (Ruiz et al, 2012), which investigated the correlation between the activity on Twitter and financial time series, showed that the price of a stock are weakly correlated with the analysed features if they are used alone. In (Bormetti et al, 2015) tick-by-tick transaction data was analysed for 20 Italian stocks on a period of approximately four months. In addition, (Cazzoli et al, 2016) analysed over 1700 stocks for a period of more than two year, making a huge financial tweet dataset. In this study, authors found out that expert users impact financial market more than others and that sectors such as technology and customers show a better correlation than others with the financial movements.

Even though most of the researches are based on the Twitter data volume, as the ones previously presented, there are also studies that apply sentiment analysis techniques to tweets to distinguish the polarity of content and its impact on the financial market. (Bollen et al, 2011) showed that public mood analysed through Twitter feeds is well correlated with Dow Jones Industrial Average (DJIA). (Zhangt, 2013) found out a high negative correlation between mood states like hope, fear and worry in tweets and the Dow Jones Average Index. Furthermore, in (Al Nasseri et al, 2014), (Liew and Budavári, 2016), (Rajesh and Gandy, 2016), (Cortez et al., 2016) and (Nguyen et al, 2015) text based sentiment was considered useful to make trading decisions or predict useful stock market variables. (Pagolu et al, 2016) applied sentiment analysis and unsupervised machine learning to Twitter to analyse the correlation between stock market movements of a company and sentiments in tweets, finding out a strong correlation between the rises and falls in stock prices of a company and public opinions or emotions about that company expressed on Twitter. Also, (Dickinson and W., 2015) investigated the correlation of sentiments of public with stock increases and decreases using Pearson

7 Chapter 2. Related Work correlation coefficient. In addition, (Oliveira et al, 2016) proposed an approach for creating stock market lexicons to deal with the short length of tweets, one of the main issues of natural language processing working with Twitter. Then, using some investors sentiment indicators, they studied the correlation between these indicators and two traditional survey indicators: Investors Intelligence (II) and American Association of Individual Investors (AAII) with moderate correlation results.

Consequently, there is a general consensus of the good sensing and novelty characteristics of Twitter as a source of information for the complex financial market, especially if it is combined with other information sources. Thus, meanwhile most of the current researches are focused on the potential of Twitter as a source of predicative information and decision making on financial markets and the development of various expert systems capable of using such information, the approach of this paper is focused on a completely different objective: to illustrate the effect of the recent popularity of cryptocurrencies in the tweets collected through cashtags of a LSE company and to deploy systems capable of differentiating both types of tweets, related to stock companies or cryptocurrencies. So that, it is possible to use the information contained in them, for both types, restoring the informative capacity the cashtag initially offered.

8 3 Motivation

3.1 Original idea

Although the objective of this TFM is to show the influence of the new use of cryptocur- rencies in the correct functioning of the Twitter cashtag and to propose different tools that allow the distinction between classical tweets about stock companies and new tweets about cryptocurrencies, the original purpose of this project was quite different. It was to compare the information capacity offered by various social media and specialized web forums to predict the stock market variations of the LSE-100 companies. That is, analyse and compare the relationship between the information posted in different web resources and the variations in the stock index value of the LSE-100 constituents.

Thus,the cross-correlation between the number of tweets with cashtags and the quotation of the LSE-100 companies in the stock market has been studied, in addition to a descriptive analysis to determine the main characteristics of the tweets, their most useful fields and possible gaps during the extraction process. In particular, the cross-correlation has been obtained for the set of tweets that have, at least, a ticker of a company from one of the markets analysed (FTSE-100, AIM-100) and the market price. This change was studied both for the general market value and for the relative differential in its price. To obtain the cross-correlation, both the pure time series and the trend of these series, the result of removing stationary and noise components, were used.

Additionally, for those companies that had a sufficient amount of tweets, the cross- correlation between each one and the market value where the company is quoted has also been studied. In addition, the cross-correlation between the tweets of each company and its value on the stock market has been analysed . All these operations have been complemented with different mobility windows of 5, 10, 15 and 20 days, applied both to stock values and to their relative differential change.

9 Chapter 3. Motivation

However, when we studied these temporary results, the results were not as expected. When we studied the problem minutely, we detected that it was due to the influence of the new cryptocurrencies on Twitter. So, the objective of the project changed to the current one. For this reason, although they were not finally used due to this change, other sources of information and resources to obtain it had been developed. For all these alternative sources of information, descriptive and temporal analyses, similar to those done with the tweets with cashtags, have also been made. In particular, the different sources of information analysed are shown below.

First, in addition to the cashtag tweets used in this project, other tweet downloads were made. Specifically, for each FTSE-100 member company, those tweets that contain any of the names that one of its companies may have were obtained. These tweets, much more common than tweets with cashtags, contain both information about the financial situation of the company and information from the point of view of the common consumer, since the name of the company is used by both investors and ordinary users. Therefore, this set of tweets was divided into two subsets depending on whether its nature was economic or not. For this, a vocabulary related to financial tweets was created from the most common terms in the tweets with cashtags and applied to these tweets. Thus, if the presence of financial vocabulary terms exceeded a certain threshold, they were classified as financial, and if not, they were considered as made by common consumers. As a results, two different datasets of tweets were created. The first one made of financial tweets that contain the name of a FTSE-100 constituent and the second one made of tweets that contain the name of a FTSE-100 constituent and information from the point of view of a common consumer.

In order to enrich the original analysis and expand the information contained in the tweets, the use of new data sources was studied. In particular, an application was developed that would allow, in a simple way, the download and storage of posts in Reddit about a wanted subject, in this case LSE-100 companies. For this purpose, different Reddit groups specialized in economic issues and the results obtained in these groups for different key financial terms, such as the name of the market, some of the most significant companies or their ticker, were investigated. To perform the search and download of the information, the Rest API offered by Reddit for accessing its information was used.

The other source of information studied was the London South East website. This is a specialized forum about the London stock exchange where news about the companies that constitute it are published regularly. In addition, it also offers a public chat service where different users can exchange opinions about both the market in general or a particular company. Unlike Reddit, the forum does not offer a service that allows the acquisition of information in a simple way. So, we had to study the way in which the page organized the information and make a WebScrapper based on this that would allow us to extract its content and store it in an easily processable format.

10 3.2. Cashtag behaviour change

The extraction of the content of these forums had some issues that had to be solved before extracting the information. In the chats offered by the platform it is usual to see entries that are published in duplicate or even tripled. However, unlike Twitter, each entry lacks a unique identifier that allows differentiating each entry from the others. So a differentiator had to be made to detect and remove duplicate entries. Since both news and chats capturers were intended to be programs that were executed periodically, it was necessary to use the same duplicate detector to determine if an entry had already been extracted in the previous iteration and thus not to extract all previously downloaded entries. This same mechanism was used to determine if all the available entries had already been extracted. The chats are organized in a paginated way and if a number higher than the last page is entered, the server only returns the content of the last one. So it was necessary to check if the content of the previous answer was repeated to determine the completion of the extraction process. Finally, we have to deal with the platform’s prevention mechanism against DoS attacks. The existence of this system mean that after a certain number of requests in a short period of time the server began to return erroneous information. This resulted in catastrophic consequences for the extractor. This policy was studied and mechanisms were developed to detect this incorrect information satisfactorily.

After having developed the two extractors, a platform was implemented consisting of a database and a Web interface that would allow the launch, management and monitoring of different captures in both new sources of information. Through this server it is possible to consult the current status of the different active captures, launch new ones, stop current ones or modify the operating parameters of the system, among other functionalities. As has been commented, for all these alternative sources of information, descriptive and temporal analyses, similar to the ones done with the tweets with cashtags, were made. However, due to the change of project objective, they have not been used.

3.2 Cashtag behaviour change

Since 2012, Twitter incorporates the cashtag which makes it much easier to find tweets that address the stock market status of a company. However, in recent months its usefulness has been reduced due to the interference of cryptocurrencies. Cryptocurrencies, despite having existed for a long time, have become especially popular in recent months, as can be seen in figures 3.1(a), 3.1(b) and 3.1() where the searches on Google about some related topics are shown, unlike searches about stock companies that have remain constant, figure 3.1(d).

11 Chapter 3. Motivation

(a) Cryptocurrency (general term) Searches on Google trend evolu- tion

(b) (ticker of Nxt platform, cryptocurrency) Searches on Google trend evolution

(c) XLM (ticker of Stellar Lummens, cryptocurrency) Searches on Google trend evolution

(d) VOD (ticker of Vodafone, stock company) Searches on Google trend evolution

Figure 3.1 – Searches on Google trend evolution

This change in behaviour is also visible on Twitter, where the number of daily results about cryptocurrencies has increased by more than 40 times, according to our analysis. Although these tweets should not interfere with the correct functioning of cashtags, many of them use the dollar symbol followed by the acronym of the cryptocurrency to indicate that the tweet refers to it. The conflict arises when the ticker of some companies and the acronym of certain cryptocurrencies coincide, largely due to the huge number of latter that have emerged in recent months. As a result, when collecting tweets related to a specific cashtag, most of them do not refer to the company they should identify, instead they address the coincident cryptocurrency as would be the case for cashtags as XLM (XLMedia stock vs Stellar Lummnens cryptocurrency) and NXT (Next plc stock vs Nxt platform cryptocurrency). We will call these cashtags as homonym cashtag and the tweets that contain at least one of them homonym tweet.

12 3.2. Cashtag behaviour change

So, a homonym cashtag would be any cashtag that can refer to both a LSE-100 stock market company and a cryptocurrency, because both have the same acronym, and a homonym tweet would be any tweet that has at least one homonym cashtag. The list of homonym tickers on the LSE-100 can be seen in table 3.1. These tickers were identified manually, looking for coincident cryptocurrencies for each constituent company. On the other hand, we will call non-homonym cashtag to any cashtag that can only refer to a stock market company or to a cryptocurrency, because there are not two of them with the same ticker, and non-homonym tweet to any tweet that has at least one cashtag from an LSE-100 company, as long as none of its tickers are included in the list of homonym cashtags. In addition, we will name company tweet to any tweet that contains at least one cashtag that refers to a stock company and cryptocurrency tweet to any tweet that contains at least one cashtag that refers to a cryptocurrency.

Homonym LSE company(market) Cryptocurrency cashtags $NXT Next plc (FTSE-100) Nxt (coin and platform) $SKY SKY plc (FTSE-100) Skycoin $XLM XLMEDIA (AIM-100) Stellar $BRK BROOKS (AIM-100) Breakout coin $GBG GB group (AIM-100) Golos Gold $APH Alliance pharma (AIM-100) Aphroditecoin Advanced medical solutions $AMS (AIM-100) Amsterdamcoin $CRW Craneware (AIM-100) Crown

Table 3.1 – Homonym tickers

Figure 3.2 – LSE-100 tweet time distribution

13 Chapter 3. Motivation

The number of tweets that contain a FTSE-100 or an AIM-100 cashtag has increased greatly in the last months, as can be seen in figure 3.2. However, most of the tweets do not refer to LSE companies. Remember that the FTSE-100 and AIM-100 are the two most important markets of the London Stock Exchange (LSE). In particular the FTSE-100 is the principal London market where the one hundred most valuable companies are listed such as Vodafone, Cocacola or RioTinto, while the AIM-100 is formed by the one hundred most valuable companies that are listed in the secondary market, this companies are less known compared to the main market as would be Alliance Pharma, Hutchison China or Staffline.

This interference is greatly increased due to the disparate number of results obtained for each of the information sources. While looking at non-homonym tickers of the FTSE-100 companies we have up to 1,000 daily results, only referring to the XLM and NXT there are more than 10,000 daily results. This difference can be seen in figure 3.3. In this, the amounts of daily tweets that contain a ticker of the FTSE-100 and AIM-100 companies collected through the Twitter stream API are shown, separating the tweets that make reference to a non-homonym cashtags and a homonym one.

As can be seen in the figure 3.3, some homonym tweets were collected in mid-2017 but their quantity is much lower, so that most of the results obtained referred to the expected stock companies. However, since October 2017, the number of results about the latter has skyrocketed, more than 30 times the amount collected in the previous months for both FTSE-100 and AIM-100, making up practically all the results obtained. In fact, the number of results obtained in December for the homonym tickers are 5.6 times the amount collected for the non-homonym tickers for the FTSE-100 market and up to 40 times for the AIM-100.

Figure 3.3 – LSE-100 tweet time distribution, Homonym(black) vs No Homonym(blue)

14 3.2. Cashtag behaviour change

If we split the homonym tweets depending on whether they refer to the corresponding cryptocurrency or the company, it will be possible to analyse in a more detailed way what happened within these tweets. To do so, we have manually classified all of the homonym tweets, identifying if they talk about a criptocurrency or a LSE company. During the classification process the content of the tweet, the name of the user who posted the tweet, its description, the history of tweets he has posted, the profile of the referenced users, if any, and any additional information added to the tweet by a hiperlink, were carefully analysed in order to determine if the tweet talks about the criptocurrency or the stock company. With all this information and some basic knowledge about the stock exchange, its constituents and the cryptocurrencies, a human being can easily differentiate between both information types.

The first feature that can be seen is how the increase in the amount of tweets collected happens only for cryptocurrency tweets, keeping the number of tweets about LSE companies constant. Likewise, the amount of tweets related to cryptocurrencies is much higher than the financial ones, with a ratio reaching 100: 1 at the beginning of 2018. This large number of cryptocurrency tweets is the reason of the aforementioned general behaviour change, as would be expected, and evidences the need to implement filtering methods to collect information regarding stock companies. However, once both parts are split, the financial tweets maintain a stable behaviour similar to the others LSE tickers, being able to use their information easily.

Figure 3.4 – Homonym tweets time distribution, LSE(blue) vs Cryptocurrency(black)

Given the aforementioned, it is fairly clear the influence of cryptocurrencies in the information collected for those cashtags where the acronym coincides with the ticker of a pre-existing stock company. Although the situation varies slightly from one homonym casthtag to other, in general, most of the collected tweets deal with cryptocurrencies instead of stock companies, which they should initially refer to. This makes it difficult

15 Chapter 3. Motivation to track the stock status of the companies, which the mechanism originally offered. In addition, almost all the tweets that talk about cryptocurrencies are spam or auto- generated by applications. For this reason, the informative purpose of the cashtag is almost lost, being necessary the development of mechanisms of disambiguation or the adaptation of existing ones to allow to distinguish between cryptocurrecy and company tweets, in order to be able to use the information contained in them in a satisfactory way. In this document, different alternatives to achieve this disambiguation will be shown, based on the knowledge acquired through an exploratory analysis made.

It should be mentioned that the purpose of this document is not to differentiate between the different stock companies that may coincide with the same ticker, it is to differentiate between tweets that contain financial information and those that do not. For this reason, all financial tweets are considered jointly regardless of the market in which the company to which it refers is quoted.

16 4 Datasets

To carry out this paper, three different datasets have been used. The first one consists on a set of tweets that contains at least one of the cashtags of the main non-homonym cryptocurrencies nowadays, that is, no coincident with the ticker of any stock company of the main markets. The list of tickers has been determined consulting different web pages specialized in the tracking of cryptocurrencies. The full list of captured cryptocurrencies is collected in table 4.1. This dataset will be used to determine the main features of the cryptocurrency tweets, as detailed in section 5, since there is no interference with any other type of tweet.

Cryptocurrency captured $SNT, $ADA, $MTH, $ADX, $LSK, $DSR, $ARK, $CLOAK, $TKN, $DLC, $DCR, $KMD, $IQT, $ZCL, $DCY, $ALIS, $RBY, $SYS, $EXP, $BCY, $VEN, $BCN, $BLITZ, $UGT, $GVT, $MONA, $QASH, $, $AUR, $UNO, $BURST, $REQ, $PART, $TRIG, $GCR, $LMC, $XEM, $BNB, $SNGLS, $BITSILVER, $PDC, $ELIX, $XVG, $DOPE, $LEND, $SNRG, $NLG, $ARDR, $QSP, $SALT, $SYNX, $GRC, $XDN, $PIVX, $DCT, $WAVES, $PTOY, $SIB, $LTC, $CPC, $NAS, $XMR, $LOCI, $ION, $VSX, $NXS, $XMY, $GBYTE, $XMG, $BAT, $IOP, $HMQ, $NTCC, $PKB, $BAY, $PBL, $BYC, $MINT, $HSR, $MUSIC, $XSPEC, $IGNIS, $ETP, $BWK, $FCT, $DRGN, $MUE, $XPM, $STEEM, $FTC, $SPHR, $DGB, $DGD, $SUB, $VOX, $MAID, $RPX, $AEON, $XAUR, $MIOTA, $CRC, $BET, $ENG, $XVJ, $POWR, $STORJ, $GUP, $UBQ, $SBD, $INFX, $LGD, $DYN, $INFR, $ONION, $MANA, $SLR, $FUN, $CURE, $BITB, $EMC2, $XZC, $IOTA, $COVAL, $AGRS, $PASC, $DOGE, $XRB, $SWT, $FLDC, $ZEC, $NBT, $XRP, $ETH, $RADS, $ETC, $PANGEA, $CLAM, $PHR, $APX, $BTC, $NEM, $NEO, $MYST, $START, $ENJ, $WTC, $PPT, $STR, $ARDOR, $ITZ, $BCPT, $ITC, $TAAS, $STRAT, $SEQ, $EDG

Table 4.1 – Cryptocurrencies captured

17 Chapter 4. Datasets

The second dataset is made by those tweets that have at least one cashtag from an LSE-100 company, as long as none of their tickers belongs to a homonym cashtag. The complete list of cashtags used for this dataset is collected in table 4.2. In addition, it can be split into two subsets, the one formed by those cashtags of companies of the FTSE-100 and another formed by those tweets that contain cashtags of an AIM-100 company. Notice that a tweet can be in both subsets if it has at least one ticker that refers to an FTSE-100 company and at least one ticker regarding an AIM-100 company. Likewise, if the tweet also has a cashtag of a cryptocurrency whose ticker is non-homonym, the tweet could belong to both the previous dataset and this one. This dataset will be used to determine the main features of stock market tweets, as detailed in section 5, since there is no interference with any other type of tweet.

FTSE-100 AIM-100 $CPI, $DC., $HIK, $INTU, $SN., $OPG, $SQS, $PAF, $BOO, $CPG, $CCL, $BARC, $CCH, $GHH, $TAP, $MANX, $SAA, $GSK, $BDEV, $DCC, $BLND, $VNL, $KWS, $IOM, $PLUS, $RIO, $WTB, $SMIN, $IAG, $HZD, $ARBB, $BNN, $IPEL, $MRW, $SVT, $III, $ITRK, $CVSG, $SFE, $OCI, $CRS, $AHT, $JMAT, $IHG, $LGEN, $DTG, $STAF, $FDP, $ABC, $HL., $AV., $BATS, $STAN, $XSG, $SCH, $BUR, $BMK, $CRH, $LSE, $RTO, $SGRO, $APGN, $TCM, $HUR, $NFC, $SBRY, $CRDA, $SHP, $DLG, $IGR, $SOLG, $YNGN, $FOG, $BLT, $PSON, $GKN, $GLEN, $QXT, $REDD, $YNGA, $PSN, $NG., $SSE, $INF, $SMT, $HCM, $MPE, $BREE, $FEVR, $BNZL, $UU., $MERL, $REL, $RNWH, $RWS, $WINE, $PRU, $LAND, $FERG, $DGE, $POLR, $DOTD, $SMS, $TEF, $MDC, $WPP, $MCRO, $EXPN, $GAMA, $CLIN, $MUL, $CTH, $WPG, $RRS, $VOD, $RMG, $TMO, $JSG, $CAM, $ASY, $RR., $IMB, $RDSB, $RDSA, $QTX, $ASC, $NUM, $CAKE, $HMSO, $FRES, $ADM, $TSCO, $EMIS, $LTG, $SMTG, $MAB1, $PFG, $HSBA, $SKG, $OML, $VTU, $JHD, $CVR, $PRSM, $TUI, $ITV, $MKS, $ULVR, $IQE, $AMER, $EAH, $WJG, $AZN, $AAL, $BT.A, $BAB, $ACSO, $SOU,$CAML, $JOUL, $PPB, $BRBY, $MNDI, $RB., $PANR, $RTHM, $FPM, $MTW, $SL., $LLOY, $SGE, $ABF, $VCP, $HGM, $DDDD, $SLE, $RBS, $STJ, $ANTO, $CNA, $YOU, $PURP, $IDOX, $OGN, $SDR, $GFS, $TW., $RSA, $BP., $HOTC, $NICL, $RST, $MIDW, $CTEC, $EZJ, $KGF, $BA. $TFW, $SCPA, $SPH

Table 4.2 – LSE-100 tickers

Finally, the third block of data used would be those tweets that contain at least one ticker from an LSE-100 company whose cashtag matches a cryptocurrency, that is, the homonym tweets. Also, this dataset can be split into two subsets, the one formed by those tickers that refer to companies of the FTSE-100 or a cryptocurrency with the same

18 cashtag and those regarding AIM-100 companies or a cryptocurrency with the same cashtag. The set of tickers that make up this dataset is shown in table 3.1. This dataset collects the situation on Twitter that is analysed in this paper, in other words, it shows the incidence of the new tweets about cryptocurrencies on the tweets collected with the cashtag of a stock exchange company. Therefore, it will be used to test and train both the heuristic filters and the classifiers that are deployed in this document.

In order to achieve a better measurement of the performance of the proposed systems, these datasets have been manually classified, showing for each tweet if the cashtag by which it has been captured refers to a stock company or a cryptocurrency. The content of the tweet, the name of the user who posted the tweet, his description, the history of tweets he has posted, the profile of the referenced users, if any, and any additional information added to the tweet by a hiperlink, were carefully analysed in order to determine if the tweet talks about the criptocurrency or the stock company. A summary of the datasets can be seen in table 4.3.

Number Name Data interval Description of results Tweets that contains a Cryptocurrencies From 15 Jan 2018, cashtag of one of the 1,023,232 non-homonym to 15 Feb 2018 tweets CNHDS main cryptocurrencies Tweets that contains a FTSE-100 cashtag of companies From 1 Jul 2017, to homonym of the FTSE-100 292,864 15 Feb 2018 tweets FTHDS coincident with a cryptocurrency

Tweets that contains a FTSE-100 cashtag of companies non-homonym From 1 Jul 2017, to of the FTSE-100 that 144,787 tweets 15 Feb 2018 do not coincide with a FTNHDS cryptocurrency

Tweets that contains a AIM-100 cashtag of companies From 1 Jul 2017, to homonym of the AIM-100 405,625 15 Feb 2018 tweets AMHDS coincident with a cryptocurrency

Tweets that contains a AIM-100 cashtag of companies non-homonym From 1 Jul 2017, to of the AIM-100 that 69,138 tweets 15 Feb 2018 do not coincide with a AMNHDS cryptocurrency

Table 4.3 – Datasets overview

19 Chapter 4. Datasets

Moreover, the homonym dataset has been divided into three subsets whose elements have been chosen randomly fulfilling the distribution explained below. The first one, called trainset, will consist of the 70% of the tweets in the dataset and will be used to train the deployed classifiers. The second is the testset, it consists on those tweets that are not in the trainset, that is, the remaining 30% of the dataset, used to perform measurements of the performance of the models. Finally, the 10% of the trainset will form the tuneset, which is used to adjust the configuration parameters of the classifiers. This partitioned is used to prevent the measurements from taking unusually high values and not showing the real system’s performance.

4.1 Extraction process

The Marble platform (Fernandez, 2014) has been used to download the tweets of the datasets used for this project. It provides a web interface for the management of tweet downloads using the Twitter API, storing the results in a database. In particular, the tweets have been downloaded using the stemming function of the API. This creates a listener that, once started, captures those tweets posted that meet the characteristics set in the capture. In this case, the criteria used for each capture was the presence of at least one of the cashtags of a fixed list. Three different captures have been made, the first one would pick up tweets if they have the cashtag of at least one non-homonym cryptocurrency, this is the CNHDS. The second capture would filter by the tickers of the members of the FTSE-100, so their results would have to be divided later to form the FTNHDS and the FTHDS. Finally, the third capture includes those tweets that have at least the cashtag of an AIM-100 company, these tweets would have to be divided to form the AMNHDS and the AMHDS.

To house these captures six complete Marble systems have been deployed in six virtual machines in the computational resources offered by CESGA (Centro de Supercomatación de Galicia)(CESGA, 2018). Each machine has one of the three captures commented previously, since they are duplicated. Thus, for each capture, one of the servers where it is active has been prepared to dump the information downloaded in a Drive repository periodically, while the other acts as a backup in case of failure. During the development of the project, the status of these servers has been monitored and managed periodically. Although finally the servers were launched on the computational resources offered by CESGA, initially Marble in its first version was running on servers offered by the Universidade de Vigo. So two migrations had to be carried out. First, the version of Marble was updated from one to two, which entailed the deployment of the entire system. Secondly, due to the usual drops suffered by these machines, Marble was transferred to CESGA’s virtual resources, much more stable, so a second migration of platforms had to be made.

Additionally, the data downloaded has been cleaned and preprocessed. During this

20 4.2. Tweet structure process, the presence of no duplicate tweets has been verified, ensuring that no tweets have been processed more than once, the different format the data could have, depending on the version of the platform and the download process, have been converted to a common format and the tweets obtained have been split into the previously commented datasets, among other minor changes necessary to prepare the data for future actions.

4.2 Tweet structure

As previously mentioned, all the datasets used during this paper are made up of different types of tweets. The information available for each of them can be divided into three main blocks. The first one would be those fields that provide general information about the tweet such as the ID, the language, the number of retweets and favourites and especially the body of the tweet. It must be mentioned that the tweets were captured as soon as they were posted, so the values of retweets as favourites are 0.

Secondly, those that make reference to the geolocation where the tweet was sent. These fields only have non-zero values if the user has enabled the geolocation beforehand. Within this block we would find information such as the latitude, longitude, country or city where it was sent, among others.

Finally, each tweet contains extensive information about the user who posted the tweet. This information goes from the name of the user, its description, followers, friends, number of tweets marked as favourite, number of retweets, account location, language, if it is a verified account...Additionally, there is a lot of information about the graphic representation of the account, in order to be able to show the user’s interface on any device: profile image, reduced profile image, background image, background color... The full tweet structure is shown in table 4.4. All these fields are analysed in depth both during the descriptive analysis and the classifier’s development.

General Tweet fields createdAt id text inReplytoUserId inReplytoStatusId inReplytoScreenName source lang contributorsId retweetedCount retweetedStatus currentUserRetweetedId scopes favouritedCount favourited withHeldInCountries polarityTags truncated retweeted possiblySensitive

Geographic Tweet fields latitude longitude placename streetAddress countryCode Id country placeType Url fullName containedWithIn geometryType geometryCoordinates bundingBoxCoordinates boundingBoxType

21 Chapter 4. Datasets

General User fields id name screenName location description contributorsEnabled Url showAllinLineMedia defaultProfile createdAt utc0ffset timeZone friendsCount favouritesCount followersCount statusesCount withHeldInCountries listedCount geoEnabled verified translator followRequestSent protected lang

Graphic User fields ProfileImageURL biggerProfileImageURL miniProfileImageURL originalProfileImageURL ProfileImageURLHttps biggerProfileImageURLHttps miniProfileImageURLHttps originalProfileImageURLHttps defaultProfileImage profileBackgroundColor profileTextColor profileLinkColor profileSidebarFillColor profileSidebarBorderColor profileUserBackgroundImage profileBackgroundImageURL profileBackgroundImageURLHttps profileBannerURL profileBannerRetinaURL profileBannerIpadURL profileBannerIpadRetinaURL profileBannerMobileURL profileBannerMobileRetinaURL profileBackgroundTiled

Table 4.4 – Tweet structure

22 5 Methodology

A descriptive analysis of the data has been made in order to differentiate between crypocurrency and financial tweets in homonym cashtags. As a result, the most distinctive features of each type of tweet will be searched to help in the deployment of systems that allow the division of tweets. The features of each type will be analysed on two sets of non-interfering data in order to expand the applicability of the classifiers. To be precise, the CNHDS has been used to extract the common features of cryptocurrencies tweets while the FTNHDS and AMNHDS have been used to extract the common features of company tweets. Thus, the characteristics detected for each tweet type are not influenced by tweets of other types due to homonym tickers and the amount of tickers considered is increased. As a result, the detected features can be applied to homonym tickers different from those studied. In particular, the information analysed about each tweet can be grouped into three different blocks.

The first one would be the information in the general tweet fields. Within this, we will see features such as the most common terms for each type of tweet, the most common hashtags or the number of tickers in the tweet. The punctuation symbols, stop words, emoticons and urls have been removed and the text has been normalized to lower-case in order to extract the most common terms of the tweet in a more representative way. Also, a stemming process has been performed for each term in the tweet. This processing has been made only for the extraction of common terms and not for the hashtags, which have only been converted to lower-case.

Secondly, a database with the information regarding the users who have made each tweet has been created using the general user fields, both for cryptocurrency and company tweets. The most representative features of each user’s type have been studied, such as the number of followers, the number of favourites, the date of creation of the account or the edition of the default interface offered by Twitter, among others. On Twitter, each user must make a small description of himself or his account. Therefore, the most common terms of these descriptions have been searched and the same processing as the

23 Chapter 5. Methodology one carried out for the extraction of the most common terms in the tweet body has been performed.

The last block of information analysed regarding the tweet is the place and time when it was posted such as the weekday, day time or geolocation. In order to obtain this information, the geographic tweet fields and the creation time are used. It should be known that, although other parameters can be observed within the analysed blocks, only those features that allow differentiating between one type of tweet and the other are commented, since this is the objective of this paper.

Based on the information observed during the descriptive analysis, different classifier systems, capable of dividing the homonym tweets between those relative to stock compa- nies and to cryptocurrencies, have been created and analysed. Specifically, two types of classifiers have been developed, within which, different alternatives have been shown. First, a set of heuristic filters, based on the presence of certain key terms for the de- tection of each type of tweet discovered during the descriptive analysis and the manual classification, have been proposed. Within them, two alternatives are presented, one uses only general terms related to cryptocurrencies and the other adds terms related to the specific tickers that are being analysed. The purpose of these heuristics is to distinguish as many tweets as possible about cryptocurrencies, without misclassifying any tweet about a stock company. That is, given the interest in detecting tweets about stock companies, getting rid of as many tweets as possible referring to cryptocurrencies, losing the minimum amount of information possible. By default, the filter will consider that a tweet refers to the trading company cashtag. However, if it contains any of the searched terms, the tweet will be marked as related to cryptocurrencies. For this reason, the terms used will be only words that identify almost unequivocally that a tweet is about a cryptocurrency.

Given the difficulty of collecting some of the patterns seen during the descriptive analysis in a heuristic, supervised classification methods have been employed. As was previously commented, we have manually classified the type to which each tweet belongs. So, it is possible to use techniques such as SVM or logistic regression that need a previously classified sample. These techniques can incorporate in a simple way those patterns discovered during the previous analysis. Thus, the independent variables that these classifiers will use will be those fields that provide distinctive features between both types of tweets discovered during the descriptive analysis. Unlike heuristic filters, these classifiers will be optimized to achieve commitment levels between the obtained precision and recall, in other words, they will try to maintain a compromise between the number of tweets about cryptocurrencies that are marked as stock companies and the number of financial tweets which are mistakenly detected as cryptocurrencies. As for the heuristic filters, different versions of classifiers will be shown, depending on the complexity of the system, the benefits achieved and the scope of use that it may have. Initially, the supervised classifiers will employ support vector machines. However, other classifiers

24 based on logistic regressions will be developed to verify if the highest computational load of the first ones is justified or if those faster methods can provide similar results.

Once both types of classifiers have been raised, their combined performance has been studied. This combined system will use the results of the heuristic filters as an independent variable of the supervised model. Moreover, the previously commented classifiers use some variables adapted to the specific tickers analysed, such as the name of the ticker by which the tweet was captured or some specific terms. For this reason, a system able to identify both types of tweets that only works with general cryptocurrency’s and stock company’s features is deployed.

Finally, the previous models only use the content of the tweet through certain key terms. They do not take into account the relative importance of each term or the relationships that may exist between them. However, the tweet body is one of the largest source of information that a tweet has. For this reason, through an LSTM network an embedding matrix has been obtained. The LSTM network, a type of recurrent neural network, is trained to predict the next word in the dataset. This LSTM network trains, among others, an embedding matrix for the most common terms of the dataset, in this case the 10,000 most common ones, which collects the relative importance of each term and the relationship that exist between them. The result of applying this matrix over each tweet is used as new independent variables for the supervised models, replacing the list of key terms originally employed. Thus, new combined systems, which use the results of the embeding matrix instead of key terms, are proposed.

The classical measurements for classifiers are used in order to evaluate the performance of each system. Specifically, the precision, recall, specificity, accuracy, fscore and AUC are calculated. Given the objective of the heuristic filters, the recall value will be the key in their evaluation, without neglecting the precision value. However, in the case of the supervised classifiers and combined systems, the analysis will be focused on the fscore and the AUC, since they combine in a single term both the precision and the recall of the deployed system. In addition to these measurements, the complexity, the estimated useful lifespan and the scope of use of each system will be discussed, as well as the tasks necessary to update them.

25 Chapter 5. Methodology

Figure 5.1 – Block diagram

26 6 Tweet features

As shown above, when the results that contain a homonym cashtag are collected, both results that refer to stock companies and to cryptocurrencies are obtained jointly. There- fore, in order to obtain the results of interest, either the tweets about a company or a cryptocurrency, being able to differentiate one type from the other is necessary. In this section, some distinctive features of each type of tweets will be shown in order to use them to perform some classifier systems, as will be discussed in the following sections. To obtain these features a descriptive analysis comparing both types of tweets has been made.

To identify the main defining features of each of the main types of tweets, two different data sets were used. First of all, we have the tweets that contain tickers from non- homonym companies of the FTSE-100 and AIM-100, that is the FTNHDS and AMNHDS, from these tweets the main features of the tweets referring to companies will be extracted. The second source of information are tweets that contain the ticker of one of the main cryptocurrencies, the CNHDS described before, as would be the case of $BTC for or $ETH for . This set of tweets will be taken as reference of the tweets related to cryptocurrencies.

This analysis seeks to highlight the differences between one type of messages and the other, so only those features that differentiate between both will be discussed. In particular, the information analysed about each tweet can be grouped into three different blocks. The first one would be the information in the general tweet fields. Within this, we will see features such as the most common terms for each type of tweet, the most common hashtags or the number of tickers in the tweet. The punctuation symbols, stop words, emoticons and urls have been removed and the text has been normalized to lower-case in order to extract the most common terms of the tweet in a more representative way. Also, a stemming process has been performed for each term in the tweet. This processing has been made only for the extraction of common terms and not for the hashtags, which have only been converted to lower-case.

27 Chapter 6. Tweet features

Secondly, a database with the information regarding the users who have made each tweet has been created using the general user fields, both for cryptocurrency and company tweets. The most representative features of each user’s type have been studied, such as the number of followers, the number of favourites, the date of creation of the account or the edition of the default interface offered by Twitter, among others. On Twitter, each user must make a small description of himself or his account. Therefore, the most common terms of these descriptions have been searched and the same processing as the one carried out for the extraction of the most common terms in the tweet body has been performed.

The last block of information analysed regarding the tweet is the place and time when it was posted such as the weekday, day time or geolocation. In order to obtain this information, the geographic tweet fields and the creation time are used. It should be known that, although other parameters can be observed within the analysed blocks, only those features that allow differentiating between one type of tweet and the other are commented, since this is the objective of this paper.

6.1 Corpus information

Regarding general tweet information, we can see that the most distinctive features of each type are found in the content of the tweet itself. Thus, the most common terms between one type of tweets and the other are different, so that the presence of certain terms can help to indicate that the tweet refers to one type or the other. Figures 6.1 and 6.2 show the most common terms for each information source.

Figure 6.1 – Cryptocurrency text Figure 6.2 – Company text word word cloud cloud

28 6.1. Corpus information

In view of the results of the previous figures, terms like coin, crypto, cyptocurrency, binanc, signal, fee or join can be very useful to identify tweets about cryptocurrencies. While terms such as rate, group, inc, plc, rate, finance or company can be used to identify companies. Worthy of particular mention requires the use of the proper names of the companies and cryptocurrencies. In both cases, these names are among the most common words, therefore, using these names as a differentiating criterion may be interesting. Although many of the most common words differ between one type and the other, a large number of them appear in both sections, as would be the case of terms such as buy, hold, trade, rt, news or price. These terms refer mostly to market interactions since both elements make them. Given the ambiguity they represent, the use of these common terms as a criterion for differentiation is not advisable, despite having slightly higher percentages of appearance for one type of tweet than for the other.

Another interesting point is the amount of tickers that each tweet contains. While the average of tweets referring to companies is three tickers and the median is a single ticker, for cryptocurrency tweets this amount is much higher with a mean and median of 18 and 20 tickers per tweet respectively, figure 6.3. However, a few tweets referring to companies still have a large number of tickers. Therefore, although this criterion can be very useful, it should not be used exclusively.

Figure 6.3 – Ticker distribution, LSE(blue) vs Cryptocurrency(black)

As with the content of tweets, the hashtags’ content also differs between one type and the other, just as it did for the tweet’s content. In fact, the most representative terms of one type and the other are similar to the most common words in the body of the message. The most common terms for both types can be seen in figures 6.4 and 6.5. As was the case for common words, terms such as #bitcoin, #ethereum, #cryptocurrency, #altcoin, # or #binance allow us to identify tweets about cryptocurrencies. While hashtags like #ftse, #mkt, #premarket or #earnings allow us to identify the tweets regarding to stock companies. As in the previous case, you should be careful using terms like #hold, #buy, #stock or #trading since they can be used in both types, although with very different percentages.

29 Chapter 6. Tweet features

Figure 6.4 – Cryptocurrency Figure 6.5 – Company hashtag word cloud hashtag word cloud

6.2 User information

Regarding to the information available per user, the terms in the description that every user makes, are quite similar to those seen for the body of the tweets. So words like crypto, bitcoin or join allow us to distinguish tweets that address cryptocurrencies. While words like finance, company, network, bank or institut are more common in tweets about the FTSE and AIM. Moreover, the cryptocurrency descriptions tend to have less formal and more personal words such as enthusiast, tip, love, person or expert. However, most of the most common terms are shared between both types of description, most of them related to economic movements. Examples of this type of terms would be news, invest, stock, market or trader, all these terms present in figures 6.6 and 6.7. As a result, the use of the content of the description to differentiate the type of tweet is reduced. However, the most identifying terms, as well as the name of the homonym cryptocurrencies can be used as search criteria over this field.

Figure 6.6 – Cryptocurrency Figure 6.7 – Company user user description word cloud description word cloud

30 6.2. User information

Other user fields that can be useful are the number of followers and friends he has, see figures 6.8 and 6.9. While for cryptocurrency tweets, these numbers tend to be quite small, more than three quarters of users do not exceed 200 followers. In fact, most of them do not have more than a couple of followers, probably because they are secondary accounts for the dissemination of self-generated tweets. Nevertheless, this amount is increased in the tweets about companies, exceeding more than 75% of the users the hundred followers. However, even for cryptocurrency tweets, there are also a few users with millions of subscribers, which shows that the use of cashtags for cryptocurrencies is quite widespread even in specialized entities.

Figure 6.8 – Follower distribution by user, LSE(blue) vs Cryptocurrency(black)

Figure 6.9 – Friend distribution by user, LSE(blue) vs Cryptocurrency(black)

The same behaviour that happens with followers also occurs with the number of retweets and favourites that the account that writes the tweet has, largely due to the greater number of followers who reach the tweet and can retweet it or give it a favourite. However, this effect is much less marked than the ones of the previous case, so the use of followers and friends is recommended as a differentiating criterion in the first place.

Another criterion regarding the user that can be taken into account is the number of verified users who tweet on one topic and the other. In this case, although the percentage

31 Chapter 6. Tweet features of verified users is not very high, 1% for tweets about companies and 0.1% for tweets about cryptocurrencies, the verified company accounts are slightly more common than verified cryptocurrency accounts, largely favoured by the greater number of followers that the tweets of the FTSE and AIM have.

Related to the type of accounts that usually make each type of tweets, most accounts that tweet about cryptocurrencies have not changed the default profile interface offered by Twitter, only 28% have modified it. This is consistent with the type of accounts that publish these tweets, accounts of non-personal use, recently created, as we will see later, and mainly destined to the diffusion of self-generated messages. However, in the case of LSE users, the percentage of users who keep the default interface is lower, 58%, which shows that these types of accounts are a bit more reliable, see figures 6.10 and 6.11.

Figure 6.10 – Cryptocurrency Figure 6.11 – Company default profile distribution default profile distribution

The last interesting field related to the account that performs the tweet is the account creation time. While the accounts about the LSE were created between 2009 - 2017, virtually all the cryptocurrency accounts are recent, from mid-2017 to the present, a period that coincides with the expansion of cryptocurrencies, as can be seen in figure 6.12. This behaviour, consistent with those previously seen, can be very useful to detect tweets referring to companies, especially if the account was created long ago. However, its ability to differentiate from the most recent accounts is reduced, so it should be combined with other criteria as the previously seen.

Figure 6.12 – Account creation time distribution, LSE(blue) vs Cryptocurrency(black)

32 6.3. Tweet time and place

6.3 Tweet time and place

The last criterion that will be mentioned is the time when the tweet was posted. Within this field, the most differentiating criterion is the number of tweets that are collected per day hour, see figure 6.13. Most of the LSE tweets are posted between 10:00 and 18:00 GMT, when the stock market is open. However, the tweets about the cryptocurrencies are more stable throughout the day, as they do not have a closing time or a specific geographical area. Regarding the place where the tweet was posted, most of the accounts have the geolocalization option disabled, so not really useful information is collection about this topic.

Figure 6.13 – Tweet time distribution, LSE(blue) vs Cryptocurrency(black)

In summary, although both cryptocurrency and company tweets are collected through homonym cashtags, each of these types have distinctive features that help in their differentiation. The tweets related to cryptocurrencies tend to contain terms such as "crypto", "coin" or the name of a cryptocurrency, as well as a large number of tickers. While tweets referring to LSE companies incorporate terms such as "financi", "ftse" or "plc", as well as the names of the main stock markets. These terms are used both in the body of the tweet and in its hashtags. However, both types of tweets also share many common terms because of their financial nature. Regarding the user who posts the tweet, cryptocurrency ones are made from very small and recent accounts, with few followers while the ones that deal with companies of the LSE are made from more visible accounts. However, both a few cryptocurrency tweets and some company tweets are posted by large accounts with millions of followers. In addition, the vocabulary used in the description of cryptocurrency accounts tends to be less formal than the company ones. Moreover, the representative terms of both types of tweets, are also common in the user’s description. Finally, while cryptocurrency tweets tend to stay constant throughout the day, most company tweets happen when the London exchange is open.

33 Chapter 6. Tweet features

Tweet features Feature Criptocurrency tweets LSE tweets

•Terms like crypto, •Terms like group, inc, coin, binanc or name of plc, financ, or name of Tweet body crypocurrencies markets •Many different tickers •Only a few tickers per in the body body (one or two)

•Small accounts, with a few followers and •Moderate number of friends followers and friends •Accounts created •Accounts created from recently 2010 to now User information •Description in a •Description in a formal informal way and with way and with words like words like crypto and financi or group coin •A few verified users •Almost none verified (1%) users (0.1%)

•Similar amount of •Most of tweets posted tweets during all day when the LSE is open Tweet time and place •No geographic •No geographic information information

Table 6.1 – Tweet main features

34 7 Application of filtering criteria and results

In the previous section we have shown the different distinctive features of cryptocurerncy tweets and company tweets from two blocks of data without interference. In this section, based on the information observed during the descriptive analysis, different classifier systems, capable of dividing the homonym tweets between those relative to stock companies and to cryptocurrencies, have been created and analysed. Specifically, two types of classifiers have been developed, within which, different alternatives have been shown. First, a set of heuristic filters, based on the presence of certain key terms for the detection of each type of tweet discovered during the descriptive analysis and the manual annotation, have been proposed. Within them, two alternatives are presented, one uses only general terms related to cryptocurrencies and the other adds terms related to the specific tickers that are being analysed. The purpose of these heuristics is to distinguish as many tweets as possible about cryptocurrencies, without misclassifying any tweet about a stock company. That is, given the interest in detecting tweets about stock companies, getting rid of as many tweets as possible referring to cryptocurrencies, losing the minimum amount of information possible. By default, the filter will consider that a tweet refers to the trading company cashtag. However, if it contains any of the searched terms, the tweet will be marked as related to cryptocurrencies. For this reason, the terms used will be only words that identify almost unequivocally that a tweet is about a cryptocurrency.

Given the difficulty of collecting some of the patterns seen during the descriptive analysis in a heuristic, supervised classification methods have been employed. As was previously commented, we have manually classified the type to which each tweet belongs. So, it is possible to use techniques such as SVM or logistic regression that need a previously classified sample. These techniques can incorporate in a simple way those patterns discovered during the previous analysis. Thus, the independent variables that these classifiers will use will be those fields that provide distinctive features between both types of tweets discovered during the descriptive analysis. Unlike heuristic filters, these classifiers will be optimized to achieve commitment levels between the obtained precision

35 Chapter 7. Application of filtering criteria and results and recall, in other words, they will try to maintain a compromise between the number of tweets about cryptocurrencies that are marked as stock companies and the number of financial tweets which are mistakenly detected as cryptocurrencies. As for the heuristic filters, different versions of classifiers will be shown, depending on the complexity of the system, the benefits achieved and the scope of use that it may have. Initially, the supervised classifiers will employ support vector machines. However, other classifiers based on logistic regressions will be developed to verify if the highest computational load of the first ones is justified or if those faster methods can provide similar results.

Once both types of classifiers have been raised, their combined performance has been studied. This combined system will use the results of the heuristic filters as an independent variable of the supervised model. Moreover, the previously commented classifiers use some variables adapted to the specific tickers analysed, such as the name of the ticker by which the tweet was captured or some specific terms. For this reason, a system able to identify both types of tweets that only works with general cryptocurrency’s and stock company’s features is deployed.

Finally, the previous models only use the content of the tweet through certain key terms. They do not take into account the relative importance of each term or the relationships that may exist between them. However, the tweet body is one of the largest source of information that a tweet has. For this reason, through an LSTM network an embedding matrix has been obtained. The LSTM network, a type of recurrent neural network, is trained to predict the next word in the dataset. This LSTM network trains, among others, an embedding matrix for the most common terms of the dataset, in this case the 10,000 most common ones, which collects the relative importance of each term and the relationship that exist between them. The result of applying this matrix over each tweet is used as new independent variables for the supervised models, replacing the list of key terms originally employed. Thus, new combined systems, which use the results of the embeding matrix instead of key terms, are proposed.

The classical measurements for classifiers are used in order to evaluate the performance of each system. Specifically, the precision, recall, specificity, accuracy, fscore and AUC are calculated. Given the objective of the heuristic filters, the recall value will be the key in their evaluation, without neglecting the precision value. However, in the case of the supervised classifiers and combined systems, the analysis will be focused on the fscore and the AUC, since they combine in a single term both the precision and the recall of the deployed system. In addition to these measurements, the complexity, the estimated useful lifespan and the scope of use of each system will be discussed, as well as the tasks necessary to update them.

36 7.1. Heuristic filters

7.1 Heuristic filters

First, a heuristic filter based on the presence of certain key terms in the body of the tweet tries to detect as many tweets as possible about cryptocurrencies but misclassifying the least possible number of company tweets. To minimize the amount of misclassified company tweets, the terms used will be only those that make it possible to determine almost unmistakably that a tweet is about a cryptocurrency, such as: cryptocurrency, lumen, etherum, bitcoin, or stellar. Likewise, a list of cryptocurrencies whose acronyms does not coincide with other companies is also used. This way, those tweets that contain any of these terms will be marked as cryptocurrency tweets while the others will be considered as company tweets.

As can be seen in table 7.1, 98.0% of the cryptocurrency tweets are correctly detected, maintaining a recall of 93.2%. As a consequence, we obtain a precision that, although not very high, is much higher than the one of the null model (2.7%). This makes this filter a good option to discard a lot of tweets about cryptocurrencies losing a limited fraction of tweets about companies.

If we analyse the terms used,tables 7.2 and 7.3, we can see that they are all specific names of the main current cryptocurrencies or words that refer to them, as would be the case of blockchain or binance. Therefore, the performance of the filter should be maintained in the medium term and decline gradually as the trendy cryptocurrencies change. To avoid this, the list of cryptocurrencies should be updated periodically. As it uses a fixed list of cryptocurrencies, the filter should obtain similar results working with tickers different than those studied.

Although the precision and recall values obtained are significantly better than those of the null model, more than a thousand tweets from companies are misclassified, which differs from the initial objective of the filter: to achieve a practically perfect recall. Although all the considered terms refer directly to cryptocurrencies, in some company tweets the cryptocurrencies are named even when the captured ticker does not refer to a cryptocurrency, as would happen for $BRK in which various tweets would refer to Berkshire Hathaway while they talk about cryptocurrencies. This is the reason of most of the failures of the heuristic. To avoid this and improve the performance of this filter, it has been optimized, adding a series of different terms depending on the ticker considered. This way, if for example the tweet to consider contains the ticker $NXT, and terms like Ignis or Ardor (elements related to the crypto platform) the tweet will be classified as belonging to cryptocurrencies. However, if the ticker is $BRK and contains words like Berkshire or Brookline, the tweet will be marked as company tweet. These specific criterions will have priority over the general ones. So, if they do not coincide, the labelling of the extended filters will be considered. The results of the application of this filtering system can be appreciated in table 7.1.

37 Chapter 7. Application of filtering criteria and results

Measurements System Basic Extended Precision (Company) 0.551 0.609 Recall 0.932 0.999 Specificity 0.980 0.983 Accuracy 0.978 0.983 FMeasure 0.692 0.757

Table 7.1 – Wordbase heuristic filter measurements

The results of the extended filter are significantly higher than those of the basic filter. The recall of the system has increased to 99.9% and only seventeen company tweets are misclassified, an amount more in line with what was sought for this type of filters. In addition, the accuracy of the system has also increased slightly thanks to specific knowledge for each ticker.

However, this new filter is limited only to the tickers analysed, and cannot be used for other cases where the interference between company and cryptocurrency happens, since it takes specific information about a company. The complete information of the terms used in each filter can be seen in table 7.2 and 7.3.

Cryptocurrency codes (Word filter) $SNT, $ADA, $MTH, $ADX, $LSK, $DSR, $ARK, $CLOAK, $TKN, $DLC, $DCR, $KMD,$IQT, $ZCL, $DCY, $ALIS, $RBY, $SYS, $EXP, $BCY, $VEN, $BCN, $BLITZ, $UGT, $GVT, $MONA, $QASH, $DASH, $AUR, $UNO, $BURST, $REQ, $PART, $TRIG, $GCR,$LMC, $XEM, $BNB, $SNGLS, $BITSILVER, $PDC, $ELIX, $XVG, $DOPE, $LEND, $SNRG, $NLG, $ARDR, $QSP, $SALT, $SYNX, $GRC, $XDN, $PIVX, $DCT, $WAVES, $PTOY, $SIB, $LTC, $CPC, $NAS, $XMR, $LOCI, $ION, $VSX, $NXS, $XMY, $GBYTE, $XMG, $IGNIS, $ETP, $BWK, $FCT, $DRGN, $MUE, $XPM, $STEEM, $FTC, $SPHR, $DGB, $DGD, $SUB, $VOX, $MAID, $RPX, $AEON, $XAUR, $MIOTA, $CRC, $BET, $ENG, $XVJ, $POWR, $STORJ, $GUP, $UBQ, $SBD, $INFX, $LGD, $DYN, $INFR, $ONION, $MANA, $SLR, $FUN, $CURE, $BITB, $EMC2, $XZC, $IOTA, $COVAL, $AGRS, $PASC, $DOGE, $XRB, $SWT, $FLDC, $ZEC, $NBT, $XRP, $ETH, $RADS, $ETC, $PANGEA, $CLAM, $PHR, $APX, $BTC, $NEM, $NEO , $MYST, $START, $ENJ, $WTC, $PPT, $STR, $ARDOR, $ITZ, $BCPT, $ITC, $TAAS, $STRAT, $SEQ, $EDG

Table 7.2 – Cryptocurrencies used (Heuristic word filter)

38 7.2. SVM classifiers

Reference ticker Word list General coin, crypt, btc, lumen, ethereum, bitcoin, whale, Cryptocurrencies stellar, binanc, blockchain $NXT(LSE) plc $NXT(Crypto) ignis, ardor, jelurida $XLM(LSE) xlmedia $XLM(Crypto) rocket, moon, $str, worth, now, trx $CRW(LSE) craneware weed, fire, emc, cannabis, medical, amphenol, aphria, $APH(LSE) $app, $acb amz, aapl, twtr, berkshire, buffet, warren, brookline, $BRK(LSE) brooks, oil $SKY(LSE) skyline, fox $GBG(LSE) plc, group $AMS(LSE) hospital, medical

Table 7.3 – Words used (Heuristic word filter)

7.2 SVM classifiers

Although the heuristic filters successfully detect a large number of tweets about cryptocur- rencies, adapting some of the patterns seen during the descriptive analysis to this type of techniques is complex. Thus, the second type of proposed systems try to effectively split both types of tweets through the use of supervised methods, more specifically the support vector machines, and the useful information from each tweet discovered during the analysis as independent variables. Unlike the previous filters, these try to achieve compromise solutions between precision and recall. They try to achieve significant improvements in the precision of the results at the expense of incorrectly classifying some company tweets. Therefore, the fundamental measurement that will be used to evaluate these classifiers will be the fscore, which allows us to clearly compare the performance of the different classifiers deployed.

The FTNHDS and AMNHDS have been manually classified to be able to perform the design of these models. In addition, this dataset has been divided into three subsets whose elements have been chosen randomly fulfilling the distribution explained below. The first one, called trainset, will consist of the 70% of the tweets in the dataset and will be used to train the deployed classifiers. The second is the testset, which consists of those tweets that are not in the trainset, that is, the remaining 30% of the dataset, used to perform measurements of the performance of the models. Finally, the 10% of the trainset will form the tuneset, which is used to adjust the configuration parameters of the classifiers.

The first result that should be highlighted is the SVM classifier that uses the differentiating features observed during the comparison of both types of tweets as independent variables.

39 Chapter 7. Application of filtering criteria and results

Within this set of variables, the use of the date of the tweet has been discarded to extend the options of the filter and not restrict it to the period studied. Thus, the list of variables used can be seen in table 7.4.

Variable Type Description

Ticker Factor Tickers of the different companies Day of the week when the tweet was Weekday Integer post (from 0 to 4)

Hour of the day when the tweet was Hour Integer post

Followers Numeric Log10 account followers

Friends Numeric Log10 account friends

Favourites Numeric Log2 account favourites Log2 number of different tickers in Dollars Numeric the tweet True if the account has not change DefaultProfile Logical the default account interface Moment when the account was Factor AccountCreationTime created (divided in half years)

Table 7.4 – Independent variables (Basic svm classifier)

In view of the results shown in table 7.6, it can be seen that compared to the filters previously presented, the precision values obtained are significantly higher, reaching values close to 90% with a very low reduction of the recall. In addition, the parameters used in this classifier are quite independent of temporal variations, which extends the useful life of the classifier significantly. Only creation of the account has a clear temporal component but its application is based on the differentiation from the accounts created before and after the irruption of cryptocurrencies. Therefore, the performance of the classifier should remain stable in the medium term.

In order to apply this system to other cryptocurrencies, it must be considered that one of the independent variables the model uses is the ticker by which the tweet has been collected, that is, the company to which it refers. Therefore, if it is used to work with other cashtags, an equivalent model that uses the new tickers should be developed or a slight degradation in the performance could happen. Subsequently, a model applicable to situations different from those contemplated will be raised.

Although the results obtained with the previous model are quite satisfactory, it does not use information about the content of the tweet. In this second classifier, a list of words of interest has been created from the most representative terms observed during the descriptive analysis. Additionally, certain key words for the differentiation between

40 7.2. SVM classifiers both types, detected during the manual classification, have been added to improve the performance. Based on these terms, a vocabulary has been created and applied to the tweets, expanding the available information and getting new independent variables to optimize the model, one for each word considered. These new variables will be 1 if the word is in the tweet, no matter how many times, and 0 otherwise. The complete vocabulary can be appreciated in table 7.5.

Word used (Extended SVM classifier) Binac, Bitcoin, Signal, Join, Crypto, Fee, Plc, Inc, Group, Company, Finance, Weed, Aapl, Moon, Cannabis, berkshire, Brooks, Ltc, Eth, Dash, Xrp, Xmr, Xem, Nem, Rocket, Jelurida, Ignis, Medical, Buffet, Warren, Stellar

Table 7.5 – Vocabulary Extended SVM classifier

In table 7.6, a slight improvement in all measurements can be appreciated, especially in terms of accuracy and fscore, going from 0.87 to 0.94. Especially noteworthy is the AUC value, with a value practically equal to 1 even in the testset. This shows the quality of the results obtained and allows different work points to the one shown in the table 7.6. For example, this classifier can provide values of precision greater than 95% while maintaining a recall higher than 90%. In terms of the useful life of this model and its applicability to cases different from those contemplated, the addition of terms related to the content of the tweet should not reduce the useful life of the classifier, since these words refer to cryptocurrencies and companies and not to a specific temporary situation. So the results should be maintained in the medium term. Likewise, the terms used are, for the most part, general and do not make reference to any specific analysed cryptocurrency. However, a few contain information related to a specific company or cryptocurrency as it would be the case of ARDOR. Therefore, if we apply the classifier in other similar situations, the performance could decline. A detail that must be considered to compare it to the previous model is that both the execution and especially the training of the model is slightly slower than in the previous case. This classifier has to apply the vocabulary to the interest tweet set and the number of support vectors used is higher.

Measurements System Basic Extended Precision (Company) 0.898 0.941 Recall 0.897 0.935 Specificity 0.997 0.998 Accuracy 0.995 0.997 FMeasure 0.898 0.938 AUC 0.977 0.997

Table 7.6 – SVM classifier measurements

41 Chapter 7. Application of filtering criteria and results

Figure 7.2 – AUC Extended SVM Figure 7.1 – AUC Basic SVM classifier classifier

7.3 Combined systems

In view of the results seen so far, both the heuristic filters and the SVM classifiers can be used together to improve the benefits obtained. Specifically, for the generation of this system, the results of the extended word filter have been introduced as a new independent variable for the extended SVM model.

Given the high recall of the heuristic filter, it will allow a large number of cryptocurrency results to be discarded quickly. So the SVM can focus more on the precision and, therefore, the final system shows an improvement in both metrics. This can be seen in the results of the joint model shown in table 7.7.

A notable improvement in all the measurements made can be appreciated, obtaining precision, recall and fscore values close to 0.97 in the testset. Likewise, even the AUC improves slightly despite its previous high value. In view of the results obtained, virtually all of the tweets are positively classified, missing only a small fraction of cases. In addition, the working point of the system can be adjusted to obtain slightly higher values of precision or recall depending on the needs.

From the point of view of useful life and the applicability of the system, it is identical to what was mentioned for the two previous ones. Given the features of the variables used, the results should remain stable in the medium term, but the benefits will be slightly lower if they are applied to other coincident tickers not included. Finally, in terms of execution time, it is slower than the previous systems, requiring the execution of both of them consecutively.

42 7.3. Combined systems

Measurements Precision (Company) 0.976 Recall 0.968 Specificity 0.999 Accuracy 0.999 FMeasure 0.972 AUC 0.9994

Table 7.7 – Combined SVM Figure 7.3 – AUC Combined classifier SVM classifier

Although the results obtained for the different classifiers shown are satisfactory, as has been mentioned, the possibility of applying them to other situations of homonym cashtags is smaller, given the features of some of the variables used. Therefore, the results of a model where only general information is used are shown in table 7.8. To generate this classifier, instead of the extended heuristic filter, which contains words related to specific cashtags, the basic filter was used.

In addition, from the extended SVM classifier both the captured ticker information and certain specific terms related to specific tickers have been discarded. The full list of variables and the vocabulary used for the SVM filer can be seen in tables 7.9 and 7.10. This allows generating a model easily applicable to other situations of interference similar to those studied, with a performance similar to the one obtained with the testset.

Measurements Precision (Company) 0.933 Recall 0.855 Specificity 0.998 Accuracy 0.995 FMeasure 0.892 AUC 0.988

Table 7.8 – Independent SVM Figure 7.4 – AUC Independent classifier SVM classifier

43 Chapter 7. Application of filtering criteria and results

Variable Type Description Day of the week when the tweet was Weekday Integer post (from 0 to 4)

Followers Numeric Log10 account followers

Friends Numeric Log10 account friends

Favourites Numeric Log2 account favourites Log2 number of different tickers in Dollars Numeric the tweet True if the account has not change DefaultProfile Logical the default account interface Moment when the account was Factor AccountCreationTime created (divided in half years)

Table 7.9 – Independent variables(Independent SVM filter)

Word used (Extended SVM classifier) Binac, Bitcoin, Signal, Join, Crypto, Fee, Plc, Inc, Group, Company, Finance, Aapl, Moon, Ltc, Eth, Dash, Xrp, Xmr, Xem, Nem, Rocket

Table 7.10 – Vocabulary independent SVM classifier

In view of the results, a slight fall in the measurements made can be seen, this fall is especially noticeable in the recall. However, the classifier precision continues still high, exceeding 90% and the accuracy is greater than 99%. In addition, the area under the curve remains high, which allows adjusting other solutions that optimize the precision or recall depending on the desired features. As with the previous combined classifier, it does not use variables with high temporal variability, so the results obtained should be maintained in the medium term.

7.4 LSTM classifiers

The above classifiers use a list of key terms to process the tweet content. However, these fields do not considered the relative importance of each term or the relationship that may exist between them. For this reason, the aforementioned combined systems have been adapted to use, instead of this list of key terms, an embedding matrix that collects both of them. It also allows considering a greater number of terms in the vocabulary without an excessive increase in the number of independent variables.

44 7.4. LSTM classifiers

The FTHDS and AMHDS have been provided to an LSTM network, a type of recurrent network, to generate the embedding matrix. In particular, this network tries to predict the next word of the dataset from the previous words. Although ideally, all of the above terms should be considered, in order to make the problem computationally treatable, only a finite amount of terms will be contemplated. In order to predict the next word, the network generates a matrix that allows to represent the relationships between the different terms and the weight of each of them in a better way. This matrix is trained, together with the network, in each of the iterations.

To facilitate the processing of the tweet body by the neural network and the supervised methods, each tweet has been represented as a vector. A vocabulary made by the 10,000 most common terms within the homonymous tweets has been used to carry out this transformation. Before creating this vocabulary, the text of each tweet has been preprocessed. In particular, weird characters, punctuation, emoticons, urls and stop words have been removed. Likewise, the tickers and names of the analysed cashtags have not been considered either. Finally, each term has been stemmed. This processing provides a greater representative capacity of the terms used and avoids the use of very common terms. The vocabulary will consist on the 9,998 most common terms after this processing, in addition to a term for those terms not collected and another for the break line.

Once the LSTM network is trained, the resulting matrix is used, together with the vectors previously created for each tweet, to generate the new independent variables. So, each of these vectors is multiplied by the aforementioned matrix. As a result, in addition to a significant reduction in the number of variables (from 10,000 to 200), a better representation of the information contained in each tweet is reached. Thus, each tweet will be represented by a vector of 200 variables, which will be used as independent variables for the SVM.

The combined classifiers previously proposed are developed again but they use the result of the embedding matrix instead of the list of key terms. As shown in table 7.11, there are no major changes in the combined classifier performance, only a very slight improvement. Therefore, the use of the LSTM network does not seem really useful in this case compare to the previous classifier, since the additional computational load that it requires does not provide a significant improvement in performance. However, this was expected given the already high performance of the combined system.

Regarding the limitations and applicability of this model, they are the same as in the previous case. This classifier should register a small drop in its performance if it is used outside the companies studied and maintain the performance in the medium term, since it does not have any variable with a fairly clear temporal variation.

45 Chapter 7. Application of filtering criteria and results

Measurements Precision (Company) 0.981 Recall 0.969 Specificity 0.9995 Accuracy 0.999 FMeasure 0.975 AUC 0.9990

Table 7.11 – LSTM SVM Figure 7.5 – AUC LSTM SVM classifier classifier

Unlike the LSTM classifier, the independent LSTM classifier provides large improvements compared to the same model with key terms. These improvements are especially noticeable for the precision, recall and fmeasure of the system, surpassing 0.92 for all of them, unlike the 0.855 of the previous independent classifier. The significantly greater vocabulary used increases the representative capacity and compensates the reduction in the other fields. In fact, the benefits obtained are similar to those of the LSTM system, which virtually classifies correctly all tweets. For this reason, the use of this model would be advisable to process homonym tickers different from those studied, although with a computational overload due to the large size of the vocabulary, embedding matrix and support vectors. As with the previous classifiers, it does not use variables with high temporal variability, so the results obtained should be maintained in the medium term. To maintain long term benefits, it would be necessary to update the temporary matrix every few months to adapt to changes in the new terms used. However, given the large size of the vocabulary, most of them should not change. So, the performance of the system should reduce more slowly than the independent classifier.

Measurements Precision (Company) 0.967 Recall 0.928 Specificity 0.999 Accuracy 0.997 FMeasure 0.947 AUC 0.992

Table 7.12 – Independent Figure 7.6 – AUC independent LSTM SVM classifier LSTM SVM classifier

46 7.5. Logistic regression systems

7.5 Logistic regression systems

Although the good results obtained through a SVM, the execution and especially the training of these models can be slow. Therefore, other faster and simpler systems have been proposed to see if similar results are obtain or the complexity of the previous used models is justified. Next, the results obtained for the different situations previously analysed are shown in table 7.13, but logistic regression is applied instead of SVM. It should be mentioned that because of the high computational cost of the LSTM network, the logistic regression is not use in the LSTM model.

Measurements System Basic Extended Combined Independent Precision (Company) 0.816 0.914 0.950 0.871 Recall 0.807 0.872 0.960 0.801 Specificity 0.995 0.998 0.999 0.997 Accuracy 0.990 0.994 0.998 0.991 FMeasure 0.812 0.892 0.955 0.835 AUC 0.977 0.993 0.9997 0.986

Table 7.13 – Logistic regression systems measurements

As can be appreciated, although the performances of all the models are lower than those of the SVM, the fall is not very significant, especially more complex the generated system. In fact, the benefits of both supervised techniques for the combined model are practically the same. However, the execution time of these classifiers is significantly lower than the previous ones, reaching more than five times faster. Especially noteworthy is the application of the basic logistic regression classifier. Since it does not require processing the text of the tweet, it can be trained and used in a few minutes. Regarding the limitations of the different models, they maintain the same restrictions as the previous ones, since they use the same set of independent variables.

7.6 Conclusions and limitations

In summary, different alternatives have been presented to perform the separation of both types of tweets, each one oriented to a specific use. Thus, heuristic filters based on words seek to discard a large number of cryptocurrency tweets without practically failing any company tweet, so they achieve high recall values with acceptable levels of precision. On the other hand, classifiers based on supervised methods provide compromise solutions between precision and recall, maximizing the fscore. Within each type, different alternatives have been presented depending on the quality of the results to be achieved and the computational load associated to them. High-quality results have been obtained for the more complex and expensive models.

47 Chapter 7. Application of filtering criteria and results

In addition, we have analysed the combined action of both types of systems and the results that this approach offers, reaching AUC values very close to the unit and fscore higher than 0.975. Finally, classifiers able to identify both types of tweets that do not use information related to any of the studied cryptocurrencies are shown. These models, despite a small decrease in measurements, still have high levels of precision and recall, specially if they use a embedding matrix instead selected key terms. This performance shows the possibility to use them in different situations from those studied. It is up to the user to choose and adapt the work point of the model that best suits the criteria sought. However, it is advisable to use the extended logistic regression classifier as an initial measurement to obtain a quick initial estimate of the benefits that can be obtained.

Regarding the limitations of the developed models, two groups must be differentiated. In the first place, there would be those classifiers that use information that refers to some of the analysed tickers, such as the extended SVM, the combined classifier, the extended heuristic word filter or the LSTM classifier. They use parameters such as the company’s ticker or certain key terms, related to some of the cashtag used, to achieve an improvement in the division made. This means that its performance for other cashtags different from those analysed is lower. So they are ideal to work with the companies of the LSE-100 but their benefits fall outside them.

However, other systems that do not use any type of information regarding the cashtags studied have also been deployed. These classifiers keep their performance out of the analysed tickers, some examples of these would be the independent classifiers, the LSTM independent classifier or the basic word-base heuristic filter. For both the adapted and independent classifiers, the performance of the models should be maintained in the medium term, since no information with a marked temporal nature is used, except the account creation time. However, this field is mostly used to differentiate between those accounts that were created before and after the popularization of cryptocurrencies, so it should continue to work correctly. However, the most popular cryptocurrencies may vary from time to time, so the list of cryptocurrency tickers should be updated every few months if you want to maintain the benefits of the system. Likewise, the most common terms of the tweets may also vary. So, it is advisable to update the embedding matrix every few months to keep the performance of the system. The other parameters considered should maintain a regular behaviour, at least in the medium term.

48 8 Conclusions and future lines

There is a general consensus about the good sensing and original characteristics of Twitter as an information media for complex financial markets. Analysis establishes Twitter as a relevant feeder for taking decisions regarding the financial market and even fraudulent activities in that market. One of the main mechanisms used in Twitter to track financial tweets is the cashtag. However, in the last months the irruption of the cryptocurrencies has produced a degradation in the quality of the information obtained through this mechanism. This is due to the fact that a few of them have homonym tickers to those of some of the companies in the main markets, which means that when using the cashtag, results referring both to stock companies and to cryptocurrencies are obtained. In addition, most of cryptocurrency tweets are self-generated spam messages. All this produces a great degradation in the informative capacity the cashtag seeks to obtain. So new disambiguation mechanisms, or the adaptation of existing ones, are necessary to restore it.

Thus, meanwhile most of the current researches are focused on the potential of Twitter as a source of predicative information and decision making on financial markets and the development of various expert systems capable of using such information, the approach of this paper is focused on a completely different objective: to illustrate the effect of the recent popularity of cryptocurrencies in the tweets collected through the cashtag of LSE companies and deploy systems capable of differentiating both types of tweets, related to stock companies or cryptocurrencies, so that it is possible to use the information contained in them, restoring the informative capacity the cashtag initially offered.

Although both cryptocurrency and company tweets are collected through homonym cashtags, each of these types have distinctive features that help in their differentiation. The tweets related to cryptocurrencies tend to contain terms such as "crypto", "coin" or the name of a cryptocurrency, as well as a large number of tickers. While tweets referring to LSE companies incorporate terms such as "financi", "ftse" or "plc", as well as the names of the main stock markets. These terms are used both in the body of the tweet and in

49 Chapter 8. Conclusions and future lines its hashtags. However, both types of tweets also share many common terms because of their financial nature. Regarding the user who posts the tweet, cryptocurrency ones are made from very small and recent accounts, with few followers while the ones that deal with companies of the LSE are made from more visible accounts. However, both a few cryptocurrency tweets and some company tweets are posted by large accounts with millions of followers. In addition, the vocabulary used in the description of cryptocurrency accounts tends to be less formal than the company ones. Moreover, the representative terms of both types of tweets, are also common in the user’s description. Finally, while cryptocurrency tweets tend to stay constant throughout the day, most company tweets happen when the London exchange is open.

Based on these criteria, different alternatives have been presented to perform the sep- aration of both types of tweets, each one oriented to a specific use. Thus, heuristic filters based on words seek to discard a large number of cryptocurrency tweets without practically failing any company tweet, so they achieve high recall values with acceptable levels of precision. On the other hand, classifiers based on supervised methods provide compromise solutions between precision and recall, maximizing the fscore. Within each type, different alternatives have been presented depending on the quality of the results to be achieved and the computational load associated to them. High-quality results have been obtained for the more complex and expensive models.

In addition, we have analysed the combined action of both types of systems and the results that this approach offers, reaching AUC values very close to the unit and fscore higher than 0.975. Finally, classifiers able to identify both types of tweets that do not use information related to any of the studied cryptocurrencies are shown. These models, despite a small decrease in measurements, still have high levels of precision and recall, specially if they use a embedding matrix instead selected key terms. This performance shows the possibility to use them in different situations from those studied.

It is up to the user to choose and adapt the work point of the model that best suits the criteria sought. However, it is advisable to use the extended logistic regression classifier as an initial measurement to obtain a quick initial estimate of the benefits that can be obtained.

During this paper, the influence of cryptocurrency tweets in the cashtag’s results is anal- ysed for the main LSE companies. However, homonym tickers between cryptocurrencies and stock companies also happens in other markets such as the NSQE or NASDAQ. We are currently looking for similar situations in other markets different than those stud- ied, testing the performance of independent classifiers and adapting the other classifier systems to these new cases.

Finally, although during the study period, from July 1 2017 to February 15 2018, the interference between the cryptocurrencies and the tickers of financial companies

50 only happened for the cashtags indicated in table 3.1, recently the situation has been extended to other cases. In particular, the new conflicting tickers are: $SPH (Sinclair pharma(AIM-100) vs Sphere(coin)), $REDD (Redde (AIM-100) vs Reddcoin) and $SMT (Scottish mortgage investment trust plc(FTSE-100) vs SmartMesh(coin)). All these new cryptocurrencies highly increased their popularity before the study period. Therefore, we are also currently testing the performance of independent classifiers in these new cases and adapting the existing ones to perform a good division for these tickers.

51

9 Appendix I: Classifiers

9.1 Supervised Methods

A support vector machine (SVM) is a supervised method that allows the classification of data into different sets. Although there are alternatives able to divide into a greater number of groups, normally it uses only two. A SVM seeks to find the hyperplane separator of maximum generality between the two sets. To do this, first it tries to determine hyperplanes w for which

wT ∗ x + b >= u (9.1)

if the point x belongs to one class and

wT ∗ x + b <= v (9.2)

if the point belongs to the other. The distance between the two sets can be defined as

(u − v)/ sqrt(wT ∗ w) (9.3)

The objective of the optimization of SVM is to maximize this margin, that is, to find the hyperplane w for which the distance between both sets is maximum. Note that this is only possible if u> v, that is, if the sets are linearly separable.

In case they are not, SVM can perform a transformation in the vector space to move the

53 Chapter 9. Appendix I: Classifiers data to a higher dimensional kernel space in which the samples are linearly separable by a hyperplane. For this, a function phi (), that performs the spatial transformation, is used. Thus, the plane sought must fulfil that

wT ∗ phi(x) + b >= u (9.4) if the point x belongs to one class and

wT ∗ phi(x) + b <= v (9.5) if the point belongs to the other. However, sometimes it is not possible to convert the data into two linearly separable sets. To work around this, most SVM implementations implement the so-called soft margin optimization goal. A soft margin optimizer adds additions error terms that are used to allow a limited fraction of training examples to be on the wrong side of the decision surface. As a consequence, the model does not actually perform well on all transformed training examples, but trades the error on these examples against increased margin on the remaining training examples. So, the model will seek to find the hyperplane of maximum generality that fulfils that

T w ∗ phi(x) + b + ei >= u (9.6) if the point x belongs to one class and

T w ∗ phi(x) + b − ei <= v (9.7) if the point belongs to the other. Among those planes that meet these conditions, SVM will choose the one that maximizes the difference between the margin between the points and the errors made, the so-called soft margin. The weight of this errors can be addjusted through the C parameter.

The function phi() is allow to map into a very large or even infinite vector space. Support vector machines can get away with this because they never explicitly compute phi(). They compute k(u,v) which is by definition equal to phi(u)T * phi(u) and computable. So, SVMs look for a s such that w = phi(s). However, usually that s does not exist. But

54 9.1. Supervised Methods

there is always a set of vectors s1, . . . , sm and numbers a1, . . . , am such that

m X w = ai ∗ phi(si) (9.8) i=1

where each si is one train example. These m train examples are called supported vectors. Hence, the name of this classifier. Using these vectors

m T X w ∗ phi(x) + b = b + ai ∗ k(si, x) (9.9) i=1 which is computable. In order to classify a point, we evaluate the previous expression and depending on the side of the plane where the point is, it will be assign to one class or the other.

In addition to SVMs, logistic regression has been used to classify the dataset. This supervised method allows estimating the probability that an element belongs to a class conditioned to the values it has in each of its fields. To do this, it models the logit of this probability as a linear combination of the independent variables of the model. So that,

P(x) n log( ) = b + X a ∗ x (9.10) 1 − P(x) i i i=1

During its training, this type of models will try to find the values of a1,..., an and b that maximize log-likelihood. This is, that maximize

Y Y P(xi) (1 − P(xi)) (9.11) i:y =1 0 0 i i :yi=o

The main idea behind this optimization process is to estimate coefficients so that the predicted probability of the main class is close to 1 for those elements of the main class and close to 0 for those which do not. Thus, once trained, it would be enough to evaluate

55 Chapter 9. Appendix I: Classifiers

b+Pn a ∗x e i=1 i i P(x) = (9.12) b+Pn a ∗x 1 + e i=1 i i to calculate the probability. In order to classify each element, a threshold probability will be defined. Those elements whose probability is greater will be marked as elements of the class considered, while those that are below will belong to the other class.

9.2 Measurements

In order to evaluate the classifiers, the precision, recall, specificity, accuary and fmeasure are used. These values can be calculated through the truth table. This table collects the correspondence between the real class of each sample and the class assigned by the classifier. Thus, considering the class of interest as class A and the other as class B, four types of elements can be found: True positives (TP), those class A data that are correctly classified, False positives (FP), those elements of class B that are marked as belonging to class A, True negatives (TN), those samples of class B that are correctly detected and False negatives (FN), those data of class A marked as belonging to class B.

From these values it is possible to calculate the aforementioned performance measurements. So the precision is

TP , (9.13) TP + FP that is, the percentage of elements classified as belonging to class A that really are of that type. The recall would be

TP , (9.14) TP + FN that is, how many of the elements of class A are correctly detected. The fmeasure is the harmonic mean between precision and recall. Moreover, the system accuracy can be calculated as

TP + TN , (9.15) TP + TN + FP + FN

56 9.2. Measurements what would be equivalent to the fraction of correctly classified elements. Finally, the specificity would be the recall but on class B instead of class A

TN . (9.16) TN + FP

In addition to these measurements, the ROC curve and the AUC are also used. The ROC collects the true positive rate (recall) against the false positive rate (specificity) at various threshold settings and the AUC is the area under this curve. The closer the AUC value to one, the better the classifier is.

57

10 Appendix II: LSTM networks

Recursive neural networks (RNN) are a type of networks that allow to work in a simple way with relationships with previous terms. In particular, an RNN is a network with loops in them, allowing information to persist. This structure can be thought of as multiple copies of the same network, each passing a message to a successor.

Figure 10.1 – Recurrent network model

In theory, RNNs should be able to learn, with the correct selection of parameters, long- term dependencies which are dependencies where the terms are distant from each other, such as the relation between the words in a text. However, in practice, RNNs do not seem to be able to learn them.

Long Short Term Memory networks (LSTMs) are a special kind of RNNs, explicitly designed to learn long-term dependencies. The core of this network is the LSTM cell. It is the unit that process one input at a time and computes its results depending on previously terms. A LSTM network can have different LSTM cells working serial or parallel. The LSTM cell structure can be seen in figure 10.2.

59 Chapter 10. Appendix II: LSTM networks

Figure 10.2 – LSTM cell structure

The key to LSTMs is the horizontal line running through the top of the block, the cell state. This state collects the information from the previous seen terms. It is updated in each time step. Two different process are performed in order to update this state. This first one is the forget gate layer, the left side of the model. This layer controls the amount of previous information that is kept. It is based on a sigmoid neural netlayer that uses the current word and the output of the previous iteration to decide the information thrown and kept.

The second one is the input gate layer, the middle part of the diagram. This block is form by a sigmoid layer that decides which values to update and a tanh layer that creates a vector of new candidate values that could be added to the state. Both layers are combined to update the cell state. This cell state will be used as the initial state for the next input.

The last part of the cell model calculates the output. This output will be based on the cell state, but will be a filtered version. First, a sigmoid layer, which decides what parts of the cell state are going to be output, is run. Then, the cell state is put through a tanh and multiplied by the output of the sigmoid gate, so that only the desired parts are output.

Like any conventional neural network, this model is trained through a optimization function. In particular, in this document we want to minimize the average negative log probability of the target words, that is,

1 N loss = − X log p (10.1) N targeti i=1

60 The cell model shown, although widely used, is not unique. There are other alternative versions with minor changes that slightly modify the operation of the cell. However, the one explained above is the one used in this document. In (Olah, 2015) and (TensorFlow, 2018) more detailed explanations about LSTM networks and their alternatives can be found, as well as different implementations.

61

Bibliography

Agarwal, S., Chomsisengphe, S., Cheryl, L., 2017. Consumer choice and financial products. Annual Review of Financial Economics 9.

Antweiler, W., Frank, M., 2004. Is all that talk just noise? the information content of internet stock message boards. The Journal of Finance 59 (3).

Baron, J., O’Mahony, A., Manheim, D., Dion-Schwarz, C., 2015. National security impli- cations of virtual : Examining the potential for non-state actor deployment. Tech. rep., RAND Corporation-NDRI Santa Monica United States. ben Khalifa, M., Díaz Redondo, R., Fernández Vilas, A., Servia Rodríguez, S., 2016. Identifying urban crowds using geo-located social media data: a twitter experiment in new york city. Journal of Intelligent Information Systems.

Billett, M., Yu, M., 2016. Asymmetric information, financial reporting, and open-market share repurchases. Journal of Financial and Quantitative Analysis 51 (4).

Bollen, J., Mao, H., Zeng, X., 2011. Twitter mood predicts the stock market. Journal of computational science 2 (1), 1–8.

Brennan, C., Lunn, W., 2016. Blockchain: the trust disrupter. Credit Suisse Securities (Europe) Ltd.: London, UK.

Buterin, V., 2017. Ethereum: a next generation and decentralized appli- cation platform (2013). URL {http://ethereum. org/ethereum. html}.

Campbell, J., Cecez-Kecmanovic, D., 2011. Communicative practices in an online financial forum during abnormal stock market behavior 48.

Ceccarelli, D., Nidito, F., Osborne, M., 2016. Ranking financial tweets. In: ACM (Ed.), Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR ’16). pp. 527–528.

CESGA, 2018. Cesga main website. URL http://www.cesga.es

63 Bibliography

Choi, H., Varian, H., 2012. Predicting the present with google trends. Economic Record 88 (s1), 2–9.

Colianni, S., Rosales, S., Signorotti, M., 2015. Algorithmic trading of cryptocurrency based on twitter sentiment analysis. CS229 Project.

Cortez, P., Oliveira, N., Ferreira, J. P., 2016. Measuring user influence in financial microblogs: experiments using stocktwits data. In: ACM (Ed.), WIMS’16 Proceedings of the 6th International Conference on Web Intelligence, Mining and Semantics.

De Choudhury, M., 2011. Tie formation on twitter: Homophily and structure of egocentric networks. In: IEEE (Ed.), Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), 2011 IEEE Third International Conference on.

Delort, J.-Y., Arunasalam, B., Leung, H., Milosavljevic, M., 2012. The impact of manipulation in internet stock message boards. International Journal of Banking and Finance 8 (4).

Dredze, M., Kambadur, P., Kazantsev, G., Mann, G., Osborne, M., 2016. How twitter is changing the nature of financial news discovery. In: ACM (Ed.), Proceedings of the Second International Workshop on Data Science for Macro-Modeling.

Fernandez, M., 2014. Marble initiative. URL http://marble.miguelfc.com

Fernández Vilas, A., Díaz Redondo, R., Crockett, K., Owda, M., Evans, L., 2018. Twitter permeability to financial events: an experiment towards a model for sensing irregularities.

Fernández-Vilas, A., Evans, L., Owda, M., Díaz Redondo, R. P., Crockett, K., 2017. Experiment for Analysing the Impact of Financial Events on Twitter. Springer Inter- national Publishing, Cham, pp. 407–419.

Go, A., Bhayani, R., Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford 1 (12).

Google, 2018. Google trends. URL http://www.google.it/trends

He, D., Habermeier, K. F., Leckow, R. B., Haksar, V., Almeida, Y., Kashima, M., Kyriakos-Saad, N., Oura, H., Sedik, T. S., Stetsenko, N., et al., 2016. Virtual and beyond: initial considerations. Tech. rep., International Monetary Fund.

Hentschel, M., Alonso, O., 2014. Follow the money: A study of cashtags on twitter. First Monday 19 (8).

64 Bibliography

Hobijn, B., Jovanovic, B., 2001. The information technology revolution and the stock market: Evidence. American Economic Review 91, 1203–1220.

Hu, T., Tripathi, A., 2016. Impact of social media and news media on financial markets. SSRN.

Kaminski, J., 2014. Nowcasting the bitcoin market with twitter signals. arXiv preprint arXiv:1406.7577.

Karppi, T., Crawford, K., 2016. Social media, financial algorithms and the hack crash. Theory, Culture & Society 33 (1), 73–92.

Kimoto, T., Asakawa, K., Yoda, M., Takeoka, M., 1990. Stock market prediction system with modular neural networks. In: Neural Networks, 1990., 1990 IJCNN International Joint Conference on. IEEE, pp. 1–6.

Liew, J. K.-S., Budavári, T., 2016. Do tweet sentiments still predict the stock market? SSRN.

Liu, H., Morstatter, F., Tang, J., Zafarani, R., 2016. The good, the bad, and the ugly: uncovering novel research opportunities in social media mining. International Journal of Data Science and Analytics 1 (3-4), 137–143.

Liu, L., Wu, J., Li, P., Li, Q., 2015. A social-media-based approach to predicting stock comovement. Expert Systems with Applications 42 (8).

Liu, Y., Huang, X., An, A., Yu, X., 2007. Arsa: a sentiment-aware model for predicting sales performance using blogs. In: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp. 607–614.

Loria, S., 2014. Textblob: simplified text processing.

Mai, F., Bai, Q., Shan, Z., Wang, X., Chiang, R., 2015. From bitcoin to big coin: The impacts of social media on bitcoin performance. SSRN Electronic Journal.

MARKET, P., 2011. Twitter mood as a stock market predictor.

Matta, M., Lunesu, I., Marchesi, M., 2015. Bitcoin spread prediction using social and web search media. In: UMAP Workshops.

McWaters, R., Galaski, R., Chatterjee, S., 2016. The future of financial infrastructure: An ambitious look at how blockchain can reshape financial services. In: World Economic Forum.

Miller, G. S., Skinner., D. J., 2015. The evolving disclosure landscape: How changes in technology, the media, and capital markets are affecting disclosure. Journal of Accounting Research 53 (2).

65 Bibliography

Morstatter, F., Pfeffer, J., Liu, H., Carley, K. M., 2013. Is the sample good enough? comparing data from twitter’s streaming api with twitter’s firehose. In: In Proceedings of the 7th International Conference on Weblogs and Social Media, ICWSM 2013. AAAI press., pp. 400–408.

Muhammad, A., Leak, A., Longley, P., 2014. A geocomputational analysis of twitter activity around different world cities. Information Science 17 (3).

Nakamoto, S., 2008. Bitcoin: A peer-to-peer electronic cash system.

Olah, C., 2015. Understanding lstm networks. URL https://www.tensorflow.org/tutorials/sequences/recurrent

Oliveira, N., Cortez, P., Areal, N., 2017. The impact of microblogging data for stock market prediction: Using twitter to predict returns, volatility, trading volume and survey sentiment indices. In: Expert Systems with Applications. pp. 125–144.

Owda, M., Crockett, K., Lee, P., 2017. Financial discussion boards irregularities detection system (fdbs-ids) using information extraction. In: Intelligent Systems Conference 2017.

Pak, A., Paroubek, P., 2010. Twitter as a corpus for sentiment analysis and opinion mining. In: LREc. Vol. 10.

Rajesh, N., Gandy, L., 2016. Cashtagnn: Using sentiment of tweets with cashtags to predict stock market prices. In: 11th International Conference on Intelligent Systems: Theories and Applications (SITA). IEEE.

Ranco, G., Aleksovski, D., Caldarelli, G., Grčar, M., Mozetič, I., 2015. The effects of twitter sentiment on stock price returns. PloS one 10 (9).

Rao, T., Srivastava, S., 2012. Analyzing stock market movements using twitter sentiment analysis. In: Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012). IEEE Computer Society, pp. 119–123.

Rao, T., Srivastava, S., 2014. Twitter Sentiment Analysis: How to Hedge Your Bets in the Stock Markets. Springer International Publishing, Cham, pp. 227–247.

Renault, T., 2017. Market manipulation and suspicious stock recommendations on social media.

Ruiz, E. J., Hristidis, V., Castillo, C., Gionis, A., Jaimes, A., 2012. Correlating financial time series with microblogging activity. In: Proceedings of the fifth ACM international conference on Web search and data mining.

Sato, Y., 2017. Market sentiment helps explain the price of bitcoin.

66 Bibliography

Seibold, S., Samman, G., 2016. Consensus: Immutable agree- ment for the internet of value. KPMG< https://assets. kpmg. com/content/dam/kpmg/pdf/2016/06/kpmgblockchain-consensus-mechanism. pdf.

Servia-Rodríguez, S., Díaz-Redondo, R., Fernández-Vilas, A., 2015. Are tweets biased by audience? an analysis from the view of topic diversity. In: International Confer- ence on Social Computing, Behavioral-Cultural Modeling, and Prediction. Springer International Publishing.

Shutes, K., McGrath, K., Lis, P., Riegler, R., 2016. Twitter and the us stock market: The influence of micro. bloggers on share prices. Economics and Business Review 2 (3).

Sprenger, T. O., Tumasjan, A., Sandner, P. G., Welpe, I. M., 2014. Tweets and trades: the information content of stock microblogs. Eur Financial Management 20, 926–957.

Tafti, A., Zotti, R., Jank, W., 2016. Real-time diffusion of information on twitter and the financial markets. PLoS ONE 11 (8).

TensorFlow, 2018. Recurrent neural networks. URL https://www.tensorflow.org/tutorials/sequences/recurrent

Tschorsch, F., Scheuermann, B., 2016. Bitcoin and beyond: A technical survey on decentralized digital currencies. IEEE Communications Surveys & Tutorials 18 (3), 2084–2123.

Tumarkin, R., Whitelaw, R. F., 2001. News or noise? internet postings and stock prices. Financial Analysts Journal 57 (3), 41–51.

Twiter, 2018. Twitter api documentation. URL https://dev.twitter.com/rest/tools/console

Vosoughi, S., 2015. Automatic detection and verification of rumors on twitter.

Wu, L., Hoi, S. C., Yu, N., 2010. Semantics-preserving bag-of-words models and applica- tions. IEEE Transactions on Image Processing 19 (7).

Xiong, F., MacKenzie, K., 2015. The business use of twitter by australian listed companies. The Journal of Developing Areas 49 (6).

Xiong, F., Prasad, A., Chapple, L., 2016. The economic consequences of corporate finan- cial reporting on twitter. In: In 7th Conference on Financial Markets and Corporate Governance Conference.

Zheludev, I., Smith, R., Aste, T., feb 2014. When Can Social Media Lead Financial Markets? Scientific Reports 4, 4213.

67