Andrew Peter Burnie
Total Page:16
File Type:pdf, Size:1020Kb
Processing Social Media Text for the Quantamental Analyses of Cryptoasset Time Series Andrew Peter Burnie A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy of University College London. Department of Computer Science University College London February 20, 2020 2 I, Andrew Peter Burnie, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the work. Abstract This thesis analyses social media text to identify which events and concerns are associated with changes between phases of rising and falling cryptoasset prices. A new cryptoasset classification system, based on token functionality, high- lights Bitcoin as the largest example of a ‘crypto-transaction’ system and Ethereum as the largest example of a ‘crypto-fuel’ system. The price of ether is only weakly correlated with that of bitcoin (Spearman’s rho 0.3849). Both bitcoin and ether show distinct phases of rising or falling prices and have a large, dedicated social media forum on Reddit. A process is developed to ex- tract events and concerns discussed on social media associated with these different phases of price movement. This innovative data-driven approach circumvents the need to pre-judge social media metrics. First, a new, non-parametric Data-Driven Phasic Word Identification methodol- ogy is developed to find words associated with the phase of declining bitcoin prices in 2017-18. This approach is further developed to find the context of these words, from which topics are inferred. Then, neural networks (word2vec) are applied to evolve analysis from extracting words to extracting topics. Finally, this work en- ables the development of a framework for identifying which events and concerns are plausible causes of changes between different phases in the ether and bitcoin price series. Consistent with Bitcoin providing a form of money and Ethereum providing a platform for developing applications, these results show the one-off effect of regu- latory bans on bitcoin, and the recurring effects of rival innovations on ether price. The results also suggest the influence of technical traders, captured through mar- Abstract 4 ket price discourse, on both cryptoassets. This thesis demonstrates the value of a quantamental approach to the analysis of cryptoasset prices. Impact Statement The first benefit of this research is to develop a user-friendly cryptoasset classifi- cation system based on token functionality. This has been published in the peer- reviewed journal Ledger [40], and formed part of the written evidence submitted to the UK Parliament Digital Currencies Inquiry to inform public policy on cryptoas- sets, in conjunction with Eversheds Sutherland (International) LLP [100]. The second impact is to quantitatively assess social media discussion forums to identify what events and concerns are associated with major shifts between dif- ferent phases of price. The benefit of this is that it necessitated the development of new methodologies that recognise the need for non-parametric analyses to quanti- tatively examine discussion forums. This moves the debate from previous analyses of volume and sentiment to associating changes in price with specific events and concerns. This starts with Data-Driven Phasic Word Identification (DDPWI; see Chapter 5), and then uses word2vec neural networks to evolve from finding ‘price dynamic words’ to topics (see Chapter 6). It demonstrates the benefits of data de- rived from social media discussion forums over alternative sources such as web search or Twitter data used in previous studies. Rather than pre-judging potential causes of movement that are then tested, these data-driven approaches discover rel- evant events and concerns from social media text. These methodologies could be applied to other cryptoassets and more generally to other research areas where there is a time series and a relevant social media text source. Outside academia the emphasis has been on developing trading algorithms to predict cryptoasset price. These have used prejudged metrics and ignored the inher- ently phasic nature of the price series, with the possibility that causal effects may Impact Statement 6 vary over time. The impact of this research is that it identifies the limitation of this approach by showing that there are both recurring events and unanticipated, one- off, ‘black swan’ events associated with phasic shifts in price. The results differ between Bitcoin and Ethereum, with the exception of speculation (see Chapter 7), which is consistent with their different token functionality (see Chapter 4). The impact of the research has been brought about through publications and conference proceedings to international academics at SIGIR and to the FinTech industry (see Section 1.5). This included three peer-reviewed, open-access arti- cles [40,45,46]. The correlation analyses presented at the Cryptocurrency Research Conference 2018 has been cited 9 times [39]. The article on cryptoasset classi- fication has been downloaded 3,024 times [40], DDPWI [45] 776 times and the word2vec topic modelling technique [43] 132 times (all by 22nd January 2020). Acknowledgements I would like to express my gratitude to my supervisors Professor Emine Yilmaz and Professor Tomaso Aste for their guidance and critical feedback, which were invaluable in encouraging me to take my ideas to the next level and in determining where the research contained within this thesis might be published for maximum impact. I would also like to extend my gratitude to my examiner in the PhD transfer viva, Professor Suzy Moat, who provided greatly appreciated recommendations and helpful advice in guiding the direction of this research. This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1 and Turing award number TU/C/000028. I would like to thank The Alan Turing Institute for their financial support as well as for the provision of lec- tures, workshops and facilities that helped to stimulate the research in this PhD. I would also like to extend this thanks to University College London which provided a fantastic environment in which to pursue a PhD. I would like to provide a special thanks to my brother, James Burnie, Head of Blockchain and Cryptoassets (United Kingdom) at Eversheds Sutherland (Interna- tional) LLP. James’s enthusiasm for cryptoassets and innovation in financial tech- nology sparked my interest in pursuing research in this exciting field; his expertise was a valuable source of feedback, particularly on the cryptoasset classification; and he was an invaluable liaison with Eversheds Sutherland (International) LLP. Finally, I would like to thank my parents, Professor James Burnie and Professor Ruth Matthews, for inspiring my lifetime love for science and data that motivated me to pursue this PhD. Table 1: Abbreviations used in Thesis Abbreviation Text AODE: Averaged One-dependence Estimators ARDL: Autoregressive Distributed Lag ARIMA: Auto-Regressive Moving Average ARIMAX: Extended version of ARIMA that includes other predictors. EC: Empirical Conditional Model ECM/VECM: Error Correction Model / Vector Error Correction Model EEMD: Ensemble Empirical Mode Decomposition method ENET: Elastic-Net regularized regression method EWMA: Exponential Weighted Moving Average GBT: Gradient Boosted Tree GDA: Gaussian Discriminant Analysis GLM: Generalised Linear Model GP: Gaussian process based regression HMM: Hidden Markov Model ICO: Initial Coin Offering LASSO: Least Absolute Shrinkage and Selection LDA: Linear Discriminant Analysis QDA: Quadratic Discriminant Analysis LIWC: Linguistic Inquiry and Word Count framework LR/WLR: Logistic Regression / Weighted Logistic Regression PCA: Principal Component Analysis RF: Random Forest STR: Structured Time Series Model STRX: STR plus regression terms on external features similar to ARIMAX SVM/SVR: Support Vector Machine / Support Vector Regression VADER: Valence Aware Dictionary for sEntiment Reasoning VAR: Vector Autoregression XGT: Extreme gradient boosting Evaluation Metrics RMSE: Root Mean Square Error MAE: Mean Absolute Error MAPE: Mean Absolute Percentage Error FEVD: Forecast-Error Variance Decomposition Correlation Metrics PMCC: Pearson’s Product Moment Correlation Coefficient SR: Spearman’s Rho KT: Kendall’s Tau VIF: Variance Inflation Factor Neural Networks BNN: Bayesian Neural Networks CNN: Convolutional Neural Network FFN: Feedforward Neural Network GASEN: Genetic Algorithm based Selective Neural Network Ensemble GRU: Gated Recurrent Unit LSTM: Long Short-Term Memory RNN: Recurrent Neural Network EEMD-ELMAN: applies EEMD then RNN RRL: Recurrent Reinforcement Learning Variants on ARCH ARCH: Auto-Regressive Conditional Heteroskedasticity GARCH: Generalized Auto-Regressive Conditional Heteroskedasticity EGARCH: Exponential GARCH AR-GARCH: Asymmetric Power GARCH AR-CGARCH: Asymmetric Power Component GARCH BEGARCH: GARCH but lets conditional log-transformed volatility be dependent on past values of a t-distribution score Regulatory Bodies CFTC: Commodity Futures Trading Commission EBA: European Banking Authority ESMA: European Securities and Markets Authority FCA: Financial Conduct Authority FINMA: Swiss Financial Market Supervisory Authority SEC: United States Securities and Exchange Commission Contents 1 Introduction 18 1.1 Research Background and Context . 18 1.2 Cryptoasset or Cryptocurrency? . 24 1.3 Research Objective . 25 1.3.1 Delineating the System to be Analysed . 25 1.3.2