<<

A prediction based model for forex markets combining Genetic Algorithms and Neural Networks

Rui Miguel Hungria Furtado

Thesis to obtain the Master of Science Degree in Telecomunications and Computer Science Engineering

Supervisor: Prof. Rui Fuentecilla Maia Ferreira Neves

Examination Committee

Chairperson: Prof. Ricardo Jorge Fernandes Chaves Supervisor: Prof. Rui Fuentecilla Maia Ferreira Neves Member of the Committee: Prof. Aleksandar Ilic

October 2018 ii To my family and friends

iii iv Acknowledgments

First I would like to thank my supervisor, Prof. Rui Neves, who provided weekly support throughout the thesis development without being extremely strict about the path needed to achieve the established goals. He completely allowed thesis to be my own work, but guided me through the right direction whenever needed. Second, I want to thank my colleagues that gave me the strength and advises to finish this work. It was a pleasure to share with you this academic journey that now comes to an end. Finally I want to thank to my family and closer ones. Without your patience and unconditional support this thesis would not be possible.

v vi Resumo

Investir em mercados financeiros e´ sempre uma tarefa complexa e incerta. De modo a aumentar as pequenas possibilidades de obter uma rentabilidade que ultrapasse o ´ındice de mercado, os investidores recorrem a uma serie´ de tecnicas´ que tem como objetivo tentar determinar futuros pontos de entrada e sa´ıda do mercado. Esta tese propoe˜ um sistema de trading otimizado para o , normalmente con- hecido por FOREX. Para desempenhar tal tarefa e´ usada uma Feedforward Neural Network (FNN), que re- cebe como input um conjunto de indicadores tecnicos´ (TI) calculados a partir de dados historicos´ de mercados FOREX, com uma amostragem horaria.´ O sistema foi estruturado seguindo uma metodologia de Supervised Learning para criac¸ao˜ das target variables, convertendo retornos horarios´ em um sinal binario,´ transposto para um problema de classificac¸ao.˜ De modo a obter o melhor conjunto de parametrosˆ usados para gerar os indicadores tecnicos´ e os hiper-parametrosˆ da rede neuronal, foi desenvolvida uma Estrategia´ Evolutiva (ES), baseada num Algoritomo Genetico´ (GA), dado que fazer uma busca exaustiva usando pelo espac¸o de resul- tados iria conduzir a um tempo de espera demasiado grande. O Algoritomo Genetico´ providencia tambem´ um processo automatico´ de Feature Selection, de modo a seleccionar apenas as features mais relevantes. O sistema proposto foi testado com dados historicos´ de 5 diferentes mercados, de modo a serem tes- tadas diferentes condic¸oes˜ de investimento. As estrategias´ produzidas sao˜ posteriormente avaliadas contra estrategias´ de investimento classicas.´ Os resultados obtidos demonstram que esta abordagem e´ capaz de superar a estrategia´ Buy & Hold (B&H) no mercado GBP/USD, alcanc¸ando um resultado medio´ de 14,19% de retorno de investimento (ROI), contra 10,69 % de ROI para B&H. O sistema tambem´ superou a estrategia´ Sell& Hold (S&H) para o par USD/CHF, alcanc¸ando um resultado de 4,45% de ROI contra 4,09% para o S&H. E´ tambem´ discutido o uso de Batch Normalization como tecnica´ de pre-processamento´ durante o desenvolvi- mento de cada estrategia´ de mercado.

Palavras-chave: Algoritmos Geneticos,´ Aprendizagem Profunda, Aprendizagem Automatica,´ Otimizac¸ao˜ de Funcionalidades, FOREX, Analise´ Tecnica´

vii viii Abstract

Investing in financial markets is always a complex and difficult task. To raise the small chances of beating the market, usually rely on several techniques that attempt to determine the underlying trading signal, and hopefully predict future market entry and exit points. This thesis proposes a trading system optimized for the Foreign Exchange Market, widely known as FOREX. To perform such task, we use a Feedforward Neural Network (FNN), that take as input features a set of technical indicators (TI), calculated using FOREX hourly data. A supervised learning approach was considered to create the target variables, converting hourly returns into a binary trading signal, suitable for a classification problem. To get the best combination of parameters used to generate each TI and FNN hyper- parameters, we deployed an Evolutionary Strategy (ES) based on a Genetic Algorithm (GA), since making an exhaustive search through the entire feature space would be an unfeasible task. The GA also deploys an automatic Feature Selection (FS) mechanism that enables the FNN to use only relevant features for the given problem. The proposed system is tested with real hourly data from 5 different markets, each one exhibiting different behavior during the sampled time. The produced investment strategies are compared with classical trading strategies for the sake of comparision. The achieved results show that this approach is capable to outperform the strategy (B&H) in the GBP/USD market, achieving an average result of 14.19% of Return of Investment (ROI), against 10.69% of ROI for B&H. The system also outperformed the Sell&Hold (S&H) strategy for the USD/CHF, achieving a result of 4.45% of ROI against 4.09% for S&H. Furthermore, it is also discussed the usage of Batch Normalization as preprocessement technique during the development of each market strategy.

Keywords: Genetic Algorithms, Deep Learning, Machine Learning, Feature Optimization, FOREX,

ix x Contents

Acknowledgments...... v Resumo...... vii Abstract...... ix List of Tables...... xiii List of Figures...... xv Nomenclature...... xvii Glossary...... 1

1 Introduction 1 1.1 Motivation...... 2 1.2 Main Contribuitions...... 2 1.3 Goals...... 3 1.4 Document structure...... 3

2 Background 5 2.1 Market trading...... 5 2.1.1 FOREX market...... 6 2.1.2 Financial data...... 7 2.1.3 Technical Analysis...... 8 2.1.4 TI...... 9 2.1.5 oscillators...... 11 2.1.6 ...... 17 2.1.7 Other indicators...... 18 2.2 Machine Learning...... 19 2.2.1 Artificial Neural Networks...... 20 2.2.2 Genetic Algorithms...... 26 2.3 Related Work...... 30 2.3.1 Related works on Neural Networks...... 30 2.3.2 Related works on Genetic Algorithms...... 32

3 Implementation 35 3.1 Model overview...... 35 3.2 User input...... 36

xi 3.2.1 Data...... 37 3.3 Feature calculation...... 38 3.4 Optimization...... 39 3.4.1 Population generation...... 40 3.4.2 Model creation...... 41 3.4.3 Fitness computation...... 44 3.4.4 GA operators...... 44 3.5 Model prediction...... 45 3.6 Market simulation...... 47

4 Results 49 4.1 FOREX Data...... 49 4.1.1 Data statistics...... 51 4.2 Evaluation metrics...... 52 4.2.1 Classification metrics...... 52 4.2.2 Financial metrics...... 53 4.3 Experimental setup...... 54 4.4 Case study A - Simple prediction...... 55 4.4.1 Classification results...... 55 4.4.2 Market simulator...... 58 4.5 Case study B.1 - Accuracy as fitness function...... 60 4.5.1 Classification results...... 60 4.5.2 Market Simulator...... 61 4.6 Case study B.2 - ROI as fitness function...... 62 4.6.1 Classification results...... 62 4.6.2 Market Simulator...... 63 4.7 Case Study 3 - Further investigation on profitable markets...... 64 4.7.1 Benchmark comparisons...... 65 4.7.2 USD/CHF...... 66 4.7.3 GBP/USD...... 68 4.7.4 GBP/USD without Batch Normalization...... 69 4.7.5 Feature selection results...... 71 4.7.6 Fitness evolution...... 74 4.7.7 Topology evolution...... 76

5 Conclusions 81 5.1 Future Work...... 82

Bibliography 83

A Topology evolution plots 87

xii List of Tables

4.1 Summary of market indices...... 51 4.2 Summary of market returns...... 51 4.3 Distribution shape descriptors...... 52 4.4 System parameters...... 54 4.5 Classification results...... 56 4.6 Classification results with Batch Normalization...... 56 4.7 Financial results...... 58 4.8 Financial results with Batch Normalization...... 58 4.9 Classification results with ACC fitness...... 60 4.10 Financial results with ACC Fitness...... 61 4.11 Classification results with ROI fitness...... 62 4.12 Financial results with ROI fitness...... 63 4.13 USD/CHF strategies comparison...... 66 4.14 GBP/USD strategies comparison...... 68 4.15 GBP/USD without BN strategies comparison...... 69 4.16 Selected TI...... 72 4.17 Solutions architecture...... 76

xiii xiv List of Figures

2.1 Currency pair EUR/USD...... 7 2.2 time series plot...... 8 2.3 SMA, EMA and HMA plot with time frame of 20 days...... 10 2.4 MOM plot...... 12 2.5 MACD and their 3 signals plot...... 13 2.6 plot...... 14 2.7 Double Smoothed Stochastic plot...... 17 2.8 Perceptron architecture...... 21 2.9 Feedforward Neural Network architecture...... 23 2.10 Sigmoid function...... 24 2.11 Binary 4 gene chromosome representation...... 27 2.12 Tournament selection method...... 28 2.13 Crossover operation...... 29 2.14 Genetic algorithm flowchart...... 30

3.1 System workflow...... 36 3.2 Raw input data...... 38 3.3 SMA csv...... 39 3.4 Optimization layer...... 40 3.5 Chromosome structure...... 41 3.6 Final prediction pipeline...... 46 3.7 Market simulation...... 48

4.1 EUR/USD market index...... 50 4.2 GBP/USD market index...... 50 4.3 GBP/JPY market index...... 50 4.4 USD/JPY market index...... 50 4.5 USD/CHF market index...... 50 4.6 Non-normalized test ACC vs Normalized test ACC...... 57 4.7 Non-normalized test ROI vs Normalized test ROI...... 59 4.8 Test ACC vs Test ROI...... 59 4.9 Maximum Drawdown for GBP/USD...... 61 4.10 Best, average and worst system individuals for GBP/USD...... 64

xv 4.11 Best, average and worst system individuals for USD/CHF...... 64 4.12 USD/CHF strategies evolution over time...... 66 4.13 USD/CHF market entry points...... 67 4.14 GBP/USD strategies evolution over time...... 68 4.15 GBP/USD without BN strategies evolution over time...... 70 4.16 GBP/USD without BN market entry points...... 71 4.17 GBP/USD histogram...... 72 4.18 USD/CHF histogram...... 73 4.19 USD/CHF and GBP/USD number of features over generation...... 73 4.20 GBP/USD box-and whisker plot...... 74 4.21 USD/CHF box-and whisker plot...... 75 4.22 USD/CHF roi vs gen...... 75 4.23 15 most used number of neurons in GBP/USD 1st FNN layer...... 77 4.24 15 most used number of neurons in GBP/USD 2nd FNN layer...... 77 4.25 15 most used number of neurons in USD/CHF 1st FNN layer...... 78 4.26 15 most used number of neurons in USD/CHF FNN 2nd layer...... 78

A.1 Evolution of the number of neurons in GBP/USD...... 88 A.2 Evolution of the number of neurons in USD/CHF...... 89

xvi Nomenclature

Optimization and Computer Engineering Related

ACC Accuracy

DL Deep Learning

EC Evolutionary Computing

FNN Feedforward Neural Network

GA Genetic Algorithm

ML Machine Learning

NN Neural Network

Investment Related

AA Aroon

ADX Average Directional Index

ATR

BB Bollinger Bands

CCI

CMO Chande Momentum Oscillator

DPO Detrended Price Oscillator

DSS Double Smoothed Stochastic

EMA Exponential

EMH Efficient Market Hypothesis

HMA Hull Moving Average

KURT Kurtosis

MACD Moving Average Convergence Divergence

MDD Maximum Drawdown

xvii MOM Momentum

PO Price Oscillator

ROC Rate of Change

ROI Return of Investment

RSI Index

SKEW Skewness

SMA Simple Moving Average

STD

STV Standard Variance

TA Technical Analysis

TI

xviii Chapter 1

Introduction

Computational has been growing over the years. The latest progresses in the Machine Learning (ML) area and the increase of complexity in financial securities, backed up with a huge growth in computer processing power, have boosted up the area of quantitative analysis. This field continuously tries to explain the behavior of financial markets through complex mathematical measures and calculations, normally using techniques such as stochastic calculus, statistical models, and data analysis. Quantitative analysis tries to represent a given reality data into numerical values, with the goal of reducing investment risk while generating the highest possible profit. The potential and diversity shown by such methods lead to the creation of powerful systems capable of providing promising results in forecasting how markets will react in a relative near future. These systems are often used by large firms. But since they do not reveal how they are performing nor what results they are achieving, this domain remains in a highly competitive and private environment. Although it is undeniable that this is a field with an outstanding potential for improvement, the activity of forecasting the market is and will always be considered a controversial activity. There is a well established community of critics who claim that the market cannot be predicted, and all efforts to accomplish this goal are useless due to the randomness associated with market variations. In fact, many economists say that the market is completely unforecastable. For example, in his 1973 bestseller “A Random Walk Down ”, Burton Malkiel stated that ”a blindfolded monkey throwing darts at a news- paper’s financial pages, could select a portfolio that would do just as well as one carefully selected by experts” [1]. The Random Walk Theory (RWT) affirms that prices are completely random, making it impossible to outperform the market. This randomness is explained by the Efficient Market Hypothesis (EMH), which states that financial markets are efficient, and that prices already reflect all known information concerning a stock [2]. This implicitly states that trends and patterns observed in past data are not correlated with future outcomes, and the occurrence of new information is apparently random as well. This is in direct opposition to Technical Analysis (TA), which claims that a stock’s future price can be forecast based on historical information, through observing chart patterns created by Technical Indicators (TI). This is a vast set of mathematical formulations based on past prices, widely used by active traders in order to create educated guesses about future market trends, identifying suitable entry and exit points in the market. We directly apply this classic trading methodology to our model, to empirically prove that the EMH is not completely right. The hypothesis is that if people using this type of procedure can consistently beat the market, then ML methods backed up by Evolutionary Computation (EC), such as Genetic Algorithms (GA) used in this work,

1 should also be able to reproduce an identical behavior in an automatic way, avoiding the need of manual labour in defining a . However, there is always a certain degree of randomness associated with every market, and that trait could never be dissociated from them. The market is and will always be a noisy, non-stationary, and non-linear system.

1.1 Motivation

The Foreign Exchange Market (FOREX) is the global market for currency trading. It is considered as the largest and most liquid financial market in the world. The reason behind its current and increasing popularity can be easily explained by the vast amount of benefits associated with this type of market (section 2.1.1). The forecasting theme in financial markets has become a major interest topic to investors, used as a measure to secure and manage their portfolio’s ”health”. Therefore, the main motivation of this work, is to create a suitable system for the FOREX market, that is capable of providing the best possible return of investment (ROI). It is intend to study if Deep Learning (DL) is a suitable tool to deal with Time Series (TS) analysis, exploring its adaptability to different FOREX markets, forecasting hourly returns transformed into binary classifications. We also want to identify if evolutionary techniques such as GA are a good feature selection and optimiza- tion tool, minimizing the used features to a number that is capable to extract the maximum performance out of each produced Feedforward Neural Network (FNN) model, at the same time that some of its internal hyperpa- rameters are optimized by the GA as well. An important point in the optimization process, is also to determine if the accuracy achieved by the deployed model is correlated with the achieved ROI, i.e, if an optimized FNN with the highest possible accuracy is also capable of providing the highest ROI.

1.2 Main Contribuitions

This thesis tries to contribute and advance the field of Computational Finance, exploring and combining the above mentions techniques to a set of different FOREX markets. Therefore, the main contribuitions of this work were the following:

• The application of DL algorithms such as FNN to analyze FOREX markets, which allow the usage of bigger volumes of data that condense information to a higher level of granularity (for example hourly records instead of daily records);

• The usage of novel DL procedures to tweak and optimize the FNN data processment;

• The usage of GA as feature selection tool, deciding the best TI features for the system, at the same time that individual feature parameters are optimized as well;

• The integration of different fitness functions, accuracy and ROI, and the study of in which markets the improvement of one metric can be related to the improvement of the other one;

2 1.3 Goals

By combining the above mentioned Machine Learning techniques with trading knowledge, the goal of this work is to create a forecasting software that can offer a sufficiently accurate market prediction, ultimately leading towards the creation of a trading strategy that minimizes the risk involved in market investments. To achieve this objective, the project intends to fulfill the following requirements:

• Explore different FOREX markets, studying their availability for having trading profit;

• Use TA as feature generation tool;

• Explore the suitability of FNN for time series analysis;

• Use market returns transformed into binary labels, as model target;

• Explore the potential of using a GA for technical indicator parameter optimization;

• Explore the potential of using a GA for FNN hyperparameter optimization;

• Select an optimal set of TI, through the use of GA feature selection;

• Forecast one hour-ahead binary market returns;

• Use predicted outputs to create a trading strategy;

• Compare the performance of the created strategy with the traditional trading strategies used by traders;

1.4 Document structure

The remainder of this document is organized as follows:

Section 1: Introduction to the theme of the thesis and used methodology.

Section 2: Description of some background relevant for the work, with respect to financial and technological aspects relevant to the theme.

Section 3: Describes all the related work useful for the development of this project, divided by works related with FOREX, FNN and GA applied in financial domains.

Section 4: Proposes a solution combining all the techniques mentioned in the previous chapters.

Section 5: Introduces evaluation metrics for each step of the created model and presents the obtained results.

Section 6: Finishes the work with a conclusion and recognizes the potential for new developments in this field.

3 4 Chapter 2

Background

This chapter provides an overview of the most important concepts related to this thesis. We divided this section in 3 main domains related to the subject: Market trading, Machine Learning and Evolutionary Compu- tation. In the financial part, we take a closer look to the history of market trading, and what type of techniques are commonly used to analyze market evolution. After that, we dive into the ML concepts, specially into DL, a subset of ML, created to deal with sizable batches of data, which is the case of Neural Networks (NN). To conclude this part, we introduce the chosen prediction model, a FNN, and all the details and technical insight that compose this complex algorithm. Finally, we present EC, a family of algorithms for global optimization inspired by biological evolution, where GA belongs to. It is also important to state that the focus of this thesis is not the financial world per se, but rather, the combination of trading with the computational domain.

2.1 Market trading

To develop a better understanding on the subject addressed by this work, it is important to give the reader some clear notions related to the stock trading system and the financial world. The stock markets are cur- rently one of the most important parts of today’s global modern economy, since they influence the economic development of several countries [3], having the ability to create new opportunities for business growth [4], and frequently considered a valuable indicator of how healthy is a specific country economy. A trading market is essentially a place where buyers and sellers come together in order to exchange different types of financial products. Different products, are generally traded in different markets. The most popular type of market among all the available ones, is without any doubt, the . In it, people trade shares of publicly traded com- panies, providing to the buyer a ownership in a corporation. Other important markets worth to be named, are for example the bonds market - where buyers loan money to an entity at a fixed , and currencies market - where currency pairs are traded, which is the case of the FOREX market. Despite the differences among markets, the actions performed at each market remain equal. When trading an can assume three different positions: , and neutral. A long simply indicates an order for performing a security purchase, hoping that its price will increase. Once the price rises the investor has to make the decision of selling the purchased positions, or keeping them, in order to create even more profit. There is always the risk of price falling, but it is always limited to the amount invested. Alternatively, a

5 short position in the market represents exactly the opposite. The investor borrows a security from a broker, and sells it immediately at the current market price, with the expectation that its value will decline. When the price decreases, the investor repurchases the same shares and returns them to the broker, making profit out of the created price difference. This action substantially increases the trading risks, and could magnify the potential losses if prices increase. Finally, a neutral position is usually taken when the investor thinks that a security is neither going up or going down, therefore preferring to stay out of the market. Market positions should be taken according to the current [5]. There are three possible market trends: bull, bear and sideways. A bull market is a market where prices are continuously increasing or are expected to increase. It could be characterized by constant uptrends which reflect an optimistic approach relatively to securities growth. This raises an opportunity for successful long positions. On the other hand, bear markets are purely the opposite. They reflect pessimism relatively to prices growth, and raise an opportunity for successful short positions. A sideways trend represents a constant up and down movement, with prices bouncing around a given range.

2.1.1 FOREX market

The FOREX market is the buying and selling market for currencies. FOREX is one of the largest and fastest growing market in the world, with average daily turnover reaching nearly 5.1 trillion dollars per day in April 2016, according to the 2016 Triennial Central Bank Survey of FX [6]. This increasing popularity is related with some relevant aspects:

• Trading takes place 24 hours a day, 5 days a week. There are 4 FOREX markets: London, New York, Sidney and Tokyo. Every market opens 8 hours per day, starting at different hours due to time zone differences. This means that a trader can easily change from a market to another when the first one closes.

• It is not necessary to invest in a specific company or sector. The only choice needed to be made is the currency pair that is going to be used during the trading period.

• FOREX can remain profitable even in the worst times because currencies are always changed in pairs. When a value of a currency declines, inevitably another currency value is going to rise up.

• The entry costs are extremely low when compared to other well known markets.

Currency pairs: Cross Currency Pairs available in FOREX can be divided in three types: majors, crosses and exotics. The most important pairs are included in the majors group: EUR/USD, GPB/USD, USD/JPY, USD/CHF, USD/CAD, AUD/USD and NZD/USD. Crosses do not include the U.S dollar and are ideal for a di- versified portfolio. The exotics group includes pairs from developing countries, so they are quite illiquid with very high spreads.

Currency pair quotes: It is really simple to understand what a determined quote stands for. In fig 2.1 we can see an example of a EUR/USD currency pair. The first acronym represents the base currency and the second one the quote currency. The value given to the pair, represents how much one unit of the base currency, is worth when converted to the quote currency. For example if a EUR/USD pair as the value of 1.5 it

6 means that one Euro(EUR) worth 1.5 US dollars(USD).

Figure 2.1: Currency pair EUR/USD

PIP & Spread: PIP stands for Price Interest Point and it represents the smallest possible change that an can make. For currency pairs that include the US dollar in it, a PIP is 1/10000 of a dollar whereas when the rate includes the Yen(JPY) a PIP is just 1/100. For example, a 5 pip spread for EUR/USD is 1.1530/1.1535. Having said that, we can divide a cross currency pair in two parts. The first one is the buying price and the second one the selling price. In the example stated above the buying price is 1.1530 and the selling price is 1.1535, meaning that 1.1530 is the price that you have to sell this currency to the broker, and 1.1535 the price that you have to pay if you want to buy this currency to the broker. The concept of spread is simply the subtraction of the selling price and the buying price. This represents the profit taken by the broker in each transaction, since there are no costs associated with making a trade, or monthly fees for managing accounts as the ones available in other well known markets.

2.1.2 Financial data

Financial data is represented through time series that sample a specific time period at a daily, monthly, hourly or even smaller sample rate. In comparison to other existent time series, financial time series present some special properties, given by the underlying structure present in financial markets. This is mainly due to the huge quantity of different factors that may have influence in final market prices. Figure 2.2 presents a time series plot of a market closing price. Time series patterns can be decomposed in four main components: trend, seasonal, cyclic movements and irregular fluctuations [7]. A trend pattern is a long term increase or decrease present in data. It could be seen as the directional movement taken by the analyzed series, divided in global -a trend applied to the whole time series, or local - applied to a sub-sequence of a time series. The seasonal component reflects regular fluctuations influenced by seasonal factors. Seasonal patterns are noticeable when data faces predictable changes within a fixed period of time - year, month, day, week or smaller. On the other hand, cyclic movements represent long term oscillations occurring in time series around a specific trend. Finally, irregular fluctuations are the random component associated with time series. They are unlikely to be repeated, and are generally associated to non-predictable events. Financial time series exhibit an extremely high frequency of possible

7 values, which is known as there most recognizable feature, creating high volatility that usually changes over time. This is due to the influence of external non-systematic factors which lead to irregular fluctuations. By contrast, systematic factors that influence a financial time series, create the cycle and trend patterns present in it. Seasonality does not play an important role in this type of series [8]. Having the announced properties in mind, we can describe financial time series as:

• Temporal ordered events: Unlike other series used in cross-sectional studies, financial time series are natural ordered events, which means that observations could never be shuffled or mixed when conducting a time series analysis.

• Non-linear: When modelling financial time series, the relationship between the used independent vari- ables cannot be explained such as a linear combination.

• Non-stationary: Time series data presents different statistics at different times. The mean and variance shown by a financial time series, constantly changes due to the high frequency of values that prices could assume.

• Noisy: Every sampled time series is always corrupted with an amount of noise that pollutes the signal. Financial time series are no exception, and the great of random information makes it harder to predict future price values.

Figure 2.2: time series plot

2.1.3 Technical Analysis

There are two methodologies that traders use to make investment decisions: Technical Analysis (TA) and (FA). For the purposes of this work, we are going to focus on TA applied to FOREX [9]. FA focus on economical and financial factors that affect a business, such as market conditions, data related with business management, and economical news, aiming to measure a security’s intrinsic value. On the other hand, TA is a methodology used to forecast the future price direction of the market by analyzing market data

8 gathered from trading activity. Technical analysts focus on charts of price movement and various analytic tools, purely relying on statistical metrics to evaluate the health and strength of a given security. In the following chapter we present TA and consider a set of Technical Indicators (TI), the most important analytic tool used by technical analysts. Before digging any deeper into how TA and TI should be used, it is very important to understand the underlying assumptions and principles related with the domain. There are 3 basic principles [10]:

1. The market discounts everything: This statement assumes that all past, current and future information is discounted into the markets and reflected in the price of securities. Even though TA ignores funda- mental factors related to firms, the market still discounts that information considering everything from the inflation to the emotions of investors.

2. Price moves in trends: : According to TA, the prices always follow some kind of trend. It is believed that the probability of prices to follow a specific trend is much higher than sudden erratic movements.

3. History tends to repeat itself: TA states that there is a repetitive nature associated with price move- ments, and that market participants tend to react in a similar way to similar events. This means that historical data can be used as an important instrument to make a prediction of how the markets will behave.

These 3 main ideas lay down the foundations for modern TA. They were first developed and introduced in the by [10]. Released in the 1800’s, it was the first investment and trading theory and was later refined by William Peter Hamilton [11]. TA is used in the form of Technical Indicators (TI). TI are simply mathematical calculations (traditionally in price or volume) based on the past variations of the market and defined by a formula [12]. Although there is a great amount of different TI, they can be classified according to their oscillatory behavior as trend following or momentum oscillators[13]. Next we provide the explanation of the ones used in this work.

2.1.4 Trend following TI

A trend following indicator tries to identify trends in the market. A trend represents a consistent change in prices, the investors’ expectations [13]. They are extremely useful to identify entry or exit points in a specific trend cycle. This is done by identifying if the cycle begun or ended in the existent crossovers between the TI and the market index. We used the following:

Simple Moving Average - SMA

Moving averages (MA) are the most basic tool used in TA. Their main goal is to smooth the trading signal according to a predefined lag, defined by a number of past periods. They are usually used as noise filters, because they allow the trader to see a cleaner signal by reducing the number of present oscillations, making a price average over the number of selected past days. Traders that use MA usually intend to buy when a MA crosses the price index in a descending way, and to sell when there is a cross in an ascending way. The SMA

9 is simply done by performing the arithmetic mean of the past n time periods. Close(i) represents the price of an asset on a specific day [12].

n X Close(i) SMA = (2.1) n n i=1

Exponential Moving Average - EMA

It follows the same principle as the SMA, but it gives more importance to latest information. This is per- formed by assigning more weight (exponentially) to the latest available data.

EMAn = Closed × α + EMAn−1 × (1 − α) 2 (2.2) with α = n + 1

In equation 2.2, d refers to the current day, n the number of past time periods and α the smoothing factor [12].

Figure 2.3: SMA, EMA and HMA plot with time frame of 20 days

Hull Moving Average - HMA

The HMA is an extremely fast and smooth moving average that almost eliminates the existing lag at the same time it improves smoothing along the created average.

√ HMAn = WMA(2 × WMA(data, n/2) − WMA(data, n), n) (2.3)

The equation 2.3, shows the HMA formula. It is made using a Weighted Moving Average (WMA) of the differ-

10 √ ence between two WMA during the n period, with n representing the number of past time periods and data the whole time series [12]. We choose to write this formula in a more abstract way, to benefit readability.

In fig 2.3 we present the three above announced moving averages. We choose a time period of twenty days, making possible to observe the different behaviours assumed by each one of them.

Aroon - AA

The Aroon indicator attempts to identify trending or not trending periods, and how strong the actual trend is. It quantifies the needed time for the price to reach the highest and lowest points over a set time period, as a percentage of total time. It is composed by 2 separate indicators, Aroon-Up and Aroon-Down. As initial parameter, both indicators receive one individual value n, indicating the number of periods since a n-day high for Aroon-Up, and the number of periods since a n-day low for Aroon-Down [14]. Aroon values oscillate between 0 and 100. The higher the Aroon-Down, the weaker the downtrend is, and the stronger the uptrend is. The opposite is verified for Aroon-Up.

n − periods since n high AroonUp = × 100 (2.4a) n n − periods since n down AroonDown = × 100 (2.4b) n

2.1.5 Momentum oscillators

A momentum based indicator tries to measure the velocity of directional price movements, in order to identify the speed/strength of a price change influenced by the enthusiasm of buyers and sellers involved in the price development [13]. This type of indicators are especially important to predict rapid unexpected changes on financial assets behavior.

Momentum - MOM

The simplest TI among all oscillators. It measures the absolute difference between today’s close value and the close value n days ago. Their values express the existing trend. If values assumed by Momentum (MOM) are positive, then we are in a uptrend, if negative, we are in the presence of a downtrend.

MOMn = closet − closet−n (2.5)

Equation 2.5 is used in each point of the closing price values array. t represents the current day, and n the number of considered past periods [12].

11 Figure 2.4: MOM plot

Rate of Change - ROC

The Rate of Change (ROC) is a momentum indicator that measures the speed of a change in a value over a predefined period of time. If values are above 50%, traders should be aware of overbought conditions. If they are below -50% we are under an oversold period. In terms of calculations, ROC takes the current value of a stock or index and divides it by the value from an earlier period. The formula is:

Close − Closen ROCn = × 100 (2.6) Closen

With n being the number of used past periods, and close corresponding to the available closing prices [12].

Relative Strength Index - RSI

It measures the speed and change of price movements. The RSI is most commonly used on a 14-day time- frame, oscillating between 0 and 100. Traditionally, and according to Wilder, RSI is considered overbought when above 70 and oversold when below 30 [12]. The Avg Gain represents the average gain of up periods during the specified time-frame. Same thing applies to Avg Loss, but using the down periods of the time-frame. As initial parameter it receives n, the number of past periods to be considered.

100 RSI = 100 − (2.7a) 1 − RS Avg Gain RS = (2.7b) Avg Loss

12 With Avg Gain being given by the sum of gains over the past n periods / n.

Moving Average Convergence Divergence - MACD

The Moving Average Convergence Divergence (MACD) is a tool for identifying entry and exit points in the market. Although we classified it like a momentum oscillator indicator, it can also be seen as a trend following TI. It is calculated with a 9 day EMA of the difference between 2 trend following indicators. These two indicators are the subtraction of a slower EMA on a faster EMA (usually 26 and 12 days, respectively) [12]. This TI generates 3 different signals: the MACD signal, signal line and MACD histogram. Their formulas are:

Figure 2.5: MACD and their 3 signals plot

MACD(n, m) = EMAn − EMAm (2.8)

With n representing the number of periods of the faster EMA, and m the number of periods of the slower EMA. After calculating the MACD signal, the signal line should be calculated. It is used to trigger buy and sell signals.

Signal = EMA9(MACD) (2.9)

Next we calculate the histogram line. It visually represents the sell and buy signal, which can be identified by a change from positive to negative in MACD histogram signal value.

Histogram = MACD − Signal (2.10)

When the signal line crosses above the MACD line, a buy signal is activated, and when the signal line is falling bellow the MACD line a sell signal is sent. MACD histogram indicates trend turning points.

13 Has we can observe in fig 2.5, the histogram area represents trend inversion, which appears when the the MACD line (the blue signal), touches the signal line (orange line).

Bollinger Bands - BB

Bollinger Bands (BB), is one of the most popular momentum oscillator TI. It is created by combining two volatility bands with a regular moving average (Upper band, Lower band and Middle band). The volatility bands are calculated based on the Standard Deviation (STDV), and together create a channel that automatically expands when volatility increases, and narrows when the opposite happens. The BB formulas are:

Upper band = SMA(n) + 2 × σn(close) (2.11a)

Middle band = SMA(n) (2.11b)

Lower band = SMA(n) − 2 × σn(close) (2.11c)

With n being the number of used past periods, which generally is set to 20. Once again, close corresponds the available closing prices. are considered overbought when the trading signal touches the Upper band, thus creating a selling opportunity. Oversold periods are identified when the Lower band is touched by the trading signal, creating a buying opportunity [12].

Figure 2.6: Bollinger Bands plot

Commodity Channel Index - CCI

The Commodity Channel Index (CCI) is a momentum oscillator TI that attempts to identify overbought and

14 oversold conditions. To identify them, as in many momentum oscillators, two thresholds are defined, generally 200 and -200. When CCI is above 200, stocks are considered overbought, and when it is lower than -200 stocks are oversold [12]. Formula:

TP − SMA (TP ) CCI = n 0.015 × MD(TP ) (2.12) Close + High + Low with T P = 3

In equation 2.12 TP stands for Typical Price and is the average of the 3 sample values. The MD present in the CCI formula is the Mean Deviation of TP.

Percentage Price Oscillator - PO

The Percentage Price Oscillator (PO) is a momentum oscillator that measures the difference between two moving averages over the value of the larger moving average. The end result is a signal that tells the trader where the short-term average is in relation to the longer-term average.

EMA − EMA PO(n, m) = n m (2.13) EMAm

With n the time period relative to the shorter moving average and m the time period relative to the slower one.

Chande Momentum Oscillator - CMO

The Chande Momentum Oscillator (CMO) is a technical momentum indicator that also attempts to identify overbought and oversold conditions. The CMO is created by calculating the difference between the sum of all recent higher closes and the sum of all recent lower closes and then dividing the result by the sum of all price movement over a given time period. The result is multiplied by 100 to give the -100 to +100 range. The defined time period is usually 20 periods. Sun − Sdn CMOn = 100 × (2.14) Sun + Sdn

In the above equation 2.14 Sun represents the sum of the difference between the current close and previous close on up days for the specified period n. Sdn is the sum of the absolute value of the difference between the current close and the previous close on down days for the specified period. Up days are days when the current close is greater than the previous close. Down days are days when the current close is less than the previous close.

Average Directional Index - ADX

The Average Directional Index (ADX), is a directional movement indicator, that has as main objective quan- tifying the strength of a given trend, regardless is type (bullish or bearish). The ADX derives from two other indicators: Plus Directional Indicator (+DI) and Minus Directional Indicator (-DI). The +DI measures the pres- ence of an uptrend, while -DI measures the presence of downtrends. Both of them are usually plotted together

15 with the ADX, to get a better interpretation of price movements. Needed calculations are the following:

1. Calculate up and down movements, by comparing the current close price with the last close price.

UpMoves = Hight − Hight−1 (2.15a)

DownMoves = Lowt−1 − Lowt (2.15b)

2. Calculate the Positive Directional Movement (+DM) and Negative Directional Movement (-DM). This two formulas are based on the last two announced equations. Their job is to filter them according to each value signal. +DM filters negative values that represent price decline, replacing them by 0. -DM does exactly the opposite replacing price increase values by 0. Note that in eq.2.15b positive values represent price decreases.

+DM = Max(UpMoves, 0) (2.16a)

−DM = Max(DownMoves, 0) (2.16b)

3. Calculate the Positive Directional Index (+DI) and Negative Directional Index (-DI). This is done by com- puting an Smoothed Moving Average (SMMA) in respect to n past periods, of +DM and -DM over the price volatility, expressed by Average True Range (ATR) TI. The value is then multiplied by 100 in order to be expressed as a percentage.

SMMA(+DM) +DI = × 100 (2.17a) AT R(Close, n) SMMA(−DM) −DI = × 100 (2.17b) AT R(Close, n)

4. Finally we can calculate the ADX. To calculate it we calculate an SMMA of the absolute value of +DI minus -DI, over +DI plus -DI.

 +DI − (−DI)  ADX = SMMA × 100 (2.18) +DI + (−DI)

Double Smoothed Stochastic - DSS

The Double Smoothed Stochastic (DSS) can be a helpful momentum based TI for swing traders. It applies two EMAs of two different periods to a standard Stochastic %K. Usually values of 8 and 13 are used for calculating the EMAs. It ranges from 0 to 100 and identifies overbought and oversold periods. Formula:

EMA (Close − LowestClose) DSS = 100 × n (2.19) EMA(HighestClose − LowestClose)

16 Figure 2.7: Double Smoothed Stochastic plot

In fig. 2.19 we also plotted the overbought and oversold thresholds, in order to see where the DSS thinks that the index is in extreme conditions. As one can see, overbought conditions appear when values are over 80, and oversold when values are below 20.

2.1.6 Volatility

Volatility is a measure of uncertainty and risk. It reflects how prices are currently moving. A high values of volatility, mean that the range of values assumed by a security could change very dramatically in a small period of time. On the other hand, low volatility means that prices are stable, and decreasing or increasing at an acceptable rate.

Average True Range - ATR

The Average True Range (ATR) is usually the TI chosen by traders to quantify the amount of existing volatility in a specific day. The ATR looks at how far price swings, comparing the highest price values with the lowest ones. AT R × n − 1 + TR AT R = t−1 t (2.20) t n

Where t represents the current day and n the number of past time periods. TRt represents the True Range (TR), which can be caluclated as:

TRt = max{Hight − Lowt, abs(High − Closet−1), abs(Low − Closet−1)} (2.21)

This calculation retrieves the highest value between the 3 caluculated ranges, assuming that as the true one.

17 We also used the Average True Range Percent (ATRP). It is almost the same as the ATR but results from the division of ATR data by close values, and converting that to percentage.

AT R(Close, n) AT RP = × 100 (2.22) n Close

2.1.7 Other indicators

In this subsection we highlight indicators used as features in this work, that do not fit into the above mention categories. We also present some indicator that do not fall under the TI family, but since they are used as features as well, we decided to depict them in this section.

Detrended Price Oscillator - DPO

Detrended Price Oscillator (DPO), is another oscillator TI used in this work. The DPO is an extremely useful tool, because it removes trend from prices, making easier to identify cyclical price movements present in a given index. Although being an oscillator, DPO is not considered as a momentum one. This is because this is not a real time indicator and cycles are identified using a time displacement technique. Cycles are estimated by counting the periods between peaks or troughs.

DPO = Close − SMA n (Close) (2.23) n 2+1

The displacement effect of this indicator is caused as we can see in equation 2.23, by the subtraction of an n SMA of the 2+1 last period, with n being the number of past periods for which the DPO signal is calculated.

Kurtosis - Kurt

Kurtosis is not a financial TI but rather a statistical measure, that indicates the ”tailedness” of the probability distribution of the provided data around the mean. It indicates the combined weight of a distribution tails relative to a Gaussian distribution. Data could be heavy-tailed, meaning excess data present in tails, or light- tailed, which suggests lack of data. Kurtosis is applied to data financial returns instead of market index. Since financial time series present a non-stationary behaviour, the usage of returns is much more reasonable. The distribution of them resembles a Gaussian distribution much more closely than absolute market values. The kurtosis of a Gaussian distribution is 3. Kurtosis is given by the following formula:

n  4 1 X xi − x¯ kurtosis = − 3 (2.24) n σ i=1

In the above equation, n represents the number of samples, xi an individual sample, xˆ the mean, and σ the standard deviation of the given data. Note that this formulation retrieves the excess kurtosis present in data, since we subtract 3 to data kurtosis.

Skewness - SKEW

18 Like Kurtosis, Skewness is also a statistical measure that quantifies shape properties of data distribution. Skewness measures the asymmetry of the distribution when comparing it with a Gaussian distribution. It retrieves how skewed data is in terms of amount and direction. If skewness is 0, the distribution of data is perfectly symmetric. A positive skew indicates a longer right tail, while a negative value indicates a longer left tail. Like kurtosis it is also applied to returns. Its formula is:

n  3 1 X xi − x¯ skewness = (2.25) n σ i=1

With n representing the number of samples, xi an individual sample, xˆ the mean, and σ the standard deviation of the given data.

Standard Deviation - STD

Standard Deviation is a statistical measure that indicates how spread are the samples of a dataset, com- pared to mean of it. It is generally represented by the Greek letter σ. Considering n the number of available samples, xi a sample and x¯ the mean of all samples, we can write the standard deviation as:

s Pn x − x¯ σ = i=1 i (2.26) n − 1

Standard Variance - STV

The concept of Standard Variance is analogous to Standard Deviation. The main difference is that values are squared, which shifts absolute units to square units. It measures the average degree to which each point differs from the mean. Pn x − x¯ σ = i=1 i (2.27) n − 1

2.2 Machine Learning

Machine Learning (ML) is a sub-field of Artificial Intelligence (AI), that given a great amount of information about a specific problem, tries to provide an answer to that problem by describing it in discrete terms, with the main focus of providing a correct answer or performing the correct action. Contrary to many other subareas present in the Computer Science field, that apply a logical deductive way of working, ML has a different way of achieving results, due to its inductive behavior, which could be seen as an empirical way of gaining knowledge [15]. This is the act of supposing something based on previous observations or assumptions already made about something. For example, we know that the sun raises up every day so we could inductively suppose that tomorrow the sun is going to raise up again. This notion of inferring or inducting some of this domain knowledge into an AI algorithm, makes it gain some sort of understanding about the problem and the environment where it is inserted. This capacity is achieved by feeding data into algorithms. Data is normally divided by columns, usually called features, that quantify different dimensions of the problem. For example in the given example, we could have as daily features the current weather and the current wind. The ML algorithm is going to try to

19 make correlations between the existent values in each one of them, in order to know if tomorrow the sun is going to raise up or not. Regarding ML, one could identify 3 types of learning techniques: supervised learning, unsupervised learn- ing and reinforcement learning. Since this work deals with supervised learning, it is worthwhile to mention the 2 types of problem approaches within it: classification and regression problems. Classification aims to classify something with respect to a discrete set of values. Alternatively, regression is related with predicting continu- ous values. Supervised Learning models are algorithms that learn by correct examples. Input must be divided in two vectors: the model features x, and the output label or target variable y. Model features were already explained above, and represent the dimensionality of the problem. The target variable y, represents correctly labelled examples corresponding to a given set of features, in a way that y = f(x), with f being the mapping function that correctly represents the given data, i.e, the function that correlates the x values with a y value. When running a supervised learning algorithm 2 different phases are normally performed: the training phase and the testing phase. This is done with two different datasets, obtained from one original dataset. They both result from splitting initial data into two smaller chunks, the trainset and the testset. The trainset usually has a size of 70% to 85%, of the original data, and as the name indicates, is used to train the model. With it, the algorithm tries to find an approximation of the mapping function f, using the y vector values as a ”gold standard”. This procedure shows how, the model gains insight about the given data, making it possible to predict new values of y, according to the found function. This is where the testset comes into use. Generally with a size between 15% and 30% of the original data, it is used to test the trained model with new never seen data, outputting a score or prediction of it. The key difference between the train and test lays on the fact that when test data is fed to the algorithm, the y vector is not present, because it has the desired results. It is later used when evaluating the model, by comparing it with the achieved results using different evaluation metrics. In order to create a ML model that correctly classifies new examples, the inference process must be carefully performed. Too much training, and lack of training could result in two well known problems in this domain: overfitting and underfitting. Overfitting, occurs when the created model, learns the training data too well. This means, that all the details and noise present in it, are assumed by the model in such way that the target function f, is not able to correctly map new elements, which results in a poor performance during the testing period. Underfitting is exactly the opposite. It occurs when a model is not able to capture the underlying structure present in the given data, hence, not being able to correctly approximate f. A compromise between fitting and generalization should be made in order to create a balanced model that is able to capture the relevant assumptions present in the training data without compromising its performance when new elements are presented to it.

2.2.1 Artificial Neural Networks

Artificial Neural Networks (ANN) are a powerful information processing model that was designed using as inspiration the way that human biological nervous system work and deal with information. This type of architectures come from an AI movement known as conectionism [16], which believes that knowledge is stored in connections between interconnected processing units, usually known as neurons. Unlike other types of models available in the ML domain, ANN are also distributed models. Each concept learned by them is represented by the combination of many neurons, and each individual neuron participates in the representation

20 of many different concepts. Another important factor that sets them apart, is the parallel way of information processing, in contrast with the sequential approach used by the majority of the prediction models. In a sequential model, it is possible to use reverse engineering processes to figure out what were the premises that made the algorithm reach to a specific conclusion. In ANN that procedure is not possible, because each neuron learns at the same time. That’s why people usually say that ANN act as a ”black box”.

Perceptron Model

To understand how neural networks work, it is important to first introduce the concept of perceptron. The perceptron model was the first model that tried to replicate the structure of a biological neuron. It was developed in the 1950s by Frank Rosenblatt [17], and its behavior resembles the logical gates AND and OR, available on computers. A perceptron is essentially a gate that receives several inputs and produces a single binary output that corresponds to a decision considering the received inputs. To generate a decision making mechanism, a set of weights are attributed to the connections between neurons. The existent number of weights is equal to the number of inputs received by the network. To activate the neuron (having 1 as output), the number of activated inputs must be greater than a certain threshold. This procedure is done by a step function, and expresses how a perceptron can weigh up different kinds of evidence in order to make decisions. Fig 2.8 represents the preceptron architecture.

Figure 2.8: Perceptron architecture

The final output is computed by applying the step function to the weighted sum of inputs, i.e, the sum of each input multiplied by its weight. It is given by the formula:

  Pn 0 if j=1 wjxj 6 threshold output = (2.28)  Pn 1 if j=1 wjxj > threshold

With wj being the weight that corresponds to the input xj. Its possible to obtain different results by varying the values of the weights and threshold. This allows us to create new models, capable to output different decisions. This may seem an inflexible and simple approach to achieve a decision making model, but if we put together a large number of connected perceptrons forming a complex layered network - a Multilayer Perceptron Network (MLP), it is possible to produce incredibly sophisticated decisions.

21 As one can expect, a network created by perceptrons represents a large number of variables that need to be controlled. The number of inputs, weights and thresholds could grow exponentially and therefore the formula 2.29 should be simplified. MLPs store their parameters as matrices. Therefore, we could represent Pn the weighted sum j=1 wjxj as a dot product and rewrite it as w · x, where w and x are vectors whose components are the weights and inputs, respectively. Another change is to modify the threshold by moving them to the opposite side of the expression, and substitute them by what is know as bias, b ≡ −threshold [18]. We could interpret b as being a measure of how prone is the perceptron to ”fire” or not. The whole formula could be rewritten as:  0 if w · x + b 6 0 output = (2.29) 1 if w · x + b > 0

Although things seem more consistent, there is still a huge downside in the perceptron architecture: the reaction of the step function concerning value changes in the network tunable parameters. For example a small change in weights or biases could easily flip the neuron output from 0 to 1, thus activating it. This could potentially make the network assume a ”cascade” behaviour where a small change in a neuron changes the output of other attached neurons. That’s why we introduce Feedforward Neural Networks in the next subsection, which overcome this problem using smoother activation functions that facing small changes in input parameters, produce small changes in output values.

Feedforward Neural Network - FNN

As it was introduced in the last part of subsection 2.2.1, a group of interconnected perceptrons forms a network. A Feedforward Neural Network (FNN), is simply a group of many layered fully connected neurons, where connections between them do not form any type of cycle, i.e, information flows only in one direction, from the input nodes to the output nodes. This means that there is no presence of any recursive connections or loops linking neurons and making information go backwards, like in Recurrent Neural Networks (RNN).

FNN are part of a subclass of algorithms in ML known as Deep Learning (DL), which concerns about algorithms that try to reproduce the structure of the brain. They use ”cascade” multiple fully connected layers besides the traditional input and output layer, responsible for non-linear processing, transformation, and feature extraction. They usually have built in non-linear functions. This is needed for two reasons: first the majority of problems can’t be explained by a linear function. If that happens, there is no need for DL and ML conventional methods could be used. The second is that linear combinations of linear functions result in a linear function. Fig 2.9 showcases FNN architecture.

The usage of non-linear functions in the middle layers leads us to one of the finest advantages of using standard FNN (and other types of NN), they can approximate any given non-linear function. This means that in theory, NNs can solve every existent problem, since the majority of problems can be translated into mathematical terms by the means of a function. This was proved by Hornik et al in 1989 with the universal approximation theorem [19], showing that for any continuous function f on a compact set K, there exists a feedforward neural network, having only a single hidden layer, which uniformly approximates f to within an arbitrary ε > 0 on K.

22 Figure 2.9: Feedforward Neural Network architecture

As other algorithms present in ML, FNN can be used for several different tasks, namely supervised learning and unsupervised learning. Supervised Learning tasks can be divided in regression and classification. This defines the type of activation present in each layer and neuron of the network (different layers could present neurons with different functions). Input neurons just let information pass and usually do not have any type of activation function. Hidden and output neurons use more refined functions for neural activation. Below we present a list of the most common activation functions. Have in mind that z = w · x + b, as explained in eq. 2.29

• Linear: This represents a straight line function, where the computed activation is proportional to the input. The output produced by a function of this type, will lay in the domain where the input values belong to. For example if the input is a continuous value, then the output y of this activation would respect the condition y ∈R. This is useful when one deals with regression problems and tries to predict a continuous value. In this type of problems, linear activations must be applied to the final layer of network.

σ(z) = z (2.30)

• Sigmoid: The sigmoid or logistic function is a non-linear function that resembles the behaviour of the perceptron step function in a more smoothed way. Like this one, it is bounded between 0 and 1, but instead of having a step, with values above a certain x range abruptly changing their y value from 0 to 1, it forms a ”S” shaped curve that allows the output to belong to the interval between 0 and 1, as shown in fig 2.10. Therefore, small changes in weights or biases result in small changes in the neuron’s output. It is applied to the hidden layer neurons or to final layer of the network if we are in the presence of a binary classification problem. 1 σ(z) = (2.31) 1 + e−z

23 Figure 2.10: Sigmoid function

• Softmax: The softmax function is a generalization of the sigmoid function. While the sigmoid function can only handle the prediction of 2 classes, the softmax function is used when dealing with multi-class classification problems. Like the sigmoid function, it also squashes the output to be between 0 and 1, and divides each output such that the total sum of the outputs is equal to 1. This transforms the output into class probabilities, calculating the probabilities distribution over K different classes. It should be used in the last layer of the network. e−zj σ(z) = k (2.32) P −zk k=1 e

• Hyperbolic Tangent: The hyperbolic tangent or tahn is another non-linear function applied to the hidden neurons present in the hidden layer. It is similar to the sigmoid, but the output is bounded between -1 and 1. It is also a zero centred function (the sigmoid is not), which could be beneficial during the learning phase of the network [20]. 2 σ(z) = tanh(z) = − 1 (2.33) 1 + e−2z

• Rectified Linear Unit: Rectified Linear Units usually known as ReLUs, are the simplest non-linear ac- tivation functions that could be used. The output of them is simply 0, if the input value is less than 0, and the raw input otherwise. A good thing about using ReLUs is the fact that training a network is much faster. When using tahns or sigmoids in the hidden neurons, almost every neuron present in the network is likely to be activated when processing an input, i.e, almost every activation is used to describe an output. The activation could become really dense. With ReLU only the most important neurons remain activated, making activations more sparse and effective.

σ(z) = max(z, 0) (2.34)

24 Backpropagation

Backpropagation is a method for training ANN, frequently described as neural nets learning mechanism. Developed by Werbos in 1974 [21], it is used for optimizing the final network output, by adjusting its weights. The optimization is performed during the training phase, through the usage of Gradient Descent [22], or other variants of this mathematical optimization algorithm. When training an ANN with the backpropagation algorithm, one can identify two major procedures: the forward pass and the backward pass. The first one is simply the forward passage of the input values from the input layer to the output layer, including all the transformations performed on the hidden layers of the network. When arriving to the last layer, a cost function is computed between the outputted values and the target vector that includes correctly labelled data. This cost function is going to express the difference between the achieved results y, and the true values yˆ, and is going to be minimized during backpropagation. Different cost functions are used for different problems. The most used ones are:

1. Mean Squared Error: The mean squared error (MSE), is the most commonly loss function used when dealing with regression problems. It simply calculates the average of the squares of the deviation errors, this is, the difference between the true values, and what was estimated. Squaring removes any negative signs, and also gives more weight to larger differences.

n 1 X MSE = (y − yˆ )2 (2.35) n i i i=1

With n being equal to the number of predictions, yi the true value or estimator, and yˆi the predicted value.

2. Categorical Cross Entropy: The categorical cross entropy is a function used in classification problems. It can be used for binary and multi-class problems.

n 1 X Loss = − y logy ˆ (2.36) n i i i=1

With n being equal to the number of predictions, yi the true value or estimator, and yˆi the predicted value.

Note that each one of this functions, and the other functions used in activations as well, share an important property among them: they are all differentiable functions. This means that its derivative exists for all existent points. This a key factor for the backward passage . After computing prediction errors using the cost function, network weights must be update in order to improve the outputted results, ultimately converging the system to the smallest possible error. This is done by using Gradient Descent, an algorithm that tries to find the optimal parameters that allow the network to minimize the cost function to a global minima. To update parameters, a constant ∆W should be added to each weight. For each individual weight ∆W is given by:

∂Etotal ∆Wj = −η × (2.37) ∂Wj

This calculates the partial derivative of Etotal, the total error given by the cost function with respect to Wj the existent network weights, i.e, how much a change in Wj will affects the total error Etotal. This is why every

25 system function should be differentiable. Calculating ∆Wj takes advantage of a well know calculus formula, the chain rule. The chain rule says that if a function f is differentiable at f(x), and another function g is differentiable at g(f(x)), then the derivative of the composite function f ◦ g is also differentiable at the point x [18]. This is given by:

(f ◦ g)0(x) = f 0(g(x)) · g0(x)

if y = f(y), and y = g(x), then we can write: (2.38)

∂z ∂z ∂y = · ∂x ∂y ∂x

The η is called learning rate, and it is a parameter that defines the speed of learning. It should be carefully defined: a large learning rate could make the gradient give large steps towards the optimal solution, raising the possibility of skipping an optimal minima. On the other hand, a too small learning rate could make the system too slow in terms of convergence. This algorithm updates its parameters for a specified number of epochs. An epoch equals a forward pass plus a backward pass. When the final epoch is reached, the final weights are updated and stored, thus creating a final model, fitted in relation to the given data.

2.2.2 Genetic Algorithms

The infinite variety of living species on the Planet Earth is a result of one of the most sophisticated and unique mechanisms that one could experience: natural selection. Evolutionary Computation (EC), is a family of algorithms inspired by this process, which can be described as empirical problem solvers, that use a set of stochastic and metaheuristic optimization tools, in order to achieve an optimal or near-optimal problem solution. Genetic Algorithms (GA) are a powerful and extremely versatile optimization tool, that are part of this broader family. As part of it, they try to mimic the evolution process observed in natural evolution of species, being often considered an evolutionary parameter tuning algorithm. They follow the methodology ”survival of the fittest” introduced by Charles Darwin in evolutionary theory [23]. This implicitly means the appliance of that principle in the breeding process of new generations, composed by different individuals. Individuals are represented by chromosomes. Therefore, each generation is represented by a group of chromosomes, and each chromosome represents a point in the existing search space, interpreted like a possible problem solution. From a biological point of view, a chromosome, is a DNA molecule composed by a set of different genes. We can translate this simple analogy into the algorithmic spectrum by qualifying a chromosome as an array structure where each element, a gene, is a weight that represents a model parameter. To apply Darwin’s methodology, a fitness function is applied to each one of the created individuals. The fitness function tell us how is the performance of an individual, with respect to a set of different parameters (mapped as genes), that try to explain a problem domain. After its calculation, a set of operations present in Evolutionary theory are applied, in order to form a new generation of individuals, based on the previous generation individuals characteristics. In this process some individuals are rejected based on the performance calculated by the fitness function. Only the elements with the highest values of performance will survive, and those will be the ones used to breed new chromosomes. By the end, the fittest element is the one chosen as the optimized individual. The introduction

26 of this biological processes accelerates the algorithmic convergence, making GAs viable alternatives to other more exhaustive and potentially expensive parameter tuning methods like Grid Search and Random Search. The algorithm can be decomposed in the following steps:

1. Population initialization - At the beginning of the algorithm, chromosomes are randomly generated in order to achieve diversity. If one wants to obtain the best fitted optimized values of a given population, then having a diverse set of individuals is a key factor for finding an optimal problem solution. A generally used, and well accepted rule, is to initialize the algorithm with a population size approximately equal to 10 times the dimensionality, i.e, the number of genes [24]. Samples have a fixed length of genes. Each gene is coded with values in a given interval, that usually is binary, but could represent a different range of values. This decision is problem dependent, but one should have in mind that larger ranges lead to larger search spaces. A too large search space could be beneficial in terms of finding an optimal solution, but it could also be too much time and resource consuming. A binary, 4 dimensional chromosome is represented in fig. ??.

Figure 2.11: Binary 4 gene chromosome representation

2. Fitness function calculation - This function main goal is to evaluate the solution domain. It takes an individual candidate as an input and outputs how fit it is as a solution of a given problem. Different fitness functions could be used, since this decision is strictly domain dependent. For example imagine an opti- mization problem concerning a classifier that aims to classify its input data into 3 different categories. To achieve maximum performance in this task, one could select as fitness function the well know classifica- tion metric, accuracy. This way, the GA is going to evaluate each individual by calculating the number of correct predictions in terms of percentage. In this case the GA will face a maximization problem, and the fittest individual will be the one with the highest value of accuracy.

3. Selection - After performing an evaluation of every solution available in a given population, is time to apply the ”survival of the fittest” principle. As previously explained, only the fittest solutions ranked with higher fit values, are going to be selected to proceed and take part in the breeding process of new chromosomes generation. There are several ways of performing this decision process [25]:

• Roulette Wheel Selection - The Toulette Wheel Selection (RWS), is method that is part of a set of methods where each individual can turn into a parent chromosome with a probability proportional to its fitness function value. This probability is given by:

fi pi = Pn (2.39) i=1 fi

27 With fi being the fitness function of a population individual. A roulette wheel is then formed with each individual probability, and a random selection (resembling rotating a wheel), is performed. Note that individuals of higher probabilities will have a larger ”pie” of the wheel and obviously higher prob- ability of being chosen as candidates for the next generation breeding process.

• Stochastic Universal Sampling - The stochastic Universal Sampling (SUS) is similar to RWS, since each individual also has a probability proportional to its fitness. The main difference between this two methods, is that instead of ”spinning the roulette wheel” k times, to obtain k parents, all parents are chosen in just one spin of the wheel. This tries to give weaker members a chance of being used to form the next generations.

• Tournament Selection - Tournament Selection (TOS), is one of the most used methods in litera- ture. TOS implements a tournament mechanism among population individuals, where only the ones with highest fitness remain in the end. For a given population of individuals, n tournaments are per- formed. Each tournament is done by selecting a k number of individuals, by random sampling. The elements in each tournament with higher fitness values, are chosen for being parents in the next generation. Small values of tournament size n, will preserve diversity among the population while large values will give smaller chances to weak individuals. Fig. 2.12 presents the above explained behaviour.

Figure 2.12: Tournament selection method

• Rank Selection - Rank Selection is a method that attempts to prevent early convergence. It is mostly used when individuals in the population have similar fitness values. This method instead of using fitness values to perform a selection it creates a rank, constructed from the fitness values. For example, consider 4 individuals with fitness values 14, 5, 36, 75, respectively. To create ranks for each individual the fitness values should be ordered in ascending order: 1st: 5, 2nd: 14, 3rd: 36 and 4th: 75. The sum of all ranks is then computed being equal to 10 (1+2+3+4). Individual probabilities are now calculated dividing each rank by the sum of all ranks.

28 • Random Selection - Simply selecting parent chromosomes by random sampling. This improves diversity but slows down the total convergence time.

4. Crossover - In order to generate new individuals, a crossover methodology is used. This process selects two or more parent chromosomes (selected via selection process), and creates new individuals based on a combination of each parent genes. To perform the crossover, a splitting point must be chose. The two most common methodologies are one point crossover and multi-point crossover. In the first one, each parent chromosome is divided at the same random generated crossover point, and the created tails are combined to form a new individual. Multi-point crossover is the generalization of the previous method, but with n splitting points. Fig 2.13 represents this operation.

Figure 2.13: Crossover operation

5. Mutation - Mutation, as crossover, is also a genetic operator commonly performed after crossover takes place. As the name indicates, the mutation operation is simply the process of mutate one or more gene values that exist in the chromosome vector structure. It is used to introduce and preserve diversity

in the population. Individuals selected are selected for mutation based on a low value probability pm. This probability should be low otherwise GA will resemble a random search mechanism. Mutate values are sampled from a distribution function selected according to gene’s value domain. For example if we are using a chromosome that has only binary genes, a gene mutation will correspond to a bit flipping from 0 to 1 or vice-versa. If each gene assumes integer values in a range between two values, then a mutation value could be sampled from a Uniform distribution, defining the lower and upper bounds of the distribution.

6. Fitness evaluation This part of the algorithm is extremely important, because it is where it is determined when a GA run will end. The fitness value calculated by the fitness function, now attached to each received chromosome, is again checked to evaluate the fitness obtained value of every candidate. Until reaching the end of the termination criteria, an initially predefined number of generations, the algorithm

29 keeps taking two step backwards, by performing the crossover and mutation operations again, therefore initializing the cycle of a new generation breeding. The whole cycle of the GA is depicted on fig 2.14.

Figure 2.14: Genetic algorithm flowchart

2.3 Related Work

This section is intended to address and present some related solutions to the work that was developed. The section will be divided in 2 subsections. The first one is directly related with works where NN take a major part in the problem solution, in financial contexts. The second subsection reports works performed with GAs. In the end it will be provided a resume table, where the most relevant solutions can be compared.

2.3.1 Related works on Neural Networks

Over the last 20 years, NN have been a largely used model when it comes to financial related works. They prove to be solid models, capable to extract and correlate different existent features in order to achieve useful and reliable results. FNN are one of the most used models, although other NN types have been used for forecasting and modelling of financial markets. Yao and Tan [26] proposed a neural network model that received as input a set of different technical in- dicators along with weekly time series data, to capture the underlying “rules” of the movement in currency exchange rates. They tested it with 5 major currencies against the US dollar (USD), namely the Japanese Yen (JPY), Deutsch Mark (DEM), British Pound (GBP), Swiss Franc (CHF) and Australian Dollar (AUD). They struc- tured the problem as a regression one, forecasting the weekly market closing price for each market. Evaluation was performed using the Normalized Mean Squared Error (NMSE), weekly return and directional change, ex- pressed as gradient. Regarding this metrics, the authors acknowledged that the aim of market forecasting is achieving the highest possible value of trading profits, and it does not matter whether the forecasts are accurate or not in terms of NMSE or gradient. They are useful measures to access the overall model quality, but should be carefully interpreted, because they do not consider the existent market trend. The experimented different topologies of the NN model with TI as features for the above mention markets, achieving as their best result a return of 28.49% for the CHF market. They ultimately report that the obtained results imply that all the studied markets are not random walk and are not highly efficient, making the forecasting task realistically possible.

30 Kara et al.[27], also confirmed the supremacy of NN models in financial applications, in relation to classical ML approaches in the Istanbul (ISE) National 100 Index. They deployed a comparison between two classification techniques, NN and Support Vector Machines (SVM). They also used TA as input feature generator, selecting ten TI as inputs of the proposed models. Instead of predicting future market values, they decided to forecast direction of daily change in the stock price index, creating labels that identify stock movements. Labels are categorized as 0 or 1. If the ISE National 100 Index at time t is higher than that at time t − 1, direction t is 1. If the ISE National 100 Index at time t is lower than that at time t − 1, direction t is 0. To evaluate the deployed models they used daily data from 1997 to 2007. They conclude that both the ANN and SVM models showed significant performance in predicting the direction of stock price movement. However, performance of the ANN model (75.74%) was found significantly better than that of the SVM model (71.52%). Jigar Patel et al.[28] predicted CNX Nifty and S&P Bombay Stock Exchange (BSE) Sensex indices from Indian stock markets with a fusion of different machine learning techniques. A two stage fusion approach supporting ten different technical indicators as input features was proposed, and applied to three different models: a SVM combined with an Artificial Neural Network (SVR-ANN), a twofold SVR model (SVR-SVR) and finally a SVR combined with a Random Forest model (SVR-RF). The first stage of each one of these models applied a SVR model to each inputed feature in order to predict the next day value of that feature. The second stage consisted in a ANN, SVR or RF model, fed with ten future values of the previously predicted statistical parameters. A comparison between a single stage model and a two stage model of each one of the deployed solutions was performed. The results proved that a combination of two techniques could achieve impressive results, totally overcoming a single layer methodology. The authors justify this with the fact that in a two layered procedure, prediction models in the second stage have to identify transformation from technical parameters describing (t + n)th day to (t + n)th day’s closing price, while in a single stage approach, prediction models have to identify transformation from technical parameters describing tth day to (t + n)th day’s closing price. The best overall prediction performance was achieved by the SVR–ANN model. A common approach to financial market analysis, is also the forecasting of returns. Qio et al. [29] applied this methodology to the Japanese index. They collected 71 variables that included financial indi- cators and macroeconomic data, divided in a monthly fashion, with a covering period from November 1993 to July 2013. A feature selection algorithm called fuzzy surfaces was used to reduce data dimensionality, to a minimal combination of explanatory variables. They discovered that from the initially collected 71 features only 18 statiscally significant. Finally, data was fed into a three layer ANN with a regular back propagation mechanism, with a GA for parameter optimization. Performance was accessed using the Mean Squared Error (MSE). Results shown a MSE value of 0.0017 for the best model, and an average MSE of 0.1219, obtained from 900 training experiments. There are also other types of NN that achieved promising results in the financial domain. The Long Short- Term Neural Networks (LSTM) are one them. They are state-of-the-art RNN model, widely used in the Natural Language Processing (NLP) domain for sequence learning, whose abilities are gradually being explored in market applications. A good example of its appliance was designed by Fischer and Krauss [30]. They used a LSTM network in order to predict next day daily return, transformed as binary labels. To convert continuous return values into discrete labels, they created a strategy based on the cross-sectional median return of all stocks in period t + 1 (the same period of the target variable). If the return value was smaller than the median value, a label 0 would be given, otherwise it would be 1. As input features they used 240 past return values

31 (aproximatelly one trading year), i.e, every t − n return with n from 0 to 239. They compared an LSTM architecture with Random Forests (RAF), FNN and Logistic Regression (LOG). LSTM, achieved the best result among the all used strategies, with an accuracy of 54.3%, and mean return of 46% (excluding transaction costs).

2.3.2 Related works on Genetic Algorithms

As it was previously stated in subsection 2.2.2, a Genetic Algorithm (GA), is an evolutionary computing technique designed to search an optimal or near optimal solution in a search space were the algorithm is confined, always following a methodology that tries to mimic and approximately replicate the principles of genetic and natural selection. A good example of how parameter optimization could be performed, was presented by Chih-Hung Wu et. al. [31]. This work aimed to develop a genetic-based SVM (GA-SVM) model that could automatically determine the optimal parameters, C and σ, of SVM with the highest predictive accuracy and generalization ability simultaneously. The model was built to predict bankruptcy, and was tested on the prediction of financial crisis in the Taiwan market, achieving results that empirically confirm that the GA-SVM model performs the best predictive accuracy when compared with the other tested models, namely classic financial statistic predictive methods and a FNN. To optimize the initial parameters of the SVM, the GA-SVM first generates a random population, where real values of C and σ are coded into the chromosome structure of each element of every generation. Finally, the model searches for optimal values iteratively, applying a survival of the fittest strategy. A more similar work with the one proposed in this thesis was done by Sezer et al. [32]. They built a deep MLP neural network for buy-sell-hold predictions, with TI parameters optimized by GA. As input data, they used daily stock prices of Dow 30 stocks between 1/1/1997 to 12/31/2006, for training purposes, and stock prices between 1/1/2007 to 1/1/2017 for testing. The target vector of buy-sell-hold points was created based on values given a RSI trading strategy in combination with a trading strategy based on SMA to identify uptrend and downtrend market periods. The GA was used to find best RSI values for buy and sell points in downtrend and uptrend in a random initialized population of 50 individuals. Generated chromosomes are divided in two distinct parts: 4 initial genes for identified uptrend periods and 4 genes for downtrend periods. Each gene codifies different parameters. The first one randomly creates RSI Buy values between 5 and 40. This correspond to RSI oversold periods, were according to RSI trading strategies, are plausible periods for market investments. Regarding this value, RSI Buy intervals are created randomly between 5 and 20, in the second gene. The third and forth genes are equal to these. The only difference is that they are related to the sell periods, making RSI Sell signal (in the third gene), assume values between 60 and 95, which according to RSI trading strategies are considered overbought market periods. The most profitable chromosome is retrieved, and training data is generated according to it. The achieved results proved that using this strategy, the created system had the capability to beat the classical Buy and Hold trading strategy in 16 out of 29 Dow 30 stocks. Yusuf and Asif Perwej [33] also proposed a system that combined GA with a FNN, optimized for the BSE market. They used a GA to search a space of ANN topologies and select those that matched optimally their criteria. The deployed a network consisted of one input layer, two hidden layers and one last output layer. The searched topologies included the number of neurons of the input and hidden layers, since the last layer was always confined to one neuron, because the defined output was the prediction of tomorrow’s excess return.

32 They compared their results with classical Time Series prediction methods namely Autoregressive models, and concluded that ANN models are superior, due to being able to capture not only linear but also non linear patterns in the underlying data. They also discovered that their ANN performance is influenced by the way that weights are initialized, ultimately concluding that this step should be performed in terms of the mean and standard deviation of several randomly selected initializations. Gorgulho, Neves and Horta [34] proposed a work that used a GA kernel to optimize technical analysis rules for stock picking, and portfolio composition. Their work aimed to manage a financial portfolio by using technical analysis indicators as trading rules. The used TIs were EMA, HMA, ROC, RSI, MACD, TSI and OBV. The Dow Jones Industrial Average Index (DJI), was used as the selected market, giving to the system user the possibility of choosing data frequency: daily, weekly or monthly. For each trading rule, a score is assigned according to a specific set of hard-coded rules, different from TI to TI. Regarding this mechanism, 4 scores were assigned. A very low score indicates a strong sell/short signal, and a value of 1 is given to it. An equal score is attributed to a very high score, indicating a strong buy signal. To low and high scores a value of 0.5 is attributed. A low signal indicates an under-performed signal with, potential to sell or to go short, while a high signal indicates a reasonable buy signal. In this work the GA is used to optimize classified trading rules. After the optimization performed by the algorithm, resulting on a classifier equation, where a set of technical indicators are correctly balanced, all the assets within the market are classified with weights. In order to validate the developed solution an evaluation was performed comparing the designed strategy against the market itself and several other investment methodologies, such as Buy and Hold and a purely random strategy. The testing period from 2003 to 2009 allowed the performance evaluation under distinct market conditions, culminating with 2008-2009 financial crash. The results are promising since the number of positions with positive return exceeds 80%, for the GA, confirming the high confidence level of the proposed approach. Regarding trading rule optimization, Hirabayashi et al. [35] proposed GA system to automatically generate trading rules based on TI, that instead of trying to predict future trading prices focuses on calculating the most appropriate trade timing. The training data used in this work consists of historical rates of the U.S. Dollar towards the Japanese Yen (USD/JPY) and Euro towards the Yen (EUR/JPY). For each data set, they used hourly closing prices. As selected TIs they used RSI, EWMA, Percent Difference from Moving Averages (PD) and the rising rate from one hour ago of the original data (RR). These indicators are used to generate trading rules. The GA is going to try to optimize the system searching for the most profitable individual according to 3 conditional equations involving the mentioned TIs, establishing boundaries for each TI, that are going to determine if we are going to have a buy or a sell order. System results are compared to a Buy & Hold strategy and with a Neural Network strategy. They claim that their work surpassed this strategies, achieving an average of 17% of profit for the EUR market and 80% for the USD market.

33 Work Date Application Heuristic Main goal Input Variables Data period Performance [35] 2009 FX market EUR, GA Optimization of Technical indica- 2005 - 2008 - Approximately 15% ROI for AUD and USD trading rules tors Hourly window AUD, 50% ROI for EUR, and against JPY 40% ROI for USD [26] 2000 FX market AUD, ANN Forecasting FX Time-series data 1993:11 - 1995:07 - 29.45% ROI and 0.043 NMSE CHF, DEM, GBP market - Regres- and Techinical Weekly window for the USD/CHF market and JPY against sion indicators USD [34] 2011 DJI 30 stocks GA Technical in- Technical indica- Not specified ROI 25,29% dicators rules tors optimization [27] 2011 Istanbul Stock Ex- SVM and Market trend binary Technical Indica- 1997:01 - 2007:12 75,74% ACC for ANN and change (ISE) Na- ANN prediction tors 71,24% ACC for SVM tional 100 Index [29] 2016 Japanese Nikkei ANN, Fuzzy Forecasting stock Fundamental Indi- 1993:11 - 2007:07 - 0.0043 225 index Surfaces market - Regres- cators Monthly window 34 and GA sion [28] 2015 BSE Sensex and SVR-ANN, Forecasting stock Time-series data 2003:01 - 2012:12 - 139.39 MAE, 3.41 RMSE for CNX Nifty SVR-RF, market - Regres- and Techinical Adjustable window CNX Nifty and 449.75 MAE and SVR-SVR sion indicators 3.34 RMSE for BSE Sensex [31] 2007 Curated list of in- GA-SVM Bankruptcy classi- Bankruptcy data Montlhy window 76% Accuracy dustries fication [32] 2017 DJI 30 stocks GA-ANN Buy-sell-hold Technical indica- 1997:1 - 2007:1 - 22.4% ROI for the MLP+GA ap- points prediction tors Daily window proach. [33] 2012 Bombay Stock Ex- GA-ANN Daily return predic- Fundamental and Daily window Mean excess returns - 1.026% change (BSE) tion technical indicators [30] 2018 S&P 500 volatility RAF, ANN, Prediction of direc- Fundamental and 1992 - 2015 - Daily 0.46 Daily return index LOGIT and tional movements technical indicators window LSTM Chapter 3

Implementation

3.1 Model overview

In this section, it is intended to provide a general overview over the solution that was developed during the development phase. The overall approach to market prediction is decomposed into small sets, were each part has a distinct role in the deployed model, using different methodologies and technologies, useful for achieving the desired result. The presented model results from the combination of concepts that were introduced in the previous sections, namely, FNN and GA. All the code was written using Python3 [36], due to the great support, accessibility, speed and available packages related with this work. We could summarize the developed system workflow in the following steps:

1. User input: The user starts by providing initial configuration data to the system, specifically FOREX market data, model parameters, GA settings, and TA desired features.

2. Feature calculation: After specifying the initial parameters and configurations, the system calculates the desired TA features with the provided data.

3. Optimization: In this module the GA algorithm is used to find the best optimized version of the FNN model. According to the characteristics of each generated individual, data is fetched from produced features data. Models are evaluated, and the best one is selected according to a selected fitness function.

4. Model prediction: After selecting the best optimized model, the testset will be used to make a prediction.

5. Market simulation: The outputted prediction is used to create a market strategy. The market strategy is evaluated using financial metrics.

The complete sequence of processing could be seen in fig 3.1, where each different layer is depicted for further explanation in the following sections.

35 Figure 3.1: System workflow

3.2 User input

The user input module is responsible for the interaction between the user and the system. We created a Python configuration file, where several system parameters could be configured, enabling the creation of different system architectures. The user can configure the following parameters, divided in 5 different sections:

• Data: In the configuration file, the final path to the market file where market records are available, must be specified in the path parameter.

• TI features: It is possible to select which TI are going to be used by enabling them in the configuration file. All the usable indicators have a parameter corresponding to its acronym. To define them, the user must simply set as TRUE the desired ones. We also specified two additional parameters: upper bound and lower bound, which set the maximum and minimum time periods for which each indicator is going to be created.

• GA: The GA parameters configure the behavior of the evolutionary operators, and the number of created chromosomes. The initial population is controlled by the pop parameter, and should receive an integer corresponding to the number of desired individuals. Same thing applies to the ngen parameter, which controls the number of generations that are going to be used as GA termination criteria. The remain- ing parameters control the select, mutation, and crossover operations. Regarding select, the tournsize parameter, configures the number of selected individuals for Tournament Selection (section2). For mu- tation and crossover, mutp and cxpb were created, representing the probability of applying the mutation and crossover to a selected individual. Since mutp and cxpb are probabilities, they should be defined by a float in the range from 0 to 1. Finally we have a parameter that controls the used fitness function. The fitness func parameter, receives a string corresponding to the name of the fitness function to be used (section 3.4.3).

• Neural Net: This section of the configuration file concerns about parameters that control the behaviour and architecture of the neural network. Activation, receives a string corresponding to the activation function used by the hidden layers of the network. Epochs, receives an integer number corresponding to the number of epochs selected for training the network. Finally, we have batch size, that indicates the number of samples that are going to be propagated through the network in each forward and backward passage, and the network optimizer, a string representing the optimizer used in the backpropagation process. Still regarding the network internal architecture, we also deployed the Batch Norm parameter.

36 This parameter controls the usage of a Batch Normalization process on the second layer of the network, by being set to TRUE or FALSE. Details about this procedure are explained in section 3.5, while in chapter4 its introduction or not in the model is discussed throughout the experimental analysis conducted in different FOREX markets. There are also two more relevant parameters that care about internal data usage, which greatly influence the system performance. Train size specifies the percentage of the dataset that is going to be used for training the network. Since it is given as a percentage, the system is able to automatically infer the size of the data used for test. Validation size is used to indicate the portion of the training used to validate the system performance.

• Investment: Investment parameters reflect the how the system is going to invest in the Market Simulator module. 3 parameters were created. Initial capital indicates how much money the user has to invest while asset number, indicates the number of assets to be acquired upon transaction. Finally, transaction cost simulates the usual costs involved in sell/buy transactions between brokers and traders.

3.2.1 Data

The system is prepared to receive Time Series market data that respects some feature constraints imposed by the created solution. The path specified in the config file should point to a .csv file with periodic information from a selected FOREX market. It should comprise the following fields:

• Date: The date correspondent to each available record. This system accepts different periods of data, but it is recommend smaller frequencies like ticks, hourly or minute to achieve more interesting results. Daily data is not recommend since the majority of the available FOREX datasets, usually have as starting period 1999, which will correspond to a small amount of records, insufficient for extracting reliable insights out of the system.

• Open: The open price rate at the beginning of the day.

• Close: The close price rate at the given time period.

• High The highest achieved rate price at the given time period.

• Low The lowest achieved rate at the given time period.

Note that the provided file columns must respect the above mentioned order. Data will be stored in the dataframe format provided by the Python library Pandas [37]. Pandas is an extremely popular Python package for Data Science, that offers powerful tools to manipulate, analyze and store data. A dataframe is a flexible way to store data in a two dimensional data structure that is aligned in a grid fashion, composed by rows and columns. Rows correspond to the existent data records, while columns include the variables collected at a specific data record (in this case at a specific date period). Pandas dataframes can also be effortlessly manip- ulated due to the large set of available operations. They work in such a way that access to information does not require going through iterative loops, making splitting and transformation operations flexible and efficient. Fig 3.2 shows an example of a FOREX EUR/USD market data stored in Pandas dataframe, in a raw stage, i.e, without any additional financial features.

37 Figure 3.2: Raw input data

3.3 Feature calculation

This is the module that adds and calculates TA features using the initial given data. As we previously specified in section 2.1.3, this corresponds to the addition of a set of different TI that use past information to generate new signals. This operations will be performed using PyTi [38], a Python library that contains various financial TI that can be used to analyze data. Throughout the system workflow, feature data will be accessed several times. With that in mind, we chose to calculate each feature only at the beginning of the program execution avoiding repeated calculations. Calcu- lated features will be stored in a folder that will contain an individual .csv file for each feature. The lower bound and upper bound parameters initially selected in the configuration file, are going to set the number of columns of the created file, with each column corresponding to the TI calculation, accounting with a different number of past periods. The selected range should be wide enough to compress small, medium and large time periods, but also not excessively large, in order to avoid the creation of a too big search space, which would greatly increase the system optimization task. To facilitate the TI creation, we also divided the used indicators in 2 different types: normal TI and special TI. Normal TIs are features that take only 1 parameter, hence the above explained methodology is applied. Special TIs take more than one parameter, with values also contained in the same used range. The creation of a .csv file for those indicators would be too much time consuming, inevitably creating big data structures that would require great amounts of memory. We opted to calculate them on the fly, since the system uses a smaller number of special TIs when compared to normal TIs. Regarding the used TIs (section 2.1.3), we separated them in the following order:

• Normal TI = [EMA, SMA, RSI, MOM, ATR, ATRP, BB, ADX, AA, CMO, DPO, DEMA, ROC, DSS, ROC, KURT, SKEW, STD, STV ]

38 • Special TI = [CCI, MACD, PO]

The system has the ability to automatically separate the 2 types of TI. Some of the indicated TIs condense more than one indicator, but for ease of interpretation we choose to represent them with only one symbol. For example the AA indicator is divided in 2 separate indicators, the AA up and AA down. These correspond to the 2 bands that are used to create the AA indicator, and their calculation rely on 2 different formulas. Therefore, they are processed by the system as 2 two normal TIs, with different input parameters. Figure 3.3 shows an example of the csv created for the SMA indicator.

Figure 3.3: SMA csv

3.4 Optimization

The optimization layer is the core layer of the created system. In it, the previously provided market data, goes through several different procedures in order to create an individual that has the best possible perfor- mance according to the defined fitness function. As it was explained in section 2.2.2, individuals are repre- sented by array structures called chromosomes. In this layer, the generated chromosomes are used for two main purposes. First, to join input data with calculated features, second, to create individual FNNs were the grouped data is going to be inputted.

Several individuals in the form of FNNs are then going to be created, with each one of them going through the evolutionary process computed by the GA. Predictions made by each FNN are going to be evaluated by the fitness function, and the process is going to be repeated until a stopping condition is met. By the end, the best individual is returned. Fig 3.5 depicts the overall procedure that is used in the optimization layer.

39 Figure 3.4: Optimization layer

3.4.1 Population generation

The first step to be performed in the optimization layer is the creation of a population of individuals. The size of each generated individual varies according to the number of initially selected TI features, corresponding to a chromosome with genes with values from lower bound to upper bound (section 3.3). Besides using genes as the number of past TI periods, each chromosome also encodes two more types of genes, namely presence and neural network genes. Both have different purposes and represent the 3 main tasks performed by the GA:

• TI creation: The first main functionality was already explained, and is TI creation, which is achieved through the usage of TI genes as feature parameter. Gene values codify the number of desired past periods, and are randomly selected by the GA, when an individual is initialized.

• Feature selection: Feature selection is a key mechanism in ML solutions. The main idea is to select only a relevant set of feature, promoting the model generalization capabilities without compromising its correctness, thus reducing the existent model variance. Therefore, unnecessary, irrelevant, and insignif- icant attributes that do not contribute to the accuracy of the predictive model, are removed. The system attempts to achieve this goal by using the previously mentioned presence genes available in each chro- mosome. For each one of the of the TI parameter genes, there is a correspondent presence gene that indicates if the respective feature will be used in the model creation or not. Each presence gene also codifies a value between lower bound and upper bound, but the main difference is that the system inter- prets it as a presence binary flag. This is done by using the statistical median of the data population. If the codified value is higher than the median, the feature will be used in the model creation, otherwise it will be discarded, thus reducing the number of used attributes.

• Neuroevolution: Neuroevolution is a term applied to models that use evolutionary capabilities, such as GAs, to evolve through time in order to achieve optimized models. In this system, this is done by encoding 2 parameters that correspond to the number of neurons present in the input and hidden layer of

40 the network. Hence, during the initialization process, individuals will assume different FNN architectures, with different network topologies. For deploying this mechanism, two additional neural network genes were deployed, which also codify values between lower bound and upper bound, giving to each layer a value in that range.

A possible chromosome structure of the generated individuals is depicted in fig 3.5.

Figure 3.5: Chromosome structure

As explained in section 2.2.2, the number of created individuals is going to correspond to the value of the population size. New individuals are going to be successively created throughout the GA running time. They will be created according to the number of generations initially set, with each generation being based in the previous generation fittest elements.

3.4.2 Model creation

After initializing a population of individual chromosomes, the system is ready to create a dedicated FNN for each individual considering the GA selected parameters. This is the part were data is grouped, and the model topology is selected according to each gene value. Regarding data grouping, there are some procedures needed to be performed in order to correctly use the existent data - data preprocessing. By the end, the model outputs a prediction vector, which corresponds to a trading strategy based on the approximated function discovered by the model. Later on, this prediction is going to be used to access individual quality of each individual.

X matrix creation

The first step of the model creation is the establishment of a feature matrix, generally called X matrix, where each x input (or model feature), is present. As explained in section 3.3, only special TI features will be calculated, with the remaining ones being selected from the initially created feature csv files. Thus, generated features will be appended with the initially provided data (which contains the Date, Open, Close and High features), to create a new dataset with a higher dimensionality. Features will be appended or not according to the values present in each presence gene. Regarding the train test split explained in section 2.2, the created matrix will be split in Xtrain, and Xtest, with the splitting point being initially set by the system user. The Xtrain vector is further decomposed in order to generate the Xval vector, which corresponds to a percentage of the

41 trainset (selected by the user in the config file), used to evaluate the model performance, during optimization period. The Xtest is preserved to be later used with the optimized model.

Y vector creation

Since in this thesis we structured the prediction problem as a Supervised Learning one, the creation of a vector that has correctly labelled predictions is needed. This is what is present in the y vector, with each position corresponding to the market rate return at time t. Signal variations are achieved by calculating the return, i.e, comparing the closing price in time t with the closing price at t − 1, being characterized has pos- itive or negative. Since financial data exhibits a non-stationary behavior, this procedure is crucial to ensure that data used in train and test set lays on the same joint distribution, which is guaranteed by the Gaussian nature of financial returns [39]. To create discrete labels instead of continuous values given by this calculation, a binarization process by threshold is performed. If the calculated return is positive, a label of 1 is created, otherwise, a 0 is given, meaning that the return at time t is negative. By deploying this methodology, the model attempts to predict future market returns instead of forecasting future rates. We find this approach much more useful, since one can’t expect that the created model is able to accurately forecast a continuous value that correctly expresses market rates. From the trader perspective, is much more profitable and beneficial to only know when is a good time to invest in the market, or when one should stay out of it. Equation 3.1 represents the binarization process used to create the y vector.

 0 if Closet−Closet−1 0  Closet−1 6 yt = (3.1) 1 if Closet−Closet−1 > 0 Closet−1

It is important to notice that we are attempting to forecast the next day market variation, and not the present market variation at time t. Therefore, for each set of features at time t, the constructed label corresponds to equation 3.1 for t + 1. This is performed by using the above formula for each row of the dataset, and after it, shifting the created y vector one row above. Similarly to what is done in the creation of the X matrix, a splitting from train, validation and test is also done on the created y vector, giving arise to ytrain, yval, and ytest.

Data cleaning

The stage of data cleaning is needed in order to eliminate non desirable values from data. By performing all the several computations needed for feature calculation, undefined or unrepresentable values emerge in each data column. This is due to the formulation used to calculate TIs which consider the past periods as initial parameter. For example lets consider a SMA that uses the n past hours as initial parameter. This will necessarily make that the first n − 1 values present in the SMA feature column, to become undefined, being treated as not-a-number values (NaN). Since the SMA TI, calculates a simple average over the n past periods, periods with t inferior to n can not be used, thus being populated with NaNs. A similar problem arises when the y vector is being created. By shifting the y vector one row above, making each row at time t to correspond to the next period t + 1 market return, the last value of the y vector will be also treated as NaN. These are values that can not be processed by the neural network, thus, the removal of them

42 is compulsory. The chose procedure, was to drop rows were NaN were encountered. This is an acceptable approach when dealing with data sampled from financial markets, since other well known practices like value interpolation, replacement by mean and replacement by median, will introduce bias into system data.

Feature Normalization

To improve the overall model convergence, feature scaling is a needed measure. When data includes a high number of dimensions, values assumed by each individual feature could present a different range of values, i.e, different levels of variance. This is undesired during the FNN training phase. Although it is possible for NNs to naturally adapt to such heterogeneous data, the existence of such dispersed features makes training much more difficult and time consuming. To overcome this problem, a normalization method called standardization (or z-score), was applied. This procedure is individually applied to every dataset feature, making each one of them to have zero mean and unit variance, following the properties of a normal distribution. This helps the convergence of the learning process, more specifically the Gradient Descent algorithm, explained in section 2.2.1. If features in considerably different scales, are used during the application of the Gradient Descent, gra- dients will not take a direct path towards the global minimum due to the shape assumed by the cost function, which could result in slow training procedure or even make the system to be stuck at a local minimum [40].

Considering µi as the mean value of feature i, and σi as the standard deviation of feature i, we could formulate the standardization formula as:

xt − µi xt = (3.2) σi

Where xt represents each value present in the feature vector i. An important note about the standardization process is the way how it is performed. In order to correctly use this normalization procedure, the above mentioned formula needs to be first applied only to each value present in the Xtrain matrix. We exclude the test set to prevent a well known problem called look-ahead bias. This is a type of model bias, that emerges when assumptions regarding the testset are used in the training period, ultimately conducting to a high performance model, were the achieved results are not trustworthy. This would be the case if one tries to standardize the whole dataset, with bias being leaked into the trainset in the form of µi and σi. To get around this issue, values of µi and σi should be drawn only from the trainset.

FNN creation

Now that all the needed information is preprocessed and ready to be utilized, its time to create each indi- vidual FNN, according to each initialized individual. The model creation was created using the Python Deep Learning library Keras [41], which contains all the tools needed to easily built any type of neural net. Data is fed into a 3 layer structured model that forecasts market returns, as a binary vector of predictions yˆ. The model is trained using the Xtrain and ytrain matrices, and tested with Xval and yval. The validation accuracy will be the optimized metric during the backpropagation process. In each epoch run, the FNN is going to train and test the network, continuously calculating the train and validation accuracy. The run with the highest value of

43 validation accuracy is going to be saved as the optimized neural network. In it, the Xtest vector will be used to evaluate the behaviour of the network on new never seen information. In section 3.5, we explain in detail how the FNN model is created during each system stage.

3.4.3 Fitness computation

The fitness function is the evaluation mechanism deployed in the GA. It is going to take as input a candidate solution (a vector of predictions yˆ), and outputs how fit is the provided solution in relation to the function. After the FNN creation, for each created population chromosome, fitness is going to be accessed. Therefore, due to the high number of repetitions that are going to be done, the fitness function should be implemented rigorously, in order to not slow down the entire system. There is no strict rule in terms of selection of fitness function. It could simply be the FNN model cost function, a classification metric, financial metric or any other custom one. Therefore, one could choose to directly enhance the model performance, optimizing a metric used during the learning process of the network, or indirectly improve the model capabilities with an outside metric, which is not being internally calculated by the model but contributes to the model overall performance. We propose 2 different fitness functions, 1 directly related with the system performance, and other related with market profitability:

• Accuracy: Accuracy is a metric that is calculated throughout the model learning process. It gives the per-

centage of correct predictions by comparing the correctly labelled vector yval or ytest, with the prediction vector yˆ. It directly impacts the model performance.

N 1 X yi − yˆi accuracy = (1 − ) × 100 (3.3) N y i=1 i

With N being the total number of samples to be predicted, yi the individual values of the y vector, and yˆi the individual values of the yˆ vector.

• Return of investment: This is a financial metric that measures the efficiency of the performed invest- ments, giving a result that could be translated into a measure of the achieved gains in terms of percent- age. Contrary to the last presented formula, the ROI is a metric that is not directly applied into the model calculation, and it is used after the model prediction, which means that its optimization will affect the FNN in an indirect way. Returns − InitialInvestment ROI = × 100 (3.4) InitialInvestment

3.4.4 GA operators

This section covers all the blocks depicted in fig.3.5 that are used by the GA, as evolutionary operators: stop condition, selection, mutation and crossover2. We took advantage of the DEAP library [42], that en- ables the individual creation of each evolutionary operator, offering a great set of possible configurations. The implementation regarding each one of them was the following:

• Stop condition: To terminate the ongoing evolutionary optimization process, a certain criteria should be met. We simply selected the number of generations as the GA stopping criteria. Until reaching the final

44 number of generations, the GA continues the breeding process created by the developed evolutionary operators. If the condition is met, then the GA proceeds and outputs the best individual that existed during the running period.

• Selection: To perform the selection phase, we selected as operator the Tournament Selection method (section 2.2.2). The process is repeated a number of times equal to the size of the initially created population.

• Mutation: The mutation operator is responsible for mutating one or more genes according to a certain mutation probability. Since each created individual encapsulates different types of genes, presence and TI parameter genes, a simple mutation process would not be adequate for this problem. Therefore, we developed a custom mutation that cuts the selected individual in half, and performs the mutation operation in the first half of the chromosome genes, i.e, where the TI parameter genes are included. We do this because we do not want to change the available presence genes, present in the second half of the chromosome. The mutation process is done by sampling an integer uniformly drawn between a lower and upper bound equal to the ones defined for the TI parameters.

• Crossover: The crossover operator is used to combine different individuals by mixing each chromosome genes. Similarly to what happens with the mutation operator, the crossover must be redesigned in order to avoid a possible blend between presence and TI parameter genes. We created a custom crossover that receives two individuals, divides each one of them at half, and applies a multi-point crossover methodol- ogy in each split. The final result is given by 2 new individuals with switched genes between them, picked according to 2 randomly selected crossover points in each half.

3.5 Model prediction

After the termination of the GA procedure, we achieve an optimized set of parameters, given by an optimal individual, represented by the fittest chromosome according to the selected fitness function. Since this individ- ual could be present anywhere in the GA search space, it was not feasible to store each individual generated data. Hence, a final prediction is made using the parameters of the optimized version of the system. This is what this module aims to. This final prediction is performed by the optimized FNN, i.e, the fittest individual among the created population. The prediction itself, derives from the appliance of the Xtest and ytest (which had been both held until this procedure), instead of Xval and yval as it was performed during the optimization process. The architecture of the developed system is composed by 3 individual layers. The first layer, the input layer, is heavily shaped by the chromosome genes. It will be selected by the penultimate chromosome gene, and it cannot be inferior to the number of columns present in the provided data, in this case, the number of picked features by the GA, plus the number of initial data features. If the GA selects a number inferior to it, this value is going to be selected as the number of neurons in the first layer. Activations will be performed by relu. The technical details of this function are presented in section 2.2.1. The second layer is technically similar to the first one, with the exception of the number units. The number of neurons will be selected according to value of the last neural network gene. Before passing the output from the first layer to the second one, a Batch Normalization process could be applied to each mini-batch,

45 with batch size and Batch Normalization appliance both being selected and initialized in the configuration file. Batch Normalization is a process that aims to recreate the initial data normalization process, explained in section 3.4.2, to the hidden layers of the neural network. The general idea is that if standardizing the initial inputs of the network improves the overall model convergence, then applying the same principle to the middle layers of the created model will be also helpful. In fact, Batch normalization speeds up the training process and minimizes the data distribution changing across layers by forcing the mean and variance by standardization. The main idea is to reduce the internal Covariance Shift, i.e, the change in the distribution of network activations across different network layers [43]. Finally the network, is also composed by a final third layer. Since the prediction target are binarized returns, the number of neurons in the last layer is unchangeable, and is permanently set to 1. The activation function in the last layer neuron must account with that, and the predictions of the network must be forced to 0 or 1. Considering the binary classification task that will be performed by the FNN, a sigmoid function (section 2.2.1) was selected. As loss function, the binary cross entropy was choose, which is no more than a special case of the categorical cross entropy introduced in section 2.2.1. Regarding the network optimization procedure explained in section 2.2.1, the optimizer field in the configuration file is going to determine the selected opti- mization algorithm to use in backpropagation. Keras provides a set of different optimizers such as Stochastic Gradient Descent and other improved versions of it. Still concerning the network data fitting procedure, it is worth to notice, that Keras forces each epoch to use the train and validation sets. Therefore, in each passage, the network is fully trained with the train set, and fully tested with the validation set, making it possible to check in real time, how fitted to the received data the network is.

Figure 3.6: Final prediction pipeline

Besides the architecture above specified, we also introduce two other mechanisms that attempt to con- tribute to the improvement of model convergence. The first one is the model checkpointer, which saves the model after every epoch only if the monitored metrics, loss or accuracy, have improved since the last epoch. This is done by storing the network weights in a hdf5 file, which is later used to load the network again, and output a prediction. The other used mechanism is the early stopping. Early stopping greatly improves the computation time, since it aims to stop the learning process whenever there is no improvement in the last x

46 epochs, with x being refereed as the patience threshold. The combination of this two tools, makes the model being able to simultaneously filter non-desirable results, and at the same time return the best seen element.

3.6 Market simulation

The market simulation module is a module designed to test market trading strategies in a simulated envi- ronment. It receives the prediction made by the optimized model and tests it against the market for what it was designed. We defined 2 main market positions, long and short. When the simulator orders a long position, the system invests in the market and purchase a predefined number of stocks. Going short means exactly the opposite, with the system selling first and buying when the position is closed (section 2.1). The applied strategy will be defined by the binary returns that are present in the predictions array. When the next period return, t+1, is positive, we have at time t a label with value 1, which indicates proper investing conditions. When t + 1 is negative, the system labels value t as 0, meaning that the next day is not suitable for market investments. To create the final market strategy, a new signal derived from the model predictions vector must be generated. In order to simulate the transition from long to short investing, a third label, -1, is then create. Therefore, the system investment procedure condenses 3 different states. 1 represents a long position, -1 represents a short position and 0 stands for a non-investing behaviour. Before investing 2 parameters must be set: the initial capital and the number of assets to be bought or sold when a long or short position is requested, respectively defined as initial capital and asset number in the configuration file. Their values should be established and varied according to the specified market. To showcase all the calculations and system behaviour during the market simulation, a csv with 9 columns is created, with each column being:

• Date - The date of the respective market period.

• Close - The close price on a given date.

• Prediction - The signal generated by the optimized FNN.

• Signal - The generated market strategy. Represents the difference between the period t + 1 and period t of the prediction column. This generates the 3 above mention labels: 1 - long, -1 - short and 0 hold.

• Positions - Number of positions to be open according to the trading signal. This number is pre-defined in the configuration file, and when an investment is performed the number of transacted assets its always equal to it, with its signal varying according to the signal column value. If 1, positions is going to be a positive value, and if -1 a negative value. When 0, the value remains equal to the previous values, since no transactions are performed.

• Positions value - It indicates the value of all the the opened or in debt market positions. Its calculation is created by multiplying the closing market value by the number of acquired assets, i.e, the values in the Close column by the Positions column.

• Cash - Indicates the total amount of money owned by the user, since the begin of the market simulation until the end of it. This quantity is going to vary according to market investments.

• Total - The total number of assets owned by the user. This combines the money that user has at a specific time period, plus the combined value of open market positions.

47 • Returns - The market return given a time period t. Useful to identify if the system correctly purchased/sell on that period.

Upon csv creation, the system is going to use the calculated values to access the overall performance of market investments. Financial metrics presented in section 4.2, were used for that purpose. Fig. 3.7 illustrates the mechanism scrutinized throughout this chapter.

Figure 3.7: Market simulation

The above calculations account with the usual transaction costs charged by the broker. This cost can be relative or fixed. Relative fees change according to the market volume size, and fixed fees are as the name indicates, fixed, with its value remaining equal regardless of the size and volume of the trade being placed. In FOREX markets the commission fee is normally included in the market spread (section 4.1). Since in this application we do not use bid and ask prices, making the spread calculation is impossible (the available data does not include this two fields since they change according to the chose broker), hence, we decided to use an average fixed commission cost of 0,0001% per transaction, in order to cover all the commission plus spread trading expenses.

48 Chapter 4

Results

4.1 FOREX Data

To evaluate the performance and robustness of the previously defined model, we tested 5 different FOREX currency pairs market data, each one described by date, open, close, high and low. As selected sample rate, we chose to use hourly data, since available FOREX data only had as starting period the year of 1999. This would intrinsically mean a small amount of usable information, and deep learning algorithms such as FNNs, work better when fed with higher data quantities. All the tested FOREX datasets have the same row size, compressing hourly data over the period of 12/03/2013 to 12/03/2018. which represents a 5 year period of 31167 trading hours. Experiments were performed in the following markets, with indexes depicted in figures 4.1,4.2, 4.3, 4.4 and 4.5:

• EUR/USD: The most popular currency pair among FOREX traders. It compares the 2 largest economies in the world.

• GBP/USD: One of the oldest currency pairs, that puts the British Pound (GBP) against the US Dollar (USD).

• GBP/JPY: This currency pair compares the British Pound (GBP) against the Japanese Yen (JPY), and is known for its great volatility. It is considered as a cross currency pair (section 2.1.1), since the US Dollar is not used to calculate the exchange rate.

• USD/JPY: Compares the US Dollar against the Japanese Yen. This market presents low interest rates, which consistely make it one of the most popular among traders.

• USD/CHF: Another major currency pair. It compares the US Dollar with the Swiss Franc (CHF), and it is considered as a safe market due to its behavior during times of uncertainty, usually staying stable or suffering some appreciation.

49 Figure 4.1: EUR/USD market index Figure 4.2: GBP/USD market index

Figure 4.3: GBP/JPY market index Figure 4.4: USD/JPY market index

Figure 4.5: USD/CHF market index

50 A descriptive summary of each market dataset is presented in table 4.1.

Table 4.1: Summary of market indices

Markets Start date End date Samples Mean Max Min Std Avg. candle

EUR/USD 12/03/2013 12/03/2018 31167 1.197 1.396 036 0.109 0.0016

GBP/USD 12/03/2013 12/03/2018 31167 470 717 202 0.143 0.002

GBP/JPY 12/03/2013 12/03/2018 31167 16101 195.846 125.069 17.694 0.3127

USD/JPY 12/03/2013 12/03/2018 31167 109.743 125.683 92.739 8.019 0.0017

USD/CHF 12/03/2013 12/03/2018 31167 0.955 033 0.840 0.038 0.0014

We can confirm that the highest values of standard deviation are given by the markets that include the Japanese Yen as currency pair. Such values indicate evidence of high volatility in this type of markets, confirm- ing the riskier profile that often describes them. We can further check this information by observing the range created between the Min and Max columns for each market. However, a high volatility market does not imply high volatility during hourly periods. This is showed by the last table column Avg. candle, where an average of the hourly market rate variation, is computed for each tested market, by subtracting the High and Low columns for each dataset record. We also present a summary of market returns for each one of the selected markets. We present this because the model attempts to learn and predict binarized returns instead of the actual market index (section 3.4.2). Table 4.2 shows a summary of the analyzed market returns for each tested market.

Table 4.2: Summary of market returns

Markets Start date End date Samples Mean Max Min Std

EUR/USD 12/03/2013 12/03/2018 31166 −2.081 × 10−6 572 × 10−2 −2.035 × 10−2 114 × 10−3

GBP/USD 12/03/2013 12/03/2018 31166 −1.557 × 10−6 2.176 × 10−2 −5.69 × 10−2 1.159 × 10−3

GBP/JPY 12/03/2013 12/03/2018 31166 2.086 × 10−6 3.247 × 10−2 −8.468 × 10−2 1.566 × 10−3

USD/JPY 12/03/2013 12/03/2018 31166 3.847 × 10−6 1.442 × 10−2 −2.968 × 10−2 1.234 × 10−3

USD/CHF 12/03/2013 12/03/2018 31166 1.022 × 10−6 2.604 × 10−2 −1.402 × 10−1 1.438 × 10−3

4.1.1 Data statistics

Since we chose financial returns as the go for methodology to transform raw market rates, we can compute helpful statistics that provide us useful information about the profitability and risk associated to each selected market. Kurtosis and Skewness (section ??) are 2 powerful statistical metrics that provide information about the shape of the given data distribuition comparing it to a Gaussian distribution. This is possible considering that the probability distribuition of market returns is approximately normally distributed. Table 4.3 presents the Skewness and Kurtosis values of each market testset.

51 Table 4.3: Distribution shape descriptors

Markets Kurtosis Skewness

EUR/USD 22.415 1.038

GBP/USD 19.990 -0.288

GBP/JPY 12.841 -0.114

USD/JPY 9.033 -0.348

USD/CHF 9.917 -0.216

Positive kurtosis indicates excess kurtosis in market data. Since all the measured return kurtosis is above 0. (note that the used formulation in section ?? discounts 3) we prove the presence of heavy tails in the distribution of events. This indicates a higher likelihood of extreme losses and gains. The Skewness parameter represent how skewed data is. When combined with Kurtosis, it indicates which type of outliers is more likely to be found (negative values indicate a left skew, and positive values a right skew). Since Skewness measures present values close to 0. we can conclude that data distribution is almost symmetrical, and the likelihood of achieving higher returns and losses is almost identical.

4.2 Evaluation metrics

In order to evaluate and access quality of a created ML model, one must choose adequate evaluation met- rics, capable to correctly express the model performance. Evaluation metrics vary according to type of ML problem that is being proposed. We could divide the used evaluation metrics in two different types: classifica- tion metrics and financial metrics.

4.2.1 Classification metrics

Classification metrics are formulas that express the performance of ML models for classification tasks. Since in this thesis the prediction problem is structured as a classification one, it is important to access the effectiveness of the created system under some classical classification criteria, in order to check how good is the model predictive power, and how well it performs the desired task. We chose the following metrics:

• Accuracy: Accuracy is by far the most used metric in classification tasks. It simply measures how fre- quently a created classifier makes correct predictions. It is given by the proportion between the number of correct predictions and the total number of predictions, therefore referring to the bias of the predictions

tp + tn Accuracy = (4.1) tp + tn + fp + fn

52 Note that the numerator of the above equation refers to the number of correct predictions made by the model, decomposed in two main components. The first tp refers to the number of correctly identified positive individuals (for example in a binary classification, how many individuals with label 1 were correctly identified), and tn the number correctly identified negative individuals (how many individuals with label 0 were correctly identified). Regarding the denominator , fp represents the number of incorrectly identified individuals and fn the number of individual incorrectly rejected.

• Precision: Precision is a metric that measures how good was a classifier in the task of classifying positive elements. We could say that precision answers to the question, ”How many selected items are relevant?”, being interpreted as the ratio of positive elements which were correctly identified.

tp P recision = (4.2) tp + fp

• Recall: As precision, recall is a metric that complements accuracy and gives deeper insight about a classifier’s performance. Recall answers to the question ”How many relevant items are selected?”, i.e, among all the existent positive items, how many were retrieved by the system.

tp Recall = (4.3) tp + fn

4.2.2 Financial metrics

Besides assessing the predictive capacity of the system with classification metrics, it is equally important to interpret the achieved results through traditional trading measurements. Since we are dealing with markets and more specifically with investments in the FOREX domain, the usage of performance measures that evaluate the efficiency of the built model in terms of gain and losses, is a naturally logical procedure to assess the overall quality of the model. We used the following:

• Return of Investment: Measures the efficiency of the performed investments, giving a result that could be translated into a measure of the achieved gains in terms of percentage.

Returns − InitialInvestment ROI = (4.4) InitialInvestment

• Maximum Drawdown: This metric aims to produce a very close approximation to calculate the asso- ciated risk of investment, measuring the peak-to-tough decline during a specific period of investment. Drawdowns are calculated as the difference between the highest local maximum and the highest local minimum. The period of Drawdown recording is performed till the occurrence of a new local maximum.

T roughV alue–P eakV alue MDD = (4.5) P eakV alue

53 4.3 Experimental setup

Before presenting any results related to each market, it is necessary to configure the developed system. Since there are many configurable parameters (section 3.2) , we decided to create a standard setup that is going to serve as baseline configuration for the set of experiments to be performed. We selected the following model setup:

Table 4.4: System parameters

Parameter Value Component

TI upper bound 100 TI features

TI lower bound 5 TI features

Tournsize 3 GA

Initial population 200 GA

Number of generations 20 GA

Crossover probability 0.5 GA

Mutation probability 0.2 GA

Train size 0.8 FNN

Validation size 0.2 FNN

Activation function ReLU FNN

Epochs 100 FNN

Batch Size 32 FNN

Optimizer Adam FNN

Transaction costs 0.0001 Trading module

Number of assets 80 000 Trading module

As its possible to check in table 4.5, 4 components of the system were tuned.

• For TI upper and lower bound we selected a range from 5 to 100. in order to have indicators that work with small, medium, and long time periods. Also, with this specification, we are expanding the GA search space, extending the number of possible combinations, used during the feature selection process. Also it is worth mention that as used input features, we selected all the TIs announced in section ??.

• In terms of GA we tuned 5 parameters. We selected a tournsize value of 3. This enables the Tournament Selection mechanism (section 2.2.2) to compete with 3 individuals. The GA initializes itself with a initial population of 200 individuals. We choose this value according to R. Storn [24]. He states that the number of initially created individuals should be 10 times the dimensionality of the problem, and since on average each individual accounts with 20 features, we thought that starting with 200 chromosomes would be a fitted value for this problem. The number of generations was set to 20 due to the lack of computing power.

54 Greater values could achieve better results, but that would greatly increase the computing time, which is already expensive when working with 20 generations (approximately 32 hours). Concerning the other two GA operators, Crossover and Mutation, we selected a probability of triggering of 0.5 and 0.2. which are standard values for both operators.

• With respect to FNN related settings, we set up 6 parameters. The first two, Train size and Validation size simply establish a value that represents the percentage of data used for its creation. Note that the value of Train size, corresponds to a percentage in respect to the whole dataset, while Validation size indicates the percentage of train set used for validation. For neural activations we selected the ReLU function due to the advantages that it brings in terms of algorithmic convergence [44]. The number epochs decided for each individual is 100. since the implemented early stopping mechanism 3.5, with a patience level of 20 epochs, rarely lets the propagation procedure to reach the 100 epochs. Batch size is set to 32 for 2 main reasons. First, 32 is a standardized value usually used, and second lower values of batch size improve the overall convergence of the network [45]. Finally the optimizer, who establish the algorithm used in the backpropagation procedure, is Adam, an extension of Stochastic Gradient Descent that has proven to be effective among a large variety of domains paper [46]. We also have an extra parameter that controls the introduction of a Batch Normalization layer in the network, but since its inclusion will be changed during the experimental process, we decided to not include it here.

• As training parameters we defined the transaction costs and the number of assets to 0.0001% and 80 000. The initial capital will be market dependent, due to the differences in , presented by the selected markets. Another parameter that it is not specified in this setup, is the used fitness function. We did not defined that here, because it will be changed during the experimental procedure, in order to evaluate the system behaviour when changing the fitness function.

4.4 Case study A - Simple prediction

In this first case study, we intend to showcase the system behaviour without performing any type of opti- mization to enhance the model performance. The idea is to get the most simple, pure predictions from the created FNN, using as input vector a GA individual created from a single run. Parameters are going to be set according to the experimental setup 4.3. The evaluation methodology is going to be based on the previously announced metrics (section 4.2). Predictions are going to be performed on the datasets presented in section 4.1. The used input features are going to be all the ones announced in section ??. Note that not all of them are going to be selected, since feature selection is always performed by the GA.

4.4.1 Classification results

The first way to access the performance of a simple prediction of a single individual, is to evaluate the system according to traditional classification metrics. We present the validation and test accuracy because it is extremely important to keep track of how different is their performance. Test precision and recall, give us an idea of how well the system is hitting or missing the ground truth results. For the 5 different markets we obtained:

55 Table 4.5: Classification results

Market Train ACC Validation ACC Test ACC Test Precision Test Recall

EUR/USD 54.56% 50.34% 50.99% 56.82% 50.50%

GBP/USD 52.46% 50.76% 50.41% 15.12% 50.46%

GBP/JPY 53.36% 50.37% 50.87% 32.14% 50.63%

USD/JPY 54.39% 52.16% 50.37% 47.13% 49.91%

USD/CHF 54.58% 51.01% 50.08% 35.03% 50.57%

The above obtained results reflect an average of each monitored metric, throughout 10 different system runs in each one of the presented markets. We showcase accuracy for 3 different sets. First, for the train set, we present the final accuracy shown by the prediction model when giving the train set has test. This displays how well the FNN model learned the given train data. Both Train ACC and Validation ACC represent the final values of the FNN in the final neural epoch. Note that the implemented backpropagation algorithm, backed up by the early stopping mechanism, is going to the decide when the network reached the final epoch by searching for the best ever seen accuracy in validation. The performance of the validation set (a cluster of held-out never seen data used as a test set during the optimization period), reflects how well the system behaved in terms of predicting next hour market variations. Since the backpropagation process optimizes the initially selected fitness function on the validation set, we also showcase Test ACC, a metric that represents accuracy measurements in new data, the test set.

Table 4.6: Classification results with Batch Normalization

Market Train ACC Validation ACC Test ACC Test Precision Test Recall

EUR/USD 58.33% 51.85% 51.03% 60.74% 50.51%

GBP/USD 59.65% 52.5% 51.61% 48.55% 50.84%

GBP/JPY 62.23% 51.02% 50.64% 2.69% 35.33%

USD/JPY 61.15% 51.46% 51.05% 31.26% 50.89%

USD/CHF 59.48% 51.06% 50.77% 49.15% 51.08%

For the sake of model comparison and system optimization, we also studied the impacts of having a Batch Normalization layer in the created FNN. This technique is known for speeding up the training procedure, en- hancing the FNN learning capabilities 3.5. By normalizing the inputs, we assure that activations distribution remains Gaussian, by forcing it to have zero mean and unit variance. By linearly transforming the given data, we increase the training speed, improving the overall convergence of the network. With the selected parame- ters, each batch with a size of 32 data records is going to be normalized between the first and the second layer of the network. The hypothesis here was to understand if higher train accuracies, could ultimately lead to im- proved results in terms of test accuracy and ROI. Applying this methodology we obtained the results presented

56 in table 4.6. As it is possibly to notice, the introduction of a Batch Normalization layer significantly changed the results for each market. Comparing the obtained outcomes, the following observations can be made:

• As it was expected, the appliance of a Batch Normalization layer improves every market results in terms of training accuracy . This empirically proves that Batch Normalization accelerates the training procedure of the FNN model, making the network to converge faster and learn more from the given data. We were able to achieve an average improvement of 11.27%, which is an extremely positive indicator in favour of Batch Normalization usage.

• Validation accuracy also improved with Batch Normalization . This is a promising achievement for the rest of this work, since in this test case we are only dealing with one single individual (no optimization is performed on the validation set). We obtained an average improvement of 1.28%, which at least indicates some enhancement of the prediction capabilities of the network. It is also important to notice that this metric was not improved in some tested markets. For the USD/JPY market, the inclusion of Batch normalization reduced the performance from 52.16% to 51.46%, and in the USD/CHF currency pair resulted in a minor improvement. We believe that this could be related with the high volatility presented by this currency pair.

• In this particular case study, the achieved test accuracy is obtained in similar way to validation accuracy, since they are both collected from held-out data, never seen by the system. We simply decided to kept both in order to better understand how the created model works when trying to predict events that brake the temporal order sequentiality displayed by financial time series. By introducing Batch Normalization we accomplished a 0.7% of accuracy improvement when predicting with using the available test set. A comparison between a normalized and a non-normalized approach is displayed in fig. 4.6.

Figure 4.6: Non-normalized test ACC vs Normalized test ACC

• Precision and recall do not directly improve with the introduction of Batch Normalization. The differences

57 between the two approaches, simply reflect how the trained system adjusts to data and forecast price return variations.

4.4.2 Market simulator

The second way to test the model, is to evaluate it under financial metrics. We showcase the validation and test ROI since they present how good the model performed in actual gains in held-out data. The amount of available investment capital varies in terms of currency, since the system is tested in the 5 announced markets and currency values are adjusted to enable always the purchase of 80 000 market positions. We also present the number of times that on average the system assumed long, short and hold positions.

Table 4.7: Financial results

Market Validation ROI Test ROI Initial Capital Long Short Hold

EUR/USD -13.59% -9.94% 120 000 USD 478 477 5249

GBP/USD 5.18% 4.05% 120 000 USD 154 153 5897

GBP/JPY -0.49% -10.37% 13 200 000 YEN 444 444 5316

USD/JPY -7.47% -9.70% 13 200 000 YEN 516 515 5172

USD/CHF -9.33% -9.79% 120 000 CHF 538 536 5129

The results presented in table 4.7, were obtained using the same methodology performed for classification metrics, i.e, as an average of 10 separate runs for each one of the analyzed currency pairs. Similarly to what was performed for analyzing the predictive capabilities of the constructed model, we also added a Batch Normalization layer to the neural network in order to study if an extra layer of normalization contributes to greater market returns, both in validation set and test set. It is also intended to analyze the relation between accuracy and ROI results, i.e, if the improvement of one of the metrics could lead to the improvement of the other.

Table 4.8: Financial results with Batch Normalization

Market Validation ROI Test ROI Initial Capital Long Short Hold

EUR/USD -6.69% -8.29% 120 000 USD 276 275 5652

GBP/USD -10.77% -2.35% 120 000 USD 278 277 5649

GBP/JPY -3.00% 1.58% 13 200 000 CHF 52 51 6102

USD/JPY 2.82% -2.09% 13 200 000 YEN 223 222 5758

USD/CHF -2.70% -0.45% 120 000 CHF 238 237 5729

With the introduction of Batch Normalization, it is possible to identify some variations when comparing the obtained Financial metrics, with the non-normalized approach. We concluded:

58 • For the calculated validation ROI, we concluded that not all the tested pairs improve their performance with the introduction of Batch Normalization. Markets like the GBP/JPY and GBP/USD, displayed a sub- stantial decrease in return, both in validation and test (table 4.8). It seems that for the GBP/USD market, the additional layer makes the system to increase the number of times that long and short investments are performed, and since the used validation set presents a long bullish trend, having a reduced number of investments could possibly be beneficial for raising the achieved gains. For the GBP/JPY market the system presented a behaviour that appears to be exactly the opposite. Since the validation and test sets are taken from an area were the GBP/JPY market is extremely volatile, the decrease of market investments posed by the usage of Batch Normalization, could likely reduce the achieved performance, obtained in a non-normalized approach.

• When it comes to the achieved test ROI, the results seem to follow the pattern obtained for validation ROI. The introduction of Batch Normalization enhances all currency pairs performance with exception to the GBP/USD market (figure 4.8). Likely to what happened with the appliance of this methodology to the validation set, the GBP/USD market presented worst results when using the second approach (fig. 4.7). The behaviour of the markets involving the JPY currency also seem to vary when comparing different approaches. We believe that this may be related with the high volatility presented by this markets (table 4.1), and also with the presented trend shifts, when comparing their test and validation sets. For the other two markets, the EUR/USD and USD/CHF, the introduction of Batch Normalization remains to be beneficial, specially in the second were its inclusion lead to significant improvements.

Figure 4.7: Non-normalized test ROI vs Normal- ized test ROI Figure 4.8: Test ACC vs Test ROI

• Regarding assumed market positions, it is also possible to find a pattern in the above performed exper- iments . Batch Normalization reduces the number of open short and long positions in every annalyzed market with exception to the GBP/USD. This confirms the hypothesis that for this market, a higher number of investments results in worst returns.

59 4.5 Case study B.1 - Accuracy as fitness function

In this second test case, we plan to study how accuracy impacts the overall performance when used as GA fitness function. The main idea is to let the system create several individuals following the GA evolutionary breeding process, with the final goal of maximizing the achieved accuracy, i.e, finding the individual that max- imizes this metric. The optimization process is performed on the validation set, and for the fittest individual, the associated FNN model weights are saved in order to previously load the network and evaluate the test set under the same experimental conditions. Therefore, the intended experiment is to examine and compare if the performance in validation and test data, is enhanced when only maximizing validation, and ultimately studying if ROI is improved by this approach. The used system configuration is the one presented in section 4.3. Regarding Batch Normalization, due to the positive results achieved in the majority of markets, we decide to maintain it during this test case.

4.5.1 Classification results

Likely to what was done in the last case study, we started by accessing the system predictive performance. Table 4.9 displays each market results, with each single outcome being the result of averaging 10 separate system runs.

Table 4.9: Classification results with ACC fitness

Market Train ACC Validation ACC Test ACC Test Precision Test Recall

EUR/USD 63.15% 53.89% 49.92% 55.65% 49.71%

GBP/USD 66.16% 62.78% 60.55% 66.09% 63.97%

GBP/JPY 63.11% 53.05% 51.41% 48.82% 50.81%

USD/JPY 62.49% 52.54% 50.49% 51.42% 49.79%

USD/CHF 59.48% 52.94% 50.59% 45.34% 51.34%

• The introduction of the accuracy fitness, lead both train and validation ACC to a substantial increase in every presented market. However, when analyzing the results, is clearly noticeable the superior perfor- mance of the system when optimized for the GBP/USD market. Results are aligned to what was show- cased during the simple prediction approach, with GBP/USD having the highest results. This indicates a higher capacity towards learning GBP/USD data. Test ACC shows that optimization on the validation set, also guarantees valuable results on test, suggesting that the GA made the model converge to a solution were some patterns and non-linearities of validation data, still hold true for test data.

• Improved results of Validation ACC do not increase the predictive capacity for all the tested currencies. Results show that in some cases, the performance was worst than what was achieved in the simple prediction approach.

• Precision and recall as expected, improved and seem to be more stable, since they are both components of the test accuracy.

60 4.5.2 Market Simulator

Besides the purely predictive capacities assessed with the above measurements, investment performance was also evaluated. Market simulations were also performed for the 5 selected markets. In addition to the measurements took for the simple prediction test case (section 3.6), we decided to keep track also of the Maximum Drawdown (section 4.2.2). The introduction of this metric only in the optimization approach, is needed in order to track and present to the user, the risk involved during the test set period, indicating the possible fall backs during future investment times. Results are displayed in table 4.10.

Table 4.10: Financial results with ACC Fitness

Market Validation ROI Test ROI MDD Long Short Hold

EUR/USD -9.89% -13.86% 6.72% 347 346 5513

GBP/USD -31.95% -37.54% 40.12% 1090 1090 4037

GBP/JPY -12.91% -0.36% 6.22% 367 367 5470

USD/JPY 5.06% -2.40% 5.10% 113 112 5985

USD/CHF -2.19% -1.82% 6.72% 199 198 5809

• The achieved results indicate worse returns in all markets, when compared to the non-optimized version of the system. This may suggest that for the developed system, accuracy is not a well suited metric for conducting valuable market investments.

Figure 4.9: Maximum Drawdown for GBP/USD

• The worst performance returns are obtained by the GBP/USD market. Such results indicate that the higher predictive capacity achieved in this currency pair, 62.78% for Validation ACC and 60.55% for Test ACC, are not reflected in terms of actual gains. The high number of long and short investments follows

61 the initially stated hypothesis that for the GBP/USD market a higher number of investments may decrease the system performance. The developed strategy seems to not be suitable for this type of market, also presenting the highest risk among all the tested currencies. Figure 4.9 presents the evolution of this metric throughout the entire market simulation for one of the executed runs, on which is possible to detect that on the majority of time spent on the market the system is only spending the invested capital, without being capable to generate any profit.

4.6 Case study B.2 - ROI as fitness function

Likely to what has been done for the previous case study, we also decided to test the system using as GA fitness function the ROI function. This way, we make the GA evolutionary process to search for the fittest individual in terms of validation ROI and check if its superior performance also holds true for the test set. The used system configuration is the same as the one used in the previous case study 4.5. The followed methodology is also the same, with classification, and financial results being presented.

4.6.1 Classification results

Table 4.11 presents the predictive capacities of the optimized model using ROI as fitness function. The idea is to showcase how optimizing a different fitness function impacts the achieved goodness of fit in each studied market.

Table 4.11: Classification results with ROI fitness

Market Train ACC Validation ACC Test ACC Test Precision Test Recall

EUR/USD 59.88% 50.19% 49.58% 65.44% 49.33%

GBP/USD 61.59% 51.36% 50.79% 27.85% 54.68%

GBP/JPY 60.73% 51.01% 50.64% 23.83% 49.76%

USD/JPY 60.48% 50.41% 50.06% 68.43% 49.59%

USD/CHF 59.63% 50.71% 50.67% 42.92% 51.17%

• The presented results indicate low accuracy results both in validation and test. This is somewhat ex- pected since ROI is the maximized function in the GA architecture. However this may not be the indicative of a inferior market performance, and although the FNN may not be capable of predicting all the majority of positive and negative market returns, if it still has the capability to correctly predict some of the higher profit entry and exit market points, the returns could still be positive.

• Once again, the best performance is showed by the GBP/USD market. This proves that the system predictive capabilities are more prone to correctly forecast in this market, and despite the fitness function not accounting with the percentage of correctly classified individuals, the system still has a higher learning capacity when compared to the learning process of the other markets.

62 4.6.2 Market Simulator

Table 4.12 presents the results of ROI in each evaluated market.

Table 4.12: Financial results with ROI fitness

Market Validation ROI Test ROI MDD Long Short Hold

EUR/USD 14.92% -1.93% 5.91% 52 53 6167

GBP/USD 13.40% 5.39% 6.08% 104 103 5961

GBP/JPY 12.29% -2.34% 7.14% 144 144 5918

USD/JPY 9.79% -1.21% 6.65% 103 102 6011

USD/CHF 10.41% 4.55% 3.17% 56 55 6077

• Comparing the results with the ones obtained in the simple prediction approach, every evaluated market improved its performance in terms of validation ROI. This is obviously an anticipated result. The range of obtained values gives an idea of what is the average performance of best individual throughout the 20 generations in which the system was optimized.

• In terms of market transactions, it seems that also on average, long and short positions are kept at small values. The system is more likely to favour a low number of investments.

• Regarding the achieved test ROI, the system exhibits a positive performance in the GBP/USD market. We empirically proved that it is possible to achieve positive results in this market. This follows what was previously measured in section 4.1.1, where Kurtosis values present the presence of ”fat tales” in the return distribution, with GBP/USD being the second highest market concerning this measurement. Figure 4.10 presents the achieved results of ROI, presenting the average, best and worst individuals generated throughout 10 system runs, calculated in hourly periods during the test period. For best and worst individuals we got a ROI of 11.51% and 2.98% respectively, which indicates an substantial amount of variance in terms of obtained results. However, it is worth notice the system ability to only obtain positive results during the 10 performed system runs.

• The USD/CHF market also presents a positive result in terms of test ROI. However the achieved per- formance is slightly lower when compared to the GBP/USD market. We believe that this may be due to the extremely oscillatory behavior and low volatility presented by this market. This is also proved by the lower values of drawdown, which are the lowest among all the tested pairs. Similarly to what has been displayed for the GBP/USD pair, figure 4.11 demonstrates the ROI results for the average, best and worst individual, both sampled in hourly periods during the test period. For the best and worst individuals we obtained a ROI of 6.35% and 3.40% respectively. This showcases the stability and low variance of results proposed by this system solution.

• The above results confirm the hypothesis that using ROI as GA fitness function yields better results than using a fitness function purely based on the classification capabilities of the system, such as accuracy in section 4.5.

63 Figure 4.10: Best, average and worst system individuals for GBP/USD

Figure 4.11: Best, average and worst system individuals for USD/CHF

4.7 Case Study 3 - Further investigation on profitable markets

This section was created in order to extend and enhance the previously performed system analysis, on a specific set of currency pairs that achieved promising results in the proposed case studies. We felt that such approach was needed in order to better understand how the system is working in profitable market simulations, analyzing its decisions and results in greater depth. Additionally, the large amount of tested models, accounting with different system parameter tuning, pose a computational intensive problem in terms of power and time, and could potentially not be beneficial to find the best system architecture for each one of the displayed mar-

64 kets. Therefore, it seemed reasonable to only extend this analysis to currencies that have already displayed superlative behaviour. Following the mentioned approach, we decided to conduct the investigation on the GBP/USD and USD/CHF markets. We based this decision on the results achieved in the prior case studies were both markets displayed promising outcomes. The selected base architecture for further experiments, will be based on the initially selected configuration (section 4.3) with ROI being the selected fitness function. Experiments performed in section 4.5 and 4.6, confirm that for the studied markets, using ROI over accuracy as GA fitness function, improves the overall system performance, ultimately leading to better results. This is the case for the two selected currency pairs, which were the only ones that were able to generate a profitable return during test period, when optimized accounting with ROI.

4.7.1 Benchmark comparisons

This section focuses on comparing and evaluate the results obtained by the two proposed investment solutions against traditional trading benchmarks. This is crucial in order to access the overall model stability and trading utility in terms of profit generation, despite (as seen in section 4.6) positive returns in both markets (GBP/USD and USD/CHF). All the used strategies are compared making usage of ROI as comparative metric. It is also important to refer that transaction costs were set to 0.0001% of the opened position, for every used strategy. As comparative benchmarks we selected the following 3 trading strategies:

• Buy & Hold: This classical approach relies upon the belief that prices move in long bullish trends, and it is not possible to forecast market variations relying on past data. Therefore, a Buy & Hold strategy is a passive investment where the trader opens a long position and holds it for a long period of time until an opinion reversal.

• Sell & Hold: The Sell & Hold strategy is another classical strategy, similar to Buy & Hold, but based on the belief that prices move in a long bearish trend. It also results in a passive investment operation, where trader opens a short position, and closes it only after a change in opinion.

• Random Walk: This trading benchmark is resultant of the Random Walk Theory [1], which states that market fluctuations are randomly generated and completely unpredictable by the usage of historical mar- ket information. Hence, the application of this strategy is done by generating a binary random signal, analogous to the y label vector created during the system workflow. Contrarily to the other two presented market strategies, that only rely on one single market operation, the Random Walk is able to open both long and short market positions.

The following comparisons will serve as an empirical confirmation that the deployed system is suitable for pos- sible market investments in the GBP/USD and USD/CHF markets. The 3 selected strategies were specifically selected to test if the created strategy is capable to react to different market conditions. The Buy & Hold test and Sell & Hold test, were picked because each market has an underlying bullish or bearish trend in the se- lected testing period. The Random Walk strategy is used to check if the created model is not beat by a purely random strategy, which would eventually confirm the previously outlined Efficient Market Hypothesis (section 1).

65 4.7.2 USD/CHF

Table 4.13 presents a comparison between USD/CHF, average, best and worst individuals with the 3 bench- mark strategies above mentioned. As comparative indicators we selected 4. The first one is obviously the ROI, which is the foundation of this entire comparison, and is going to serve as base for other used metrics. Profitable transactions displays the percentage of profitable transactions among all the ones performed during trading. The third metric, days with positive ROI, gives an idea to the user about how risky each strategy is in terms of stability and sustainable growth. Finally, Maximum Drawdown is used again, but this time account- ing with drawdowns present in ROI past data, instead of returns or portfolio value. Figure 4.12 presents the evolution of the 3 benchmark strategy against the proposed system.

Table 4.13: USD/CHF strategies comparison

Parameters Avg Best Worst Buy&Hold Sell&Hold Random

ROI 4.55% 6.35% 3.41% -4.09% 4.09% -12.98%

Profitable 67.84% 76.05% 58.27% 0% 100% 21.55% transactions

Days with 83.11% 86.71% 83.41% 2.51% 99.72% 2.25% positive ROI

Maximum 3.18% 3.17% 3.48% 6.05% 3.98% 29.29% Drawdown

Figure 4.12: USD/CHF strategies evolution over time

66 The proposed solution for the USD/CHF currency pair exhibits a promising behavior, illustrated in figure 4.12. Both Buy&Hold, and the Random Walk strategy are clearly outperformed by the proposed solution with the two reaching a negative ROI of -4.09% and -12.98% respectively, not being able to provide any return to the user in any point during the market simulation period, explained by a constant decaying growth. The Sell&Hold strategy is the only benchmark that is capable to present a performance suited for the behavior displayed by the USD/CHF. As its possible to see in figure 4.12, the Sell&Hold is only broke by the developed strategy by the end of the trading year, reaching a ROI of 4.09%, a value extremely close to the average ROI of 4.55% shown by the system. However, when examining the evolution of it, it is possible to see a high volatility throughout time, which indicates undesirable risk to the trader. In contrast, the proposed system displays a steady growth during the whole trading period. Its also worth notice its security and less riskier conduct during the 10 performed system runs, which could be seen by the small range of ROI values that separate the worst and best individuals. The less riskier profile assumed by the developed solution could be identified through the values presented in the Maximum Drawdown row, which indicate the biggest drawdown during the trading period in terms of ROI. Regarding that metric, we can confidently state that the proposed solution is less riskier than all the benchmark strategies.

Figure 4.13: USD/CHF market entry points

In figure 4.13 it is possible to observe the 72 short and long positions opened by the best proposed system. The densest presented area, starting at 2017-10. represents the turnover point for the presented solution. This is the point were the proposed solution surpassed the Sell&Hold strategy (available in figure 4.12, on the same date) being able to outperform it until the end of the trading period. Moreover, despite not being able to stay in its highest ROI value, the system still remains profitable in the beginning of 2018. a period where the trend inverted its path, making the index decrease to the lowest market quote available in the performed test. Such

67 abilities prove the capacity of the algorithm to deal with both bullish and bearish market periods, and although presenting a slow and steady growth, the ability to still remain profitable in this market.

4.7.3 GBP/USD

Likely to what was done for the USD/CHF market, table 4.13 presents a comparison between the average, best and worst individuals against the 3 selected benchmark strategies. The presented measurements reveal that Sell&Hold and Random Walk strategies, provide a negative ROI of -10.69% and -15.78%. We can confirm this information by looking to figure 4.14, and conclude that from the 3 benchmarks, only the Buy&Hold strategy is capable to adapt to the GBP/USD index, mainly due to its long bullish trend presented in the selected trading period. In fact, the Buy&Hold strategy presents an extremely profitable evolution by reaching a value of return of 10.69%. Therefore it is possible to notice the ability of the proposed system to create conservative strategies similar to the Buy&Hold.

Table 4.14: GBP/USD strategies comparison

Parameters Avg Best Worst Buy&Hold Sell&Hold Random

ROI 5.65% 11.51% 2.98% 10.69% -10.69% -15.78%

Profitable 45.87% 47.37% 42.45% 100% 0% 14.52% transactions

Days with 98.77% 99.95% 97.47% 99.38% 1.59% 4.05% positive ROI

Maximum 5.97% 4.24% 5.76% 3.94% 14.58% 36.09% Drawdown

Figure 4.14: GBP/USD strategies evolution over time

68 This happens because during the system optimization period, the best individuals behave in a similar way, favouring a small number of open market positions, heavily shaped by the way how the market grows. However, when comparing it with the achieved results, it is possible to check that only the best individual was capable to outperform it, achieving a ROI of 11.51%. Both the average and worst individuals were not capable to surpass it, which indicates that the proposed strategy presents a lot of variance throughout the 10 system runs. We can confirm that by looking into table 4.14, where the displayed ROI results for the average and worst individuals, achieved disappointing values of 5.65% and 2.98%, respectively. Therefore, it is possible to conclude that this strategy is not capable to beat the Buy&Hold strategy, which revealed to be more profitable and less risky, with a Maximum Drawdown of 3.94%. Still regarding the risk involved in each strategy, SellHold and Random present an extremly high Maximum Drawdown due to their continuous bearish trend.

4.7.4 GBP/USD without Batch Normalization

Since the last experiment proved that the deployed system was not capable to develop an enough reliable investment strategy for the GBP/USD market, we decided to extend the experimental process in order to improve the achieved results. Therefore, we proceeded with some architectural changes towards achieving a more stable configuration able to create a new trading strategy that would manage to produce a performance superior to the Buy&Hold benchmark. The main idea is to develop a new version of the proposed system in which the majority of runs creates individuals capable to surpass the selected benchmark, which would ultimately result in a high profit average individual. The adopted approach was based on the measurements took throughout the system analysis, specially the ones obtained in the first case study (4.4), were we focused on measuring the results of a non-optimized approach, and also the inclusion of a Batch Normalization layer in the FNN internal architecture. By looking into the results obtained during the market simulation, it is possible to see that the introduction of a Batch Normalization layer better results for every tested market except for the GBP/USD market, which lead to its attachment to the model in case study B.1 and case study B.2 (4.5 and 4.6). Hence, based on those results, we decided to test the system without the Batch Normalization layer previously introduced. The achieved results are presented in table 4.15.

Table 4.15: GBP/USD without BN strategies comparison

Parameters Avg Best Worst Buy&Hold Sell&Hold Random

ROI 14.19% 17.81% 8.08% 10.69% -10.69% -19.36%

Profitable 45.87% 50.85% 46.53% 100% 0% 12.78% transactions

Days with 99.79% 99.91% 99.74% 99.36% 1.61% 1.53% positive ROI

Maximum 2.94% 4.45% 7.88% 3.94% 14.58% 40.25% Drawdown

By analyzing the above results, it is possible to conclude that the system performance was greatly improved by the removal of the batch normalization layer. Figure 4.15 depicts the evolution of the new configuration

69 against the 3 benchmarks. Since the tested market remains unchanged, we can automatically exclude the Sell&Hold and Random Walk strategies due to not being suitable for the evolution of the market index. For the new proposed solution, the achieved average ROI is 14.19%, a value that significantly beats the Buy&Hold produced ROI, indicating less variance in the produced solution. Besides that, the system was also able to improve the results returned by the best and worst individuals. Regarding them, we achieved a result of 17.81% for the best individual ROI, a value that clearly outperforms the benchmark strategy. For the worst individual we also improved the system efficiency as well, achieving a ROI of 8.08%, a value that is closer to the Buy&Hold ROI than what we were able to get from the normalized version of this experiment. We believe that this approach lead to better results because it was able to reduce the predictive capacity achieved by the system. Throughout the entire experimental test, which included all the available case studies, it is noticeable that the system is more prone to learn the GBP/USD market. Therefore, the introduction of a batch normalization layer, a technique that seeks accuracy improvement, drove each individual FNN to enhanced results in training, which created an overfitted model that learnt the training data too well. We can confirm this fact by looking into each case study table, and check that GBP/USD train accuracy is always superior to its peers.

Figure 4.15: GBP/USD without BN strategies evolution over time

Moreover, by analyzing the results obtained in table 4.9 and table 4.10, in case study 4.5, we can observe that the optimization procedure was able to find extremely fitted individuals that held the best performance for train validation and test accuracy, but when investing with those individuals, the average results for ROI, both in validation and test, displayed an awful performance, suggesting that the model was capable to learn certain parts of the training data, but by creating an excessive active strategy (1090 long and 1090 short positions), was not able to generalize and be conscious of the overall market trend, assuming a reactive behaviour that

70 attempted to predict all the small return variations. This confirms the hypothesis that Batch Normalization is not suited for this market, and the usage of it unable the trading capacity of the created system. Figure 4.16 displays the best individual opened positions during the trading period.

Figure 4.16: GBP/USD without BN market entry points

4.7.5 Feature selection results

In this section we focus on analyzing the impacts of feature selection in the two proposed solutions, both for the GBP/USD and USD/CHF market. This includes the number and type of selected features, in this case technical indicators, among all the implemented ones. Since the system configuration remains unchanged in comparison to the last experimental analysis, the enabled features are EMA, SMA, RSI, MOM, ATR, ATRP, BB, ADX, AA, CMO, DPO, DEMA, ROC, DSS, ROC, KURT, SKEW, STD, STV, CCI, MACD and PO. The opti- mization procedure carried by the GA, is going to select relevant features among all these ones, defining and tuning the parameters of each TI as well, following the evolutionary optimization process that was described in section 3.4. Additionally, both solutions are also analyzed in terms of internal model structure. This is also defined by the GA optimization methodology, by selecting the number of neurons present in each neural layer of the final created FNN. Table 4.16 presents the selected features for each one of the proposed solutions, with respective parameters, both resultant from the evolutionary process of the GA.

71 Table 4.16: Selected TI

Market Selected features

GBP/USD (EMA, 97), (SMA, 50), (AA DOWN, 84), (DSS, 99),(SKEW, 83), (STV, 72), (DEMA, 24), (PO,88)

USD/CHF (EMA, 81), (SMA, 77), (HMA, 83), (AA UP, 35), (LOW BB, 9), (MID BB, 50), (DSS, 98), (ATRP, 57), (STD, 18), (DEMA, 85)

The above results denote one tendency among the two different markets. Although the system has at its disposal a vast number of 22 different TI, we can clearly notice that the most profitable individual, for both currency pairs, only uses half or less of them. In terms of selected TI parameters, it is also possible to notice that the GA seems to be more prone to use TI that account with a large period of past hours, which is probably related with the extensive size of the validation period. In order to analyze the usage of features throughout the optimization procedure, we plotted two histograms, 4.17 and figure 4.18, that show the distribution of used features across each GA generation, with the red vertical dashed line representing the mean value.

Figure 4.17: GBP/USD histogram

Regarding the presented images, it is important to mention that the showcased results, also include 3 features to which the selected ones were aggregated, namely Close, High and Low. The figures show that, over 20 generations of 200 initially created individuals, both solutions present as average of used features, a number approximate to 13 and 14 features, throughout the entire system run. For the GBP/USD we can conclude that the achieved result is not in agreement with the presented histogram, with 11 features only being used by approximately 600 individuals. For the USD/CHF the achieved results are distinct, and the majority of the population uses the same number of features has the proposed solution. To complement this information, we also present in figure 4.19 the evolution of the average number of used features, across the 20 generations of the GA. The displayed values show that for both currencies, the system individuals converge to values close

72 to the previously calculated mean. This confirms that the integration of a feature selection method in the GA algorithm, is functioning, decreasing the number of features and reducing the variance of each generation results.

Figure 4.18: USD/CHF histogram

Figure 4.19: USD/CHF and GBP/USD number of features over generation

73 4.7.6 Fitness evolution

In order to validate even further the GA evolutionary optimization process, it is also important to analyze how the validation ROI evolves throughout the number of GA generations. We decided to create two box-and- whisker plots concerning, the number of the validation ROI of each element throughout generation in fig. 4.20 and fig. 4.21. Box-and-whisker plots are useful to understand how is the variability of data in each one of the GA generations. The whiskers present the maximum and minimum values encountered in that generation, while the box displays the range between data’s first and third quartile (interquartile range - IQR), with both quartiles being separated by the ROI median value (the black line inside each box).

Figure 4.20: GBP/USD box-and whisker plot

These two presented figures prove that in each generation, individuals are converging and achieving better ROI scores. In figure 4.20, we can see that each generation IQR, is getting increasingly smaller, at the same time that the median values are becoming progressively higher. These two factors combined indicate that at least 50% of the created individuals are following the evolutionary procedure introduced by the GA. Values above and below the IQR present outliers ( values that achieve surprisingly high/low ROI), which can always be found throughout the optimization. This is the case of the best individual which is given by the fittest value of the fifteenth generation. This is also the case for the USD/CHF currency pair, where the fittest individual can again be found at fifteenth generation of the GA (figure 4.21). In terms of fitness optimization, this market presents a slower convergence with individuals increasing at a slower rate when compared to EUR/USD. It may seem that values are not converging, but if we look at figure 4.22, we can see that the average of each generation is increasing towards fitter ROI values.

74 Figure 4.21: USD/CHF box-and whisker plot

Figure 4.22: USD/CHF roi vs gen

In fact, we can even confirm that USD/CHF provides a softer fitness conversion, with ROI values presenting a smoother increase when compared to the oscillating behavior assumed by the GBP/USD evolution signal. This proves that the best individuals are not always located in the last generations of the optimization process,

75 and often the optimized solutions are given by outliers present in later generations, which are subsequently chosen by the GA selection mechanism for the evolutionary breeding procedure. This results in the appliance of crossover and mutation operators to the best individuals, which in many cases does not implicitly creates the fittest individual, but rather the convergence of the whole population. Figure 4.22 presents both markets average ROI evolution, with results revealing that the optimization procedure is working, with both results converging to higher values, maximizing the desired fitness function. The best individuals of each solution were found in 19th and 20th generation of GBP/USD and USD/CHF respecitively, which also indicate that the values plotted in figure 4.22, are not created by high variation of ROI in same generation individuals, and the highest value of is achieved in the final phase of the optimization.

4.7.7 Topology evolution

Finally to conclude the analysis of the proposed system solutions, we must also inspect and examine the architecture of the two final FNN models. As it was explained in chapter 3.4, the final optimized chromosome encodes two genes that are responsible for the number of used neurons in the first and second layer of the developed FNN. The system creates different architectures for the two obtained solutions. The number of neurons used in each layer for GBP/USD and USD/CHF is presented in table 4.17.

Table 4.17: Solutions architecture

Market Number of inputs First layer neurons Hidden layer neurons

GBP/USD 11 80 63

USD/CHF 13 32 40

It is also important to remember that the range of values obtained for each network layer is set by the initial configuration file, with both solutions following the setup parameters presented in section 4.3. Therefore, the assumed number of neurons ranges from 5 to 100. Furthermore, to complement this information and locate the chose values among all the selected ones, we decided to create two bar-plots that display the 15 most used number of neurons genes, one for the first layer and a another one for the second layer. Note that the displayed genes are the ones with highest number of occurrences during the optimization period, i.e throughout the 20 generations of the GA. For the GBP/USD we obtained the following two plots (fig. 4.23 and fig. 4.24). Has it is possible to see in figure 4.23 for the first layer of the generated FNN for the GBP/USD currency pair, the value of 80 neurons for the first layer of the network is the most frequent one, which is in accordance with the selected solution. Same thing happens for the second layer (fig. 4.24), with 63 also being the most frequent gene among the most frequent 15. This may suggest that the GA converges to such values throughout the optimization procedure. We can confirm this information in fig. A.1, where we plotted a violinplot that shows the evolution of the number of selected neurons in layer 1 and layer 2 during the optimization procedure. This is done by computing the kernel density estimation (KDE), of all the selected neuron values during a single generation. Results show that as the algorithm proceeds, the choice of having a small number of neurons for both layers is substantially reduced and in the last layers the selected values are reduced to range between 40 and 100 neurons.

76 Figure 4.23: 15 most used number of neurons in GBP/USD 1st FNN layer

Figure 4.24: 15 most used number of neurons in GBP/USD 2nd FNN layer

For the USD/CHF currency pair two barplots with the most frequent 15 gene values were also created (fig 4.25 and fig. 4.26). The achieved results present a different behavior when compared to the one obtained for the GBP/USD market, with the GA usually picking a high number of neurons for the first FNN layer, and small number of neurons for the second FNN layer. However the the values selected for the final solution do not follow this trend, with 32 for the first layer and 40 for the second one. Moreover, the values selected for the final solution do not correspond the most frequent genes during the optimization. For both layers, they differ from the spotted tendency, which may confirm the hypothesis that the individual chose by the GA represents a

77 minority in terms of architecture, and the GA convergence did not followed that direction.

Figure 4.25: 15 most used number of neurons in USD/CHF 1st FNN layer

Figure 4.26: 15 most used number of neurons in USD/CHF FNN 2nd layer

The topological convergence of the USD/CHF market can be seen in fig. A.2. In it, we can prove that as individuals evolve throughout the optimization procedure, the tendency is to select a higher number of neurons for the first layer and a lower number of neurons for the second one. We can also confirm that the achieved values in the proposed solution are not the most frequent in the generation where its individual was created. If we look to the distributions available in the fifteenth generation, it is possible to notice that the KDE

78 curve is extremely small in the areas correspondent to the number of neurons on the first and second layer. However we can clearly spot the influence of that individual in the convergence of the algorithm in following GA generations. After the 15th generation the KDE curves that accommodate 32 and 40 (for the first and second layer respectively), become progressively larger towards the 20th generation.

79 80 Chapter 5

Conclusions

In this thesis it was presented a financial forecasting system that combines Evolutionary Computing with Deep Learning, in order to provide a trading strategy capable to maximize the obtained returns and minimize the associated investment risk. The baseline model for the developed system was based on a Feed Forward Neural Network, optimized by a Genetic Algorithm. To test the developed system, 5 different FOREX currency pairs were selected, EUR/USD, GBP/USD, GBP/JPY, USD/JPY and USD/CHF. This provided a way to access the system capacity to adapt to distinct market conditions, since each currency pair index exhibited their own particular properties throughout the chose data sample time span, 5 years of hourly data. To create more data to feed the developed system, a vast number of Technical Indicators, mathematical formulations that account with past market variations, were used as feature generation tools. The main idea behind this decision, was based on the premise that if there are traders that can consistently beat the market using Technical Analysis, then an expert system that learns through the usage of past data, should also be capable to get some insights about the market current behavior, creating educated guesses about how is it going to evolve in a near future. Therefore, as consequence of this choice, the system deeply relies on past data to create future predictions, based on the assumption that markets are inefficient. This is in direct opposition with the Efficient Market Hypothesis, which states that markets are completely random, belief that we try to refute throughout the course of this work. Following this methodology we were able to achieve promising results in some of the tested currency pairs. A simple prediction model without optimization was created in order to check how is the behavior of each market based on purely random internal parameter optimization. Two versions of this test case were performed, one with Batch Normalization and another one without it. Results proved that on all tested markets except the GBP/USD, the usage of Batch Normalization improves the predictive performance and the achieved returns (section 4.4). Therefore, backed up by such results, we decided to proceed the optimization case studies with the inclusion of a Batch Normalization layer in the FNN architecture. For these case studies (section 4.5 and section 4.6), the best results were obtained using as GA fitness function the ROI, instead of the traditional predictive performance measurement given by ACC. By testing both, we were able to conclude that when optimizing a system with ACC, we create models that attempt to make a huge number of transactions in a short period of time. Due to their enhanced predictive capacities, they are able to predict more accurately small market variations, which ultimately conducts them to negative returns, because of wrong replication of that behavior in distinct periods of data.

81 We were able to come up with two different solutions one for the GBP/USD market, and another one for the USD/CHF. For the GBP/USD we were able to surpass the Buy&Hold strategy by achieving an average ROI of 14.19%, and a 17.81% ROI for the best achieved individual, against 10.69% achieved by the Buy&Hold. Regarding the USD/CHF currency pair, we were able to outperform the Sell&Hold strategy that obtained a ROI of 4.09%, by achieving an average ROI of 4.45% and a ROI of 6.35% for the best seen individual. Such results prove that it is possible possible to extract some profit out for market trading by recurring to previous periods of data, indicating that markets are not completely efficient. Finally in the last 3 subsections of chapter 4.7, we conducted an extended analysis on the two achieved solutions to understand the convergence of the GA in terms of features, fitness and network topology. The achieved results show that the algorithm converges to improved values of validation ROI, at the same it reduces the number of used features, normally preferring to use TIs that account with big periods of information.

5.1 Future Work

As a follow-up to the presented work, a series of distinct directions can be explored in order to try to improve the developed system. Some of the most relevant are presented next:

• Introduction of a leverage mechanism. This would be interesting in order to evaluate how the system would deal with the potential risk that its usage could potentially bring.

• Introduction of a dynamic GA. There are several implementations of this algorithm, but the main idea is to give to the GA the possibility to dynamic adapt the mutation and crossover rate, according to the fitness values given by the current generation. This would benefit the overall convergence of the algorithm towards a global minimum/maximum.

• Test more fitness functions, specially related with the financial domain. A good test would be to try the sharpe ratio, an indicator that attempts to measure the risk-adjusted return.

• Introduction of time series cross validation with an expanding window to better access the model predic- tive power.

• Explore the usage of LSTM neural networks and GRU neural networks, since they account with temporal dependencies

• Unsupervised Learning for feature extraction

82 Bibliography

[1] B. G. Malkiel. The efficient market hypothesis and its critics. Journal of economic perspectives, 2003.

[2] E. F. Fama. Efficient capital markets: A review of theory and empirical work. The journal of Finance, 1970.

[3] R. C. Cavalcante, R. C. Brasileiro, V. L. Souza, J. P.Nobrega, and A. L. Oliveira. Computational intelligence and financial markets: A survey and future directions. Expert Systems with Applications, 2016.

[4] M. Casson and J. S. Lee. The origin and development of markets: A business history perspective. Business History Review, 2011.

[5] A. Lunde and A. Timmermann. Duration dependence in stock prices: An analysis of bull and bear markets. Journal of Business & Economic Statistics, 2004.

[6] Triennial central bank survey of foreign exchange and otc derivatives markets in 2016, 2016.

[7] R. Cont. Empirical properties of asset returns: stylized facts and statistical issues. 2001.

[8] M. A. Josef Arlt. Financial time series and their features. Acta oeconomica pragensia.

[9] M. P. Taylor and H. Allen. The use of technical analysis in the foreign exchange market. Journal of international Money and Finance, 1992.

[10] F. B. Matos. Ganhar em Bolsa. Leya, 2015.

[11] Y. Zhu and G. Zhou. Technical analysis: An perspective on the use of moving averages. Journal of Financial Economics, 2009.

[12] J. J. Murphy. Study Guide for Technical Analysis of the Futures Markets: A Self-training Manual. New York institute of finance New York, 1987.

[13] A. Gorgulho, R. Neves, and N. Horta. Applying a ga kernel on optimizing technical analysis rules for stock picking and portfolio composition. Expert systems with Applications, 2011.

[14] A. Bakhach, E. Tsang, and W. L. Ng. Forecasting directional changes in financial markets. Technical report, 2015.

[15] R. S. Michalski, J. G. Carbonell, and T. M. Mitchell. Machine learning: An artificial intelligence approach. Springer Science & Business Media, 2013.

[16] P. Domingos. The master algorithm: How the quest for the ultimate learning machine will remake our world. Basic Books, 2015.

83 [17] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 1958.

[18] M. A. Nielsen. Neural networks and deep learning. Determination Press, 2015.

[19] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural networks, 1989.

[20] Y. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller.¨ Efficient backprop. In Neural networks: Tricks of the trade. Springer, 1998.

[21] P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 1990.

[22] J. Kiefer, J. Wolfowitz, et al. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics, 1952.

[23] T. Back,¨ D. B. Fogel, and Z. Michalewicz. Evolutionary computation 1: Basic algorithms and operators. CRC press, 2000.

[24] R. Storn. On the usage of differential evolution for function optimization. In Fuzzy Information Processing Society, 1996. NAFIPS., 1996 Biennial Conference of the North American. IEEE.

[25] K. Jebari and M. Madiafi. Selection methods for genetic algorithms. International Journal of Emerging Sciences, 2013.

[26] J. Yao and C. L. Tan. A case study on using neural networks to perform technical forecasting of forex. Neurocomputing, 2000.

[27] Y. Kara, M. Acar Boyacioglu, and O. K. Baykan. Predicting direction of stock price index movement using artificial neural networks and support vector machines. Expert Syst. Appl., 2011.

[28] J. Patel, S. Shah, P. Thakkar, and K. Kotecha. Predicting using fusion of machine learning techniques. Expert Systems with Applications, 2015.

[29] M. Qiu, Y. Song, and F. Akagi. Application of artificial neural network for the prediction of stock market returns: The case of the japanese stock market. Chaos, Solitons & Fractals, 2016.

[30] T. Fischer and C. Krauss. Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research, 2017.

[31] C.-H. Wu, G.-H. Tzeng, Y.-J. Goo, and W.-C. Fang. A real-valued genetic algorithm to optimize the parameters of support vector machine for predicting bankruptcy. Expert systems with applications, 2007.

[32] O. B. Sezer, M. Ozbayoglu, and E. Dogdu. A deep neural-network based stock trading system based on evolutionary optimized technical analysis parameters. Procedia Computer Science, 2017.

[33] Y. Perwej and A. Perwej. Prediction of the bombay stock exchange (bse) market returns using artificial neural network and genetic algorithm. Journal of Intelligent Learning Systems and Applications, 2012.

84 [34] A. Gorgulho, R. Neves, and N. Horta. Applying a ga kernel on optimizing technical analysis rules for stock picking and portfolio composition. Expert systems with Applications, 2011.

[35] A. Hirabayashi, C. Aranha, and H. Iba. Optimization of the trading rule in foreign exchange using genetic algorithm. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation. ACM, 2009.

[36] G. Rossum. Python reference manual. Technical report, 1995.

[37] W. McKinney. Pandas, python data analysis library. 2015. Reference Source, 2014.

[38] K. J. Magnuson. Pyti, 2017.

[39] S. J. Brown and J. B. Warner. Using daily stock returns: The case of event studies. Journal of financial economics, 1985.

[40] Y. A. LeCun, L. Bottou, G. B. Orr, and K.-R. Muller.¨ Efficient backprop. In Neural networks: Tricks of the trade. Springer, 2012.

[41] F. Chollet et al. Keras, 2015.

[42] F.-A. Fortin, F.-M. De Rainville, M.-A. Gardner, M. Parizeau, and C. Gagne.´ DEAP: Evolutionary algorithms made easy. Journal of Machine Learning Research, 2012.

[43] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

[44] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), 2010.

[45] Y. Bengio. Practical recommendations for gradient-based training of deep architectures. In Neural net- works: Tricks of the trade. Springer, 2012.

[46] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[47] M. Waskom, O. Botvinnik, D. O’Kane, P. Hobson, J. Ostblom, S. Lukauskas, D. C. Gemperline, T. Augspurger, Y. Halchenko, J. B. Cole, J. Warmenhoven, J. de Ruiter, C. Pye, S. Hoyer, J. Vander- plas, S. Villalba, G. Kunter, E. Quintero, P. Bachant, M. Martin, K. Meyer, A. Miles, Y. Ram, T. Brunner, T. Yarkoni, M. L. Williams, C. Evans, C. Fitzgerald, Brian, and A. Qalieh. mwaskom/seaborn: v0.9.0 (july 2018), 2018.

85 86 Appendix A

Topology evolution plots

In this appendix we provide two plots related with the last subsection of this work (section 4.7.7). Plots were created with the help of Seaborn [47] a Python statistical data visualization library.

87 1 1

1

88

1 1 11 1 1 1 1 1 1 1 1

Figure A.1: Evolution of the number of neurons in GBP/USD 1 1

1

89

1 1 11 1 1 1 1 1 1 1 1

Figure A.2: Evolution of the number of neurons in USD/CHF 90