Classification of Price Extrema Based on Market Microstructure and Price Action Features A Case Study of S&P500 E-mini Futures

Artur Sokolovsky* Luca Arnaboldi†‡

Newcastle University, School of Computing Newcastle Upon Tyne, UK

The University of Edinburgh, School of Informatics Edinburgh, UK

Abstract

The study introduces an automated trading system for S&P500 E-mini futures (ES) based on state-of-the-art machine learning. Concretely: we extract a set of scenarios from the tick market data to train the models and further use the predictions to statistically assess the soundness of the approach. We define the scenarios from the local extrema of the price action. Price extrema is a commonly traded pattern, however, to the best of our knowledge, there is no study presenting a pipeline for automated classification and profitability evaluation. Additionally, we evaluate the ap- proach in the simulated trading environment on the historical data. Our study is filling this gap by presenting a broad evaluation of the approach supported by statistical tools which make it general- isable to unseen data and comparable to other approaches.

1 Introduction

As machine learning (ML) changes and takes over virtually every aspect of our lives, we are now able to automate tasks that previously were only possible with human intervention. A field in which it has quickly gained in traction and popularity is finance [18]. This field, which is often dominated by organisations with extreme expertise, knowledge and assets, is often considered out of reach to indi- viduals, due to the complex decision making and high risks. However, if one sets aside the financial risks, the people, emotions, and the many other aspects involved, the core process of trading can be simplified to decision making under pre-defined rules and contexts, making it a perfect ML scenario. Most current day trading is done electronically, through various available applications. Market data is propagated by the trading exchanges and handled by specialised trading feeds to keep track of trades, bids and asks by the participants of the exchange. Different exchanges provide data in different arXiv:2009.09993v3 [q-fin.TR] 14 May 2021 formats following predetermined protocols and data structures. Finally the dataset is relayed back to a trading algorithm or human to make trading decisions. Decisions are then relayed back to the exchange, through a gateway, normally by means of a broker, which informs the exchange about the

*Email: [email protected], ORCID - 0000-0001-8080-1331 †Email: [email protected], ORCID - 0000-0002-0808-2456 ‡Work partially done whilst at Newcastle University

1 wish to buy (long) or sell (short) specific assets. This series of actions rely on the understanding of a predetermined protocol which allows communication between the various parties. Several software tools exist to ensure that almost all these steps are done for you, with the decisions made being the single point that may be uniquely done by the individual. After a match is made (either bid to ask or ask to bid) with another market participant, the match is conveyed back to the software platform and the transaction is completed. In this context, the main goal of ML is to automate the decision making in this pipeline. When constructing algorithmic trading software, or Automatic Trading Pipeline (ATP), each of the components of the exchange protocol needs to be included. Speed is often a key factor in these ex- changes as a full round of the protocol may take as little as milliseconds. So to construct a robust ATP, time is an important factor. This extra layer adds further complexity to the machine learning problem. A diagram of what an ATP looks like in practice, is presented in Fig1.

Order Entry Gateway

7.

5. 6.6. . market data Exchange Feed Handler Electronic Trading Platform

Order Book Bid Ask

1. 2.

3. Live 4. Data

Decision preprocess Historic Data Machine Learning Trading Strategy

A. B.

Figure 1: Full overview of Automated Trading Platform components

In Fig1 it can be observed that the main ML component is focused on training of the decision making and the strategy. This is by no means a straightforward feat as successful strategies are often jealously guarded secrets, as a consequence of potential financial profits. Several different compo- nents are required, not the least analysing the market to establish components of interest. Historical raw market data contains unstructured information, allowing one to reconstruct all the trading activ- ity, however that is usually not enough to establish persistent and predictable price-action patterns due to the market non-stationarity. This characterisation is a complex process, which requires guid-

2 ance and domain understanding. While traditional approaches have focused on trying to learn from the market time series over the whole year or potentially across dozens of years, more recent work has proposed the usage of data manipulation to identify key events in the market [40], this advanced categorization can then become focus for the machine learning input to improve performance. This methodology focuses on identifying states of a financial market, which can then be used to identify points of drastic change in the correlation structure, whether positive or negative. Previous approaches have used these states to correlate them to world wide events and general market values to categorize interesting scenarios [6], showing that using these techniques the training of the strategy can be greatly optimized. At the same time there is lack of research proposing a full-stack automated trading platform and evaluating it using statistical methods. In this work we build above the existing body of knowledge by proposing an approach for the extraction and classification of financial market patterns based on price action and market microstructure. Moreover we quantify the effects of the approach by using effect sizes, as well as perform hypothesis testing. The contributions of this paper are as follows: 1) a methodology to construct an automated trading platform using a state of the art machine learning techniques; 2) we present an automated market profiling technique based on machine learning; 3) we formulate research questions and assess them statistically and 3) we propose and evaluate the performance of a futures trading strategy based on market profiles. The remainder of the paper is structured as follows: Section 2) provides financial and machine learning background needed to understand our approach; Section 3) provides descriptions of some related work; Section 4) states the research questions of the paper; Section 5) presents the studied data and the study methodology for the automated market profiling approach and assessment methodol- ogy; Section 6) details our constructed APT results; in Section 7) a discussion on results, their implica- tions and limitations is provided; and Section 8) concludes the work.

2 Background

There are several different types of financial data, and each of these has a different role in financial trading. They are widely classified into four categories: i. Fundamental Data, this kind of data is formed by a set of documents, for example financial accounts, that a company has to send to the orga- nization that regulates its activities, this is most commonly accounting data of the business, ii. Market Data, this constitutes all trading activities that take place, allowing you to reconstruct a trading book, iii. Analytics, this is often derivative data acquired by analysing the raw data to find patterns, and can take the form of fundamental or market analytics, and iv. Alternate data, this is extra domain knowl- edge that might help with the understanding of the other data, such as world events, social media, Twitter and any other external sources. In this work we analyse type ii data to extract patterns and construct market profiles, and therefore focus our background on this data type, a more comprehen- sive review of different data types is available in Prado & Lopez [12].

2.1 Data pre-processing In order to prepare data for processing, the raw data is structured into predetermined formats to make it easier for a machine learning algorithms to digest. There are several ways to group data, and various different features may be aggregated. The main idea is to identify a window of interest based on some heuristic, and then aggregate the features of that window to get a representation, called Bar. Bars may contain several features and it is up to the individual to decide what features to select, common features include: Bar start time, Bar end time, Sum of Volume, Open Price, Close Price, Min and Max (usually called High and Low) prices, and any other features that might help characterise the trading performed within this window. The decision of how to select this window may be a make or break

3 for your algorithm, as it will mean you either have good useful data or data not representative of the market. An example of this would be the choice of using time as a metric for the bar window, e.g. take n hours snapshots. However, given the fact there are active and non-active trading periods, you might find that only some bars are actually useful using this methodology. In practice, the widely considered way to construct bars is based on the number of transactions that have taken place or volumes traded. This allows for the construction of informative bars which are independent of timings and get a good sampling of the market, as it is done as a function of trading activity. There are of course many other ways to select a bar [12], so it is up to the prospective user to select one that works for their case.

2.2 Discovery of the price extrema In mathematics an extremum is any point at which the value of a function is largest (a maximum) or smallest (a minimum). These can either be local or global extrema. At local extremum the value is larger/lower at immediately adjacent points, while at a global extremum the value of the function is larger than its value at any other point in the interval of interest. If one wants to maximise their profits theoretically, their intent would be to identify an extremum and trade at that point of optimality i.e. the peak. This is one of the many ways of defining the points of optimality. As far as the algorithms for an ATP are concerned, they will often perform with active trading, so finding a global extremum serves little purpose. Consequently, local extrema within a pre-selected window are instead chosen. Several complex algorithms exist for this with use cases in many fields such as biology [15]. However, the objective is actually quite simple: identify a sample for which neigh- bours on each side have a lower values for maxima, and higher values for minima. This approach is very straight forward and can be implemented with a linear search. In the case where there are flat peaks, which means several entries are of equal value the middle entry is selected. Two further met- rics of interest are, the prominence and width of a peak. Prominence of a peak measures how much a peak stands out from the surrounding baseline of the near entries, and is defined as the vertical dis- tance between the peak and lowest point. Width of the peak is the distance between each of the lower bounds of the peak, signifying the peaks duration. In case of peak classification, these measures can aid a machine learning estimator to relate the obtained features with the discovered peaks, this avoids attempts to directly relate properties of narrow or less prominent peaks with wider or more prominent peaks. These measures allow for the classification of good points of trading as well as giving insight as to what led to this classification with the prominence and width.

2.3 Derivation of the market microstructure features A market microstructure is the study of financial markets and how they operate. It’s features represent the way that the market operates, how decisions are made about trades, price discovery process and many more [32]. The process of market microstructure analysis is the identification of why and how the market prices will change, in order to trade profitably. These may include, 1) the time between trades, as it is usually an indicator of trading intensity [3] 2) volatility, which might represent evi- dence of good and bad trading scenarios, as high volatility may lead to unsuitable market state [23], 3) volume, which may directly correlate with trade duration, as it might represent informed trading rather than less high volume active trading [37], and 4) trade duration, high trading activity is related to greater price impact of trades and faster price adjustment to trade-related events, whilst slower trades may indicate informed single entities [16]. Whist several other options are available they are often instrument related and require expert domain knowledge. In general it is important to tailor and evaluate your features to cater the specific scenario identified. One such important scenario to consider when catering to prices, is the aggressiveness of buyers and sellers. In an Order Book, a match implies a trade, which occurs whenever a bid matches an ask

4 and conversely, however the trade is only ever initiated by one party. In order to dictate who is the aggressor is in this scenario (if not annotated by the marketplace), the tick rule is used [1]. The rule labels a buy initiated trade as 1, and a sell-initiated trade as -1. The logic is the following an initial label l is assigned an arbitrary value of 1 if a trade occurs and the price change is positive, then l 1 if the = price change is negative, and l 0 and if there is no price change l is inverted. This has been shown to = be able to identify the aggressor with high degree of accuracy [17]

2.4 Machine learning algorithms Machine learning is a field that has come to take over almost every aspect of our lives, from personal voice assistants to healthcare. So it comes at no surprise that it is also gaining popularity in the field of algorithmic trading. There is a wide range of techniques for machine learning, ranging from very sim- ple such as regression, to techniques used for deep learning such as neural networks. Consequently, its important to choose an algorithm which is most suited to the problem one wishes to tackle. In ATPs one of the possible roles of the machine learning is to identify situations in which it is profitable to trade, depending on the strategy, for example if using a flat strategy the intent is to identify when the market is flat. In these circumstances, and due to the potentially high financial implications of false negatives understanding the prediction is key. Understanding the prediction involves the process of being able to understand why the algorithm made this decision. This is a non trivial issue and some- thing very difficult to do for a wide range of techniques, neural networks being the prime example (although current advances are being made [53,19]. Perhaps one of the simplest yet highly effective techniques is known as Support Vector Machines (SVM). SVMs are used to identify the hyperplane that best separates a binary sampling. If we imagine a set of points mapped onto a 2d plane, the SVM will find the best line that divides the two different classifications in half. This technique can easily be expanded to work on higher-dimensional data, and since it is so simple, it becomes intuitive to see the reason behind a classification. However, this technique whilst popular for usage in financial forecasting [49, 27], suffers from the drawback that it is very sensitive to parameter tuning, making it harder to use, and also does not work with categorical features directly, making it less suited to complex analysis. Another popular approach are decision trees. The reason tree-based approaches are hugely popu- lar is because they are directly interpretable. To improve the efficacy of this technique, several different trees are trained and used in unison to come up with the result. The most popular case of this is Ran- dom Forest. Random Forest operates by constructing a multitude of decision trees at training time and outputting the classification of the individual trees. However, this suffers from the fact that the differ- ent trees are not weighted and contribute equally, which might lead to inaccurate results. One class of algorithms which has seen mass popularity for its robustness, effectiveness and clarity is boosting algorithms. Boosters create “bins” of classifications that can be combined to reduce overfitting and improve the prediction. The data is split into n samples, either randomly or by some heuristic, and each tree is trained using one of the samples. The results of each tree are then used in an ensemble way to help make the final prediction, effectively each individual weak classifier helps contribute to a final grouped together and much more powerful classifier. Finally each tree features are internally evaluated leading to a weakness measure which dictates the overall contribution to the result. The first usage of boosting using a notion of weakness was introduced by AdaBoost [21], this work presented the concept of combining the output of the boosters into a weighted sum that represents the final output of the boosted classifier. This allows for adaptive analysis as subsequent weak learn- ers are tweaked in favor of those instances misclassified by previous classifiers. Following on from this technique two other techniques were introduced XGBoost [9] and LightGBM [29], both libraries gained a lot of traction in the machine learning community for their efficacy, and are widely used. In this category, the most recent algorithm is CatBoost. CatBoost [42] highly efficient and less prone to

5 bias than its predecessors, it is quickly becoming one of the most used approaches, in part due to its high flexibility. CatBoost was specifically proposed to expand issues in the previous approaches which lead to target leakage, which sometimes led to overfitting. This was achieved by using ordered boost- ing, a new technique allowing independent training and avoiding leakage. Also allowing for better performance on categorical features.

2.5 Feature and prediction analysis Feature analysis is the evaluation of the input data to assess their effectiveness and contribution to the prediction. This may also take the form of creating new features using domain knowledge to improve the data. The features in the data will directly influence the predictive models used and the results that can be achieved. Intuitively, the better the chosen and prepared features, the better the results that can be achieved. However, this may not always be the case for every scenario, as it may lead to overfitting due to a too large dimensionality of the data. The process of evaluating and selecting the best features is referred to as feature engineering. One of the methods supporting feature engineering is feature importance evaluation. The simplest way to achieve this is by feature ranking [24], in essence a heuristic is chosen and each feature is assigned a score based on this heuristic, ordering the features in descending order. This approach however may be problem specific and require previous domain knowledge. Another common approach is the usage of correlations, to evaluate how the features relate to the output. The intent of this approach is to evaluate the dependency between the features and the result, which intuitively might lead in a feature that contributes more to the output [4]. However these approaches evaluate the feature as a single component, in relation to the output, independent of the other features. Realistically one would want to understand their features as a whole and see how they contribute to a prediction as a group. Beyond the initial understanding of the features it is important to get an understanding of the prediction. Compared to previously discussed approaches this starts from the result of the model and goes back to the feature to see which ones contributed to the prediction. This has advantages over the pure feature analysis approaches as it can be applied to all the different predictors individually and gives insights into the workings of the predictor. Recent advances into this approach, namely SHAP (SHapleyAdditive exPlanations) [36], are able to provide insight into a by prediction scoring of each feature. This innovative technique can allow the step through assessment of features throughout the different predictions, providing guided insight which can also be averaged for an overall assessment. This is very useful for debugging an algorithm, assessing the features and understanding the market classifications, making it particularly relevant for this case study.

2.6 Trading strategy In this context we only refer to common strategies for active trading. Active trading seeks to gain profit by exploiting price variations, to beat the market over short holding periods. Perhaps the most com- mon approach is trend-based strategies. These strategies aim to identify shifts in the market towards a raise or decrease in price and sell at the point where they are likely to gain profit. The second com- mon approach is called flat strategy. Unlike trending markets a flat market is a stable state in which the range for the broader market does not move either higher or lower, but instead trades within the boundaries of recent highs and lows. This makes it easier to understand changes of the market and make a profit with a known market range. The role of the machine learning in both these strategies is to predict whether the market is entering a state of flatness or trending respectively. To evaluate the effectiveness of the trading strategy the Sharpe Ratio is used. This is a measure for assessing the performance of an investment or trading approach and can be computed as the

6 following: Rp f S − , (1) = σp where Rp & R f correspond to the portfolio and risk free returns, respectively, and σp is a standard de- viation of the portfolio return. While the equation gives a good intuition of the measure, in practice its annualised version is often computed. It assumes that daily returns follow the Wiener process distri- bution, hence to obtain annualised values, the daily Sharpe values are multiplied by p252 - the annual number of trading days. It should be noted that such an approach might overestimate the resulting Sharpe ratios as returns auto-correlations might be present, violating the original assumption [35].

2.7 Backtesting In order to test a trading strategy evaluation is performed to assess profitability. Whilst it is possible to do so on real market data, it is generally more favourable to do so on historical data to get a risk-free estimation of performance. The notion is that a strategy that would have worked poorly in the past will probably work poorly in the future, and conversely. But as you can see, a key part of backtesting is the risky assumption that past performance predicts future performance. Several approaches exist to perform backtesting and different things can be assessed. Beyond testing of trading strategies, backtesting can show how positions are opened and likelihoods of certain scenarios taking place within a trading period. The more common technique is to implement the backtesting within the trading platform, as this has the advantage that the same code as live trading can be used. Almost all platforms allow for simulations on historical data, although it may differ in form from the raw data one may have used for training. For more flexibility one can implement their own backtesting system in languages such as Python or R. This specific approach enables for the same code pipeline that is training the classifier to also test the data, allowing for much smoother testing. Whilst this will ensure the same data that is used for training may be used for testing it may suffer from differences to the trading software that might skew the results. Another limitation of this approach is that there is no connection to the exchange or the broker, there will be limitations on how order queues are implemented as well as the simulating of latency which will be present during live trading. This means that the identification of slippages, which is the difference between where the order is submitted by the algorithm and the actual market entry/exit price, will differ and impact the order of trades.

2.8 Statistical Reproducibility In order to evaluate the results of our research, answer the research questions, add explainability and increase reproducibility of our study, we make use of several statistical techniques.

2.8.1 Effect Sizes

The first step that has to be done is quantifying the effectiveness of the approach in relation to a con- trol. Statistically this is done using effect sizes. Effect size is a measure for calculating the strength of a statistical claim. A larger effect size indicates that there is a larger difference between the treatment (method) and the control sample. Reporting effect sizes is considered a good practice when present- ing empirical research findings in many fields [47, 25,52]. Two types of effect sizes exist: relative and absolute. Absolute ones provide a raw difference between the two groups and are usually used for quantifying the effect for a particular use case. Relative are obtained by normalising the difference by the absolute value of the control group. Depending on the setting, computing effect sizes one might look at differences in variances explained, differences in mean, and associations in variables [30].

7 Differences in variance explained measures the proportion to which a mathematical model ac- counts for the variation. One of the most common ways to do this is making use of Pearson Correla- tion [31]. Pearson correlation is defined as the covariance of the two variables, divided by the product of their standard deviations. This normalises the measurement of the covariance, such that the re- sult always has a value between -1 and 1. A further commonly used measure is known as r-squared, taking the Pearson correlation and squaring it. By doing this the we can measure the proportion of variance shared by the two variables. The second approach is instead to look at the differences in population means, using a standardization factor. Popular approaches include Cohen’s d [10], which calculates the difference between two samples means with pooled standard deviation as the standard- ization factor. However it was found that the standard deviation may be biased as the standardiza- tion factor, meaning that when the two means are compared and standardized by division as follows f r acu u SD, if the standard deviation SD is used it may cause some bias and alternative standard- 1 − 2 izations may be preferred. This is rectified in the Hedge’s g [26] method, which corrects the bias using a correction factor when computing the pooled standard deviation. A further extension that can be added on top of this correction is to use av or rather average variance instead of variance, this is more powerful when applied to two correlated samples, and uses the average standard deviation instead, once again the corrected Cohen’s da v is referred to as Hedge’s gav [11,34, ?]. The final type of effect size is categorical variable associations, which checks the inter-correlation of variables, and can evalu- ate the probability of variables being dependent on one another, examples of this are the chi-squared test [20], also effective on ordinal variables.

2.8.2 Inferential Statistics

Another core component of a statistical assessment is Inferential Statistics. With inferential statis- tics one aims to determine whether the findings are generalisable to the population. It is also used to determine if there is a significant difference between the means of two groups, which may be re- lated in certain features. The most popular approaches fall under the general linear model [38]. The general concept is that we want to use a null hypothesis to test the probabilistic difference between our sample population and another population. Popular approaches include t-test [48], ANOVA [22], Wilcoxon [51], and many more, depending on the considered setting. A t-test is a type of inferen- tial statistic used to determine if there is a significant difference between the means of two groups, which may be related in certain features. The basic functioning is that you take a sample from two groups, and establish the null hypothesis for which the sample means are equal, it then calculates the mean difference, standard deviations of the two groups and number of data values of two groups and attempts to reject the null hypothesis. If the null hypothesis is rejected it means that the mean dif- ference is statistically significant. However the t-test relies on several assumptions: 1) that the data is continuous, 2) that the sample is randomly collected from the total populations, 3) that the data is nor- mally distributed, and 4) that the data variance is homogeneous [50]. This makes the t-test not suited to analysis of small samples, where normality and other sample properties are hard to assess reliably. An approach which doesn’t face the same limitations is the Wilcoxon test [51]. The advantage of this approach is that instead of comparing means, it repeatedly compares the samples to check if the mean ranks differ, this means it will check the arithmetic average of the indexed position within a list. This type of comparison is applicable for paired data only and done on individual paired subjects, increas- ing the power of the comparison. However a downside of this approach is that it is non-parametric. A parametric test is able to better observe the full distribution, and is consequently able to observe more differences and specific patterns, however as we saw with t-test they rely on stronger assumptions and are sometimes impractical.

8 2.8.3 Inferential Corrections

The more inferences are made, the more likely erroneous inferences are to occur. Multiple compar- isons arise when a statistical analysis involves multiple simultaneous statistical tests, each of which has a potential to produce a discovery, of the same dataset or dependent datasets. Hence, the overall probability of the discovery increases. This increased chance should be corrected. Some methods are more specific but there exist a class of general significance level α adjustments. Examples of these are the Bonferroni Corrections [5] and Šidák Corrections [45]. The general idea follows on from the fol- lowing: given that the p-value establishes that if the null hypothesis holds what is likelihood of getting an effect at least as large in your own sample. Then if the p-value is small enough you can conclude that your sample is inconsistent with the null hypothesis and reject it for the population. So the idea of the corrections is that to retain a prescribed significance level α in an analysis involving more than one comparison, the significance level for each comparison must be more stringent than the initial α. In case of Bonferroni corrections, if for some test performed out of the total n we ensure that its p-value is less than 1.0 α/n, then we can conclude, as previously, that the associated null hypothesis − is rejected.

3 Related Work

In this section we break down related work in this area, including market characterization, price ex- trema and optimal trading points, and automated trading systems. Each of the previous works is compared to our approach. In their seminal work Munnix et al. [41] first proposed the characterization of market structures based on correlation. Through this they were able to detect key states of market crises from raw mar- ket data. This same technique also allowed the mapping of drastic changes in the market, which corresponded to key points of interest for trading. By using k means clustering, the authors were able to predict whether the market was approaching a crisis, allowing them to react accordingly and con- struct a resilient strategy. Whilst this approach was a seminal work in the understanding of market dynamics, it was still based on statistical dependencies and correlations which are not quite as ad- vanced as more modern machine learning approaches. Nonetheless their successful results initiated a lot more research in this area. Their way of analysing a market as a series of states proved to be a winning strategy allowing for more focused decision making and improving the understanding of the market. Following on from this same approach we seek to characterise the market as a series of peaks of interest and understand whether the market structure allows their classification in difference scenarios. This constitutes the initial stage of the ATP,or the preprocessing, and with their approach several steps of manual intervention are still required. Historically there has been an intuition that the changes in market price are random. By this it is understood that whilst volatility is due to certain events, it is not possible to extract them from raw data. However, despite this, volatility is still one of the core metrics for trading [2]. In an effort to statistically analyse price changes and break down key events in the market Caginalp & Caginalp [7], propose a method to find peaks in the volatility, representing price extrema. The price extrema rep- resent the optimal point at which the price is being traded before a large fluctuation. This strategy depends on the exploitation of a shift away from the optimal point to either sell high or buy low. The authors describe the supply and demand of a single asset as a stochastic equation where the peak is found when maximum variance is achieved. Since the implied relationship of supply and demand is something that will hold true for any exchange, this is a great fit for various different instruments. In a different context, Miller et al [39], analyse Bitcoin data to find profitable trading bounds. Bitcoin, unlike more traditional exchanges, is decentralised and traded 24h a day, making the data much more sparse and with less concentrated trading periods. This makes the trends harder to analyse. Their

9 approach manipulates the data in such a way that it is smoothed, through the removal of splines, this seeks to manipulate curves to make its points more closely related. By this technique they are able to remove outliers and find clearer points of fluctuation as well as peaks. The authors then construct a bounded trading strategy which proves to perform well against unbounded strategies. Since Bitcoin is more decentralised and by the very nature of those investing in it, this also reduces barriers to entry, making automated trading is much more common. This means that techniques to identify bounds and points of interest in the market are also more favoured and widely used. An automated trading system, is a piece of code that autonomously trades in the market. The goal of such machine learning efforts is the identification of a market state in which a trade is prof- itable, and to automatically perform the transaction at that stage. Such a system is normally tailored for a specific instrument, analysing unique patters to improve the characterisation. One such effort focusing on FX markets is, Dempster & Leemans [13]. In this work, a technique using reinforcement learning is proposed, to learn market behaviours. Reinforcement learning is an area of machine learn- ing concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward. This is achieved by assigning positive rewards to desired actions and negative rewards to undesired actions, leading to an optimization towards actions that increment re- wards. In financial markets this naturally corresponds to profitable trades. Using this approach the authors are able to characterise when to trade, perform analysis of associated risks, and automatically make decisions based on these factors. In the more recent work, Booth et al. [6], describe a model for seasonal stock trading using an ensemble of Random Forests. This leveraged variability in sea- sonal data to predict the price return based on these events taking place. Their random forest based approach reduces the drawn-down change of peak to through events. Their approach is based on domain knowledge of well-known seasonality events, usual approaches following this technique find that whilst the event is predictable, the volatility is not. So their characterisation allows to predict which events will lead to profits. The Random Forests are used to characterise features of interest in a time window, and multiple of these are aggregated to inform the decision process. These ensembles are then weighted based on how effective they are and used to inform decision with higher weights having more input. Results across fifteen DAX assets show increases in profitability and in prediction precision. As can be seen, statistical and machine learning techniques have been successfully applied in a variety of scenarios proving effective as the basis of automatic trading and identification of profitable events. This makes the further investigation into more advanced machine learning techniques a de- sirable and interesting area. Our work expands into these previous concepts to improve performance and seek new ways to characterise the market.

4 Aim

The aim of the study is comprised of multiple aspects: i) propose an automated trading system and check the significance of its performance ii) propose a two-step way of feature design in the con- text of the automated trading system and assess its performance, iii) interpret the system outputs, iv) demonstrate benefits of the statistical methods in the context of financial time series analysis. We formulate the research questions and statistical hypotheses which are later evaluated as fol- lows. As the first research question we aim to assess whether the estimator is capable of fitting to the data on the considered feature space. We believe, it is a necessary initial step, as without this informa- tion it would be hard to judge on the following findings. RQ1: Is it feasible to classify the extracted price extrema using the proposed feature space and Cat- Boost estimator better than the baseline precision? By the baseline precision we mean using an esti- mator with an always positive class output.

10 As the second research question we aim to investigate if the proposed 2-step method for feature extraction gives any benefit in comparison to using any of the extraction steps alone. RQ2: Does the use of the 2-step feature extraction improve the extremum classification performance with respect to the individual steps? We provide more details and formulate the null and alternative hypotheses in Material and Meth- ods section.

5 Material and Methods

In our study we demonstrate how off the shelf machine learning methods can be applied to financial markets analysis and trading in particular. Taking into account the non-stationary nature of the finan- cial markets, we are achieving this goal by considering only subsets of the time series. We propose a price-action-based way of defining the subsets of interest and perform their classification. Concretely: we identify local price extrema and predict whether the price will reverse (or ’rebound’) or continue its movement (also called ’crossing’). For the demonstration purposes, we set up a simplistic trading strategy, where we are trading a price reversal after a discovered local extremum is reached as shown in Fig.4. We statistically assess our choices of the feature space and feature extraction method. In the simulated trading, we limit our analysis to backtesting and do not perform any live trading. In the Discussion section we address the limitations of such an approach. We share the reproducibility package for the study [46]. In the section we first describe the datasets and pre-processing procedures. Then, we outline the experiment design comprising of entries labelling and setting up the classification task, designing features, evaluating the model performance, evaluating the results statistically, interpreting model results and, finally, simulated trading.

5.1 Data In the study we use S&P500 E-mini CME futures contracts ES(H-Z)2017, ES(H-Z)2018 and ES(H-U)2019. Which correspond to ES futures contracts with expiration in March (H), June (M), September (U) and December (Z). We operate on ticks data which includes Time&Sales records statistics, namely: bid and ask volumes and numbers of trades, as well as the largest trade volumes per tick. We consider a tick to incorporate all the market events between the two price changes by a minimum price step. For the considered financial instrument the tick size is $0.25.

5.2 Data pre-processing We sample the contracts data to the active trading periods by considering only the nearest expiring contracts with the conventionally accepted rollover dates. The samples end on the second Thursday of the expiration month - on that day the active trading is usually transferred to the following contract. This decision ensures highest liquidity, and, due to the double-auction nature of the financial markets, stable minimum bid-ask spreads [28]. In the current study we consider two simplest scenarios of the price behaviour after it reaches the local extremum - reversal and extremum crossing. When labelling the entries, we require up to 15 ticks price movement as a reversal (or rebound) and only 3 ticks for the extremum crossing. The labelling approach allows us to study a range of configurations of the reversals and investigate how the configurations affect the performance of the models. At the same time these ranges are well within the boundaries of the intraday price changes.

11 An essential part of the proposed automated trading system is detection of the price extrema. The detection is performed on a sliding window of the streamed ticks with a window size of 500. We cap- ture peaks with the widths from 100 to 400. The selected widths range serves three purposes: i) en- sures that we do not consider high-frequency trading (HFT) scenarios which require more modelling assumptions and different backtesting engine; ii) allows to stay in intraday trading time frames and have a large enough number of trades for analysis; iii) makes the price level feature values comparable across many of the entries.

5.3 Experiment design 5.3.1 Classification task

In order to incorporate machine learning into the automated trading system, we design a binary classi- fication task, where the labels correspond to price reversals (positives) and crossings (negatives). Due to the labelling design, we are more flexible with take profits and stop losses when trading reversals (up to 15 ticks versus 3 ticks) - this explains their positive labelling.

5.3.2 Feature design

To perform the extrema classification, we obtain two types of features: i) designed from the price level ticks (called price level (PL) features), and ii) obtained from the ticks right before the extremum is approached (called market shift (MS) features) as we illustrate in Fig.2. We think it is essential to perform the two-step collection since the PL features contain properties of the extremum, and the MS features allow to spot any market changes happened between the time when the extremum was formed and the time we are trading it.

PL Features MS Features

Price Level Crossing

Figure 2: The figure illustrates what data we use for the 2-step feature extraction: price level (PL) and market shift (MS) components.

Considering different extrema widths, varying dimensionality of the data does not allow to use it directly for classification - most of the algorithms take fixed-dimensional vectors as input. We ensure

12 the fixed dimensionality of a classifier input by aggregating per-tick features by price. We perform the aggregation for the price range of 10 ticks below (or above in case of a minimum) the extremum. This price range is flexible - 10 ticks are often not available within ticks associated with the price level (red dashed rectangle in Fig.2) in this case we fill the empty price features with zeros. We assume that the further the price from the extremum the less information relevant for the classification it contains. Considering the intraday volatility of ES, we expect that the information beyond 10 ticks from the extremum is unlikely to improve the predictions. If one considers larger time frames (peak widths), this number might need increasing. PL features are obtained from per-tick features by grouping by price with sum, max or count statis- tics. For instance: if one is considering volumes, it is reasonable to sum all the aggressive buyers and sellers before comparing them. Of course, one can also compute mean or consider max and min volumes per tick. If following this line of reasoning, the feature space can be increased to very large dimensions. We empirically choose a feature space described in Tables1 and2 for Price Level and Mar- ket Shift components, respectively. Defining the feature space we aim to make the feature selection step computationally feasible. Too large feature space might be also impractical from the optimisation point of view, especially if the features are correlated. To track the market changes, for the MS feature component we use 237 and 21 ticks and compare statistics obtained from these two periods. Non-round numbers help avoiding interference with the majority of manual market participants who use round numbers [12]. We also choose the values to be comparable to our expected trading time frames. No optimisation was made on them. We obtain the MS features being 2 ticks away from the price level to ensure that our modelling does not lead to any time-related bias where one cannot physically send the order fast enough to be executed on time.

5.3.3 Model evaluation

After the features are designed and extracted, the classification can be performed. As a classifier we choose CatBoost estimator. We feel that CatBoost is a good fit for the task since it is resistant to over- fitting, stable in terms of parameter tuning, efficient and one of the best-performing boosting algo- rithms [42]. Finally, being based on decision trees, it is capable of correctly processing zero-padded feature values when no data at price is available. Other types of estimators might be comparable in one of the aspects and require much more focus in the other ones. For instance, neural networks might offer a better performance, but are very demanding in terms of architecture and parameter optimization. In this study we use precision as the main scoring function (S):

TP S , (2) = TP FP + where TP is the number of true positives and FP the number of false positives. All the statistical tests are run on the precision scores of the samples. This was chosen as the main metric since by design every FP leads to losses, and false negative (FN) means only a lost trading opportunity. To give a more comprehensive view on the model performance, we report F1 scores, PR-AUC (precision-recall area under curve) and ROC-AUC (receiver-operating characteristic area under curve) metrics. We report model performances for the two-step feature extraction approach as well as for the each of the feature extraction steps separately. In order to avoid large bias in the base classifier probability, we introduce balanced class weights into the model. The weights are inversely proportional to the number of entries per class. The con- tracts for training and testing periods are selected sequentially - training is done on the active trading phase of contract N, testing - on N 1, for N [0,B 1], where B is the number of contracts considered + ∈ − in the study.

13 Table 1: Price level feature space component used in the study. These features are obtained when the price level is formed. When discussed, features are referred by the codes in the square brackets at the end of descriptions.

Equation Description Pp PL t t =| − |(Vb Va) Bid and ask volumes summed across all the ticks for t [0,1,2] [PL0] Pp PL t + ∈ t =| − | Vb Bid volumes summed across all the ticks for t [0,1,2] [PL1] Pp PL t ∈ t =| − | Va Ask volumes summed across all the ticks for t [0,1,2] [PL2] Pp PL t ∈ t =| − | Tb Number of bid trades summed across all the ticks for t [0,1,2] [PL3] Pp PL t ∈ =| − | Ta Number of ask trades summed across all the ticks for t [0,1,2] [PL4] t ∈ M(T p PL t ) Maximum bid trade across all the ticks for t [0,1,2] [PL5] = | − | b ∈ M(T )a Maximum ask trade across the ticks for t [0,1,2] [PL6] p PL t ∈ Pp PL=| t − | t =| − | 1 Number of ticks at price for t [0,1,2] [PL7] Pp PL t ∈ t =| − | Vb Price level (PL) features Pp PL t PL1 divided by PL2, for t [0,1,2] [PL8] t =| − | Va ∈ Pp PL t t =| − | Tb Pp PL t Feature PL3 divided by feature PL4 [PL9] t =| − | Ta Pp PL t t =| − | M(T )b Pp PL t Feature PL5 divided by feature PL6 [PL10] t =| − | M(T )a Pp PL t =| − |(V Va ) t b + Pp PL t Total volume at price PL t divided by the number of ticks [PL11] t =| − | 1 | − | P10 Pp PL t t 0 t =| − | Va Total Ask Volume [PL12] = P10 Pp PL t t 0 t =| − | Vb Total Bid Volume [PL13] = P10 Pp PL t t 0 t =| − | Ta Total Ask Trades [PL14] = P10 Pp PL t t 0 t =| − | Tb Total Bid Trades [PL15] = P10 Pp PL t t 0 t =| − |(Va Vb) Total Volume [PL16] = + - Peak extremum - minimum or maximum [PL17] - Peak width in ticks described in the Background section [PL18] - Peak prominence - described in the Background section [PL19] - Peak width height - described in the Background section [PL20] p PL t P =| − | t [0,1,2] Vb p∈ PL t Bid volumes close to extremum divided by ones which are further [PL21] P =| − | t [3..9] Vb ∈ p PL t P =| − | t [0,1,2] Va p∈ PL t Ask volumes close to extremum divided by ones which are further [PL22] P =| − | t [3..9] Va ∈ p PL t P =| − | t [0,1,2] Vb Sum bid volume close to the price extremum divided by p∈ PL t P =| − | the close ask volume [PL23] t [0,1,2] Va ∈ p PL t P =| − | t [3..9] Vb Sum bid volume far from the price extremum divided by p∈ PL t P =| − | the far ask volume [PL24] t [3..9] Va ∈ OB - order book T - trades t - ticks N - total ticks p - price w - tick window PL - extremum price V - volume b - bid a - ask Key PN - price level neighbours until distance M(X) - Max value in set X

14 Table 2: Market shift feature space component used in the study. These features are obtained right before the already formed price level is approached. When discussed, features are referred by the codes in the square brackets at the end of descriptions.

Equation Description Pw 237 t,b= (Vb ) Pw 237 Fraction of bid over ask volume for last 237 ticks [MS0] t,a= (Va ) Pw 237 t,b= (Tb ) Pw 237 Fraction of bid over ask trades for last 237 ticks [MS1] t,a= (Ta )

Pw 237 Pw 21 t = Vb t = Vb Fraction of bid/ask volumes for long minus Pw 237 Pw 21 t = Va − t = Va short periods [MS2] Pw 237 Pw 21 t = Tb t = Tb Pw 237 Pw 21 Fraction of bid/ask trades for long minus short periods [MS3] t = Ta − t = Ta

Pw 237 Pw 21 t = M(T )b t = M(T )b Max bid trade divided by ask for long periods minus Pw 237 Pw 21 t = M(T )a t = M(T )a short periods [MS4]

Market Shift (MS) features − RSI(p S) Technical indicator RSI with the stated periods p [MS5 ] ∈ X Technical indicator MACD with the stated long & short periods MACD(l p S; sp l p/2) ∈ = l p,sp [MS6X ] T - trades N - total ticks w - tick window V - volume b - bid a - ask PN - price level neighbours until distance Key M(X) - Max value in set X S - ranges [20,40,80,120,160,200]

We apply a commonly accepted in ML community procedure for input feature selection and model parameter tuning [33]. Firstly, we perform the feature selection step using a Recursive Feature Elim- ination with cross-validation (RFECV) method. The method is based on gradual removal of features from the model input starting from the least important ones (based on the model’s feature impor- tance), and measuring the performance on a cross-validation dataset. In the current study on each RFECV step we remove 10% of the least important features. Cross-validation allows robust assessment of how the model performance generalizes into unseen data. Since we operate on the time series, we use time series splits for cross-validation to avoid the look-ahead bias. For the feature selection, the model parameters are left default, the only configuration we adjust is class labels balancing as our data is imbalanced. Secondly, we optimize the parameters of the model in a grid-search fashion. Even though CatBoost has a very wide range of parameters which can be optimized, we choose the param- eters common for boosting trees models for the sake of feasibility of the optimization and leaving the possibility of comparing the optimisation behaviour to the other boosting algorithms. The following parameters are optimized: 1) Number of iterations, 2) Maximum depth of trees, 3) has_time parameter set to True or False, and 4) L2 regularisation. For the parameter optimisation we use a cross-validation dataset as well. We perform training and cross-validation within a single contract and the backtesting of the strategy on the subsequent one to ensure relevance of the optimized model.

5.3.4 Statistical evaluation

Here we formalise the research questions by proposing null and alternative hypotheses, suggesting statistical tests for validating them, as well as highlighting the importance of the effect sizes. The effect sizes are widely used in empirical studies in social, medical and psychological sci- ences [34]. They are a practical tool allowing to quantify the effect of the treatment (or a method)

15 in comparison to a control sample. Moreover, they allow generalization of the findings to the whole population (unseen data in our case). Finally, effect sizes can be compared across studies [34]. We be- lieve that introduction of the effect sizes into the financial markets domain contributes to the research reproducibility and comparability. In the current study we report Hedge’s gav - unbiased measure designed for paired data. The effect sizes are visualised in a form of forest plots with .95 confidence intervals (CIs), representing the range where the effect size for the population might be found with the .95 probability. We correct the confidence intervals for multiple comparisons by applying Bonferroni corrections. When testing the hypotheses, the samples consist of the test precisions on the considered con- tracts, leading to equal sample sizes in the both groups, and entries are paired as the same underlying data is used. Comparing small number of paired entries and being unsure about normality of their dis- tribution, we take a conservative approach and for hypothesis testing use the single-tailed Wilcoxon signed-rank test. This test is a non-parametric paired difference test, which is used as an alternative to a t-test when the data does not fulfill the assumptions required for the parametric statistics. When reporting the test outcomes, we support them with the statistics of the compared groups. Namely, we communicate standard deviations, means and medians. We set the significance level of the study to α .05. Also, we account for multiple comparisons by = applying Bonferroni corrections inside of each experiment family [8]. We consider research questions as separate experiment families.

5.3.5 Model analysis

We perform the model analysis in an exploratory fashion - no research questions and hypotheses are stated in advance. Hence, the outcomes of the analysis might require additional formal statistical assessment. We use SHAP local explanations to understand how models end up with the particu- lar outputs. Through the decision plot visualisations we aim to find common decision paths across entries as well as informally compare models with small and large numbers of features. The repro- ducibility package contains the code snippets as well as the trained models, which allow to repeat the experiments for all the models and contracts used in the study.

5.3.6 Simulated trading

The trading strategy is defined based on our definition of the crossed and rebounded price levels, and schematically illustrated in Fig.3. It is a flat market strategy, where we expect a price reversal from the price level. Backtrader Python package 1 is used for backtesting the strategy. Backtrader does not allow taking bid-ask spreads into account, that is why we are minimizing its affects by excluding HFT trading opportunities (by limiting peak widths) and limiting ourselves to the actively traded contracts only. Since ES is a very liquid trading instrument, its bid-ask spreads are usually 1 tick, which however does not always hold during extraordinary market events, scheduled news, session starts and ends. We additionally address the impact of spreads as well as order queues in the Discussion section. In our backtests, we evaluate performance of the models with different rebound configurations and fixed take-profit parameter, and a varying take-profits with a fixed rebound configuration to better understand the impact of both variables on the simulated trading performance.

5.4 Addressing research questions In the current subsection we continue formalizing the research questions by proposing the research hypotheses. The hypotheses are aimed to support the findings of the paper, making them easier to

1Available at: https://www.backtrader.com/

16 start Tick

yes get input features 2 ticks from PL yes|no

no

classify crossing crossing|rebound end

rebound

submit limit order at price level short long limit limit max order order

price level min min|max

stop loss: 3 ticks take profit: 15 ticks

Figure 3: The block diagram illustrates steps of the trading strategy. communicate. For both research questions we run the statistical tests on the precision metric. In the current study we encode the hypotheses in the following way: H0X and H1X correspond to null and alternative hypotheses, respectively, for research question X.

5.4.1 CatBoost versus no-information model (RQ1)

In the first research question we investigate whether it is feasible to improve the baseline performance for the extrema datasets using the chosen feature space and CatBoost classifier. We consider the base- line performance to be precision of an always positive class output estimator. The statistical test ad- dresses the following hypothesis: H11: CatBoost estimator allows classification of the extracted extrema with precision better than

17 the no-information approach. H01: CatBoost estimator allows classification of the extracted extrema with a worse or equal preci- sion in comparison to the no-information approach.

5.4.2 Two-step versus single-step feature extraction (RQ2)

The second research question assesses if the proposed two-step feature extraction gives any statis- tically significant positive impact on the extremum classification performance. The statistical test addresses the following hypothesis: H12: Two-step feature extraction leads to an improved classification precision in comparison to using features extracted from any of the steps on their own. H02: Two-step feature extraction gives equal or worse classification precision than features ex- tracted from any of the steps on their own. In the current setting we are comparing the target sample (the 2-step approach) to the two control samples (MS and PL components). We are not aiming to formally relate the MS and PL groups, hence only comparisons to the 2-step approach are necessary. In order to reject the null hypothesis, the test outcomes for both MS and PL components should be significant.

6 Results

In the current section we communicate the results of the study. Namely: the original dataset and pre- processed data statistics, model performance for all the considered configurations, statistical evalu- ation of the overall approach and the two-step feature extraction, and, finally, simulated trading and model analysis. An evaluation of these results is presented in Sec.7.

6.1 Raw and pre-processed data The considered data sample is described in Tab.3. We provide the numbers of ticks per contract. The contracts are sorted by the expiration date from top to bottom ascendingly. The number of ticks changes non-monotonically - while the overall trend is raising, the maximum number of ticks is ob- served for ESZ2018 contract. And the largest change is observed between ESZ2017 & ESH2018.

Table 3: Numbers of reconstructed ticks per contract used in the study.

Contract Number of ticks ESH2017 1271810 ESM2017 1407792 ESU2017 1243120 ESZ2017 1137427 ESH2018 2946336 ESM2018 2919757 ESU2018 1825417 ESZ2018 3633969 ESH2019 3066530 ESM2019 2591000 ESU2019 2537197

We perform the whole study on 3 different rebound configurations: 7, 11 and 15 ticks price move- ment required for the positive labelling. We communicate the numbers of entries in the classification

18 Table 4: The table communicates numbers of positive labels as well as total numbers of entries per contract. ’Reb.’ corresponds to the rebound - required number of ticks for the positive labelling. Rebound columns show numbers of positively labeled entries.

Contract Reb. 7 Reb. 11 Reb. 15 Total ESH2017 896 804 703 4911 ESM2017 951 881 756 4181 ESU2017 858 812 693 3446 ESZ2017 689 640 518 12317 ESH2018 2014 1983 1953 11537 ESM2018 1965 1936 1868 6534 ESU2018 1271 1226 1095 16331 ESZ2018 2677 2620 2565 11718 ESH2019 1990 1949 1883 9743 ESM2019 1711 1692 1630 10389 ESU2019 1761 1735 1704 6191 tasks in Table4. As one can see, the numbers of the extracted extrema do not strictly follow the lin- ear relation with the numbers of ticks per contract. Considering the numbers of positively labelled entries, the numbers decrease for the larger rebound sizes.

6.2 Price Levels 6.2.1 Automatic extraction

Since the first step of the pipeline is detecting peaks, we show a sample of data with automatically detected peaks and widths in Fig.4. We provide a 2k ticks sample with an upward price trend. The automatically detected peaks are marked with grey circles and the associated peak widths are depicted with solid grey lines. Some of the peaks are not automatically discovered as not satisfying conditions of the algorithm by having insufficient widths or being not prominent enough (see Section 2.2 for the definitions of both).

6.2.2 Classification of the extrema

For all the considered models we perform feature selection and parameter tuning. We make the opti- misation results available as a part of the reproducibility package. The model precisions obtained on a per-contract basis are provided in Table5. The relative changes in the precision across contracts are preserved across the labelling configurations. There is no evidence that certain configuration shows consistently better performance across contracts. We report the rest of the metrics in the supplemen- tary data, in TableS1. The precision of the models for each of the feature extraction steps: Market Shift (MS) and Price Level (PL) components are presented in Table6. The rest of the metrics are reported in the supplementary materials, in TablesS3 andS2 for the MS and PL, respectively.

6.2.3 Price Levels, CatBoost versus No-information estimator (RQ1)

Below we present the effect sizes (Fig.5) associated with the research question. Concretely, we use precision as the measurement variable for comparing between the no-information model and the CatBoost classifier. There are no configurations showing significant effect sizes. The largest effect size is observed for the 15 tick rebound labelling. Large CIs are observed partially due to a small sample size.

19 Crossing

Figure 4: A sample of data demonstrating automated peak detection and peak widths annotation.

Table 5: Precisions of CatBoost classifier with the 2-step feature extraction (’2-step’) and always-positive output classifier (’Null’). The performance is reported for 3 labelling configurations: 7,11 and 15 ticks rebounds.

Contract Rebound 7 Rebound 11 Rebound 15 2-step Null 2-step Null 2-step Null ESH2017 0.20 0.19 0.20 0.18 0.15 0.15 ESM2017 0.22 0.21 0.23 0.19 0.19 0.17 ESU2017 0.25 0.20 0.15 0.19 0.18 0.15 ESZ2017 0.17 0.16 0.17 0.16 0.19 0.16 ESH2018 0.19 0.17 0.18 0.17 0.17 0.16 ESM2018 0.21 0.19 0.21 0.19 0.18 0.17 ESU2018 0.17 0.16 0.16 0.16 0.16 0.16 ESZ2018 0.18 0.17 0.17 0.17 0.16 0.16 ESH2019 0.18 0.18 0.18 0.17 0.18 0.17 ESM2019 0.18 0.17 0.17 0.17 0.17 0.16

We test the null hypothesis for rejection for the 3 considered configurations. The original data used in the tests is provided in Table5. We report test outcomes in a form of test statistics and p-values in Table7. Additionally, in the same table the compared groups statistics are included. We illustrate the performance of the compared groups in the supplementary materials, in Figure S1. There is now skew in any of the labelling configurations - medians and means do not differ within the groups.

20 Table 6: Precisions of the Market Shift (MS) and Price Level (PL) feature spaces reported for the labelling configurations of 7, 11 and 15 ticks.

Contract Rebound 7 Rebound 11 Rebound 15 MS PL MS PL MS PL ESH2017 0.20 0.19 0.19 0.17 0.15 0.15 ESM2017 0.24 0.21 0.20 0.19 0.16 0.18 ESU2017 0.25 0.23 0.23 0.17 0.15 0.19 ESZ2017 0.18 0.16 0.16 0.16 0.16 0.15 ESH2018 0.19 0.17 0.18 0.18 0.17 0.16 ESM2018 0.22 0.19 0.21 0.20 0.18 0.17 ESU2018 0.17 0.15 0.17 0.16 0.16 0.15 ESZ2018 0.18 0.17 0.17 0.18 0.17 0.16 ESH2019 0.19 0.18 0.18 0.17 0.17 0.17 ESM2019 0.18 0.18 0.18 0.17 0.18 0.16

We see around 2 times larger standard deviations for the CatBoost model in comparison to the no- information model. The potential reasons and implications are discussed in sections 7.2 and 7.5, re- spectively. There are 3 tests run in this experiment family, hence after applying Bonferroni corrections for multiple comparisons, the corrected significance level is α .05/3 .0167. = = Table 7: Statistics supporting the outcomes of the Wilcoxon test which assesses whether CatBoost estimator with the 2-step feature extraction (’CB’ column) leads to a better classification performance than the always-positive output classifier (’Null’ column). The result is reported for the rebound labelling configurations of 7, 11 and 15 ticks.

Statistics Test Groups Rebound 7 One-tailed Wilcoxon test p-value .001 < Test Statistics 55.0 CB Null Mean (Precision) 0.19 0.18 Median (Precision) 0.18 0.17 Standard Deviation (Precision) 0.025 0.0151 Rebound 11 One-tailed Wilcoxon test p-value .053 Test Statistics 44.0 CB Null Mean (Precision) 0.18 0.17 Median (Precision) 0.18 0.17 Standard Deviation (Precision) 0.022 0.0112 Rebound 15 One-tailed Wilcoxon test p-value .0049 Test Statistics 52.0 CB Null Mean (Precision) 0.17 0.16 Median (Precision) 0.17 0.16 Standard Deviation (Precision) 0.012 0.0055

21 Rebound 15

Rebound 11

Rebound 7

0.5 0.0 0.5 1.0 1.5 2.0

Hedge's gav

Figure 5: Hedge’s gav effect sizes quantifying the improvement of the precision from using the CatBoost over the no-information estimator. The error bars illustrate the .95 confidence intervals, corrected for multiple comparisons (3 in this case). The dashed line corresponds to the significance threshold. Rebounds accord to different labelling configurations, where 7, 11 and 15 are ticks required for the positive labelling of an entry.

6.2.4 Price Levels, 2-step feature extraction versus its components (RQ2)

We display the effect sizes related to the second research question in Figure6. We report them sep- arately for the two feature extraction components versus the 2-step. There are no significant effects observed for any of the labelling configurations. Moreover, for the considered sample MS effect sizes are negative for rebounds 7 and 15. The negative effect size in the considered setting means that MS compound performs better than the 2-step approach. This effect is insignificant, hence does not gen- eralise to the population. We perform statistical tests to check if the null hypothesis H02 can be rejected. The original data used in the tests is provided in Tables5 and6, for target (2-step) and control groups (MS and PL), respectively. We communicate the test outcomes in Table8. For the sake of reproducibility, in the same table we report the compared groups’ standard deviations, means and medians. We interpret the results of the tests in section 7.3. Finally, to support the reader, we plot the performance of the considered groups in the supplementary data, Figure S2. There are 6 tests run in this experiment family, hence after applying Bonferroni corrections, the corrected significance level is α .05/6 = = .0083.

6.3 Model analysis Here we communicate the exploratory analysis of the trained models. We choose two models trained on the same contract but with different labelling configurations. Namely, we report the analysis of the models trained on the ESH2019 contract, with rebounds 7 and 11. The choice is motivated by very different final numbers of features after the feature selection step. The core analysis is done on the decision plots, provided in Figures7 and8. Since no pattern was observed when plotting the whole sample, we illustrate a random subsample of 100 entries from top

22 PL MS Rebound 15

Rebound 11

Rebound 7

0 1 2 1 0 1

Hedge's gav Hedge's gav

Figure 6: Hedge’s gav effect sizes quantifying the improvement of the precision from using the 2-step feature extraction over the each of the components (PL and MS). The error bars illustrate the .95 confidence intervals, corrected for multiple comparisons (6 in this case). The dashed line corresponds to the significance threshold. Rebounds accord to different labelling configurations, where 7, 11 and 15 are ticks required for the positive labelling of an entry.

1000 entries sorted by the per-feature contribution. Additionally, we provide summary plots with the SHAP values of the models for the whole sample in the supplementary materials, in Figures S3, S4. There are 29 features in the model trained on the rebound 11 configuration, and 8 features in the one trained on the rebound 7 labels. 7 features are present in both models: PL23, MS6_80, MS0, PL24, MS2, MS6_20, MS6_200. Contributions from the features in the rebound 7 model are generally larger, also, overall confidence of the model is higher. We notice that the most impactful features are coming from the Market Shift features. In Figure7 one can see two general decision patterns: one ending up at around 0.12-0.2 output probability and another one consisting of a number of misclassified entries ending up at around 0.8 output probability. In the first decision path most of the MS features contribute consistently towards the negative class, however PL23 and PL2 often push the probability in the opposite direction. In the second decision path PL23 has the most persistent effect towards the positive class, which is opposed by MS6_80 in some cases and gets almost no contribution from MS0 and MS6_200 features. In Figure8 there is a skew in the output probabilities towards the negative class. Contributions from the PL features are less pronounced than for the rebound 7 model - top 8 features belong to the MS feature extraction step. There is no so obvious decision path with misclassified entries. At the same time, we see strong contributions towards negative outputs from MS6_20 and MS6_40. This is especially noticeable for the entries which have output probabilities around 0.5 before MS6 are taken into account.

6.4 Simulated trading Here we report two types of the simulated trading experiments - with a fixed take profit (15 ticks, as shown in Fig3) and varying rebounds of 7, 11 and 15 ticks, as well as fixed rebound with varying take

23 Table 8: Statistics supporting the outcomes of the Wilcoxon test which assesses whether CatBoost estimator with the 2-step feature extraction (’2-step’ column) leads to a better classification performance than each of the single-step feature extraction (’PL’ and ’MS’ columns). The result is reported for the rebound labelling configurations of 7, 11 and 15 ticks.

Statistics Rebound 7 One-tailed Wilcoxon test p-value .0049 .96 Test Statistics 52.0 11.0 2-step PL 2-step MS Mean (Precision) 0.19 0.18 0.19 0.20 Median (Precision) 0.18 0.18 0.18 0.19 Standard Deviation (Precision) 0.025 0.018 0.025 0.028 Rebound 11 One-tailed Wilcoxon test p-value .080 .58 Test Statistics 42.0 26.0 2-step PL 2-step MS Mean (Precision) 0.18 0.18 0.18 0.19 Median (Precision) 0.18 0.18 0.18 0.18 Standard Deviation (Precision) 0.022 0.012 0.022 0.017 Rebound 15 One-tailed Wilcoxon test p-value .052 .35 Test Statistics 44.0 32.0 2-step PL 2-step MS Mean (Precision) 0.17 0.16 0.17 0.17 Median (Precision) 0.17 0.16 0.17 0.17 Standard Deviation (Precision) 0.012 0.006 0.012 0.010 profits of 7, 11 and 15 ticks. The results are shown in Figures9 and 10 for the fixed rebounds and take profits, respectively. In addition to the cumulative profits we report annualised Sharpe ratios with a 5% risk-free annual profit. Varying take profit configuration has more evident effects in 2019 (Fig9), while behaviour divergence for varying rebounds is observed already in 2018 (Fig10). When computing the net outcomes of the trades, we add $4.2 per-contract trading costs based on our assessment of current broker and clearance fees. We do not set any slippage in the backtesting engine, since ES liquidity is large. However, we execute the stop-losses and take-profits on the tick following the close-position signal to account for order execution delays and slippages. This allows to take into account uncertainty rooting from large volatilities and gaps happening during the extraor- dinary market events. The backtesting is done on the tick data therefore there are no bar-backtesting assumptions made.

24 0.0 0.2 0.4 0.6 0.8 1.0

MS0

MS6_200

MS6_20

MS6_80

PL23

PL2, t=1

MS2

PL24

0.0 0.2 0.4 0.6 0.8 1.0 Model output value

Figure 7: The figure illustrates the feature contributions to the output on a per-entry basis for the CatBoost model, trained on ESH2019, rebound 7 configuration. X axis shows the strength of the contribution either towards a positive class (when the change is >0) or towards the negative one. Colors indicate the model output confidence - blue corresponds to the negative class (crossing) and red - to the positive (rebound). Features are sorted by importance descendingly from the top. Misclassified entries are depicted with dashed lines.

25 0.2 0.3 0.4 0.5 0.6

MS6_40 MS6_20 MS1 MS6_120 MS0 MS6_80 MS6_200 MS6_160 PL0, t=0 PL23 MS2 MS5_20 PL8, t=2 PL9, t=2 PL9, t=1 PL4, t=2 PL24 PL4, t=1 PL4, t=3 PL11, t=0 MS4 PL6, t=1 MS3 PL5, t=2 PL7, t=2 PL15 PL6, t=2 PL18 PL1, t=3

0.2 0.3 0.4 0.5 0.6 Model output value

Figure 8: The figure illustrates the feature contributions to the output on a per-entry basis for the CatBoost model, trained on ESH2019, rebound 11 configuration. X axis shows the strength of the contribution either towards a positive class (when the change is >0) or towards the negative one. Colors indicate the model output confidence of the positive class - blue corresponds to the negative class (crossing) and red - to the positive (rebound). Features are sorted by importance descendingly from the top. DMisclassified entries are depicted with dashed lines.

26 Take Profit 7 0 Take Profit 11 Take Profit 15 2 Sharpe Ratio

0

500

1000 Cum. profit, [ticks]

2017-01-28 2017-08-16 2018-03-04 2018-09-20 2019-04-08 2019-10-25 Date

Figure 9: Cumulative profit curves for all the rebound configurations and fixed take profit of 15 ticks for years 2017-2019 with the corresponding annualized rolling Sharpe ratios (computed for 5% risk-free income). The trading fees are already included in the cumulative profits.

27 Rebound 7 0 Rebound 11 Rebound 15

2 Sharpe Ratio

0

500

1000 Cum. profit, [ticks]

2017-01-28 2017-08-16 2018-03-04 2018-09-20 2019-04-08 2019-10-25 Date

Figure 10: Cumulative profit curves for the best-performing rebound configuration of 15 ticks and take profits of 7, 11 and 15 ticks for years 2017-2019 with the corresponding annualized rolling Sharpe ratios (computed for 5% risk-free income). The trading fees are already included in the cumulative profits.

28 7 Discussion

This section breaks down and analyses the results presented in Sec.6. Results are discussed in relation to the overall pattern extraction and model performance, then research questions, model analysis and, finally, simulated trading. Additionally, we discuss limitations in regards to our approach and how they can be addressed. Finally we present our view on implications for practitioners and our intuition of potential advancements and future work in this area.

7.1 Pattern extraction The numbers of price levels and ticks per contract follow the same trend (Table3). Since the number of peaks is proportional to the number of ticks, we can say that the mean peak density is preserved over time to a large extent. In this context, the peak pattern can be considered stationary and appearing in various market conditions.

7.2 Price Levels, CatBoost versus No-information estimator (RQ1) The precision improvement for the CatBoost over the no-information estimator varies a lot across contracts (Table5). At the same time the improvement is largely preserved across the labelling con- figurations. It might be due to the original feature space, whose effectiveness relies a lot on the market state - it certain market states (and time periods) the utility of the feature space drops and the drop is quite consistent across the labelling configurations. An extensive research of the feature spaces would be necessary to make further claims. The overall performance of the models is weak in an absolute scale, however it is comparable to the existing body of knowledge in the area of financial markets [14]. Effect sizes in Fig.5 are not significant. Significance here means that the chosen feature space with the model significantly contribute towards the performance improvement with respect to the no- information model. Positive but insignificant effect size means that there is an improvement which is limited to the considered sample and is unlikely to generalise to the unseen data (population). Assessing the results of the statistical tests, we use the significance level corrected for the multiple comparisons. Tests run on the rebound 7 and 15 configurations result in significant p-values, rebound 11 - insignificant (Table7). Hence, we reject the null hypothesis H 01 for the labelling configurations of 7 and 15 ticks. This outcome is not supported by the effect sizes. This divergence between the test and effect sizes indicates a need for the feature space optimisation before use of the approach in the live trading setting. The insignificant outcome may be caused by particular market properties, or by unsuitable feature space for this particular labelling configuration. Interestingly, standard deviations differ consistently between the compared groups across the configurations. While no-information model performance depends solely on the fraction of the positively labelled entries, CatBoost perfor- mance additionally depends on the suitability of the feature space and model parameters - this likely explains higher standard deviations in the CatBoost case.

7.3 Price Levels, 2-step feature extraction versus its components (RQ2) For the effect sizes in Figure6 we compare the 2-step feature extraction approach to its components - PL and MS. We do not see any significant effects showing supremacy of any of the approaches. Neg- ative effect sizes in case of the Market Shift component mean that in the considered sample MS per- forms better than the 2-step approach. This result does not generalise to the unseen data as its confi- dence intervals cross the 0-threshold. The possible explanation of the result is the much larger feature space of the 2-step approach (consisting of PL and MS features) than the MS compound. In case if PL features are generally less useful than MS, which is empirically supported by out model analysis

29 (Figs S3 and S4), they might have a negative impact during the feature selection process by introduc- ing noise. Assessing the statistical tests, we use the significance level corrected for 6 comparisons. The p- values from Table8 show that there is no significant outcome and the null hypothesis H 02 cannot be rejected. While the 2-step approach does not bring any improvement to the pipeline, there is no evidence that it significantly harms the performance either. It might be the case that if PL features are designed differently, the method could benefit from them. In this study we withhold from iteratively tweaking the feature space to avoid any loss of the statistical power. We find this aspect interesting for the future work, however it would require increasing the sample size to be be able to account for the increased number of comparisons.

7.4 Simulated trading In the simulated trading we observed an interesting result - data labelling configuration has more im- pact on the profitability than the take profits (Figures9,10). We hypothesise that the reason is the simplistic trading strategy which is overused by the trading community in various configurations. In contrast, the labelling configuration is less straightforward and has more impact on the profitabil- ity. Note that the obtained precisions (Table5) cannot be directly related to the considered trading strategy as there might be multiple price levels extracted within the time interval of a single trade. Consequently, there might be extrema which are not traded. It is hard to expect consistent profitability considering the simplicity of the strategy and lack of optimisation of the feature space, however, even in the current setting one can see profitable episodes (Fig9). The objective of the study was not to provide a ready-to-trade strategy, but rather demonstrate a proof of concept. We believe that the demonstrated approach is generalisable to other trading strate- gies.

7.5 Limitations The proposed experiment design is one of the many ways the financial markets can be studied empir- ically. Statistical methods often have strong use case conditions and assumptions. When there is no entirely suitable tool available, one chooses the closest matching solution. While it is advised to use Glass’s ∆ in case of the significantly different standard deviations between groups, this measure does not have corrections for the paired data. Hence, in our experiment design we choose to stick to the Hedge’s gav . For the sake of completeness, we verified the results using Glass’s ∆ - 15 ticks rebound effect size becomes significant in the RQ1. In the backtesting we use the last trade price to define ticks, we do not take into account bid-ask spreads. In live trading, trades as executed by bid or ask price, depending on the direction of the trade. It leads to a hidden fee of the bid-ask spread per trade. This is crucial for intraday trading as average profits per trade often lay within a couple of ticks. Moreover, when modeling order executions, we do not consider per-tick volumes coming from aggressive buyers and sellers (bid and ask). It might be the case that for some ticks only aggressive buyers (sellers) were present, and our algorithm exe- cuted a long (short) limit order. This leads to uncertainty in opening positions - in reality some of the profitable orders may have not been filled. At the same time, losing orders would always be executed. Another limitation is that we do not model order queues, and, consequently cannot guarantee that our orders would have been filled if we submitted them live even if both bid and ask volumes were present in the tick. This is crucial for high frequency trading (HFT), where thousands of trades are performed daily with tiny take-profits and stop-losses, but has less impact for the trade intervals

30 considered in the study. Finally, there is an assumption that us entering the market does not change its state significantly. We believe it is a valid assumption considering the liquidity of S&P E-mini futures.

7.6 Implications for practitioners We provide a systematic approach to evaluation of automated trading strategies. While large market participants have internal evaluation procedures, we believe that our research could support various existing pipelines. Considering the state of the matter with lack of code and data publishing in the field, we are confident that the demonstrated approach can be used towards improving generalisabil- ity and reproducibility of research. Specific methods like extrema classification and 2-step feature extraction are a good baseline for the typical effect sizes observed in the field.

7.7 Future work In the current study we have proposed an approach for extracting extrema from the time series and classifying them. Since the peaks are quite consistently present in the market, their characteristics might be used for assessing the market state in a particular time scale. The next step would be to propose a more holistic scenario extraction. We would aim to define the scenarios using other market properties instead of the price. Also, the approach can be validated for trading trends - in this case one would aim to classify price level crossings with high precision. Stricter definition of the crossings in terms of the price movement is necessary for that. In terms of improving the strategy, there is a couple of things can be done. For instance: take-profit and stop-loss offsets might be linked to the volatility instead of being constant. Also, flat strategies usually work better in certain times of the day - it would be wise to interrupt trading before USA and EU session starts and ends, as well as scheduled reports, news and impactful speeches. Additionally, all the mentioned parameters we have chosen can be looked into and optimized to the needs of the market participant. In terms of the chosen model, it would be interesting comparing CatBoost classifier to DA-RNN [43] model as it makes use of the attention-based architecture designed on the basis of the recent break- through in the area of natural language processing [44]. Finally, we see a gap in the available FLOSS (Free/Libre Open Source Software) backtesting tools. To the best of our knowledge, there is no publicly available backtesting engine taking into account bid and ask prices and order queues. While there are solutions with this functionality provided as parts of the proprietary trading platforms, they can only be used as a black box. An open source engine would contribute to transparency and has a potential to become the solution for both research and industry worlds.

8 Conclusion

Our work showcased an end-to-end approach to perform automated trading using price extrema. Whilst extrema have been discussed as potentially high performance means for trading decisions, there has been no work proposing means to automatically extract them from data and design a strat- egy. Our work demonstrated an automated pipeline using this approach, and our evaluation showed some interesting results. This paper has presented every single aspect of data processing, feature extraction, feature eval- uation and selection, machine learning estimator optimisation and training, as well as details of the trading strategy. Moreover, we statistically assessed the findings. We rejected the null hypothesis an- swering RQ1 - our approach performs statistically better than the baseline. We did not observe any significant effect sizes for RQ2 and could not reject the null hypothesis. Hence, the use of the 2-step

31 feature extraction does not improve the performance of the approach for the proposed feature space and the model. We hope that by providing every single step of the ATP,it will enable further research in this area and be useful to a varied audience. We conclude by providing samples of our code online [46].

Acknowledgement

The research is supported by CRITiCaL - Combatting cRiminals In The CLoud, under grant EP/M020576/1. The authors would like to thank Roberto Metere for his drawing contributions.

References

[1] Aitken, M., Frino, A.: The determinants of market bid ask spreads on the australian stock ex- change: Cross-sectional analysis. Accounting & Finance 36(1), 51–63 (1996)

[2] Bahmani-Oskooee, M., Hegerty, S.W.: Exchange rate volatility and trade flows: a review article. Journal of Economic studies (2007)

[3] Bauwens, L., Giot, P., Grammig, J., Veredas, D.: A comparison of financial duration models via density forecasts. International Journal of Forecasting 20(4), 589–609 (2004)

[4] Blessie, E.C., Karthikeyan, E.: Sigmis: A feature selection algorithm using correlation based method. Journal of Algorithms & Computational Technology 6(3), 385–394 (2012)

[5] Bonferroni, C.: Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8, 3–62 (1936)

[6] Booth, A., Gerding, E., Mcgroarty, F.: Automated trading with performance weighted random forests and seasonality. Expert Systems with Applications 41(8), 3651–3661 (2014)

[7] Caginalp, C., Caginalp, G.: Asset price volatility and price extrema. arXiv preprint arXiv:1802.04774 (2018)

[8] Ce, B.: Teoria statistica delle classi e calcolo delle probabilit. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commerciali di Firenze 8, 3–62 (1936)

[9] Chen, T., Guestrin, C.: Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. pp. 785–794 (2016)

[10] Cohen, J.: A coefficient of agreement for nominal scales. Educational and psychological mea- surement 20(1), 37–46 (1960)

[11] Cumming, G.: Understanding the new statistics: Effect sizes, confidence intervals, and meta- analysis. Routledge (2013)

[12] De Prado, M.L.: Advances in financial machine learning. John Wiley & Sons (2018)

[13] Dempster, M.A., Leemans, V.: An automated fx trading system using adaptive reinforcement learning. Expert Systems with Applications 30(3), 543–552 (2006)

[14] Dixon, M., Klabjan, D., Bang, J.H.: Classification-based financial markets prediction using deep neural networks. Algorithmic Finance 6(3-4), 67–77 (12 2017). https://doi.org/10.3233/AF- 170176, https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10. 3233/AF-170176

32 [15] Du, P.,Kibbe, W.A., Lin, S.M.: Improved peak detection in mass spectrum by incorporating con- tinuous wavelet transform-based pattern matching. Bioinformatics 22(17), 2059–2065 (2006)

[16] Dufour, A., Engle, R.F.: Time and the price impact of a trade. The Journal of Finance 55(6), 2467– 2498 (2000)

[17] Easley, D., de Prado, M.L., O’Hara, M.: Discerning information from trade data. Journal of Finan- cial Economics 120(2), 269–285 (2016)

[18] Easley, D., de Prado, M.L., O’Hara, M., Zhang, Z.: Microstructure in the machine age. Available at SSRN 3345183 (2019)

[19] Fan, F., Xiong, J., Wang, G.: On interpretability of artificial neural networks. arXiv preprint arXiv:2001.02522 (2020)

[20] Fraser, C.: Association between two categorical variables: Contingency analysis with chi square. In: Business Statistics for Competitive Advantage with Excel 2019 and JMP,pp. 341–377. Springer (2019)

[21] Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an appli- cation to boosting. Journal of computer and system sciences 55(1), 119–139 (1997)

[22] Girden, E.R.: ANOVA: Repeated measures. No. 84, Sage (1992)

[23] Grammig, J., Wellner, M.: Modeling the interdependence of volatility and inter-transaction dura- tion processes. Journal of Econometrics 106(2), 369–400 (2002)

[24] Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of machine learning research 3(Mar), 1157–1182 (2003)

[25] Hawkins, D., Gallacher, E., Gammell, M., et al.: Statistical power, effect size and animal welfare: recommendations for good practice. Animal Welfare 22(3), 339–344 (2013)

[26] Hedges, L.V.: Distribution theory for glass’s estimator of effect size and related estimators. journal of Educational Statistics 6(2), 107–128 (1981)

[27] Huang, W., Nakamori, Y., Wang, S.Y.: Forecasting stock market movement direction with support vector machine. Computers & operations research 32(10), 2513–2522 (2005)

[28] Iori, G., Chiarella, C.: A Simulation Analysis of the Microstructure of Double Auction Markets. Quantitative Finance 2, 346–353 (2002)

[29] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y.: Lightgbm: A highly efficient decision tree. In: Advances in neural information processing systems. pp. 3146–3154 (2017)

[30] Kelley, K., Preacher, K.J.: On effect size. Psychological methods 17(2), 137 (2012)

[31] Kirch, W. (ed.): Pearson’s Correlation Coefficient, pp. 1090–1091. Springer Netherlands, Dor- drecht (2008). https://doi.org/10.1007978-1-4020-5614-7_2569

[32] Kissell, R.L.: The science of algorithmic trading and portfolio management. Academic Press (2013)

[33] Kuhn, M., Johnson, K.: Feature engineering and selection: A practical approach for predictive models (2019). https://doi.org/10.1201/9781315108230

33 [34] Lakens, D.: Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology 4(NOV) (2013). https://doi.org/10.3389/fpsyg.2013.00863

[35] Lo, A.W.: The Statistics of Sharpe Ratios. Financial Analysts Journal 58(4), 36–52 (7 2002). https://doi.org/10.2469/faj.v58.n4.2453, https://www.tandfonline.com/doi/full/ 10.2469/faj.v58.n4.2453

[36] Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: Advances in neural information processing systems. pp. 4765–4774 (2017)

[37] Manganelli, S.: Duration, volume and volatility impact of trades. Journal of Financial markets 8(4), 377–399 (2005)

[38] McNeil, K.A., Newman, I., Kelly, F.J.: Testing research hypotheses with the general linear model. SIU Press (1996)

[39] Miller, N., Yang, Y., Sun, B., Zhang, G.: Identification of technical analysis patterns with smooth- ing splines for bitcoin prices. Journal of Applied Statistics 46(12), 2289–2297 (2019)

[40] Münnix, M.C., Shimada, T., Schäfer, R., Leyvraz, F., Seligman, T.H., Guhr, T., Stan- ley, H.E.: Identifying states of a financial market. Scientific Reports 2, 1–9 (2 2012). https://doi.org/10.1038/srep00644, http://arxiv.org/abs/1202.1623

[41] Münnix, M.C., Shimada, T., Schäfer, R., Leyvraz, F., Seligman, T.H., Guhr, T., Stanley, H.E.: Identi- fying states of a financial market. Scientific reports 2, 644 (2012)

[42] Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A.V., Gulin, A.: Catboost: unbiased boosting with categorical features. In: Advances in neural information processing systems. pp. 6638–6648 (2018)

[43] Qin, Y., Song, D., Cheng, H., Cheng, W., Jiang, G., Cottrell, G.W.: A dual-stage attention-based recurrent neural network for time series prediction. IJCAI International Joint Conference on Ar- tificial Intelligence 0, 2627–2633 (2017). https://doi.org/10.24963/ijcai.2017/366

[44] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsuper- vised multitask learners (2019)

[45] Šidák, Z.: Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association 62(318), 626–633 (1967)

[46] Sokolovsky, Artur and Arnaboldi, Luca: Machine learning classification of price extrema based on market microstructure and price action features. a case study of s&p 500 e-mini futures. repro- ducibility package. (2021). https://doi.org/10.5281/ZENODO.4036850, https://zenodo.org/ record/4036850

[47] Stern, R.: Good statistical practice for natural resources research. CABI (2004)

[48] Student: The probable error of a mean. Biometrika pp. 1–25 (1908)

[49] Tabak, B.M., Feitosa, M.A.: An analysis of the yield spread as a predictor of inflation in brazil: Evidence from a wavelets approach. Expert Systems with Applications 36(3), 7129–7134 (2009)

[50] Usman, M.: On consistency and limitation of independent t-test kolmogorov smirnov test and mann whitney u test. IOSR Journal of Mathematics 12(4), 22–27 (2016)

34 [51] Wilcoxon, F.: Individual comparisons by ranking methods. In: Breakthroughs in statistics, pp. 196–202. Springer (1992)

[52] Wilkinson, L.: Statistical methods in psychology journals: Guidelines and explanations. Ameri- can psychologist 54(8), 594 (1999)

[53] Zhou, B., Bau, D., Oliva, A., Torralba, A.: Interpreting deep visual representations via network dis- section. IEEE transactions on pattern analysis and machine intelligence 41(9), 2131–2145 (2018)

35 Supplementary Materials

Table S1: Model performance measures reported for the 2-step feature extraction, 3 different labelling configurations: 7, 11 and 15 ticks rebounds. Null-Precision corresponds to the performance of an always-positive classifier.

Contract PR-AUC F1-score Precision ROC-AUC Null-Precision Rebound 7 ESH2017 0.19 0.26 0.20 0.51 0.19 ESM2017 0.25 0.28 0.22 0.54 0.21 ESU2017 0.25 0.29 0.25 0.56 0.20 ESZ2017 0.18 0.16 0.17 0.52 0.16 ESH2018 0.19 0.28 0.19 0.54 0.17 ESM2018 0.22 0.32 0.21 0.55 0.19 ESU2018 0.17 0.18 0.17 0.52 0.16 ESZ2018 0.18 0.28 0.18 0.53 0.17 ESH2019 0.18 0.24 0.18 0.51 0.18 ESM2019 0.17 0.24 0.18 0.52 0.17 Rebound 11 ESH2017 0.19 0.30 0.20 0.53 0.18 ESM2017 0.21 0.14 0.23 0.54 0.19 ESU2017 0.18 0.14 0.15 0.50 0.19 ESZ2017 0.17 0.23 0.17 0.52 0.16 ESH2018 0.19 0.27 0.18 0.53 0.17 ESM2018 0.21 0.31 0.21 0.55 0.19 ESU2018 0.16 0.17 0.16 0.50 0.16 ESZ2018 0.19 0.27 0.17 0.52 0.17 ESH2019 0.18 0.24 0.18 0.52 0.17 ESM2019 0.17 0.25 0.17 0.52 0.17 Rebound 15 ESH2017 0.15 0.20 0.15 0.51 0.15 ESM2017 0.18 0.21 0.19 0.53 0.17 ESU2017 0.17 0.17 0.18 0.54 0.15 ESZ2017 0.17 0.11 0.19 0.51 0.16 ESH2018 0.17 0.26 0.17 0.52 0.16 ESM2018 0.18 0.28 0.18 0.53 0.17 ESU2018 0.16 0.22 0.16 0.51 0.16 ESZ2018 0.17 0.26 0.16 0.52 0.16 ESH2019 0.17 0.26 0.18 0.51 0.17 ESM2019 0.17 0.26 0.17 0.52 0.16

36 Table S2: Model performance measures reported for the Price Level feature extraction component, 3 different labelling configurations: 7, 11 and 15 ticks rebounds. Null-Precision corresponds to the performance of an always-positive classifier.

Contract PR-AUC F1-score Precision ROC-AUC Null-Precision Rebound 7 ESH2017 0.19 0.24 0.19 0.50 0.19 ESM2017 0.21 0.28 0.21 0.50 0.21 ESU2017 0.22 0.25 0.23 0.53 0.20 ESZ2017 0.16 0.19 0.16 0.49 0.16 ESH2018 0.18 0.23 0.17 0.50 0.17 ESM2018 0.19 0.25 0.19 0.49 0.19 ESU2018 0.16 0.15 0.15 0.50 0.16 ESZ2018 0.18 0.24 0.17 0.51 0.17 ESH2019 0.18 0.26 0.18 0.51 0.18 ESM2019 0.18 0.21 0.18 0.50 0.17 Rebound 11 ESH2017 0.17 0.19 0.17 0.49 0.18 ESM2017 0.20 0.18 0.19 0.51 0.19 ESU2017 0.18 0.18 0.17 0.50 0.19 ESZ2017 0.16 0.19 0.16 0.50 0.16 ESH2018 0.17 0.23 0.18 0.51 0.17 ESM2018 0.19 0.17 0.20 0.50 0.19 ESU2018 0.16 0.22 0.16 0.50 0.16 ESZ2018 0.18 0.20 0.18 0.52 0.17 ESH2019 0.18 0.25 0.17 0.51 0.17 ESM2019 0.17 0.22 0.17 0.50 0.17 Rebound 15 ESH2017 0.15 0.20 0.15 0.50 0.15 ESM2017 0.17 0.24 0.18 0.51 0.17 ESU2017 0.17 0.16 0.19 0.53 0.15 ESZ2017 0.16 0.13 0.15 0.50 0.16 ESH2018 0.16 0.25 0.16 0.50 0.16 ESM2018 0.18 0.21 0.17 0.51 0.17 ESU2018 0.15 0.16 0.15 0.49 0.16 ESZ2018 0.16 0.19 0.16 0.51 0.16 ESH2019 0.17 0.24 0.17 0.50 0.17 ESM2019 0.16 0.18 0.16 0.49 0.16

37 Table S3: Model performance measures reported for the Market Shift feature extraction component, 3 different labelling configurations: 7, 11 and 15 ticks rebounds. Null-Precision corresponds to the performance of an always-positive classifier.

Contract PR-AUC F1-score Precision ROC-AUC Null-Precision Rebound 7 ESH2017 0.20 0.26 0.20 0.51 0.19 ESM2017 0.25 0.28 0.24 0.55 0.21 ESU2017 0.25 0.31 0.25 0.56 0.20 ESZ2017 0.17 0.26 0.18 0.52 0.16 ESH2018 0.20 0.29 0.19 0.54 0.17 ESM2018 0.21 0.33 0.22 0.55 0.19 ESU2018 0.17 0.23 0.17 0.51 0.16 ESZ2018 0.18 0.28 0.18 0.52 0.17 ESH2019 0.19 0.28 0.19 0.53 0.18 ESM2019 0.18 0.27 0.18 0.53 0.17 Rebound 11 ESH2017 0.18 0.24 0.19 0.51 0.18 ESM2017 0.21 0.24 0.20 0.50 0.19 ESU2017 0.21 0.31 0.23 0.56 0.19 ESZ2017 0.16 0.12 0.16 0.50 0.16 ESH2018 0.20 0.28 0.18 0.54 0.17 ESM2018 0.20 0.32 0.21 0.55 0.19 ESU2018 0.17 0.22 0.17 0.51 0.16 ESZ2018 0.17 0.27 0.17 0.52 0.17 ESH2019 0.18 0.26 0.18 0.52 0.17 ESM2019 0.18 0.27 0.18 0.53 0.17 Rebound 15 ESH2017 0.15 0.22 0.15 0.48 0.15 ESM2017 0.17 0.20 0.16 0.50 0.17 ESU2017 0.16 0.17 0.15 0.51 0.15 ESZ2017 0.16 0.11 0.16 0.50 0.16 ESH2018 0.17 0.27 0.17 0.53 0.16 ESM2018 0.18 0.28 0.18 0.54 0.17 ESU2018 0.16 0.22 0.16 0.51 0.16 ESZ2018 0.17 0.27 0.17 0.52 0.16 ESH2019 0.17 0.27 0.17 0.52 0.17 ESM2019 0.18 0.27 0.18 0.53 0.16

38 CatBoost Null Rebound 7

0.24

0.22

0.20

0.18

0.16 Rebound 11

0.22

0.20

0.18 Precision

0.16

Rebound 15

0.18

0.17

0.16

0.15

ESH2017 ESU2017 ESH2018 ESU2018 ESH2019 Contract

Figure S1: Precision of the CatBoost model and the always-positive estimator. The plot shows the labelling configurations of 7, 11 and 15 ticks rebounds.

39 PL+MS feature space PL MS Rebound 7

0.24

0.22

0.20

0.18

0.16

Rebound 11

0.22

0.20

0.18 Precision

0.16

Rebound 15

0.19

0.18

0.17

0.16

0.15

ESH2017 ESU2017 ESH2018 ESU2018 ESH2019 Contract

Figure S2: Precision of the model which uses the 2-step feature extraction (MS+PL) versus the performance of the models using the single-step feature extraction (MS, PL). The plot shows the labelling configurations of 7, 11 and 15 ticks rebounds.

40 High

MS6_80

MS0

MS6_20

MS6_200

PL2, t=1

PL23 Feature value

MS2

PL24

Low 1.5 1.0 0.5 0.0 0.5 1.0 SHAP value (impact on model output)

Figure S3: SHAP summary plot of the model trained on ESH2019 contract, rebound 7 configuration. Each marker is a classified entry. X axis quantifies the contribution of the entries towards positive or negative class output.

41 High MS1 MS6_40 MS0 MS6_20 MS6_120 PL23 PL0, t=0 MS6_80 MS2 MS5_20 MS6_200 MS6_160 PL4, t=1 PL24 PL4, t=2 MS4 Feature value PL9, t=2 PL4, t=3 PL9, t=1 PL8, t=2 PL11, t=0 MS3 PL6, t=1 PL7, t=2 PL5, t=2 PL18 PL6, t=2 PL1, t=3 PL15

Low 0.3 0.2 0.1 0.0 0.1 0.2 SHAP value (impact on model output)

Figure S4: SHAP summary plot of the model trained on ESH2019 contract, rebound 11 configuration. Each marker is a classified entry. X axis quantifies the contribution of the entries towards positive or negative class output.

42