Hidden Markov Models Applied To Intraday Momentum Trading With Side Information

Hugh Christensena,∗, Richard Turnerb, Simon Godsilla

aSignal Processing and Communications Laboratory, Engineering Department, Cambridge University, CB2 1PZ, UK bMachine Learning Group, Engineering Department, Cambridge University, CB2 1PZ, UK

Abstract A for intraday momentum trading is presented which specifies a latent momentum state re- sponsible for generating the observed securities’ noisy returns. Existing momentum trading models suffer from time- lagging caused by the delayed frequency response of digital filters. Time-lagging results in a momentum signal of the wrong sign, when the market changes trend direction. A key feature of this state space formulation, is no such lagging occurs, allowing for accurate shifts in signal sign at market change points. The number of latent states in the model is estimated using three techniques, cross validation, penalized likelihood criteria and simulation based model selection for the marginal likelihood. All three techniques suggest either 2 or 3 hidden states. Model parameters are then found using Baum-Welch and Monte Carlo, whilst assuming a single (discretized) univariate Gaussian distribution for the emission matrix. Often a momentum trader will want to condition their trading signals on additional information. To reflect this, learning is also carried out in the presence of side information. Two sets of side information are considered, namely a ratio of realized volatilities and intraday seasonality. It is shown that splines can be used to capture statistically significant relationships from this information, allowing returns to be predicted. An Input Output Hidden Markov Model is used to incorporate these univariate predictive signals into the transition matrix, presenting a possible solution for dealing with the signal combination problem. Bayesian inference is then carried out to predict the securities t + 1 return using the forward algorithm. The model is simulated on one year’s worth of e-mini S&P500 futures data at one minute sampling frequency, and it is shown that pre-cost the models have a Sharpe ratio in excess of 2.0. Simple modifications to the current framework allow for a fully non-parametric model with asynchronous prediction. Keywords: Bayesian inference, trend following, high frequency futures trading, quantitative finance.

1. Introduction Quantitative trading, namely the application of the scientific method, is now well established in the finan- An intraday momentum trading strategy is presented, cial markets. A sub-section of this field is termed al- consisting of a Hidden Markov Model (HMM) frame- gorithmic trading, where algorithms are responsible for work that has the ability to use side information from the full trade cycle, including the decision of when to external predictors. The proposed framework is quite buy and sell. When this process is dependent on the general and allows any predictors to be used in conjunc- prior behavior of the security, it historically was termed tion with the momentum model. An appealing aspect technical analysis (Lo et al., 2000). Momentum trad- arXiv:2006.08307v1 [q-fin.TR] 15 Jun 2020 of this model is that all the computationally demand- ing (or trend following) falls into this category and is ing learning is done off-line, allowing for a fast infer- the most popular hedge fund style trading strategy cur- ence phase meaning the model can be applied to high- rently used. For example, the largest quantitative hedge frequency financial data. funds by assets under management famously trade mo- mentum strategies (Anon, 2011). It can be inferred from this that momentum is the most significant exploitable ∗ Corresponding author. effect in the financial markets, and as a result of this Email addresses: [email protected] (Hugh Christensen), [email protected] (Richard Turner), [email protected] (Simon there is a large body of literature published on the ef- Godsill) fect (Hong and Stein, 1999). Momentum (or trend) can

Preprint submitted to arXiv June 22, 2020 be defined as the rate of change of price. As a strat- ten want to incorporate other information into their mo- egy, momentum trading aims to forecast future security mentum based forecast, the signal combination prob- returns by exploiting the positive autocorrelation struc- lem, and an IOHMM framework is established to al- ture of the data. Once a trend is detected by careful es- low this. For both innovations, realistic experiments are timation of the mean return (in the presence of noise), it conducted (including transaction costs and slippage), can be predicted. The most well known trend-following results presented and conclusions drawn. system is that introduced by Gerald Appel in the 1970’s, This paper is structured as follows. In Section 2 the moving-average convergence-divergence (MACD) HMM’s in finance and economics are reviewed and the (Gerald, 1999), made famous by the success of a group HMM framework is introduced. In Section 3 the three of traders named the “turtles” (Faith, 2007). The MACD learning methodologies are presented. In Section 4 two strategy uses the difference between a pair of cascaded extrinsic predictors are developed and tested, and then low pass filters in parallel to remove noise while es- in Section 5 learning is carried out using this side in- timating the true mean of the rate of change of price formation. In Section 6 our inference algorithm is pre- (Satchell and Acar, 2002). The reasons for the mo- sented. In Section 7 we present the historical futures mentum effect existing are less than clear despite ex- data and then simulate the performance of the models tensive academic research on the subject. Financial data with data and present results. Finally in Section 8 con- consists of deterministic and stochastic components and clusions are presented, along with suggestions for fur- both of these components can exhibit trends. Signifi- ther work. cant trends commonly occur even in data which is gen- erated by a random process, such as geometric Brow- nian motion (Wilmott, 2006) and can be explained by 2. Hidden Markov Models the effect of summing random disturbances (Lo and An HMM is a Bayesian state space model that as- MacKinlay, 2001). Attempting to model such stochas- sumes discrete time unobserved (hidden or latent) states tic trends can lead to spurious results. Deterministic (Gales and Young, 2008). The basic assumptions of a reasons for trends existing are thought to include herd- Markov state space model are firstly that states are con- ing behaviour (Shiller, 2005), supply-and-demand argu- ditionally independent of all other states given the pre- ments (Johnson, 2002) and delayed over-reactions that vious state, and secondly that observations are condi- are eventually reversed (Jegadeesh and Titman, 1999). tionally independent of all other observations given the While there is debate in the academic literature between state that generated it. those that believe the momentum effect is still viable post-transaction costs, for example (Jegadeesh and Tit- man, 1999), and those that believe the effect has been 2.1. Literature Review of HMM in Finance and Eco- arbitraged away, for example (Lesmond et al., 2004), nomics the continued profitability of large momentum trading In the 1970’s Leonard Baum was one of the first re- hedge funds is testament to the enduring nature of the searchers to work with what is now known as an HMM. momentum effect. He applied the methodology to securities’ trading for The motivation for this paper is to apply HMM’s to the hedge fund Renaissance Technologies (Baum et al., produce a trading algorithm that exploits the momentum 1970; Teitelbaum, 2008). Since then HMMs have been effect, and that can be applied to the financial markets in used extensively in finance and economics (Bhar and real-time by industry practitioners. The core aim of the Hamori, 2004; Mamon and Elliott, 2007). The first paper is to give the algorithm the best predictive perfor- widely attributed public application of HMM’s to fi- mance possible, irrespective of methodology. Applica- nance and economics was by James Hamilton in 1989 tion of such work to the financial markets has obvious (Hamilton, 1989). In his seminal paper, Hamilton views economic benefits. the parameters of an autoregression as the outcome of The two main innovations presented in this paper are a discrete Markov process, where the observed variable both new and novel applications of existing statistical is GNP and the latent variable is the business cycle. By techniques to an applied problem. No new methodolo- observing GNP, the position in the business cycle can gies are introduced in the paper. Firstly, the price dis- be estimated and future activity predicted. covery process of a security is described by a trend term Following Hamilton’s paper there has been much in the presence of noise. This process is fitted into an Bayesian work discussing estimation of these mod- HMM framework and various means of parameter esti- els and providing financial and economic applications, mation are inspected. Secondly, momentum traders of- most of which focus on Markov chain Monte Carlo 2 (MCMC). MCMC is a means of providing a numerical process as in an HMM (Bishop, 2006). Liesenfeld et approximation to the posterior distribution using a set of al apply a bivariate mixture model to stock price and samples, allowing approximate posterior probabilities trading volume (Liesenfeld, 2001). In their model, the and marginal likelihoods to be found. Two excellent re- behavior of volatility and volume results from the si- views of the field of Bayesian estimation using MCMC multaneous interaction of the number of information ar- are given by Chib (Chib, 2001) and Scott (Scott, 2002). rivals and traders’ sensitivity to new information, both Noteworthy papers applying Bayesian estimation tech- of which are treated as latent variables. In an hierar- niques include; Fruhwirth-Schnatter applies MCMC chical HMM (HHMM), each state is itself an HHMM, to a clustering problem from a panel data set of US allowing modelling of “the complex multi-scale struc- bank lending data, where model parameters are time- ture which appears in many natural sequences” (Fine varying (Frühwirth-Schnatter, 2001). Shephard applies et al., 1998). Wisebourt et al generate a measure of the the Metropolis algorithm to a non-Gaussian state space limit order book imbalance and uses it to drive latent model and illustrates the technique by sea- market regimes inside an HHMM (Wisebourt, 2011). sonally adjusting a money supply time series (Shep- Poisson HMMs (PHMM) are a special case of HMMs hard, 1994). McCulloch et al apply a Gibbs sampler for where a Poisson process has a rate which varies in as- parameter estimation in their Markov switching model sociation with changes between the different states of and illustrate their technique using the growth rates of a Markov model (Scott, 2002). Branger et al apply a GNP (McCulloch and Tsay, 1994). Meligkotsidou et PHMM to model jumps in asset price in order to help in- al tackle interest rate forecasting with an non-constant form contagion risk and portfolio choice (Branger et al., transition matrix using an MCMC reversible jump algo- 2012). Hidden semi-Markov models (HSMM) have the rithm for predictive inference (Meligkotsidou and Del- same structure as a HMM except that the unobservable laportas, 2011). Less commonly applied in the eco- process is semi-Markov rather than Markov. Here the nomic and financial literature is the technique of vari- probability of a change in the hidden state depends on ational Bayes (VB) (Attias, 1999). VB provides a para- the amount of time that has elapsed since entry into the metric approximation to the posterior, often using in- current state (Yu, 2010). Bulla et al apply a HSMM to dependence assumptions, in a computationally efficient daily security returns in order to capture the slow de- manner. McGrory et al apply VB to estimate the num- cay in the autocorrelation function of the squared re- ber of unknown states along with the model parameters turns, which HMMs fail to capture (Bulla and Bulla, from the daily returns of the S&P500 (McGrory and Tit- 2006). Finally, factorial HMMs (FHMM) distribute terington, 2009). Finally, the debate between learning the latent state into multiple state variables in a dis- in HMM’s using frequentist methods such as expecta- tributed manner, allowing a single observation to be tion maximization (EM), versus Bayesian methods such conditioned on the corresponding latent variables of a as MCMC is reviewed by Ryden who highlights poor set of independent Markov chains, rather than a single and long computation times as potential compu- Markov chain (Ghahramani and Jordan, 1997). Charlot tational disadvantages of MCMC (Rydén, 2008). applies an FHMM to design a new multivariate GARCH Many different extensions and modifications to the model with time varying conditional correlation (Char- “vanilla” HMM have been proposed, and applied to eco- lot, 2012). nomics and finance. Input output HMMs (IOHMM) Applications of HMMs in finance and economics include inputs and outputs and can be viewed as a di- range extensively, with latent variables including the rected version of a Hidden (Bengio and business cycle (Grégoir and Lenglart, 2000), inter- Frasconi, 1995; Kakade et al., 2002). Unlike HMMs, equity market correlation (Bhar and Hamori, 2003), the output and transition distributions are not only con- bond-credit risk spreads (Thomas et al., 2002), inflation ditioned on the current state, but are also conditioned (Kim, 1993; Chopin and Pelgrin, 2004), credit risk (Gi- on an observed input value. Bengio et al carry out ampieri et al., 2005), options pricing (Buffington and El- learning in an IOHMM using a feed-forward neural net- liott, 2002), portfolio allocation (Elliott and Hinz, 2002; work. Kim et al use an IOHMM to model stock order Roman et al., 2010), volatility (Rossi and Gallo, 2006; flows in the presence of two hidden market states (Kim Dueker, 1997), interest rates (Elliott and Wilson, 2007; et al., 2002). HMMs are a generalization of a mixture Ang and Bekaert, 2002), trend states (Dai et al., 2010; model where latent variables control the mixture com- Pidan and El-yaniv, 2011) and future asset returns (Shi ponent to be selected for each observation. In a mix- and Weigend, 1997; Hassan and Nath, 2005; Dueker ture model, the latents are assumed to be i.i.d. random and Neely, 2007). variables, as opposed to being related through a Markov This paper relates to the broader field of research into 3 the prediction of security returns by exploiting the mo- state state transition function mentum effect. To our knowledge, no other authors m m m m p 1 2 3 have considered momentum as a latent variable in a A HMM setting. However Christensen et al have consid- state prior observation ered a latent momentum formulation in a Bayesian fil- function tering setting (Christensen et al., 2012). In this paper the authors track a continuous latent momentum state y1 y2 y of a time series using a Rao-Blackwellized particle fil- 3 ter. The paper finds that the predictions are statistically observation significant when applied to a portfolio of futures in the Figure 1: A state space model for a discretized continuous presence of transaction costs. In general terms it is ex- observed state ∆y (the change in price) and a discrete hidden pected that an HMM would be able to outperform the state m (the trend). The relationship between the latent vari- Rao-Blackwellized particle filtering formulation. This ables and the system parameters is shown. is because an HMM with lots of states can model arbi- trary transitions between trend states, e.g. a sudden re- versal of trend at the top of the market, whereas a linear allows for sudden changes to a new trend, whereas dig- Gaussian model is limited to linear changes. ital filters inevitably have some delay in response de- 2.2. An HMM for Trading Momentum pending upon their frequency response. Finally, we note that digital smoothing filters can often be written Our model is based on the concept of a noisy trend, equivalently as the stationary solution of particular lin- where the trend is a latent state and the price series is ear state-space models (Harvey, 1991). Brownian with a stochastic drift. In order to forecast Time series, such as security returns, can be syn- the next time step in the HMM we begin with a distri- chronous or asynchronous. A synchronous time series bution over the current hidden state and use the transi- is one where the time stamps lie on a regular grid. The tion function to propagate this distribution forward in grid spacing is referred to as the sampling frequency. An time. At the next time step we are able to infer the asynchronous time series is one where the time stamps most likely hidden state and generate a predictive dis- do not lie on a regular grid. Raw security returns are tribution over observables. This is done by taking a generally asynchronous, but are often sampled to make weighted average of the conditional distribution of the them synchronous. Security returns are not continuous observations where the weights are from the distribu- in value, but lie on a discrete price grid, with a grid spac- tion over the hidden state. We are not interested in ing defined on a security specific basis by the exchange. predicting price, an arbitrary value, rather the change This grid spacing is called the tick size, α. The state in price or the return. Let yt be the price, such that y space of the latent states consists of a total of K possi- Y = {y ,..., y } and ∆y = log t/yt−1 be the return, such 1 T t ble values of trend. An upper limit to K can be found by that ∆Y = {∆y ,..., ∆y }. In our model ∆y (the obser- 2 T t knowing the grid size and calculating Ω = max (|∆Y|). vation) is influenced by a hidden, unobserved state, m t The latent variables are indexed on this grid as, (the trend, where d∆y/dt is a noisy estimate of mt), such that M = {m1,..., mT }. In order to find E(∆yt|∆y1:t−1), mt ∈ {1,..., k,..., K} (1) a two step process of learning followed by inference is th carried out. This model is shown in Figure 1. where k refers to the k latent state. By fixing K the The intuition behind the model is that security returns set of time-dependent trend terms can be specified by can be modelled as a noisy trend process and that while M. The observations depend on the latent states accord- return can be observed, the trend state cannot be and ing to; return at time t is equal to the trend term, plus must be inferred. While the MACD algorithm of Sec- Gaussian noise, tion 1 attempts to find the true value of this hidden state 2 ∆yt = µmt + t t ∼ Norm(0, σm ) (2) empirically by use of a digital filter, in this paper we t model the observations and the hidden trend state ex- Given the indexed grid of Equation (1), plicitly, therefore allowing interpretation of all the pa- one mode of initialization would be µ1:K = rameters in a meaningful way. An additional advantage {−Ω, −(Ω − α), −(Ω − 2α),..., 0,..., (Ω − 2α), (Ω − α), Ω}. of an HMM formulation over MACD is that HMMs are The Gaussian assumption of Equation (2) could be able to track trends in a much more flexible way, by en- replaced with any other parametric distribution (for ex- coding non-linear relationships between the states. This ample, fat-tailed Cauchy) or a non-parametric approach 4 (for example, kernel density estimation). The resulting this paper is to produce the best predictive performance conditional distribution from Equation (2) is, possible, irrespective of methodology used and so from

2 a philosophical viewpoint we are agnostic. From a prac- p(∆yt|mt) = Norm(∆yt; µm , σ ) t mt tical point of view, frequentist methods are more com- As the latents lie on a discrete grid, yet the noise model monly found in the trading industry. It is reasoned this is continuous, the implementation is required to dis- is due to the relative simplicity of the methods, the par- cretize the Gaussian noise variable to ensure the results simony of the models and the associated low compu- lie on the grid. The joint distribution of this state space tational loads. In particular, trading practitioners tend model is therefore given by, to dislike complex models due to the risks associated with model failure being low-probability, high-impact.   YT  YT These risks are easier to understand and monitor in sim- p(∆Y, M) = p(m )  p(m |m ) p(∆y |m ) 1  t t−1  t t ple models. t=2 t=1 The major issue when learning is the transient nature Given the HMM and the observations p(∆Y, M) one can of the latent state and how stable its estimated means deduce information about the states occupied by the un- are. In order to ensure the most accurate estimation derlying Markov chain m1:T . What we are interested in possible, the means of the K Gaussian distributions (the finding in this model is the probability of a trend given trends) are efficiently estimated using short windows of all our observations of price up to now, p(mt|∆y1:t), also data. This is implemented using a rolling window of known as the filtering distribution. data that consists of 23 trading days (one month). This window size approximately agrees with the lowest fre- 2.3. Model Parameters quency information we are trying to exploit in our sys- Our model requires that the transition matrix A, tem. emission matrix φ and latent node initial value π1 are The mixing of frequentist learning with Bayesian in- known a-priori. Together these form the parameters ference is a well established approach in the literature, of our model Θ = {A, π, φ}, as shown in Figure 1. for example Andrieu et al estimate static parameters in Finding Θ constitutes the learning phase of the HMM. non-linear non-Gaussian state space models using EM This batch approach to learning suits the structure type algorithms (Andrieu and Doucet, 2003). Other ex- of the financial markets, as parameter estimation can amples of merging frequentist and Bayesian method- be done using the previous H days of market data, ologies are given by Gelman (Gelman, 2011) and Jor- when the market is shut. Before discussing learning, dan (Jordan, 2009) who suggest using Bayesian in- the connection between the hidden states M and the ference coupled with frequentist model-checking tech- model parameters Θ is explained. A specifies the niques. Such an approach gives the performance ben- probability of transitions between the latent states, π efits of using a Bayesian prior, while allowing for the is the probability of the initial latent state and φ is easily checkable assumptions given by frequentist con- the probability of the observed return occurring. The fidence intervals. Completely integrated Bayesian tech- connection between parameter φ and Equation (2) is niques to our problem do exist, such as particle MCMC that φ(∆yt) follows a Gaussian distribution. In this which allows for fully Bayesian learning and inference, paper four different off-line learning approaches are however such techniques suffer from unpractically high considered, computational complexity (Andrieu et al., 2010).

1. Θ is learnt using piecewise linear regression 2.3.1. State Transition Matrix (PLR). 2. Θ is learnt using the Baum-Welch algorithm. A conditional distribution for the latent variables 3. Θ is learnt using Markov Chain Monte Carlo p(mt|mt−1) is specified. Because the latent variables can (MCMC). take one of K values, this distribution is the transition 4. Θ is learnt using the Baum-Welch algorithm in the matrix A, size (K × K). The transition probabilities are presence of side information. given by, PLR and Baum-Welch are both frequentist methods,  a a ... a  while MCMC is a Bayesian method. The inference  1,1 1,2 1,K   a a ... a  phase of this paper is purely Bayesian. At this point we  2,1 2,2 2,K  A =  . . . .  consider the correctness of combining frequentist and  . . .. .    Bayesian methods in the same model. The core aim of aK0 ,1 aK0 ,2 ... aK0 ,K 5 where • Penalized likelihood criteria, such as Bayesian in- 0 formation criterion (BIC) (Schwarz, 1978) and a 0 = p(m = M |m = M 0 ) k , k = 1,..., K k ,k t k t−1 k Akaike information criterion (AIC) (Akaike, i.e. the probability of making a particular transition 1974). These criteria penalize the maximized like- 0 from state k to state k in one time step is given by ak0 ,k. lihood function by the number of model parame- The diagonal corresponds to the probability of the sys- ters. The disadvantage, is that they do not provide tem staying in its current state, while the lower diagonal any measure of confidence in the selected model. corresponds to the system moving to a negative price • Approximate Bayesian computation (Toni et al., trend and the upper diagonal corresponds to moving to 2009). Simulation based model selection. a positive price trend. A has K(K − 1) independent pa- • Bayesian model comparison (Kass and Raftery, rameters and each row of A is a probability distribution 1995). Theoretically powerful, but difficult to ap- P function such that k ak0 k = 1. ply in practice. This approach is often approxi- mated by MCMC (Gilks et al., 1996). 2.3.2. Emission Matrix The posterior probability p(Mk|∆Y, Θ) of a model Mk The probability of an observation given the hidden given data ∆Y is given by Bayes theorem, state is given by the emission matrix φ. This matrix is a p(∆Y|M , Θ)p(M ) set of parameters governing the conditional distribution p(M |∆Y, Θ) = k k k p(∆Y) of the observed variables p(∆Y|M, φ) = φk(∆y). In a discrete HMM model, an emission matrix is output of For two different models M1, M2 with parameters size the number of states in the hidden representation by Θ1, Θ2, the Bayes factor B can be used to carry out the number of possible output states. For our continuous model selection, model each one of the K states has an associated output R p(∆Y|M1) p(Θ1|M1)p(∆Y|Θ1, M1)dΘ1 distribution of a single univariate discretized Gaussian, B = = R 2 p(∆Y|M2) |M | M with a mean µk and a variance σk, as given by Equation p(Θ2 2)p(∆Y Θ2, 2)dΘ2 (3). The chosen model is simply the model with the highest 2 φk(∆y) ∝ Norm(∆y; µk, σk) integrated likelihood p(Mk|∆Y, Θ). However, at times 2 the prior p(Θ|M ) is unknown and so the logarithm of Norm(∆y; µk, σ ) k = k (3) the integrated likelihood can be approximated by the P Norm(∆y; µ , σ2) ∆y∈Y k k BIC. More accurate selection of K requires evaluating R where Y denotes the set of all possible ∆y. the marginal likelihood p(Θ|Mk)p(∆Y|Θ, Mk)dΘ, however this is an extremely difficult integral to calcu- 2.3.3. Initial Latent Node late. Dealing with this integral is covered in Section The initial latent node m1 is special in that it does not 3.3 once MCMC has been introduced. In the following have a parent node and so it has a marginal distribution Section, each method of learning uses one of the above p(m1) represented by a vector of probabilities π with techniques to estimate K. elements πk ≡ p(m1 = k).

2.3.4. Number of Unknown States 3. Learning Phase As the number of latent momentum states K is un- The three independent methods of learning Θ are now known, estimating K is a model selection problem. presented. Results are shown for applying the methods There are various methodologies for determining K, to one year’s worth of data at one minute sampling fre- both heuristic and formal, frequentist and Bayesian. We quency from the ES future, a traded security. Full de- summarize some of these techniques here, tails of the dataset and and its processing are described • Cross validation (Kohavi et al., 1995). Segment in Section 7. the data set into training and test portions. Select K which gives the best predictive performance on 3.1. Piecewise Linear Regression the training data set and then apply it to the test The “default case” is presented as a baseline against data set. which other methods of learning can be compared. A is • Generalized likelihood ratio tests (Vuong, 1989). initialized as, The ratio of two model’s likelihoods is used to  0 β, k = k compute a p-value, which allows the null hypothe- 0  ak ,k =  0 sis to be accepted or rejected. 1−β/K−1, k , k 6 where β is the probability of the state staying in its cur- In the first step of the algorithm (the E-step), Baum- rent state and is set as β = 0.5. The 1−β/K−1 term reflects Welch uses the forward-backward algorithm, which that fact that in the absence of conditioning informa- finds the smoothing distribution p(k|∆y1:T ). The For- tion, no state is more likely than any other state, though ward algorithm gives αt(k), which is the probability that it is most likely to stay in its current state and thus is the model is in state k at time t, based on the current pa- described as “sticky”. rameters. The Backward algorithm gives βt+1(k) which Change points P in the price Y represent breaks be- is the probability of emitting the rest of the sequence tween latent momentum states (i.e. trends). Using if we are in state k at time t + 1, based on the current piecewise linear regression (PLR) on the training data parameters. For large numbers of observations, numeri- set, P is found (Oh, 2011). PLR gives two things - cal under-flow can occur, hence in implementation log- firstly a state sequence which can be used in learning probabilities are used (Kingsbury and Rayner, 1971). 2 later and secondly, the model mean µk and variance σk. The second step of the algorithm (the M-step), sees PLR is simply ordinary least squares carried out over successive iterations of the algorithm update Θ improv- segmented data, with change points tested for by t-stats. ing the likelihood up to some local maximum. This is For each segment of data that contains a trend, µk is the done by calculating the occupation probabilities γt(k) 2 gradient of the regression and σk the variance, found which is the probability of the model occupying state k from the maximum likelihood estimate for a Gaussian at time t. These probabilities are then used to find the noise model, maximum likelihood estimates of A and φ (Juang et al., s 1986). PT 2 Baum-Welch is used to find K by maximizing the log- σ = t=1 t T likelihood of k = 1,..., 50 models. Penalized likeli- hood criteria are calculated for each model and the max- where  are the regression residuals. The presence of imum value K = 3 selected. The results are shown in autocorrelation would suggest that PLR was not work- Figure 2. ing correctly and so is checked for using the Durbin- Baum-Welch requires estimates for initial value of the Watson test (Durbin and Watson, 1971). Finally, it is noted that many other approaches exist for change point x 104 Penalized Likelihood Criteria detection, for example (Adams and MacKay, 2007; 9.3 Punskaya et al., 2002). AIC 9.2 BIC For the default case, the number of hidden states is Maxima found using cross validation and is set to K = 2. 9.1

3.2. Baum-Welch 9 The Baum-Welch algorithm is a special case of the 8.9

EM algorithm which can be used to determine parame- Statistic 8.8 ter estimates in an HMM when the state sequence of the latents is unknown (Baum et al., 1970). The algorithm 8.7 attempts to find the sequence of latent states M which will maximize the likelihood of having generated ∆Y 8.6 given Θ, 8.5 0 5 10 15 20 25 30 35 40 45 50 Hidden State K Θˆ = argmax p(∆Y|Θ) Θ Figure 2: Penalized likelihood criteria. Finding the number Finding this global maximum is intractable as it requires of hidden states using Baum-Welch. The optimal model of enumerating over all parameter values Θk and then cal- K = 3 is shown by a red dot. culating p(∆Y|Θk) for each k. Baum-Welch avoids this global maximum and instead settles for a local maxi- mum. As with other members of the EM class, this is emission and transition matrices. A “flat start model” achieved by computing the expected log-likelihood of is defined by setting all the values of A to be equal and the data under the current parameters and then using φ to the global mean/variance of the data. The problem this to iteratively re-estimate the parameters until con- with this approach is that, depending on how the ini- vergence. tial HMM parameters are chosen, the local maximum to 7 which Baum-Welch converges to may not be the global maximum. Convergence is deemed to have occurred when either a certain number of iterations have passed or a certain log-likelihood tolerance has been met. In order to hit the global maximum, good initialization is crucial. To avoid local minima, a prior is set over Θ using training data Z. Applying Baum-Welch to Z it is 2 Alg. 1 HMM Baum-Welch. noted that the square root of the model variances σk is of the same order of magnitude as the tick-size α for Θˆ = BW(Z, K) the ES contract. This is as expected as the algorithm is 1: Initialize unable to predict with an accuracy smaller than the grid 2: Θ {A, φ} = extract(Z) {Extract initial parameters size. Learning the covariance structure (untied) can re- from the estimate} sult in a implementation issue that for one or more states 3: while q < maxIterations do the local maxima might settle on a small number of data 4: Go around loop until parameters converge or tol 2 points, giving σk → 0, preventing the log-likelihood in- is met creasing at each iteration of the M-Step. This is dealt 5: Forward Pass with in our implementation by never allowing the vari- 6: α1(k) = p(m1)p(z1|m1) {Initialization} 2 ance to decrease below a fraction of the tick-size, α /2. 7: for t = 2 to T do The Baum-Welch algorithm is shown in Algorithm 1. P | | 8: αt(k) = mt−1 p(zt mt)p(mt mt−1)αt−1(k) {Gen- The notation of k to refers to a particular state and not erate a forwards factor by eliminating mt−1} the indicator variable mt, as that is path-dependent. 9: end for 10: Backward Pass 3.3. Markov Chain Monte Carlo 11: βt(k) = 1 {Initialization} 12: for t = T − 1 to 1 do MCMC methods are a class of algorithms for sam- P 13: β (k) = p(z |m )p(m |m )β (k) pling from probability distributions based on construct- t mt+1 t+1 t+1 t+1 t t+1 {Generate a backwards factor by eliminating ing a Markov chain that has the desired distribution as m } its equilibrium distribution. By constructing Markov t+1 14: end for chains for sampling specific densities, marginal den- 15: Occupation Probabilities sities, posterior expectations and evidence can be cal- αt(k)βt(k) 16: γt(k) = culated. The Metropolis-Hastings algorithm (MHA) is p(zt) 17: Parameter Estimation PT a simple and widely applicable MCMC algorithm that t=1 γt(k)zt 18: µ(k) = PT uses proposal distributions to explore the target distri- t=1 γt(k) PT γ (k)(z −µ )(z −µ )T bution (Metropolis et al., 1953). MHA constructs a 19: σ2(k) = t=1 t t k t k {Marginalizing over PT γ (k) Markov chain by proposing a value for Θ from the pro- t=1 t k gives “tied” σ2} posal distribution, and then either accepting or rejecting  2 20: φ ∼ Norm Z; µk, σk this value (with a certain probability). Given the well PT α (k)a 0 φ (z )β (k)z established literature on MHA in the financial field, the 1 t=1 t k ,k k t+1 t+1 t 21: A = p(z) PT t=1 γt(k) reader is referred to the review at (Chib, 2001). 22: score = p(Z|A, φ) In order to find the unknown number of states in a 23: Terminate Bayesian framework, a prior distribution is placed on 24: if score < tol then model Mk and then posterior distribution of Mk is esti- 25: Θˆ = {A, φ} {Maximum likelihood estimates} mated given data ∆Y, 26: return(Θˆ ) 27: end if p(Mk|∆Y) ∝ p(∆Y|Mk) × p(Mk) 28: end while where p(Mk) is the prior, p(Mk|∆Y) is the posterior and the quantity we wish to estimate is the marginalized likelihood p(∆Y|Mk). However, as marginal likelihood integration is intractable, simulation based approaches must be used. There are many ways to approximate this marginal likelihood using MCMC draws, typically done using MHA for each Mk separately. However, all 8 known estimators have been shown to be biased (Robert x 104 Log Estimates of Marginal Likelihood −2 1 and Marin, 2008). Another technique from the literature is reversible-jump MCMC (RJMCMC), however this is highly computationally intensive (Green, 1995). Based on the lower run-time, K is estimated using marginal likelihoods. To avoid a biased estimator, this is done using a simulation based approximation of the marginal −4 0.5 likelihood called bridge sampling (Frühwirth-Schnatter,

2006). Bridge sampling takes an i.i.d. sample from an Standard Error importance density and combines it with the MCMC Log Marginal Likelihood ∆Y draws from the posterior density in an appropriate way. pˆ(Mk| ) Max Log-Likelihood With bridge sampling, p(∆Y|Mk) is approximated by, Standard Error −1 PL ˜[l;k] ∗ ˜[l;k] −6 0 L l=1 κ(θ )p (θ |∆Y, Mk) 1 2 3 4 5 6 7 8 9 10 pˆ(∆Y|Mk) = Number of Hidden States K −1 PN ˘[n;k] (θ˘[n;k]) N n=1 κ(θ )q ∗ Figure 3: Log of the bridge sampling estimator of the where p (θ|∆Y, Mk) = p(∆Y|θ, Mk) × p(θ|Mk), and is the unnormalised posterior density of θ on Θ , κ is marginal likelihoodp ˆ(∆Y|Mk) under the default prior for k K = 1,..., 10. The maximum is at K = 3. On the right-hand an arbitrary function on Θk, q is an arbitrary proba- [n;k] axis the standard error is shown for each model. bility density on Θk, θ˘ are samples from the pos- [l;k] terior p(θ|∆Y, Mk) obtained using MHA and θ˜ are i.i.d. samples from q (Rydén, 2008). A drawback to the bridge-sampling approach is that if the number of hidden states is suspected to be larger than about six, on the methodology suggested by Fr¨uhwirth-Schnatter then empirically the technique becomes inaccurate and (Frühwirth-Schnatter, 2008). The prior combines the a trans-dimensional approach such as RJMCMC has to hierarchal prior for state specific variances σ2 with a be used. This is because it is essential that all modes k informative prior on the transition matrix A by assum- of the posterior density are covered by the importance ing that each row (a ,..., a ), i = 1,..., K follows density q(θ), to avoid any instability in the estimators i1 iK a Dirichlet Dir(e ,..., e ) prior where e = 4 and (Frühwirth-Schnatter, 2006). i1 iK i j e = 1/(d−1) for i j. By choosing e > e the HMM is A literature review was conducted on the estimation i j , ii i j bounded away from a finite mixture model (Frühwirth- of the number of hidden states in S&P500 daily return Schnatter, 2008). The vector π = {π , . . . , π } of the data. Assorted techniques including VB, RJMCMC, 1 K initial states is drawn from the ergodic distribution of EM, penalized likelihood criteria and bridge-sampling the hidden Markov chain. all estimated the data to contain between 2 and 3 hid- den states (McGrory and Titterington, 2009; Robert As a point estimate is required for Θ, we must move et al., 2000; Rydén et al., 1998; Frühwirth-Schnatter, from the distributional estimate to a point estimate. This 2008; Rydén, 2008). As a result of this we believe that is done by approximating the posterior mode. The pos- K ≤ 10, while noting our data sampling frequency is terior mode is the value of Θ which maximizes the non- significantly different from that used in the literature normalized mixture posterior density log p∗(Θ|∆Y) = (one minute versus daily). In order to find K, a se- log p(∆Y|Θ) + log p(Θ). The posterior mode estima- ries of mixture distributions of a univariate normal are tor is the optimal estimator with respect to the 0/1 loss specified. For each of k = 1,..., 10 models the log of function. The estimator is approximated by the MCMC the bridge sampling estimator of the marginal likelihood draw with the largest value of log p∗(Θ|∆Y). pˆ(∆Y|Mk) is found. The results are shown in Figure 3. As samples from the beginning of the chain may not It can be seen that the largest marginal likelihood is accurately represent the desired distribution a “burn-in” a mixture of three normal distributions, meaning K = 3 period of 2,000 draws was used. Run length was set is the number of hidden states suggested by MCMC. to 10,000 draws. Implementation used the Bayesf tool- Using this number of hidden states, Θ is found. The box with full details of the approach followed found in choice of prior is a critical step in the MCMC pro- subsection 11.3.3 of (Frühwirth-Schnatter, 2006). A se- cess and can lead to significant variations in the pos- lection of the MCMC outputs are shown in Figure 4. terior probabilities. A proper prior is defined based 9 Histogram of the Data Representation inference to find the number of hidden states and sys- 3 0.8 tem parameters in an HMM. Baum-Welch can be used 0.6 2 for maximum likelihood inference or for maximum a-

2 0.4 σ posteriori (MAP) estimates. A Bayesian approach re-

Density 1 0.2 tains distributional information about the unknown pa- 0 0 rameters which MCMC can be used to approximate. −10 −5 0 5 −0.2 −0.1 0 0.1 0.2 ∆ Y µ Baum-Welch computes point estimates (modes) of the Posterior Draws for µ Posterior Draws for σ2 k k posterior distribution of parameters, while MCMC gen- 0.2 0.8 k=1 erates distributional outputs. Both learning approaches 0.1 0.6 k=2 k=3 have their advantages and disadvantages. One pass 2 k k

µ 0 0.4 σ

of the EM algorithm is computationally similar to −0.1 0.2 one sweep of MCMC, however typically many more

−0.2 0 MCMC sweeps are run than EM iterations, meaning the 0 5000 10000 0 5000 10000 MCMC Draw Number MCMC Draw Number computational cost for MCMC is much higher. Baum- Welch does not always converge on the global maxima, Figure 4: Markov Chain Monte Carlo by the Metropolis- whereas MCMC suffers from the difficulty of choosing Hastings algorithm. Subplot one, histogram of the data in a good prior and potentially poor mixing of MCMC. For comparison to the fitted 3 component Gaussian mixture distri- MCMC, estimating the number of latent states by a mix- bution. Subplot two, a point process representation for K = 3. ture likelihood may be a fragile process. It will obvi- Subplot three, MCMC posterior draws for µ . Subplot four, k ously depend upon the distributions chosen. If a non- MCMC posterior draws for σ2 k Gaussian distribution were selected, the mixture might be of lower order. This point also applies to the other learning approaches as well. In summary EM is found 3.4. Learning Summary to be the simplest and quickest solution (Rydén, 2008). The relative predictive performance of the three sets of In this subsection the major differences the three Θ is presented in Section 7. methods of learning are considered and the results com- So far, we have considered the relatively simply spec- pared. The three techniques for estimating the number ification of two and three state Markov regime switch- of hidden states all gave very similar results. Cross val- ing between Gaussian distributions. This approach is idation for PLR gave K = 2, penalized likelihood cri- well known to be able to capture some aspects of the teria for Baum-Welch gave K = 3 and bridge-sampling nonlinearity of price formation, however it does suffer for MCMC gave K = 3. For a momentum model both from overfitting and unobservability in the underlying K = 2 and K = 3 makes sense, as K = 2 could cor- states. Chen et al provide an interesting critique of other respond to an upward/downward-trending momentum such approaches applied to forecasting electricity prices states, with K = 3 meaning an additional no-trending (Chen and Bunn, 2014). The authors conclude that a momentum state. Any higher values of K may just finite mixture approach to regime switching performs be considered noise. Subplot three of Figure 4 sup- best in out-of-sample testing, a methodology that we ports this hypothesis by showing that the gradient of may look to in future work. In the following section, the trend is either positive, negative or zero, correspond- the sophistication of the model is increased by the in- ing to upward/downward/no-trending states. These ob- clusion of exogenous information. servations translate into different conditional means for the two/three normal distributions and are reported in 4. Side Information the results Section 7. The framework or two or three states is appealing as experiments with MACD momen- In this Section, the case where the probability of any tum models have shown only the sign of the predictive given state in A is affected by side information from signal has traction against the sign of future returns. The outside the model is considered. This is important as A magnitude of the signal does not seem able to predict the governs the dynamics of Y. In “classical” trading mod- magnitude of future returns. els, the t + 1 return of a security is forecast by a “signal” The inclusion of the PLR learning allows a “naive” which is a univariate time series, typically synchronous estimate of the system parameters to be compared to the and continuous between ±1. When this signal is > 0 the formal EM and MCMC techniques. Both determinis- trader will go “long” and when the signal is at < 0 the tic Baum-Welch and stochastic MCMC use statistical trader will go “short”. In the simple case of a portfolio 10 consisting of only one security, the number of lots of zero mean spline ensures that no persistent bias is al- security to be traded is directly proportional to the prod- lowed over the interval of estimation. For a predictor uct of the signal magnitude and available capital. His- X = {x1, x2,..., xT } the learning and subsequent fore- torically such predictive trading signals are generated casting procedure is shown in Algorithm 2. from either “technicals” or “fundamentals”. Technicals where t = 1,..., T is intra-day time and n = are signals based on the prior behaviour of the security (Schwager, 1995b). Fundamentals are signals based on Alg. 2 Learning and Forecasting With Splines. upon extrinsic factors (such as economic data) (Schwa- ∆Yˆ = LAFWS(Y, X) ger, 1995a). In this section two predictive signals are 1: for n = 1 to N do generated and shown to have statistical traction against 2: for t = 1 to T do  y  security returns. In Section 5, the information held in 3: ∆y = log nt {Take re- ynt−1 these signals is used when learning A. This methodol- turns} ogy is quite general and as such could be applied to any 4: end for technical or fundamental predictor. ∆y−µ∆y 5: ∆y¯ = σ {Normalize the Momentum traders often want to combine their mo- ∆y return, ∆y¯ ∼ Norm(0, 12)} mentum signal with one or more extrinsic predictive sig- 6: G 0 = spline(x 0 , ∆y¯ 0 ) {Generate nals to give a single forecast. This is called the signal n−n :n n−n :n n−n :n spline G} combination problem for which a variety of different 00 7: if n > n then solutions exist, for example, Bayesian model averag- 8: for t = 1 to T do ing (Hoeting et al., 1999), frequentist model averaging 9: ∆yˆ = G 00 (x ) (Wang et al., 2009), expert tracking (Cesa-Bianchi and t n−n :n t {Evaluate the spline} Lugosi, 2006) and filtering (Genasay et al., 2001). It is 10: end for noted that our approach of biasing the transition dynam- 11: end if ics of an HMM momentum trading system using exter- 12: end for nal predictors seems to be another possible solution to this problem. 0 00 1,..., n , n ,..., N is inter-day time. Spline evaluation is intra-day while spline learning is inter-day where the 4.1. Forecasting with Splines spline is “grown” over time, allowing it to capture new Splines are now introduced as the methodology by information and forget old information. The normaliza- which we condition learning of the transition matrix. tion step for price is carried out using an exponentially weighted moving average process for both mean µ∆y and Splines are a way of estimating a noisy relationship 00 between dependent and independent variables, while volatility σ∆y (Pesaran et al., 2009). In our code n = 66 allowing for subsequent interpolation and evaluation days with N = 258 days and T = 856 observations per (Reinsch, 1967). Splines have been used extensively day. In this way the spline is estimated using the previ- in the financial trading literature, in areas as diverse ous 66 trading days worth of data, on a rolling basis. as volatility estimation (Audrino and Bühlmann, 2009), In the next two sections we implement two popular yield curve modelling (Bowsher and Meeks, 2008) and “off the shelf” predictors from the literature which ex- returns forecasting (Dablemont, 2010). We use a B- ploit intraday effects and use them to generate X. spline as a way of capturing a stationary, non-linear rela- tionship between predictor and security return. Splines 4.2. Predictor I: Volatility Ratio are implemented in MATLAB using the shape mod- An extensive body of empirical research exists show- elling language toolbox (D’Errico, 2011) and the curve ing that realized volatility has predictive power against fitting toolbox (MATLAB, 2009). In our experience fit- security returns (Christoffersen and Diebold, 2003; Hib- ting splines seems to be as much an art as a science bert et al., 2008; Giot, 2005; Burghardt and Liu, 2008). with sources of variability including how to treat end- These observations can be explained by showing that points and the number of knots depending on the degree the sign dynamics of security returns are driven by of “belief” in the underlying economic argument of the volatility dynamics (Kinlay, 2006). Modelling the re- relationship. Each spline is forced to be zero mean by turns process ∆yt as Gaussian with mean µ and con- setting the integral of the spline to be zero, as the mean ditional volatility σt allows for probability distribution value of a function is the integral of that function di- function f and a cumulative distribution function F. vided by the length of the support of that function. A The probability of a positive return p(∆yt+1) > 0 is given 11 by F = 1 − p([0, f ]). This shows the probability of 4.3. Predictor II: Seasonality a positive return is a function of conditional volatility Seasonality is an extremely well documented effect σt+1|t and so as σt+1|t increases, the probability of a pos- in the financial markets and is defined by returns from itive return falls. In order to be able to benefit from this securities’ varying predictably according to a cyclical relationship a forecast for σt+1|t is required. pattern across a time-scale (Bernstein, 1998). The time- Much literature exists on the subject of volatility fore- scale of the variation in question varies from multi-year casting, a summary of which is beyond the scope of this (Booth and Booth, 2003) to yearly (Lakonishok and paper so instead the reader is directed to three excellent Smidt, 1988), monthly (Ariel, 1987), weekly (Franses reviews (Poon and Granger, 2003; Pesaran et al., 2009; and Paap, 2000), daily (Peiro, 1994) and intraday (Tay- Zaffaroni, 2008). The main finding of these reviews lor, 2010). The fact that the periodicity (i.e. frequency) is that the sophisticated volatility models can not out is fixed and known a-priori, distinguishes the effect from perform the simplest models with any statistical signifi- other cyclical patterns in security returns (Taylor, 2007). cance and for that reason we use the IGARCH(1,1), oth- Intra-daily seasonality is where returns vary condi- erwise known as the J.P. Morgan Risk Metrics EWMA tionally on the location within the trading day. Hirsch model (JPM, 1996; Pafka and Kondor, 2001). A draw- observes in 1987, that in the case of the Dow Jones In- back to this approach are the recent findings in the lit- dustrial Average, the market spends most of the trading erature that volatility estimation with data above ∼20- day going down and a very small amount of time going minute frequency can lead to artifacts in the estimate up (i.e. the rises are large and fast and the falls are grad- (Andersen et al., 2011). ual and slow), with the rises happening post-open and The EWMA methodology exponentially weights the post-lunch (Hirsch, 1987). observations, representing the finite memory of the mar- A wide range of methodologies for extracting season- ket, as per Equation (4), ality signals from financial data exist in the literature, in- cluding FFT (Murphy, 1987), seasonal GARCH (Bail- lie and Bollerslev, 1991), flexible Fourier form (An- tuv Xψ dersen and Bollerslev, 1997), wavelets (Gencay et al., τ 2 σt+1|t = (1 − λ) λ ∆yt−τ (4) 2002), Bayesian auto-regression (Canova, 1993), linear τ=0 regression (Lovell, 1963) and splines (Huptas, 2009). As we know the size of the cycle a-priori, believe the The model has two parameters ψ (window size) and λ effect to be non-linear and prefer to work in the time- (variance decay factor where 0 < λ < 1) which are domain, splines are chosen to estimate the relation- fixed a-priori with a trade-off between λ and ψ, with a ship between time of day and security return. The small λ yielding similar results to a small ψ. The origi- use of splines seems to be a well accepted way of nal J.P. Morgan documentation suggests using λ = 0.94 capturing seasonality, for example (Martin-Rodriguez with daily frequency data, though we increase the reac- and Caceres-Hernandez, 2005; Robb, 1980; Cáceres- tivity of the term to fit our one minute frequency data Hernández and Martín-Rodríguez, 2007; Taylor, 2010; and set λ = 0.79 (Pesaran and Pesaran, 2007; Patton, Martín Rodríguez and Cáceres Hernández, 2010). 2010). This leaves the only parameter of the model as Following the approach of Martin et al a seasonal the number of historical observations ψ to include in the index is used to quantify the cycle (Martín Rodríguez estimate. and Cáceres Hernández, 2010). Here the author con- structs an index by defining the period of time under There are many technical indicators that are based consideration and then partitioning it into a periodic grid on volatility in the popular trading literature, includ- between one and T and then assigning observations to ing bollinger bands, the ratio of implied to realized buckets on this grid. The authors then capture the sea- volatility, and the ratio of current volatility to historical sonal variation by fitting a spline to the seasonal index volatility. We choose to implement the latter termed the and bucketed-data. In the case of our one-minute fre- volatility ratio as designed by Chande in 1992 (Chande, quency data the size of the period is T = 856. The input 1992) which requires estimating conditional volatilities to Algorithm 2 is given by X = [1,..., T]. for “now” and in the “past” (Colby, 2002; investope- dia.com, 2016; quantshare.com, 2016). We parameter 4.4. Simulation and Results sweep the ratio and select values ψ f ast = 50 and ψslow = 100 based on stability and predictive performance. The For our data set consisting of one-years worth of ES σ (ψ ) input to Algorithm 2 is given by X = t+1|t fast /σt+1|t(ψslow). data at one minute frequency, Algorithm 2 is applied to 12 the two predictors and results presented. Firstly, the two gies against a long-only portfolio for the 258 trading splines generated from the training data set are shown in days of 2011. It can be seen that the returns profile is Figure 5. It can be seen the relationship is non-linear. It different for the two strategies so that while the volatil- is also clear that the integral of the splines is zero, mean- ity strategy return is higher it also more volatile which ing that a series of random evaluations of the spline will results in the two strategies having similar risk adjusted lead to a zero mean signal as required. By the degree return profiles. Subplot two shows the mean (pre-cost) of local structure of the splines it is clear that these are annualized Sharpe ratios for the strategies. The Sharpe empirical relationships, however this does not invalidate ratio of both strategies is around 2.0, as commonly re- them as predictors, but merely requires a stronger belief quired for an intraday trading strategy to be successful. in the underlying economic hypotheses behind them. Subplot three shows the correlation coefficients between The economic interpretation of Figure 5 for the volatil- the strategies, which are either small and positive or ity ratio predictor is that a small (0.6) ratio of recent negative, as required for a diverse portfolio. to old volatility means that risk is falling, and so the In summary both predictors seem to have traction spline suggests buying. A large (0.8) ratio of recent to against forecasting the returns of ES and thus contain old volatility means that risk is rising, and so the spline information of predictive use. For that reason we try suggests selling. For the seasonality predictor the spline and incorporate them into our HMM momentum model. suggests buying in the early morning and selling in the The “classical way” of doing this would be to combine afternoon. the final three signals, for example, by taking a weighted The choice of the number of knots for the spline is im- mean. Rather than combine the signals outright, the portant. Too many knots means the spline will be very information held in the splines is used in the learning tightly fitted to the data, while too few knots may fail to phase. capture the relationship of interest. The problem with over-fitting the relationship being that the in-sample 5. Learning With Side Information performance will be great, but the out-of-sample per- formance will be poor. Hence it is a matter of balance 5.1. Introduction which is decided upon by intuition about the variabil- ity of the underlying economic relationship. 6 knots are The HMM of Figure 1 states that the probability of chosen for the volatility predictor and 10 knots for the transitioning between momentum states is only depen- seasonality predictor. Increasing the number of knots on dent on the last momentum state, p(mt|mt−1). From Sec- the volatility predictor to 40, doubles the predictive per- tion 4 we have two splines that we know contain use- formance in-sample, but is probably just fitting to noise. ful information when it comes to predicting security re- turns. In this Section the HMM is re-specified by incor- The performance of the two strategies can now be porating the side information held in the splines, such simulated against a benchmark of a long-only strategy. that the transition distribution is given by p(mt|mt−1, xt). Special care is taken to ensure that the simulation is a The belief behind this new model is that the extrinsic truly out-of-sample simulation. Specifically, each one data is of value to predicting the change in price of the of the data points used to evaluate a trade had not been security. Essentially we are saying that not all of the used in any of the previous stages of model identifica- securities’ variance can be explained by the momentum tion, learning and estimation. The annualized Sharpe effect, even though we believe it to be the dominant fac- ratio is a popular√ measure of risk-adjusted return and is tor. N(µ−r) defined as σ where µ is the mean strategy return, σ is the standard deviation of the strategy return, r is the 5.2. Input Output Hidden Markov Models risk free rate and N is the number of trading periods in In Input Output Hidden Markov Models (IOHMMs) the year. The ratio is computed by calculating a vector the observed distributions are referred to as inputs and of daily returns, generated by finding the total intraday the emission distributions as outputs (Bengio and Fras- strategy return each day and setting N = 258. This√ ag- coni, 1995). Like regular HMMs, IOHMMs have a fixed gregation approach is preferable to scaling by N for number of hidden states, however the output and transi- intraday N, as the output is more stable. As our final tion distributions are not only conditioned on the cur- signal is zero mean and interest is earned at rate r on rent state, but are also conditioned on an observed dis- short futures positions, we set r = 0. crete input value X. In the HMM of Section 2.2, Θ was The results of the simulation are shown in Figure 6. chosen to maximize the likelihood of the observations Subplot one shows the annual returns for the two strate- given the correct classification sequence p(∆Y|M, Θ). 13 Volatility Ratio Spline on e−Mini S&P500 Future (258 trading days of 2011) 0.2

0.1

0 Returns −0.1

−0.2 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Volatility Ratio

x 10−3 6 Interday Seasonality Spline on e−Mini S&P500 Future (258 trading days of 2011)

4

2

0

Returns −2

−4

−6

0100 0200 0300 0400 0500 0600 0700 0800 0900 1000 1100 1200 1300 1400 15001515 Time of day (Chicago local time)

Figure 5: Forecasting splines. Subplot one shows the spline generated by the volatility ratio predictor. Subplot two shows the spline generated by the seasonality predictor. This approach could be generalized when using N predictors, by generating an N-dimensional spline.

Cumulative Percentage Returns (2011−2012) 15

10

5

0

% Return −5 Volatility Ratio Seasonality −10 Long−only

−15 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Time

Annualized Strategy Sharpe Ratios Inter−Strategy Correlation Coefficents 3

2.5 Long−only 1 −0.48 0.17 0.8 0.6 2 0.4 1.5 Seasonality −0.48 1 −0.15 0.2 1 0 Sharpe Ratio 0.5 Volatility Ratio 0.17 −0.15 1 −0.2 −0.4 0 Volatility Ratio Seasonality Long−only Long−only Seasonality Volatility Ratio Strategy

Figure 6: Forecasting splines results. Subplot one shows the annual returns for the two strategies against a long-only portfolio for the 258 trading days of 2011. Subplot two shows the mean (pre-cost) annualized Sharpe ratios for the strategies. Subplot three shows the correlation coefficients between the strategies.

IOHMMs are trained to maximize the likelihood of the and output sequences are synchronous (Bengio et al., conditional distribution p(∆Y|X, Θ), where the latent 1999). Such a system can be represented with dis- variable M is stochastically relayed to the output vari- crete state space distributions for emission p(∆yt|mt, xt) able ∆Y. A schematic of the model is shown in Figure and transition p(mt|mt−1, xt). When the extrinsic pre- 7. dictor and the HMM momentum predictor have differ- We consider the simplifying case where the input ent time stamps, or are of different sampling frequen-

14 state state transition case of the two splines in Figure 5, the discretization function m m m m gives R = 5 and R = 2 for the volatility and seasonal- p 1 2 3 ity predictors respectively. Given the smoothness of the state prior observation splines, concatenation is rare and so the resulting small function loss of Markovian structure can be ignored. The obvi- x ous advantage of discretizing by roots is that parameters 1 x2 x3 {A1, A2,..., AR} map to signed returns. The learning al- external Δy1 Δy2 Δy3 information gorithm for IOHMM is shown in Algorithm 3. observation Figure 7: Bayesian network showing the conditional indepen- Alg. 3 IOHMM Learning. dence assumption of a synchronous IOHMM. ∆Y is an ob- Θˆ = iohhmLearning(∆Y, X) servable discrete output, X is an observable discrete input and 1: R = NewtonRaphson (G) {Find the roots of spline M is an unobservable discrete variable. The model at time t is G} described by the latent state conditional on the observed state 2: Z = map (∆Y, X, R) {Map ∆Y to buckets corre- | 1:R and some external information p(mt ∆y1:t, xt). sponding to the roots of G} 3: for r = 1 to R do 4: Θˆ r = BW(Zr) {Baum-Welch on the R buckets, cies, an asynchronous setup is required, adding compu- as per Algorithm 1} tational complexity to the forward-backward recursion 5: end for (Bengio and Bengio, 1996). It is noted such a technique could allow signals of a lower frequency to be used in Using the methodology described above, two inde- a high-frequency inference problem, for example, low- pendent predictions are generated for each of the two frequency macro-economic data could be used to bias IOHMM models, one for the volatility ratio and one intraday trading. for seasonality. However, it maybe the case we wish to The literature suggests three main approaches to combine the two predictors into a single prediction. In learning in IOHMM: Artificial neural networks (Ben- this case of more than one predictor, X is treated as mul- gio and Frasconi, 1995), partially observable Markov tivariate and a multi-dimensional spline is generated. decision processes (Bäuerle and Rieder, 2011) and EM Subject to some appropriate discretization of the spline, (Bengio et al., 1999). As Baum-Welch (an EM variant) Algorithm 3 can then be applied to solve p(mt|mt− , x ) 1 ¯t was used for learning in the HMM case, in order to be where xt is a vector. consistent we opt to learn by EM for the IOHMM case ¯ too. In terms of Algorithm 1 the only changes required to deal with the IOHMM case are to lines 8 and 13, 6. Inference Phase X We first present inference for the default HMM case αt(k) = p(zt|mt, xt)p(mt|mt−1, xt−1)αt−1(k)

mt−1 and then consider the IOHMM case. The aim of the in- X ference phase is to find the marginal predictive distribu- βt(k) = p(zt+1|mt+1, xt+1)p(mt+1|mt, xt)βt+1(k) tion p(∆yt|∆y1:t−1, Θ). This is found using the forward m t+1 algorithm (Bishop, 2006). To implement this methodology a different A is trained The likelihood vector, size K × 1, corresponds to the for every unique value of X. Such an approach has observation probabilities and together with the transi- the drawbacks of over parameterization and requiring tion probabilities fully describes the model. It is defined large amounts of data. This is solved by discretizing as, the spline according to its roots, with R − 1 roots giv-   p(∆y |m = k, Θ) ∝ Norm ∆y ; µ , σ2 (5) ing R “buckets” of spline. xt is then aligned with ∆yt, t t t k k   and ∆yt assigned to one of the R buckets, the contents 1  1  = √ exp − ∆y − µ 2 of each bucket being concatenated to give a data vector.  2 t k  σ 2π 2σ Baum-Welch learning with Algorithm 1 is then carried k k out on each of these vectors, as before. As the transi- If the Gaussian assumption of Equation (2) was dropped tion distribution p(mt|mt−1, xt) is time sequential, con- then Equation (5) would be of a different form. Or in catenating the bucketed data is strictly incorrect as oc- the case of a non-parametric approach, the density of casionally p(mt|mt−τ, xt) occurs, where τ > 1. In the p(∆Y|M, Θ) would be evaluated at this step. 15 The first step of the prediction is different to the sub- Alg. 4 HMM Prediction. sequent steps, due to not yet being in the recursive Signal = HMM(∆Y, Θ) chain. The first step starts with a prior over the hidden 1: Update for first step 2 states, πk×Norm(∆yt;µk,σk ) 2: ω1|1,k = P   0 0 × 0 2 k πk Norm ∆yt;µk ,σ 0 p(m1 = k|∆y1) ∝ p(m1 = k)p(∆y1|m1 = k) k   3: for t = 2 to T do ∝ × 2 πk Norm ∆y1; µk, σk 4: Predict   P 2 5: ω = 0 a 0 ω 0 πk × Norm ∆y1; µk, σ t|t−1,k k kk t−1|t−1,k k P ∗ = 6: ∆yˆt = k ωt|t−1,k × µ P  2  k 0 π 0 × Norm ∆y ; µ 0 , σ 0 k k 1 k k 7: 8: Update Once initialization has been dealt with, the rest of the 2 ωt|t−1,k×Norm(∆yt;µk,σk ) 9: ωt|t,k =   process can be decomposed into a recursive formula- P 2 0 ω 0 ×Norm ∆yt;µ 0 ,σ 0 tion. The recursions update the posterior filtering dis- k t|t−1,k k k 10: tribution in two steps: Firstly a prediction step propa- 11: Output gates the posterior distribution at the previous time step 12: Signalt = TF(∆yˆt) {Apply a transfer function} through the target dynamics to form the one step ahead 13: end for prediction distribution. Secondly an update step incor- porates the new data through Bayes’ rule to form the new filtering distribution. The filtering distribution ωt|t,k IOHMM version of the prediction algorithm is summa- is given by, rized in Algorithm 5.

ωt|t,k , p(mt = k|∆y1:t) Alg. 5 IOHMM Prediction. ∝ p(m = k|∆y )p(∆y |m = k) t 1:t−1 t t Signal = IOHMM(∆Y, X, Θ¯ ) ∝ ωt|t−1,k p(∆yt|mt = k)   1: Update for first step 2 π ×Norm ∆y ;µ ,σ2 ωt|t−1,k × Norm ∆yt; µk, σ k ( t k k ) k 2: ω | ,k =   1 1 P 2 =   0 π 0 ×Norm ∆yt;µ 0 ,σ 0 P 2 k k k k 0 ω 0 × Norm ∆y ; µ 0 , σ 0 k t|t−1,k t k k 3: for t T do = 2 to  F ¯ The predictive distribution ωt|t−1,k is found by multiply- 4: Θ = Θ, xt {Parameter lookup table} ing the filtering distribution by the state transition ma- 5: Predict P trix, 6: ωt|t−1,k = k0 akk0 ωt−1|t−1,k0 P ∗ X 0 7: ∆yˆt = k ωt|t−1,k × µk ωt|t−1,k = akk0 p(mt−1 = k |∆y1:t−1) 8: 0 k 9: Update X 2 0 0 ωt|t−1,k×Norm(∆yt;µk,σk ) = akk ωt−1|t−1,k 10: ωt|t,k =   P 2 0 0 ω 0 ×Norm ∆yt;µ 0 ,σ 0 k k t|t−1,k k k 11: The prediction ∆yˆt is then found by taking the ex- 12: Output pectation of the marginal predictive density distribution 13: Signalt = TF(∆yˆt) {Apply a transfer function} | p(∆yt ∆y1:t−1), 14: end for X ∆yˆt = ∆yt × p(∆yt|∆y1:t−1)

∆yt X X 6.0.1. Asynchronous Price Data = ∆yt p(mt = k|∆y1:t−1)p(∆yt|mt = k) In the above form, Algorithm 4 supports data which ∆yt k lies on a discrete time grid. The popularity of such syn- X ∗ = ωt|t−1,k × µk chronous methodologies in dealing with financial data k arises from the computational challenge of dealing with ∗ Where µk is the mean of the discretized Gaussian the huge amounts of data generated by the markets. In p(∆yt|mt = k). The full approach is summarized in Al- reality, financial data is asynchronous due to trades clus- gorithm 4. tering together (Dufour and Engle, 2000). Aggregation Inference in the IOHMM case is very similar to the is the process of moving from asynchronous to syn- HMM case, though here Θ is conditional on xt. The chronous data and this acts as a zero-one filter. Such a 16 rough down-sampling procedure means potentially use- The performance of the default HMM is the worst of ful high-frequency information is thrown away. The the group of models. This is as expected and reflects Bayesian approach to this problem is to keep as much the fact that A contains no information about the mar- information as possible and then let the model decide ket, as all states are equally likely. The poor PLR per- how what parts are/are not needed. formance can also be explained by the pair of negative Our model can be altered to deal with asynchronous trend terms (µPLR = [−8.99, −0.0207]), in what was a data by modifying the observation equation in Equation rising market over the simulation period. While Baum- (2) by scaling up the observation inter-arrival times, Welch was able to beat both the default HMM and the long-only case, MCMC was not able to beat the long- 2 ∆yt = µm ∆ti + t , t ∼ N(0, σ ∆ti) i ti i i mti only case. There is no reason why Baum-Welch should be able to outperform MCMC - we believe this may re- where ∆t = t − t is the time between asynchronous i i i−1 flect the difficulty in using MCMC correctly. Reasons observations. Such a representation suffers the draw- for the poor MCMC performance are now discussed, back that µm does not change evenly over time, but ti along with suggestions for improvement, changes asynchronously according to observation time. • Just as EM can fail to find the true global max- The HMM could be further modified to incorporate ima, MCMC can fail to converge to the station- smooth µm change, for example by using continuous- ti ary distribution of the posterior probabilities (Gilks time HMMs, but this is beyond the scope of this paper. et al., 1996). Common causes for convergence fail- ure are too few draws and poor proposal densities 7. Data and Simulation (Kalos and Whitlock, 2008). Ergodic averages of MCMC draws which were generated by random Data from the CME GLOBEX e-mini S&P500 (ES) permutation sampling are used to check conver- future is used, one of the most liquid securities’ in the gence. Convergence can be seen to occur in Fig- world. Tick data is used for the period 01/01/2011 to ure 4 as the entire MCMC chain is roughly sta- 31/12/2011, giving 258 days data. The synchronous tionary for first and second moment parameters. form of the algorithm is implemented and the tick data Cowles et al recommend checking for convergence pre-processed by aggregating to periodic spacing on a by a combination of strategies including applying one minute grid, giving 856 observations per day. Only diagnostic procedures to a small number of paral- 0100-1515 Chicago time is considered, Monday-Friday, lel chains, monitoring auto-correlations and cross- corresponding to the most liquid trading hours. 1515 correlations (Cowles and Carlin, 1996). However, Chicago time is when the GLOBEX server closes down we do not believe convergence has failed in this for its maintenance break and when the exchange offi- case. cially defines the end of the trading day. Only use the • The mean emission parameters are Baum-Welch front month contract is used, with contract rolling car- µ1:3 = [−0.0198, −0.00573, 0.0183], MCMC ried out 12 days before expiry. Only GLOBEX (elec- µ1:3 = [−0.122, −0.0117, 0.121]. It can be seen tronic) trades are considered, with pit (human) trades that both have negative/zero/positive trend terms, being excluded. No additional cleaning beyond what but that the numerical values for the first and third the data provider has done is carried out. state are quite different. It maybe the case that As synchronous prices are generated on a close-to- MCMC has failed to visit all the highly probable close basis, in simulation the forecast signal is lagged regions of the parameter space because of local by one period so that look-ahead is not incurred. The maxima in the posterior distribution. strategy return is then equal to the security return mul- • The step of moving from the distributional estimate tiplied by the lagged signal. Learning is carried out on to the point estimate presents an opportunity for data from the second half of 2010. For all five momen- selecting sub-optimal Θ. Our implementation of tum strategies, the mean and variance were specified for MCMC approximates the posterior mode by keep- each state (i.e. the system was not tied). Evaluation of ing the sample with the highest posterior probabil- trading strategies is an extensive field, e.g. (Aronson, ity. It is possible however, that this approach could 2006) and so just the key metrics of Sharpe ratio and end up selecting a local maxima, as opposed the returns are presented. The HMM strategies are bench- global maxima, leading to a sub-optimal estimate marked against the long-only case. The pre-cost results of Θ. In future work this step could be done by of the simulation using ES for the year 2011 are shown estimating the likelihood of each sample and then in Figure 8. taking the maximum. 17 Default HMM Cumulative Percentage Returns (2011−2012) 100 Baum−Welch HMM Volatility Ratio IOHMM Seasonality IOHMM 50 MCMC HMM Long−only 0 % Return −50

−100 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Time

Annualized Strategy Sharpe Ratios Inter−Strategy Correlation Coefficents 4 1 Default HMM 1 −0.68 −0.54 −0.78 −0.27 0.096 2 Baum−Welch HMM −0.68 1 0.71 0.91 0.41 −0.017 0.5

0 Volatility Ratio IOHMM −0.54 0.71 1 0.75 0.25 0.078

0 −2 Seasonality IOHMM −0.78 0.91 0.75 1 0.43 −0.12

Sharpe Ratio MCMC HMM −0.27 0.41 0.25 0.43 1 −0.039 −4 −0.5 Long−only 0.096 −0.017 0.078 −0.12 −0.039 1 −6

Long−only Long−only Default HMM MCMC HMM Default HMM MCMC HMM Baum−Welch HMM Seasonality IOHMM Baum−Welch HMM Seasonality IOHMM Volatility Ratio IOHMM Volatility Ratio IOHMM

Figure 8: Simulation results for the five variations of the HMM intraday momentum trading strategy, with K = 2 or 3, plus the long-only case.

• Choice of prior maybe more influential than might equal to its long run average. The failure of MCMC to be expected (Frühwirth-Schnatter, 2008). While beat the long-only case, while Baum-Welch does, again we have followed the recommendations of the liter- points to the fact that parameter selection has failed for ature, it might be that using a more diffuse prior on MCMC. All of the HMMs have a low correlation to A would give better results, as it would allow the the long-only case, which is as expected given all the parameter space to be more thoroughly searched. HMMs have a zero mean signal. Post-cost results re- Failure to search correctly could happen if the ex- duce the Sharpe ratio of the HMM strategies by approx- isting prior was too strong and overwhelms the imately 15%. data, but this would be unusual given the amount The IOHMM models are both able to beat the Baum- of data used for learning. In particular we believe Welch HMM model Sharpe ratio by more than 10%, re- the use of a uniform prior should cause the results flecting the fact that the model is able to use the infor- of MCMC and EM to converge. Another approach mation X, which as known from Section 4 has predic- would be to initialize MCMC with Baum-Welch. tive value. The Sharpe ratio from the IOHMM models is If the search moves away from the initial search smaller than that from the individual side-information X space, then it might be the case, that the chains are predictors, because the covariance between the individ- taking a very long time to mix. ual predictors and the momentum signal is greater than • The proposal density maybe poorly chosen leading zero. Even though the Sharpe ratio of the IOHMM sig- to acceptance rates which are too high or too low. nal is less than the Sharpe ratio of the individual X pre- In future work we suggest modifying the proposal dictors, this is not necessarily a bad thing as the correla- density to incorporate work from optimal proposal tion of the IOHMM returns has decreased relative to the scalings (Neal and Roberts, 2006) and adaptive al- benchmark returns. Institutional investors tend to run gorithms (Levine and Casella, 2006) to attempt to “portfolios of strategies” in order to diversify strategy find good proposals automatically. risk. Here any strategy with a positive expectation and Interestingly the MCMC and Baum-Welch strategy re- a low correlation to the existing return stream, maybe turns have a reasonably high correlation at 0.41, sug- worthy of inclusion in that portfolio, even if the perfor- gesting that they maybe picking up the same market mance of the new strategy does not beat the benchmark. moves, but with MCMC doing so in a less timely (op- The simulation results shown in Figure 8 suggest that timal) fashion. Over the trading times considered, ES this strategy could be worthy of inclusion in such a port- rose resulting in a Sharpe ratio of 0.4, approximately folio. 18 8. Conclusions and Further Work the momentum effect is strongest at the 1-3 month pe- riod, we have shown the effect is viable at higher trading 8.1. Conclusions frequencies too. This paper has presented a viable framework for Finally it is noted that this work is an instance of intraday trading of the momentum effect using both unsupervised learning under a single basic generative Bayesian sampling and maximum likelihood for param- model. As such it can be linked to other work in the eters, and Bayesian inference for state. The framework field by noting when the state variables presented in this is intended to be a practical proposition to augment model become continuous and Gaussian, the problem momentum trading systems based on low-pass filters, can be solved by a Kalman filter and when continuous which have been use since the 1970’s. A key advantage and non-Gaussian the problem can be solved by a parti- of our state space formulation is that it does not suffer cle filter, for example (Christensen et al., 2012). from the delayed frequency response that digital filters do. It is this time lag which is the biggest cause of pre- 8.2. Further Work dictive failure in digital filter based momentum systems, In future work we would like to explore in detail why due their poor ability to detect reversals in trend at mar- learning Θ by MCMC results in poorer performance ket change points. than by Baum-Welch. In particular the selection of the As the number of latent momentum states in the mar- prior and the proposal density seem worthy of further ket data is never known, it has to be estimated. Three investigation, as discussed in Section 7. estimation techniques are used, cross-validation, penal- In this paper just the best sample of Θ was retained. ized likelihood criteria and MCMC bridge sampling. An improved prediction might be possible by retaining All three techniques give very similar results, namely all the samples and averaging their predictions. Fully that the system consists of 2 or 3 hidden states. Bayesian inference uses the distributional estimate of Learning of the system parameters is principally car- Θ output from MCMC. Denoting the training data as ried out by two methods, namely frequentist Baum- Z and the out of sample data as ∆Y, MCMC gives Welch and Bayesian MCMC. Theoretically MCMC i = 1,..., I samples from the posterior distribution, s.t. probably should be able to outperform Baum-Welch, Θi ∼ p(Θ|Z, Mk). The predictive density can then be however when carrying out simulations on out of sam- determined by, ple data, it is found that Baum-Welch gives the best pre- Z dictive performance. The reasons for this are unclear, p(∆Y|Z, Mk) = p(∆Y|Z, Θ, Mk)p(Θ|Z, Mk)dΘ but it maybe because selecting a good prior is hard for our system, or that the single point estimate of Baum- XI Welch maybe close to the “correct” value, giving supe- ≈ p(∆Y|Z, Θi, Mk) rior performance over the Bayesian marginalization of i=1 the parameters by MCMC. A closely related approach that could also be investi- Often a trend-following system will want to incor- gated is Bayesian Model averaging (BMA) (Hoeting porate external information, in addition to the momen- et al., 1999). While the Bayesian inference just de- tum signal, leading to the signal combination problem. scribed performs averaging over the distribution of pa- An IOHMM is formulated as possible solution to this rameters Θ, BMA performs averaging at the level of the problem. In an IOHMM, the transition distribution is model Mk. BMA might be a sensible approach given conditioned not only on the current state, but also on the similarity of the MCMC marginal likelihoods used an observed external signal. Two such external sig- for model selection. nals are generated, seasonality and volatility ratio, both Predictive performance may also be improved by re- with positive Sharpe ratios, and are incorporated into the moving the model’s parametric assumption and chang- IOHMM. The performance of the IOHMM can be seen ing to use asynchronous data. By using a more natural to be improved over the HMM, suggesting the IOHMM description of emission noise, the fit of the model could methodology used is a possible solution to the signal be improved. In the current downsampling of the data it combination problem. maybe that useful high-frequency information is getting In addition to presenting novel applications of thrown away. Using asynchronous data would be the HMMs, this paper provides additional support for the most Bayesian approach, allowing the model to decide momentum effect being profitable, pre- and post-cost, what to do with that high-frequency information. and adds to the substantial body of evidence on the ef- Finally, an interesting area of future research could fect. While much of the existing literature shows that be to compare the IOHMM methodology with other 19 approaches to signal combination, such as a weighted Bengio, Y., Frasconi, P., 1995. An input output hmm architecture. Ad- mean of the Baum-Welch HMM and the individual pre- vances in neural information processing systems, 427–434. dictor signals. Bengio, Y., et al., 1999. Markovian models for sequential data. Neural computing surveys 2, 129–162. Bernstein, J., 1998. Seasonality: Systems, Strategies and Signals. John Wiley & Sons. 9. Acknowledgements Bhar, R., Hamori, S., 2003. New evidence of linkages among g7 stock markets. Finance Letters 1 (1). We acknowledge use of the following MATLAB Bhar, R., Hamori, S., 2004. Hidden Markov models: applications to toolboxes; Kevin Murphy’s “Probabilistic Modeling financial economics. Kluwer Academic Pub. Toolkit” https://github.com/probml/pmtk Bishop, C., 2006. Pattern Recognition and . Springer. and Sylvia Fr¨uhwirth-Schnatter’s “Bayesf” Booth, J., Booth, L., 2003. Is presidential cycle in security returns www.wu.ac.at/statmath/en/faculty_staff/ merely a reflection of business conditions? Review of Financial faculty/sfruehwirthschnatter. Economics 12 (2), 131–159. Bowsher, C., Meeks, R., 2008. The dynamics of economic functions: modeling and forecasting the yield curve. Journal of the American References Statistical Association 103 (484), 1419–1437. Branger, N., Kraft, H., Meinerding, C., 2012. Partial information Adams, R. P., MacKay, D. J., 2007. Bayesian online changepoint de- about contagion risk and portfolio choice. Tech. rep., Department tection. Tech. rep., Cambridge University. of Finance, Goethe University. Akaike, H., 1974. A new look at the statistical model identification. Buffington, J., Elliott, R. J., 2002. American options with regime Automatic Control, IEEE Transactions on 19 (6), 716–723. switching. International Journal of Theoretical and Applied Fi- Andersen, T. G., Bollerslev, T., 1997. Intraday periodicity and volatil- nance 5 (05), 497–514. ity persistence in financial markets. Journal of Empirical Finance Bulla, J., Bulla, I., 2006. Stylized facts of financial time series and hid- 4 (2-3), 115 – 158, high Frequency Data in Finance, Part 1. den semi-Markov models. Computational & Data Anal- Andersen, T. G., Bollerslev, T., Meddahi, N., 2011. Realized volatility ysis 51 (4), 2192–2209. forecasting and market microstructure noise. Journal of Economet- Burghardt, G., Liu, L., 2008. How stock price volatility affects stock rics 160 (1), 220 – 234. returns and cta returns. Tech. rep., Newedge Brokerage. Andrieu, C., Doucet, A., 2003. Online expectation-maximization type Cáceres-Hernández, J., Martín-Rodríguez, G., 2007. Heterogeneous algorithms for parameter estimation in general state space mod- seasonal patterns in agricultural data and evolving splines. The IUP Journal of Agricultural Economics 4 (3), 48–65. els. In: Acoustics, Speech, and , 2003. Proceed- ings.(ICASSP’03). 2003 IEEE International Conference on. Vol. 6. Canova, F., 1993. Forecasting time series with common seasonal pat- terns. Journal of Econometrics 55 (1-2), 173–200. IEEE, pp. VI–69. Andrieu, C., Doucet, A., Holenstein, R., 2010. Particle Markov chain Cesa-Bianchi, N., Lugosi, G., 2006. Prediction, Learning, and Games. Cambridge University Press. Monte Carlo methods. Journal of the Royal Statistical Society: Se- ries B (Statistical Methodology) 72 (3), 269–342. Chande, T. S., March 1992. Adapting moving averages to market volatility. Technical Analysis of Stocks & Commodities magazine Ang, A., Bekaert, G., 2002. Regime switches in interest rates. Journal of Business & Economic Statistics 20 (2), 163–182. 10(3), 108–114. Charlot, P., 2012. Modelling volatility and correlations with a hidden Anon, 6th Jan 2011. Momentum in financial markets: Why Newton was wrong. The Economist. Markov decision tree. Tech. rep., Aix-Marseille University. Ariel, R., 1987. A monthly effect in stock returns. Journal of Financial Chen, D., Bunn, D., 2014. The forecasting performance of a finite Economics 18 (1), 161–174. mixture regime-switching model for daily electricity prices. Jour- Aronson, D., 2006. Evidence-Based Technical Analysis: Applying nal of Forecasting. the Scientific Method and Statistical Inference to Trading Signals. Chib, S., 2001. Markov chain Monte Carlo methods: computation and Wiley Trading. inference. Handbook of econometrics 5, 3569–3649. Attias, H., 1999. Inferring parameters and structure of latent variable Chopin, N., Pelgrin, F., 2004. Bayesian inference and state number models by variational Bayes. In: Proceedings of the Fifteenth con- determination for hidden Markov models: An application to the ference on Uncertainty in artificial intelligence. Morgan Kaufmann information content of the yield curve about inflation. Journal of Publishers Inc., pp. 21–30. Econometrics 123 (2), 327–344. Audrino, F., Bühlmann, P., 2009. Splines for financial volatility. Jour- Christensen, H. L., Murphy, J., Godsill, S. J., 2012. Forecasting high- nal of the Royal Statistical Society: Series B (Statistical Method- frequency futures returns using online langevin dynamics. Selected ology) 71 (3), 655–670. Topics in Signal Processing, IEEE Journal of 6 (4), 366–380. Baillie, R., Bollerslev, T., 1991. Intra-day and inter-market volatility Christoffersen, P., Diebold, F., 2003. Financial asset returns, direction- in foreign exchange rates. The Review of Economic Studies 58 (3), of-change forecasting, and volatility dynamics. Tech. rep., NBER. 565–585. Colby, R. W., 2002. The Encyclopedia Of Technical Market Indica- Bäuerle, N., Rieder, U., 2011. Markov Decision Processes with appli- tors. McGraw-Hill. cations to finance. Springer. Cowles, M. K., Carlin, B. P., 1996. Markov chain Monte Carlo conver- Baum, L. E., Petrie, T., Soules, G., Weiss, N., 1970. A maximiza- gence diagnostics: a comparative review. Journal of the American tion technique occurring in the statistical analysis of probabilistic Statistical Association 91 (434), 883–904. functions of Markov chains. The annals of Dablemont, S., 2010. Forecasting of High Frequency Financial Time 41 (1), 164–171. Series: Concepts, Methods, Algorithms. Lambert Academic Pub- Bengio, S., Bengio, Y., 1996. An EM algorithm for asynchronous in- lishing. put/output hidden Markov models. In: International Conference Dai, M., Zhang, Q., Zhu, Q. J., 2010. Trend following trading under a On Neural Information Processing. Citeseer, pp. 328–334. regime switching model. SIAM Journal on Financial 20 1 (1), 780–810. Markov model: a new approach. In: Intelligent Systems Design D’Errico, J., 2011. Shape language modellingwww. and Applications, 2005. ISDA’05. Proceedings. 5th International mathworks.com/matlabcentral/fileexchange/ Conference on. IEEE, pp. 192–196. 24443-slm-shape-language-modeling. Hibbert, A. M., Daigler, R. T., Dupoyet, B., 2008. A behavioral expla- Dueker, M., Neely, C. J., 2007. Can Markov switching models predict nation for the negative asymmetric return-volatility relation. Jour- excess foreign exchange returns? Journal of Banking & Finance nal of Banking & Finance 32 (10), 2254 – 2266. 31 (2), 279–296. Hirsch, Y., 1987. Don’t Sell Stocks on Monday: An Almanac for Dueker, M. J., 1997. Markov switching in GARCH processes and Traders, Brokers and Stock Market Investors. Penguin. mean-reverting stock-market volatility. Journal of Business & Eco- Hoeting, J., Madigan, D., Raftery, A., Volinsky, C., 1999. Bayesian nomic Statistics 15 (1), 26–34. model averaging: A tutorial. Statistical science 14(4), 382–401. Dufour, A., Engle, R., 2000. Time and the price impact of a trade. The Hong, H., Stein, J. C., 1999. A unified theory of underreaction, mo- Journal of Finance 55 (6), 2467–2498. mentum trading, and overreaction in asset markets. The Journal of Durbin, J., Watson, G., 1971. Testing for serial correlation in least Finance 54 (6), 2143–2184. squares regression. iii. Biometrika 58 (1), 1–19. Huptas, R., 2009. Intraday seasonality in analysis of uhf financial Elliott, R., Hinz, J., 2002. Portfolio optimization, hidden Markov data: Models and their empirical verification. Dynamic Economet- models, and technical analysis of P&F charts. International Journal ric Models 9, 1–10. of Theoretical and Applied Finance 5 (04), 385–399. investopedia.com, 2016. The volatility ratio.www.investopedia. Elliott, R. J., Wilson, C. A., 2007. The term structure of interest rates com/terms/v/volatility-ratio.asp. in a hidden Markov setting. In: Hidden Markov Models in Finance. Jegadeesh, N., Titman, S., 1999. Profitability of momentum strategies: Springer, pp. 15–30. An evaluation of alternative explanations. Tech. rep., National Bu- Faith, C., 2007. Way of the Turtle. McGraw-Hill Professional. reau of Economic Research. Fine, S., Singer, Y., Tishby, N., 1998. The hierarchical hidden Markov Johnson, T., 2002. Rational momentum effects. Journal of Finance, model: Analysis and applications. Machine learning 32 (1), 41–62. 585–608. Franses, P., Paap, R., 2000. Modelling day-of-the-week seasonality in Jordan, M. I., 2009. Are you a Bayesian or a frequentist? Summer the s&p 500 index. Applied Financial Economics 10 (5), 483–488. School Lecture, Cambridge. Frühwirth-Schnatter, S., 2001. Markov chain Monte Carlo estimation JPM, 1996. JPM RiskMetrics - technical document. Tech. rep., JPM. of classical and dynamic switching and mixture models. Journal of URL www.riskmetrics.com/system/files/private/ the American Statistical Association 96 (453), 194–209. td4e.pdf Frühwirth-Schnatter, S., 2006. Finite mixture and Markov switching Juang, B., Levinson, S., Sondhi, M., 1986. Maximum likelihood es- models. Springer Science+ Business Media. timation for multivariate mixture observations of Markov chains Frühwirth-Schnatter, S., 2008. Comment on article by rydén. (corresp.). Information Theory, IEEE Transactions on 32 (2), 307– Bayesian Analysis 3 (4), 689–698. 309. Gales, M., Young, S., 2008. The application of hidden Markov models Kakade, S., Teh, Y. W., Roweis, S. T., 2002. An alternate objective in speech recognition. Foundations and Trends in Signal Process- function for Markovian fields. In: Machine Learning International ing (Now publishers). Workshop. pp. 275–282. Gelman, A., 2011. Induction and deduction in Bayesian data analysis. Kalos, M. H., Whitlock, P. A., 2008. Monte Carlo methods. Wiley- Rationality, Markets and Morals (RMM) 2, 67–78. VCH. Genasay, R., Dacorogna, M., Muller, U. A., Pictet, O., 2001. An In- Kass, R. E., Raftery, A. E., 1995. Bayes factors. Journal of the ameri- troduction to High-Frequency Finance. Academic Press. can statistical association 90 (430), 773–795. Gencay, R., Selcuk, F., Whitcher, B., 2002. An Introduction to Kim, A., Shelton, C., Poggio, T., 2002. Modeling stock order flows Wavelets and Other Filtering Methods in Finance and Economics. and learning market-making from data. Tech. rep., Massachusetts Elsevier. Institute of Technology. Gerald, A., 1999. Technical analysis power tools for active investors. Kim, C.-J., 1993. Unobserved-component time series models with Financial Times Prentice Hall. Markov-switching heteroscedasticity: Changes in regime and the Ghahramani, Z., Jordan, M. I., 1997. Factorial hidden Markov models. link between inflation rates and inflation uncertainty. Journal of Machine learning 29 (2-3), 245–273. Business & Economic Statistics 11 (3), 341–349. Giampieri, G., Davis, M., Crowder, M., 2005. A hidden Markov Kingsbury, N., Rayner, P., 1971. Digital filtering using logarithmic model of default interaction. Quantitative Finance 5 (1), 27–34. arithmetic. Electronics Letters 7 (2), 56–58. Gilks, W. R., Richardson, S., Spiegelhalter, D. J., 1996. Markov chain Kinlay, J., 2006. Predicting market direction. Tech. rep., Investment Monte Carlo in practice. Vol. 2. Chapman & Hall/CRC. Analytics LLP. Giot, P., 2005. Relationships between implied volatility indexes and Kohavi, R., et al., 1995. A study of cross-validation and bootstrap for stock index returns. The Journal of Portfolio Management 31 (3), accuracy estimation and model selection. In: International joint 92–100. Conference on artificial intelligence. Vol. 14. Lawrence Erlbaum Green, P. J., 1995. Reversible jump Markov chain Monte Carlo com- Associates Ltd, pp. 1137–1145. putation and Bayesian model determination. Biometrika 82 (4), Lakonishok, J., Smidt, S., 1988. Are seasonal anomalies real? a 711–732. ninety-year perspective. Review of Financial Studies 1 (4), 403– Grégoir, S., Lenglart, F., 2000. Measuring the probability of a busi- 425. ness cycle turning point by using a multivariate qualitative hidden Lesmond, D., Schill, M., Zhou, C., 2004. The illusory nature of mo- Markov model. Journal of forecasting 19 (2), 81–102. mentum profits. Journal of Financial Economics 71 (2), 349–380. Hamilton, J. D., 1989. A new approach to the economic analysis of Levine, R. A., Casella, G., 2006. Optimizing random scan gibbs sam- nonstationary time series and the business cycle. Econometrica: plers. Journal of multivariate analysis 97 (10), 2071–2100. Journal of the Econometric Society, 357–384. Liesenfeld, R., 2001. A generalized bivariate mixture model for Harvey, A., 1991. Forecasting, structural time series models and the stock price volatility and trading volume. Journal of Econometrics Kalman filter. Cambridge university press. 104 (1), 141–178. Hassan, M. R., Nath, B., 2005. Stock market forecasting using hidden Lo, A., MacKinlay, A., 2001. A non- down Wall Street.

21 Princeton University Press. rior probability approximation technique. Bayesian Analysis 3 (2), Lo, A., Mamaysky, H., Wang, J., 2000. Foundations of technical anal- 427–441. ysis: Computational algorithms, statistical inference, and empiri- Robert, C. P., Ryden, T., Titterington, D. M., 2000. Bayesian inference cal implementation. Journal of Finance, 1705–1765. in hidden Markov models through the reversible jump Markov Lovell, M. C., 1963. Seasonal adjustment of economic time series and chain Monte Carlo method. Journal of the Royal Statistical So- multiple regression analysis. Journal of the American Statistical ciety: Series B (Statistical Methodology) 62 (1), 57–75. Association 58 (304), 993–1010. Roman, D., Mitra, G., Spagnolo, N., 2010. Hidden Markov models Mamon, R., Elliott, R., 2007. Hidden Markov models in finance. Vol. for financial optimization problems. IMA Journal of Management 104. Springer Verlag. Mathematics 21 (2), 111–129. Martin-Rodriguez, G., Caceres-Hernandez, J., 2005. Modelling the Rossi, A., Gallo, G. M., 2006. Volatility estimation via hidden Markov hourly spanish electricity demand. Economic Modelling 22 (3), models. Journal of Empirical Finance 13 (2), 203–230. 551–569. Rydén, T., 2008. EM versus Markov chain Monte Carlo for estimation Martín Rodríguez, G., Cáceres Hernández, J., 2010. Splines and the of hidden Markov models: A computational perspective. Bayesian proportion of the seasonal period as a season index. Economic Analysis 3 (4), 659–688. Modelling 27 (1), 83–88. Rydén, T., Teräsvirta, T., Åsbrink, S., 1998. Stylized facts of daily MATLAB, 2009. Spline Toolbox User’s Guide 3. return series and the hidden Markov model. Journal of applied McCulloch, R. E., Tsay, R. S., 1994. Statistical analysis of economic econometrics 13 (3), 217–244. time series via Markov switching models. Journal of time series Satchell, S., Acar, E., 2002. Advanced Trading Rules. Butterworth- analysis 15 (5), 523–539. Heinemann. McGrory, C. A., Titterington, D., 2009. Variational Bayesian analysis Schwager, J. D., 1995a. Fundamental Analysis (Schwager on Fu- for hidden Markov models. Australian & New Zealand Journal of tures). John Wiley & Sons. Statistics 51 (2), 227–244. Schwager, J. D., 1995b. Technical Analysis (Schwager on Futures). Meligkotsidou, L., Dellaportas, P., 2011. Forecasting with non- John Wiley & Sons. homogeneous hidden Markov models. Statistics and Computing Schwarz, G., 1978. Estimating the dimension of a model. The annals 21 (3), 439–449. of statistics 6 (2), 461–464. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., Scott, S. L., 2002. Bayesian methods for hidden Markov models: Re- Teller, E., 1953. Equation of state calculations by fast computing cursive computing in the 21st century. Journal of the American machines. The journal of chemical physics 21, 1087. Statistical Association 97 (457), 337–351. Murphy, J., 1987. The seasonality of risk and return on agricultural Shephard, N., 1994. Partial non-gaussian state space. Biometrika futures positions. American Journal of Agricultural Economics 81 (1), 115–131. 69 (3), 639–646. Shi, S., Weigend, A. S., 1997. Taking time seriously: Hidden Markov Neal, P., Roberts, G., 2006. Optimal scaling for partially updating experts applied to financial engineering. In: Computational Intel- mcmc algorithms. The Annals of Applied Probability 16 (2), 475– ligence for Financial Engineering (CIFEr), 1997., Proceedings of 515. the IEEE/IAFE 1997. IEEE, pp. 244–252. Oh, E. S., May 2011. Bayesian particle filtering for prediction of fi- Shiller, R., 2005. Irrational exuberance. Princeton University Press. nancial time series. Master’s thesis, Cambridge University. Taylor, J., 2010. Exponentially weighted methods for forecasting in- Pafka, S., Kondor, I., 2001. Evaluating the riskmetrics methodology in traday time series with multiple seasonal cycles. International Jour- measuring volatility and value-at-risk in financial markets. Physica nal of Forecasting 26 (4), 627–646. A: Statistical Mechanics and its Applications 299, 305–310. Taylor, S. J., 2007. Asset Price Dynamics, Volatility, and Prediction. Patton, A. J., 2010. Volatility forecast comparison using imperfect Princeton University Press. volatility proxies. Journal of Econometrics 160 (1), 246–256. Teitelbaum, R., January 2008. The code breaker. Bloomberg Maga- Peiro, A., 1994. Daily seasonality in stock returns: Further interna- zine, 32–48An interview with the CEO of Renaissance Technolo- tional evidence. Economics Letters 45 (2), 227–232. gies. Pesaran, B., Pesaran, M. H., 2007. Volatilities and conditional corre- Thomas, L. C., Allen, D. E., Morkel-Kingsbury, N., 2002. A hid- lations in futures markets with a multivariate t distribution. CESifo den Markov chain model for the term structure of bond credit risk Working Paper Series 2056, CESifo Group Munich. spreads. International Review of Financial Analysis 11 (3), 311– Pesaran, M., Schleicher, C., Zaffaroni, P., 2009. Model averaging in 329. risk management with an application to futures markets. Journal of Toni, T., Welch, D., Strelkowa, N., Ipsen, A., Stumpf, M. P., 2009. Ap- Empirical Finance 16 (2), 280–305. proximate Bayesian computation scheme for parameter inference Pidan, D., El-yaniv, R., 2011. Selective prediction of financial trends and model selection in dynamical systems. Journal of the Royal with hidden Markov models. In: Advances in Neural Information Society Interface 6 (31), 187–202. Processing Systems. pp. 855–863. Vuong, Q. H., 1989. Likelihood ratio tests for model selection and Poon, S. H., Granger, C. W., 2003. Forecasting volatility in financial non-nested hypotheses. Econometrica: Journal of the Econometric markets: A review. Journal of Economic Literature XLI, 478–539. Society, 307–333. Punskaya, E., Andrieu, C., Doucet, A., Fitzgerald, W., 2002. Bayesian Wang, H., Zhang, X., Zou, G., 2009. Frequentist model averaging curve fitting using MCMC with applications to signal segmenta- estimation: a review. Journal of Systems Science and Complexity tion. Signal Processing, IEEE Transactions on 50 (3), 747–758. 22, 732–748. quantshare.com, 2016. The standard de- Wilmott, P., 2006. Paul Wilmott on Quantitative Finance. Wiley. viation ratio.www.quantshare.com/ Wisebourt, S., 2011. Hierarchical hidden Markov model of high- item-1039-standard-deviation-ratio. frequency market regimes using trade price and limit order book Reinsch, C., 1967. Smoothing by spline functions. Numerical Mathe- information. Master’s thesis, University of Waterloo. matics 10, 177–183. Yu, S.-Z., 2010. Hidden semi-Markov models. Artificial Intelligence Robb, A. L., 1980. Accounting for seasonality with spline functions. 174 (2), 215–243. The Review of Economics and Statistics 62 (2), 321–323. Zaffaroni, P., 2008. Large-scale volatility models: theoretical prop- Robert, C. P., Marin, J.-M., 2008. On some difficulties with a poste- erties of professionals’ practice. Journal of Time Series Analysis

22 29 (3), 581–599.

23