Deep Learning Financial Market Data .

Steven Hutt [email protected] 12 December, 2016 The Limit Order Book

A financial exchange provides a central location for electronically trading contracts. Trading is implemented on a Matching Engine, which matches buy and sell orders according to certain priority rules and queues orders which do not immediately match. A Marketable limit order is immedi- ately matched and results in a trade. A Passive limit order cannot be matched and joins the queue. Data In: Client ID tagged order flow. Data Out: LOB state (quantity at price).

Need to learn data sequences. A. Booth©

1 Representing the Limit Order Book

At a given point in time, the LOB may be represented by:

Level Bid Px Bid Sz Ask Px Ask Sz Bid Side: The quantity of orders at 1 23.0 6 24.0 2 each price level at or below the best 2 22.5 13 24.5 5 3 22.0 76 25.0 17 bid. 4 21.5 132 25.5 14 5 21.0 88 26.0 8 Ask Side: The quantity of orders at 5 Level Limit Order Book each price level at or above the best ask. A possible representation x = a ⊔ b at =| (88, 132, 76{z,..., 14, 8}) t t t 10 elements is given on the left. The evolution of the LOB over time is bt = (23.0, 24.0) represented by a sequence:

(x1, x2, x3, ··· , xn, ··· )

2 LOB Pattern Search for Prohibited Trading Patterns

Motivation Regulators identify prohibited patterns of trading activity detrimental to orderly markets. Financial Exchanges are responsible for maintaining orderly markets. (e.g. Flash Crash and Hound of Hounslow.) Challenge Identify prohibited trading patterns quickly and efficiently. Goal Build a trading pattern search function using . Given a sample trading pattern identify similar patterns in historical LOB data.

3 Learning Sequences with RNNs

In a there are nodes which feedback on nx n nz themselves. Below xt ∈ R , ht ∈ R h and zt ∈ R , for t = 1, 2,..., T. Each node is a vector or 1-dim array.

output zt z1 z2 z3 ··· zT

Wzh . ··· state h.t Whh h0 h1 h2 h3 hT

Whx

input xt x1 x2 x3 ··· xT

The RNN is defined by update equations:

ht+1 = f(Whhht + Whxxt + bh), zt = g(Wyhht + bz).

nu×nv nu where Wuv ∈ R and bu ∈ R are the RNN weights and biases respectively. The functions f and g are generally nonlinear.

Given parameters θ = {Whh, Whx, Wyh, bh, bz}, an RNN defines a mapping of sequences

Fθ : X = (x1, x2, ··· , xT) 7→ Z = (z1, z2, ··· , zT). 4 Supervised Training for RNNs

Supervised Learning: Given sample input sequences {X(1), X2, ··· X(K)} with corresponding outputs {Z(1), Z2, ··· Z(K)}, choose the RNN parameters θ so as to minimize ∑K ∑T L L L (k) (k) (θ) = k(θ) where k(θ) = L(Fθ(xt ), zt ) t=k t=1

and Lt(θ) = L(Fθ(xt), zt) is some , e.g. least squares. : Fixing some time t above, we have

∑t ∏ ∂L ∂L ∂x ∂x ∂x ∂x t = t t i where t = j ∂θ ∂xt ∂xi ∂θ ∂xi ∂xj−1 i=1 t≥j>i

But this final product leads to vanishing / exploding gradients! Vanishing gradients limits the RNNs capacity to learn from the distant past. Exploding gradients destroy the Gradient Descent . Problem: How to allow information in early states to influence later states? Note that trading patterns can extend over 100s of time steps.

5 Long Short Term Memory Units

We would like to create paths through time through which the gradient neither vanishes nor explodes. Sometimes need to forget information once it has been used. An LSTM [3] learns long term dependencies by adding a memory cell and controlling what gets modified and what gets forgotten:

forget modify

memory ct ∗. + ⃝ ct+1

tanh

∗ ∗ ⃝ = copy

⃝• = join then copy

σ σ tanh σ

⃝• ⃝ ⃝ state ht ht+1

input xt An LSTM can remember state dependencies further back than a standard RNN and has become one of the standard approaches to learning sequences. 6 Unitary RNNs

The behaviour of the LSTM can be difficult to analyse. An alternative approach [1] is to ensure there are no contractive directions,

that is require Whh ∈ U(n), the space of unitary matrices. This preserves the simple form of an RNN but ensures the state dependencies go further back in the sequence. A key problem is that gradient descent does not pre- serve the unitary property: x ∈ U(n) 7→ x + λυ∈ / U(n)

The solution in [1] is to choose Whh from a subset generated by parametrized unitary matrices: F −1 F Whh = D3R2 D2ΠR1 D1.

Here D is diagonal, R is a reflection, F is the Fourier transform and Π is a permutation.

7 RNNs as Programs addressing External Memory

RNNs are Turing complete but the theoretical capabilities are not matched in practice due to training inefficiencies. The Neural emulates a differentiable Turing Machine by adding external addressable memory and using the RNN state as a memory controller [2]. NTM equivalent 'code' Input xt Output yt initialise:moveheadtostartlocation whileinputdelimiternotseendo receiveinputvector RNNRNNRNN Controller Controller Controller. writeinputtoheadlocation incrementheadlocationby1 endwhile Read Heads Write Heads returnheadtostartlocation whiletruedo

ml Memory ml readoutputvectorfromheadlocation emitoutput incrementheadlocationby1 endwhile Vanila NTM Architecture

8 Deep RNNs

out zt z1 z2 z3 ··· zT

W zh3

W ··· L3 h3t h3h3 h30 h31 h32 h33 h3T

h2t

W ··· L2 h2t h2h2 h20 h21 h22 h23 h2T

h1t

. W ··· L1 h1.t h1h1 h10 h11 h12 h13 h1T

W h1x

in xt x1 x2 x3 ··· xT

The input to hidden Li is the state hi−1 of hidden layer Li−1. Deep RNNs can learn at different time scales.

9 :

The more regularities there are in data the more it can be compressed. The more we are able to compress the data, the more we have learned about the data. (Grünwald, 1998)

.

Wh1x Wh2h1 Wh3h2 Wyh3

x1 y1

x2 h11 h31 y2

x3 h12 h21 h32 y3

x4 h13 h22 h33 y4

x5 h14 h34 y5

x6 y6

encode decode input code reconstruct

Autoencoder

10 Recurrent Autoencoders

z1 z2 z3 ··· zT

= . e e e e ··· e d d d d ··· d h0 h1 h2 h3 hT h0 h1 h2 h3 hT

x1 x2 x3 ··· xT z0 z1 z2 ··· zT−1

encoder decoder

A Recurrent is trained to output a reconstruction z of the the e input sequence x. All data must pass through the narrow hT. Compression between input and output forces pattern learning.

11 Recurrent Autoencoder Updates

z1 z2 z3 ··· zT

= . e e e e ··· e d d d d ··· d h0 h1 h2 h3 hT h0 h1 h2 h3 hT

x1 x2 x3 ··· xT z0 z1 z2 ··· zT−1

encoder decoder

e e e e e d d d d d h1 = f(Whhh0 + Whxx1 + bh) h1 = f(Whhh0 + Whxz0 + bh) e e e e e d e d d d d d h2 = f(Whhh1 + Whxx2 + bh) h0 = hT h2 = f(Whhh1 + Whxz1 + bh) e e e e e d d d d d h3 = f(Whhh2 + Whxx3 + bh) h3 = f(Whhh2 + Whxz2 + bh) ··· ··· encoder updates copy decoder updates

12 Deep Recurrent Autoencoder

z1 z2 z3 ··· zT

d d d d ··· d h30 h31 h32 h33 h3T decoder_2

d d d d ··· d h20 h21 h22 h23 h2T decoder_1

= he he he he ··· he d d d d ··· d encoder_2 30 31 32 33 3T h10 h11 h12 h13 h1T decoder_0

e e e e ··· e z z z ··· z − encoder_1 h20 h21 h22 h23 h2T 0 1 2 T 1

e e e e ··· e . encoder_0 h10 h11 h12 h13 h1T

x1 x2 x3 ··· xT

A Deep Recurrent Autoencoder allows learning patterns on different time scales. This is important for trading patterns which may occur over a few or multiple time steps.

13 TensorFlow Deep Recurrent AutoEncoder

with tf.variable_scope('encoder_0') as scope: for t in range(1, seq_len): # placeholder for input data xs_0_enc[t] = tf.placeholder(shape=[None, nx_enc]) hs_0_enc[t] = lstm_cell_l0_enc(xs_0_enc[t], hs_0_enc[t-1]) scope.reuse_variables()

with tf.variable_scope('encoder_1') as scope: for t in range(1, seq_len): # encoder_1 input is encoder_0 hidden state xs_1_enc[t] = hs_0_enc[t] hs_1_enc[t] = lstm_cell_l1_enc(xs_1_enc[t], hs_1_enc[t-1]) scope.reuse_variables()

with tf.variable_scope('encoder_2') as scope: for t in range(1, seq_len): # encoder_2 input is encoder_1 hidden state xs_2_enc[t] = hs_1_enc[t] hs_2_enc[t] = lstm_cell_l2_enc(xs_2_enc[t], hs_2_enc[t-1]) scope.reuse_variables()

14 Pattern Search

Goal: Build a trading pattern search function using Deep Learning. Given a sample trading pattern identify similar patterns in historical LOB data. Methodology: Train the deep recurrent autoencoder on sequences of LOB data to define a mapping from LOB sequences to fixed length vector encoded LOB sequences:

{ ··· } encode−→ e X = x1, x2, , xT Xenc = h3T

{ (1) (2) ··· (K)} { (1) (2) ··· (K) } For all historical LOB data, X , X , , X compute Xenc, Xenc, , Xenc . n Pattern Search: Define dist(X, Y) = d(Xenc, Yenc) where d is some metric on R e with n = dim(h3T).

15 Sample Search Results (Small Data Set)

LHS: Search target RHS: Search result

Time displacement Large quantity bias

16 Generative Models I

It is useful to learn the distribution of LOB sequences and be able to generate samples from this distribution: Backtesting: Generate large scale sample sets for systems testing Transfer Learning: Augment historical data for illiquid contracts Latent Variables: Learn driving variables for LOB markets

The recurrent autoencoder above does not learn data distributions and cannot generate samples of LOB sequences. Instead: Generative Models: generate examples like those already in the data

17 Generative Models II

We assume a probabalistic model for the data x in the form ∫

p(x) = pθ(x | h)p(h)dh

for latent variables h and seek to choose parameters θ which maximize p(x) over the data. General Approach: Distribution p(h) can be simple (e.g. Gaussian) so long as

pθ(x | h) is universal (e.g. neural network) ∑ ∼ ≃ 1 | How to maximize p(x)?: Sample hk p(h) and set p(x) n k pθ(x hk)

But: For most h, pθ(x | h) will be close to zero so convergence is poor | | Better: Sample h from pθ(h x) and set p(x) = Eh∼pθ (z|x)(pθ(x h)

18 Variational Autoencoder

We assume the market data {xk} is sampled from a probability distribution p(x) with a 'small' number of latent variables h.

Assume pθ(x, h) = pθ(x | h)pθ(h) where θ are weights of a NN/RNN. Then

∑ h. pθ (h) prior pθ(x) = pθ(x | h)pθ(h) h

Train network to minimize posterior pθ (h | x) p(x | h) likelihood ∑ L { } − θ( xk ) = min log(pθ(xk)). θ x pθ (x) marginal k

Compute pθ(h | x) and cluster.

But pθ(h | x) is intractable!

19 Variational Inference

Variational Inference learns an NN/RNN approximation qϕ(h | x) to pθ(h | x) during training.

Think of qϕ as encoder and pθ decoder networks: an autoencoder.

≥ L Then log pθ(xk) θ,ϕ(xk) where L (x ) = −D (q (h | x )||p (h)) + E (log p (x | h)) θ,ϕ k | KL ϕ {z k θ } |qϕ {zθ k } regularization term reconstruction term is the variational lower bound. h. pθ (h) prior = Reconstruction term log-likelihood wrt p (x | h) likelihood approx qϕ(h | x) θ qϕ

x pθ (x) marginal Regularization term = target prior on en- coder

20 Questions?

20 References

M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks. arXiv:1511.06464v4, 2016. A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv:1410.5401v2, 2014. S. Hochreiter and J. Schmidhuber. Long short term memory. Neural Computation, 9:1735--1780, 1997.

21