Deep Learning Financial Market Data .
Steven Hutt [email protected] 12 December, 2016 The Limit Order Book
A financial exchange provides a central location for electronically trading contracts. Trading is implemented on a Matching Engine, which matches buy and sell orders according to certain priority rules and queues orders which do not immediately match. A Marketable limit order is immedi- ately matched and results in a trade. A Passive limit order cannot be matched and joins the queue. Data In: Client ID tagged order flow. Data Out: LOB state (quantity at price).
Need to learn data sequences. A. Booth©
1 Representing the Limit Order Book
At a given point in time, the LOB may be represented by:
Level Bid Px Bid Sz Ask Px Ask Sz Bid Side: The quantity of orders at 1 23.0 6 24.0 2 each price level at or below the best 2 22.5 13 24.5 5 3 22.0 76 25.0 17 bid. 4 21.5 132 25.5 14 5 21.0 88 26.0 8 Ask Side: The quantity of orders at 5 Level Limit Order Book each price level at or above the best ask. A possible representation x = a ⊔ b at =| (88, 132, 76{z,..., 14, 8}) t t t 10 elements is given on the left. The evolution of the LOB over time is bt = (23.0, 24.0) represented by a sequence:
(x1, x2, x3, ··· , xn, ··· )
2 LOB Pattern Search for Prohibited Trading Patterns
Motivation Regulators identify prohibited patterns of trading activity detrimental to orderly markets. Financial Exchanges are responsible for maintaining orderly markets. (e.g. Flash Crash and Hound of Hounslow.) Challenge Identify prohibited trading patterns quickly and efficiently. Goal Build a trading pattern search function using Deep Learning. Given a sample trading pattern identify similar patterns in historical LOB data.
3 Learning Sequences with RNNs
In a Recurrent Neural Network there are nodes which feedback on nx n nz themselves. Below xt ∈ R , ht ∈ R h and zt ∈ R , for t = 1, 2,..., T. Each node is a vector or 1-dim array.
output zt z1 z2 z3 ··· zT
Wzh . ··· state h.t Whh h0 h1 h2 h3 hT
Whx
input xt x1 x2 x3 ··· xT
The RNN is defined by update equations:
ht+1 = f(Whhht + Whxxt + bh), zt = g(Wyhht + bz).
nu×nv nu where Wuv ∈ R and bu ∈ R are the RNN weights and biases respectively. The functions f and g are generally nonlinear.
Given parameters θ = {Whh, Whx, Wyh, bh, bz}, an RNN defines a mapping of sequences
Fθ : X = (x1, x2, ··· , xT) 7→ Z = (z1, z2, ··· , zT). 4 Supervised Training for RNNs
Supervised Learning: Given sample input sequences {X(1), X2, ··· X(K)} with corresponding outputs {Z(1), Z2, ··· Z(K)}, choose the RNN parameters θ so as to minimize ∑K ∑T L L L (k) (k) (θ) = k(θ) where k(θ) = L(Fθ(xt ), zt ) t=k t=1
and Lt(θ) = L(Fθ(xt), zt) is some loss function, e.g. least squares. Gradient Descent: Fixing some time t above, we have
∑t ∏ ∂L ∂L ∂x ∂x ∂x ∂x t = t t i where t = j ∂θ ∂xt ∂xi ∂θ ∂xi ∂xj−1 i=1 t≥j>i
But this final product leads to vanishing / exploding gradients! Vanishing gradients limits the RNNs capacity to learn from the distant past. Exploding gradients destroy the Gradient Descent algorithm. Problem: How to allow information in early states to influence later states? Note that trading patterns can extend over 100s of time steps.
5 Long Short Term Memory Units
We would like to create paths through time through which the gradient neither vanishes nor explodes. Sometimes need to forget information once it has been used. An LSTM [3] learns long term dependencies by adding a memory cell and controlling what gets modified and what gets forgotten:
forget modify
memory ct ∗. + ⃝ ct+1
tanh
∗ ∗ ⃝ = copy
⃝• = join then copy
σ σ tanh σ
⃝• ⃝ ⃝ state ht ht+1
input xt An LSTM can remember state dependencies further back than a standard RNN and has become one of the standard approaches to learning sequences. 6 Unitary RNNs
The behaviour of the LSTM can be difficult to analyse. An alternative approach [1] is to ensure there are no contractive directions,
that is require Whh ∈ U(n), the space of unitary matrices. This preserves the simple form of an RNN but ensures the state dependencies go further back in the sequence. A key problem is that gradient descent does not pre- serve the unitary property: x ∈ U(n) 7→ x + λυ∈ / U(n)
The solution in [1] is to choose Whh from a subset generated by parametrized unitary matrices: F −1 F Whh = D3R2 D2ΠR1 D1.
Here D is diagonal, R is a reflection, F is the Fourier transform and Π is a permutation.
7 RNNs as Programs addressing External Memory
RNNs are Turing complete but the theoretical capabilities are not matched in practice due to training inefficiencies. The Neural Turing Machine emulates a differentiable Turing Machine by adding external addressable memory and using the RNN state as a memory controller [2]. NTM equivalent 'code' Input xt Output yt initialise:moveheadtostartlocation whileinputdelimiternotseendo receiveinputvector RNNRNNRNN Controller Controller Controller. writeinputtoheadlocation incrementheadlocationby1 endwhile Read Heads Write Heads returnheadtostartlocation whiletruedo
ml Memory ml readoutputvectorfromheadlocation emitoutput incrementheadlocationby1 endwhile Vanila NTM Architecture
8 Deep RNNs
out zt z1 z2 z3 ··· zT
W zh3
W ··· L3 h3t h3h3 h30 h31 h32 h33 h3T
h2t
W ··· L2 h2t h2h2 h20 h21 h22 h23 h2T
h1t
. W ··· L1 h1.t h1h1 h10 h11 h12 h13 h1T
W h1x
in xt x1 x2 x3 ··· xT
The input to hidden layer Li is the state hi−1 of hidden layer Li−1. Deep RNNs can learn at different time scales.
9 Unsupervised Learning: Autoencoders
The more regularities there are in data the more it can be compressed. The more we are able to compress the data, the more we have learned about the data. (Grünwald, 1998)
.
Wh1x Wh2h1 Wh3h2 Wyh3
x1 y1
x2 h11 h31 y2
x3 h12 h21 h32 y3
x4 h13 h22 h33 y4
x5 h14 h34 y5
x6 y6
encode decode input code reconstruct
Autoencoder
10 Recurrent Autoencoders
z1 z2 z3 ··· zT
= . e e e e ··· e d d d d ··· d h0 h1 h2 h3 hT h0 h1 h2 h3 hT
x1 x2 x3 ··· xT z0 z1 z2 ··· zT−1
encoder decoder
A Recurrent Autoencoder is trained to output a reconstruction z of the the e input sequence x. All data must pass through the narrow hT. Compression between input and output forces pattern learning.
11 Recurrent Autoencoder Updates
z1 z2 z3 ··· zT
= . e e e e ··· e d d d d ··· d h0 h1 h2 h3 hT h0 h1 h2 h3 hT
x1 x2 x3 ··· xT z0 z1 z2 ··· zT−1
encoder decoder
e e e e e d d d d d h1 = f(Whhh0 + Whxx1 + bh) h1 = f(Whhh0 + Whxz0 + bh) e e e e e d e d d d d d h2 = f(Whhh1 + Whxx2 + bh) h0 = hT h2 = f(Whhh1 + Whxz1 + bh) e e e e e d d d d d h3 = f(Whhh2 + Whxx3 + bh) h3 = f(Whhh2 + Whxz2 + bh) ··· ··· encoder updates copy decoder updates
12 Deep Recurrent Autoencoder
z1 z2 z3 ··· zT
d d d d ··· d h30 h31 h32 h33 h3T decoder_2
d d d d ··· d h20 h21 h22 h23 h2T decoder_1
= he he he he ··· he d d d d ··· d encoder_2 30 31 32 33 3T h10 h11 h12 h13 h1T decoder_0
e e e e ··· e z z z ··· z − encoder_1 h20 h21 h22 h23 h2T 0 1 2 T 1
e e e e ··· e . encoder_0 h10 h11 h12 h13 h1T
x1 x2 x3 ··· xT
A Deep Recurrent Autoencoder allows learning patterns on different time scales. This is important for trading patterns which may occur over a few or multiple time steps.
13 TensorFlow Deep Recurrent AutoEncoder
with tf.variable_scope('encoder_0') as scope: for t in range(1, seq_len): # placeholder for input data xs_0_enc[t] = tf.placeholder(shape=[None, nx_enc]) hs_0_enc[t] = lstm_cell_l0_enc(xs_0_enc[t], hs_0_enc[t-1]) scope.reuse_variables()
with tf.variable_scope('encoder_1') as scope: for t in range(1, seq_len): # encoder_1 input is encoder_0 hidden state xs_1_enc[t] = hs_0_enc[t] hs_1_enc[t] = lstm_cell_l1_enc(xs_1_enc[t], hs_1_enc[t-1]) scope.reuse_variables()
with tf.variable_scope('encoder_2') as scope: for t in range(1, seq_len): # encoder_2 input is encoder_1 hidden state xs_2_enc[t] = hs_1_enc[t] hs_2_enc[t] = lstm_cell_l2_enc(xs_2_enc[t], hs_2_enc[t-1]) scope.reuse_variables()
14 Pattern Search
Goal: Build a trading pattern search function using Deep Learning. Given a sample trading pattern identify similar patterns in historical LOB data. Methodology: Train the deep recurrent autoencoder on sequences of LOB data to define a mapping from LOB sequences to fixed length vector encoded LOB sequences:
{ ··· } encode−→ e X = x1, x2, , xT Xenc = h3T
{ (1) (2) ··· (K)} { (1) (2) ··· (K) } For all historical LOB data, X , X , , X compute Xenc, Xenc, , Xenc . n Pattern Search: Define dist(X, Y) = d(Xenc, Yenc) where d is some metric on R e with n = dim(h3T).
15 Sample Search Results (Small Data Set)
LHS: Search target RHS: Search result
Time displacement Large quantity bias
16 Generative Models I
It is useful to learn the distribution of LOB sequences and be able to generate samples from this distribution: Backtesting: Generate large scale sample sets for systems testing Transfer Learning: Augment historical data for illiquid contracts Latent Variables: Learn driving variables for LOB markets
The recurrent autoencoder above does not learn data distributions and cannot generate samples of LOB sequences. Instead: Generative Models: generate examples like those already in the data
17 Generative Models II
We assume a probabalistic model for the data x in the form ∫
p(x) = pθ(x | h)p(h)dh
for latent variables h and seek to choose parameters θ which maximize p(x) over the data. General Approach: Distribution p(h) can be simple (e.g. Gaussian) so long as
pθ(x | h) is universal (e.g. neural network) ∑ ∼ ≃ 1 | How to maximize p(x)?: Sample hk p(h) and set p(x) n k pθ(x hk)
But: For most h, pθ(x | h) will be close to zero so convergence is poor | | Better: Sample h from pθ(h x) and set p(x) = Eh∼pθ (z|x)(pθ(x h)
18 Variational Autoencoder
We assume the market data {xk} is sampled from a probability distribution p(x) with a 'small' number of latent variables h.
Assume pθ(x, h) = pθ(x | h)pθ(h) where θ are weights of a NN/RNN. Then
∑ h. pθ (h) prior pθ(x) = pθ(x | h)pθ(h) h
Train network to minimize posterior pθ (h | x) p(x | h) likelihood ∑ L { } − θ( xk ) = min log(pθ(xk)). θ x pθ (x) marginal k
Compute pθ(h | x) and cluster.
But pθ(h | x) is intractable!
19 Variational Inference
Variational Inference learns an NN/RNN approximation qϕ(h | x) to pθ(h | x) during training.
Think of qϕ as encoder and pθ decoder networks: an autoencoder.
≥ L Then log pθ(xk) θ,ϕ(xk) where L (x ) = −D (q (h | x )||p (h)) + E (log p (x | h)) θ,ϕ k | KL ϕ {z k θ } |qϕ {zθ k } regularization term reconstruction term is the variational lower bound. h. pθ (h) prior = Reconstruction term log-likelihood wrt p (x | h) likelihood approx qϕ(h | x) θ qϕ
x pθ (x) marginal Regularization term = target prior on en- coder
20 Questions?
20 References
M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks. arXiv:1511.06464v4, 2016. A. Graves, G. Wayne, and I. Danihelka. Neural turing machines. arXiv:1410.5401v2, 2014. S. Hochreiter and J. Schmidhuber. Long short term memory. Neural Computation, 9:1735--1780, 1997.
21