Model-based Deep for Dynamic Portfolio Optimization

Pengqian Yu * 1 Joon Sern Lee * 1 Ilya Kulyatin 1 Zekun Shi 1 Sakyasingha Dasgupta 1

Abstract Dynamic portfolio optimization remains one of the most Dynamic portfolio optimization is the process of challenging problems in the field of finance (Markowitz, sequentially allocating wealth to a collection of as- 1959; Haugen & Haugen, 1990). It is a sequential decision- sets in some consecutive trading periods, based on making process of continuously reallocating funds into a investors’ return-risk profile. Automating this pro- number of different financial investment products, with the cess with remains a challenging main aim to maximize return while constraining risk. Clas- problem. Here, we design a deep reinforcement sical approaches to this problem include dynamic program- learning (RL) architecture with an autonomous ming and convex optimization, which require discrete ac- trading agent such that, investment decisions and tions and thus suffer from the ‘curse of dimensionality’ (e.g., actions are made periodically, based on a global (Cover, 1991; Li & Hoi, 2014; Feng et al., 2015)). objective, with autonomy. In particular, with- There have been efforts made to apply RL techniques to al- out relying on a purely model-free RL agent, we leviate the dimensionality issue in the portfolio optimization train our trading agent using a novel RL architec- problem (Moody & Saffell, 2001; Dempster & Leemans, ture consisting of an infused prediction module 2006; Cumming et al., 2015; Jiang et al., 2017; Deng et al., (IPM), a generative adversarial 2017; Guo et al., 2018; Liang et al., 2018). The main idea is module (DAM) and a behavior cloning module to train an RL agent that is rewarded if its investment deci- (BCM). Our model-based approach works with sions increase the logarithmic rate of return and is penalised both on-policy or off-policy RL algorithms. We otherwise. However, these RL algorithms have several draw- further design the back-testing and execution en- backs. In particular, the approaches in (Moody & Saffell, gine which interact with the RL agent in real time. 2001; Dempster & Leemans, 2006; Cumming et al., 2015; Using historical real financial market data, we Deng et al., 2017) only yield discrete single-asset trading simulate trading with practical constraints, and signals. The multi-assets setting was studied in (Guo et al., demonstrate that our proposed model is robust, 2018), however, the authors did not take transaction costs profitable and risk-sensitive, as compared to base- into consideration, thus limiting their practical usage. In re- line trading strategies and model-free RL agents cent study (Jiang et al., 2017; Liang et al., 2018), transaction from prior work. costs were considered but it did not address the challenge of having insufficient data in financial markets for the train- ing of robust machine learning algorithms. Moreover, the 1. Introduction methods proposed in (Jiang et al., 2017; Liang et al., 2018) directly apply a model-free RL algorithm that is sample arXiv:1901.08740v1 [cs.LG] 25 Jan 2019 Reinforcement learning (RL) consists of an agent interact- ing with the environment, in order to learn an optimal policy inefficient and also doesn’t account for the stability and risk by trial and error for sequential decision-making problems issues caused by non-stationary financial market environ- (Bertsekas, 2005; Sutton & Barto, 2018). The past decade ment. In this paper, we propose a novel model-based RL has witnessed the tremendous success of deep reinforce- approach, that takes into account practical trading restric- ment learning (RL) in the fields of gaming, robotics and tions such as transaction costs and order executions, to stably recommendation systems (Lillicrap et al., 2015; Silver et al., train an autonomous agent whose investment decisions are 2016; Mnih et al., 2015; 2016). However, its applications in risk-averse yet profitable. the financial domain have not been explored as thoroughly. We highlight our three main contributions to realize a model- based RL algorithm for our problem setting. Our first contri- *Equal contribution 1Neuri PTE LTD, One George Street, #22- 01, Singapore 049145. Correspondence to: Sakyasingha Dasgupta bution is an infused prediction module (IPM), which incor- . porates the prediction of expected future observations into state-of-the-art RL algorithms. Our idea is inspired by some Pre-print paper. Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization attempts to merge prediction methods with RL. For exam- ventional RL algorithms and to solve ple, RL has been successful in predicting the behavior of complex tasks. This technique is similar in spirit to the simple gaming environments (Oh et al., 2015). In addition, work in (Nair et al., 2018). The difference is that we create prediction based models have also been shown to improve the expert behavior based on a one-step greedy strategy by the performance of RL agents in distributing energy over solving an optimization problem that maximizes immediate a smart power grid (Marinescu et al., 2017). In this paper, rewards in the current time step. Additionally, we only up- we explore two prediction models; a nonlinear dynamic date the actor with respect to its auxiliary behavior cloning Boltzmann machine (Dasgupta & Osogami, 2017) and a loss in an actor-critic algorithm setting. We demonstrate that variant of parallel WaveNet (van den Oord et al., 2018). BCM can prevent large changes in portfolio weights and These models make use of historical prices of all assets in thus keep the volatility low, while also increasing returns in the portfolio to predict the future price movements of each some cases. asset, in a codependent manner. These predictions are then To the best of our knowledge, this is the first work that treated as additional features that can be used by the RL leverages the deep RL state-of-art, and further extends it agent to improve its performance. Our experimental results to a model-based setting and integrate it into the financial show that using IPM provides significant performance im- domain. Even though our proposed approach has been rig- provements over baseline RL algorithms in terms of Sharpe orously tested on the off-policy RL algorithm (in particular, ratio (Sharpe, 1966), Sortino ratio (Sortino & Price, 1994), the deep deterministic policy gradients (DDPG) algorithm maximum drawdown (MDD, see (Chekhlov et al., 2005)), (Lillicrap et al., 2015)), these concepts can be easily ex- value-at-risk (VaR, see (Artzner et al., 1999)) and condi- tended to on-policy RL algorithms such as proximal policy tional value-at-risk (CVaR, see (Rockafellar et al., 2000)). optimization (PPO) (Schulman et al., 2017) and trust region Our second contribution is a data augmentation module policy optimization (Schulman et al., 2015) algorithms. We (DAM), which makes use of a generative adversarial net- showcase the overall algorithm for model-based PPO for work (GAN, e.g., (Goodfellow et al., 2014)) to generate portfolio management and the corresponding results in the synthetic market data. This module is well motivated by supplementary material. Additionally, we also provide algo- the fact that financial markets have limited data. To illus- rithms for differential risk sensitive deep RL for portfolio trate this, consider the case where new portfolio weights are optimization in the supplementary material. For the rest of decided by the agent on a daily basis. In such a scenario, the main paper, our discussion will be centered around how which may not be uncommon, the size of the training set our three contributions can improve the performance of the for a particular asset over the past 10 years is only around off-policy DDPG algorithm. 2530, due to the fact that there are only about 253 trading This paper is organized as follows: In Section2, we re- days a year. Clearly, this is an extremely small dataset that view the deep RL literature, and formulate the portfolio may not be sufficient for training a robust RL agent. To optimization problem as a deep RL problem. We describe overcome this difficulty, we train a recurrent GAN (Esteban the structure of our automatic trading system in Section3. et al., 2017) using historical asset data to produce realistic Specifically, we provide details of the infused prediction multi-dimensional time series. Different from the objective module, data augmentation module and behavior cloning function in (Goodfellow et al., 2014), we explicitly consider module in Section 3.2 to Section 3.4. In Section4, we report the maximum mean discrepancy (MMD, see (Gretton et al., numerical experiments that serve to illustrate the effective- 2007)) in the generator loss which further minimizes the ness of methods described in this paper. We conclude in distribution mismatch between real and generated data dis- Section5. tributions. We show that DAM helps to reduce over-fitting and typically leads to a portfolio with less volatility. 2. Preliminaries and Problem Setup Our third contribution is a behavior cloning module (BCM), which provides one-step greedy expert demonstration to In this section, we briefly review the literature of deep rein- the RL agent. Our idea comes from the imitation learning forcement learning and introduce the mathematical formula- paradigm (also called learning from demonstrations), with tion of the dynamic portfolio optimization problem. its most common form being behavior cloning, which learns A Markov Decision Process (MDP) is defined as a 6-tuple a policy through supervision provided by expert state-action hT, γ, S, A, P, ri. Here, T is the (possibly infinite) decision pairs. In particular, the agent receives examples of behavior horizon; γ ∈ (0, 1] is the discount factor; S = S S is from an expert and attempts to solve a task by mimicking t t the state space and A = S A is the action space, both the expert’s behavior, e.g., (Bain & Sommut, 1999; Abbeel t t assumed to be finite dimensional and continuous; P : S × & Ng, 2004; Ross et al., 2011). In RL, an agent attempts A × S → [0, 1] is the transition kernel and r : S × A → to maximize expected reward through interaction with the R is the reward function. Policy is a mapping µ : S → A, environment. Our proposed BCM combines aspects of con- Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

th specifying the action to choose in a particular state. At each the price relative vector of the t trading period as ut , > time step t ∈ {1,...,T }, the agent in state st ∈ St takes pt pt−1 = (1, p1,t/p1,t−1, . . . , pm,t/pm,t−1) where an action at = µ(st) ∈ At, receives the reward rt and denotes the element-wise division. In addition, we let transits to the next state st+1 according to P. The agent’s hi,t , (pi,t − pi,t−1)/pi,t−1 denote the percentage change objective is to maximize its expected return given the start of closing price at time t for asset i, the space associated µ PT t−1 m k1 distribution, J , Est∼P,at∼µ[ t=1 γ rt]. The state- with its vector form h:,t (hi,:) as H:,t ⊂ R (Hi,: ⊂ R ) action value function, or the Q value function, is defined as where k1 is the time embedding of prediction model. We de- µ PT (i−t−1) Q (st, at) , Esi>t∼P,ai>t∼µ[ i=t γ ri|st, at]. fine wt−1 as the portfolio weight vector at the beginning of trading period t where its ith element w represents the Deep deterministic policy gradient (DDPG) algorithm (Lil- i,t−1 proportion of asset i in the portfolio after capital realloca- licrap et al., 2015) is an off-policy model-free reinforce- tion and Pm w = 1 for all t. We initialize our portfolio ment learning algorithm for continuous control which utilize i=0 i,t with w = (1, 0,..., 0)>. Due to price movements in the large function approximators such as deep neural networks. 0 market, at the end of the same period, the weights evolve DDPG is an actor-critic method, which bridges the gap 0 according to w = (u w )/(u · w ), where is between policy gradient methods and value function ap- t t t−1 t t−1 the element-wise multiplication. Our goal at the end of proximation methods for RL. Intuitively, DDPG learns a 0 period t is to reallocate portfolio vector from w to w by state-action value function (critic) by minimizing the Bell- t t selling and buying relevant assets. Paying all commission man error, while simultaneously learning a policy (actor) by fees, this reallocation action shrinks the portfolio value by directly maximizing the estimated state-action value func- 0 a factor c¯ c Pm |w − w | where c is the transac- tion with respect to the network parameters. t , i=1 i,t i,t tion fees for purchasing and selling. In particular, we let In particular, DDPG maintains an actor function µ(s) with ρt−1 denote the portfolio value at the beginning of period µ Q 0 0 parameters θ , a critic function Q(s, a) with parameters θ , t and ρt at the end. We then have ρt =c ¯tρt. The imme- and a replay buffer R as a set of tuples (st, at, rt, st+1) for diate reward is the logarithmic rate of return defined by 0 each experienced transition. DDPG alternates between run- rt , ln(ρt/ρt−1) = ln(¯ctρt/ρt−1) = ln(¯ctut · wt−1). ning the policy to collect experiences (i.e. training roll-outs) We define the normalized close price matrix at time t by and updating the parameters. In our implementation, train- P [p p |p p | · · · |p p |1] ing roll-outs were conducted with noise added to the policy t , t−k2+1 t t−k2+2 t t−1 t where 1 (1, 1,..., 1)> and k is the time embed- network’s parameter space to encourage exploration (Plap- , 2 ding. The normalized high price matrix Ph is defined by pert et al., 2017). During each training step, DDPG samples t Ph [ph p |ph p | · · · |ph p |ph p ], t , t−k2+1 t t−k2+2 t t−1 t t t a minibatch consisting of N tuples from R to update the l and low price matrix Pt can be defined similarly. We fur- actor and critic networks. DDPG minimizes the follow- h l Q PN ther define the price tensor as Yt , [Pt Pt Pt]. Our ing loss L w.r.t. θ to update the critic, L = i=1(ri + Q 2 objective is to design a RL agent that observes the state γQ(si+1, µ(si+1)) − Q(si, ai|θ )) /N. The actor parame- µ µ st (Yt, wt−1) and takes a sequence of actions (port- ters θ are updated using the policy gradient ∇θµ J = , PN Q µ folio weights) over the time at = wt such that the final ∇aQ(s, a|θ )| ∇θµ µ(s|θ )|s=s /N. To i=1 s=si,a=µ(si) i portfolio value ρ = ρ [exp(PT γt−1r )] is stabilize learning, the Q value function is usually computed T 0Est∼P,at∼µ t=1 t using a separate network (called the target network) whose maximized. weights are an exponential average over time of the critic network. This results in smoother target values. 3. System Design and Data Financial portfolio management is the process of constant In this section, we discuss the detailed design of our pro- redistribution of available funds to a set of financial assets. posed RL based automatic trading system. Our goal is to create a dynamic portfolio allocation sys- tem that periodically generates investment decisions and then act on these decisions autonomously. Following (Jiang AE et al., 2017), we consider a portfolio of m + 1 assets, in- DAM cluding m risky assets and 1 risk-free asset (e.g., broker IPM BCM DH Agent cash balance or U.S. treasury bond). We introduce such Data Fetcher notation: given a matrix g, we denote the ith row of g by th gi,:, and the j column by g:,j. We denote the closing, Market h high and low price vectors of trading period t as pt, pt and pl where p is the closing price of the ith asset in the tth t i,t MSE period. In this paper, we choose the first asset to be risk- Broker Trade Execution Engine h l free cash, i.e., p0,t = p0,t = p0,t = 1. We further define Figure 1. Trading framework. Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

The trading framework referenced in this paper is repre- applied. sented in Figure1 and is a modular system composed of a Performance monitoring is based on a set of evaluation data handler (DH), an algorithm engine (AE) and a market measures such as Sharpe and Sortino ratios, value-at-risk simulation engine (MSE). The DH retrieves market data (VaR), conditional value-at-risk (CVaR), maximum draw- and deals with the required data transformations. It is de- down (MDD), annualized volatility and annualized returns. signed for continuous data ingestion in order to provide Let Y denote a bounded random variable. The Sharpe ratio the AE with the required set of information. The AE is a of Y is defined as SR [Y ]/pvar[Y ]. Sharpe ratio, rep- collection of models containing RL agents and environment , E resenting the reward per unit of risk, has been recognized specifications. We refer the readers to Algorithm1 in the not to be desirable since it is a symmetric measure of risk supplementary material for further details. The MSE is an and, hence, penalizes the low-cost events. Sortino ratio, online event-driven module that provides feedback of ex- VaR and CVaR are risk measures which gained popularity ecuted trades, which can eventually be used by the AE to for taking into consideration only the unfavorable part of compute rewards. In addition, it also executes investment the return distribution, or, equivalently, unwanted high cost. decisions made by the AE. The strategy applied in this study Sortino ratio is defined similarly to Sharpe ratio, though is an asset allocation system that rebalances the available replacing the standard deviation pvar[Y ] with the down- capital between the provided set of assets and a cash asset side deviation. The VaR with level α ∈ (0, 1) of Y is the on a daily frequency. α (1 − α)-quantile of Y , and CVaRα at level α is the expected The data used in this paper is a mix of U.S. equities1 on return of Y in the worst (1 − α) of cases. tick level (trade by trade) aggregated to form open-high-low- close (OHLC) bars on an hourly frequency2. 3.1. Network Architecture In the financial domain, it is common to use benchmark In this subsection, we provide a qualitative description of strategies to evaluate the relative profitability and risk pro- our proposed network before delving into the mathematical file of the tested strategies. A common benchmark strategy details in the subsequent subsections. is the constantly rebalanced portfolio (CRP), where at each Agent period the portfolio is rebalanced to the initial wealth dis- Memory tribution among the m + 1 assets including the cash. This strategy can be seen as using the mean-reverting nature of FE FA NI stock prices, as it sells those that gained value while buying Critic Network BCM more of those losing value. In (Cover, 1991) it is shown Actions IPM (desired portfolio FE FA NI that such a strategy is asymptotically the best for stationary allocation) stochastic processes such as stock prices, offering exponen- Actor Network tial returns and is, therefore, an optimistic benchmark to compare against. Transaction fees c have been fixed at a Data Fetcher DAM conservative level of 20 basis points3 and, given the use 4 of market orders, an additional 50 basis points slippage is Figure 2. Agent architecture. 1 We use data from Refinitiv DataScope, with experiments car- Figure2 shows how we integrate the IPM, BCM and DAM ried out on the following U.S. Equities: Costco Wholesale Corpo- ration, Cisco Systems, Ford Motors, Goldman Sachs, American into our network architecture. In particular, the DAM is International Group and Caterpillar. The selection is qualitative used to append generated data to each training episode. The with a balanced mix of volatile and less volatile stocks. An ad- data fetcher fetches data from the augmented data in a step- ditional asset is cash (in U.S. dollars), representing the amount wise fashion. This data is then fed to both the IPM and of capital not invested in any other asset. Furthermore, a generic agent, which have multiple neural networks. market variable (S&P 500 index) has been added as an additional feature. The data is shared in supplementary files, which will be To illustrate our proposed off-policy version of dynamic made available publicly later. 2 portfolio optimization algorithm, we adapt the actor-critic We execute orders hourly, while the agent produces decisions style DDPG algorithm (Lillicrap et al., 2015). In this set- daily. 3One basis point is equivalent to 0.01%. ting, at least two networks (one for the actor and one for the 4Slippage is defined as the relative deviation of the price at critic) are required in the agent as shown in Figure2. Fur- which the orders get executed and the price at which the agent thermore, our implementation utilizes both target networks produced the reallocation action. This conservative level of slip- (Mnih et al., 2015) and parameter noise exploration (Plap- page is due to the lack of equity volume data, as during less liquid pert et al., 2017), which in itself necessitates two additional trading periods the agent market orders might influence the market networks for the actor. Our agent, comprising of six sepa- price and get a worse trade execution than expected in a simulated environment. rate networks (four networks for the actor and two networks for the critic), is described in Algorithm1 of supplementary Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization material Section A. Finally, we have a network improvement (NI) component for each network which specifies how the network is updated. We next discuss how the agent is trained and tested. For Specifically, NI synchronizes the learning rates between the each episode during training, we select an episode of data actor and the critic, which preserves training stability by by selecting a random trading date that satisfies our desired ensuring the actor is updated at a slower rate than the critic episode length. The training data of each episode is then (Bhatnagar et al., 2009). It is also important to note that augmented by synthetic data generated by the DAM. At the actor network’s NI component receives gradients from each time step of the episode, market data, i.e., data cor- the BCM, which makes use of one-step greedy actions to responding to the current time step of interest, fetched via provide supervised updates to the actor network to reduce the data fetcher is used to train the IPM which produces portfolio volatility. predictions of price percentage changes in the next time step. At the same time, the agent’s networks are updated by making use of data sampled from the memory (also re- 3.2. Infused Prediction Module ferred to as the replay buffer) via the prioritized experience Here, for the IPM, we implemented and evaluated two dif- replay (Schaul et al., 2015). Once the agent’s networks are ferent multivariate prediction models, trained in an online updated, an action, i.e., desired portfolio allocation in the manner, differing in their computational complexity. As next time step, can be obtained from the actor network that the most time efficient model, we implemented the nonlin- has been perturbed with the parameter noise. The MSE ear dynamic Boltzmann machine (Dasgupta & Osogami, then executes this noisy action and the agent moves to the 2017) (NDyBM) that predicts the future price of each asset next state. The corresponding state, action, rewards, next conditioned on the history of all assets in the portfolio. As state and computed one-step greedy action, produced by the NDyBM does not require through time, BCM, are stored in the memory. This process is repeated it has a parameter update time complexity of O(1). This for each step in the episode and for all episodes. During the makes it very suitable for fast computation in online time- testing period, the IPM continues to be updated at each time series prediction scenarios. However, NDyBM assumes that step while the agent is frozen in that it is no longer trained the inputs come from an underlying Gaussian distribution. and actions are obtained from the actual actor network, i.e., In order to make IPM more generic, we also implemented the actor network without parameter noise. a novel prediction model using dilated layers As shown in Figure2, actor and critic networks in the agent inspired by the WaveNet architecture (van den Oord et al., consist of feature extraction (FE), feature analysis (FA) and 2018). As there was no significant difference in predictive network improvement (NI) components. The FE layers aim performance noticed between the two models, in the rest of the paper we provide results with the faster NDyBM based to extract features given the current price tensor Yt. In our experiments, we have 8 assets including cash and a time IPM module. However, details of our WaveNet inspired architecture can be seen in supplementary section D. embedding k2 = 10. This essentially means that we have a price tensor of the shape 3 × 8 × 10 if using the channels We use the state-space augmentation technique, and con- first convention. The FE layers can be either LSTM-based struct the augmented state-space Se , St × H:,t+1 × recurrent neural networks (RNN, see (Hochreiter & Schmid- H:,t+1 × H:,t+1: each state is now a pair s˜t , huber, 1997)) or convolutional neural networks (CNN, see (st, xt+1), where st ∈ St is the original state, and (Krizhevsky et al., 2012)). We find that the former typically h l m×3 xt+1 , (h:,t+1, h:,t+1, h:,t+1) ∈ X ⊂ R where yields better performance. Thus, the price tensor is reshaped h l h:,t+1, h:,t+1, h:,t+1 ∈ H:,t+1 is the predicted future close, to a 24 × 10 tensor prior to being fed to the LSTM-based high and low asset percentage price change tensor. FE network. The NDyBM can be seen as an unfolded Gaussian Boltz- The outputs at each time step of the LSTM-based FE net- mann machine for an infinite time horizon i.e. T → ∞ work are concatenated together into a single vector. Next, history, that generalizes a standard vector auto-regressive the previous actions, wt−1, is concatenated to this vector. model with eligibility traces and nonlinear transformation Finally, it is further concatenated with a one-step predicted of historical data (Dasgupta & Osogami, 2017). It repre- price percentage change vector produced by the IPM and a sents the conditional probability density of x[t] given x[:t−1] market index performance indicator (i.e., the price ratio of [t] as, p(x[t]|x[:t−1]) = QN p (x |x[:t−1])5. Where, each a market index such as the S&P 500 index). The resulting j=1 j j factor of the right-hand side denotes the conditional proba- vector is then passed to a series of dense layers (i.e., multi- [t] [:t−1] ), which we refer to as the feature analysis bility density of xj given x for j = 1,...,N. Where, (FA) component. We remark that the network in (Jiang et al., N = m × 3 are the number of units in the NDyBM. Here, 5 [t] 2017) does not have dense layers, which may not account For mathematical convenience, xt and x are used inter- for non-linear relationships across various assets. changeably. Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

[t] xj is considered to have a Gaussian distribution for each j: synthetic series of high, low, close (HLC) triplet vector [t] [t]2  x −µ  xt. This approach avoids the non-stationarity in market [t] [:t−1] √ 1 j j pj(xj |x ) = 2 exp − 2σ2 . price dynamics by generating a stationary percentage change 2 π σj j series instead, and the generated HLC triplet is guaranteed to Here, µ , (µj)j=1,...,N is the vector of expected values of the j-th unit at time t given the history up to t − 1 patterns. maintain their relationship (i.e., generated highs are higher It is represented as: than generated lows). [t] Pd−1 [δ] [t−δ] PK [t−1] th µ = b + δ=1 F x + k=1 Gk αk . We assume implicitly that the i assets’ percentage change [t−1] i vector hi,: follows a distribution: hi,: ∼ p (hi,:). Let H Where, b , (bj)j=1,...,N is a bias vector, αk , data [t−1] be the hidden dimension. Our goal is to find a parameterized (α )j=1,...,N are K eligibility trace vectors, d is the time- j,k Gi p (z), z ∈ k1×H delay between connections (i, j) and function ψ such that given a noise prior z R [δ] ˜ the generated distribution pi (Gi (z)) are empirically simi- F (fi,j)(i,j)∈{1,...,N}2 for 0 < δ < d, Gk g ψ , , i (gi,j,k)(i,j)∈{1,...,N}2 for k = 1,...,K are N × N weight lar to the data distribution pdata(hi,:). Formally, given two i (j) b (j) b matrices. The eligibility trace can be updated recursively as, batches of observations {Gψ(z) }j=1, {hi,: }j=1 from [t] [t−1] [t−di,j +1] i i αi,j,k = λk αi,j,k + xi . Here, λk is a fixed decay distributions pg and pdata where b is the batch size, we want k i i rate factor defined for each of the column vectors. Addi- pg ≈ pdata under certain similarity measure of distributions. tionally, the bias parameter vector b, is updated at each time using a RNN layer. This RNN layer computes a nonlinear One suitable choice of such similarity measure is the maxi- feature map of the past time series. Where in, the output mum mean discrepancy (MMD) (Gretton et al., 2012). Fol- lowing (Li et al., 2015), we can show (see supplementary weights from the RNN to the bias layer along with other 2 NDyBM parameters (bias, weight matrices and variance), material) that with RBF kernel, minimizing MMD\b , which are updated online using a stochastic gradient method. is the biased estimator of squared MMD results in matching all moments between the two distribution pa, pb. Following (Dasgupta & Osogami, 2017)6, the NDyBM is trained to predict the next time-step close, high and low As discussed in previous works (Arjovsky et al., 2017; Ar- percentage change for each asset conditioned on the history jovsky & Bottou, 2017), vanilla GANs suffer from the prob- of all other assets, such that the log-likelihood of each time- lem that discriminator becomes perfect when the real and series is maximised. As such NDyBM parameters are up- the generated probabilities have disjoint supports (which is dated at each step t, following the gradient of the conditional often the case under the hypothesis that real-world data lies [t] [t] [−∞,t−1] probability density of x : ∇θ log p(x |x ) = in low dimensional manifolds). This could lead to generator PN [t] [−∞,t−1] gradient vanishing, making the training difficult. Further- i=1 ∇θ log pi(xi |x ). more, the generator can suffer from mode collapse issue In the spirit of providing the agent with additional mar- where it succeeds in tricking the discriminator but the gen- ket signals and removing non-Markovian dynamics in the erated samples have low variation. model, along with the prediction, we further augment the state space with a market index performance indicator. The In our RGAN architecture, we model each asset i separately by a pair of parameterized function Di ,Gi , and we use state space now has the form s˜t (st, xt+1,It) where, φ ψ , 2 It ∈ R+ is the market index performance indicator for time MMD\ , the unbiased estimator of squared MMD between step t. i i i pg and pdata, as a regularizer for the generator Gψ, such that the generator not only tries to ’trick’ the discriminator into i 3.3. Data Augmentation Module classifying its output as coming from pdata, but also tries i i to match pg with pdata in all moments. This alleviates the The limited historical financial data prevents us from scaling 2 up the deep RL agent. To mitigate this issue, we augment the aforementioned issues in vanilla GAN: firstly MMD\ is data set via the recurrent generative adversarial networks defined even when distributions have disjoint supports, and 2 (RGAN) framework (Esteban et al., 2017) and generate secondly gradients provided by MMD\ are not dependent synthetic time series. The RGAN follows the architecture of on the discriminator but only on the real data. Specifically, a regular GAN, with both the generator and the discriminator i i both the discriminator Dφ and the generator Gψ is trained substituted by recurrent neural networks. Specifically, we with method. The discriminator objective generate percentage change of closing price for each asset is: separately at higher frequency than what the RL agent uses. 1 Pb h i (j)  i i (j) i maxφ j=1 log Dφ(hi ) + log 1 − Dφ(Gψ(z )) , We then downsample the generated series to obtain the b and the generator objective is: 6Details of the learning rule and derivation of the model as per 2 1 Pb h  i i (j) i the original paper. Algorithm steps and hyper-parameter settings minψ b j=1 log 1 − Dφ(Gψ(z )) + ζMMD\ are provided in supplementary. Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

(j) b given a batch of b samples {z }j=1 drawn independently actions a¯i is calculated: from a diagonal Gaussian noise prior pz(z), and a batch of µ (j) b i L¯ = −1× b samples {hi }j=1 drawn from the data distribution pdata th PN Pm of the i asset. a¯i,j log(µ(si))j + (1 − a¯i,j) log(1 − (µ(si))j) i=1 j=0 . In order to select the bandwidth parameter σ in the RBF N(m + 1) kernel, we set it to the median pairwise distance between (2) the joint data. Both discriminator and generator networks are LSTMs (Hochreiter & Schmidhuber, 1997). Gradients of the log-loss with respect to the actor net- µ 2 work ∇ µ L¯ can then be calculated and used to perturb Although generator directly minimises estimated MMD\ , θ the weights of the actor slightly. Using this loss directly we go one step further to validate the RGAN by conducting prevents the learned policy from improving significantly Kolmogorov-Smirnov (KS) test. Our results show that the beyond the demonstration policy, as the actor is always tied RGAN is indeed generating data representative of the true back to the demonstrations. To achieve this, a factor λ is underlying distribution. More details of the KS test can be used to discount the gradients such that the actor network found in supplementary material. is only slightly perturbed towards the one-step greedy ac- tion, thus maintaining the stability of the underlying RL 3.4. Behavior Cloning Module algorithm. Following this, the typical DDPG algorithm as In finance, some investors may favor a portfolio with lower described above is executed to train the agent. volatility over the investment horizon. To achieve this, we propose a novel method of behavior cloning, with the pri- 4. Experiments mary purpose of reducing portfolio volatility while main- 7 taining reward to risk ratios. We remark that one can broadly All experiments were conducted using a backtesting envi- dichotomize imitation learning into a passive collection of ronment, and the assets used are as described in Section3. demonstrations (behavioral cloning) versus an active col- The time period spanning from 1 Jan 2005 to 31 Dec 2016 lection of demonstrations. The former setting (Abbeel & was used to train the agent, while the time period between Ng, 2004; Ross et al., 2011) assumes that demonstrations 1 Jan 2017 to 4 Dec 2018 was used to test the agent. We are collected a priori and the goal of imitation learning is initialize our portfolio with $500, 000 in cash. We imple- to find a policy that mimics the demonstrations. The latter ment a CRP benchmark where funds are equally distributed setting (Daume´ et al., 2009; Sun et al., 2017) assumes an among all assets, including the cash asset. We also compare interactive expert that provides demonstrations in response the previous work (Jiang et al., 2017) as a baseline which to actions taken by the current policy. Our proposed BCM can be considered as a vanilla model-free DDPG algorithm in this definition is an active imitation learning algorithm. (i.e., without data augmentation, infused prediction and be- 8 In particular, for every step that the agent takes during train- havioral cloning) . It should be noted that, compared to the ing, we calculate the one-step greedy action in hindsight. implementation in (Jiang et al., 2017) our baseline agent is This one-step greedy action is computed by solving an opti- situated in a more realistic trading environment that does mization problem, given the next time period’s price ratios, not assume an immediate trade execution. In particular, our current portfolio distribution and transaction costs. The ob- backtesting engine executes market orders at the open of 9 jective function is to maximize returns in the current time the next OHLC bar , as well as adds slippage to the trading step, which is why the computed action is referred to as costs. Additional practical constraints are applied such that the one-step greedy action. For time step t, the objective fractional trades of an asset are not allowed. function is as follows: As shown in Table1, adding both DAM and BCM to the Pm baseline leads to a very small increase in Sharpe ratio (from max ut · wt − c i=1 |wi,t − wi,t−1| wt (1) 0.56% to 0.57%) and Sortino ratio (from 0.78% to 0.79%). Pm s.t. i=0 wi,t = 1, 0 ≤ wi,t ≤ 1, ∀i. There are two hypotheses that we can draw here. First, we hypothesize that the vanilla DDPG algorithm (in its usage Solving the above optimization problem for wt yields an within our framework) is not over-fitting to the data since optimal expert greedy action denoted by a¯t. This one-step the use of data augmentation has a relatively small impact greedy action is then stored in the replay buffer together on the performance. Similarly, it could be hypothesized that with the corresponding (s , a , r , s ) pair. In each train- t t t t+1 7 ing iteration of the actor-critic algorithm, a mini-batch of Algorithm of our model-based DDPG agent is detailed in supplementary material Section A. {(s , a , r , s , a¯ )}N is sampled from the replay buffer. i i i i+1 i i=1 8In fact, our baseline is superior to the prior work (Jiang et al., Using the states si that were sampled from the replay buffer, 2017) due to the addition of prioritized experience replay and the actor’s corresponding actions are computed and the log- parameter noise for better exploration. loss between the actor’s actions and the one-step greedy 9Refer to Section3 for details on these terms. Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

Table 1. Performances for different models (last column represents IPM+DAM+BCM model. ann. and accnt. stand for annualized and account, respectively). CRP Baseline IPM DAM BCM IPM+DAM IPM+BCM DAM+BCM All Final accnt. value 574859 570482 586150 571853 571578 575430 577293 571417 580899 Ann. return 7.25% 7.14% 8.64% 7.26% 7.24% 7.60% 7.77% 7.22% 8.09% Ann. volatility 12.65% 12.79% 14.14% 12.80% 12.79% 12.84% 12.81% 12.77% 12.77% Sharpe ratio 0.57 0.56 0.61 0.57 0.57 0.59 0.61 0.57 0.63 Sortino ratio 0.80 0.78 0.87 0.79 0.79 0.83 0.85 0.79 0.89 VaR0.95 1.27% 1.30% 1.41% 1.30% 1.30% 1.29% 1.28% 1.30% 1.27% CVaR0.95 1.91% 1.93% 2.11% 1.94 1.93% 1.93% 1.92% 1.93% 1.91% MDD 13.10% 13.80% 12.20% 13.70% 13.70% 12.70% 12.60% 13.70% 12.40% the baseline is efficient in its use of available signals in the though annualized volatility is significantly reduced. In training set, such that, the addition of behavior cloning by addition, when comparing baseline versus baseline with itself, has a negligible improvement of performance. By BCM, we see that MDD is reduced and Sharpe and Sortino making use of just IPM, we observe a significant improve- ratios have small improvements. ment over the baseline in terms of Sharpe (from 0.56 to We can conclude that IPM significantly improves portfolio 0.61) and Sortino (from 0.78 to 0.87) ratios, indicating that management performances in terms of Sharpe and Sortino we are able to get more returns per unit risk taken. However, ratios. This is particularly attractive for investors who aim to we note that the volatility of the portfolio is significantly maximize their returns per unit risk. In addition, DAM helps increased (from 12.79% to 14.14%). This could be due to to prevent over-fitting especially in the case of larger and the agent over-fitting to the training set, with the addition more complex network architectures. However, it is impor- of this module. Therefore, when the agent sees the testing tant to note that this may impact risk to reward performance. set, its actions could result in more frequent losses, thereby BCM as envisioned, helps to reduce portfolio risk as seen by increasing volatility. its ability to either reduce volatility or MDD across all runs. To reduce over-fitting, we integrate IPM with DAM. As It is also interesting to note its ability to slightly improve the can be seen in Table1, the Sharpe ratio is actually reduced risk to return (reward) ratios. Enabling all three modules, slightly compared to the IPM case (from 0.61 to 0.59) al- our proposed model-based approach can achieve significant though volatility is significantly reduced (from 14.14% to performance improvement as compared with benchmark 12.84%). This reduces the overall risk of the portfolio. and baseline. In addition, we provide details on an addi- Moreover, this substantiates our hypothesis of the agent tional risk adjustment module (that can adjust the reward over-fitting to the training set. One possible reason for IPM function contingent on risk) with experimental results in over-fitting to the training set while the baseline agent does supplementary material. Usage of the presented modules not is the following: the IPM is inherently more complex with a PPO based on-policy algorithm is also provided. due to the presence of a prediction model. The increased complexity in terms of larger network structure essentially 5. Conclusion results in higher probability of over-fitting to the training data. As such, using DAM to diversify the training set is In this paper, we proposed a model-based deep reinforce- highly beneficial in containing such model over-fitting. ment learning architecture to solve the dynamic portfolio optimization problem. To achieve a profitable and risk- One drawback of using IPM+DAM is the decreased return sensitive portfolio, we developed infused prediction, GAN to risk ratios as mentioned above. To mitigate this, we based data augmentation and behavior cloning modules, and make use of all our contributions as a single framework (i.e. further integrated them into an automatic trading system. IPM+DAM+BCM). As shown in Table1, we are not only The stability and profitability of our proposed model-based able to recover the performance of the original model trained RL trading framework were empirically validated on sev- using just IPM, but surpass the IPM performance in terms eral independent experiments with real market data and of both Sharpe (from 0.61 to 0.63) and Sortino ratios (from practical constraints. Our approach is applicable not only 0.87 to 0.89). It is worthwhile to note that addition of BCM to the financial domain, but also to general reinforcement has a strong impact compared to IPM+DAM in reducing learning domains which require practical considerations on volatility (from 12.84% to 12.77%) and MDD (from 12.7% decision making risk and training data scarcity. Future work to 12.4%). The observation of behavior cloning reducing could test our model’s performance on gaming and robotics downside risk is further substantiated when comparing the applications. model trained via IPM versus the model trained with IPM and BCM; the Sharpe and Sortino ratios are maintained Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

References Deng, Y., Bao, F., Kong, Y., Ren, Z., and Dai, Q. Deep direct reinforcement learning for financial signal representation Abbeel, P. and Ng, A. Y. Apprenticeship learning via inverse and trading. IEEE transactions on neural networks and reinforcement learning. In Proceedings of the twenty-first learning systems, 28(3):653–664, 2017. international conference on Machine learning, pp. 1. ACM, 2004. Esteban, C., Hyland, S. L., and Ratsch,¨ G. Real-valued (medical) time series generation with recurrent condi- Arjovsky, M. and Bottou, L. Towards principled methods for tional GANs. arXiv preprint arXiv:1706.02633, 2017. training generative adversarial networks. arXiv preprint arXiv:1701.04862, 2017. Feng, Y., Palomar, D. P., and Rubio, F. Robust optimization of order execution. IEEE Transactions on Signal Process- Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein ing, 63(4):907–920, Feb 2015. ISSN 1053-587X. doi: generative adversarial networks. In Precup, D. and Teh, 10.1109/TSP.2014.2386288. Y. W. (eds.), Proceedings of the 34th International Con- ference on Machine Learning, volume 70 of Proceed- Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., ings of Machine Learning Research, pp. 214–223, In- Warde-Farley, D., Ozair, S., Courville, A., and Bengio, ternational Convention Centre, Sydney, Australia, 06– Y. Generative adversarial nets. In Advances in neural 11 Aug 2017. PMLR. URL http://proceedings. information processing systems, pp. 2672–2680, 2014. mlr.press/v70/arjovsky17a.html. Gretton, Arthur, Borgwardt, M., K., Rasch, Malte, Schlkopf, Artzner, P., Delbaen, F., Eber, J.-M., and Heath, D. Coherent Bernhard, and Alexander. A kernel two-sample test, measures of risk. Mathematical finance, 9(3):203–228, Dec 2012. URL http://jmlr.csail.mit.edu/ 1999. papers/v13/gretton12a.html. Bain, M. and Sommut, C. A framework for behavioural Gretton, A., Borgwardt, K. M., Rasch, M., Scholkopf,¨ B., cloning. Machine intelligence, 15(15):103, 1999. and Smola, A. J. A kernel method for the two-sample- Bertsekas, D. P. Dynamic programming and optimal control, problem. In Advances in neural information processing volume 1. Athena scientific Belmont, MA, 2005. systems, pp. 513–520, 2007. Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, Guo, Y., Fu, X., Shi, Y., and Liu, M. Robust log-optimal M. Natural actor–critic algorithms. Automatica, 45(11): strategy with reinforcement learning. arXiv preprint 2471–2482, 2009. arXiv:1805.00205, 2018. Borovykh, A., Bohte, S., and Oosterlee, C. W. Condi- Haugen, R. A. and Haugen, R. A. Modern investment theory. tional time series forecasting with convolutional neural 1990. arXiv preprint arXiv:1703.04691 networks. , 2017. Heess, N., Hunt, J. J., Lillicrap, T. P., and Silver, D. Memory- Chekhlov, A., Uryasev, S., and Zabarankin, M. Drawdown based control with recurrent neural networks. arXiv measure in portfolio optimization. International Journal preprint arXiv:1512.04455, 2015. of Theoretical and Applied Finance, 8(01):13–58, 2005. Hochreiter, S. and Schmidhuber, J. Long short-term memory. Cover, T. M. Universal portfolios. Mathematical finance, 1 Neural computation, 9(8):1735–1780, 1997. (1):1–29, 1991. Jaeger, H. Adaptive nonlinear system identification with Cumming, J., Alrajeh, D. D., and Dickens, L. An investi- echo state networks. In Advances in neural information gation into the use of reinforcement learning techniques processing systems, pp. 609–616, 2003. within the algorithmic trading domain. PhD thesis, 2015. Jaeger, H. and Haas, H. Harnessing nonlinearity: Predicting Dasgupta, S. and Osogami, T. Nonlinear dynamic boltz- chaotic systems and saving energy in wireless communi- mann machines for time-series prediction. In AAAI, pp. cation. science, 304(5667):78–80, 2004. 1833–1839, 2017. Jiang, Z., Xu, D., and Liang, J. A deep reinforcement learn- Daume,´ H., Langford, J., and Marcu, D. Search-based ing framework for the financial portfolio management . Machine learning, 75(3):297–325, problem. arXiv preprint arXiv:1706.10059, 2017. 2009. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet Dempster, M. A. and Leemans, V. An automated FX trading classification with deep convolutional neural networks. system using adaptive reinforcement learning. Expert In Advances in neural information processing systems, Systems with Applications, 30(3):543–552, 2006. pp. 1097–1105, 2012. Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

Li, B. and Hoi, S. C. Online portfolio selection: A survey. Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, ACM Computing Surveys (CSUR), 46(3):35, 2014. R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychow- icz, M. Parameter space noise for exploration. arXiv Li, Y., Swersky, K., and Zemel, R. Generative moment preprint arXiv:1706.01905, 2017. matching networks. In Bach, F. and Blei, D. (eds.), Pro- ceedings of the 32nd International Conference on Ma- Rockafellar, R. T., Uryasev, S., et al. Optimization of condi- chine Learning, volume 37 of Proceedings of Machine tional value-at-risk. Journal of risk, 2:21–42, 2000. Learning Research, pp. 1718–1727, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr. Ross, S., Gordon, G., and Bagnell, D. A reduction of imita- press/v37/li15.html. tion learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international Liang, Z., Chen, H., Zhu, J., Jiang, K., and Li, Y. Adversar- conference on artificial intelligence and statistics, pp. ial deep reinforcement learning in portfolio management. 627–635, 2011. arXiv preprint arXiv:1808.09940, 2018. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori- Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, tized experience replay. arXiv preprint arXiv:1511.05952, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous 2015. control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015. Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. Trust region policy optimization. In International Marinescu, A., Dusparic, I., and Clarke, S. Prediction- Conference on Machine Learning, pp. 1889–1897, 2015. based multi-agent reinforcement learning in inherently non-stationary environments. ACM Transactions on Au- Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and tonomous and Adaptive Systems (TAAS), 12(2):9, 2017. Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. Markowitz, H. M. Portfolio selection: Efficient diversifica- tion of investment. 344 p, 1959. Sharpe, W. F. Mutual fund performance. The Journal of business, 39(1):119–138, 1966. Mittelman, R. Time-series modeling with undecimated fully convolutional neural networks. arXiv preprint Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., arXiv:1508.00317, 2015. Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., et al. Mastering the Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, game of go with deep neural networks and tree search. J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- nature, 529(7587):484, 2016. land, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540): Sortino, F. A. and Price, L. N. Performance measurement 529, 2015. in a downside risk framework. the Journal of Investing, 3 (3):59–64, 1994. Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- Sun, W., Venkatraman, A., Gordon, G. J., Boots, B., and chronous methods for deep reinforcement learning. In Bagnell, J. A. Deeply aggrevated: Differentiable imita- International conference on machine learning, pp. 1928– tion learning for sequential prediction. arXiv preprint 1937, 2016. arXiv:1703.01030, 2017.

Moody, J. and Saffell, M. Learning to trade via direct Sutton, R. S. and Barto, A. G. Reinforcement learning: An reinforcement. IEEE transactions on neural networks, 12 introduction. MIT press, 2018. (4):875–889, 2001. van den Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Nair, A., McGrew, B., Andrychowicz, M., Zaremba, W., Vinyals, O., Kavukcuoglu, K., van den Driessche, G., and Abbeel, P. Overcoming exploration in reinforcement Lockhart, E., Cobo, L., Stimberg, F., Casagrande, N., learning with demonstrations. In 2018 IEEE Interna- Grewe, D., Noury, S., Dieleman, S., Elsen, E., Kalch- tional Conference on Robotics and Automation (ICRA), brenner, N., Zen, H., Graves, A., King, H., Walters, pp. 6292–6299. IEEE, 2018. T., Belov, D., and Hassabis, D. Parallel WaveNet: Fast high-fidelity . In Dy, J. and Oh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. Action- Krause, A. (eds.), Proceedings of the 35th Interna- conditional video prediction using deep networks in atari tional Conference on Machine Learning, volume 80 of games. In Advances in neural information processing Proceedings of Machine Learning Research, pp. 3918– systems, pp. 2863–2871, 2015. 3926, Stockholmsmssan, Stockholm Sweden, 10–15 Jul Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

2018. PMLR. URL http://proceedings.mlr. press/v80/oord18a.html. Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

A. Algorithms An on-policy version of our proposed model-based PPO style RL architecture can be found in Algorithm2. Our proposed model-based off-policy actor-critic style RL architecture is summarized in Algorithm1. Algorithm 2 Dynamic portfolio optimization algorithm (on-policy version). Algorithm 1 Dynamic portfolio optimization algorithm 1: Input: Two-head policy network µ with weights θµ, policy (off-policy version). policy 2m+2 value head µ : Se → R and value head µ : Se → R, 1: Input: Critic Q(˜s, a|θQ), actor µ(˜s|θµ) and perturbed ac- and clipping threshold . tor networks µ(˜s|θµ˜) with weights θQ, θµ, θµ˜ and standard 2: for episode = 1,...,M do deviation of parameter noise σ. 3: Receive initial observation state s1 0 2: Initialize target networks Q0, µ0 with weights θQ ← θQ, 4: for t = 1,...,T do 0 θµ ← θµ 5: Predict future price tensor xt+1 with prediction models 3: Initialize replay buffer R using st and form augmented state s˜t 6: Use the policy-head of policy network to produce the 4: for episode = 1,...,M do m+1 action mean m(˜st) ∈ and variance Σ(˜st) = 5: Receive initial observation state s1 R 6: for t = 1,...,T do diag(σ1, σ2, . . . , σm+1) 7: Execute action at ∼ N (m(˜st), Σ(˜st)), observe reward 7: Predict future price tensor xt+1 with prediction models rt and new state st+1 using st and form augmented state s˜t µ˜ µ˜ 8: Compute the DSR or D3R dt based on (4) or (6) 8: Use perturbed weight θ to select action at = µ(˜st|θ ) 9: Predict next future price tensor xt+2 using prediction 9: Execute action at, observe reward rt and new state st+1 models with inputs st+1 and form the augmented state 10: Predict next future price tensor xt+2 using prediction models with inputs s and form the augmented state s˜t+1 t+1 10: Solve the optimization problem (1) for the expert greedy s˜t+1 11: Solve the optimization problem (1) for the expert greedy action a¯t action a¯ 11: end for t 12: Compute the probability ratio: 12: Store transition (˜st, at, rt, s˜t+1, a¯t) in R 13: Sample a minibatch of N transitions, policy µ µ (at|s˜t) (˜si, ai, ri, s˜i+1, a¯i), from R via prioritized replay, µ θ rt(θ ) = policy according to temporal difference error µ µ (a |s˜ ) θ t t 0 0 µ0 Q0 old 14: Compute yi = ri + γQ (˜si+1, µ (˜si+1|θ )|θ ) Q 15: Update the critic θ by annealing the prioritized replay 13: Obtain the value estimate from the value head V (˜st) = value bias while minimizing the loss: µ (˜st) and calculate the advantage estimate Aˆt = t0−t P 0 N t0>t γ dt − V (˜st) 1 X Q 2 14: for i = 1,...,N do (y − Q(˜s , a |θ )) N i i i 15: Update the policy network using the gradient of the i=1 clipped surrogate objective:

16: Maintain the ratio between the actual learning rates of T the actor and critic X µ µ ∇ µpolicy Aˆ min(r (θ ), clip(r (θ ), 1 − , 1 + )) 17: Update θµ using the sampled policy gradient: θ t t t t=1

1 X Q µ ∇ Q(˜s, a|θ )| µ ∇ µ µ(˜s|θ )| 16: Update the policy network using the gradient of value N a s˜=˜si,a=µ(˜si|θ ) θ s˜=˜si i loss: T X 2 ∇ µvalue Aˆ 18: Calculate the expert auxiliary loss L¯ in (2) and update θ t µ µ t=1 θ using ∇θµ L¯ with factor λ 0 0 19: Update the target networks: θQ ← τθQ + (1 − τ)θQ , 17: Calculate the expert auxiliary loss L¯ in (2) and update 0 0 ¯µ θµ ← τθµ + (1 − τ)θµ policy network using ∇θµpolicy L with factor λ 20: Create adaptive actor weights µ˜0 from current actor 18: end for 0 19: end for weight θµ and current σ: θµ˜ ← θµ + N (0, σ) 21: Generate adaptive perturbed actions a˜0 for the sampled 0 µ˜0 transition starting states s˜i: a˜ = µ(s ˜i|θ ). With pre- B. Experiments for On-policy Algorithm µ viously calculated actual actions a = µ(˜si|θ ), cal- 0 culate the mean induced action noise: d(θµ, θµ˜ ) = As shown in Table2, the proposed IPM can improve the q PPO baseline and baseline+BCM model performance. For 1 PN [(a − a0 )2] N i=1 Es i i brevity, we do not enumerate all module combinations. The µ µ˜0 22: Update σ: if d(θ , θ ) ≤ δ, σ ← ασ, otherwise σ ← purpose is to show the proposed RL architecture could be σ/α extended to the on-policy setting. 23: end for µ˜ µ 24: Update perturbed actor: θ ← θ + N (0, σ) For the trained PPO+BCM model, we arbitrarily chose the 25: end for testing period between 17 Apr 2018 to 03 Dec 2018 to depict Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

polynomial order 3.

Cash For the variant of WaveNet (van den Oord et al., 2018) in 0.0194 the infused prediction module, we choose the number of

0.0193 Costco Wholesale Corporation dilation levels L to be 6, filter length f to be 2, the number 0.177 −4 0.176 of filters to be 32 and the learning rate to be 10 . Inputs are 0.175 Cisco Systems scaled with min-max scaler with a window size equal to the 0.242 L PL−2 i receptive field of the network, which is f + i=0 f = 95. 0.240 Ford Motors 0.0301 For the DAM, we select the number of training set to be 0.0300

Goldman Sachs 30,000, noise latent dimension H to be 8, batch size to be 0.01420 128, time embedding k1 to be 95, generator RNN hidden 0.01415 American International Group units to be 32, discriminator RNN hidden unit to be 32 and 0.1100 −3 0.1095 a learning rate 10 . For each episode in Algorithm1, we Amgen Inc generate and append two months of synthetic market data 0.395 0.394 for each asset. 0.393 Caterpillar

0.01485 For the BCM, we choose the factor λ = 0.1 to discount the 0.01480 gradient of the log-loss.

2018-04-17 2018-05-15 2018-06-13 2018-07-12 2018-08-09 2018-09-07 2018-10-05 2018-11-02 2018-12-03 Elasped time steps D. Details of Infused Prediction Module D.1. Nonlinear Dynamic Boltzmann Machine Algorithm Figure 3. Portfolio weights for different assets (assets from top to bottom are Cash, Costco Wholesale Corporation, Cisco Systems, Ford Motors, Goldman Sachs, American International Group, Am- gen Inc and Caterpillar). how the portfolio weights evolve over the trading time in Figure3. The corresponding trading signals are shown in Figure4. It can be seen from both figures that the portfolio weights produced by the RL agent can adapt to the market change and evolve as time goes by without falling into a local optimum.

C. Hyperparameters for Experiments We report the hyperparameters for the actor and critic net- works in Table3. Nonlinear Dynamic Boltzmann machine in the infused pre- diction module uses the following hyper-parameter settings: delay d = 3, decay rates λ = [0.1, 0.2, 0.5, 0.8] i.e. k = 4, learning rate was 10−3, with standard RMSProp optimizer. Figure 5. A nonlinear dynamic Boltzmann machine unfolded in time. There are no connections between units i and j within a The input and output dimensions were fixed at three times layer. Each unit is connected to each other unit only in time. The the number of assets, corresponding to the high, low and lack of intra-layer connections enables conditional independence close percentage change values. The RNN layer dimen- as depicted. sion is fixed at 100 units with a tanh nonlinear activa- tion function. A zero mean, 0.01 standard deviation noise was applied to each of the input dimensions at each time step, in order to slightly perturb the inputs to the network. In Figure5 we show the typical unfolded structure of a This injected noise was cancelled by applying a standard10 nonlinear dynamic Boltzmann machine. The RNN layer Savitzky-Golay (savgol) filter with window length 5 and (of a reservoir computing type architecture) (Jaeger, 2003; Jaeger & Haas, 2004) is used to create nonlinear feature 10As implemented in scientific computing package Scipy. transforms of the historical time-series and update the bias Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

Table 2. Performance for on-policy PPO-style dynamic portfolio optimization algorithm (accnt. and ann. are abbreviations for account and annualized respectively). Baseline BCM IPM+BCM Final accnt. value 595332 672586 711475 Ann. return 5.23% 15.79% 18.25% Ann. volatility 14.24% 13.96% 14.53% Sharpe ratio 0.37 1.13 1.26 Sortino ratio 0.52 1.62 1.76 VaR0.95 1.41% 1.39% 1.38% CVaR0.95 2.10% 2.09% 2.20% MDD 22.8% 12.40% 11.60%

Table 3. Hyperparameters for the DDPG actor and critic networks. Actor network Critic network FE layer type RNN bidirectional LSTM RNN bidirectional LSTM FE layer size 20, 8 20, 8 FA layer type Dense Dense FA layer size 256,128,64,32 256,128,64,32 FA layer Leaky relu Leaky relu Optimizer Gradient descent optimizer Adam optimizer Dropout 0.5 0.5 Synchronized to be 100 times slower than Learning rate 10−3 the critic’s actual learning rate Episode length 650 Number of episodes 200 σ for parameter noise 0.01 Replay buffer size 1000 parameter vector as follows: learning scenario this can provide significant computational benefits. b[t] = b[t−1] + A>Ψ[t] Our implementation of the NDyBM based infused predic- Where, Ψ[t] is a M × 1 dimensional state vector at time t tion module is based on the open-source code available at of a M dimensional RNN. A is the M × N dimensional https://github.com/ibm-research-tokyo/ learned output weight matrix that connects the RNN state dybm. Algorithm 3 describes the basics steps. to the bias vector. The RNN state is updated based on the input time-series x[t] as follows: Algorithm 3 Online asset price change prediction with the [t]  [t−1] [t] Ψ = tanh WrnnΨ + Winx . NDybM IPM module. 1: Require: All the weight and bias parameters of NDyBM Here, Wrnn and Win are the RNN internal weight matrix are initialized to zero. The RNN weights, Wrnn initial- and the weight matrix corresponding to the projection of the ized randomly from N (0, 1), Win initialized randomly from time-series input to the RNN layer, respectively. N (0, 0.1). The FIFO queue is initialized with d − 1 zero vectors. K eligibility traces {α[−1]}K are initialized with [t+1] k k=1 The NDyBM is trained online to predict ˜x based on zero vectors the estimated µ[t] for all time observations t = 1, 2, 3,... . 2: Input: Close, high and low percentage price change for each The parameters are updated such that the log-likelihood asset at each time step LL(D) = P P log p(x[t]|x[−∞,t−1]), of the given fi- 3: for t = 0, 1, 2, ... do x∈D t 4: Compute µ[t] using µ[t] = b + Pd−1 F[δ]x[t−δ] + nancial time-series data D is maximized. We can derive δ=1 PK G α[t−1] & update the bias vector based on RNN exact stochastic gradient update rules for each of the param- k=1 k k layer eters using this objective (can be referenced in the original 5: Predict the expected price change pattern at time t using paper). Such a local update mechanism, allows an O(1) up- µ[t] date of NDyBM parameters. As such a single epoch update 6: Observe the current time series pattern at x[t] of NDyBM occurs in sub-seconds  1s as compared to a 7: Update the parameters of NDyBM based on tens of seconds update of the WaveNet inspired model. Scal- ∇ log p(x[t]|x[−∞,t−1]) [t] ing up to large number of assets in the portfolio, in an online 8: Update FIFO queues and eligibility traces by αi,j,k = Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

[t−1] [t−di,j +1] λk αi,j,k + xi fully convolutional model can process inputs in parallel, 9: Update RNN layer state vector thus resulting in faster training speed compared to recurrent 10: end for models, but the number of layers needed is linear to k1. This inefficiency can be solved by using stacked dilated convolu- D.2. WaveNet inspired multivariate time-series tion. A dilated convolution with dilation factor d uses filters prediction with d − 1 zero inserted between its values, which allows We use an autoregressive generative model, which is a vari- it to operate in a coarser scale than a normal convolution ant of parallel WaveNet (van den Oord et al., 2018), to with same effective filter length. When stacking dilated learn the temporal pattern of the percentage price change convolution layers with filter length f, if an exponentially m×3 increasing dilation factor d = f l−1 is used in each layer tensor xt ∈ , which is the part of state space that l R L is assumed to be independent of agent’s actions. Our net- l, the effective receptive fields will be f , where L is the work is inspired by previous works on adapting WaveNet total number of layers. Thus large receptive field can be to time series prediction (Mittelman, 2015; Borovykh et al., achieved with having logarithmic many layers to k1, which 2017). We denote the ith asset’s price tensor at time t as has much fewer parameters needed compared to a normal h l fully convolutional model. xi,t = (pi,t, pi,t, pi,t). The joint distribution of price over time X = {x1,..., xT } is modelled as a factorized product Secondly, by factoring not only over time but over different of probabilities conditioned on a past window of size k1: assets as well, this model makes parallelization easier and

T m potentially has better interpretability. The prediction of each Y Y asset is conditioned on all the asset prices in the past window, p(X) = p(xi,t|xt−k1 ,..., xt−1, θ). t=1 i=1 which makes the model easier to run in parallel since each stack can be placed on different GPUs. The model parameter θ is estimated through maximum like- lihood estimation (MLE) respective to p(X). The joint The serial cross correlation R at lag ` for a pair of discrete probability is factorized both over time and different assets, time series x(t), y(t) is defined as and the conditional probabilities is modelled as stacks of X dilated causal . Causal convolution ensures that Rxy(`) = x(t)y(t − `). the output does not depend on future data, which can be t implemented by front zero padding convolution output in time dimension such that output has the same size in time 44 Predicted dimension as input. 42 Actual 40 Figure6 shows the tensorboard visualization of a dilated 38 convolution stack of our WaveNet variant. At each time t, 36 k1 m×3×k1 Price the input window (xt−k ,..., xt−1) ∈ X ⊂ 1 R 34 first goes through the common F1 layer, which is a depth- 32 wise separable 2d convolution followed by a 1 × 1 convolu- 30 tion, with time and features (close, high, low) as height and 2017-08 2017-09 2017-10 2017-11 2017-12 2018-01 2018-02 2018-03 width and assets as channel. Output from F1 is then feed Date into different stacks of the same architecture as depicted in Figure6, one for each different asset. Fl and Rl denotes Figure 7. Out of sample actual and predicted closing price move- dilated convolution with filter length f and relu activation at ment for asset Cisco Systems. l−1 level l, which has dilation factor dl = f . Each Rl takes the concatenation of F and R as input. This concate- l l+1 Roughly speaking, R (`) measure the similarity be- nation is represented by the residual blocks in the diagram. xy tween x and a lagged version of the y. The peak α = The final M layer in Figure6 is a 1 × 1 convolution with 3 arg max R (`) indicates that x(t) has highest correlation filters, one for each of high, low and close, and the output ` xy with y(t + α) for all t on average. If x(t) is the predicted is exactly x . The output from m different stacks is then i,t time series, and y(t) is the real data, a naive prediction concatenated to produce the prediction x . t model simply takes the last observation as prediction and The reason for modelling the probabilities as such is two- would thus have α = −1. Sometimes a prediction model fold. First, it addresses the dependency on historical patterns will learn this trivial prediction and we call this kind of of the financial market by using a high order autoregressive model trend following. To test whether our model has model to capture long term patterns. Models with recur- learned a trend following prediction, we convert the pre- rent connections can capture long term information since dicted percentage change vector xt back to price (see Figure they have internal memory, but they are slow to train. A 7) and calculated the serial cross-correlation between the Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization predicted series and the actual. Figure8 clearly shows that Note that the RBF kernel has the following representation there is no trend following behavior as α = 0.  kxk2 + kyk2  hx, yi K(x, y) = exp − exp 2σ2 σ2 1.0 ∞ X hx, yin 0.8 =C σ2nn! 0.6 n=0 ∞ 0.4 X hφn(x), φn(y)i =C 0.2 σ2nn! n=0 0.0

Correlation Coefficient 0.2 where C is the constant exp −||x||2 + ||y||2/(2σ2), and 0.4 100 50 0 50 100 φn({·) is the feature map of polynomial kernel of degree Lag n. Following (Li et al., 2015), we can rewrite the biased 2 φ Figure 8. Trend following analysis on predicted closing price estimator MMD\b with n: movement for asset Cisco Systems. ∞ M M 2 X 1  1 X X MMD\ =C hφ (x ), φ (x )i b σ2nn! M 2 n i n j n=0 i=1 j6=i M N E. Details of Data Augmentation Module 2 X X − hφ (x ), φ (y )i MN n i n j Maximum mean discrepancy (MMD) (Gretton et al., 2012) i=1 j=1 is a pseudometric over Prob(X ), the space of probability N N 1 X X  measures on some compact metric set X . Given a family of + hφ (y ), φ (y )i N 2 n i n j functions F, MMD is defined as i=1 j6=i 2 MMD(F, pa, pb) = sup Ex∼pa [f(x)] − Ex∼pb [f(x)]. ∞ M N f∈F X 1 1 X 1 X =C φn(xi) − φn(yj) . σ2nn! M N n=0 i=1 j=1 When F is the unit ball in a Reproducing Kernel Hilbert Space (RKHS) H associated with a universal kernel K : 2 X × X → {f ∈ H : kfk ≤ 1} (F, p , p ) R, i.e. ∞ , MMD a b Therefore minimizing MMD\b can be seen as matching all is not only a pseudometric but a proper metric as well, moments between the two distribution pa, pb. that is MMD(F, pa, pb) = 0 if and only if pa = pb. One example of a universal kernel is the commonly used Although we can do multivariate two-sample test with Gaussian radial basis function (RBF) kernel K(x, y) = MMD, it suffers from the curse of dimensionality. Thus exp −kx − yk2/(2σ2). instead of looking at the distribution of percentage change vector hi, we perform two-sample test on the distribution M N Given samples {xi}i=1, {yi}i=1 from pa, pb, the square of ph of percentage change series hi,t itself. Specifically, we MMD has the following unbiased estimator: use the Kolmogorov-Smirnov test, which is a two-sample M M M N test method based on k · k∞ norms between the empirical 2 1 X X 2 X X MMD\ = K(x , x ) − K(x , y ) cumulative distribution functions of the two distributions. M  i j MN i j 2 i=1 j6=i i=1 j=1 For each asset of interest, we divide the data set into a N N training and a validation set, where the training set is used 1 X X + K(y , y ). to train the RGAN model. Then we generate a batch of N  i j 2 i=1 j6=i samples with the trained RGAN. For each generated series, we perform the KS test between it and every series in the We also have the biased estimator where the empirical esti- validation set and calculates the maximum p-value from mate of feature space means is used which. The average of these maximum p-values, denoted M M M N by p¯max, among all the generated samples is then used to 2 1 X X 2 X X MMD\ = K(x , x ) − K(x , y ) determine the goodness of fit of the given generated series. b M 2 i j MN i j i=1 j6=i i=1 j=1 The average p¯max across all assets in our portfolio is 0.11, N N which shows that we cannot reject the null hypothesis that 1 X X + K(y , y ). the generated distribution and the data distribution is the N 2 i j i=1 j6=i same. Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

F. Risk-adjustment Module The DSR has several attractive properties including facil- itating recursive updating, enabling efficient on-line opti- In this section, we discuss the risk-adjustment module mization, weighting recent returns more and providing in- (RAM) which is aimed to optimize portfolio risk directly. terpretability (Moody & Saffell, 2001). We adapt the framework of recurrent deterministic policy gradient (RDPG, see (Heess et al., 2015)) technique which To define the D3R, we need first to define the downside is a natural extension of DDPG to memory-based control deviation ddT as setting with recurrent neural networks, and adjust our risk 1 T ! 2 preference. 1 X dd min{r , 0}2 , T , T t Note that the sampled minibatch from the replay buffer in t=1 Algorithm1 is of the form which is the square root of the average of the square of the (st1 , at1 , rt1 , st1+1),..., (stK , atK , rtK , stK +1). negative returns. Using the downside deviation as a measure The critic and actor networks are then updated according of risk, we can now define the downside deviation ratio (DDR) as to these disjointed (non-successive) samples, i.e., stk+1 6= s , ∀k = 1,...,K − 1. Different from DDPG replay E[rt] tk+1 DDRT , . (5) buffer which stores a time-step experience, RDPG replay ddT buffer stores a complete trajectory. In other words, each The D3R is then defined by considering the exponential sample in the DDPG replay buffer is a state-action-reward- moving average of the returns and the squared downside next state tuple, while each sample in the RDPG replay deviation of returns in (5), and by expanding to the first buffer is a trajectory starting from a given initial state. When order in the adaption rate η of DDR: we sample from the RDPG replay buffer, different trajecto-  rt−0.5νt−1 ries are sampled. if rt > 0, dDDRt  ddt−1 d = 2 2 t , ddt−1(rt−0.5νt−1)−0.5νt−1rt Different from the original RDPG (Heess et al., 2015), the dη 3 otherwise,  ddt−1 update rules of our RDPG are the same as DDPG except that (6) samples from the replay buffer are trajectories instead of where disjoint transitions, and that both critic and actor networks in RDPG still take augmented state s˜i as inputs. The details νt = νt−1 + η(rt − νt−1), of these changes are in Algorithm4 2 2 2 2 ddt = ddt−1 + η(min{rt, 0} − ddt−1). Having access to trajectory sampling also allows us to apply The D3R rewards the presence of large average positive risk-adjustment to our RL objective. Following (Moody returns and penalizes risky downside returns. The procedure & Saffell, 2001), we consider new risk-aware objectives of obtaining the risk-sensitive RDPG agent for dynamic including differential Sharpe ratio (DSR) and differential portfolio optimization is summarized in Algorithm4. downside deviation ratio (D3R). To define the DSR, recall the definition of Sharpe ratio SR and denote its value at time We perform the experiments with initial values ν0 = ω0 = −8 t by SRt: dd0 = 0, and add ε = 10 to the denominators in (4) E[rt] and (6) to avoid division by zero. The results of DSR can SRt = p , (3) var[rt] be found in Table4 below. We can observe similar phe- nomenon to before as shown in Table1, where the rewards where rt is the logarithmic rate of return for period t. The were scaled by a fixed number12. In particular, the use of differential Sharpe ratio Dt is then obtained by considering the moving average of the returns and standard deviation of IPM drastically improves the Sharpe (from 0.56 to 0.60) and returns in (3), and expanding to first order in the adaptation Sortino (from 0.78 to 0.84) ratios. Furthermore, looking rate η at the model trained with IPM and the model trained with IPM+BCM, we see that the use of BCM is able to slightly dSR ω ∆ν − 0.5ν ∆ω t t−1 t t−1 t improve both Sharpe (from 0.60 to 0.62) and Sortino ratios dt , = 3 , (4) dη 2 2 (ωt−1 − νt−1) (from 0.84 to 0.87). This is consistent with earlier results. 11 Using both IPM+DAM+BCM actually produces the best re- where νt and ωt are exponential moving estimates of the sults compared to just DSR in terms of Sharpe (from 0.56 to first and second moments of rt 0.63) and Sortino (from 0.78 to 0.88) ratios. It is worthwhile νt = νt−1 + η∆νt = νt−1 + η(rt − νt−1), to note that using IPM+DAM compared to just IPM does 2 ωt = ωt−1 + η∆ωt = ωt−1 + η(rt − ωt−1). not reduce the volatility as in the earlier case, though the

11 12 This is different from wt in the main paper to denote the All experiment results shown in Table1 were generated by portfolio weights. scaling the rewards by a factor of 103. Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

Sharpe (from 0.60 to 0.62) and Sortino (from 0.84 to 0.87) 19: Update θµ using the sampled policy gradient: ratios are improved. This could be due to the rewards being T K risk-adjusted and the DAM allowing better generalization 1 X X Q ∇aQ(˜st, a)|θ )|s˜ =˜s ,a=µ(˜s |θµ)× TK t kt kt of the model to the testing set. t=1 k=1 µ ∇ µ µ(˜st|θ )|s˜ =˜s Results for models with D3R are given in Table5. We θ t kt again see that using IPM compared to just D3R improves 20: Calculate the expert auxiliary loss L¯ in (2) and update θµ both Sharpe (from 0.57 to 0.59) and Sortino (from 0.79 to µ using ∇θµ L¯ with factor λ 0.83) ratios. However, we can observe that the annualized 0 0 21: Update the target networks: θQ ← τθQ + (1 − τ)θQ , volatility is not significantly impacted when comparing IPM 0 0 θµ ← τθµ + (1 − τ)θµ 12.75% 12.74% 0 with IPM+DAM+BCM (from to ). This 22: Create adaptive actor weights µ˜ from current actor weight 0 could be due to the difference in risk adjustment of the θµ and current σ: θµ˜ ← θµ + N (0, σ) rewards. It is nevertheless important to note that using 23: Generate adaptive perturbed actions a˜0 for the sampled 0 µ˜0 IPM+DAM+BCM significantly improves Sharpe (from 0.56 transition starting states s˜i: a˜ = µ(s ˜i|θ ). With pre- µ to 0.63) and Sortino (from 0.78 to 0.88) ratios over the viously calculated actual actions a = µ(˜si|θ ), cal- 0 model trained with just DSR. culate the mean induced action noise: d(θµ, θµ˜ ) = q 1 PN 0 2 We show that the use of risk-adjusted rewards such as DSR N i=1 Es [(ai − ai) ] 0 and D3R does not detrimentally impact performance, more- 24: Update σ: if d(θµ, θµ˜ ) ≤ δ, σ ← ασ, otherwise σ ← over, it presents an alternative way of scaling the rewards σ/α without the need of heuristically setting a scaling factor. 25: Update perturbed actor: θµ˜ ← θµ + N (0, σ) 26: end for

Algorithm 4 Risk-adjusted dynamic portfolio optimization algorithm. 1: Input: Critic Q(˜s, a|θQ), actor µ(˜s|θµ) and perturbed actor networks µ(˜s|θµ˜) with weights θQ, θµ, θµ˜, standard deviation of parameter noise σ and risk adaption rate η. 0 2: Initialize target networks Q0, µ0 with weights θQ ← θQ, 0 θµ ← θµ 3: Initialize replay buffer R 4: for episode= 1,...,M do 5: Receive initial observation state s1 6: Initialize ν0, ω0 and dd0 for DSR and D3R 7: for t = 1 ...,T do 8: Predict future price tensor xt+1 using prediction models with inputs st and form augmented state s˜t µ˜ µ˜ 9: Use perturbed weight θ to select action at = µ(˜st|θ ) 10: Take action at, observe reward rt and new state st+1 11: Predict next future price tensor xt+2 using prediction models with inputs st+1 and form the augmented state s˜t+1 12: Solve the optimization problem (1) for the expert greedy action a¯t 13: Compute the DSR or D3R dt based on (4) or (6) 14: end for 15: Store trajectory (˜s1, a1, d1, s˜2, a¯t,..., s˜T , aT , dT , s˜T +1, a¯T ) in R 16: Sample a minibatch of size K from replay buffer K {(˜sk1 , ak1 , dk1 , s˜k2 , a¯k1 ,..., s˜kT , akT , dkT , s˜kT +1 , a¯kT )}k=1. 0 0 µ0 Q0 17: Compute ykt = dkt + γQ (˜skt+1 , µ (˜skt+1 |θ )|θ ) for all k = 1,...,K and t = 1,...,T 18: Update the critic θQ by minimizing the loss:

T K 1 X X Q 2 (y − Q(˜s , a |θ )) TK kt kt kt t=1 k=1 Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

Table 4. Performances for different models (all models are risk-adjusted by DSR risk measure. accnt. and ann. are abbreviations for account and annualized respectively). DSR IPM DAM BCM IPM+DAM IPM+BCM DAM+BCM IPM+DAM+BCM Final accnt. value 570909 575709 570128 571595 579071 578478 571880 579444 Ann. return 7.17% 7.62% 7.10% 7.24% 7.93% 7.88% 7.27% 7.96% Ann. volatility 12.79% 12.75% 12.75% 12.81% 12.75% 12.73% 12.75% 12.68% Sharpe ratio 0.56 0.60 0.56 0.57 0.62 0.62 0.57 0.63 Sortino ratio 0.78 0.84 0.78 0.79 0.87 0.87 0.79 0.88 VaR0.95 1.30% 1.27% 1.28% 1.30% 1.27% 1.26% 1.30% 1.25% CVaR0.95 1.93% 1.91% 1.93% 1.94% 1.91% 1.91% 1.93% 1.90% MDD 13.70% 12.40% 13.60% 13.70% 12.40% 12.40% 13.70% 12.20%

Table 5. Performances for different models (all models are risk-adjusted by D3R risk measure. accnt. and ann. are abbreviations for account and annualized respectively). DDR IPM DAM BCM IPM+DAM IPM+BCM DAM+BCM IPM+DAM+BCM Final accnt. value 571790 574791 571717 571737 577833 576809 571973 578091 Ann. return 7.26% 7.54% 7.25% 7.25% 7.82% 7.72% 7.27% 7.84% Ann. volatility 12.81% 12.75% 12.79% 12.79% 12.75% 12.75% 12.81% 12.74% Sharpe ratio 0.57 0.59 0.57 0.57 0.61 0.61 0.57 0.62 Sortino ratio 0.79 0.83 0.79 0.79 0.86 0.85 0.79 0.86 VaR0.95 1.30% 1.27% 1.29% 1.30% 1.27% 1.27% 1.30% 1.27% CVaR0.95 1.94% 1.91% 1.93% 1.93% 1.91% 1.91% 1.94% 1.91% MDD 13.70% 12.40% 13.60% 13.70% 12.40% 12.40% 13.70% 12.40% Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

DrawDown (None) 9 drawdown 3.55 6 3.553

COST.OQ (60 Minutes) C:232.08 248 BuySell (True, 0.01) buy 240 sell 232.08232 224 216 208 200 192 184

CSCO.OQ (60 Minutes) C:48.74 BuySell (True, 0.01) 48.7449 buy sell 48 47 46 45 44 43 42

F.N (60 Minutes) C:9.60 BuySell (True, 0.01) 11.5 buy sell 11.0 10.5 10.0 9.609.5 9.0 8.5

GS.N (60 Minutes) C:191.63 BuySell (True, 0.01) 250 buy sell 240 230 220 210 200 191.63190

AIG.N (60 Minutes) C:43.18 BuySell (True, 0.01) 54 buy sell 52 50 48 46 44 43.18 42 40

AMGN.OQ (60 Minutes) C:202.93 205 BuySell (True, 0.01) 202.93 buy sell 200 195 190 185 180 175 170 165

CAT.N (60 Minutes) C:138.95 BuySell (True, 0.01) 155 buy sell 150 145

138.95140 135 130 125 120 115

.SPX (60 Minutes) C:2790.37 BuySell (True, 0.01) 2920 buy sell 2880 2840

2790.372800 2760 2720 2680 2640

May 2018 Jun 2018 Jul 2018 Aug 2018 Sep 2018 Oct 2018 Nov 2018 Dec 2018

Figure 4. Trading signals (last row is the S&P 500 index). quick_test-1547105898 quick_test-... init add init_all_ta...

dfcn

r

a l

r a c la s a quick_test-... c s init_1 beta1_power train_op beta2_power quick_test-... SGDRDecay ... 2 more Sum Mean_1

default-metrics_... init_1 default-grad_sum... input default-loss Const_2

s

r beta1_power o

s

n e t

8 beta2_power rs o

s r

n s dfcn r a

e 1 l o save t

a

s c

0 n s global_step...

1 e t

58 tenso 58

4 quick_test-...

Reshape_1 Identity transpose start range Mean shape while perm while gradients train_op Rank Const_1 histogram default-CAT.N-lo... default-AMGN.O... default-AIG.N-lo... default-GS.N-los... default-F.N-loss... default-CSCO.OQ-... default-COST.OQ-... delta while

sors

ten 7

0 ×

8 1

×

? 1

×

? × Identity_3 Exit_3 Identity_2 Exit_2

? asset-step-...

s s r

×

o r s

n

e

7 t o

?×7 3

s s

r TensorArra... TensorArra...

×

o n s 4 tensors 4

?

n e

e t s t s

r

r

?×95×3×7 o

s

4 o

n 4 s e gradients s t r

n o

s 4 e n e t t s 4 4 sor 4 ten

r r

a a

l l

a a

c c

s s alar y Reshape train_op Switch_3 scalar Switch_2 sc

save scalar shape while dfcn while TensorArraySt... while TensorArrayStac... while asset-step-wise-... SGDRDecay train_op while scalar r r la la init a a c c s s

Merge_3 Merge_2

r

scalar r

s

r a a

o

l l

?×1995 s scalar n

e t a a

2 r

c

2 tensors 2 tensors a c l

a s s c s 7 tensors Enter_3NextItera... Enter_2 NextItera... x mode TensorAr... TensorAr... train_op TensorArra... TensorArray size while size while global_step_defa... save

init

scalar scalar

Merge_1 NextItera... TensorArra... TensorArrayWr... TensorArray TensorArrayWrite...

scalar ×7

1 1

×

? ?× × 7 dfcn

stack strided_s... stack strided_s... stack strided_s... stack strided_s... stack strided_s... stack strided_s... stack strided_s... strided_slice concat[0-6] ExpandDim... stack_1 ExpandDim... stack_1 ExpandDim... concat[0-6] ExpandDim... stack_1 ExpandDim... concat[0-6] ExpandDim... stack_1 ExpandDim... concat[0-6] ExpandDim... stack_1 ExpandDim... concat[0-6] ExpandDim... stack_1 ExpandDim... concat[0-6] ExpandDim... stack_1 ExpandDim... concat[0-6] ExpandDim... mean_squa...ExpandDims[0-27] concat[0-6] mean_squa... mean_squared_er... mean_squared_er... mean_squared_er... mean_squared_er... mean_squared_er... mean_squared_er... concat_7 stack_2 gradients stack_2 gradients Identity gradients stack_2 gradients Identity gradients stack_2 gradients Identity gradients stack_2 gradients Identity gradients stack_2 gradients Identity gradients stack_2 gradients Identity gradients COST.OQ concat_8 ExpandDim... mean_squa... Identity Identity Identity Identity Identity Identity Identity concat[0-6] concat_9 mean_squa... strided_slic... gradients mean_squa...... 18 more ... 4 more ?×3

strided_slic... COST.OQ add[0-6] strided_slic... Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization Identity strided_slic... strided_slic...

3 strided_slic... ×

? ... 2 more

mean_squa... strided_slic... stack strided_s... add[0-6] stack_1 ExpandDim... Identity strided_slice_1 strided_slic... stack_2 gradients Reshape_1 strided_slic... Identity strided_slic...... 3 more

?×95×3 7

× Squeeze

7 3

× COST.OQ ×

3

5 gradients

× 9

5 CSCO.OQ × Identity_1

9 ?

× ? Identity input-module F. N dfcn GS.N

?×95×1×3 AIG.N ... 3 more Identity gradients dfcn M

axis concat_7 ?×3 ?×3 ?×3 ?×3 ?×3 ?×3 ExpandDim... ?×1 ?×1 ?×1 ?×1 ?×1 ?×1 Identity ?×95×1×32 Identity gradients dfcn R-1 axis concat_8 ExpandDim... TensorArra... Identity ?×95×1×64 input-module gradients Identity residual-1 axis concat_9 ExpandDim... TensorArra... Identity ?×95×1×32 Identity gradients dfcn R-2 ?×95×1×64 Identity residual-2 gradients scalar 32 1× × 5 9 × ? Enter_1NextItera... add Identity Reshape y gradients dfcn R-3 ?×95×3×7

concat[0-6] input-module add[0-6] input-module add[0-6] input-module add[0-6] input-module add[0-6] input-module add[0-6] input-module add[0-6]

?×95×1×64 ExpandDim... ExpandDim... ExpandDim... ExpandDim... ExpandDim... ExpandDim... ExpandDim... strided_slic... mean_squared_er... Identity strided_slic... gradients Identity strided_slic... gradients Identity strided_slic... gradients Identity strided_slic... gradients Identity strided_slic... gradients Identity strided_slic... gradients gradients CSCO.OQ gradients strided_slice_3 F. N gradients strided_slice_5 GS.N gradients strided_slice_7 AIG.N gradients strided_slice_9 AMGN.OQ gradients strided_slice_11 CAT.N gradients strided_slice_13 Identity residual-3 gradients Identity dfcn Identity dfcn Identity dfcn Identity dfcn Identity dfcn Identity dfcn Identity

32 1× × 5 9 × ?

Identity gradients dfcn R-4 ?×95×1×64 Identity residual-4 gradients

32 1× × 5 9 × ?

Identity gradients dfcn R-5 ?×95×1×64 ?×95×1×32 Identity residual-5 gradients

32 ?×95×1×32 1× × 5 9 ×

?

7 ×

?×95×1×32

2

3

3

× ×

5 Identity 1 9 gradients

× × F-6

5 ? dfcn

9

×

?

2

3

×

1

×

5

9

× ?

Identity gradients

dfcn F-5

2

3

×

1

×

5

9

× ?

Identity

r gradients

a

l dfcn F-4 a

c

s

2

3

×

1

×

5

9

× ?

Identity gradients

dfcn F-3

2

3

×

1

×

5

9

× ? input-module

Identity F-2 gradients

s

r dfcn

o

s

n

e t

2

scalar Figure 6. The network structure of WaveNet.

Identity_1 Exit Exit_1 2 tensors Identity 263 tensors

gradients gradients gradients

7

×

3

×

5

r

9 a l

×

a

c ? s 7 ×3×

Switch_1 ?×95 Switch scalar

r

a

l

a

c

s

r a l a c s scalar LoopCond

gradients scalar Less

y

r

a

l

a s c a l a

r c s s r

o

s

n

e t

2 Merge

Enterscalar Const

train_op beta1_power save init

train_op beta2_power save init

train_op quick_test-15471... save init