Model-Based Deep Reinforcement Learning for Dynamic Portfolio Optimization

Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization Pengqian Yu * 1 Joon Sern Lee * 1 Ilya Kulyatin 1 Zekun Shi 1 Sakyasingha Dasgupta 1 Abstract Dynamic portfolio optimization remains one of the most Dynamic portfolio optimization is the process of challenging problems in the field of finance (Markowitz, sequentially allocating wealth to a collection of as- 1959; Haugen & Haugen, 1990). It is a sequential decision- sets in some consecutive trading periods, based on making process of continuously reallocating funds into a investors’ return-risk profile. Automating this pro- number of different financial investment products, with the cess with machine learning remains a challenging main aim to maximize return while constraining risk. Clas- problem. Here, we design a deep reinforcement sical approaches to this problem include dynamic program- learning (RL) architecture with an autonomous ming and convex optimization, which require discrete ac- trading agent such that, investment decisions and tions and thus suffer from the ‘curse of dimensionality’ (e.g., actions are made periodically, based on a global (Cover, 1991; Li & Hoi, 2014; Feng et al., 2015)). objective, with autonomy. In particular, with- There have been efforts made to apply RL techniques to al- out relying on a purely model-free RL agent, we leviate the dimensionality issue in the portfolio optimization train our trading agent using a novel RL architec- problem (Moody & Saffell, 2001; Dempster & Leemans, ture consisting of an infused prediction module 2006; Cumming et al., 2015; Jiang et al., 2017; Deng et al., (IPM), a generative adversarial data augmentation 2017; Guo et al., 2018; Liang et al., 2018). The main idea is module (DAM) and a behavior cloning module to train an RL agent that is rewarded if its investment deci- (BCM). Our model-based approach works with sions increase the logarithmic rate of return and is penalised both on-policy or off-policy RL algorithms. We otherwise. However, these RL algorithms have several draw- further design the back-testing and execution en- backs. In particular, the approaches in (Moody & Saffell, gine which interact with the RL agent in real time. 2001; Dempster & Leemans, 2006; Cumming et al., 2015; Using historical real financial market data, we Deng et al., 2017) only yield discrete single-asset trading simulate trading with practical constraints, and signals. The multi-assets setting was studied in (Guo et al., demonstrate that our proposed model is robust, 2018), however, the authors did not take transaction costs profitable and risk-sensitive, as compared to base- into consideration, thus limiting their practical usage. In re- line trading strategies and model-free RL agents cent study (Jiang et al., 2017; Liang et al., 2018), transaction from prior work. costs were considered but it did not address the challenge of having insufficient data in financial markets for the training of robust machine learning algorithms. Moreover, the 1. Introduction methods proposed in (Jiang et al., 2017; Liang et al., 2018) directly apply a model-free RL algorithm that is sample arXiv:1901.08740v1 [cs.LG] 25 Jan 2019 Reinforcement learning (RL) consists of an agent interact- ing with the environment, in order to learn an optimal policy inefficient and also doesn’t account for the stability and risk by trial and error for sequential decision-making problems issues caused by non-stationary financial market environ- (Bertsekas, 2005; Sutton & Barto, 2018). The past decade ment. In this paper, we propose a novel model-based RL has witnessed the tremendous success of deep reinforce- approach, that takes into account practical trading restric- ment learning (RL) in the fields of gaming, robotics and tions such as transaction costs and order executions, to stably recommendation systems (Lillicrap et al., 2015; Silver et al., train an autonomous agent whose investment decisions are 2016; Mnih et al., 2015; 2016). However, its applications in risk-averse yet profitable. the financial domain have not been explored as thoroughly. We highlight our three main contributions to realize a model- based RL algorithm for our problem setting. Our first contri- *Equal contribution 1Neuri PTE LTD, One George Street, #22- 01, Singapore 049145. Correspondence to: Sakyasingha Dasgupta bution is an infused prediction module (IPM), which incor- <[email protected]>. porates the prediction of expected future observations into state-of-the-art RL algorithms. Our idea is inspired by some Pre-print paper. Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization attempts to merge prediction methods with RL. For exam- ventional RL algorithms and supervised learning to solve ple, RL has been successful in predicting the behavior of complex tasks. This technique is similar in spirit to the simple gaming environments (Oh et al., 2015). In addition, work in (Nair et al., 2018). The difference is that we create prediction based models have also been shown to improve the expert behavior based on a one-step greedy strategy by the performance of RL agents in distributing energy over solving an optimization problem that maximizes immediate a smart power grid (Marinescu et al., 2017). In this paper, rewards in the current time step. Additionally, we only up- we explore two prediction models; a nonlinear dynamic date the actor with respect to its auxiliary behavior cloning Boltzmann machine (Dasgupta & Osogami, 2017) and a loss in an actor-critic algorithm setting. We demonstrate that variant of parallel WaveNet (van den Oord et al., 2018). BCM can prevent large changes in portfolio weights and These models make use of historical prices of all assets in thus keep the volatility low, while also increasing returns in the portfolio to predict the future price movements of each some cases. asset, in a codependent manner. These predictions are then To the best of our knowledge, this is the first work that treated as additional features that can be used by the RL leverages the deep RL state-of-art, and further extends it agent to improve its performance. Our experimental results to a model-based setting and integrate it into the financial show that using IPM provides significant performance im- domain. Even though our proposed approach has been rig- provements over baseline RL algorithms in terms of Sharpe orously tested on the off-policy RL algorithm (in particular, ratio (Sharpe, 1966), Sortino ratio (Sortino & Price, 1994), the deep deterministic policy gradients (DDPG) algorithm maximum drawdown (MDD, see (Chekhlov et al., 2005)), (Lillicrap et al., 2015)), these concepts can be easily ex- value-at-risk (VaR, see (Artzner et al., 1999)) and condi- tended to on-policy RL algorithms such as proximal policy tional value-at-risk (CVaR, see (Rockafellar et al., 2000)). optimization (PPO) (Schulman et al., 2017) and trust region Our second contribution is a data augmentation module policy optimization (Schulman et al., 2015) algorithms. We (DAM), which makes use of a generative adversarial net- showcase the overall algorithm for model-based PPO for work (GAN, e.g., (Goodfellow et al., 2014)) to generate portfolio management and the corresponding results in the synthetic market data. This module is well motivated by supplementary material. Additionally, we also provide algo- the fact that financial markets have limited data. To illus- rithms for differential risk sensitive deep RL for portfolio trate this, consider the case where new portfolio weights are optimization in the supplementary material. For the rest of decided by the agent on a daily basis. In such a scenario, the main paper, our discussion will be centered around how which may not be uncommon, the size of the training set our three contributions can improve the performance of the for a particular asset over the past 10 years is only around off-policy DDPG algorithm. 2530; due to the fact that there are only about 253 trading This paper is organized as follows: In Section2, we re- days a year. Clearly, this is an extremely small dataset that view the deep RL literature, and formulate the portfolio may not be sufficient for training a robust RL agent. To optimization problem as a deep RL problem. We describe overcome this difficulty, we train a recurrent GAN (Esteban the structure of our automatic trading system in Section3. et al., 2017) using historical asset data to produce realistic Specifically, we provide details of the infused prediction multi-dimensional time series. Different from the objective module, data augmentation module and behavior cloning function in (Goodfellow et al., 2014), we explicitly consider module in Section 3.2 to Section 3.4. In Section4, we report the maximum mean discrepancy (MMD, see (Gretton et al., numerical experiments that serve to illustrate the effective- 2007)) in the generator loss which further minimizes the ness of methods described in this paper. We conclude in distribution mismatch between real and generated data dis- Section5. tributions. We show that DAM helps to reduce over-fitting and typically leads to a portfolio with less volatility. 2. Preliminaries and Problem Setup Our third contribution is a behavior cloning module (BCM), which provides one-step greedy expert demonstration to In this section, we briefly review the literature of deep rein- the RL agent. Our idea comes from the imitation learning forcement learning and introduce the mathematical formula- paradigm (also called learning from demonstrations), with tion of the dynamic portfolio optimization problem. its most common form being behavior cloning, which learns A Markov Decision Process (MDP) is defined as a 6-tuple a policy through supervision provided by expert state-action hT; γ; S; A; P; ri. Here, T is the (possibly infinite) decision pairs. In particular, the agent receives examples of behavior horizon; γ 2 (0; 1] is the discount factor; S = S S is from an expert and attempts to solve a task by mimicking t t the state space and A = S A is the action space, both the expert’s behavior, e.g., (Bain & Sommut, 1999; Abbeel t t assumed to be finite dimensional and continuous; P : S × & Ng, 2004; Ross et al., 2011).

Model-Based Deep Reinforcement Learning for Dynamic Portfolio Optimization

Self-Training Wavenet for TTS in Low-Data Regimes

Unsupervised Speech Representation Learning Using Wavenet Autoencoders Jan Chorowski, Ron J

Unsupervised Speech Representation Learning Using Wavenet Autoencoders

Real-Time Black-Box Modelling with Recurrent Neural Networks

Linear Prediction-Based Wavenet Speech Synthesis

Anomaly Detection in Raw Audio Using Deep Autoregressive Networks

Predicting Uber Demand in NYC with Wavenet

Parallel Wave Generation in End-To-End Text-To-Speech

Unsupervised Learning of Cross-Modal Mappings Between

A Survey of Forex and Stock Price Prediction Using Deep Learning

Accent Transfer with Discrete Representation Learning and Latent Space Disentanglement

A Practical Guide to Ai in the Contact Center