Option Hedging with Risk Averse Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Option Hedging with Risk Averse Reinforcement Learning Edoardo Vittori1,2, Michele Trapletti1, Marcello Restelli2 {edoardo.vittori,michele.trapletti}@intesasanpaolo.com [email protected] 1Intesa Sanpaolo, 2Politecnico di Milano ABSTRACT delta), must be adjusted with the trader’s experience in order to In this paper we show how risk-averse reinforcement learning can both reduce the risks and contain hedging costs. be used to hedge options. We apply a state-of-the-art risk-averse In this paper we focus on the option hedging problem in a real- algorithm: Trust Region Volatility Optimization (TRVO) to a vanilla istic environment where we exploit the power of Reinforcement option hedging environment, considering realistic factors such as Learning (RL). In some sense we aim at replicating and hopefully discrete time and transaction costs. Realism makes the problem improving, in an automatic way, the trader’s experience of con- twofold: the agent must both minimize volatility and contain trans- taining both risk and hedging costs. While there is an extensive action costs, these tasks usually being in competition. We use the literature on both option hedging [13] and reinforcement learn- algorithm to train a sheaf of agents each characterized by a dif- ing [29], there are very few works on the combined topics, the main ferent risk aversion, so to be able to span an efficient frontier on ones being [4, 5, 11, 15], which we will analyze in Section 5. the volatility-p&l space. The results show that the derived hedging Here we implement a robust tool capable of providing the trader strategy not only outperforms the Black & Scholes delta hedge, with a hedging signal more accurate than the delta hedge, as it but is also extremely robust and flexible, as it can efficiently hedge is optimized in a realistic environment, with discrete time and options with different characteristics and work on markets with transaction costs. We achieve this result through the use of risk- different behaviors than what was used in training. averse RL by applying TRVO [2], an algorithm capable of optimizing together the hedging (i.e. risk reduction) and p&l objectives. KEYWORDS By controlling the risk-aversion parameter, we are able to create a Pareto frontier in the volatility-p&l space, which strictly dominates Deep Hedging, Reinforcement Learning, Transaction Costs the delta hedge since it performs better both in terms of variance and in terms of p&l. Having trained a sheaf of agents, each characterized by a different risk aversion, we reduce the job of the trader to simply 1 INTRODUCTION deciding on which point of the frontier to place herself. We also Vanilla options, contracts that offer the buyer the right to buy or sell show that the trained agents are robust since they can efficiently a certain amount (the option’s notional) of the underlying asset at a hedge options with different characteristics and work on markets predefined price (the strike) at a certain future time (the maturity), which behave differently than what was used in training. These are a fundamental building block in the derivatives business. They experimental results are the main contribution of this work. offer an investor the opportunity to take advantage of a movement The paper is structured as follows: in Section 2 we present the in the price of an asset aligned with her market view, avoiding hedging of a vanilla option using the B&S model and subsequently the risk of losing money due to the chance that the asset price explain how the hedging environment can be embedded to work moves instead in the opposite way, at a cost, the option premium, in a reinforcement learning context. For simplicity, we restrict our completely defined at inception. study to a long position in call options of unitary notional (exten- In the options market a crucial role is played by traders who sions to put options or to short positions or to a generic notional quote both the buy and sell prices of an option. They are usually are trivial). In Section 3 we describe the chosen reinforcement learn- referred to as sell-side or market-making traders and generally ing algorithm. In Section 4 we present and evaluate the empirical cover a large range of options on very few underlying instruments. performance. In Section 5 we compare with the current literature and in Section 6 we present our conclusions and outlook. arXiv:2010.12245v1 [q-fin.TR] 23 Oct 2020 When a market-maker succeeds to sell and buy the same amount of the same option, she makes a profit, which is her main aim, equal to the difference between the two premia, and the financial risk is 2 DELTA HEDGING perfectly offset since the value of the portfolio realized by thetwo In this section we describe the Black & Scholes pricing framework. trades does not depend on the underlying asset. However, most of It is the main option model used in practice, and thus the bench- the time, requests are not symmetric, leaving an open risk position mark considered in this paper. There are two things to model: the that must be closed (i.e., hedged). underlying and the deriving option price. In the B&S framework, Option pricing and hedging builds on the standard Black & Sc- the underlying behaves as Geometric Brownian Motion (GBM), holes (B&S) [3] model which is based on a strong set of assumptions thus let (C be the underlying at time C, then it can be described as: that tend to be unrealistic [33]: hedging is assumed to be cost-less d( = `( dC ¸ f( d, , and continuous. Due to this, the hedging process, which usually C C C C consists of a purchase or sale of the option’s underlying in a quan- where ,C is Brownian motion, ` the drift (which we assume to tity based on the first-order derivative of the B&S price (known as be 0 throughout the paper without loss of generality) and f the volatility. For an initial value (0, the SDE has the analytic solution: We consider infinite-horizon problems in which future rewards f2 are exponentially discounted with W. Following a trajectory g B (C = (0 exp ` − C ¸ f,C , ¹B , 0 ,B , 0 ,B , 0 , ...), let the returns be defined as the discounted 2 0 0 1 1 2 2 p cumulative reward: where: ,C¸D − ,C ∼ N ¹0,Dº = N ¹0, 1º × D, N being the normal 1 ∑︁ C distribution. Let 퐶C be call option price at time C, ) the time of 퐺 = W R¹BC , 0C º. (3) maturity, ) − C the Time To Maturity (TTM), the strike price, f C=0 and coincide with those of the GBM. The B&S call price is: ` 퐶C For each state B and action 0 the action-value function is defined as: ` ¹) −Cº 퐶C ¹(C º = Φ¹3C º(C − Φ¹4C º 4 , " 1 # ∑︁ C 2 &c ¹B, 0º B E W R¹BC , 0C ºjB0 = B, 00 = 0 , 1 (C f B ¸ ∼P ¹·jB ,0 º 3C = p ln ¸ ` ¸ ¹) − Cº , C 1 C C C=0 f ) − C 2 0C ∼c ¹·jBC º p 4C = 3C − f ) − C, which can be recursively defined by the Bellman equation: & ¹B, 0º = R¹B, 0º ¸ W E & ¹B0, 00º. (4) where Φ is the cumulative distribution function of the standard c 0 c m퐶C B ∼P ¹·jB,0º normal distribution. We introduce m( , which is known as the 00∼c ¹·jB0º option delta and for our position (a long call of unitary notional) is bounded between 0 and 1. In particular when) −C is relatively small The typical RL objective 퐽c is defined as (C m퐶C (C m퐶C " 1 # and ≪ 1, ! 0 and 퐶C ! 0; instead if ≫ 1, ! 1 m( m( ∑︁ C and 퐶 ! ( . A trader who has a long position in a call option will 퐽c B E W R¹BC , 0C º = E »& ¹B, 0º¼ . (5) C C B0∼` B0∼` C=0 endure a profit swing of 퐶C¸: ¹(C¸: º − 퐶C ¹(C º for a time-lag of :.A 0C ∼c ¹·jBC º 0∼c ¹·jBº delta hedge is a strategy to limit this profit movement by buying BC¸1∼P ¹·jBC ,0C º or selling a certain quantity of the underlying, call this function This objective can be maximized in two main ways. The first is m퐶C ∗ ℎ¹ m( , Eº, which depends on the delta and the trader’s experience by learning the optimal action-value function & for each state C ∗ ∗ E; we will refer to it as just ℎC for ease of notation. and action, thus the optimal policy is: c ¹0jBº = argmax0& ¹B, 0º. The profit in one timestep of a trader who bought a call option These algorithms are called value-based [29], and are used in [5, and making a delta hedge can be calculated as dC¸: = 퐶C¸: ¹(C¸: º − 15]. This type of algorithm can become impractical in a hedging 퐶C ¹(C º − ℎC × ¹(C¸: − (C º. Now assume that we replicate the delta environment where both states and actions are continuous. In fact, exactly, so ℎ = m퐶C and there is no experience into play. Then the it is necessary to approximate the value function, and it has been C m(C B&S model assures a zero profit in the continuous limit: ( ! 0) shown that even in relatively simple cases, such algorithms fail to and in the absence of transaction costs. converge [1]. From now on, we consider : ¡ 0, in particular we take as a The other family is instead policy search methods [7], which op- reference point 5 evenly spaced rebalances of ℎC per day.