A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

Ioannis Boukas 1 Damien Ernst 1 Thibaut Theate´ 1 Adrien Bolland 1 Alexandre Huynen 2 Martin Buchwald 2 Christelle Wynants 2 Bertrand Cornelusse´ 1

Abstract 1. Introduction The large integration of variable energy resources The vast integration of renewable energy resources (RES) is expected to shift a large part of the energy ex- into (future) power systems, as directed by the recent world- changes closer to real-time, where more accurate wide energy policy drive (The European Commission, 2017), forecasts are available. In this context, the short- has given rise to challenges related to the security, sustain- term electricity markets and in particular the inability and affordability of the power system (“The Energy traday market are considered a suitable trading Trilemma”). The impact of high RES penetration on the floor for these exchanges to occur. A key compo- modern short-term electricity markets has been the subject nent for the successful renewable energy sources of extensive research over the last few years. Short-term integration is the usage of energy storage. In this electricity markets in Europe are organized as a sequence of paper, we propose a novel modelling framework trading opportunities where participants can trade energy in for the strategic participation of energy storage in the day-ahead market and can later adjust their schedule in the European continuous intraday market where the intraday market until the physical delivery. Deviations exchanges occur through a centralized order book. from this schedule are then corrected by the transmission The goal of the storage device operator is the max- system operator (TSO) in real time and the responsible imization of the profits received over the entire parties are penalized for their imbalances (Meeus & Schit- trading horizon, while taking into account the op- tekatte, 2017). erational constraints of the unit. The sequential decision-making problem of trading in the intra- Imbalance penalties serve as an incentive for all market day market is modelled as a Markov Decision participants to accurately forecast their production and con- Process. An asynchronous distributed version sumption and to trade based on these forecasts (Scharff of the fitted Q iteration algorithm is chosen for & Amelin, 2016). Due to the variability and the lack of solving this problem due to its sample efficiency. predictability of RES, the output planned in the day-ahead The large and variable number of the existing market may differ significantly from the actual RES output orders in the order book motivates the use of high- in real time (Karanfil & Li, 2017). Since the RES fore- level actions and an alternative state representa- cast error decreases substantially with a shorter prediction tion. Historical data are used for the generation horizon, the intraday market allows RES operators to trade of a large number of artificial trajectories in order these deviations whenever an improved forecast is available to address exploration issues during the learning (Borggrefe & Neuhoff, 2011). As a consequence, intraday arXiv:2004.05940v1 [q-fin.TR] 13 Apr 2020 process. The resulting policy is back-tested and trading is expected to reduce the costs related to the reser- compared against a benchmark strategy that is the vation and activation of capacity for balancing purposes. current industrial standard. Results indicate that The intraday market is therefore a key aspect towards the the agent converges to a policy that achieves in cost-efficient RES integration and enhanced system security average higher total revenues than the benchmark of supply. strategy. Owing to the fact that commitment decisions are taken close to real time, the intraday market is a suitable market floor 1Department of Electrical Engineering and Computer Science, for the participation of flexible resources (i.e. units able to 2 University of Liege,` Liege,` Belgium Market Modeling and Mar- rapidly increase or decrease their generation/consumption). ket View, ENGIE, Brussels, Belgium. Correspondence to: Ioannis Boukas . However, fast-ramping thermal units (e.g. gas power plants) incur a high cost when forced to modify their output, to A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding operate in part load, or to frequently start up and shut down. • The market order, where no price limit is specified (the The increased cost related to the cycling of these units will order is matched at the best price) be reflected to the offers in the intraday market (Perez´ Ar- riaga & Knittel et al, 2016). Alternatively, flexible storage • The limit order, which contains a price limit and can devices (e.g. pumped hydro storage units or batteries) with only be matched at that or at a better price low cycling and zero fuel cost can offer their flexibility at a • The market sweep order, which is executed immedi- comparatively low price, close to the gate closure. Hence, ately (fully or partially) or gets cancelled. they are expected to play a key role in the intraday market.

1.1. Intraday markets in Europe Limit orders may appear with restrictions related to their execution and their validity. For instance, an order that In Europe, the intraday markets are organized in two distinct carries the specification Fill or Kill should either be fully designs, namely auction-based or continuous trading. and immediately executed or cancelled. An order that is specified as All or Nothing remains in the order book until In auction-based intraday markets, participants can submit it is entirely executed (Balardy, 2017a). their offers to produce or consume energy at a certain time slot until gate closure. After the gate closure, the submitted The European Network Codes and specifically the capacity offers are used to form the aggregate demand and supply allocation and congestion management guidelines (Meeus curves. The intersection of the aggregate curves defines & Schittekatte, 2017) (CACM GL) suggest that continu- the clearing price and quantity (Neuhoff et al., 2016). The ous trading should be the main intraday market mechanism. clearing rule is uniform pricing, according to which there Complementary regional intraday auctions can also be put is only one clearing price at which all transactions occur. in place if they are approved by the regulatory authorities Participants are incentivized to bid at their marginal cost (Meeus & Schittekatte, 2017). To that direction, the Cross- since they are paid at the uniform price. This mechanism Border Intraday (XBID) Initiative (Spot, 2018) has enabled increases price transparency, although it leads to inefficien- continuous cross-border intraday trading across Europe. Par- cies, since imbalances after the gate closure can no longer ticipants of each country have access to orders placed from be traded (Hagemann, 2015). participants of any other country in the consortium through a centralized order book, provided that there is available In continuous intraday (CID) markets, participants can sub- cross-border capacity. mit at any point during the trading session orders to buy or to sell energy. The orders are treated according to the first come first served (FCFS) rule. A transaction occurs as 1.2. Bidding strategies in literature soon as the price of a new “Buy” (“Sell”) order is equal or The strategic participation of power producers in short-term higher (lower) than the price of an existing “Sell” (“Buy”) electricity markets has been extensively studied in the lit- order. Each transaction is settled following the pay-as-bid erature. In order to co-optimise the decisions made in the principle, stating that the transaction price is specified by the sequential trading floors from day-ahead to real time the oldest order of the two present in the order book. Unmatched problem has been traditionally addressed using multi-stage orders are stored in the order book and are accessible to all stochastic optimisation. Each decision stage corresponds to market participants. The energy delivery resolution offered a trading floor (i.e. day-ahead, capacity markets, real-time), by the CID market in Europe ranges between hourly, 30- where the final decisions take into account uncertainty using minute and 15-minute products, and the gate closure takes stochastic processes. In particular, the influence that the place between five and 60 minutes before actual delivery. producer may have on the market price formation leads to Continuous trading gives the opportunity to market partici- the distinction between “price-maker” and “price-taker” and pants to trade imbalances as soon as they appear (Hagemann, results in a different modelling of the uncertainty. 2015). However, the FCFS rule is inherently associated with lower allocative inefficiency compared to auction rules. This In (Baillo et al., 2004), the optimisation of a portfolio of gen- implies that, depending on the time of arrival of the orders, erating assets over three trading floors (i.e. the day-ahead, some trades with a positive welfare contribution may not oc- the adjustment and the reserves market) is proposed, where cur while others with negative welfare contribution may be the producer is assumed to be a “price-maker”. The offering realised (Henriot, 2014). It is observed that a combination of strategy of the producer is a result of the stochastic residual continuous and auction-based intraday markets can increase demand curve as well as the behaviour of the rest of the the market efficiency in terms of liquidity and market depth, market players. On the contrary, a “price-taker” producer and results in reduced price volatility (Neuhoff et al., 2016). is considered in (Plazas et al., 2005) for the first two stages of the problem studied, namely the day-ahead and the au- In practice, the available contracts (“Sell” and “Buy” orders) tomatic generation control (AGC) market. However, since can be categorized into three types: the third-stage (balancing market) traded volumes are small, A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding the producer can negatively affect the prices with its partici- processes in (Garnier & Madlener, 2015). Alternatively, pation. Price scenarios are generated using ARIMA models in (Henriot, 2014), the CID price is constructed as a linear for the two first stages, whereas for the third stage a linear function of the offered quantity with an increasing slope curve with negative slope is used to represent the influence as the gate closure approaches. In this way, the scarcity of of the producer’s offered capacity on the market price. conventional units approaching real time is reflected. In (Gonsch¨ & Hassler, 2016), real weather data and market Hydro-power plant participation in short-term markets ac- data are used to simulate the forecast error and CID price counting for the technical constraints and several reservoir processes. levels is formulated and solved in (Fleten & Kristoffersen, 2007). Optimal bidding curves for the participation of For the sequential decision-making problem in the CID mar- a “price-taker” hydro-power producer in the Nordic spot ket, the offered quantity of energy is the decision variable to market are derived accounting for price uncertainty. In be optimised (Garnier & Madlener, 2015). The optimisation (Boomsma et al., 2014), the bidding strategy of a two-level is carried out using Approximate Dynamic Programming reservoir plant is casted as a multi-stage stochastic program (ADP) methods, where a parameterised policy is obtained in order to represent the different sequential trading floors, based on the observed stochastic processes for the price, the namely the day-ahead spot market and the hour-ahead bal- RES error and the level of the reservoir (Gonsch¨ & Hassler, ancing market. The effects of coordinated bidding and 2016). The ADP approach presented in (Gonsch¨ & Hassler, the “price-maker” versus “price-taker” assumptions on the 2016) is compared in (Hassler, 2017) to some threshold- generated profits are evaluated. In (PandZiˇ c´ et al., 2013), based heuristic decision rules. The parameters are updated bidding strategies for a virtual power plant (VPP) buying according to simulation-based experience and the obtained and selling energy in the day-ahead and the balancing mar- performance is comparable to the ADP algorithm. The ob- ket in the form of a multi-stage stochastic optimisation are tained decision rules are intuitively interpretable and are investigated. The VPP aggregates a pumped hydro energy derived efficiently through simulation-based optimisation. storage (PHES) unit as well as a conventional generator with The bidding strategy deployed by a storage device operator stochastic intermittent power production and consumption. participating in a slightly different real-time market orga- The goal of the VPP operator is the maximization of the nized by NYISO is presented in (Jiang & Powell, 2014). expected profits under price uncertainty. In this market, the commitment decision is taken one hour In these approaches, the intraday market is considered as ahead of real-time and the settlements occur intra-hour every auction-based and it is modelled as a single recourse action. five minutes. In this setting, the storage operator selects two For each trading period, the optimal offered quantity is price thresholds at which the intra-hour settlements occur. derived according to the realization of various stochastic The problem is formulated as an MDP and is solved using an variables. However, in reality, for most European countries, ADP algorithm that exploits a particular monotonicity prop- according to the EU Network Codes (Meeus & Schittekatte, erty. A distribution-free variant that assumes no knowledge 2017), modern intraday market trading will primarily be a of the price distribution is proposed. The optimal policy is continuous process. trained using historical real-time price data. The strategic participation in the CID market is investigated Even though the focus of the mentioned articles lies on for the case of an RES producer in (Henriot, 2014) and the CID market, the trading decisions are considered to (Garnier & Madlener, 2015). In both works, the problem is take place in discrete time-steps. A different approach is formulated as a sequential decision-making process, where presented in (A¨ıd et al., 2016), where the CID market par- the operator adjusts its offers during the trading horizon, ac- ticipation is modelled as a continuous time process using cording to the RES forecast updates for the physical delivery stochastic differential equations (SDE). The Hamilton Ja- of power. Additionally, in (Gonsch¨ & Hassler, 2016) the use cobi Bellman (HJB) equation is used for the determination of a PHES unit is proposed to undertake energy arbitrage of the optimal trading strategy. The goal is the minimization and to offset potential deviations. The trading process is of the imbalance cost faced by a power producer arising formulated as a Markov Decision Process (MDP) where the from the residual error between the RES production and future commitment decision in the market is based on the demand. The optimal trading rate is derived assuming a stochastic realization of the intraday price, the imbalance stochastic process for the market price using real market penalty, the RES production and the storage availability. data and the residual error. The volatility of the CID prices, along with the quality of the In the approaches presented so far, the CID price is mod- forecast updates, are found to be key factors that influence elled as a stochastic process assuming that the participating the degree of activity and success of the deployed bidding agent is a “price-taker”. However, in the CID market, this strategies (Henriot, 2014). Therefore, the CID prices and assumption implies that the CID market is liquid and the the forecast errors are considered as correlated stochastic price at which one can buy or sell energy at a given time A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding are similar or the same. This assumption does not always market (EPEXSPOT, 2017). In summary, the contributions hold, since the mean bid-ask spread in a trading session of this work are the following: in the German intraday market for 2015 was several hun- dred times larger than the tick-size (i.e. the minimum price • We model the CID market trading process as an MDP movement of a trading instrument) (Balardy, 2017b). It is where the energy exchanges occur explicitly through a also reported in the same study that the spread decreases as centralized order book. The trading policy is adapted trading approaches the gate closure. to the state of the order book. An approach that explicitly considers the order book is • The operational constraints of the storage device are presented in (Bertrand & Papavasiliou, 2019). A threshold- considered explicitly. based policy is used to optimise the bid acceptance for storage units participating in the CID market. A collection of • A state space reduction is proposed in order to deal different factors such as the time of the day are used for the with the size of the order book. adaptation of the price thresholds. The threshold policy is trained using a policy gradient method (REINFORCE) and • A novel representation of high-level actions is used the results show improved performance against the “rolling to identify the opportunity cost between trading and intrinsic” benchmark. In this paper, we present a methodol- idling. ogy that serves as a wrapper around the existing industrial state of the art. We provide a generic modelling framework • The fitted Q iteration algorithm is used to find a time- for the problem and we elaborate on all the assumptions that variant policy that maximizes the total cumulative prof- allow the formulation of the problem as an MDP. We solve its collected. the resulting problem using a value function approximation • Artificial trajectories are produced using historical data method. from the German CID market in order to address exploration issues. 1.3. Contribution of the paper In this paper, we focus on the sequential decision-making 1.4. Outline of the paper problem of a storage device operator participating in the CID market. Firstly, we present a novel modelling frame- The rest of the paper is organized as follows. In Section2, work for the CID market, where the trading agents exchange the CID market trading framework is presented. The inter- energy via a centralized order book. Each trading agent is action of the trading agents via a centralized order book is assumed to dynamically select the orders that maximize its formulated as a dynamic process. All the available informa- benefits throughout the trading horizon. In contrast to the tion for an asset trading agent is detailed and the objective existing literature, all the available orders are considered is defined as the cumulative profits. In Section3, all the explicitly with the intention to uncover more information at assumptions necessary to formulate the bidding process in each trading decision. The liquidity of the market and the the CID market as an MDP are listed. The methodology influence of each agent on the price are directly reflected in utilised to find an optimal policy that maximizes the cumu- the observed order book, and the developed strategies can lative profits of the proposed MDP is detailed in Section4. adapt accordingly. Secondly, we model explicitly the dy- A case study using real data from the German CID market namics of the storage system. Finally, the resulting problem is performed in Section5. The results as well as consider- is cast as an MDP. ations about limitations of the developed methodology are discussed in Section6. Finally, conclusions of this work are The intraday trading problem of a storage device is solved drawn and future recommendations are provided in Section using Deep Reinforcement Learning techniques, specifically 7. A detailed nomenclature is provided at the AppendixA. an asynchronous distributed variant of the fitted Q iteration RL algorithm with deep neural networks as function approx- 2. Continuous Intraday Bidding process imators (Ernst et al., 2005). Due to the high-dimensionality and the dynamically evolving size of the order book, we 2.1. Continuous Intraday market design motivate the use of high-level actions and a constant size, low-dimensional state representation. The agent can select The participation in the CID market is a continuous process between trading and idling. The goal of the selected actions similar to the stock exchange. Each market product x ∈ X, is the identification of the opportunity cost of trading given where X is the set of all available products, is defined as the state of the order book observed and the maximization the physical delivery of energy in a pre-defined time slot. of the total return over the trading horizon. The resulting op- The time slot corresponding to product x is defined by its timal policy is evaluated using real data from the German ID starting point tdelivery(x) and its duration λ(x). The trading process for time slot x opens at topen(x) and closes at tclose(x). Q1 Q2 Q3 Q4 tsettle, tsettle, tsettle, tsettle Q1 Q2 Q3 Q4 tdelivery tdelivery tdelivery tdelivery λ(Q1) Delivery timeline T¯ 00:00 00:15 00:30 00:45 01:00 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 τ topen,topen,topen,topen tclose tclose tclose tclose Continuous trading timeline 16:00 23:30 23:45 00:00 00:15

A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 topen,topen,topen,topen tclose tclose tclose tclose ai,t−1 ai,t ai,t+1 a−i,t a−i,t+1

Q1 Q2 Q3 Q4 Discrete trading timeline T t , t , t , t t − 1 t t + 1 settle settle settle settle Q1 Q2 Q3 Q4 tdelivery tdelivery tdelivery tdelivery λ(Q1) Delivery timeline T¯ 16:00 00:0023:3000:15 00:30 00:45 01:0023:45 00:00 00:15 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 τ topen,topen,topen,topen tclose tclose tclose tclose Continuous trading timeline 16:00 23:30 23:45 00:00 00:15

Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 topen,topen,topen,topen tclose tclose tclose tclose ai,t−1 ai,t ai,t+1 t a−i,t a−i,t+1 Discrete trading timeline T t − 1 t t + 1 16:00 23:30 23:45 00:00 00:15 t

Figure 1. Trading (continuous and discrete) and delivery timelines for products Q1 to Q4

During the time interval t ∈ [topen(x),tclose(x)], a participant delivery of energy are penalized in the imbalance market. can exchange energy with other participants for the lagged physical delivery during the interval δ(x), with: 2.2. Continuous Intraday market environment δ(x) = tdelivery(x),tdelivery(x) + λ(x) . As its name indicates, the CID market is a continuous environment. In order to solve the trading problem presented The exchange of energy takes place through a centralized in this paper, it has been decided to perform a relevant dis- order book that contains all the unmatched orders o j, where cretisation operation. As shown in Figure 1, the trading j ∈ Nt corresponds to a unique index that every order re- timeline is discretised in a high number of time-steps of ceives upon arrival. The set Nt ⊆ N gathers all the unique constant duration ∆t. Each discretised trading interval for indices of the orders available at time t. We denote the status product x can be denoted by the set of time-steps T(x) = of the order book at time t by Ot = (o j,∀ j ∈ Nt ). As time topen(x),topen(x) + ∆t,...,tclose(x) − ∆t,tclose(x) . Then, progresses new orders appear and existing ones are either the discrete-time trading opportunities for the entire set accepted or cancelled. of products X can be modelled such that the time-steps are S defined as t ∈ T = x∈X T(x). In the following, for the sake Trading for a set of products is considered to start at the gate of clarity, the increment (decrement) operation t + 1 (t − 1) opening of the first product and to finish at the gate closure will be used to model the discrete transition from time-step of the last product. More formally, considering an ordered t to time-step t + ∆t (t − ∆t). set of available products X = {Q1,...Q96}, the corresponding trading horizon is defined as T = [topen(Q1),tclose(Q96)]. It is important to note that in theory the discretisation opera- For instance, in the German CID market, trading of hourly tion leads to suboptimalities in the decision-making process. (quarterly) products for day D opens at 3 pm (4 pm) of day However, as the discretisation becomes finer (∆t → 0), the D − 1 respectively. For each product x, the gate closes 30 decisions taken can be considered near-optimal. Increas- minutes before the actual energy delivery at tdelivery(x). The ing the granularity of the decision time-line results in an timeline for trading products Q1 to Q4 that correspond to increase of the number of decisions that can be taken and the physical delivery in 15-minute time slots from 00:00 hence, the size of the decision-making problem. Thus, there until 01:00, is presented in Figure1. It can be observed is a clear trade-off between complexity and quality of the that the agent can trade for all products until 23:30. After resulting decisions when using a finite discretisation. each subsequent gate closure the number of available prod- Let Xt denote the set of available products at time-step t ∈ T ucts decreases and the commitment for the corresponding time slot is defined. Potential deviations during the physical A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding such that: set of new orders) ai,t , inducing a transition which can be represented by the following equation: Xt = {x|x ∈ X,t ≤ tclose(x)}. OB OB We define the state of the CID market environment at time- st+1 = f (st ,ai,t ,a−i,t ). (1) OB OB step t as st = Ot ∈ S . The state contains the observation of the order book at time-step t ∈ T i.e. the unmatched 2.3. Asset trading orders for all the available products x ∈ X ⊂ X. t An asset optimizing agent participating in the CID market A set of n agents I = {1,2,...,n} are continuously interact- can adjust its position for product x until the corresponding in the CID environment exchanging energy. Each agent ing gate closure tclose(x). However, the physical delivery i ∈ I can express its willingness to buy or sell energy by post- of power is decided at tdelivery(x). An additional amount ing at instant t a set of new orders ai,t ∈ Ai in the order book, of information (potentially valuable for certain players) is n which results in the joint action at = (a1,t ,...,an,t ) ∈ ∏i=1 Ai. received during the period tclose(x),..,tdelivery(x) , from the gate closure until the delivery of power. Based on this The process of designing the set of new orders a for agent i,t updated information, an asset-trading agent may need to or i at instant t consists, for each new order, in determining have an incentive to deviate from the net contracted power the product x ∈ Xt , the side of the order y ∈ {“Sell”,“Buy”}, + in the market. the volume v ∈ R , the price level p ∈ [pmin, pmax] of each con con |Xt | unit offered to be produced or consumed, and the various Let vi,t = (vi,t (x),∀x ∈ Xt ) ∈ R , gather the volumes of validity and execution specifications e ∈ E. The index of power contracted by agent i for the available products x ∈ Xt 0 each new order j belongs to the set j ∈ Nt . at each time-step t ∈ T. In the following, we will adopt the convention for vcon(x) to be positive when agent i contracts The set of new orders is defined as a = i,t i,t the net volume to sell (produce) and negative when the agent ((x ,y ,v , p ,e ),∀ j ∈ N0 ⊆ ). We will use the no- j j j j j t N contracts the volume to buy (consume) energy for product x tation for the joint action a = (a ,a ) to refer to the t i,t −i,t at time-step t. action that agent i selects ai,t and the joint action that all other agents use a−i,t = (a1,t ,...,ai−1,t ,ai+1.t ,...,an,t ). Following each market transition as indicated by equation con (1), the volumes contracted vi,t are determined based on the transactions that have occurred. The contracted volumes Table 1. Order Book for Q1 and time slot 00:00-00:15 vcon are derived according to the FCFS rule that is detailed i Side v [MW] p [e/MWh] i,t in (EPEXSPOT, 2019). The mathematical formulation of 4 “Sell” 6.25 36.3 the clearing algorithm is provided in (Le et al., 2019). The 2 “Sell” 2.35 34.5 ←− ask objective function of the clearing algorithm is comprised of two terms, namely the social welfare and a penalty term 1 “Buy” 3.15 33.8 ←− bid modelling the price-time priority rule. The orders that maxi- 3 “Buy” 1.125 29.3 mize this objective are matched, provided that they satisfy 5 “Buy” 2.5 15.9 the balancing equations and constraints related to their specifications. The clearing rule is implicitly given by: The orders are treated according to the first come first served con OB (FCFS) rule. Table1 presents an observation of the order vi,t = clear(i,st ,ai,t ,a−i,t ). (2) book for product Q1. The difference between the most expensive “Buy” order (“bid”) and the cheapest “Sell” order mar We denote as Pi,t (x) ∈ R the net contracted power in the (“ask”) defines the bid-ask spread of the product. A deal market by agent i for each product x ∈ X, which is updated between two counter-parties is struck when the price pbuy at every time-step t ∈ T according to: of a “Buy” order and the price psell of a “Sell” order satisfy mar mar con the condition pbuy ≥ psell. This condition is tested at the Pi,t+1(x) = Pi,t (x) + vi,t (x). (3) arrival of each new order. The volume of the transaction is ∀x ∈ Xt defined as the minimum quantity between the “Buy” and “Sell” order (min(v ,v )). The residual volume remains buy sell The discretisation of the delivery timeline T¯ is done with available in the market at the same price. As mentioned in time-steps of duration ∆τ, equal to the minimum duration of the previous section, each transaction is settled following delivery for the products considered. The discrete delivery the pay-as-bid principle, at the price indicated by the oldest timeline T¯ is considered to start at the beginning of delivery order. of the first product τinit and to finish at the end of the delivery Finally, at each time-step t, every agent i observes the state of the last product τterm. For the simple case where only four OB of the order book st , performs certain actions (posting a quarterly products are considered, as shown in Figure1, the A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding delivery time-step is ∆τ = 15min and the delivery timeline that: ¯ T = {00 : 00,00 : 15,...,01 : 00}, where τinit = 00 : 00 and 0 gi,t (t ) = Gi,t (τ), (4) τterm = 01 : 00. In general, when only one type of product 0 is considered (e.g. quarterly), there is a straightforward ci,t (t ) = Ci,t (τ), (5) relation between time of delivery τ and product x, since ∀t0 ∈ [τ,τ + ∆τ). τ = tdelivery(x) and ∆τ = λ(x). Thus, terms x or τ can be used interchangeably. For the sake of keeping the notation At each time-step t during the trading process, agent i relatively simple, we will only consider quarterly products can decide to adjust its generation level by ∆Gi,t (τ) or mar in the rest of the paper. In such a context, the terms Pi,t (τ) its consumption level by ∆Ci,t (τ). According to these mar or Pi,t (x) can be used interchangeably to denote the net adjustments the generation and consumption levels can contracted power in the market by agent i at trading step t be updated at each time-step t according to: for delivery time-step τ (product x). Gi,t+1(τ) = Gi,t (τ) + ∆Gi,t (τ), (6) As the trading process evolves the set of delivery time-steps Ci,t+1(τ) = Ci,t (τ) + ∆Ci,t (τ), (7) τ for which the asset-optimizing can make decisions de- ¯ creases as trading time t crosses the delivery time τ. Let ∀τ ∈ T(t). T¯(t) ⊆ T¯ be a function that yields the subset of delivery Let wexog denote any other relevant exogenous informa- time-steps τ ∈ T¯ that follow time-step t ∈ T such that: i,t tion to agent i such as the RES forecast, a forecast of the actions of other agents, or the imbalance prices. The computation of ∆G (·) and ∆C (·) depends on the T¯(t) = {τ|τ ∈ T¯ \ {τterm},t ≤ τ}. i,t i,t market position, the technical limits of the assets, the state of the order book and the exogenous information exog res wi,t . We define the residual production Pi,t (τ) ∈ R The participation of an asset-optimizing agent in the CID at delivery time-step τ as the difference between the market is composed of two coupled decision processes with production and the consumption levels and can be com- different timescales. First, the trading process where a deci- puted by: sion is taken at each time-step t about the energy contracted res until the gate closure tclose(x). During this process, the agent Pi,t (τ) = Gi,t (τ) −Ci,t (τ). (8) can decide about its position in the market and create scenar- res ios/make projections about the actual delivery plan based on We note that the amount of residual production Pi,t (τ) its position. Second, the physical delivery decision that is aggregates the combined effects that Gi,t (τ) and Ci,t (τ) taken at the time of the delivery τ or tdelivery(x) based on the have on the revenues made by agent i through interact- total net contracted power in the market during the trading ing with the markets (intraday/imbalance). process. The level of generation and consumption for a market An agent i participating in the CID market is assumed to period τ can be adjusted at any time-step t before the OB physical delivery τ, but it becomes binding when t = τ. monitor the state of the order book st and its net con- mar We denote as ∆i,t (τ) the deviation from the market tracted power in the market Pi,t (x) for each product x ∈ X, position for each time-step τ, as scheduled at time t, which becomes fixed once the gate closure occurs at tclose(x). Depending on the role it presumes in the market, an asset- after having computed the variables Gi,t (τ) and Ci,t (τ), optimizing agent is assumed to monitor all the available as follows: information about its assets. We distinguish the three fol- mar res P (τ) + ∆i,t (τ) = P (τ), (9) lowing cases among the many different roles that can be i,t i,t ¯ played by an agent in the CID market: ∀τ ∈ T(t).

The term ∆i,t (τ) represents the imbalance for market period τ as estimated at time t. This imbal- • The agent controls a physical asset that can generate ance may evolve up to time t = τ. We denote by and/or consume electricity. We define as G (τ) ∈ i,t ∆ (τ) = ∆ (τ) the final imbalance for market pe- G ,G the power production level for agent i at de- i i,t=τ i i riod τ. livery time-step τ as computed at trading step t. In a similar way, we define the power consumption level The power balance of equation (9) written for time-step + t + 1 is given by: Ci,t (τ) ∈ Ci,Ci , where Ci,Ci,Gi,Gi ∈ R . We further 0 assume that the actual production gi,t (t ) and consump- mar P (τ) + ∆i,t+1(τ) = Gi,t+1(τ) −Ci,t+1(τ) (10) tion level c (t0) during the time-period of delivery i,t+1 i,t ¯ t0 ∈ [τ,τ + ∆τ), is constant for each product x such ∀τ ∈ T(t + 1). A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

It can be observed that by substituting equations (3), we assume are equal. We note that for batteries, charg- (6) and (7) in equation (10) we have: ing and discharging efﬁciencies may be different and depend on the charging/discharging speeds. As can be observed from equation (14), time-coupling constraints mar con Pi,t (τ) + vi,t (τ) + ∆i,t+1(τ) = are imposed on Ci,t (τ) and Gi,t (τ) in order to ensure that the amount of energy that can be discharged dur- G (τ) + ∆G (τ) − (C (τ) + ∆C (τ)) (11) i,t i,t i,t i,t ing some period already exists in the storage device. ¯ ∀τ ∈ T(t). Additionally, constraints associated with the maximum charging power Ci and discharging power Gi, as well as The combination of equations (8) and (9) with equa- the maximum and minimum energy level (SoCi, SoCi) tion (11) yields the update of the imbalance vector are considered in order to model the operation of the according to: storage device. con Equation (14) can be written for time-step t + 1 as: ∆i,t+1(τ) = ∆i,t (τ) + ∆Gi,t (τ) − ∆Ci,t (τ) − vi,t (τ) (12) SoCi,t+1(τ + ∆τ) = SoCi,t+1(τ)+ ∀τ ∈ T¯(t). ! Gi,t+1(τ) ∆τ · ηCi,t+1(τ) − , • The agent does not own any physical asset (market η maker). It is equivalent to the ﬁrst case with Ci = Ci = (15) Gi = Gi = 0. The net imbalance ∆i,t (τ) is updated at ∀τ ∈ T¯(t + 1). every time-step t ∈ T according to:

mar Combining equations (14) and (15) we can derive Pi,t (τ) + ∆i,t (τ) = 0, (13) the updated vector of the state of charge at time- ∀τ ∈ T¯(t). step t + 1 depending on the decided adjustments (∆Gi,t (τ),∆Ci,t (τ)) as: • The agent controls a storage device that can produce, SoC (τ + ∆τ) − SoC (τ) = store and consume energy. We can consider an agent i,t+1 i,t+1 controlling a storage device as an agent that controls SoCi,t (τ + ∆τ)−SoCi,t (τ)+ generation and production assets with speciﬁc con- ∆Gi,t (τ) ∆τ · (η∆Ci t (τ) − ), (16) straints on the generation and the consumption level , η related to the nature of the storage device. Following ∀τ ∈ T¯(t). this argument, let Gi,t (τ) (Ci,t (τ)) refer to the level of discharging (charging) of the storage device for de- The state of charge SoC (τ) at delivery time-step τ can livery time-step , updated at time t. Obviously, if i,t τ be updated until t = τ. Let us also observe that there is G ( ) > 0 (C ( ) > 0), then we automatically have i,t τ i,t τ a bijection between Pres(τ) and the terms C (τ) and C ( ) = 0 (G ( ) = 0) since a battery cannot charge i,t i,t i,t τ i,t τ G (τ) or, in other words, determining Pres is equiva- and discharge energy at the same time. In this case, i,t i,t lent to determining C (τ) and G (τ) and vice versa. agent i can decide to adjust its discharging (charging) i,t i,t The deviation from the committed schedule ∆i,t+ (τ) level by ∆G ( ) (∆C ( )). Let SoC ( ) denote the 1 i,t τ i,t τ i,t τ at delivery time-step τ at each time-step t + 1 can be state of charge of the storage unit at delivery time- computed by equation (12). step τ ∈ T¯ as it is computed at time-step t, where All the new information arriving at time-step t for an SoCi,t (τ) ∈ SoCi,SoCi . The evolution of the state of charge during the delivery timeline can be updated asset-optimizing agent i (controlling a storage device) at decision time-step t as: is gathered in variable: OB si,t = (st , mar ¯ (Pi,t (τ),∆i,t (τ),Gi,t (τ),Ci,t (τ),SoCi,t (τ),∀τ ∈ T), SoCi,t (τ + ∆τ) = SoCi,t (τ)+ exog ! wi,t ) ∈ Si. Gi,t (τ) ∆τ · ηCi,t (τ)− , (14) η The control action applied by an asset-optimizing agent ∀τ ∈ T¯(t). i trading in the CID market at time-step t consists of posting new orders in the CID market and adjusting its Parameter η represents the charging and discharging production/consumption level or equivalently its charg- efﬁciencies of the storage unit which, for simplicity, ing/discharging level for the case of the storage device. A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

The control actions can be summarised in variable ui,t = product x ∈ X (delivery time-step τ) takes place at the (ai,t ,(∆Ci,t (τ),∆Gi,t (τ),∀τ ∈ T¯)). end of the physical delivery tsettle(x) (i.e. at τ + ∆τ), as presented in Figure1. We define the imbalance settle- In this paper, we consider that the trading agent adopts ment timeline T Imb, as T Imb = {τ + ∆τ,∀τ ∈ T¯}. The a simple strategy for determining, at each time-step t, the imbalance penalty1 is only applied when time instance variables ∆C (τ), ∆G (τ) once the trading actions a have i,t i,t i,t t is an element of the imbalance settlement timeline. been selected. In this case, the decision regarding the trading actions ai,t fully defines action ui,t and thus the notation ui,t will not be further used. This strategy will be referred to in The function Ri is defined as: the rest of the paper as the “default” strategy for managing R (t,s ,a ,a ) = the storage device. According to this strategy, the agent aims i i,t i,t −i,t ( Imb at minimizing any imbalances (∆i,t+1(τ)) and therefore we OB ∆i(τ) · I(τ) ,if t ∈ T , (19) ρ s ,ai,t ,a−i,t + . use the following decision rule: t 0 ,otherwise ¯ (∆Ci,t (τ),∆Gi,t (τ),∀τ ∈ T) = argmin ∑ | ∆i,t+1(τ) |, τ∈T¯ 2.5. Trading policy s.t. (2), (3), (8), (9), (12), (14). (17) All the relevant information that summarises the past and that can be used to optimise the market participation One can easily see that from equation (11) this decision is assumed to be contained in the history vector h = rule is equivalent to imposing Pres (τ) as close as possible i,t i,t+1 (s ,a ,r ,...,s ,a ,r ,s ) ∈ H . Trading agent to Pmar (τ), given the operational constraints of the device. i,0 i,0 i,0 i,t−1 i,t−1 i,t−1 i,t i i,t+1 i is assumed to select its actions following a non-anticipative We will elaborate later in this paper on the fact that adopting history-dependent policy π (h ) ∈ Π from the set of all ad- such a strategy is not suboptimal in a context where the i i,t missible policies Π, according to: a ∼ π (·|h ). agent needs to be balanced for every market period while i,t i i,t being an aggressor in the CID market. 2.6. Trading objective For the sake of simplicity, we assume that the decision process of an asset-optimizing agent terminates at the gate The return collected by agent i in a single trajectory ζ = (si, ,ai, ,...,ai,K− ,si,K) of K −1 time-steps, given an initial closure tclose(x) along with the trading process. Thus, the 0 0 1 res state si,0 = si ∈ Si, which is the sum of cumulated rewards final residual production Pi (τ) for delivery time-step τ is given by Pres(τ) = Pres (τ). Similarly, the final over this trajectory is given by: i i,t=tclose(x) imbalance is provided by ∆i(τ) = ∆i,t=tclose(x)(τ). K−1 Gζ (s ) = R (t,s ,a ,a )|s = s . Although this approach can be used for the optimisation of i ∑ i i,t i,t −i,t i,0 i (20) t=0 a portfolio of assets, in this paper, the focus lies on the case where the agent is operating a storage device. We note that this case is particularly interesting in the context of energy The sum of returns collected by agent i, where each agent i transition, where storage devices are expected to play a key is following a randomized policy πi ∈ Π are consequently role in the energy market. given by:

( K−1 ) πi 2.4. Trading rewards V (si) = Ri (t,si,t ,ai,t ,a−i,t )|si,0 = si . a ∼π ,aE ∼π ∑ i,t i −i,t −i t=0 The instantaneous reward signal collected after each transi- (21) tion for agent i is given by:

ri,t = Ri (t,si,t ,ai,t ,a−i,t ), (18) The goal of the trading agent i is to identify an optimal policy π∗ ∈ Π that maximizes the expected sum of rewards where R : T × S × A × ... × A → . i i i 1 n R collected along a trajectory. An optimal policy is obtained The reward function Ri is composed of the following terms: by:

∗ = V πi (s ). i. The trading revenues obtained from the matching pro- πi argmaxπi∈Π i (22) cess of orders at time-step t, given by ρ where ρ is a 1 OB The imbalance price I(τ) is deﬁned by a process that depends stationary function ρ : S × A1 × ... × An → R, on a plethora of factors among which is the net system imbalance during delivery period τ, deﬁned by the imbalance volumes of all ii. The imbalance penalty for deviation ∆i(τ) from the I the market players (∑ ∆i(τ)). For the sake of simplicity we will market position for delivery time-step τ at the imbal- assume that it is randomly sampled from a known distribution over ance price I(τ). The imbalance settlement process for prices that is not conditioned on any variable. A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

3. Reinforcement Learning Formulation i operating a storage device and trading energy in the CID market can be formalised as a fully observable ﬁnite-time In this section, we propose a series of assumptions that MDP with the following characteristics: allow us to formulate the previously introduced problem of a storage device operator trading in the CID market using a reinforcement learning (RL) framework. Based on these • Discrete time-step t ∈ T, where T is the optimisation assumptions, the decision-making problem is cast as an horizon. MDP; the action space is tailored in order to represent a • State space H , where the state of the system h ∈ H at particular market player and additional restrictions on the i i,t i time t summarises all past information that is relevant operation of the storage device are introduced. for future optimisation.

3.1. Assumptions on the decision process • Action space Ai, where ai,t ∈ Ai is the set of new orders posted by agent i at time-step t. Assumption 1 (Behaviour of the other agents). The other agents −i interact with the order book in between two dis- • Transition probabilities hi,t+1 ∼ P(·|hi,t ,ai,t ), that can crete time-steps in such a way that agent i can be considered be inferred by the following processes: independent when interacting with the CID market at each time-step t. Moreover, it is assumed that these actions a−i,t 1. a−i,t ∈ A−i is drawn according to equation (23) OB OB depend strictly on the history of order book states st−1 and 2. The state of the order book st+1 follows the tran- thus by extension on the history hi,t−1 for every time-step t: sition given by equation (1) exog 3. The exogenous information wi,t is given by equation (24) and the noise by (25) a−i,t ∼ Pa−i,t (·|hi,t−1). (23) 4. The variable si,t+1 that summarises the informa- Assumption (1) suggests that the agents engage in a way that tion of the storage device optimizing agent fol- is very similar to a board game like chess, for which player lows the transition given by equations (1), (6)-(12) i can make a move only after players −i have made their (24), (25) and (16) move. This behaviour is illustrated in Figure1 (magniﬁed 5. The instantaneous reward ri,t collected after each area). Given this assumption, the notation a−i,t can also transition is given by equations (18) and (19). be seen as referring to actions selected during the interval The elements resulting from these processes can be (t − ∆t,t). used to construct hi,t+1 in a straightforward way. Assumption 2 (Exogenous information). The exogenous in- exog formation wi,t is given by a stochastic model that depends solely on k past values, where 0 < k ≤ t and a random dis- 3.2. Assumptions on the trading actions turbance ei,t according to: Assumption 4 (Aggressor). The trading agent can only sub- exog exog exog mit new orders that match already existing orders at their wi,t = b(wi,t−1,...,wi,t−k,et ), (24) price (i.e. aggressor or liquidity taker).

ei,t ∼ Pei,t (·|hi,t ). (25) red Let Ai be the space that contains only actions that Assumption 3 (Strategy for storage control). The control match pre-existing orders in the order book. Accord- th decisions related to the charging (∆C (τ)) or discharging ing to Assumption (4), the i agent, at time-step t, is i,t red OB (∆G (τ)) power to/from the storage device are made based restricted to select actions ai,t ∈ Ai ⊂ Ai. Let st = i,t OB 0OB OB OB OB on the “default” strategy described in Section 2.3. ((x j ,y j ,v j , p j ,e j ),∀ j ∈ Nt ) be the order book observation at trading time-step t. We use y0OB to denote that It can be observed that with such an assumption, the storage the new orders have the opposite side (“Buy” or “Sell”) than j control decisions (∆Ci,t (τ) and ∆Gi,t (τ)) are obtained as a the existing orders. We denote as ai,t ∈ [0,1] the fraction direct consequence of the trading decisions ai,t . Indeed, after of the volume accepted from order j. The reduced action the trading decisions are submitted and the market position red space Ai is then defined as: is updated, the storage control decisions are subsequently derived following the “default” strategy. Assumption (3) red OB 0OB j OB OB OB j Ai = {(x j ,y j ,ai,t · v j , p j ,e j ),ai,t ∈ [0,1],∀ j ∈ Nt }. results in reducing the dimensionality of the action space and consequently the complexity of the decision-making red At this point, posting a new set of orders ai,t ∈ Ai boils problem. down to simply specifying the vector of fractions: Following Assumptions (1), (2) and (3), one can simply j ¯red observe that the decision-making problem faced by an agent a¯i,t = ai,t ,∀ j ∈ Nt ∈ Ai A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding that define the partial or full acceptance of the existing or- Despite the different assumptions made on the operation ders. The action ai,t submitted by an aggressor is a function of the storage device and the way it is restricted to interact OB l of the observed order book st and the vector of fractions with the market, the dimensionality of the action space a¯i,t and is given by: still remains very high. Due to limitations related to the function approximation architecture used to implement the OB ai,t = l(st ,a¯i,t ). (26) fitted Q iteration algorithm, a low-dimensional and discrete action space is necessary, as discussed in subsection 4.3. 3.3. Restrictions on the storage operation Therefore, as part of the methodology, in subsection 4.4 we propose a way for reducing the action space. Afterwards, Assumption 5 (No imbalances permitted). The trading agent in subsection 4.5, a more compact representation of the can only accept an order to buy or sell energy if and only if it state space is proposed in order to reduce the computational does not result in any imbalance for the remaining delivery complexity of the training process and increase the sample periods. efficiency of the algorithm. According to Assumption (5) the agent is completely risk- Finally, the low number of available samples (one trajectory averse in the sense that, even if it stops trading at any given per day) gives rise to issues related to the limited exploration point, its position in the market can be covered without caus- of the agent. In order to address these issues, we generate a ing any imbalance. This assumption is quite restrictive with large number of trading trajectories of our MDP according respect to the full potential of an asset-optimizing agent in to an ε-greedy policy, using historical trading data. In the the CID market. We note that, according to the German last part of this section, we elaborate on the strategy that regulation policies (see (Braun & Hoffmann, 2016)), the is used in this paper for generating the trajectories and the imbalance market should not be considered as an optimisa- limitations of this procedure. tion floor and the storage device should always be balanced at each trading time-step t (∆i,t (τ) = 0,∀τ ∈ T¯). In this 4.1. Collection of trajectories respect, we can view Assumption5 as a way to comply with the German regulation policies in a risk-free context where As previously mentioned, an asset-optimizing agent can col- each new trade should not create an imbalance that would lect a set of trajectories from previous interactions with the have to be covered later. CID market. Based on Assumption (6), each day can be optimised separately and thus, trading for one day corresponds Assumption 6 (Optimization decoupling). The storage de- to one trajectory. We consider that the trading horizon device has a given initial value for the storage level SoCinit i fined in Section 2.2 consists of K discrete trading time-steps at the beginning of the delivery timeline. Moreover, it is such that T = {0,...,K}. A single trajectory sampled from constrained to terminate at a given level SoCterm at the end i the MDP described in Section3 is defined as: of the delivery timeline. m m m m m m m Under Assumption (6) the optimisation of the storage unit ζm = hi,0,ai,0,ri,0,...,hi,K−1,ai,K−1,ri,K−1,hi,K . over a long trading horizon can be decomposed into shorter optimisation windows (e.g. of one day). In the simulation A set of M trajectories can be then defined as: init results reported later in this paper, we will choose SoCi = term F = { ,m = ,...,M}. SoCi . ζm 1

4. Methodology The set of trajectories F can be used to generate the set of sampled one-step system transitions F0 defined as: In this section, we describe the methodology that has been applied for tackling the MDP problem described in sub- section3. We consider that, in reality, an asset-optimizing (h1 ,a1 ,r1 ,h1 ), ··· (h1 ,a1 ,r1 ,h1 ), agent has at its disposal a set of trajectories (one per day)  i,0 i,0 i,0 i,1 i,K−1 i,K−1 i,K−1 i,K  0  . . .  from participating in the CID market in the past years. The F = . .. . .  M M M M M M M M  process of collecting these trajectories and their structure is (hi,0,ai,0,ri,0,hi,1), ··· (hi,K−1,ai,K−1,ri,K−1,hi,K) presented in Section 4.1. Based on this dataset, we propose in subsection 4.2 the deployment of the fitted Q iteration The set F0 is split into K sets of one-step system transitions algorithm as introduced in (Ernst et al., 2005). This algo- F0 defined as: rithm belongs to the class of batch-mode RL algorithms that t make use of all the available samples at once for updating 0 m m m m Ft = (hi,t ,ai,t ,ri,t ,hi,t+1),m = 1,...,M , the policy. This class of algorithms is known to be very t sample efficient. ∀t ∈ {0,...,K − 1}. A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

A batch-mode RL algorithm can be implemented for ex- tracting a relevant trading policy from these one-step system ( ) transitions. In the following subsection, the type of RL al- Qt (hi,t ,ai,t ) = E ri,t + max Qt+1(hi,t+1,ai,t+1) . a i t , ei t red gorithm used for inferring a high-quality policy from this − , , ai,t+1∈Ai set of one-step system transitions is explained in detail. (30)

∗ ∗ ∗ 4.2. Batch-mode reinforcement learning An optimal time-variant policy π = µ0 ,..., µK−1 can be identified using the Q-functions as following: Q-functions and Dynamic Programming: In this section, the fitted Q iteration algorithm is proposed for the optimisation of the MDP defined in Section3, using a set of collected ∗ µt = argmax red Qt (hi,t ,ai,t ), (31) trajectories. In order to solve the problem, we first define ai,t ∈Ai the Q-function for each state-action pair (hi,t ,ai,t ) at time t ∀t ∈ {0,...,K − 1}. as proposed in (Bertsekas, 2005) as: Computing the Q-functions from a set of one-step system transitions: In order to obtain the optimal time- ∗ Qt (hi,t ,ai,t ) = E {ri,t +Vt+1(hi,t+1)}, (27) variant policy π , the effort is focused on computing the a−i,t , ei,t Q-functions defined in equation (30). However, two aspects ∀t ∈ {0,...,K − 1}. render the use of the standard value iteration algorithm im- possible for solving the MDP defined in Section3. First, the transition probabilities of the MDP defined in Section3 = { ,..., } ∈ A time-variant policy π µ0 µK−1 Π, consists in a are not known. Instead, we can exploit the set of collected H → Ared sequence of functions µt , where µt : i i . An action historical trajectories to compute the exact Q-functions us- a t i,t is selected from this policy at each time-step , according ing an algorithm such as Q-learning (presented in (Watkins a = (h ) t+1 = { ,..., } to i,t µt i,t . We denote as π µt+1 µK−1 the & Dayan, 1992)). Q-learning is designed for working only t + sequence of functions µt from time-step 1 until the end with trajectories, without any knowledge of the transition of the horizon. Standard results from dynamic programming probabilities. Optimality is guaranteed given that all state- (DP) show that for the finite time MDP we are addressing action pairs are observed infinitely often within the set of in this paper, there exists at least one such time-variant the historical trajectories and that the successor states are policy which is an optimal policy as defined by equation independently sampled at each occurrence of a state-action (22). Therefore, we focus on the computation of such an pair (Bertsekas, 2005). In Section 4.6 we discuss the valid- optimal time-variant policy. We define the value function ity of this condition and we address the problem of limited V t+1 as the optimal expected cumulative rewards from stage exploration by generating additional artificial trajectories. t + K 1 until the end of the horizon given by: Second, due to the continuous nature of the state and action spaces a tabular representation of the Q-functions used in Q-learning is not feasible. In order to overcome this issue, Vt+1(hi) = we use a function approximation architecture to represent ( K−1 ) the Q-functions (Busoniu et al., 2017). max E Ri,k hi,k, µk(hi,k),a−i,k |hi,t+1 = hi . t+1 (a ,e ) ∑ The computation of the approximate Q-functions is per- π ∈Π −i,t+1 i,t+1 k=t+1 ··· formed using the fitted Q iteration algorithm (Ernst et al., (a−i,K−1,ei,K−1) (28) 2005). We present the algorithm for the case where a parametric function approximation architecture (Qt (hi,t ,at ;θt )) is used (e.g. neural networks). In this case, the algorithm We observe that Q (h ,a ) is the value attained by taking t i,t i,t is used to compute, recursively, the parameter vectors θ action a at state h and subsequently using an optimal pol- t i,t i,t starting from t = K − 1. However, it should be emphasized icy. Using the dynamic programming algorithm (Bertsekas, that the fitted Q iteration algorithm can be adapted in a 2005) we have: straightforward way to the case in which a non-parametric function approximation architecture is selected. The set of M samples of quadruples F0 = Vt (hi,t ) = max Qt (hi,t ,ai,t ). (29) n o t a ∈Ared m m m m i,t i (hi,t ,ai,t ,rt ,hi,t+1),m = 1,...,M obtained from previous experience is exploited in order to update the parameter Equation (27) can be written in the following form that vectors θt by solving the supervised learning problem relates Qt and Qt+1: presented in equation (32). The target vectors yt are A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

red computed using the Q-function approximation of the next The state space Hi as well as the action space Ai , as de- stage (Qt+1(hi,t+1,at+1;θt+1)) according to equation (33). scribed in Section 3.2, depend explicitly on the content of ˆ OB The Q-function for the terminal state is set to zero (QK ≡ 0) the order book st . The dimension of these spaces at each and the algorithm iterates backwards in the time horizon T, time-step t depends on the total number of available orders producing a sequence of approximate Q-functions denoted | Nt | in the order book. However, the total number of orders by Qˆ = {Qˆ0,...,QˆK−1} until termination at t = 0. is changing at each step t. Thus, both the state and the action spaces are high-dimensional spaces of variable size. In order to reduce the complexity of the decision-making problem, M we have chosen to reduce these spaces so as to work with m m m 2 θt = argmin ∑ (Qt (hi,t ,at ;θt ) − yt ) (32) a small action space of constant size and a compact state θt m=1 space. In the following, we describe the procedure that was m m m yt = rt + max Qt+1(hi,t+ ,ai,t+1;θt+1) (33) carried out for the reduction of the state and action spaces. red 1 ai,t+1∈Ai 4.4. Action space reduction: High-level actions Once the parameters θt are computed, the time-variant pol- ∗ ∗ ∗ icy πˆ = µˆ0 ,..., µˆK−1 is obtained as: In this section, we elaborate on the design of a small and discrete set of actions that is an approximation of the original ∗ µˆt (hi,t ) = argmax red Qt (hi,t ,ai,t ;θt ), (34) ai,t ∈Ai action space. Based on Assumptions (1), (2), (3), (4), (5) and 0 ∀t ∈ {0,...,K − 1}. (6), a new action space Ai is proposed, which is defined as 0 Ai = {“Trade”,“Idle”}. The new action space is composed 0 0 of two high-level actions ai,t ∈ Ai. These high-level actions In practice, a new trajectory is collected after each trading are transformed to an original action through mapping p : day. The set of collected trajectories F is consequently 0 red 0 red Ai → Ai , from space Ai to the reduced action space Ai . augmented. Thus, the fitted Q iteration algorithm can be The high-level actions are defined as following: used to compute a new optimal policy when new data arrive. 4.4.1. “TRADE” 4.3. Limitations At each time-step t, agent i selects orders from the order The fitted Q iteration algorithm, described in the previous book with the objective of maximizing the instantaneous section, can be used to provide a trading policy based on reward under the constraint that the storage device can re- the set of past trajectories at the disposal of the agent. Even main balanced for every delivery period, even if no further though, this approach is theoretically sound, in practice interaction with the CID market occurs. As a reminder, this there are several limitations to overcome. The efficiency of constraint was imposed by Assumption (5). the described fitted Q iteration algorithm is overshadowed by the high-dimensionality of the state and the action space. Under this assumption, the instantaneous reward signal ri,t , presented in equation (19), consists only of the trad- The state variable ing revenues obtained from the matching process of orders at time-step t. We will further assume that mapping + hi,t = (si,0,ai,0,ri,0,...,si,t−1,ai,t−1,ri,t−1,si,t ) ∈ Hi u : R × {“Sell”,“Buy”} → R that adjusts the sign of the volume vOB of each order according to their side yOB. Orders is composed of : posted for buying energy will be associated with positive volume and orders posted for selling energy with negative • The entire history of actions (ai,0,...,ai,t−1) before volume, or equivalently: time t

( OB OB • The entire history of rewards (ri,0,...,ri,t−1) before v , if y = “Buy”, u(vOB,yOB) = (35) time t −vOB, if yOB = “Sell”. OB OB • The history of order book states s0 ,...,st up to time t and,y of the private information private private Consequently, the reward function ρ deﬁned in Section 2.4 (si,0 ,...,si,t ) up to time t, where: is adapted according to the proposed modiﬁcations. The OB red private mar new reward function ρ, where ρ : S × A¯ → , is a s =((P (τ),∆i,t (τ), i R i,t i,t stationary function of the orders observed at each time- ¯ Gi,t (τ),Ci,t (τ),SoCi,t (τ),∀τ ∈ T), step t and the agent’s response to the observed orders. An exog wi,t ). analytical expression for the instantaneous reward collected A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding is given by:

Nt OB j OB OB OB ri,t = ρ st ,a¯i,t = ∑ ai,t · u(v j ,y j ) · p j . (36) j=1

The High-level action “Trade” amounts to solving the bid acceptance optimisation problem presented in Algorithm1. The objective function of the problem, formulated in equation (37), consists of the revenues arising from trading. It is important to note that the operational constraints guarantee that no order will be accepted if it causes any imbalance. We denote as Nτ ⊂ N the set of unique indices of the available orders that correspond to delivery time-step τ and S Algorithm 1 “Trade” Nt = τ∈T¯ Nτ . In equation (38), the energy purchased and j OB mar init OB mar Input: t, st , P , SoCi, SoCi, Ci, Ci, Gi, Gi, SoC , sold (∑ j∈N ai,t u(v j )), the past net energy trades (Pi,t (τ)) i,t i τ SoCterm,τ , τ ,G , C and the energy discharged by the storage (Gi,t (τ)) must i init term i,t i,t Output: a¯ , SoC , G , C , ∆G , ∆C , k , r match the energy charged by the storage (Ci,t (τ)) for every i,t i,t+1 i,t+1 i,t+1 i,t i,t i,t+1 i,t delivery time-step τ. The energy balance of the storage Solve: device, presented in equation (39), is responsible for the j OB OB OB max ∑ ai,t · u(v j ,y j ) · p j (37) time-coupling and the arbitrage between two products x a¯i,t , SoC i,t+1 j∈Nt (delivery time-steps τ). The technical limits of the stor- Gi,t+1, Ci,t+1 ∆Gi,t , ∆Ci,t age level and the charging and discharging process are de- ki,t+1, ri,t scribed in equations (40) to (44). The binary variables s.t. a j u(vOB,yOB) + Pmar(τ)+ ¯ ∑ i,t j j i,t ki,t = (ki,t (τ),∀τ ∈ T) restrict the operation of the unit for j∈Nτ each delivery period in only one mode, either charging or Ci,t+1(τ) = Gi,t+1(τ), ∀τ ∈ T¯(t) (38) discharging. SoCi,t+1(τ + ∆τ) = SoCi,t+1(τ)+ The optimal solution to this problem yields the vector of G (τ) ∆τ · η ·C (τ) − i,t+1 , ∀τ ∈ T¯(t) (39) fractions: i,t+1 j ¯red η a¯i,t = ai,t ,∀ j ∈ Nt ∈ Ai SoCi ≤ SoCi,t+1(τ) ≤ SoCi, ∀τ ∈ T¯(t) (40) that are used in equation (26) to construct the action ai,t ∈ init red SoCi = SoCi,t+1(τinit ), (41) Ai . The optimal solution also deﬁnes at each time-step t term the adjustments in the level of the production (discharge) SoCi = SoCi,t+1(τterm), (42) ¯ ∆Gi,t = (∆Gi,t (τ),∀τ ∈ T(t)) and the consumption (charge) Ci ≤ Ci,t+1(τ) ≤ ki,t+1(τ) ·Ci, ∀τ ∈ T¯(t) (43) ∆Ci,t = (∆Ci,t (τ),∀τ ∈ T¯(t)). The evolution of the state Gi ≤ Gi,t+1(τ) ≤ (1 − ki,t+1(τ))Gi, ∀τ ∈ T¯(t) (44) of charge SoCi,t+1 = (SoCi,t+1(τ),∀τ ∈ T¯(t)) of the unit G ( ) = G ( ) + G ( ), ∀ ∈ T¯(t) (45) as well as the production Gi,t+1 = (Gi,t+1(τ),∀τ ∈ T¯(t)) i,t+1 τ i,t τ ∆ i,t τ τ and consumption Ci,t+1 = (Ci,t+1(τ),∀τ ∈ T¯(t)) levels are Ci,t+1(τ) = Ci,t (τ) + ∆Ci,t (τ), ∀τ ∈ T¯(t) (46) computed for each delivery period. ki,t+1(τ) ∈ {0,1}, ∀τ ∈ T¯(t) (47) j 4.4.2. “IDLE” ai,t ∈ [0,1], ∀ j ∈ Nt (48) No transactions are executed, and no adjustment is made to the previously scheduled quantities. Under this action, the vector of fractions a¯i,t is a zero vector. The discharge and charge as well as the state of charge of the storage device remain unchanged (∆Gi,t ≡ 0 and ∆Ci,t ≡ 0) and we have:

Gi,t+1(τ) = Gi,t (τ),∀τ ∈ T¯(t), (49)

Ci,t+1(τ) = Ci,t (τ),∀τ ∈ T¯(t), (50)

SoCi,t+1(τ) = SoCi,t (τ),∀τ ∈ T¯(t). (51)

With such a reduction of the action-space, the agent can choose at every time-step t between the two described high- 0 0 level actions (ai,t ∈ Ai = {“Trade”,“Idle”}). Note that A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding when the agent learns to idle, given a current situation, it market. The market depth curves serve as a visualization of does not necessarily mean, that if it had chosen to “Trade” the order book that provides information about the liquidity instead, he would not make a positive immediate reward. In- of the market. Moreover, it provides information about the deed, the agent would choose “Idle” if it believes that there maximum (minimum) price that a trading agent will have to may be a better market state emerging, i.e. the agent would pay in order to buy (sell) a certain volume of energy. If we learn to wait for the ideal opportunity of orders appearing in assume a fixed-price discretisation, certain upper and lower the order book at subsequent time-steps. We compare this bounds on the prices and interpolation of the data in this approach to an alternative, which we refer to as the “rolling price range, the market depth curves of each product x can intrinsic” policy. According to this policy, at every time-step be approximated by a finite and constant set of values. t of the trading horizon the agent selects the combination of Even though this set of values has a constant size, it can orders that optimises its operation and profits, based on the still be extremely large. Its dimension is not a function of current information assuming that the storage device must the number of existing orders any more, but it depends on remain balanced for every delivery period as presented in the resolution of the price discretisation, the price range (Lohndorf & Wozabal, 2015). The “rolling intrinsic” pol- considered, and the total number of products in the market. icy is, thus, equivalent to sequentially selecting the action Instead of an individual market depth curve for each product “Trade” (Algorithm1), as defined in this framework. The x, we consider a market depth curve for all the available algorithm proposed later in this paper exploits the experi- products, i.e. existing orders in ascending (descending) ence that the agent can gain through (artificial) interaction price order and accumulating the available volumes for all with its environment, in order to learn the value of trading the products. In this way we can construct the aggregated or idling at every different state that agent may encounter. market depth curve, presented in Figure 2b. The aggregated market depth curve illustrates the total available volume 4.5. State space reduction (“Sell” or “Buy”) per price level for all products. In this section, we propose a more compact and low- The motivation for considering the aggregated curves comes dimensional representation of the state space Hi. The state from the very nature of a storage device. The main profit- hi,t , as explained in Section 4.3, contains the entire history generating mechanism of a storage device is the arbitrage of all the relevant information available for the decision- between two delivery periods. Its functionality involves the making up to time t. We consider each one of the compo- purchasing (charging) of electricity during periods of low nents of the state hi,t , namely the entire history of actions, prices and the selling (discharging) during periods of high order book states and private information, and we provide a prices. simplified form. For instance, in Figure 2a, a storage device would buy vol- First, the vector containing the entire history of actions is re- ume for product Q4 and sell volume back for product Q5. duced to a vector of binary variables after the modifications The intersection of the “Sell” and “Buy” curves in Figure 2b introduced in Section 4.4. defines the maximum volume that can be arbitraged by the Second, the vector containing the history of order storage device if no operational constraints were considered book states is reduced into a vector of engineered fea- and serves as an upper bound for the profits at each step OB t tures. We start from the order book state st = . Alternatively, the market depth for the same products OB 0OB OB OB OB OB Q to Q at a different time-step of the trading horizon is ((x j ,y j ,v j , p j ,e j ),∀ j ∈ Nt ⊆ N) ∈ S that is de- 1 6 fined in Sections 2.1 and 2.2 as a high-dimensional contin- presented in Figure 3a. As illustrated in Figure 3b, there is uous vector used to describe the state of the CID market. no arbitrage opportunity between the products, hence the Owing to the variable (non-constant) and large amount of aggregated curves do not intersect. Thus, we assume, that OB orders | Nt |, the space S has a non-constant size with the aggregated curves provide a sufficient representation of high-dimensionality. the order book. In order to overcome this issue, we proceed as following. At this point, considering a fixed-price discretisation and a First, we consider the market depth curves for each product fixed price range would yield a constant set of values able x. The market depth of each side (“Sell” or “Buy”) at a to describe the aggregated curves. However, in order to time-step t, is defined as the total volume available in the further decrease the size of the set of values with sufficient order book per price level for product x. The market depth price discretisation, we motivate the use of a set of distance for the “Sell” (“Buy”) side is computed by stacking the measures between the two aggregated curves that succeed existing orders in ascending (descending) price order and in capturing the arbitrage potential at each trading time-step accumulating the available volume. The market depth for t as state variables, as presented in Figures 2b and 3b. each of the quarter-hourly products Q1 to Q6 at time instant For instance, we define as D1 the signed distance between t is illustrated in Figure 2a using data from the German CID A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

(a) (a)

(b) (b) Figure 3. (a) Market depth per product (for products Q to Q ) at Figure 2. (a) Market depth per product (for products Q to Q ) 1 6 1 6 a time-step t with no arbitrage potential. (b) The corresponding at a time-step t with arbitrage potential. (b) The corresponding aggregated curves for a non profitable order book. aggregated curves for a profitable order book. the 75th percentile of “Buy” price and the 25th percentile of “Sell” price and as D2 the absolute distance between the According to Assumption (5), the trading agent cannot per- mean value of “Buy” and “Sell” volumes. Other measures form any transaction if it results in imbalances. Therefore, it used are the signed price difference and absolute volume is not relevant to consider the vector ∆i,t since it will always difference between percentiles (25%, 50%, 75%) and the be zero according to the way the high-level actions are de- bid-ask spread. A detailed list of the distance measures is fined in Section 4.4. Additionally, Assumption (3) regarding provided in Table2. the default strategy for storage control in combination with Assumption (5) yields a direct correlation between vectors The new, continuous, low-dimensional observation of the Pmar and G , C , SoC . Thus, it is considered that Pmar order book s0OB ∈ S0OB = {D1,..,D10} is used to represent i,t i,t i,t i,t i,t t contains all the required information and thus vectors G , the state of the order book and, in particular, its profit poten- i,t OB OB Ci,t and SoCi,t can be dropped. tial. It is important to note that in contrast to st ∈ S , the 0OB 0OB new order book observation st ∈ S does not depend on Following the previous analysis we can de- the number of orders in the order book and therefore has fine the low-dimensional pseudo-state zi,t = 0OB 0 0 0 0 a constant size, i.e. the cardinality of S is constant over (si,0,ai,0,ri,0,...,ai,t−1,ri,t−1,si,t ) ∈ Zi, where 0 0OB mar exog 0 time. si,t = (st ,Pi,t ,wi,t ) ∈ Si. This pseudo-state can Finally, the history of the private information of agent i, be seen as the result of applying an encoder e : Hi → Zi that is not publicly available, is a vector that contains the which maps a true state hi,t to pseudo-state zi,t . private high-dimensional continuous variables si,t related to the Assumption 7 (Pseudo-state). The pseudo-state zi,t ∈ Zi con- operation of the storage device. As described in Section 4.3, tains all the relevant information for the optimisation of the private si,t is defined as: CID market trading of an asset-optimizing agent.

private mar si,t =((Pi,t (τ),∆i,t (τ), Under Assumption (7), using the pseudo-state zi,t instead of the true state hi,t is equivalent and does not lead to a Gi,t (τ),Ci,t (τ),SoCi,t (τ),∀τ ∈ T¯), exog sub-optimal policy. The resulting decision process after the wi,t ). state and action spaces reductions is illustrated in Figure4. A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

a−i,t ri,t a−i,t+1 ri,t+1

“default” “default” ··· hi,t CID Env Storage hi,t+1 CID Env Storage ··· strategy strategy

e ai,t e ai,t+1

zi,t p zi,t+1 p

0 0 πi ai,t πi ai,t+1

Figure 4. Schematic of the decision process. The original MDP is highlighted in a gray background. The state of the original MDP hi,t is 0 0 encoded in pseudo-state zi,t . Based on zi,t , agent i takes an high-level action ai,t , according to its policy πi. This action ai,t is mapped to an original action ai,t and submitted to the CID market. The CID market makes a transition based on the action of agent i and the actions of the other agents a−i,t . After this transition, the market position of agent i is deﬁned and the control actions for storage device are derived according to the “default” strategy. Each transition yields a reward ri,t and a new state hi,t .

Output QTrade QIdle

FC5 shape: (batch size, 36, 2) F F ··· F . . . Fully Connected . . .

FC2 shape: (batch size, 36, 36) F F ··· F

FC1 shape: (batch size, 128, 36) F F ··· F

Hidden state hi,0 hi,1 ··· hi,t Lstm output: (batch size, 128)

LSTM hi,−1 LSTM LSTM ··· LSTM

Inputs s¯i,t−h¯ s¯i,t−h¯+1 ··· s¯i,t

Input shape: (batch size, h¯, 263)

Figure 5. Schematic of the neural network architecture. A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

Table 2. Order book features used for the state reduction. Symbol Deﬁnition Description

Buy Sell D1 pmax − pmin Signed diff. between the maximum “Buy” price and the minimum “Sell” price Buy Sell D2 pmean − pmean Signed diff. between the mean “Buy” price and the mean “Sell” price Buy Sell D3 p25% − p75% Signed diff. between the 25th percentile “Buy” price and the 75th percentile “Sell” price Buy Sell D4 p50% − p50% Signed diff. between the 50th percentile “Buy” price and the 50th percentile “Sell” price Buy Sell D5 p75% − p25% Signed diff. between the 75th percentile “Buy” price and the 25th percentile “Sell” price Buy Sell D6 |vmin − vmin | Abs. diff. between the minimum “Buy” cum. volume and the maximum “Sell” cum. volume Buy Sell D7 |vmean − vmean| Abs. diff. between the mean “Buy” cum. volume and the mean “Sell” cum. volume Buy Sell D8 |v25% − v25%| Abs. diff. between the 25th percentile “Buy” cum. volume and the 25th percentile “Sell” cum. volume Buy Sell D9 |v50% − v50%| Abs. diff. between the 50th percentile “Buy” cum. volume and the 50th percentile “Sell” cum. volume Buy Sell D10 |v75% − v75%| Abs. diff. between the 75th percentile “Buy” cum. volume and the 75th percentile “Sell” cum. volume

4.6. Generation of artificial trajectories agents). The actions of trading agent i do not influence the future actions of the rest of the agents −i in the CID market. In this section, the generation of additional artificial trajec- In this way, agent i is not capable of influencing the market. tories for addressing exploration issues is discussed. Indeed, if we were to implement an agent that selects at every time- Assumption (8) implies that each of the agents −i entering in step among the “Idle” and “Trade” actions, we would collect the market would post orders solely based on their individual a certain number of trajectories (one per day) over a cer- needs. Furthermore, its actions are not considered as a tain period of interactions. The collected dataset could be reaction to the actions of the other market players. used to train a policy using a batch mode RL algorithm, as described in Section 4.2. Every time a new trajectory Leveraging Assumption (8) allows one to tackle the ex- would arrive, it would be appended in the previous set of ploration issues discussed previously by generating several trajectories and the entire dataset could be used to improve artificial trajectories using historical order book data. We the trading policy. denote by E the number of episodes (times) each day from historical data is repeated and by Ltrain the set of trading This approach requires a large number of days in order to days used to train the agent. We can then obtain the total acquire a sufficient amount of data. Additionally, as dis- number of trajectories M as M = E · |Ltrain|. cussed in Section 4.2, sufficient exploration of the state and action spaces is a key requirement for converging to a near- The generation of artificial trajectories is performed accord- optimal policy. It is required that all parts of the state and ing to the process described in (Ernst et al., 2005). This action spaces are visited sufficiently often. As a result, the process interleaves the generation of trajectories with the RL agent needs to explore unknown grounds in order to computation of an approximate Q-function using the trajec- discover interesting policies (exploration). It should also ap- tories already generated. As shown in Algorithm2, for a ply these learned policies to get high rewards (exploitation). number of episodes ep, we randomly select days from the However, since the set of collected trajectories would come training set which we simulate using an ε-greedy policy. from a real agent, the visitation of many different states is According to this policy, an action is chosen at random with expected to be limited. probability ε and according to the available Q-functions with probability (1 − ε). The generated trajectories are In the RL context, exploration is performed when the agent added to the set of trajectories. The second step consists selects a different action than the one that, according to its of updating the Q-function approximation using the set of experience, will yield the highest rewards. In real life, it collected trajectories. This process is terminated when the is unlikely for a trader to select such actions, and poten- total number of episodes has reached the specified number tially bear negative revenues, for the sake of gaining more E. experience. This leads to limited exploration of the learning train process and would result in a suboptimal policy. This process introduces parameters L , E, ep, ε and decay. The selection of these parameters impacts the train- Assumption 8 (No impact on the behaviour of the rest of the ing progress and the quality of the resulting policy. The set A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

Algorithm 2 Generation of artificial trajectories Input: Ltrain, E, ep, ε, decay Output: Qˆ, F Initialize Qˆ ≡ 0 M ← E · |Ltrain| m ← 0 repeat for iter j ← 0 to ep do train train d ← rand(L ) B Randomly pick a day d from train set L ˆ ζm ← simulate(d,ε − greedy(Q)) B Generate trajectory ζm by simulating day d using ε-greedy policy F.add(ζm) B Append trajectory from day d to set F ε ← anneal(ε,decay,iteri) B Anneal the value of ε based on decay parameter m ← m + 1 Update Qˆ using set F according to equations (32), (33) B Fit new Qˆ functions until m ≥ M; return Qˆ,F of days considered for training (Ltrain) is typically selected Theoretically, the length of the sequence of features that as a proportion (e.g. 70%) of the total set of days available. is provided as input to the neural network can be as large The total number of episodes E should be large enough so as the total number of trading steps in the optimisation that convergence is achieved and is typically tuned based horizon. In practice though, there are limitations with on the application. The frequency with which the trajectory respect to the memory that is required to store a tensor of generation and the updates are interleaved is controlled by this size. As we can observe in Figure5, each sample in parameter ep. A small number of ep results in a large num- the batch contains a vector of size 263 for each time-step. ber of updates. Parameter ε is used to address the trade-off Assuming a certain batch size, there is a certain limit to between exploration-exploitation during the training pro- the number of steps that can be stored in the memory. cess. As the training evolves, this parameter is annealed Therefore, for practical reasons and due to hardware based on some predefined parameter decay, in order to grad- limitations, we assume a history length h¯ defined as zi,t = (a0 ,r ,s0 ,a0 ,r ,...,a0 ,r ,s0 ) ∈ ually reduce exploration and to favour exploration along i,t−h¯−1 i,t−h¯−1 i,t−h¯ i,t−h¯ i,t−h¯ i,t−1 i,t−1 i,t the (near-)optimal trajectories. In practice, the size of the Zi. At each step t, the history length h¯ takes the buffer F cannot grow infinitely due to memory limitations, minimum value between the time-step t and h¯max, so typically a limit on the number of trajectories stored in (h¯ = min(t,h¯max)). Additionally, we provide the variable the buffer is imposed. Once this limit is reached, the oldest 0 0 s¯t = (ai,t−1,ri,t−1,si,t ), as a fixed size input for each step t trajectories are removed as new ones arrive. The buffer is a of the LSTM. Consequently, the pseudo-state can be written double-ended queue of fixed size. as zi,t = (s¯t−h¯ ,...,s¯t ).

4.7. Neural Network architecture 4.8. Asynchronous Distributed Fitted Q iteration As described in Section 4.5, pseudo-state zi,t contains a The exploration requirements of the continuous state space, sequence of variables whose length is proportional to t. This as defined previously introduce the necessity for collecting motivates the use of Recurrent Neural Networks (RNNs), a large number of trajectories M. The total time required for that are known for being able to efficiently process variable- gathering these trajectories heavily depends on the simula- length sequences of inputs. In particular, we use Long Short- tion time needed for one episode. In this particular setting term Memory (LSTM) networks (Goodfellow et al., 2016), developed, the simulation time can be quite long since, at a type of RNNs where a gating mechanism is introduced to each decision step, if the action selected is “Trade”, an regulate the flow of information to the memory state. optimisation model is constructed and solved. All the networks in this study have the architecture pre- In this paper and in order to address this issue, we resort to sented in Figure5. It is composed of one LSTM layer with an asynchronous architecture, similar to the one proposed 128 neurons followed by five fully connected layers with in (Horgan et al., 2018), presented in Figure6. The two 36 neurons where “ReLU” was selected as the activation processes, described in Section 4.6, namely generation of function. The structure of the network (number of layers trajectories and computation of the Q-functions, run concur- and neurons) was selected after cross-validation. rently with no high-level synchronization. A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

Learner Experiences Global buffer Network Experiences

Actor Network parameters Network Local buffer Environment Local Buffer

Figure 6. Schematic of the asynchronous distributed architecture. Each actor runs on a different thread and contains a copy of the environment, an individual ε-greedy policy based on the latest version of the network parameters and a local buffer. The actors generate trajectories that are stored in their local buffers. When the local buffer of each actor is ﬁlled, it is appended to the global buffer and the agent collects the latest network parameters from the learner. A single learner runs on a separate thread and is continuously training using experiences from the global buffer.

Multiple actors that run on different threads are used to used for the optimisation of the CID market participation generate trajectories. Each actor contains a copy of the of a PHES operator are described. Second, the benchmark environment, an individual ε-greedy policy based on the strategy used for comparison purposes and the process that latest version of the Q functions and a local buffer. The was carried out for validation are presented. Finally, per- actors use their ε-greedy policy to perform transitions in formance results of the obtained policy are presented and the environment. The transitions are stored in the local discussed. buffer. When the local buffer of each actor is filled, it is appended to the global buffer, the agent collects the latest 5.1. Parameters specification Q-functions from the learner and continues the simulation. A single learner continuously updates the Q-functions using The proposed methodology is applied for a PHES unit par- the simulated trajectories from a global buffer. ticipating in the German CID market with the following characteristics: The benefits from asynchronous methods in Deep Reinforce- ment Learning (DRL) are elaborated in (Mnih et al., 2016). • SoCi = 200 MWh, Each actor can use a different exploration policy (different initial ε value and decay) in order to enhance diversity in the • SoCi = 0 MWh, collected samples which leads to a more stable learning pro- init term cess. Additionally, it is shown that the total computational • SoCi = SoCi = SoCi − SoCi /2, time scales linearly with the number of threads considered. • C = G = 200 MW, Another major advantage is that distributed techniques were i i shown to have a super-linear speedup for one-step methods • Ci = Gi = 0 MW, that are not only related to computational gains. It is argued that, the positive effect of having multiple threads leads to • η = 100%. a reduction of the bias in one-step methods (Mnih et al., 2016). In this way, these algorithms are shown to be much The discrete trading horizon has been selected to be ten more data efficient than the original versions. hours, i.e. T = {17 : 00,...,03 : 00}. The trading time interval is selected to be ∆t = 15 min. Thus the trading pro- 5. Case study cess takes K = 40 steps until termination. Moreover, all 96 quarter-hourly products of the day, X = {Q1,..,Q96}, The proposed methodology is applied for the case of a PHES are considered. Consequently, the delivery timeline is T¯ = unit. First, the parameters and the exogenous information {00 : 00,...,23 : 45}, with τinit = 00 : 00 and τterm = 24 : 00 A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding and the delivery time interval is ∆τ = 15 min. Each product a strategy on historical data is widely used because it can can be traded until 30 minutes before the physical delivery provide a measure of how successful a strategy would be of electricity begins (e.g. tclose(Q1) = 23 : 30 etc.). if it had been executed in the past. However, there is no guarantee that this performance can be expected in the future. The number of days selected for training was |Ltrain| = This validation process heavily relies on Assumption (8) 252 days. For the construction of the training/test sets, the about the inability of the agent to influence the behaviour days were sampled uniformly without replacement from a of the other players in the market. It can still provide an pool of 362 days. The number of simulated episodes for approximation on the results of the obtained policy before each training day was selected to be E = 2000 episodes for deploying it in real life. However, the only way to evaluate the artificial trajectories generation process, described in the exact viability of a strategy is to deploy it in real life. Section 4.6. During the trajectories generation process the high-level actions (“Trade” or “Idle”) were chosen following We compare the performances of the policy obtained by the an ε-greedy policy. As described in Section 4.8, each of fitted Q iteration algorithm πFQ and the “rolling intrinsic” the actor threads is provided with a different exploration policy πRI. The comparison will always be based on the parameter ε that is initialised with a random uniform sample computation of the return of the policies on each day. For in the range [0.1,0.5]. The parameter ε is then annealed a given policy, the return over a day will be simply com- exponentially until a zero value is reached. puted by running the policy on the day and summing up the 0 0 0 0 rewards obtained. The pseudo-state zi,t = (si,0,ai,0,ri,0,...,ai,t−1,ri,t−1,si,t ) ∈ Zi is composed of the entire history of observations and The learning algorithm that we have designed for learning a actions up to time-step t, as described in Section 4.5. For policy has two sources of variance, namely those related to the sake of memory requirements, as explained in Section the generation of the new trajectories and those related to 4.7, we assume that the last ten trading steps contain suffi- the learning of the Q-functions from the set of trajectories. cient information about the past. Thus, the pseudo-state is To evaluate the performance of our learning algorithm we transformed in sequences of fixed length h¯max = 10, will perform several runs and average the performances of the policies learned. In the following, when we report the 5.2. Exogenous variable performance of a fitted Q iteration policy over one day, we exog will actually report the average performances of ten learned The exogenous variable wi,t represents any relevant infor- policies over this day. mation available to agent i about the system. In this case exog We now describe the different indicators that will be used study, we assumed that the variable wi,t contains: afterwards to assess the performances of our method. These • The 24 values of the Day-ahead price for the entire indicators will be computed for both the training set and the trading day test set, but are detailed hereafter when they are computed for the test set. It is straightforward to adapt the procedure • The Imbalance price and the system Imbalance for the for computing the indicators for the training set. four quarters preceding each time-step t πFQ πRI Let Vd and Vd denote the total return of the fitted Q • Time features: i) the hour of the day, ii) the month and and the “rolling intrinsic” policy for day d, respectively. iii) whether the traded day is a weekday or weekend We gather the obtained returns of each policy for each day d ∈ Ltest . We sort the returns in ascending order, and we 5.3. Benchmark strategy obtain an ordered set containing a number of |Ltest | values for each policy. We provide descriptive statistics about the The strategy selected for comparison purposes is the “rolling FQ RI distribution of the returns of each policy V π and V π on intrinsic” policy, denoted by πRI (Lohndorf & Wozabal, d d the test set Ltest . In particular, we report the mean, the mini- 2015). According to this policy, the agent selects at each mum and maximum values achieved for the set considered. trading time-step t the action “Trade”, as described in Sec- Moreover, we provide the values obtained for each of the tion 4.4. This benchmark is selected since it represents the quartiles (25%, 50% and 75%) of the set. current state of the art used in the industry for the optimisation of PHES unit market participation. Additionally, we compute the sum of returns over the entire set of days as follows: 5.4. Validation process πFQ πFQ The performance of the policy obtained using the fitted Q V = ∑ Vd , (52) test algorithm, denoted by πFQ, is evaluated on test set Ltest that d∈L πRI πRI contains historical data from 110 days. These days are not V = ∑ Vd . (53) used during the training process. This process of backtesting d∈Ltest A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

An alternative performance indicator considered is the dis- Table 3. Descriptive statistics of the returns obtained on the days crepancy of the returns coming from the fitted Q policy with of the training set for policies πFQ and πRI. The last column also respect to the risk-averse rolling intrinsic policy. We define provides the corresponding profitability ratios. test the profitability ratio rd for each day d ∈ L , that corre- πFQ returns (e) πRI returns (e) r (%) sponds to the signed percentage difference between the two policies as follows: mean 8670 8440 2.56 min 2586 2524 −0.42 25% 5571 5537 0.33 FQ RI 50% 7496 7385 1.45 V π −V π r = d d · 100%. (54) 75% 11081 10632 3.69 d πRI Vd max 54967 54285 23.12 sum 2167624 2110200 2.7 In a similar fashion, we sort the profitability ratios obtained for each day in the test set and we provide descriptive statistics about its distribution across the set. The mean, minimum Table 4. Descriptive statistics of the returns obtained on the days and maximum values of the profitability ratio as well as the of the test set for policies πFQ and πRI. The last column also values of each quartile are reported. Finally, we compute provides the corresponding profitability ratios. the profitability ratio for the sum of returns over the entire πFQ returns (e) πRI returns (e) r (%) set between the two policies, as: mean 8583 8439 1.50 min 2391 2366 −0.73

FQ RI 25% 5600 5527 0.23 V π −V π 50% 7661 7622 0.87 rsum = · 100%. (55) V πRI 75% 10823 10721 2.22 max 37902 36490 6.56 5.5. Results sum 935552 919871 1.7 The performance indicators described previously are computed for both the training and the test set. The results obtained are summarised in Tables3 and4. Descriptive It is important to note that, for the vast majority of the statistics about the distribution of the returns from both poli- samples, the profitability ratio is nonnegative. This result cies as well as the profitability ratio are presented for each allows us to characterize the new policy as risk-free in the dataset. sense that in the worst case, the fitted Q policy performs as well as the “rolling intrinsic” policy. From the standpoint of FQ It can be observed that on average π yields better returns practical implementation this result allows us to construct RI than π both on the training and the test set. More specifi- a wrapper around the current industrial standard practices cally, on the training set, the obtained policy performs, on and expect an average improved performance of 2% almost average 2.56% better than the “rolling intrinsic” policy. For without any risk. However, as discussed earlier, the back- the top 25% of the training days the profitability ratio is testing of a strategy in historical data may differ from the higher than 3.69% and in some cases it even exceeds 10%. outcomes in real deployment for various reasons. Overall, the total profits coming from the fitted Q policy add up to e2,167,624, yielding a difference of e57,423.3 (2.7%) more than the profits from the “rolling intrinsic” for 6. Discussion the set of 252 days considered. In this section, we provide some remarks related to the The fitted Q policy yields on average a 1.5% greater profit on practical challenges encountered and the validity of the the test set with respect to the returns of the “rolling intrinsic” assumptions considered throughout this paper. policy. It is important to highlight that for 50% of the test set, the profits from the fitted Q policy are higher than 1% 6.1. Behaviour of the rest of the agents in comparison to the “rolling intrinsic”. The difference In this paper, we assumed (Assumption1) that the rest of between the total profits resulting from the two policies over the agents −i post orders in the market based on their needs the set of 110 days considered amounts to e15,681 (1.7%). and some historical information of the state of the order The distribution of training and test set samples according book. In reality, the available information that the other to the obtained profitability ratio is presented in Figure7. It agents possess is not accessible by agent i. This fact gives can be observed that most samples are spread in the interval rise to issues related to the validity of the assumption that between 0−5% and that the distribution has a positive skew. the process is Markovian. A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

6.3. Exploration There are two main issues related to the state space exploration that result in the somewhat limited performance of the obtained policy. First, in the described setting, the way in which we generate the artificial trajectories is very important for the success of the method. The generated states must be “representative” in the sense that the areas around these states are visited often under a near optimal policy (Bertsekas, 2005). In particular, the frequency of appear- ance of these areas of states in the training process should be proportional to the probability of occurrence under the optimal policy. However, in practice, we are not in a position to know which areas are visited by the optimal policy. In that respect, the asynchronous distributed algorithm used Figure 7. Profitability ratio. in this paper was found to successfully address the issue of state exploration. Second, the assumptions (Assumptions3,4,5) related to We further assumed (8) in Section 4.6 that the behaviour of the operation of the storage device according to the “default” agent i does not influence the strategy of the other agents strategy without any imbalances allowed, as well as the par- −i. Based on this assumption the training and the validation ticipation of the agent as an aggressor, are restrictive with process were performed using historical data. However, respect to the set of all admissible policies. Additionally, the strategy of each of the market participants is highly the adoption of the reduced discrete action space described dependent on the actions of the rest participants, especially in Section 4.4 introduces further restrictions on the set of in a market with limited liquidity such as the CID market. available actions. Although having a small and discrete space is convenient for the optimisation process, it leads These assumptions, although slightly unrealistic and opti- to limited state exploration. For instance, the evolution of mistic, provide us with a meaningful testing protocol for a the state of charge of the storage device is always given trading strategy. The actual profitability of a strategy can be as the output of the optimisation model based on the order obtained by deploying the strategy in real-time. However, book data. Thus, in this configuration, it is not possible it is important to show that the strategy is able to obtain to explore all areas of the state space (storage levels) but substantial profits in back-testing first. only a certain area driven by the historical order book data. However, evaluating the policy on a different dataset might 6.2. Partial observability of the process lead to areas of the state space (e.g. storage level) that are In Section3, the decision-making problem studied in this never visited during training, leading to poor performance. paper was framed as an MDP after considering certain as- Potential mitigations of this issue involve diverse data aug- sumptions. Theoretically, this formulation is very conve- mentation techniques and/or different representation of the nient, but does not hold in practice. Especially considering action space. Assumption7, where the reduced pseudo-state is assumed to contain all the relevant information required. 7. Conclusions and future work Indeed, the trading agents do not have access to all the In this paper, a novel RL framework for the participation information required. For instance, a real agent does not of a storage device operator in the CID market is proposed. know how many other agents are active in the market. They The energy exchanges between market participants occur do not know the strategy of each agent either. There is through a centralized order book. A series of assumptions exog also a lot of information gathered by w which is not related to the behaviour of the market agents and the oper- available for the agent. Finally, the fact that the state space ation of the storage device are considered. Based on these was reduced results in an inevitable loss of information. assumptions, the sequential decision-making problem is cast Therefore, it would be more accurate to consider a Partially as an MDP. The high dimensionality of both the action and Observable Markov Decision Process (POMDP) instead. In the state spaces increase the computational complexity of a POMDP, the real state is hidden and the agent only has ac- finding a policy. Thus, we motivate the use of discrete high- cess to observations. For an RL algorithm to properly work level actions that map into the original action space. We with a POMDP, the observations have to be representative further propose a more compact state representation. The of the real hidden states. resulting decision process is solved using fitted Q iteration, A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding a batch mode reinforcement learning algorithm. The results case. European Journal of Operational Research, illustrate that the obtained policy is a low-risk policy that 238(3):797 – 809, 2014. ISSN 0377-2217. doi: is able to outperform on average the state of the art for the http://dx.doi.org/10.1016/j.ejor.2014.04.027. URL industry benchmark strategy (“rolling intrinsic”) by 1.5% http://www.sciencedirect.com/science/ on the test set. The proposed method can serve as a wrapper article/pii/S0377221714003695. around the current industrial practices that provides decision support to energy trading activities with low risk. Borggrefe, F. and Neuhoff, K. Balancing and intraday market design: Options for wind integration. DIW The main limitations of the developed strategy originate Discussion Papers 1162, Berlin, 2011. URL http: from: i) the insufficient amount of relevant information //hdl.handle.net/10419/61319. contained in the state variable, either because the state reduction proposed leads to a loss of information or due to the Braun, S. and Hoffmann, R. Intraday Optimization of unavailability of information and ii) the limited state space Pumped Hydro Power Plants in the German Electricity exploration as a result of the proposed high-level actions in Market. In Energy Procedia, volume 87, pp. 45–52, 2016. combination with the use of historical data. To this end and doi: 10.1016/j.egypro.2015.12.356. as future work, a more detailed and accurate representation Busoniu, L., Babuska, R., De Schutter, B., and Ernst, D. of the state should be devised. This can be accomplished by Reinforcement learning and dynamic programming using increasing the amount of information considered, such as function approximators. CRC press, 2017. RES forecasts, and by improving the order book representation. We propose the use of continuous high-level actions EPEXSPOT. Market data intraday continuous, in an effort to gain state exploration without leading to very 2017. URL http://www.epexspot.com/ complex and high-dimensional action space. en/market-data/intradaycontinuous. EPEXSPOT. EPEXSPOT Operational rules, 2019. References URL https://www.epexspot.com/document/ 40170/EPEX%20SPOT%20Market%20Rules. A¨ıd, R., Gruet, P., and Pham, H. An optimal trading problem in intraday electricity markets. Mathematics and Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch Financial Economics, 10(1):49–85, 2016. mode reinforcement learning. Journal of Machine Learn- ing Research, 6(Apr):503–556, 2005. Baillo, A., Ventosa, M., Rivier, M., and Ramos, A. Optimal offering strategies for generation companies operating in Fleten, S. E. and Kristoffersen, T. K. Stochastic program- electricity spot markets. IEEE Transactions on Power ming for optimizing bidding strategies of a Nordic hy- Systems, 19(2):745–753, May 2004. ISSN 0885-8950. dropower producer. European Journal of Operational doi: 10.1109/TPWRS.2003.821429. Research, 181(2):916 – 928, 2007. ISSN 0377-2217. doi: http://dx.doi.org/10.1016/j.ejor.2006.08.023. Balardy, C. German continuous intraday market: Orders book’s behavior over the trading session. In Meeting Garnier, E. and Madlener, R. Balancing forecast errors in the Energy Demands of Emerging Economies, 40th IAEE continuous-trade intraday markets. Energy Systems, 6(3): International Conference, June 18-21, 2017. International 361–388, 2015. Association for Energy Economics, 2017a. Gonsch,¨ J. and Hassler, M. Sell or store? An ADP Balardy, C. An analysis of the bid-ask spread in the german approach to marketing renewable energy. OR Spec- power continuous market. In Heading Towards Sustain- trum, 38(3):633–660, 2016. ISSN 14366304. doi: able Energy Systems: Evolution or Revolution?, 15th 10.1007/s00291-016-0439-x. IAEE European Conference, Sept 3-6, 2017. International Association for Energy Economics, 2017b. Goodfellow, I., Bengio, Y., and Courville, A. Deep Learning. MIT Press, 2016. http://www. Bertrand, G. and Papavasiliou, A. Adaptive trading in con- deeplearningbook.org. tinuous intraday electricity markets for a storage unit. IEEE Transactions on Power Systems, pp. 1–1, 2019. Hagemann, S. Price determinants in the German intraday ISSN 1558-0679. doi: 10.1109/TPWRS.2019.2957246. market for electricity: an empirical analysis. Journal of Energy Markets, 2015. Bertsekas, D. P. Dynamic programming and optimal control, volume 1. Athena scientific Belmont, MA, 2005. Hassler, M. Heuristic decision rules for short-term trading of renewable energy with co-located energy storage. Boomsma, T. K., Juul, N., and Fleten, S.-E. Bid- Computers and Operations Research, 83, 2017. ISSN ding in sequential electricity markets: The Nordic 03050548. doi: 10.1016/j.cor.2016.12.027. A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

Henriot, A. Market design with centralized wind power Plazas, M. A., Conejo, A. J., and Prieto, F. J. Multimarket management: Handling low-predictability in intraday optimal bidding for a power producer. IEEE Transactions markets. Energy Journal, 35(1):99–117, 2014. ISSN on Power Systems, 20(4):2041–2050, Nov 2005. ISSN 01956574. doi: 10.5547/01956574.35.1.6. 0885-8950. doi: 10.1109/TPWRS.2005.856987.

Horgan, D., Quan, J., Budden, D., Barth-Maron, G., Hessel, Scharff, R. and Amelin, M. Trading behaviour on the contin- M., Van Hasselt, H., and Silver, D. Distributed priori- uous intraday market Elbas. Energy Policy, 88:544–557, tized experience replay. arXiv preprint arXiv:1803.00933, 2016. ISSN 03014215. doi: 10.1016/j.enpol.2015.10. 2018. 045.

Jiang, D. R. and Powell, W. B. Optimal hour-ahead bidding Spot, N. Xbid cross-border intra day market project, in the real-time electricity market with battery storage 2018. URL https://www.nordpoolspot.com/ using Approximate Dynamic Programming. pp. 1–28, globalassets/download-center/xbid/ 2014. ISSN 1091-9856. doi: 10.1287/ijoc.2015.0640. xbid-qa_final.pdf. URL http://arxiv.org/abs/1402.3575. The European Commission. 2030 energy strategy, 2017. URL https:// Karanﬁl, F. and Li, Y. The role of continuous intraday ec.europa.eu/energy/en/topics/ electricity markets: The integration of large-share wind energy-strategy-and-energy-union/ power generation in Denmark. Energy Journal, 38(2): 2030-energy-strategy. 107–130, 2017. ISSN 01956574. doi: 10.5547/01956574. 38.2.fkar. Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279–292, 1992. Le, H. L., Ilea, V., and Bovo, C. Integrated Euro- pean intra-day electricity market: Rules, modeling and analysis. Applied Energy, 238(June 2018):258–273, A. Nomenclature 2019. ISSN 03062619. doi: 10.1016/j.apenergy. 2018.12.073. URL https://doi.org/10.1016/ Acronyms j.apenergy.2018.12.073. ADP Approximate Dynamic Programming. Lohndorf, N. and Wozabal, D. Optimal gas storage valua- CID Continuous Intraday. tion and futures trading under a high-dimensional price DRL Deep Reinforcement Learning. process. Technical report, Technical report, 2015. FCFS First Come First Served. MDP Markov Decision Process. Meeus, B. L. and Schittekatte, T. The EU Electricity Net- OB Order Book. work Codes: Course text for the Florence School of Reg- PHES Pumped Hydro Energy Storage. ulation online course. (October), 2017. RES Renewable Energy Sources.

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, Sets and indexes T., Harley, T., Silver, D., and Kavukcuoglu, K. Asyn- Name Description chronous methods for deep reinforcement learning. In i Index of an agent. International conference on machine learning, pp. 1928– −i Index of all the agents except agent i. 1937, 2016. j Index of an order. Neuhoff, K., Ritter, N., Salah-Abou-El-Enien, A., and Vas- m Index of a sample of quadruples. silopoulos, P. Intraday markets for power: discretizing d Index of a day in a set. the continuous trading? 2016. t Trading time-step. τ Discrete time-step of delivery. PandZiˇ c,´ H., Morales, J. M., Conejo, A. J., and Kuzle, A Joint action space for all the agents. I. Offering model for a virtual power plant based on Ai Action space of agent i. stochastic programming. Applied Energy, 105:282–292, A−i Action space of the rest of the agents −i. red 2013. ISSN 03062619. doi: 10.1016/j.apenergy.2012.12. Ai Reduced action space of agent i. 0 077. Ai Set of high-level actions for agent i. A¯i Set of all factors for the partial/full acceptance Perez´ Arriaga, I. and Knittel et al, C. Utility of the fu- of orders by agent i. ture. An MIT Energy Initiative response. 2016. ISBN 9780692808245. URL energy.mit.edu/uof. A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

E Set of conditions that can apply to an order. h¯max Maximum sequence length of past information. F Set of all sampled trajectories. I(τ) Imbalance price for delivery period δ(x). F0 Set of sampled one-step transitions. K Number of steps in the trading period. 0 Ft Set of sampled one-step transitions for time t. M Number of samples of quadruples. Hi Set of all histories for agent i. n Number of agents. I Set of agents. ot Market order. Ltrain Set of trading days used to train the agent. p Price of an order. test L Set of trading days used to evaluate the agent. pmax Maximum price of an order. Nt Set of all available order unique indexes at time pmin Minimum price of an order. t. SoCi Maximum state of charge of storage device. 0 Nt Set of all the unique indexes of new orders SoCi Minimum state of charge of storage device. init posted at time t. SoCi State of charge of storage device at the begin- Nτ Set of all the unique indexes of orders for deliv- ning of the delivery timeline. term ery at τ. SoCi State of charge of storage device at the end of Ot Set of all available orders in the order book at the delivery timeline. time t. tclose(x) End of trading period for product x. OB S Set of all available orders in the order book. tdelivery(x) Start of delivery of product x. 0OB S Low dimensional set of all available orders in topen(x) Start of trading period for product x. the order book. tsettle(x) Time of settlement for product x. Si State space of agent i. v Volume of an order. T Trading horizon, i.e. time interval between ﬁrst x Market product. possible trade and last possible trade. y Side of an order (“Sell” or “Buy”). m T(x) Discretization of the trading timeline for product yt Target computed for sample m at time t. x. δ(x) Time interval covered by product x (delivery). T¯ Discretization of the delivery timeline. ∆t Time interval between trading time-steps. T¯(t) Discretization of the delivery timeline at trading ∆τ Time interval between delivery time-steps. step t. ε Parameter for the ε-greedy policy. T Imb Discretization of the imbalance settlement time- η Charging/discharging efﬁciency of storage de- line. vice. X Set of all available products. θt Parameters vector of function approximation at Xt Set of all available products at time t. time t. Zi Set of pseudo-states for agent i. λ(x) Duration of time-interval δ(x). Π Set of all admissible policies. ζ A single trajectory. ζm A single indexed trajectory. Parameters τinit Initial time-step of the delivery timeline. τ Terminal time-step of the delivery timeline. Name Description term C Maximum consumption level for the asset of i Variables agent i. Ci Minimum consumption level for the asset of Name Description agent i. at Joint action from all the agents at time t. E Number of episodes. ai,t Action of posting orders by agent i at time t. e Conditions applying on an order other than vol- a−i,t Action of posting orders by the rest of the agents ume and price. −i at time t. 0 ep Number of simulations between two successive ai,t High-level action by agent i at time t. Q function updates. j ai,t Acceptance (partial/full) factor for order j by decay Parameter for the annealing of ε. agent i at time t. Gi Maximum production level for the asset of agent a¯i,t Factors for the partial/full acceptance of all or- i. ders by agent i at time t. Gi Minimum production level for the asset of agent Ci,t (τ) Consumption level at delivery time-step τ com- i. puted at time t. ¯ 0 h Sequence length of past information. ci,t (t ) Consumption level during the delivery interval. ei,t Random disturbance for agent i at time t. A Deep Reinforcement Learning Framework for Continuous Intraday Market Bidding

Gi,t (τ) Generation level at delivery time-step τ com- Pet (·) Random disturbance probability distribution puted at t. function. 0 gi,t (t ) Generation level during the delivery interval. P(·) Transition probabilities of the MDP. hi,t History vector of agent i at time t. PFQ(·) The stochastic process (algorithm) of ﬁtted Q ki,t (τ) Binary variable that enforces either charging or iteration.

discharging of the storage device. Pθt,0 (·) Distribution of the initial parameters θt,0. mar 0 Pi,t (x) Net contracted power of agent i for product x p(·) Mapping from high-level actions Ai to the re- red (delivery time-step τ) at time t. duced action space Ai . res Pi,t (τ) Residual production of agent i delivery time- Qt (·,·) State-action value function at time t. step τ (for product x) at time t. Qˆ(·,·) Sequence of Q-function approximations. res Pi (τ) Final residual production of agent i for product R(·) Reward function. ri,t Instantaneous reward of agent i at time t. u(·) Signing convention for the volume wrt. the side rd Profitability ratio at day d. (‘Buy” or ‘Sell”) of each order. π rsum Profitability ratio for the sum of returns over set. V i (·) Total expected reward function for policy πi. FQ si,t State of agent i at time t. π FQ Vd (·) Return of the fitted Q policy πi for day d. SoC ( ) State of charge of device at delivery time-step RI i,t τ τ V π (·) Return of the “rolling” intrinsic policy πRI for computed at t. d i OB day d. st State of the order book at time t. µt (·) Policy function at time t. s0OB Low dimensional state of the order book at time t πi(·) Policy followed by agent i. t. private ρ(·) Trading revenue function. si,t Private information of agent i at time t. s¯t Triplet of fixed size, part of pseudo-state zi,t that serves as an input at LSTM at time t. ui,t Aggregate (trading and asset) control action of the asset trading agent i at time t. con vi,t (x) Volume of product x contracted by agent i at time t. exog wi,t Exogenous information of agent i at time t. zi,t Pseudo-state for agent i at time t. ∆i,t (τ) Imbalance for delivery time τ for agent i computed at time t. ∆i(τ) Final imbalance for delivery time τ for agent i. ∆Gi,t Change in the production level for the asset of agent i at time t. ∆Ci,t : Change in the consumption level for the asset of agent i at time t.

Functions Name Description clear(·) Market clearing function. b(·) Univariate stochastic model for exogenous information. e(·) Encoder that maps from the original state space Hi to pseudo-state space Zi. f (·) Order book transition function. Gζ (·) Revenue collected over a trajectory. g(·) System dynamics of the MDP. k(·) System dynamics of asset trading process. l(·) Reduced action space construction function.

Pa−i,t (·) Probability distribution function for the actions of the rest of the agents −i.