A gradient boosting approach for optimal selection of bidding strategies in reservoir hydro a,b ∗ c b a Hans Ole Riddervold , , Signe Riemer-Sørensen , Peter Szederjesi and Magnus Korpås ,0

aNTNU bNorsk Hydro cSINTEF

ARTICLEINFO ABSTRACT

Keywords: Power producers use a wide range of decision support systems to manage and plan for sales in the reservoir hydro day-ahead electricity market, and they are often faced with the challenge of choosing the most advan- bidding strategies tageous bidding strategy for any given day. The optimal solution is not known until after spot clearing. hydro power Results from the models and strategy used, and their impact on profitability, can either continuously gradient boosting be registered, or simulated with use of historic data. Access to an increasing amount of data opens neural network for the application of models to predict the best combination of models and strategy for any given day. In this article, historical performance of two given bidding strategies over several years have been analyzed with a combination of domain knowledge and machine learning techniques (gradient boosting and neural networks). A wide range of variables accessible to the models prior to bidding have been evaluated to predict the optimal strategy for a given day. Results indicate that a machine learning model can learn to slightly outperform a static strategy where one bidding method is chosen based on overall historic performance.

1. Introduction issue date

One of the main tasks for an operator of hydro electric date (day) d − P d − 1 d d + 1 d + 7 power in a deregulated market is to decide how much power should be produced the following day. Several strategies for value date bidding available production exist and have been described in the literature [4, 33]. Each strategy will potentially lead bidding time to different commitments for production, which again will have an impact on profitability. Figure 1: Definition of issue- and value date where “d” is date, Market prices and inflow for the next day are uncertain, and “P” is any date before the bidding time. and the profit associated with any selected strategy is not revealed until after the decisions are made. The question addressed in this article, is if the producer In the following sections, results from a case study in- with access to sufficient amounts of information about his- vestigating historical performance of two different bidding torical performance of different strategies, can predict in ad- strategies will be presented [29]. Further, the steps associ- vance which bidding strategy should be selected for a given ated with the machine learning process applied in this article day. will be described together with examples of application on Bidding day-ahead production to the power-exchange is the historical data. In Section4, a concept for application typically done the day before the actual commitments are ex- and case-study will be presented, before drawing a conclu- ecuted. "Issue date" is defined as the date when bidding is

arXiv:2002.03941v1 [math.OC] 10 Feb 2020 sion in Section5. conducted, while "value date" is used for the date when com- mitments from the bidding are realized through costs and in- 2. Results from historical bidding strategies come. Only variables that can be identified on or before the issue date, can be used to classify the optimal strategy asso- Two strategies have been evaluated in this analysis. In ciated with performance of a corresponding value date. Fig. the first method, the expected volumes are found by deter- 1 describes the relationship between the two. ministic optimization with price forecast and inflow and sub- ⋆ mitted as fixed hourly bids to the power exchange. The opti- ⋆⋆ mization is performed with SHOP, which is a software tool ∗ Corresponding author for optimal short-term hydropower scheduling developed by [email protected] (H.O. Riddervold); SINTEF Energy Research, and is used by many hydropower [email protected] (S. Riemer-Sørensen); producers in the Nordic market [11]. The second strategy is [email protected] [email protected] (P. Szederjesi); (M. stochastic bidding. The stochastic model is based on the de- Korpås) https://www.ntnu.no/ansatte/hans.o.riddervold (H.O. Riddervold) terministic method, but allows for a stochastic representation ORCID(s): of inflow to the reservoir and the day-ahead market prices. In this case, bid-curves can be generated from the stochastic

HO Riddervold et al.: Page 1 of 17 Table 1 Example of performance quantification in EUR for the deterministic and stochastic models for three days in 2017 for the use-case river system.

Issue dateValue date Deterministic ( det) Stochastic ( stocℎ) MIN DELTA(s) BEST 2017-07-01 2017-07-02 69.2 137.9 69.2 68.7 1 2017-07-02 2017-07-03 16.5 65.1 16.5 48.6 1 2017-07-03 2017-07-04 31.1 29.9 29.9 -1.2 0 model as described in [3]. Example of results from evalu- the performances of each model directly and using a sim- ation of two strategies for some selected days for a specific ple decision heuristic (minimum gap) to decide on the most river system are shown in Table1. The performance quan- advantageous strategy. tification for the deterministic and stochastic models are the In Sec.3, we will systematically go through the different performance-gaps in EUR for these strategies. steps associated with the machine learning challenges for the In [29], a method for measuring performance of individ- two approaches. ual historical bidding days has been proposed, where Πs,d in s Eq.1 represents a measure of value for bidding strategy on 3. Machine Learning Process day d, and Πopt,d describe the optimal value for the relevant bidding date based on a deterministic strategy with perfect The use of machine learning to classify strategies or pre- foresight of price and inflow. dict values have gained significant momentum during the last The performance gap ( s,d) reflects the loss of choosing few years [18]. Machine learning is a set of techniques that strategy s for day d compared to an optimal deterministic allow a a computer algorithm to improve performance as it strategy, and is calculated as the difference between Πopt,d gains experience, which in our case is exposure to additional and Πs,d. A high value for s,d indicate poor performance. data. No explicit instructions are required, but instead, the algorithm needs some sort of training on representative input and output data. If the training is successful and the model s,d = Πopt,d − Πs,d (1) can find some general patterns or behaviour in the data, it can subsequently be used to predict output for previously unseen input data. Machine learning basically covers every- thing from simple regression to deep neural networks. s,d = stocℎ,d − det,d (2) Classification and regression problems, are categorised Since the deterministic and stochastic predictions are based as methods where the training takes place on pre-bidding values, even the best model will rarely corre- on pairs of input and output data, and subsequently applied to spond exactly to the perfect foresight strategy computed after unseen input data to generate predictions. Other categories the actual inflows and prices are known. Consequently, we are used for clustering and associa- define the best model as the one with the lowest gap relative tion, and used for decision optimisa- to the perfect scenario. We classify each date (data point) tion. with “0” if the stochastic strategy leads to to lowest gap, and Within the area of electricity power market analysis, neu- “1” if the deterministic strategy leads to the lowest gap as ral networks have been used to investigate strategic and al- shown in table1. gorithmic bidding [7, 28] as well as for price- and load fore- casting [6, 20]. If we define “strategy gap” (s,d) as “performance gap Gradient boosting methods have received less attention, stochastic” ( stocℎ,d) minus “performance gap deterministic” but have successfully been applied for load- and price fore- ( det,d), a high value is a strong signal to choose a deter- ministic model for that day. Negative values indicate that casting [24, 14]. a stochastic model is preferred, and the more negative, the Results from the previously published articles, indicated higher the importance of a stochastic model. Values around that machine learning techniques successfully can be applied zero indicate a negligible difference in the choice of model to improve forecasting of load and prices in the power mar- for the day. With this insight, we see that a pure classification ket, but there exists limited publications documenting oper- model only measuring the correct number of classifications, ational use and added value from these techniques compared will not necessarily give the best results if the overall target to what is state of the art in the industry. Several energy com- is to have a low deviation from optimum over time. panies as well as software-, data- and consultancy companies It seems obvious to apply supervised machine learning supplying the energy sector advertise and promote the use of to the selection problem. We have tested two different ap- machine learning, and the European intra-day market has in proaches: i) labelling each data point as stochastic or deter- particular been an area of interest in relation to application ministic based on the performance gap, and then training a [26, 25]. classification model to predict which category unseen data In this article, we assume that the data is time-independent points belong to. 2) A regression model trained to predict in the sense that a strategy choice for the next day does not

HO Riddervold et al.: Page 2 of 17 Data Feature Selection Training Scaling Prediction Consolidating integration engineering algorithm models tuning Interpreting deploying

Figure 2: Typical sub-tasks involved in machine learning analysis process.

affect which strategy will be the best choice further in the fu- to the price, the stronger is the signal to produce. Based ture. Such time dependence could potentially be accounted on this insight and domain knowledge of what might effect for in a reinforcement learning framework which has received production, we apply some general hypotheses to select an increased attention as a solution to the increasing amount of initial set of input variables for further analysis as given in non-linear relationships and high-dimension problems asso- Table2. ciated with hydropower production planning [23, 10, 34]. All input variables in Table2 are related to the issue data, However, in general, reinforcement methods are still fairly and consequently available when the strategy for the next day immature and require significant fine tuning on individual is decided. In the high and low columns, it is indicated what problems in order to work [15], so we consider it outside the is believed to be the preferred strategy based on experience scope of this work. from operation in case of high and low values. “U” indicates There are several ways of approaching a machine learn- that the domain experts are uncertain how the value will af- ing problem, but they often involve some or all of the sub- fect the classification, but still believe it is important relative tasks illustrated in Fig.2 and several iterative loops over the to choice of strategy. process. Our implementation is discussed in Sec. 3.1-3.6. 3.1.3. Preparation and visualisation 3.1. Data integration Visualisation and sanity check of data is important in any 3.1.1. Data gathering data analysis. The first step is to visualize all the input- (X) Data gathering can be time consuming and the data might and output variables (y). This is to detect missing or obvi- require significant cleaning and quality assurance before it ously incorrect data in the data set, or anything that should can be injected into a learning model. In addition to gather- be corrected for during the analysis. A plot may quickly re- ing data for the values we want to predict, defined as output veal periods of missing data or unrealistic values, or provide variables, it could be wise to have an idea about what could ideas of interactions between input and output values. Fig.3 be relevant input variables to be collected in the same pro- shows performance gaps ( ) together with inflow deviation cess. In the case of predicting the best strategy using a clas- from normal. It provides a clear indication of a connection sification model, the output variable is the prediction of a between periods with high inflow and poor performance of stochastic or deterministic model, while the input variables deterministic bidding. could be inflow and prices. The output from a regression model would be the strategy gap (s). That is, given some Determenistic Stochastic 1250 20 input variables (input), what is the predicted output variable Inflow relative normal (secondary y.axis) (output). 1000 /day] 0

Experienced production planners may have an intuitive [ perception of what the important input variables could be, 750 and these can be used as initial values in the learning pro- 500 20 cess. This is often an iterative process where additional in- sight gradually is gained and frequent reviews of the initial 250 Performance gap 40 conditions are required. In order to maximise learning and Inflwo relative to Normal [m3/s] 0 minimise bias, it can be necessary to adjust the variables. In- 2018-04-01 2018-05-31 sufficient performance ofa model should call forrethinking date both related to input, model structure and system-design. Figure 3: Detailed results for performance gap from April-May 3.1.2. Input variables 2018, compared to inflow relative to normal. Data from [29]. The two strategies evaluated in this analysis are mea- sured against a perfect foresight model where prices and in- In Fig.4 we plot the strategy gap ( s) defined in Eq. flow are known prior to bidding. All other factors going into 2, but only for absolute values larger than 200 €/day corre- the performance evaluation are equal. We expect that fac- sponding to days where there is a significant impact from tors affecting the prices and inflows will have a significant the choice of strategy. These values typically concentrate effect on the performance gap. Another important factor af- around the second quarter every year, indicating that time of fecting the production schedule is the water value. The wa- year e.g. month may be a relevant variable to include in the ter value can be defined as the future expected value of the analysis. stored marginal kWh of water, i.e. its alternative cost [32,8]. The basic concept is to produce when the water value is lower than the price. The lower the water value is compared

HO Riddervold et al.: Page 3 of 17 Table 2 Initial set of input variables, hypotheses for how they affect the choice of strategy, and the strategy that the domain experts presumably would select for high/low values of the variables.

Nr. Variable Hypothesis High Low 1 Inflow deviation from Higher observed inflow increases the risk of flooding for the next day (value date) S D historic normal 2 Reservoir filling High or low reservoir filling increases the risk of flooding or resource shortage during value date S S 3 Price volatility High volatility in prices gives increased uncertainty for prices the next day S D 4 Price volatility in the High volatility in the price prognosis give increased uncertainty for prices the next day S D prognosis 5 Water value Water value is the primary deciding factor for production, and will clearly have an effect on the UU strategy choice when seen in relation to other variables 6 Average price Price relative to water value is important U U 7 Average price progno- Price prognosis relative to water value is important U U sis

1.0 Stochastic 500 0.19 0.15 0.15 0.05 0.17 0.18 1.00 Determenistic BEST ] y

a 0.20 0.23 0.23 0.05 0.35 1.00 0.18 0.8 d / [ 0

p 0.47 0.42 0.42 0.03 1.00 0.35 0.17 a

g inflow_deviation

0.6 y g e

t 0.21 0.17 0.16 1.00 0.03 0.05 0.05 a r t 500 reservoir_filling_2 S

0.4 0.74 0.97 1.00 0.16 0.42 0.23 0.15

reservoir_filling_1

0.73 1.00 0.97 0.17 0.42 0.23 0.15 2016 2017 2018 2019 average_prog 0.2

average_p Figure 4: Strategy gaps in the period 2016-2108. 1.00 0.73 0.74 0.21 0.47 0.20 0.19

water_value BEST

average_p water_value 3.1.4. Correlation average_prog inflow_deviation In Fig.5 relative (Spearman) correlation between vari- reservoir_filling_1reservoir_filling_2 ables are shown. If some of the independent variables (X) Figure 5: Correlation matrix for the initial set of variables. are un-correlated with the dependent variable(s) (y) of inter- est, they can be removed. If some of the independent vari- ables are correlated with each other, a subset can be removed or, even better, combined. 1.0 We are mainly interested in the correlation between vari- ables and the prediction indicated in the “BEST” column. A 0.8 higher number indicates that the variable will play a more important role in the classification. In this case, the water 0.6 value and reservoir filling in reservoir 2 are the variables with the highest correlation to the prediction. There is also 0.4 a strong correlation between prices associated with the is- average_p Stochastic sue date ( ) and the prognosis for the value date 0.2 Determenistic average_prog) ( . Resoirvoir level for intake resoirvoir (%) Grid plots as shown in Fig.6 can also be used to plot 0 10 20 30 Inflow deviation from normal (m3/s) all variables against each other to see if they group clearly into categories. A trained eye will spot the correlations from Figure 6: Gridplot for variables inflow and reservoir level (de- the grid plot. In this plot blue(triangular) markers indicated noted reservoir filling in Fig.5). when the stochastic model performs best, while red (round) markers indicate when the deterministic model is preferred. When the level in the intake reservoir is high or inflow is less for kernel methods for which features are not explicitly well above normal, a stochastic model is preferred. computed [12]. 3.2. Feature selection and engineering In this article, input variables are primarily used when referring to variables collected in the data collection phase, The terms variable and feature are often used without while we use the term feature for constructed variables engi- distinction when referring to machine learning problems un-

HO Riddervold et al.: Page 4 of 17 neered to maximise the information available for the model. full range of prices available by Nord Pool. The bid and ask We thereby define as selection and curves are constructed by interpolating the points with a 0.1 combination of relevant information we are investigating in EUR/MWh granularity. Subsequently, they are then used to order to optimally exploit the available characteristics in the compute price sensitivities based on shifts in the bid curve. data. In the ideal world with unlimited training data and The resulting features are the price differences resulting from computational power, the machine learning algorithm can a 1000 MW changes in load, i.e. one price difference after a be fed all the available variables and figure out the relevant horizontal shift of +1000 MW in the bid curve and one price features itself. However, with limited access to data, feature difference based on a -1000 MW shift. The price difference engineering is often an iterative process where you go back is computed as the crossing-point price between the bid- and and forth between data pre-processing, correlations, feature ask curves after a shift, minus the pre-shift crossing-point engineering and model fitting. price. The computed hourly sensitivities are presented in In this case study, two examples of extracted features are Fig. 8a. These are used to compute a rolling volatility feature month and the strategy-gap associated with the issue date. based on the standard deviation of price sensitivities for the In the input data, date is given, and specifying the month is previous day, presented over time in Fig. 8b. therefore a trivial task. The same applies for strategy gap. Thus, the price sensitivity to shifts in the bid curve varies In the initial choice of variables, we are not considering the over time and can occasionally experience sudden spikes or time dimension of each variable. Given our physical knowl- build-up over multiple days. The assumption that the next edge of the system, we know there could be some delay be- day will experience a high volatility if the current day had tween parameter changes. Consequently, it might be better a high volatility is intended to be captured by the t-1 rolling to shift some parameters by a certain amount of time to get a volatility feature. An additional t-2 feature was tested, but variable called X-t. As illustrated in Table.1, the data con- did not yield any additional qualitative improvements in the tains the strategy gap for each day, and the new feature is classification of optimal strategy. Further, rate-of-change then the strategy gap for value day minus one. features were computed as the differences in standard de- More advanced methods such as principal component viation of price sensitivities from day t-2 to day t-1, but did analysis, can also be used to get ideas for feature engineer- not yield any significant improvements in the quality of pre- ing, but have not been exploited further in this analysis. dictions when added to the model. 3.2.1. Bid- and ask curves 3.2.2. Features used It can be hypothesised that price volatility can be pre- To evaluate the benefit of increasing the amount of vari- dicted by investigating bid/ask sensitivities in the day-ahead ables, two input data-sets are defined. The first data-set is market. In the Nordic Market, the spot price for each hour referred to as simple input, and consists of eight variables is determined by the crossing-point between the bid and ask listed in Table2. The second set is referred to as complex curves published daily after spot clearing by Nord Pool. Two input and consists of the following variables: examples of bid- and ask curves after interpolation are illus- trated in Figures 7a and 7b, where the dashed line is the bid • the eight simple variables curve and the solid line is the ask curve. In the left plot, the • all hourly prices for both issue date and prognosis for curves show a fairly stable balance but with a larger price value date sensitivity to the upside, whereas the curves in the right plot show a balance with strong sensitivity to the downside, i.e. • bid-ask curves a price collapse. Thus, the spot price volatility depends on the shape of • rolling volatility the two curves and how they change from hour to hour. As • month, year, day and performance of similar week- seen, the (dashed) bid curve is normally vertically shaped, days implying that power consumption on a given day is fairly inelastic. The (solid) ask curve has a significant price drop • strategy gap for issue date for low volumes and a significant price rise for high volumes. This is reflecting the underlying price elasticity of supply • rate of change for reservoir filling from different sources of energy. • difference between price and water value The hypothesis is that when demand is either close to the point where the bid price increases almost exponentially, or In total, the complex-input data-set consist of 113 variables. close to the point where bid price drops toward zero, an in- creased volatility is probable, and a stochastic bidding strat- 3.3. Selecting the model and algorithm egy is preferred. Again, only information available prior to The problem at hand can be solved as a classification bidding can be used to predict the optimal strategy the next problem where the target is to predict the best bidding strat- day, and we therefore use the bid curves for the issue date to egy or by predicting the strategy gap directly, and based on predict strategy for the value date. this decide the optimal strategy. As a base we use the bids and asks of volume for the

HO Riddervold et al.: Page 5 of 17 (a) Bid- (dashed) and ask (solid) curves for February 11, 2016, (b) Bid- (dashed) and ask (solid) curves for May 1, 2018, hour hour 11 2 Figure 7: Bid- and ask curves Nord Pool power market

(b) Rolling volatility feature based on the standard deviation of (a) Price difference based on a +/- 1000 MW shift in load. price sensitivities for the previous day. Figure 8: Bid- and ask sensitivities and rolling volatility in the Nord Pool power market.

An algorithm suitable for both approaches is boosted de- features. The importance of each parameter in the trees can cision trees as implemeted in the XGBoost library[9]. XG- be derived in several ways. In this article, we have applied Boost implements the gradient boosting [13] decision tree two methods to quantify the importance. These are GAIN algorithm and use multiple trees to classify samples . This and SHAP which will be described further in Sec. 3.5. gives valuable insight on the feature importance and possible Fully connected neural networks (NN) and recurrent neu- feature interactions The main hyper parameters are learning ral networks (RNN) have been tested with less success and a rate, maximum tree depth and number of estimators repre- comparison with these methods are included in Sec. 4.2.1. sented by trees [1]. Figure9 illustrates a randomly selected tree from the re- 3.4. Splitting the data sulting classification model for a limited set of features. The Before we can train a model, we need to split the data first split feature is placed highest or to the left in the tree, into training samples and test samples. The basic principle followed by subsequent criteria. The total model is an av- is that we build and train the model based on a fraction of erage of multiple trees with different representations of the the data called training data, and then test the performance of

HO Riddervold et al.: Page 6 of 17 leaf=-24.1803265 yes, missing

no yes, missing reservoir_filling_1<0.438138664 leaf=-6.72080612 water_value<-1.02843833 no yes, missing vol_p<4.52756453 leaf=-0.20695734 no

leaf=-32.3677177

Figure 9: An example of one classification tree from the model trained on a limited set of features. The total model is an average over many trees.

the model on data that are not used when building the model. equally influence the model, we need to “put everything on This sample is referred to as test data, and represent unseen the same scale”. We can either scale everything to a fixed samples that are not accessible in the model building phase. range of values or change the distribution to become a nor- There exist several ways of generating training and test sets. malised Gaussian. A common approach is to define a the ratio between training In general, decision tree algorithms such as XGBoost do and test data, e.g. 0.7/0.3, and let the model randomly split not require scaling [1], but it might help with quicker con- the data into sub-sets. vergence in relation to numerical processing. Scaling is re- Applying the concept of randomly splitting the data into quired for neural networks, so to be able to apply the same training and test might conflict with the underlying mech- pre-processing of data, similar scaling has been applied for anisms behind the power market. When managing Hydro both XGBoost and neural networks in this article. power, weather plays an important role. The possible out- Depending on the sample size, the test data can either be comes and combinations of reservoir levels, inflow, snow scaled with their own scaling (for large samples), or with the and prices are widely spread, and the power producers there- training sample (small samples). What to chose depends on fore tend to use a long history of observations to represent the how you would treat the actual data you will later use with possible scenarios in there decision support models. Typi- the model. cally this can be in the range of 30-60 years of history. Con- Time-dependent data may have trends that makes a sim- sequently a random split of historic data may leak informa- ple scaling meaningless, in particular when training on his- tion from the historic future into the training sample, which toric data and applying to new data. Fig. 10a illustrate the is not desirable. Instead, we can use a sequential split of distribution of water values for each of the three years of data, and test the trained model on years not previously seen data. Fig. 10b describe the results after applying standard by the model. Such a split will lead to a model that performs scaling for all years jointly. The distributions differ signif- well on years with similar historic representation, but with icantly between the three years. If two years with relative very little predictive power in “freak” years. Three years of low values such as 2016 and 2017 are used for training, the data are available in this analysis. If we assume that two values are not representative for the 2018 test data, which years are used for training, and one year is used for testing, is dominated by higher values. If we instead scale the wa- we get a split of 67/33 between training and test data. In ter values per year, the resulting distributions become very addition to random splitting of data, splitting of data have similar, as shown in 10c. Only scalings of water values have been tested with the years 2016 and 2017 used for training, been illustrated in the figures, but the same pattern can be and then tested on 2018 data. Results from this analysis is observed for several of the input variables. Scaling on in- presented in Sec. 4.1. dividual years has therefor been applied to all variables in Another method which has not been tested in this anal- this analysis. In real life application, it is a challenge that we ysis, is to use all available historic information, or a rolling are not able to scale the daily input variables with the yearly number of historic days to train the model. The model will average since this information first is available after the year then be used to classify the next day. To test performance has passed. An implementable alternative would be to to of such a model, a simulation framework would be needed scale the input variable with the 365-days rolling average. since we are depending on updating the training set every 3.5.2. Hyperparameter tuning day, as well as re-calibrate the model. There are several hyperparameters in the XGBoost model 3.5. Feature scaling and hyperparameter tuning that can be tuned. We have applied the randomized search 3.5.1. Feature scaling functionality in sklearn [27] on a selection of parameters Not all features have the same scale: Some have values listed in AppendixA along with their final values. The re- of the order of 1000s, and some are 0.1. In order to let them sulting hyperparameter values after tuning vary both depend-

HO Riddervold et al.: Page 7 of 17 150 2016 2017 125 2018

100

75

50

25 Number of samples

0 0 20 40 1 0 1 2 4 2 0 2 Original value /MWh Scaled value, all years Scaled value, individual years

(a) (b) (c) Figure 10: Distribution profile for water values un-scaled (a), scaled together (b) and individually (c)

average a fairly large spread can be expected when the model is ap- 0.70 plied on randomly generated training and test data. Finally, the hyper-parameters associated with the best average score

0.65 which can be observed around iteration nr. 400 in Fig. 11, is selected as the parameters used further in the analysis.

0.60 3.5.3. Data

Accuracy In machine learning problems, it is tempting to include all parameters which are assumed to have an impact on the 0.55 results. However, introducing too many features might in- troduce noise and distort the solution [30]. Reducing the

0.50 number of features can be done manually by inspecting data and removing parameters with low correlation with the re-

0 200 400 600 800 1000 sults, or high correlation with another parameter making one Iteration nr. of them redundant. This topic is addressed in 3.1.4. Figure 11: Hyper parameter tuning over 1000 iterations of XGBoost will also by default remove variables that do random combination of parameters not have sufficient impact on the model performance. The variables remaining after hyperparameter tuning and model optimization may be a subset of the original input variables ing on the number of features that are included, and also as , and the result is an optimized combination of variables a result of random selection of training and test data. and parameters. When expanding the model with additional To avoid that the parameters are over-optimised towards variables, the time used for parameter tuning increases con- the training data, the training data can be further split into siderably. The question is if we prior to model fine-tuning, sub-sets of data. A validation data set is a subset of the train- can reduce the complexity by removing variables that have ing data, that is used during the training to describe the eval- limited effect on the results. Here we have used a rule-based uation of model variation due to changes in hyperparameter algorithm where model performance is evaluated as we grad- values and data preparation. ually remove features with the lowest impact on the classifi- In our case, we have split the randomly selected train- cation results for a fixed set of hyperparameters. The feature ing set into 5 sub-sets of data of approximately equal size importance has been evaluated as the “gain” computed from normally defined as folds. The first fold is treated as a val- the decision trees in XGBoost. The gain[5] is defined as idation set, and the method is fit on the remaining k minus the improvement in accuracy from adding a split on a given 1 folds [16]. These folds are then evaluated with 1000 set feature to a branch in the classification tree. Before adding of randomly generated hyperparameters. The resulting ac- the new split, there were some wrongly classified elements, curacy obtained for each fold for every combination of hy- but after adding the split there is now two branches which perparameters is plotted together with the average value in are more accurate. Fig. 11. Firstly, we observe a spread in accuracy within We have refit the model while subsequently removing each fold illustrating the importance of tuning the hyperpa- the feature with the lowest gain. Afterwards, we select the rameters. Further inspection of the optimization output re- feature combination with the highest accuracy as the best set veals a complicated parameter landscape with multiple lo- of features. The process will be referred to as GAINS in this cal minima. We also observe a spread in accuracy between article. each fold, indicating that the folds may not be representa- The GAINS method has been applied on the complex tive for the full data set. This could be improved with addi- data set described in Sec. 4.1.2. Before applying GAINS, tional data. The range in accuracy observed indicate that a rough screening of hyperparameters over 10 iterations is conducted. The number of input variables is then reduced

HO Riddervold et al.: Page 8 of 17 (a) Individual predictions (b) Aggregation of shap values for the two most important features Figure 12: SHAP values on individual and aggregated feature values

based on feature importance, and the final hyper-parameters deterministic (stochastic) bidding. This is illustrated by the are calculated over 1000 iterations on a smaller set of input horizontal location, which shows whether the effect of that variables. value is associated with a higher or lower prediction. The color of the individual sample shows whether that variable 3.5.4. Feature importance and explanations is high (in red) or low (in blue) for that observation. If the Another useful tool when evaluating the relevance and colors are split similarly to the shap values, the relation is importance of different features is SHapley Additive exPlan- simple, otherwise it’s probably complex and dependent on tions (SHAP). The framework interprets a target model by multiple variables. A low water value has a large and nega- applying Shapley game theory for how a reward given to tive impact on the strategy gap pushing in favour of stochas- a team should be distributed between the individual play- tic bidding, while a high water value favours deterministic ers based on their contributions [22]. The features are inter- bidding. The “low” comes from the blue color, and the “neg- preted as “contributors”, and the prediction task corresponds ative” impact is shown on the X-axis. to the “game”. The “reward” is the actual prediction minus The computation of shap values assume independence the result from the explanation model. The underlying idea between the features. Our features are not independent, which is to take a complex model, which has learnt global non- may affect the ranking somewhat [2]. However, since shap linear patterns in the data, and break it down into lots of local primarily is used to qualitatively understand and illustrate linear models which describe individual data points. potential impact of different features in this article, and se- The shap values can be interpreted either for individual lecting features is done in combination with other methods predictions or for the entire sample. Fig 12a shows one sam- such as GAINS, the fact that some of the variables are de- ple classified using the regression model described in 4.2. pendant play a minor role in relation to prediction accuracy. The output value is the prediction for that observation which in this case is −224.45 and consequently the sample is clas- 3.6. Prediction sified as stochastic. The base value of −32 is the value that The main target of the classification problem is to predict would be predicted if we did not know any other features which bidding strategy to use for the next day. Zeros indi- for the current output. This is the mean value for “strategy cate that a stochastic strategy should be chosen, while ones gap” for all samples in the training set. This is logical, since indicate a deterministic strategy. Fig. 13a illustrate predic- in lack of other information, we would predict the average tions from the classification model together with actual his- value for any new samples. Interestingly, if no other infor- toric “Best strategy” from the test data. A nice attribute of mation is available, we would classify the sample as stochas- the XGBoost classificator with binary logistic objective is tic. The red and blue arrows illustrate how adding other that the outcome is given as probabilities and not pure clas- feature values push the prediction towards higher or lower sification. This means that a figure close to zero gives a high values. Red indicate a push towards higher values and de- probability of the sample being stochastic, while a figure just terministic bidding, and blue is a push towards lower values beneath 0.5 still classifies as stochastic, but with lower prob- and stochastic bidding. In this sample there is a clear push ability. towards stochastic bidding mainly driven by the water value. When applying a single-output regression model, the out- The only other feature with any significant impact on this put will be the strategy gap representing the predicted differ- sample is the volatility for today’s price which is pushing in ence in value for choosing one strategy in favour of another. direction of deterministic bidding. It has previously been explained that a negative strategy gap Fig 12b illustrates an aggregation of values for two of the indicates stochastic bidding, while positive numbers indicate most important features in the regression problem. Variables deterministic bidding. While the classification model gives are ranked in descending order of feature importance. The probabilities between zero and one, the regression model shap values can be interpreted as “odds” i.e. what is the prob- will span out values in a much larger range capturing the ability of “winning”/predict the higher/correct value which values that are at stake (after inversion of any applied scal- in our case what is the probability of predicting 1. Which ing). Fig. 13b illustrate the predicted strategy gap from the again means predicting that the deterministic model is best. XGBoost regression model compared with the observations So for high (low) shap values that specific variable is con- in the test data. The target is still to decide for one bidding tributing to increase (decrease) the probability of predicting strategy for the next day. The results from the regression

HO Riddervold et al.: Page 9 of 17 ORdevl tal.: et Riddervold HO solution. viable a major be with could samples impact of cost classification on focusing strategy strategy. A wrong the the choosing directly with associated indicate losses model potential strategy regression correct the day, the selected choosing a are for of only importance models the classification for the proxies in asso- probabilities prediction the with While ciated models. higher regression even stochas- on an a have impact might of favour thresholds Applying in strategy. re- chosen tic is is bidding strategy deterministic this before of quired direction in applied, is tendency 0.6 clear instance for a of threshold a strategy If stochastic chosen. a be should that selected, shown one be have to choosing [29] is analysis to strategy previous one relation threshold If skewed in another. averse to a opposed have risk strategy to more possible are we also deciding where is when It risk-neutral is strategy. days. model on unclassified the of cases number these both increased In an of price however the increase, with typically will when threshold accuracy tighter The a applying classified. be category. not a will into between classified be in Values will value val- this with over samples or only under 0.6, ues and 0.4, of threshold If a stochastic. choose as we classified be will below all de- while as the classified terministic, in be 0.5 will above of values threshold all a model, applying classification when probabili- example, results/ an the As on ties. thresholds applying of possibility strategy. deterministic a or stochastic recommen- a binary either to of dations transformed be therefore must model ohtecasfiainadrgeso oe pnfrthe for open model regression and classification the Both Predictions Predicted Strategy

iue13: Figure gap ( ) [ /day] Predicted probability of sample beeing determenistic 0.0 0.5 1.0 500 500 0 rdcin ihcasfiainadrgeso o ipemodel simple for regression and classification with Predictions 0 Predicted value Actual testdata 200 b ersinresults Regression (b) a lsicto results Classification (a) 400 Sample number taeygp ilb oeiprat hsi unie as quantified is This the important. as more defined be previously will strategies, gap, strategy two performance large the a between is the there gap identifying where sense, dates on this strategy In bidding correct best day. every the selected where per- is model average strategy a the to reduce strategy compared to gap to is formance related objective objective main main The the selection. really not is it target, introduced: performance model of Evaluation 3.6.1. taey n efrac hudteeoeb esrdagainst measured be therefore should performance and strategy, strategy. from bidding deviation optimal percent “Realistic in the as measured is to It referred Gap”. be Performance will It strategy. bidding optimal sam- classified ples. all for plan optimal the from gap performance optimal and the selected, if been data test had the strategy in samples all for plan optimal the gap,opt vntog ihcasfiainacrc sa important an is accuracy classification high though Even are measures two model, the of performance monitor To h oe sdsge osgeta pia bidding optimal an suggest to designed is model The  realistic  A hc ersnsteaeaepromnegpfrom gap performance average the represents which , realistic 600 = ubro orc predictions correct of Number ste esr fhwfr eaefo the from are we far, how of measure a then is oa ubro predictions of number Total ( =

gap − 800

gap,opt

)∕ gap

gap,opt ersnsteaverage the represents 1000 ae1 f17 of 10 Page (4) (3) a benchmark where an optimal bidding strategy is selected purposes in the first section are based on the eight original for every day, and not against an optimal plan with perfect variables. When the amount of variables increase consid- foresight on prices and inflow. This is why the measure of erably, so does the complexity in visualisation and graph- realistic is introduced, rather than using gap as performance ical interpretation. Introducing hourly variables on prices, measure. price prognosis, and bid-ask sensitivities increases the num- Assuming that the consequences of wrongly classifying ber of variables to over 100. This is still considered to be a samples are normally distributed between stochastic and de- small problem in machine learning terms (e.g. genetic prob- terministic strategies, and that there is a fairly equal split be- lems deal with several hundred thousand features [17]), but tween when the two strategies perform best, the accuracy correlation matrices, tree-structures, feature-importance di- represents a good measure for model performance. In other agrams etc. will contain more information than can be prac- cases, for instance in medical diagnostics, the consequence tically visualized. The results from the processes can rather of failing to identify disease of a sick person might be much be summarized with presentation of the resulting parameters higher than the cost of sending a healthy person to more tests. after tuning and training, and the results of the classification In this case we have to use more sophisticated methods for on the test data. evaluating performance of the model [19]. Table3 summarizes the results from different modelling In fig. 14 the two measures of A and realistic,i are plotted approaches on various splits of the available data. as a function of number of variables that are removed in the 4.1. Classification classification problem. In this example, the best accuracy 4.1.1. Simple model - Original variables is obtained with the eight original variables. It can also be observed that this coincides with when the gap to a optimal Two main approaches for selecting training and test sam- bidding model is the lowest. ples have been investigated when applying the classification model, sequential and random split. accuracy We observe a variation in performance for different ran- performance

0.63 0.155 dom seeds. This indicates noise in the data, and that finding a universal set of variables and hyperparameters that fit well 0.150 0.62 with all data, might be difficult. The result for the random ]

0.145 % [ sampling in Table3 is therefor when the model is fitted using

i

0.61 , c i t s i

l the eight original variables. In order to determine the vari- a e

0.140 r

ance of possible outputs on the random sample, we bootstrap 0.60 e c n 0.135 a [31] the test sample and determine the performance on each m r accuracy Ai [%]

o * 0.59 f A r sub-sample individually. The mean value( mean) and stan- e 0.130 P dard deviation(A*std ) after boostrapping on 100 samples is 0.58 0.125 also indicated in the same table. As expected, applying the model on sequential data give 0.57 0.120 poorer accuracy performance than for random samples. When 0 1 2 3 4 5 6 7 number of variables dropped [i] 2016 and 2017 data are used to predict 2018 strategies, the results are barely better than random guessing. When in- Figure 14: Variable reduction vestigation of 2016 and 2017 data in addition reveal that stochastic bidding is best 58% of the time, and the amount of stochastic samples in 2018 data is 52%, purely applying a binomial selection with a probability factor for stochastic 4. Concept for application and case study results equal to 0.58 would increase the accuracy to above An important prerequisite when building a learning model, 50% without taking into account any variables. is the access to data, and specifically in this case, historic bid- One would however, expect that the results from sequen- ding performance. To benefit from previous work, this anal- tial splitting would approach the accuracy obtained by ran- ysis is based on the results obtained in [29] where bidding dom split as more data will become available. performance for a river system located in South-Western Nor- The random sampling give an accuracy around 62% , but way has been analysed for 2018. Given the large variations also here the training and test set consists of an average of that are associated with operation of a hydro-based system, 54% stochastic samples, so we are only increasing accuracy one year of data is considered to be too short. The period by 8% compared to applying a binomial selection without evaluated has therefore been extend with additional simula- any variable input. tions for the years 2016 and 2017. An interesting observation is that even though there is an  The framework for how a boosting algorithm can be used average of 18% realistic performance gap ( ) for the predic- to indicate the optimal bidding strategy for any given day tion model when applying random sampling, the gap is still has been presented in section3. The examples used to il- 3% better than when applying a pure stochastic strategy for lustrate the learning process are all linked to the real life all days in the evaluation period. The benefit in addition to case discussed in this section. The data used for illustration a marginal better performance, is the reduced computational

HO Riddervold et al.: Page 11 of 17 Table 3

Results for predicted accuracy (A) and realistic performance gap (realistic ) with different modelling approaches * * Model description A realistic A mean A std (1) XSCR Xgboost Simple Classification Random 0.62 0.18 0.61 0.03 (2) XSCS Xgboost Simple Classification Sequential 0.56 0.21 0.55 0.03 (3) XCCR Xgboost Complex Classification Random 0.63 0.15 0.63 0.03 (4) XSRR Xgboost Simple Regression Random 0.60 0.15 0.60 0.03 (5) NNRR Neural Network Regression Random 0.59 0.19 NA NA (6) RNRS Regression Sequential 0.57 0.22 NA NA

capacity required to perform the bidding process, since we 4.1.3. Comparison of GAINS and SHAP reduce the number of days when the stochastic bidding pro- SHAP and GAINS values carry different information [22]. cess must be used by almost 40%. Relying solely on a deter- In Fig.16, the feature importance from GAINS is displayed ministic bidding strategy is not a good idea, since this would (in black) together with the resulting values from the SHAP give a realistic performance gap () of 43%. analysis (red/blue for positive/negative impact). Feature im- portance from the GAIN-loop have been normalised to the 4.1.2. Complex model - Variable extension and average value of the SHAP samples to facilitate for better reduction comparison. The results from the boosting model when applying the The SHAP values are sorted with the most important limited set of original variables will potentially give the power variables at the top. Red color means a feature is positively producer better insight in relation to choosing the correct correlated with the target variable, while blue color indicate strategy, but the accuracy of the model still seems to be low negative correlation. High water values correlates with de- compared to what is required from a good prediction model. terministic bidding. Inflow deviation shown in blue indicate Will increasing the amount of variables and complexity in that low values for this variable favours deterministic bid- the model increase the performance? In addition to daily av- ding. The same is the case for strategy gap for day minus erage values associated with price and volatility, all hourly 1 (DELTA_1). We already know that a high value for strat- prices for both issue date and prognosis for value date are egy gap indicates deterministic bidding. The SHAP value included. The hypothesis is that the classification model in- further implies that if deterministic bidding was the best for terprets the inherent volatility and importance of individual day minus one (issue date), there is an increased probability hours when predicting the bidding strategy. Bid-ask sensi- for deterministic bidding also is the best strategy for the day tivities as described in 3.2.1 have also been added. Finally, we investigate bidding for (value date). additional features such as month, year, day, performance of There is clearly a deviation between the two methods. similar week-days, performance gap for day minus one, rate SHAP seems to emphasize the importance of the strategy of change for reservoir filling and difference between price gap observed for day minus one to a larger extent than fea- and water value are included in the model. The extended ture importance from GAIN. On the other hand, GAINS give model now consist of 113 variable compared to the 8 orig- higher relative scores for hourly variables such as bid sen- inal variables. A correlation matrix and intuition would re- sitivity in hour 8 (bd_8) and price prognosis for hour 24 veal that several of the new and old variables are correlated (p_prog24). and could potentially be removed. When the SHAP analysis is performed after a consider- We apply the GAIN-loop and SHAP analysis to the same able number of variables are removed, which is the case in data sample as used for the eight original variables. Due to Fig. 16, we have lost information about the potential effect the computational requirements from increasing the number of the filtered variables. One would assume that these any- of features, the initial hyperparameter tuning is reduced to 10 how contribute with limited information, but a test applying iterations. This is both to reduce the time used in the tuning SHAP on an un-reduced model, indicate that two variables, process, but also to avoid that the parameters are customized (vol_roll_2) and (bd_1), also appear among the top eight to the data-set consisting of all variables. After the parame- variables in the SHAP analysis. These are therefor also in- ter tuning process, only 15 of the original 113 variables are cluded when moving forward to the next step in the analysis. still present in the model, and the GAIN-loop further reduces Water value, inflow deviation and reservoir filling 2 which the number to 11. were part of the eigth original variables are listed as impor- Fig. 15 illustrate how the accuracy gradually increase as tant in both SHAP and GAINS. In addition, features such as more variables are removed until we reach the highest accu- strategy gap for value day minus one, delta water value and i  racy at = 11. The realistic performance gap( ) is also low price, rate of change for reservoir filling seem to have sig- at this point. nificant impact on the resulting predictions. Furthermore, GAINS seem to emphasize the importance of the bid curves for hour 8 (bd_8) and the volatility in the next days progno-

HO Riddervold et al.: Page 12 of 17 sis (vol_prog). An interesting observation is that from the 96 DELTA_1 hourly variables, we are now left with two. These are most inflow_deviation likely chosen by the model because they represent the possi- delta_watervalue_price bility for low prices during night (bd_1) and morning peak water_value prices (bd_8). reservoir_filling_2

accuracy rate_of_change_per_day_resevoir_filling

performance bd_8 0.62 reservoir_filling_1 0.18 p_prog18 0.60 ] vol_prog % [

i 0.17 , c i

t p_prog24 s i l a e 0.58 r 0.000 0.025 0.050 0.075 0.100 0.125 0.150 0.175 0.200 SHAP Value (Red = Positive Impact) e c

0.16 n a m

r Figure 16: Top 15 SHAP values sorted compared with features accuracy Ai [%] o f 0.56 r e importance from GAINS Should we explain the variables here? P 0.15

0.54

0.14 leaves us with a relatively weak prediction model. However,

0 2 4 6 8 10 12 14 the situation can be improved with more/less noisy data. number of variables dropped [i] 4.2. Regression Figure 15: Variable reduction for complex model The classification process described in 4.1 is primarily designed and optimized towards obtaining the best predic- Exactly where the cut-off should be in relation to how tion accuracy for stochastic and deterministic bidding. The many variables should be brought further to the next step in best accuracy correlate (albeit weakly) with when the lowest  the analysis is difficult to measure, but three additional vari- realistic performance gap ( ) is observed, but it can be seen  ables (reservoir_filling_1, p_prog18, and p_prog24) listed in that there are instances where a low ( ) could have been ob- Fig.16 are removed manually since these neither seem to be tained at the expense of poorer overall accuracy. In Fig.15  significant in GAINS nor SHAP. the performance gap ( ) barely change when the number of The 10 final variables are the 8 remaining variables in variables dropped increase from four to six, but the accuracy Fig.16, in addition to (vol_roll_2) and (bd_1). The next step is reduced by almost 4% from 62% to 58%  is to re-run the parameter tuning process with the 10 vari- If the primary target is to obtain as low ( ) as possible, ables over 1000 iterations, and test the model on randomly several adjustments to the model could be made. First of sampled training and test data. The same random seed as all, the classification model has been trained with a binary used for evaluating the original model with eight variables logistic objective, which specifically optimizes the accuracy is used to illustrate the performance on an identical set of regardless of the resulting gap. Fitting for the actual gap data. values with e.g. a “Mean Squared Error” (MSE) objective The accuracy using these 10 variables turn out to be 63% would penalise large gaps more than small gaps. which is 1% better than the result from the simple model It is also possible to introduce more categories and per- where the eight original variables are used. Given that the form a multi-class classification. One could for instance use 113 new variables are reduced to 10, it is a clear indication five groups where they represent ranges for strategy gap. that most of the new variables contain little or no additional One group could be samples with strategy-gap’s in the range information relevant for improving the prediction accuracy. -400 to -200. These are samples which we obviously would In contrary, including new variables might just as well in- like to predict as stochastic, and additional weight on this troduce noise to the problem and distort the solution as ex- group could be incorporated in the objective (loss) function. plained in section 3.5.3. The input data has already been classified in stochastic Two main conclusions can be drawn from the compari- and deterministic based on the positive/negative values of son between SHAP and GAINS, and the re-run of the model the strategy gap. To take into account the importance of the with reduced number of variables. Firstly, variable reduc- numeric values of the strategy gap, an alternative method is tion based on GAIN and SHAP might not necessary end up to perform regression on the strategy gap directly, and rather giving us the combination of variables that give the best pre- classify the best strategy with simple heuristics after predict- diction. This is because the potential between ing the strategy gap. The reason for not pursing this method variables are not sufficiently addressed in this process [21] as our primary approach, is that the relative limited amount Secondly, there is substantial amount of noise in the data of data, and amount of noise in the input data, which lead leading to high sensitivity of the final results to the choice of to a model with relative poor performance on predicting the variables and the composition of traning and test data. This strategy gap. However, if we assume that we are only in-

HO Riddervold et al.: Page 13 of 17 terested in if the model predicts a number that is higher or (and outputs) are independent of each other. To exploit any lower than zero, the regression model might actually be able time dependence in the data, we have applied a recurrent to incorporate an important aspect not sufficiently addressed neural network (RNN) on the sequential split of training and in the classification. With a mean squared error objective, test data. Since time-dimension is a key aspect of RNNs, it the GAIN-loop can focus on variables that contribute to re- does not make sense to apply it on random selection of data. ducing the gap. The criteria for selecting the best set of vari- The most relevant test of the RNN in this case study, is to ables can therefore be based on the strategy gap, rather than train and validate the model on the historic data in sequential the classification accuracy. MSE is given by eq.4, where i order, and test in on samples that follow sequentially after is the strategy gap for sample i and ̂i is the predicted value. the period of training. To be able to compare with a similar period investigated with the classification model, the RNN n has been used with training in 2016 and 2017, and tested on 1 É MSE = ( − ̂ )2 2018 data. The network has been trained over 400 epochs. n i i (5) t=1 The results can be seen in table3. Similarly to the clas- sification results for the same period, the accuracy is poor Given the relatively small difference in results obtained by with only 57% hit-rate, and a realistic performance gap () introducing the increased complexity in the complex model, of 22%. Using this model to select bidding strategy would regression results have only been tested on the simple model, actually give higher losses than purely choosing a static strat- and the results can be seen in table3. As expected, the clas- egy with stochastic bidding. The loss for training and vali- sification accuracy is 2% worse than for the model optimised dation as shown in Fig.17b indicate that we have a poorly towards classification, but the performance gap for the model fitted model. is actually 3% better. 4.3. Alternative and combined approaches 4.2.1. Neural Networks Both XGBoost and neural networks have strength and The XGBoost algorithm used in the analysis in the pre- weakness that can influence the results of the analysis. A vious sections was chosen for two main reasons. First it is a nice attribute of the XGBoost algorithm is the ability to track method allowing for transparency in relation to tracking im- the importance of different features, and thereby remove fea- portance of the different variables in the prediction process, tures that have little or no impact on the results. A useful at- and thereby avoiding the “black-box” perception that is asso- tribute in the neural network is the possibility predict more ciated with neural networks. The second argument is that the than one value, and combine this with a custom made loss algorithm is suitable for both classification and regression. function. A hybrid model where XGBoost is used to select As a benchmark for results obtained by the XGBoost the most important features, and a Neural Network is applied regression model, a multilayer (MLPs) model, with the recommended set of features and a custom made often referred to as a classical type of neural network, has (CL), could be a possibility to combine the been tested on the same random sample analysed using the strengths of the two models. This approach has been investi- XGBoost model on the original set of variables (XSRR) . gated , and indicate that results might improve marginally for Typically, with neural networks, we seek to minimize the er- the neural network model, but the results are still poorer than ror the prediction model makes, also referred to as the loss. for only using the XGBoost model. The custom loss func- When training a neural network, it is important to avoid over- tion used is designed to focus on the days where the highest fitting. This can be monitored by comparing the loss func- performance gaps can be observed. tion for the training and validation sets. If the two values converge with training, the result from the model is assumed n 1 É to fit well. An example of a good fit can be observed for the CL = min( , ) − min( ̂ , ̂ ) n n ð det stocℎ det stocℎ ð (6) 150 first iterations in Fig. 17a. This is the results from 400 t=1 iterations with the eight original variables. When loss for the training and validation data diverge as the number of it- erations increase, the model is over-fitted. This seems to be In comparison with Eq.5, Eq.6 performs regression on the performance gap (Πs) for each bidding strategy directly, the case if the numbers of iterations are increased beyond  150. The result in this case, is a model which is really good and not on the strategy gap ( s). This makes it possible to “punish” the days where one strategy perform poorly com- at representing the training data, but not necessarily good n at predicting results for unseen data. The neural network pared to another. The delta can be raised to a power of to benchmark is therefore established using results after 150 it- emphasize the difference even more. erations, and the results can be found in table3. The predic- Antoher concept that could be an alternative to the in- tion accuracy and realistic performance gap is () somewhat vestigated approaches is Bayesian model averaging, where worse than for the regression using XGboost. The Neural multiple models can contribute to the decision support, with Network models have been manually tuned, and the result- weights based on how sure each model is. ing model architectures and hyper-parameters are listed in AppendixA In a traditional neural network we assume that all inputs

HO Riddervold et al.: Page 14 of 17 1.0 Train Train 0.9 Validation 2.0 Validation

0.8 1.5

Loss 0.7 Loss 1.0 0.6

0.5 0.5 0 100 200 300 400 0 100 200 300 400 Epoch Epoch

(a) Well fitted Neural Network with eight variables (b) Poorly fitted Reccurent Neural Network Figure 17: Plot of model loss for Neural Networks

5. Conclusion ables. Neural networks have been tested to benchmark result In this article, various techniques for classification and from XGBoost, but results indicate that the initial assump- regression have been tested on historical data representing tion that gradient boosting models are suited for this type of bidding performance for a reservoir-based river-system in analyses was well founded. Using regression rather that clas- the Nord Pool market. The primary objective has been to sification has proved to be able to reduce the performances investigate the possibility of predicting prior to day-ahead losses, but at the expense of somewhat lower accuracy score. bidding whether stochastic or deterministic bidding would One challenge associated with the use of the prediction be the preferred strategy under the prevailing market and hy- model arise when the model is used to analyse data that have drological conditions. a distribution profile on key variables that vary considerably A simple plot of inflow deviation together with perfor- from what is represented in the test data. This is clearly the mance gap for the year 2018 as seen in Fig3 gave an indi- case when the model is used to analyse a year without any cation that periods with high inflow to a certain degree were samples represented in the training set. Increased amount correlated with periods with negative stategy-gaps favouring of historic data will most likely increase the probability that stochastic bidding. This was partly the motivation for con- training data are representative, but the volatile nature of ducting the analysis, and also one of the hypotheses in the power markets and weather does not guarantee that that this initial selection process for variables to be used in the pre- is going to be the case. diction model. Inflow deviation together with water value If a power producer decides to implement a strategy se- has also proved to be the variables that play the largest role lection model, including new observations and re-calibrating in the prediction model. This has been demonstrated using the model on a daily basis could be the most robust solution. two different techniques for investigating feature importance One difference between Neural networks and boosting meth- that are associated with the gradient boosting in XGboost. ods is that Neural network can be updated on the fly, whereas If historical data are assumed to represent future instances boosting methods must be fully retrained when the training with sufficient accuracy, applying a prediction model trained data changes. Tracking performance of a strategy selection on available data will be able to predict the correct strategy model will be important, and the power producers should with an accuracy of 62-63%. If the benchmark is a strategy consider running the model in parallel with existing systems where only stochastic bidding is performed, the prediction for a period of at least one year to account for seasonal ef- model will outperform this strategy, and the number of days fects. If performance of the model turns out to be poor, it that must be analysed with the computationally demanding is an indication that the historic information available might stochastic model is reduced by almost 40% . A consider- be insufficient to give results with sufficient quality. able increase in variables beyond what was assumed by to be the most important ones by domain experts, do not not seem to improve prediction accuracy considerably. A pre- diction model with only 62% accuracy and a standard devia- tion of 3% indicate that there could be considerable amount of noise in the data. Given that only one combination of hyper-parameters and variables can be selected in an opera- tional environment, the risk of over-fitting a model increase when the model is fit to a large set of variables. This might favour choosing a simple model with a limited set of vari-

HO Riddervold et al.: Page 15 of 17 A. Appendix 1 whiteblanks

Table 4 Model architecture and hyper-parameters Param XSCR XSCS XCCR XSRR NNRR RNRS Algorithm Xgb Xgb Xgb Xgb Keras Keras Sequential Sequential Dense LSTM objective binary: binary: binary: reg: mse mse logistic logistic logistic squarederror learning_rate 0.075 0.075 0.032 0.092 max_depth 4 4 4 3 n_estimators 178 178 323 259 number_boost_round 9282 9282 2761 2509 gamma 4.26 4.26 9.68 4.34 subsample 0.77 0.77 0.85 0.64 dropout_frac 0.2 0.3 number_neurons1 50 number_neurons2 20 lookback 30 L2 0.005 0.005 kernel_regularizers l1(L2) l1(L2) activation linear linear optimizer adam adam

HO Riddervold et al.: Page 16 of 17 References [19] Krukrubo, L.A., . The confusion matrix for classification. URL: https://medium.com/towards-artificial-intelligence/the- https://xgboost.readthedocs.io/ [1] , . Xgboost documentation. URL: confusion-matrix-for-classification-eb3bcf3064c7. en/latest . [20] Li, S., Wang, P., Goel, L., 2015. Short-term load fore- [2] Aas, K., Jullum, M., Løland, A., 2019. Explaining individual predic- casting by wavelet transform and evolutionary extreme learn- tions when features are dependent: More accurate approximations to ing machine. Electric Power Systems Research 122, 96 – shapley values. ArXiv abs/1903.10464. 103. URL: http://www.sciencedirect.com/science/article/pii/ [3] Aasgård, E., Naversen, C., Fodstad, M., Skjelbred, H., 2018. Optimiz- S0378779615000036, doi:https://doi.org/10.1016/j.epsr.2015.01.002. ing day-ahead bid curves in hydropower production. Energy Systems [21] Lu, M., . Shap for explainable machine learning. URL: https:// 9, 257–275. meichenlu.com/2018-11-10-SHAP-explainable-machine-learning. [4] Aasgård, E.K., Skjelbred, H.I., Solbakk, F., 2016. Compar- [22] Lundberg, S., Erion, G., Chen, H., DeGrave, A., Prutkin, J., Nair, B., ing bidding methods for hydropower. Energy Procedia 87, 181 Katz, R., Himmelfarb, J., Bansal, N., Lee, S.I., 2019. Explainable http://www.sciencedirect.com/science/article/pii/ – 188. URL: ai for trees: From local explanations to global understanding. ArXiv S1876610215030386 https://doi.org/10.1016/j.egypro.2015.12.349 , doi: . abs/1905.04610. 5th International Workshop on Hydro Scheduling in Competitive [23] Matheussen, B.V., Granmo, O.C., Sharma, J., 2019. Hydropower op- Electricity Markets. timization using , in: Wotawa, F., Friedrich, G., Pill, I., [5] Abu-Rmileh, A., . The multiple faces of ‘feature importance’ in Koitz-Hristov, R., Ali, M. (Eds.), Advances and Trends in Artificial https://towardsdatascience.com/be-careful-when- xgboost. URL: Intelligence. From Theory to Practice, Springer International Publish- interpreting-your-features-importance-in-xgboost-6e16132588e7 . ing, Cham. pp. 110–122. [6] Anbazhagan, S., Kumarappan, N., 2012. A neural net- [24] Mayrink, V., Hippert, H.S., 2016. A hybrid method using exponential work approach to day-ahead deregulated electricity market prices smoothing and gradient boosting for electrical short-term load fore- classification. Electric Power Systems Research 86, 140 – casting, in: 2016 IEEE Latin American Conference on Computational http://www.sciencedirect.com/science/article/pii/ 150. URL: Intelligence (LA-CCI), pp. 1–6. doi:10.1109/LA-CCI.2016.7885697. S0378779611003208 https://doi.org/10.1016/j.epsr.2011.12.011 , doi: . [25] NN1, . Machine learning on the intraday power market. URL: [7] Baltaoglu, S., Tong, L., Zhao, Q., 2019. Algorithmic bidding for https://innovation.engie.com/en/innovation-trophies/machine- virtual trading in electricity markets. IEEE Transactions on Power learning-on-the-intraday-power-market/4602. 10.1109/TPWRS.2018.2862246 Systems 34, 535–543. doi: . [26] NN2, . Wattsight launches price forecast for continuous intraday [8] Brovold, S.H., Skar, C., Fosso, O.B., 2014. Implement- trading. URL: https://www.wattsight.com/blog/intraday-predict- ing hydropower scheduling in a european expansion plan- launch/. http: ning model. Energy Procedia 58, 117 – 122. URL: [27] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., //www.sciencedirect.com/science/article/pii/S1876610214017846 , Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., https://doi.org/10.1016/j.egypro.2014.10.417 doi: . renewable Energy Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Research Conference, RERC 2014. Duchesnay, E., 2011. Scikit-learn: Machine learning in Python. Jour- [9] Chen, T., Guestrin, C., 2016. Xgboost: A scalable tree boosting sys- nal of Machine Learning Research 12, 2825–2830. tem. ArXiv abs/1603.02754. [28] Pinto, T., Sousa, T.M., Morais, H., Praça, I., Vale, Z., 2016. [10] Fidje, J.T., Haraldseid, C.K., Granmo, O.C., Goodwin, M., Metalearning to support competitive electricity market players’ Matheussen, B.V., 2017. A learning automata local contribution sam- strategic bidding. Electric Power Systems Research 135, 27 pling applied to hydropower production optimisation, in: Bramer, M., – 34. URL: http://www.sciencedirect.com/science/article/pii/ Petridis, M. (Eds.), Artificial Intelligence XXXIV, Springer Interna- S0378779616300591, doi:https://doi.org/10.1016/j.epsr.2016.03.012. tional Publishing, Cham. pp. 163–168. [29] Riddervold, H.O., Aasgård, E.K., Skjelbred, H.I., Naversen, C., Kor- [11] Fosso, O., Belsnes, M., 2004. Short-term hydro scheduling in a lib- pås, M., 2019. Rolling horizon simulator for evaluation of bidding eralized power system, in: 2004 International Conference on Power strategies for reservoir hydro, in: 2019 16th International Confer- System Technology, 2004. PowerCon 2004, IEEE. pp. 1321–1326 ence on the European Energy Market (EEM), pp. 1–7. doi:10.1109/ Vol.2. EEM.2019.8916227. [12] Guyon, I., Elisseeff, A., 2003. An introduction to variable and feature [30] Silipo, R., Widmann, M., . 3 new techniques for data- selection. Journal of Machine Learning 3, 1157–1182. dimensionality reduction in machine learning. URL: [13] Hastie, T., Tibshirani, R., Friedman, J., 2009. The elements of statis- https://thenewstack.io/3-new-techniques-for-data-dimensionality- tical learning: , inference and prediction. 2 ed., Springer. reduction-in-machine-learning. http://www-stat.stanford.edu/~tibs/ElemStatLearn/ URL: . [31] Statistics, L.B., Breiman, L., . Out-of-bag estimation. [14] Iria, J., Soares, F., 2019. A cluster-based optimization ap- [32] Wolfgang, O., Haugstad, A., Mo, B., Gjelsvik, A., Wangen- proach to support the participation of an aggregator of a larger steen, I., Doorman, G., 2009. Hydro reservoir handling in number of prosumers in the day-ahead energy market. Elec- norway before and after deregulation. Energy 34, 1642 – http: tric Power Systems Research 168, 324 – 335. URL: 1651. URL: http://www.sciencedirect.com/science/article/pii/ //www.sciencedirect.com/science/article/pii/S0378779618303973 , S0360544209003119, doi:https://doi.org/10.1016/j.energy.2009.07.025. https://doi.org/10.1016/j.epsr.2018.11.022 doi: . 11th Conference on Process Integration, Modelling and Optimisation https: [15] Irpan, A., 2018. Deep reinforcement learning doesn’t work yet. for Energy Saving and Pollution Reduction. //www.alexirpan.com/2018/02/14/rl-hard.html . [33] Yamin, H., 2004. Review on methods of generation scheduling [16] James, G., Witten, D., Hastie, T., Tibshirani, R., 2014. An Introduc- in electric power systems. Electric Power Systems Research 69, tion to Statistical Learning: with Applications in R. Springer Texts 227 – 248. URL: http://www.sciencedirect.com/science/article/pii/ https://books.google.no/ in Statistics, Springer New York. URL: S0378779603002402, doi:https://doi.org/10.1016/j.epsr.2003.10.002. books?id=at1bmAEACAAJ . [34] Zarghami, M., 2018. Short term management of hydro-power system [17] Jiang, L., Zhang, B., Ni, Q., Sun, X., Dong, P., 2019. Prediction using reinforcement learning. of snp sequences via gini impurity based gradient boosting method. IEEE Access PP, 1–1. doi:10.1109/ACCESS.2019.2893269. [18] Jordan, M.I., Mitchell, T.M., 2015. Machine learn- ing: Trends, perspectives, and prospects. Science 349, 255–260. URL: https://science.sciencemag.org/ content/349/6245/255, doi:10.1126/science.aaa8415, arXiv:https://science.sciencemag.org/content/349/6245/255.full.pdf.

HO Riddervold et al.: Page 17 of 17