ESTIMATING AND BACKTESTING RISK UNDER HEAVY TAILS

MARCIN PITERA AND THORSTEN SCHMIDT

Abstract. While the estimation of risk is an important question in the daily business of banks and insurances, many existing plug-in estimation procedures suffer from an unnecessary bias. This often leads to the underestimation of risk and negatively impacts backtesting results, especially in small sample cases. In this article we show that the link between estimation bias and backtesting can be traced back to the dual relationship between risk measures and the corresponding performance measures, and discuss this in reference to value-at-risk and frameworks. Motivated by this finding, we propose a new algorithm for bias correction and show how to apply it for generalized Pareto distributions. In particular, we consider value-at-risk and expected short- fall plug-in estimators, and show that the application of our algorithm leads to gain in efficiency when heavy tails exist in the data.

Keywords: value-at-risk, expected shortfall, estimation of risk measures, bias, risk estimation, backtesting, unbiased estimation of risk measures, generalized Pareto distribution.

1. Introduction Risk measures are a central tool in capital reserve evaluation and quantitative risk management, see McNeil et al.(2015) and references therein. Since efficient statistical estimation procedures for risk are highly important in practice, this aspect of risk measurement recently raised a lot of attention, see Cont et al.(2010), Kr¨atschmer et al.(2014), Acerbi and Sz´ekely(2014), Davis(2016), Frank(2016), Ziegel(2016), Lauer and Z¨ahle(2016, 2017), Yuen et al.(2020), or Bartl and Tangpi (2020) amongst many others. When the estimation of risk is considered, majority of literature focuses on the plug-in approach combined with standard estimation techniques like maximum-likelihood estimation, see Embrechts and Hofert(2014), Kr¨atschmer and Z¨ahle(2017) and references therein. In this context, topics like predictive statistical inference analysis and bias quantification are typically studied in reference to the fit of the whole estimated distribution. Still, for risk measurement purposes, it is also important to measure adequacy and conservativeness of the capital projections directly, e.g. via backtesting, see Nolde and Ziegel(2017) where the importance of such studies is shown in reference to elicitability and backtesting. The aim of this article is to consider this topic when heavy tails arXiv:2010.09937v2 [q-fin.RM] 20 Mar 2021 are present. In particular, capital underestimation risk that is embedded into the risk estimation process or theoretical links between risk estimation procedures and backtesting should be further studied; see e.g. Bignozzi and Tsanakas(2016). To justify this point, let us mention that it was only recently pointed out in Moldenhauer and Pitera(2019) that the standard value-at-risk (VaR) backtesting breach count statistic is in fact a performance measure dual to the VaR family of risk measures as introduced in Cherny and Madan(2009). This shows that VaR breach count statistic is bound to the VaR family in the same manner that e.g. Sharpe Ratio is bound to risk aversion specification in mean-variance portfolio optimization or acceptability indices are bound to coherent risk measures,

Date: First circulated: October 20, 2020, This version: March 23, 2021.

1 2 MARCIN PITERA AND THORSTEN SCHMIDT see Bielecki et al.(2016). This indicates that there should be a direct statistical connection between the way how risk measure estimators are designed and how they perform in backtesting. In this article, we study this connection by discussing estimation bias in the context of estimating risk and its connection to backtesting performance. We focus on an economic notion of unbiasedness for risk measures which turns out to be important whenever the risk measure needs to be estimated. Unbiasedness in the context of risk measures is the generalization of the well-known statistical unbiasedness to the risk landscape, see Pitera and Schmidt(2018). Also, we refer to Bignozzi and Tsanakas(2016) where a similar approach has been applied to residual risk quantification. Here, we link risk unbiasedness concept to backtesting and compare the performance of existing estimators to the proposed unbiased counterparts in Gaussian and generalized Pareto distribu- tion (GPD) frameworks. The results presented for value-at-risk and expected shortfall show that plug-in estimators show a systematic underestimation of risk capital which results in deteriorated backtesting performance. While this is true even in an i.i.d. and light-tailed setting, it is more pronounced for heavy tails, small sample sizes, and high confidence levels. Also, we propose a new bias reduction technique that increases the efficiency of value-at-risk and expected shortfall plug-in estimators. This article is organized as follows. In Section2 we introduce the theoretical background, while in Section3 we comment on the relation between risk bias and backtesting. In Section4 and Section 5 we illustrate this relation for value-at-risk and expected shortfall. In Section6, we introduce a bootstrapping algorithm for bias reduction. Section7 illustrates the performance on simulated examples under GPD distribution and shows that risk bias correction leads to better estimator performance.

2. The estimation of risk and the associated bias Consider the risk of a financial position measured by a monetary risk measure ρ. The position’s future P&L is denoted by X. The estimation relies on a sample X1,...,Xn corresponding to historic realizations of the P&L. An estimator of the risk ρ(X) is a function of the historical data, which we denote byρ ˆ =ρ ˆ(X1,...,Xn). The position is secured by adding the estimated riskρ ˆ to the position and we denote the secured position by Y := X +ρ. ˆ Note that Y is a random variable where both X andρ ˆ are random. For simplicity, in this paper we focus on the i.i.d. case, i.e. we assume that X1,...,Xn,X are independent and identically distributed. This is a frequent choice, e.g. when the historical simula- tion approach is used, see Jorion(2001). Nevertheless, it should be noted that most of the analysis presented in this paper is also true in the generalised non i.i.d. framework. Assuming the monetary risk measure ρ is law-invariant, the risk of Y depends on the underlying distribution. Let Θ denote the parameter space and let each θ ∈ Θ identify a certain distribution choice; Θ could be infinite-dimensional when linked to a non-parametric framework. Moreover, let

R := {ρθ : θ ∈ Θ} denote the corresponding family of risk measures in which ρθ is used to quantify the risk of Y (and X) under θ ∈ Θ; see Remark 2.1 where the link between ρ and R is discussed in detail. As in Pitera and Schmidt(2018), we call an estimator ˆρ unbiased with respect to the family of risk measures R, if ρ(X +ρ ˆ) = 0 for all ρ ∈ R. (2.1) ESTIMATING AND BACKTESTING RISK 3

Relaxing this notion, we callρ ˆ sufficient if ρ(X +ρ ˆ) ≤ 0 for all ρ ∈ R. If we want to emphasize the difference to statistical notions of unbiasedness and sufficiency, we will use the notions risk unbiasedness and risk sufficiency. We also refer to Bignozzi and Tsanakas(2016), where a concept of residual risk measuring the size of ρ(X +ρ ˆ) is introduced, and to Francioni and Herzog(2012) where so called probability unbiasedness is introduced – those concepts are consistent with risk unbiasedness in the VaR case. The definition of unbiasedness in (2.1) has a direct economic interpretation: the secured position is acceptable under all risk measures ρ ∈ R. This quantifies the estimation error arising fromρ ˆ and contains the information on the actual risk the holder of the position faces when using the estimator ρˆ to secure the financial position X. This is in line with the predictive inference paradigm as we quantify the actual risk of secured position when using predicted risk for securitization. The concept of risk unbiasedness is the proper generalization of the statistical bias to non- linear measures; see Remark 2.1. From this perspective, apart from unbiasedness, it is natural to require additional properties. For example, consistency or other economic-driven properties such as minimization of the average mean capital reserve could be asked for. We refer to Section 4 and Example 7.3 in Pitera and Schmidt(2018) for additional details. Remark 2.1 (Relation to statistical bias). When the risk measure ρ corresponds to negative of the expectation, Equation (2.1) coincides with the statistical bias: indeed in this case, risk unbiasedness is equivalent to Eθ[ˆρ] = −Eθ[X], for all θ ∈ Θ, where Eθ is the expectation under the parameter θ ∈ Θ. Since we are in the i.i.d. case, the arithmetic mean turns out to be an unbiased estimator in the risk sense when risk is measured by expectation. Moreover, note that while the estimatorρ ˆ := −X1 is also unbiased it is not a good choice e.g. due to lack of consistency and high variance. This illustrates why additional properties need to be imposed apart from (risk) unbiasedness. Remark 2.2 (On the use of cash-additivity). On the first sight, it might be tempting to use cash additivity property in (2.1) to split X andρ ˆ in order to separate estimation procedure from secured position risk measurement. Nevertheless, recalling that both X andρ ˆ =ρ ˆ(X1,...,Xn) are random variables, this is not possible. Also, note that most risk measures are non-linear operators so that we cannot hope for a relationship similar to the one presented in Remark 2.1. Remark 2.3 (Risk unbiasedness under normality). In the case of normally distributed random 2 variables one would consider θ = (µ, σ ) ∈ Θ = R × R>0. Then, unbiasedness means that the risk of the secured position Y is zero for all possible choices of µ ∈ R and σ > 0. In Section 4.1 we state the closed-form expression for the VaR unbiased estimator in the Gaussian setting. 2.1. Risk unbiasedness for law-invariant risk measures. In the law-invariant setting, the risk measure only depends on the distribution of the underlying losses. Then it can be represented as a 1 function R : F → R ∪ {+∞}, where F denotes the set of all probability distributions. VaR and ES are prominent examples. Then, the family R = {ρθ : θ ∈ Θ} is simply given by ρθ(Z) = R(Fθ(Z)), where Fθ(Z) is the distribution of some arbitrary position Z under θ ∈ Θ. Consequently,ρ ˆ is unbiased if R(Fθ(Y )) = 0, θ ∈ Θ, (2.2)

Note that in the i.i.d. case Fθ(Y ) can be obtained from the convolution of the distribution of X with the distribution of the estimatorρ ˆ.

1 Here, we adopt the usual risk measure convention ∞ − ∞ := ∞, so that the domain of R could be F. 4 MARCIN PITERA AND THORSTEN SCHMIDT

3. Relation between Backtesting and Bias The notions of unbiasedness and sufficiency have an intrinsic connection to backtesting. In particular, the backtesting framework representation from Moldenhauer and Pitera(2019) can be placed into the context of risk unbiasedness which we illustrate in the following. To begin with, we detail the considered backtesting framework. The observed sample of n + m observations is denoted by x−(n−1), . . . , x0, x1, . . . , xm. We aim at m backtestings, each based on a sample of length n. This is done as follows: first, we compute m risk estimatorsρ ˆ1,..., ρˆm where each risk estimatorρ ˆt =ρ ˆ(xt−n, . . . , xt−1) is based on a historical sample of size n, starting at t − n and ending at t. The corresponding realization of the associated P&L is given by xt and the associated secured position sample is given by

yt := xt +ρ ˆt. ρˆ For brevity, we denote by y = y := (yt)t=1,...,m the sample for the m backtestings. Regarding the mentioned duality, we start from the observation that value-at-risk and expected shortfall are risk measures indexed by a parameter, the risk level α ∈ (0, 1]. We therefore con- sider a family of risk measures (Rα) which is indexed by a parameter α ∈ (0, 1]. We assume the family is monotone and law-invariant: that is, every member of the family is given by a function Rα : F → R ∪ {+∞}, see Section 2.1, and for any distribution F ∈ F and α1, α2 ∈ (0, 1] such that α1 < α2, we have

Rα1 (F ) ≥ Rα2 (F ). This could correspond to VaR or ES with α ∈ (0, 1] describing to the underlying risk level, also referred to as confidence level or confidence threshold; see BCBS(1996). Our goal is to estimate and validate the risk of X at the pre-defined reference risk level α0 ∈ (0, 1], e.g. 0.01 or 0.025. A natural choice for the objective function of the validation is the performance measure that is dual to (Rα) as defined in Cherny and Madan(2009). Here, we concentrate on the empirical coun- terpart of the dual performance measure which is obtained by replacing the unknown distribution by its empirical counterpart. More precisely, we introduce the dual empirical performance measure of the risk estimatorρ ˆ, in reference to the family (Rα), by ρˆ emp ρˆ Bm(y ) := inf{α ∈ (0, 1] : Rα (y ) ≤ 0}, (3.1) emp where Rα (y) := Rα(Fm(y)) is the empirical plug-in risk estimator based on the empirical distri- bution Fm(y) of y. Intuitively speaking, in (3.1) we look for the smallest value of α ∈ (0, 1] that ρˆ makes the secured sample y acceptable in terms of Rα. Note that in (3.1) we implicitly assumed that yρˆ could be used as a sample, at least when the risk of the secured position is being estimated. While this is the case under the correct model specification (i.e. when true risks are added to the position X), it might not be the case in general. Consequently, to account for potential sample bias, model misspecification, non i.i.d. sample prop- erties, etc., some alternative performance measures could be considered as well. In fact, the typical approach is to use the so-called traffic-light approach with certain fixed robust thresholds imposed on the value of Bm(y). We refer to Section4 where this is discussed in reference to regulatory backtesting. Having this in mind, we analyze the connection between the backtesting function introduced emp ρˆ in (3.1) and risk bias. First, the inner part of Equation (3.1), Rα (y ) ≤ 0, is the empirical counterpart of (risk) sufficiency. Consequently, if the risk estimatorρ ˆ is sufficient, one expects ρˆ Bm(y ) not to exceed the reference level α0 which should in turn indicate good performance of the secured position. Second, the minimum in (3.1) is achieved for α for which the empirical equivalent ρˆ of risk unbiasedness is satisfied. Thus, for an unbiased estimator, the value of Bm(y ) should be ESTIMATING AND BACKTESTING RISK 5 close to the reference threshold value α0. Consequently, one would expect that unbiased estimators perform well in backtesting. This two observations show that there is a direct relationship between backtesting performance emp ρˆ and unbiasedness. Theoretically, to compute the bias one could simply compute the value Rα0 (y ), where α0 is the reference risk level. Nevertheless, this quantity is difficult to to interpret as the net size of bias could vary. For instance, the net bias is not scale invariant and it is proportional to the (current) underlying position size or the current market volatility. To overcome this limitation, in (3.1) we switch to identifying the risk level that makes a position acceptable (unbiased). It should be emphasized that this approach is in fact common market practice that is embedded in the regulatory VaR backtesting, and we refer to Section4 for details. Summarizing, the above shows the importance of the concept of performance measures (also called acceptability indices) introduced in Cherny and Madan(2009) and shows that the approach introduced therein is in fact quite generic and already adopted by practitioners; see also Bielecki et al.(2016).

4. The Backtesting of Value-at-risk The classical regulatory backtest for VaR counts the empirical number of overshoots and com- pares them to the expected number of overshoots. More precisely, given the positions yρˆ secured by the estimatorρ ˆ a risk overshoot or a breach occurs at time t when the position is not sufficiently ρˆ secured, i.e. when yt < 0. In particular when estimating VaR it is natural to introduce the average ρˆ breach count Tm(y ), given by m 1 X T (y) := 1 , y ∈ m. (4.1) m m {y<0} R t=1 ρˆ This measure, or the equivalent non-standardized breach count mTm(y ), is the standard statistic used in regulatory backtesting, see BCBS(2009). Now, it is relatively easy to show that emp Tm(y) = inf{α ∈ (0, 1] : V@Rα (y) ≤ 0}, emp where V@Rα (y) := −y(bnαc+1) corresponds to the standard empirical VaR at level α ∈ (0, 1], the value y(k) is the k-th order statistic of the data, and the value bzc denotes the integer part of z ∈ R. In other words, (4.1) is a direct application of (3.1) in the VaR context which indicates direct relation between risk bias and backtesting in the VaR framework. See also Equatin (6) in Bignozzi and Tsanakas(2016) where the link between risk bias and backtesting in the VaR context is outlined. For regulatory backtesting we adapt the traffic-light approach for model assessment. Namely, we use yearly time series (n = 250 historical observations): the estimatorρ ˆ is in the green zone, if ρˆ ρˆ ρˆ Tm(y ) < 0.02, in the yellow zone, if 0.02 ≤ Tm(y ) < 0.04, and in the red zone, if Tm(y ) ≥ 0.04. This corresponds to less than five, between five and nine, and ten or more breaches, respectively, when net breach count is considered; we refer to BCBS(1996) for more details. To show that bias is indeed directly reflected in backtesting we consider two examples in Sec- tion 4.1 and Section 4.2. The presented results illustrate that the usage of biased risk estimators leads to a systematic underestimation of risk. This important defect becomes more pronounced with heavier tails of the underlying distribution, increased confidence level (i.e. decreasing α), or reduced sample size. Also, it turns out that a present bias negatively effects the predictive accuracy as defined in Gneiting(2011). For example we showed in Section 8.3 in Pitera and Schmidt(2018) that the consistent VaR score of the Gaussian unbiased estimator is better than the score of the standard Gaussian plug-in estimator. 6 MARCIN PITERA AND THORSTEN SCHMIDT

4.1. Backtesting under normality. In this section we show that in the Gaussian setting, many popular VaR risk estimators are biased, even in an asymptotic sense. Assume the the observed sample x−(n−1), . . . , xm is a realization from an i.d.d. normal distribution with unknown parameter. For simplicity, we fix α0 = 1%, n = 250, i.e. backtest on a yearly basis. If we would know mean µ and variance σ2 of the underlying normal distribution, then the true risk given by ˆ true −1  V@Rt := − µ + σ Φ (0.01) , t = 1, . . . , m, could be computed. Of course this is only possible in simulated scenarios, which we use later for testing the performance of competitive approaches. Note that due to identical distribution, true risk does not depend on t. −1 Pn If we denote the parameter estimates in back-testing period t byµ ˆt = n i=1 yt−i, and 2 −1 Pn 2 σˆ = (n − 1) i=1(yt−i − µˆt) , we obtain the common plug-in-estimator of the value-at-risk,

ˆ plug-in −1  V@Rt := − µˆt +σ ˆtΦ (0.01) t = 1, . . . , m. (4.2)

In the Gaussian framework the unbiased estimator of value-at-risk can be computed and we refer to Pitera and Schmidt(2018) for details. More precisely, the unbiased estimator is given by ! rn + 1 V@Rˆ u := − µˆ +σ ˆ t−1 (0.01) , (4.3) t t t n n−1 where tn−1 refers to t-student distribution function with n − 1 degrees of freedom. It is not complicated to verify that this estimator is indeed unbiased using independence and the pivotal property of parameter estimatesµ ˆt andσ ˆt. Third, it is natural to consider an empirical quantile as the estimator for value-at-risk. We call ˆ emp 2 such estimator the empirical estimator and denote it by V@Rt . We are now ready to perform backtesting and numerically illustrate the impact of bias on the estimator’s performance. Given a large Gaussian sample, we construct the backtests for the afore- mentioned estimators according to Equation (4.1): for z ∈ {plug-in, u, emp, true}, the average number of exceptions up to time m is given by

m z 1 X T := 1 ˆ z . (4.4) m {xt+V@Rt <0} t=1 For a numerical illustration we keep the window length of historical data, n, fixed and consider m → ∞. For the case of standard normal random variables (i.e. µ = 0, σ = 1), we plot the resulting backtests in Figure1. z z Ideally, the value T = T (m) should converge to theoretical reference value α0 = 1% when m → ∞. However, Figure1 shows that this is neither the case for the plug-in estimator nor the empirical estimator. It holds, however, for the unbiased estimator, and – of course – for the true risk. In fact, under normality, the asymptotic exception rate for the plug-in estimator could be directly computed and is strictly related to estimator bias. Namely, for n = 250, using Central

2To estimate empirical VaR we used the standard quantile function built into R software with default (type 9) setting. ESTIMATING AND BACKTESTING RISK 7

0.014

0.012 T

0.010

Empirical.estimator Normal.estimator 0.008 Normal.unbiased.estimator True.risk

0 25000 50000 75000 100000 length of backtest

Figure 1. Backtesting results under normality for increasing observation length m up to 100 000 days and a rolling window of length n = 250. We present the average number of exceptions T z, ˆ emp ˆ plug-in ˆ u at level α0 = 1%, for V@Rt (empirical), V@Rt (normal), V@Rt (unbiased) and under the ˆ true assumption of full knowledge of the underlying distribution, V@Rt (true). Only the true (yet unknown) risk and the unbiased estimator exception rates are close to the theoretical exception rate.

Limit Theorem arguments and standard prediction interval calculation scheme, we obtain   lim T plug-in(m) = Z − µˆ +σ ˆΦ−1(0.01) < 0 m→∞ P  r  Z − µˆ n −1 = P < Φ (0.01) σˆp1 + 1/n n + 1 r n  = t Φ−1(0.01) ≈ 1.05%, n−1 n + 1 where Z is normal random variable,µ ˆ is its estimated mean, andσ ˆ is its estimated standard deviations, both based on sample from Z of size n. The simulations also show that the asymptotic exception rate of the empirical estimator is larger and oscillates around 1.35%. While the empirical estimator asymptotic exception rate (1.35%) might not look severe it does have a serious monitoring consequence. Under the correct model setting for VaR at level 1%, the individual breach probability should be 1% which leads to probability of reaching the yellow or red traffic-light zone (having more than 4 exceptions in the annual backtest) equal to approximately 10%.3 On the other hand, if the individual breach probability is equal to 1.35%, the probability of reaching yellow or red zone is equal to approximately 25%. In consequence, the monitoring will detect problems with the model twice as often as it should, at least if daily risks (rather than three month average risks) are used for backtesting. Only when we additionally increase the length of the backtesting window, n → ∞, the above quantity converges to 1%, as expected.

4.2. Backtesting under generalized Pareto distributions. In the context of heavy tails, a closed form expression for the unbiased estimator is no longer available and one has to rely on numerical procedures. We introduce such an approach in the following section. We begin by describing the currently existing estimators for generalized Pareto distributions and illustrate their performance.

3 We get 1 − FB(n,p)(4) ≈ 10%, where B(n, p) is the Bernoulli distribution with n = 250 trials and probability of success equal to p = 1%. 8 MARCIN PITERA AND THORSTEN SCHMIDT

We continue in the i.i.d. setting from Section3 and assume that x−(n−1), . . . , x0, x1, . . . , xm is GPD distributed (left tail) with threshold u ∈ R, shape ξ ∈ R, and scale β ≥ 0; we also set p := P (X ≤ u). In practical applications, the threshold u is often considered as known while the parameters ξ and β have to be estimated. Under the assumption of a fixed and known distribution it is well-known how to compute the value-at-risk under GPD, we refer to McNeil et al.(2015) for details. Indeed, basic calculations yield the true risk value   true β −ξ V@Rt := −u + ξ α − 1 .

Denoting by ξˆt and βˆt (with additional t referring to the estimation period) the standard estimators for ξ and β we readily obtain the plug-in estimator for value-at-risk in the GPD case,

ˆ  ˆ  ˆ plug-in βt −ξt V@Rt := −u + α − 1 . ξˆt

When the threshold u needs to be estimated, we would additionally replace u byu ˆt. Moreover, to numerically asses an existing bias in the estimation, we additionally consider the empirical estimator ˆ emp V@Rt , which is of course the same as in the previous section. For simplicity, we consider a fixed parameter set u = −1, ξ = 0.05, β = 0.7, and p = 0.2. Having p = 0.2 means that only 20% of the data lie beyond the threshold u and the observed sample consists only of the data below the threshold. The corresponding periods have to be adjusted accordingly: we consider a learning period of length n = 250 · p = 50 and reference level α0 = 0.01/p = 0.05. For z ∈ {plug-in, emp, true}, we construct the secured position and perform the backtest according to Equation (4.4). The results are shown in Figure2.

0.07

0.06 T

0.05

Empirical.estimator

0.04 GPD.plug.in.estimator True.risk

0 25000 50000 75000 100000 length of backtest

Figure 2. Backtesting results under GPD for increasing observation length m up to 100 000 days and a rolling window of length n = 250. We present the average number of exceptions T z, at level ˆ emp ˆ plug-in α0 = 1%, for V@Rt (empirical), V@Rt (plug-in), and under the assumption of full knowledge ˆ true of the underlying distribution, V@Rt (true). Only the true (and unknown) risk exception rate is close to the theoretical exception rate.

The figure considers a fixed window size, n = 50, together with increasing sample size, i.e. m → ∞. We observe that the bias vanishes for the true value-at-risk, as expected there is no bias once the true distribution is known. On the contrary, both the plug-in estimator and the empirical estimator ˆ plug-in show a clear bias. The asymptotic risk level reached by V@Rt is around 6% (instead of 5%), ˆ emp while for V@Rt it is even close to 7%. The latter means that in 7% of the observed cases the ESTIMATING AND BACKTESTING RISK 9 estimated risk capital is not sufficient to cover the occurred losses which corresponds to a significant underestimation of the present risk.

5. Expected Shortfall backtesting In this section we show that the results presented for VaR also hold true for ES. For brevity, we illustrate the backtesting performance in case of expected shortfall by mimicking the framework introduced in Section4. The duality-based performance metric based on (3.1) can also be used for ES backtesting. For the ES secured sample y, we can compute the empirical mean of the overshooting samples, m 1 X G(y) := 1 , (5.1) m {y(1)+...+y(t)<0} t=1 where y(k) denotes the kth order statistic of y. The performance statistic (5.1) simply measures the cumulative breach count and answers a simple question: how many (worst-case) scenarios do we need to consider to know that the aggregated loss does not exceed the aggregated capital reserve, see Moldenhauer and Pitera(2019) for details. As expected, (5.1) is the empirical counterpart of (3.1), i.e. we have emp G(y) = inf{α ∈ (0, 1] : ESα (y) ≤ 0}, emp where ESα denotes the empirical expected shortfall at level α ∈ (0, 1] estimator. As in Section 4.1 and Section 4.2, we show backtesting results for Gaussian and GPD distributions. In both cases, we replace average exception rate statistic with (5.1). Hence, for every estimator z in scope, we consider m z z 1 X G = G (m) := 1{yz +...+yz <0}, (5.2) m (1) (t) t=1 z where, for each m, the order statistic y(k) corresponds to the kth order statistic of the associated ˆ z z secured position yt = xt +ESt , t = 1, . . . , m, and where ESt denotes the tth day estimated expected shortfall risk using estimator z. Under i.i.d. normality setting discussed in Section 4.1, we consider the same set of estimators, now defined for reference level α0 = 2.5%; learning sample size is the same as before, i.e. we set n = 250. Namely, we consider true risk, Gaussian plug-in, Gaussian unbiased, and empirical estimators given by  −1   −1  true φ(Φ (α0)) ˆ plug-in φ(Φ (α0)) ESt := − µ + σ , ESt := − =µ ˆt +σ ˆt , α0 α0 t−1 −1 P 1 !  φ(Φ (α )) i=t−n xi {x +V@Rˆ emp≤0} ESˆ u := − µˆ + c · σˆ 0 , ESˆ emp := − i t , t t 250 t t Pt−1 α0 1 ˆ emp i=t−n {xi+V@Rt ≤0} where the constant c250 = 1.0077 is obtained using approximation scheme introduced in Example 5.4 in Pitera and Schmidt(2018). Similarly, under the i.i.d. GDP setting discussed in Section 4.2, we consider the empirical estimator as well as the true risk and the plug-in estimators given by true ˆ plug-in ˆ ˆ true V@Rt β + ξu ˆ plug-in V@Rt βt + ξtu ESt := + , ESt := + , 1 − ξ 1 − ξ 1 − ξˆt 1 − ξˆt for conditional reference risk level 0.025/p = 0.125, and conditioned sample size 250p = 50, see McNeil et al.(2015) for details. In both cases we consider the same set of parameters that was used in Section 4.1 and Section 4.2, respectively, and plot values of backtesting statistic Gz. The aggregated results for both normal 10 MARCIN PITERA AND THORSTEN SCHMIDT case and GPD case are presented in Figure3. As in the VaR case, we observe no bias only for the true risk and Gaussian unbiased estimators. In all other cases, significant bias is visible.

0.035

0.16

0.030

0.14 T G 0.025 0.12

Empirical.estimator Empirical.estimator Normal.estimator 0.020 0.10 GPD.plug.in.estimator Normal.unbiased.estimator True.risk True.risk

0 25000 50000 75000 100000 0 25000 50000 75000 100000 length of backtest length of backtest

Figure 3. Backtesting results under normality (left) and GPD (right) for increasing observation length m up to 100 000 days and unconditional rolling window of length 250 and 50, respectively. Empirical mean of the overshoots Gz at level 2.5% and 12.5%, respectively, is presented for selected estimators. Only the true (yet unknown) risk and the unbiased estimator rates are close to the theoretical rates.

6. Risk bias reduction for plug-in estimators In Section4 and Section5 we have illustrated that plug-in estimators are typically biased implying a negative impact on backtesting performance. For a law-invariant risk measure, the plug-in procedure can be formalized as follows: denote the distribution of X by the family Fθ(X): θ ∈ Θ and consider a law-invariant risk measure represented by the function R : F → (−∞, ∞]. The plug-in procedure is given by the following two steps: in the first step we estimate the parameter θˆ; in the second step we treat θˆ like being the true underlying parameter and plug it into formula for the risk measure. Since the risk-measure is law-invariant, this turns out to be R(Fθ(X)) and the plug-in estimator is

ρˆ := R(Fθˆ(X)). Motivated by the findings in the previous section we introduce a general scheme for improving the efficiency of an existing parametric plug-in estimator by reducing its bias. We start from the observation that in the Gaussian case the unbiased estimator given in (4.3) is obtained from the plug-in estimator in (4.2) by appropriately rescaling parameter estimates. Indeed, we transformed the set of estimated parameters using the mapping ! rn + 1 t−1 (α) (ˆµ, σˆ) 7−→ µ,ˆ n−1 σˆ , (6.1) n Φ−1(α) where n ∈ N is the underlying sample size, and then plugged the changed parameter into the formula for the risk measure. More generally, if we estimate the d-dimensional parameter θ, we look for a modification of the estimated parameter from θˆ to a ◦ θˆ, where ◦ is the componentwise multiplication and choose a d d such that the bias is minimized. In other words, assuming that Θ ⊆ R we aim at finding a ∈ R for which the tweaked plug-in estimator

ρˆa := R(Fa◦θˆ(X)) ESTIMATING AND BACKTESTING RISK 11

d has the smallest bias. While the global value of a ∈ R that rendersρ ˆa unbiased might not exist, we minimize the bias locally.

6.1. Local bias minimization. We start by relaxing the unbiasedness/sufficiency criterion. The idea is to assess unbiasedness not on the full parameter set Θ, but on a subset Θ0 ⊆ Θ, which is chosen dependent on the data at hand. For simplicity, we focus on distribution-based risk measures and use the notation introduced in Section 2.1. Given a distribution-based reference risk measure R and a we define the maximal risk bias of the estimatorρ ˆa on Θ0 as

B∗(ˆρa, Θ0) := sup R(Fθ(X +ρ ˆa)). (6.2) θ∈Θ0

In particular, if B∗(ˆρa, Θ) = 0, thenρ ˆa is risk sufficient since R(Fθ(X +ρ ˆa)) ≤ B∗(ˆρa, Θ) = 0, for θ ∈ Θ. Let us now assume that we are given a single parameter θ ∈ Θ and set Θ0 = {θ}. Then, under the i.i.d. framework, we can rewrite (6.2) as

B∗(ˆρa; θ) = R(Fθ(X) ∗ Fθ(ˆρa)), (6.3)

d where ∗ is the convolution operator. Next, we want to find a ∈ R for which the value of B∗(ˆρa; θ) is as close to zero as possible. Namely, we set

∗ aθ := arg min B∗(ˆρa; θ) . (6.4) a∈Rd Now, assuming that θ in (6.4) is equal to the true (unknown) parameter, by substituting the estimatorρ ˆ = R(Fˆ(X)) with the adjusted one,ρ ˆa∗ = R(F ∗ ˆ(X)), will lead to a bias reduction. θ θ aθ ◦θ While in practice the true parameter adjustment in unknown (since the true parameter θ is unknown), one can compute the adjustment (6.4) for θˆ using bootstrap methods. Assuming that the estimated parameter and the true parameter are close to each other, their local adjustments should be similar. In Section 6.2, we introduce a bootstrapping algorithm which could be used to compute a∗ and apply the algorithm in the normal and GPD case. θˆ It should be noted that in (6.4) we consider the absolute value of bias to simplify the evaluation process. Alternatively, one could consider the positive bias penalty function (6.2) and introduce supplementary optimality criterion that minimizes the expected value of estimated risk. Note that such a framework could be also desirable from the economic perspective since in practical setting we want to be secured with minimal capital. For completeness, we refer to Bignozzi and Tsanakas(2016), where a similar approach based on absolute risk bias adjustment has been proposed. However, note that the approach proposed in the aforementioned paper is different from ours as it aims to estimate the bias value B∗(ˆρ; θ) directly using the bootstrap method; the estimated value is later added to estimated risk to reduce the bias. That saying, for simplified scale-location families those two approaches could produce consistent results. In particular, we note that while in Bignozzi and Tsanakas(2016) the GPD case is not studied, the authors consider the t-student distribution framework and show how to minimize bias in this setting. While numerical analysis in the cited paper is not directly linked to backtesting or predictive accuracy analysis as in our case (see Section7), absolute bias reduction oriented studies also indicate improved performance when imposing control on the underlying risk bias (called residual risk therein) e.g. by considering residual add-on or shifting risk measure reference risk level. 12 MARCIN PITERA AND THORSTEN SCHMIDT

6.2. The bootstrapping bias reduction. Since we do not know the value of the true parameter ∗ ˆ θ, one cannot directly compute the value of aθ in (6.4). Nevertheless, we can replace θ by θ and follow the bootsrapping procedure. Our goal is to compute a∗ := a∗ = arg min B (ˆρ ; θˆ) (6.5) θˆ ∗ a a∈Rd by bootstraping and use it to define the boostrapped risk estimator b ρˆ := R(Fa∗◦θˆ(X)). The proposed algorithm is summarized in details in Figure4.

Algorithm 1 (The bootstapped risk estimator). Begin by an estimation step.

(1) Estimate θˆ (e.g. using MLE approach) from X1,...,Xn. The next steps perform the bootstrap. (2) For each i = 1,...,B, simulate an i.i.d. sample Xi of size n from the estimated ˆi i distribution Fθˆ(X) and compute estimator θ for each X . d (3) For any a ∈ R (or a suitable chosen subset A) do the following: ˆ1 ˆB (3a) estimate the distribution Fθˆ(ˆρa) directly from sample (θ ,..., θ ), e.g. by a kernel density estimation.

(3b) Compute the convolution Fθˆ(X) ∗ Fθˆ(ˆρa). Finally, compute the optimal choice of the parameter a: ∗  a (4) Calculate a := arg mina∈A R Fθˆ(X) ∗ Fθˆ(ˆρa) , (5) Set the bootstrapped risk estimator to b ρˆ := R(Fa∗◦θˆ(X)). (6.6)

aIt should be noted that the local bias-minimising rescaling might be non-unique. Generally speaking, it is 2 better to rescale the shape parameters than the location parameters: If Θ = R and the first coordinate is 2 the location parameter (mean), then this can be achieved by choosing A = (1, R) = {(x1, x2) ∈ R : x1 = 1}.

Figure 4. The algorithm for bootstrapping bias reduction.

To illustrate how Algorithm 1 works let us calculate the value of a∗ for two reference VaR cases that were already considered before, i.e. for plug-in normal and plug-in GPD estimators. Inspired by the Gaussian case, (6.1), we apply rescaling only to the scale parameters. We fix the rolling window length to n = 50, use i.i.d. samples, and consider VaR at level 5%. In the normal case, we compute univariate a∗ > 0 and get the adjusted plug-in estimator − µˆ + (a∗ · σˆ)Φ−1(0.05) , for various choices of true µ ∈ R and σ > 0. In the GPD case we consider an exogenous threshold u which we set to u = 0 and consider univariate a∗ > 0 for the adjusted plug-in estimator

∗ ˆ  ˆ  (a ·β) α−ξ − 1 , ξˆ for various exemplary choices of underlying true parameters ξ ∈ R \{0} and β > 0. Note that in the GPD case we are using conditioned sample, i.e. 5% denotes the conditioned VaR level. For example, assuming that initial sample size was 250 and 50 observations were negative, this setting 50 would effectively correspond to unconditional threshold 5% · 250 = 1%. In both cases, we set B = 50 000 for bootstrap sampling, and apply the algorithm detailed in Figure4 on an exemplary representative parameter grid. The results are presented in Figure5. ESTIMATING AND BACKTESTING RISK 13

2.5 2.0 1.040

1.16

2.0 1.5 1.035

1.12 1.5

sd 1.030

1.0 beta

1.0 1.08

1.025 0.5 0.5

1.04 1.020 0.0 0.0 0.5 1.0 1.5 2.0 −0.2 0.0 0.2 0.4 mean xi

Figure 5. The heatplots present local bias adjustment values for Normal (left) and GPD (right). We computed multiplicative rescaler for scale parameters that should be applied to plug-in normal and GPD estimators to make them locally unbiased. The results are presented for sample size n = 50 and VaR at level p = 5%.

Two main conclusions could be made. First, the local bias adjustment for the normal case does not depend on the underlying parameters and is close to the value presented in (6.1), i.e. we have

q n+1 −1 −1 n tn−1(p)/Φ (p) ≈ 1.029. Second, in the GPD case, it could be observed that the bias value adjustment looks like a (mono- tonic) function of ξ parameter. This is alligned with the intuition as the β parameter is used for linear scaling which should not impact (relative) bias adjustment due to positive homogenuity of VaR.

7. Performance of the bias reduction In this section we present two examples of the local bias minimization when the underlying returns are distributed as GPD. In the first case we consider value-at-risk, while in the second we focus on expected shortfall. To asses the performance of the bias reduction, we simulate samples from the GPD families with different parameter settings and compare the performance of the plug-in estimators with its locally unbiased equivalent constructed via Algorithm 1. For completeness, in both cases we also show the results for true risk and empirical estimators. 7.1. Performance metrics. To measure the performance of both VaR and ES estimators, we introduce supplementary performance metrics. Recall that our inputs are risk projections (ˆρt), realised P&Ls (xt), and the corresponding secured positions (yt) = (xt +ρ ˆt). We start with metrics used for VaR projections. First, we take the exception rate statistic T defined in (4.1), i.e. m 1 X T := 1 . m {yt<0} t=1 This is a standard backtesting metric used in regulatory backtesting. The value of T should not be substantially bigger than the VaR reference risk level. Second, we consider the mean score statistic S based on the quantile strictly consistent scoring function s(r, x) := (1{r>x} − α)(r − x), where r is corresponding to quantile (negative of risk) 14 MARCIN PITERA AND THORSTEN SCHMIDT projection, x is denoting realised P&L, and α is the underlying risk level. Namely, noting that r and x could be expressed via secured sample (yt), we define m 1 X S := − (1 − α)y . m {yt<0} t t=1 Metric S is a common elicitability backtesting statistic used for estimator performance comparison, see Gneiting(2011) and Fissler et al.(2015) for details Third, we consider rolling window traffic-light non-green zone backtest statistic NGZ. We fix the length of an additional rolling window, i.e. N = 50. We count the average number of overshoots Ps+N−1 1 in the interval [s, s + N − 1] by T1(s) := t=s {yt<0}. The average Non-Green Zone (NGZ) classifier is defined as m−N 1 X NGZ := 1 , m − N {T1(s)≥zα} s=1 where zα is the exception number 95% confidence threshold that could be explicitly computed. 1 Namely, under the correct model assumption, the sequence ( {yt<0}) forms a Bernoulli process with probability of success equal to α. Consequently, under correct model specification T1(t0) should be linked to a Bernoulli trial of length n with success probability α. For example, for α = 0.05 and n = 50, we get FB(50,0.05)(4) ≈ 0.89 and FB(50,0.05)(5) ≈ 0.96, which would lead to 95% confidence threshold equal to zα = 5. In a perfect setting, NGZ should be close to 1 − FB(n,p)(zα − 1). Fourth, we introduce Diebold-Mariano comparative (t-)test statistic with unbiased estimator being the reference one. The test statistic is given by √ µˆ(d) DM := n · , σˆ(d) m where d = (dt)t=1 is the time series of comparative errors between two estimators,µ ˆ(d) is the sample mean comparative error, andσ ˆ(d) is the sample standard deviation of the comparative error, respectively; see Osband and Reichelstein(1985) for details. In the VaR case, the t-th day ˆ z1 ˆ z2 comparative error is given by dt := s(−V@Rt , xt)−s(−V@Rt , xt), where z1 and z2 denote certain estimators (such as plug-in estimator, empirical estimator, or true risk). Finally, we are introducing standard metrics linked to the mean and standard deviation of the estimated regulatory capital. When other criteria are satisfied, this allows to choose the estimator which requires minimal regulatory capital or has the smallest statistical bias. We define the mean of risk estimator (MR) and standard deviation of risk estimator (SD) by q 1 Pm 1 Pm 2 MR := m t=1 ρˆt and SD := m−1 t=1(ˆρt − MRV) . (7.1) Note that smaller values of MR indicate that (on average) smaller capital reserves are required. Ideally, this metric should be close to the true risk. Since parameters need to be estimated we expect that this model uncertainty is reflected by an increase in MR. MR can also be seen as an approximation of the standard (statistical) estimator bias. Let us now move to the ES case. For simplicity and brevity, and to avoid the discussion about joint-elicitability and (statistical) relevance of the output metrics (see e.g. Fissler et al.(2015)), we decided to consider a simplified framework in which we focus only on backtesting performance; other conclusions should be similar to VaR. Namely, we focus on the averaged cumulative exception rate defined in (5.1), i.e. m 1 X G := 1 . m {y(1)+...+y(i)<0} i=1 ESTIMATING AND BACKTESTING RISK 15

For completeness, we also include results of MR and SD in the ES case. 7.2. Bias reduction for value-at-risk in the GPD framework. We consider three different sets of GPD parameters (u, ξ, β) and estimate VaR using three different (conditional) reference α levels. The parameter sets we picked are

θ0 = (−0.978, 0.212, 0.869),

θ1 = (−2.2, 0.388, 0.545), (7.2)

θ2 = (−0.40028, 1.19, 0.774), while the α confidence thresholds are equal to 5%, 7.5%, and 10%, respectively. In first two cases the learning period is equal to n = 50, while in the last case it is equal to n = 42. The first setting corresponds to the t-student distribution with 3 degrees of freedom and 20% percentage threshold; the second setting corresponds to market data calibrated parameters from S&P500 index – see Table 5 in Gilli and Kellezi(2006); the third setting comes from Table 5 in Moscadelli(2004) and is related to an operational risk framework. Note that in the third set we have ξ > 1 which implies that the first moment is infinite. For simplicity, in all cases, we fix u, estimate β and ξ, and rescale only β when Algorithm 1 is used. For computational purposes we fixed bootstrap sample size to B = 50 000, see Step (2) of Algorithm 1 for details. We consider five different VaR estimators: the empirical estimator, the plug-in GPD estimator, true risk, and two bootstrap-based estimators that are described below. First, for reference, we adjust the GPD plug-in estimator V@Rˆ plug-in by a fixed multiplier a that is computed based on true values of ξ and β. In other words, we compute the true adjustment value corresponding to (6.4) rather than its estimate corresponding to (6.5). It should be noted the (true) adjustment a is not available in practice since the true values of ξ and β are unknown. Nevertheless, this gives us information about adjustment application rationality and could serve as a benchmark for other estimators. We set ˆ boot1 ˆ plug-in V@Rt := −u + a · (V@Rt + u), (7.3) where a is computed using true values of ξ and β in the first step of Algorithm 1; note that adjusting βˆ by a results in (7.3). Here, we slightly modified the objective function in step (4) to allow a small positive bias and reduce the size of a. Namely, in all cases we allowed bias to be equal to 10% of the estimator standard error. For the first two datasets, this corresponds to approximately 1% of the true risk value, while for the third dataset this corresponds to 10% of the true risk value. Note that in the third case we have ξ > 1 which increase the standard error. The correction was done to slightly bring down capital without (considerably) impacting exception thresholds. Second, we introduce an estimator that is computed using the bootstrapping algorithm detailed in Figure4 applied to each sample separately. We set ˆ boot2 ˆ plug-in V@Rt := −u +a ˆt · (V@Rt + u), (7.4) wherea ˆt is the local bias correction for tth day estimates ξˆt and βˆt. Here, we also slightly modified the Algorithm 1 to avoid high sensitivity to outliers and make the output risk projection more rational. Namely, we decided to seta ˆt value to one (i.e. do not apply adjustment) if the estimated ˆ plug-in value of V@Rt was too big in the first place. This was done if the estimated plug-in risk ˆ plug-in was breaching the 10% upper quantile of the aggregated estimated capital sample (V@Rt ). From risk management point of view, this could correspond to the situation when the risk manager decides not to apply the risk bias correction if he thinks that the estimated capital is severely overestimated in the first place. Again, this correction was done to slightly bring down capital 16 MARCIN PITERA AND THORSTEN SCHMIDT without (considerably) impacting exception thresholds. Also, note that high risk projections usually correspond to estimated parameters that are relatively far away from the true parameters in which case the size of the adjustment could differ. We follow a backtesting rolling window approach with total number of simulations equal to N = n + m = 100 000. The results of the simulation are presented in Table1.

Parameters VaR α n TS NGZ DM MR (SD) GPD emp 0.066 0.2475 0.20 -11.8 4.47 (0.92) u = −0.978 plug-in 0.060 0.2426 0.15 -0.4 4.57 (0.78) ξ = 0.212 boot1 0.050 50 0.052 0.2425 0.08 — 4.83 (0.83) β = 0.869 boot2 0.052 0.2424 0.08 2.2 4.83 (0.80) true 0.051 0.2331 0.11 18.7 4.61 (0.00) GPD emp 0.091 0.3140 0.13 -10.7 4.59 (0.71) u = −2.200 plug-in 0.087 0.3101 0.12 -0.5 4.58 (0.57) ξ = 0.388 boot1 0.075 50 0.078 0.3100 0.07 — 4.76 (0.61) β = 0.545 boot2 0.077 0.3097 0.07 4.4 4.76 (0.58) true 0.077 0.3017 0.08 18.2 4.63 (0.00) GPD emp 0.119 7.2509 0.08 -13.3 10.57 (7.16) u = −0.400 plug-in 0.124 7.2003 0.12 0.0 9.08 (4.51) ξ = 1.190 boot1 0.100 42 0.111 7.2003 0.07 — 10.38 (5.19) β = 0.774 boot2 0.111 7.1907 0.07 13.2 10.30 (4.60) true 0.101 7.1175 0.05 12.0 9.82 (0.00)

Table 1. The table presents numerical results the estimation of value-at-risk in three cases of a GPD distribution. It reports the empirical (emp), the plug-in, and the two locally unbiased estimators: the estimator with true adjustment (boot1) and the bootstraping estimator (boot2). The reported values are the cumulative exception rate T , mean scores S, the non-green zone (NGZ), the Diebold-Mariano test statistic with boot1 as competitive approach, and the mean risk value MR (with standard deviation in brackets).

It can be seen that only for the unbiased estimators (and of course for the true VaR) the exception rate (T) and non-green zone (NGZ) frequency is close to the expected/desired values. Indeed, both empirical and GPD plug-in estimators have higher (than expected) exception rate and enter yellow (or red) zone more frequently. Also, note that in all cases the mean score of the unbiased estimators is better compared to empirical and plug-in estimators. Summarizing, the ˆ boot1 ˆ boot2 unbiased estimators V@Rt and V@Rt outperform the empirical and plug-in estimators in all respects. If we had to decide between those two measures, the Diebold-Mariano tests suggest that boot2 estimator has a better performance.

7.3. Bias reduction for expected shortfall in the GPD framework. In this setting we repeat the previous simulation example in the context of expected shortfall. We consider the first two parameter sets from Equation (7.2) since, for θ2, we have ξ > 1 so expected shortfall is no longer finite. The results are presented in Table2. Two remarks should be made. First, the bias adjustment for expected shortfall is large in comparison to the adjustment in the value-at-risk case, computed for the same confidence threshold α ∈ (0, 1). Something we expected since expected shortfall takes into account the whole distribution ESTIMATING AND BACKTESTING RISK 17

Parameters ES α n G MR (SD) GPD emp 0.100 6.14 (1.83) u = −0.978 plug-in 0.078 6.73 (2.01) ξ = 0.212 boot1 0.050 50 0.050 7.88 (2.41) β = 0.869 boot2 0.057 7.66 (2.22) true 0.051 6.70 (0.00) GPD emp 0.136 6.80 (2.40) u = −2.200 plug-in 0.121 7.08 (2.44) ξ = 0.388 boot1 0.075 50 0.079 8.35 (3.08) β = 0.545 boot2 0.092 8.03 (2.65) true 0.077 7.07 (0.00)

Table 2. The table presents numerical results the estimation of expected shortfall in two cases of a GPD distribution. It reports the empirical (emp), the plug-in, the estimator knowing ξ and β (true), the bootstrapping estimator using ξ and β for computing a (true2), and the two locally unbiased estimators: the bootstrapping estimator (boot) and the splitting estimator (splitting). The reported values are the cumulative aggregated exception rate statistic G, the non-green zone (NGZ), and the mean risk value MR (with standard deviation in brackets). tail.4 In particular, the closer the value of ξ to 1, the bigger the bias, and the multiplier is close to 70% for ξ ≈ 0.8. We illustrate this with a comparison plot with VaR and ES bias adjustments computed for all estimated parameter values for the second dataset, see Figure6. Please note that the simulations are in perfect agreement with results in Figure5, i.e. ξ is the main bias determination driver. Second, as expected, the performance (measured in terms of G) for empirical and plug-in estimators is significantly worse in comparison to the unbiased estimators which tend to underestimate the risk. This is aligned with the results presented for the VaR case.

1.6 1.6

1.25

1.6 1.2 1.20 1.2

1.15 1.4 beta beta 0.8 0.8

1.10

1.2

0.4 1.05 0.4

−0.5 0.0 0.5 −0.5 0.0 0.5 xi xi Figure 6. The heatplots present local bias adjustment for GPD plug-in estimator for VaR (left) and ES (right), both at level 7.5%. The underlying sample size in both cases is equal to n = 50. The multiplicative rescalers are computed for all parameter pairs that were estimated for 100 000 simulated data based on Dataset 2 specification.

4In Regulatory framework VaR at 1% was replaced by ES at 2.5% which should essentially result in the same level of bias, at least in the Gaussian setting. 18 MARCIN PITERA AND THORSTEN SCHMIDT

Acknowledgements The first author acknowledges support from the National Science Centre, Poland, via project 2016/23/B/ST1/00479. The second author acknowledges support from the Deutsche Forschungs- gemeinschaft under the grant SCHM 2160/13-1.

References Acerbi, C. and Sz´ekely, B. (2014), ‘Back-testing expected shortfall’, Risk magazine (November). Bartl, D. and Tangpi, L. (2020), ‘Non-asymptotic rates for the estimation of risk measures’, preprint, arXiv:2003.10479 . BCBS (1996), Supervisory framework for the use of ’backtesting’ in conjunction with the internal models approach to market risk capital requirements, Technical report, Basel Committee on Banking Supervision, Bank for International Settlements. BCBS (2009), Fundamental review of the trading book: A revised market risk framework - Consultative document, Technical report, Bank for International Settlements, Basel Committee on Banking Supervision. Bielecki, T. R., Cialenco, I., Drapeau, S. and Karliczek, M. (2016), ‘Dynamic assessment indices’, Stochastics 88(1), 1–44. Bignozzi, V. and Tsanakas, A. (2016), ‘Parameter uncertainty and residual estimation risk’, Journal of Risk and Insurance 83(4), 949–978. Cherny, A. S. and Madan, D. B. (2009), ‘New measures for performance evaluation’, The Review of Financial Studies 22(7), 2571–2606. Cont, R., Deguest, R. and Scandolo, G. (2010), ‘Robustness and sensitivity analysis of risk measurement procedures’, Quantitative Finance 10(6), 593–606. Davis, M. (2016), ‘Verification of internal risk measure estimates’, Statistics & Risk Modeling 33, 67 – 93. Embrechts, P. and Hofert, M. (2014), ‘Statistics and quantitative risk management for banking and insur- ance’, Annual Review of Statistics and Its Application 1, 493–514. Fissler, T., Ziegel, J. F. and Gneiting, T. (2015), ‘Expected shortfall is jointly elicitable with - implications for backtesting’, Risk magazine (December). Francioni, I. and Herzog, F. (2012), ‘Probability-unbiased value-at-risk estimators’, Quantitative Finance 12(5), 755–768. Frank, D. (2016), ‘Adjusting var to correct sample volatility bias’, Risk magazine (October). Gilli, M. and Kellezi, E. (2006), ‘An application of extreme value theory for measuring financial risk’, Computational Economics 27(2-3), 207–228. Gneiting, T. (2011), ‘Making and evaluating point forecasts’, Journal of the American Statistical Association 106(494), 746–762. Jorion, P. (2001), Value at Risk: The New Benchmark for Managing , 3 edn, The McGraw-Hill Professional. Kr¨atschmer, V., Schied, A. and Z¨ahle,H. (2014), ‘Comparative and qualitative robustness for law-invariant risk measures’, Finance and Stochastics 18(2), 271–295. Kr¨atschmer, V. and Z¨ahle,H. (2017), ‘Statistical inference for expectile-based risk measures’, Scandinavian Journal of Statistics 44(2), 425–454. Lauer, A. and Z¨ahle,H. (2016), ‘Nonparametric estimation of risk measures of collective risks’, Statistics & Risk Modeling 32(2), 89–102. Lauer, A. and Z¨ahle,H. (2017), ‘Bootstrap consistency and bias correction in the nonparametric estimation of risk measures of collective risks’, Insurance: Mathematics and Economics 74, 99–108. McNeil, A. J., Frey, R. and Embrechts, P. (2015), Quantitative risk management: concepts, techniques and tools-revised edition, Princeton university press. Moldenhauer, F. and Pitera, M. (2019), ‘Backtesting expected shortfall: a simple recipe?’, Journal of Risk 22(1). Moscadelli, M. (2004), ‘The modelling of operational risk: experience with the analysis of the data collected by the basel committee’, Technical Report 517, Banca d’Italia . ESTIMATING AND BACKTESTING RISK 19

Nolde, N. and Ziegel, J. F. (2017), ‘Elicitability and backtesting: Perspectives for banking regulation’, The annals of applied statistics 11(4), 1833–1874. Osband, K. and Reichelstein, S. (1985), ‘Information-eliciting compensation schemes’, Journal of Public Economics 27(1), 107–115. Pitera, M. and Schmidt, T. (2018), ‘Unbiased estimation of risk’, Journal of Banking & Finance 91, 133–145. Yuen, R., Stoev, S. and Cooley, D. (2020), ‘Distributionally robust inference for extreme value-at-risk’, Insurance: Mathematics and Economics 92, 70–89. Ziegel, J. F. (2016), ‘Coherence and elicitability’, 26, 901 – 918.

Institute of Mathematics, Jagiellonian University,Lojasiewicza6, 30-348 Cracow, Poland Email address: [email protected]

Dep. of Mathematical Stochastics, University of Freiburg, Eckerstr.1, 79104 Freiburg, Germany Email address: [email protected]