Deep X-Valuation Adjustments (XVAs) Analysis

*Ÿ Presented by Bouazza Saadeddine y

Joint work with Lokman Abbas-Turkiy and Stéphane Crépeyz

* Crédit Agricole CIB, France Ÿ LaMME, Université d'Evry/Paris-Saclay, France LPSM  Sorbonne Université, France y LPSM  Université de Paris, France z Plan of the presentation 2/34

1. XVA primer 2. General simulation and learning issues 3. From linear regressors to neural networks 4. Backward learning scheme with over-simulation 5. Numerical benchmark 6. GPU optimizations 7. Work in progress

Motivation 4/34

200809 crisis major banking reforms aimed at securing the financial system;  ! Collateralization and capital requirements were raised;  Unintended consequences: quantify market incompleteness by , based on XVA metrics;  XVAs: pricing add-ons meant to account for counterparty risk and its capital and funding  implications VA: Valuation Adjustment;  X: catch-all letter to be replaced by C (), D (Debt), F (Funding), M () or  K (Capital). During the financial crisis, roughly two-thirds of losses attributed to counterparty  were due to CVA losses and only about one-third were due to actual defaults (Basel Committee report, June 2011); In January 2014, JP Morgan recorded a $1.5 billion FVA loss;  Essential to model future evolution of XVAs.  ! X and Y are resp. the state of defaults and market risk factors at time t , i 0; : : : ; n ;  i i i 2 f g CVA could be simulated with 1 layer of Nested Monte-Carlo (NMC), assuming analytic MtM 

Time steps

M(i)

M(i-1) n 1 + M(i) − CVAi := E[ j=i MtMj+1 1 tj < tj+1 Xi; Yi] f  gj M(i) P

FVA would need n layers of NMC ( : funding spread)  n 1 + FVAi := E[ − j+1 (MtMj+1 1  >t CVAj+1 FVAj+1) (tj+1 tj) Xi; Yi] j=i f j+1g − − − j i.e. exponenPtial complexity in n, unless interpolation or regression is used for cutting the recursion. General simulation and learning issues 7/34

(Xi)0 i n and (Yi)0 i n are jointly Markov processes evaluated at 0=t0 <

where i;n := fi((Xj)i j n; (Yj)i j n);     The probability changes everyday due to changes in market conditions;  We are in a setting where X contributes more to the variance of  than Y ;  Simulating X is much faster than simulating Y  Over-simulating X. ! Depth

Mkva .

KVA0 Mec ECs, 0

. Mfva

Mcva ECs

. Mim FVAt=s,...,s+1 Mmtm .

CVAt, MVAt, t=s,...,s+1 . . IMt=s,...,s+1 , MtMt=s,...,s+1 t FVA IMv u, u=t,...,T CVAu, MVA , MtMw=v,...,v+ IMu=t,...,T u=t,...,T , MtM MVAu, CVAu IMv=u,...,T , MtMw , MtMv=u,...,T

Hard to compute because of critical tail events and of the presence of multiple XVA layers  EC is an Expected Shortfall of default losses and fluctuations of lower XVAs;  In Abbas-Turki, Diallo and Crépey (2018), a benchmark approach involving multi-layer NMC  and linear regressions, along with GPU optimization techniques have been developed. This benchmark however has an exponential complexity in the # of layers; We focus here on an approach based on neural regressions, with linear complexity.  Regression setup 9/34

(Xi)0 i n and (Yi)0 i n are jointly Markov processes evaluated at 0=t0 <

where i;n := fi((Xj)i j n; (Yj)i j n);     Technique well known to the quant finance community (Longstaff and Schwartz (2001));  k k k Draw i.i.d samples (Xi ; Yi ; i;n) 1 k m of (Xi; Yi; i;n);  f g   k k k Estimate E[i;n Xi; Yi] using linear regression of (i;n)1 k m against (Xi ; Yi ) 1 k m  j   f g   (: feature mapping). Neural networks as regressors 10/34

XVAs exhibit non-trivial dependencies on the (many) risk factors;  Very hard to manually craft a good enough feature mapping ;  Neural Networks (NNs) offer a way to learn the feature mapping too: 

^ ^ 2 find i argmin E[(i;n '(Xi; Yi)) ] 2 2 −

where ' is a NN parametrized by ;

'^ (Xi;Yi) would then be an estimator of E[i Xi;Yi], only valid for given market conditions;  i j NNs more flexible than linear regression, e.g. allowing to enforce positivity using a ReLU or  SoftPlus activation at the output layer. An over-simulation scheme 11/34

Leverage on the hierarchy between X and Y : relax the i.i.d setting and sample more real-  izations of X than Y ;

k Simulate i.i.d paths (Y )1 k  of Y ;    k;l k At ti and for every 1 k , simulate i.i.d realizations (Xi )1 l ! of Xi given Yi and  k;l k;l k    set i;n := fi(Xi ; Yi );

k;l k k;l k;l Yields a sample (Xi ; Yi ; i;n) 1 k  of size m :=  ! of (Xi; Yi; i;n);  f g1l !    The generated sample is not i.i.d;  More efficient in speed since Y is more costly to simulate than X and in memory usage.  i i In the XVA setting, assuming is the counterparty's default intensity process:  X : default indicator, Y : market risk factor processes (rates, FX, default intensities);  i i i i 0; : : : ; n ; P(Xi = 0 Yj 0 j i) = exp( j t);  8 2 f g jf g   − j=1 Conditional on Yj 0 j i, Xi can be simulated veryPfast as 1 i t> with  Exp(1).  f g   f j=1 j g  P

Learning scheme 13/34

At every time step ti:

k;l k k;l 1. Simulate (Xi ; Yi ; i;n) 1 k  according to the previous over-simulation scheme; f g1l !   k;l k;l k 2. Train a NN to regress (i;n) 1 k  against (Xi ; Yi ) 1 k  , i.e. 1l ! f g 1l !    

 ! ^ k;l k;l k 2 find i argmin  (i;n '(Xi ; Yi )) ; 2 2 − k=1 l=1 X X

3. Use (x; y) '^ (x; y) as an estimator for E[i;n Xi = x; Yi = y]. 7! i j Backward learning 14/34

Possible to have non-smooth paths ' (Xi; Yi) 0 i n even when the labels are smooth;  f i g   Learnings on each time step t are being done independently of each other;  i Start the learning at T and then at every time step reuse the previous solution as an initial-  ization of the training algorithm; Local minima associated with each time step will be close to each other  paths of ' (Xi; Yi) 0 i n are now smooth; ! f i g   A form of transfer learning, also helps accelerate the convergence of the learning procedure.  Choosing the over-simulation factor 15/34

Assume that simulating Y costs  times more than X Y in terms of computation time;  i ij i ! chosen so as to minimize the variance of the loss 1  ! f (Xk;l;Y k) w.r.t ! under  m k=1 l=1  i i a budget constraint = (!+), where f is such that f (X ; Y ) = ( ' (X ; Y ))2;  P P i i i;n −  i i Defining  := E[(f (X1;1; Y 1))2] E[f (X1;1; Y 1) f (X1;2; Y 1)]  i  i i −  i i  i i and  := E[f (X1;1; Y 1) f (X1;2; Y 1)] (E[f (X1;1; Y 1)])2, one can show that: i  i i  i i −  i i 2 2  ! 1  1    Var f (Xk;l; Y k) = i ! i + i + p m  i i  0 !0 −  1 0  1 1 k=1 l=1 ! s i s i X X @ @ A @ A A t = 9.5 years 8.47317 101 t = 8.0 years × 2 101 8.47317 101 × × 8.47317 101 × 8.47317 101 × 1 heuristic 8.47317 10 1 × 1 M 10 8.47317 10 × 8.47317 101 × 8.47317 101 × 0 50 100 150 200 250 0 50 100 150 200 250 3.45 101 4.7 101 × × t = 6.5 years 3.44 101 t = 5.0 years × 4.68 101 3.43 101 × × 3.42 101 4.66 101 × heuristic × 3.41 101 ×

M 1 4.64 10 3.4 101 × × 1 4.62 101 3.39 10 × × 0 50 100 150 200 250 0 50 100 150 200 250 3.14 101 × t = 4.5 years t = 3.0 years 3.12 101 2.62 101 × × 3.1 101 × 1 2.61 101 3.08 10 × × 1 3.06 10 1 heuristic × 2.6 10 3.04 101 ×

M × 3.02 101 2.59 101 × × 3 101 1 × 1 2.58 10 2.98 10 × × 0 50 100 150 200 250 0 50 100 150 200 250 SGD iteration SGD iteration

 i  Figure 1. Optimal over-simulation factor  at different time steps and SGD iterations. i r Finite parameter space case 17/34

Assume  is finite, let 0 <  <  and define:

? # := minE[f(X ; Y )]   2  ! ^ 1 k k;l #;! := min f(Xi ; Yi )   m 2 k=1 l=1  ? S :=  :XE[fX(X ; Y )] # +  f 2  !  g 1 S^ :=  : f (Xk; Y k;l) #^ +  ;! 2 m  i i  ;! ( k=1 l=1 ) X X Assume  S =/ and let u:  S  such that E[f (X; Y )] E[f (X; Y )] ? for all n ; n ! u()   −   S, for some ? , and define g (X ; Y ) := f (X; Y ) f (X ; Y ). 2 n   u() −  In particular,   S; E[g (X; Y )] ? and we have: 8 2 n   −

   (;) P(S^ S )  e− ! ;!   j j

t ! where !(; ) := min  SI;!( ), I;!( ) = supt R t  logE  Y and 2 n − − 2 − − M ! j (z Y ) := E[ezg(X;Y ) Y ]. M j j     We have I ( ) > 0 and for  close to E[g (X; Y )]: ;! − − 

2 ( + E[g(X ; Y )]) I;!( ) −  2 1 E[var(g (X; Y ) Y )] + var(E[g (X; Y ) Y ]) !  j  j (? )2 −   2 1 E[var(g (X; Y ) Y )] + var(E[g (X; Y ) Y ]) !  j  j  Market and Credit Model for the experiment 19/34

Dealer bank trading derivatives in 10 economies with 8 different clients;  Simulated short rates for these economies and their exchange rates against a reference  currency (Vasicek dynamics for rates and log-normal dynamics with stochastic rates for FX); For the bank and for every counterparty, we simulate respectively the credit spread and  default intensities (CIR dynamics); 10 + 9 + 9 = 28 market risk factors and 8 default indicators;  ! IR FX SPREADS 100 time steps and 10 years horizon (1 time step = 0.1 years);  |{z} |{z} |{z} 500 IR swaps, 16384 paths for market factors (with 128 inner paths if using NMC);  Market processes grouped together in Y and default indicators in X;  MtMs assumed to be analytic (true for 80% + of the book).  Mean of learned CVA using def indicators, out-of-sample 1750 95th percentile of learned CVA using def indicators, out-of-sample 5th percentile of learned CVA using def indicators, out-of-sample Mean of Nested Monte-Carlo CVA 1500 95th percentile of Nested Monte-Carlo CVA using def indicators 5th percentile of Nested Monte-Carlo CVA using def indicators 1250

1000

750

500

250

0

0 20 40 60 80 100

Figure 2. Over-simulation factor of 1 (ie no oversimulation) 2000 Mean of learned CVA using def indicators, out-of-sample 95th percentile of learned CVA using def indicators, out-of-sample 5th percentile of learned CVA using def indicators, out-of-sample 1750 Mean of Nested Monte-Carlo CVA 95th percentile of Nested Monte-Carlo CVA using def indicators 1500 5th percentile of Nested Monte-Carlo CVA using def indicators

1250

1000

750

500

250

0

0 20 40 60 80 100

Figure 3. Over-simulation factor of 32 2000 Mean of learned CVA using def indicators, out-of-sample 95th percentile of learned CVA using def indicators, out-of-sample 1750 5th percentile of learned CVA using def indicators, out-of-sample Mean of Nested Monte-Carlo CVA 1500 95th percentile of Nested Monte-Carlo CVA using def indicators 5th percentile of Nested Monte-Carlo CVA using def indicators 1250

1000

750

500

250

0

0 20 40 60 80 100

Figure 4. Over-simulation factor of 64 2000 Mean of learned CVA using def indicators, out-of-sample 95th percentile of learned CVA using def indicators, out-of-sample 1750 5th percentile of learned CVA using def indicators, out-of-sample Mean of Nested Monte-Carlo CVA 95th percentile of Nested Monte-Carlo CVA using def indicators 1500 5th percentile of Nested Monte-Carlo CVA using def indicators

1250

1000

750

500

250

0

0 20 40 60 80 100

Figure 5. Over-simulation factor of 128 2000 Mean of learned CVA using def indicators, out-of-sample 95th percentile of learned CVA using def indicators, out-of-sample 1750 5th percentile of learned CVA using def indicators, out-of-sample Mean of Nested Monte-Carlo CVA 1500 95th percentile of Nested Monte-Carlo CVA using def indicators 5th percentile of Nested Monte-Carlo CVA using def indicators 1250

1000

750

500

250

0

0 20 40 60 80 100

Figure 6. Over-simulation factor of 256 2000 Mean of learned CVA using def indicators, out-of-sample 95th percentile of learned CVA using def indicators, out-of-sample 1750 5th percentile of learned CVA using def indicators, out-of-sample Mean of Nested Monte-Carlo CVA 1500 95th percentile of Nested Monte-Carlo CVA using def indicators 5th percentile of Nested Monte-Carlo CVA using def indicators

1250

1000

750

500

250

0

0 20 40 60 80 100

Figure 7. Over-simulation factor of 512 One can show that: 

n 1 − + CVAi = E MtMj+1 1 tj < tj+1 Xi; Yi 2 f  gj 3 j=i X 4 n 1 j+1 5 j − + = E MtMj+1 exp m t exp m t Yi 1  >ti 2 − − − 3 f j=i m=i+1 ! m=i+1 !! X X X 4 5

Much easier to learn;  Cannot be applied to other XVA metrics (e.g. FVA, Quantiles or Expected Shortfalls).  2000 Mean of learned CVA using def intensities, out-of-sample 95th percentile of learned CVA using def intensities, out-of-sample 1750 5th percentile of learned CVA using def intensities, out-of-sample Mean of Nested Monte-Carlo CVA 1500 95th percentile of Nested Monte-Carlo CVA using def indicators 5th percentile of Nested Monte-Carlo CVA using def indicators 1250

1000

750

500

250

0

0 20 40 60 80 100

Figure 8. Using now the version with intensities instead of indicators Simulations done on GPU 28/34

Data generation and training are part of the end product;  Monte-Carlo simulations are parallel by nature and well suited to GPUs;  We wrote CUDA implementations (in Python through Numba) to perform the Monte-Carlo  simulations: Each thread handles one trajectory, i.e. one scenario, of the processes X and Y ;  Threads do mostly the same operations, only on different random scenarios, minor diver-  gence is due to default events; Coalesced global memory accesses whenever possible;  Constant memory used for the simulation parameters, shared memory used as a buffer;  Accomodating GPU memory limitations 29/34

We simulate the paths piece by piece w.r.t time:  1st kernel launch will start from X0; Y0 and generate paths over [t0; : : : ; tb];

i-th kernel launch will start from Xib; Yib and generate paths over [tib; : : : ; t(i+1)b];

B-th kernel launch will start from XBb; YBb and generate paths over [tBb; : : : ; T ], with n ; B = b Aftereach kernel launch, the piece would be transferred from device to host, and the GPU  global memory can be dedicated to the upcoming piece; The container arrays for the paths from t to T of X and Y would be pinned arrays;  0 Default events stored as bits packed in int8 integers: bitwise operations to access the default  state of a counterparty. Leveraging JIT compilation of CUDA kernels 30/34

Arrays can be treated as static while parametrizing their sizes thanks to Python closures and JIT compilation provided by Numba, useful to have the compiler put them in registers when possible:

def compile_cuda_generate_exp1(num_spreads, num_defs_per_path, num_paths, ,→ ntpb, stream): num_names= num_spreads-1 # at this stage, num_names, num_defs_per_path and num_paths are ,→ compile-time constants for the following kernel @cuda.jit(max_registers=32) def _cuda_generate_exp1(out, rng_states): block= cuda.blockIdx.x block_size= cuda.blockDim.x tidx= cuda.threadIdx.x pos= tidx+ block* block_size if pos< num_paths: fori in range(num_names): forj in range(num_defs_per_path): out[i, j, pos]= ,→ -math.log(xoroshiro128p_uniform_float32(rng_states, ,→ pos)) # out[i, j, pos] contains the (j*num_paths+pos)-th sample of an ,→ exponential random variable of parameter 1 which will be later used ,→ for the i-th component of Ib # compiling the kernel sig= (nb.float32[:, :, :], nb.from_dtype(xoroshiro128p_dtype)[:]) cuda_generate_exp1= _cuda_generate_exp1.compile(sig).configure( (num_paths+ntpb-1)//ntpb, ntpb, stream) cuda_generate_exp1._func.get().cache_config(prefer_cache=True) # returning the compiled kernel return cuda_generate_exp1 CUDA with Numba + PyTorch 31/34

Preparation of inputs and labels at each time step done on GPU (e.g. numerical integration  for the XVA labels + aggregation over counterparties); CUDA with Numba and PyTorch interoperability: new cuda array interface;  Prepare the inputs/labels on the GPU and let PyTorch use them directly less unnecessary  ! device host device copies; $ $ Easier to reuse CUDA best practices under PyTorch (e.g. pinned memory).  Execution times 32/34

Over-simulation factor Learning approach (training) NMC (inner simulations+reduction) 1 37s 32 54s 64 88s 5h27min 128 165s  256 317s 512 676s

Table 1. Execution times using a RTX 2070 Super (Turing arch, laptop version), to be scaled up on a node with 4xV100; Work in progress 33/34

Wrap-up the mathematical analysis of the convergence of the regressions in the infinite  butbounded case based on the work of Dentcheva, Ruszczy«ski, Shapiro (2009); Offline hyper-parameter tuning being unacceptable in our setting:  get rid of the learning rate through a probabilistic line search for SGD (Mahsereci, Hennig  (2017)); Explore the use of mixed-precision and Tensor Cores;  Scale the experiments up on a 4xV100.  Thank you for your attention