Dept. of Math. University of Oslo Statistical Research Report No. 7 ISSN 0806–3842 July 2008

THE COPULA INFORMATION CRITERION

STEFFEN GRØNNEBERG AND NILS LID HJORT

Abstract. The arguments leading to the classical AIC does not hold for the case of parametric copulae models when using the pseudo maximum likelihood procedure. We derive a proper correction, and name it the Copula Information Criterion (CIC). The CIC is, as the copula, invariant to the underlying marginal distributions, but is fundamentally different from the AIC formula erroneously used by practitioners. We further observe that bias-correction–based information criteria do not exist for a large class of commonly used copulae, such as the Gumbel or Joe copulae. This motivates the investigation of median unbiased information criteria in the fully parametric setting, which then could be ported over to the current, more complex, semi-parametric setting.

Contents

1. Introduction and summary 1 2. Approximation Theorems for the Kullback–Leibler divergence 2

2.1. Infinite expectation of qn and rn for certain copula models 8 2.2. AIC-like formula 9 2.3. TIC-like formula 11 3. Illustration 11 Appendix A. Mathematical proofs 14 References 20

1. Introduction and summary

Let Xi = (Xi,(1),Xi,(2),...,Xi,(d)) be i.i.d from some joint distribution, for which we intend to employ one or more parametric copulae. Letting Cθ(v) be such a copula, with d density cθ(v) on [0, 1] , in terms of some parameter θ of length p, the pseudo-log-likelihood function is Xn `n(θ) = log cθ(Vˆi) i=1 in terms of the estimated and transformed marginals, say ˆ ˆ ˆ ˆ (1) (2) (d) Vi = (Vi,(1), Vi,(2),..., Vi,(d)) = (Fn (Xi,(1)),Fn (Xi,(2)),...,Fn (Xi,(d)),

Date: 25th July 2008. Key words and phrases. AIC, copulae, . 1 2 STEFFEN GRØNNEBERG AND NILS LID HJORT where the traditional choices are 1 Xn F (i)(x) = I{X(i) ≤ x}. n n + 1 i i=1

In particular, the Vˆi lie inside the unit square. The maximizer θˆ of `n(·) is called the maximum pseudo-likelihood estimator. Often in copulae applications one uses

? AIC = 2`n,max − 2length(θ) (1) as a model selection criterion, with `n,max = `n(θˆ) being the maximum pseudo likelihood, # # from the traditional Akaike information criterion AIC = 2`n,max −2length(θ), where `n,max is the usual maximum likelihood for a fully parametric model. One computes this AIC? score for each candidate model and in the end chooses the model with highest score. This cannot be quite correct, however, as the arguments underlying the derivations of the traditional AIC do not apply here – since `n(·) at work here is not a proper log- likelihood function for a model. In other words, the AIC? formula above ignores the noise ? inherent in the transformation step that takes Xi to Vˆi. The AIC -formula would be ◦ ◦ ◦ appropriate only if we could use Ui,(k) = F(k)(Xi,(k)), with known marginals F(k). So this paper is about the appropriate modifications necessary to provide the correct penalisation for complexity. We shall see that the result is the Copula Information Cri- terion ∗ ∗ ∗ CIC = 2`n,max − 2(ˆp +q ˆ +r ˆ ) (2) with a formula for pˆ∗+ˆq∗+ˆr∗ different from (and more complicated than) merely length(θ). These quantities even vary non-trivially with the model parameter – in clear contrast with length(θ) which is invariant to the actual value of θ. Surprisingly, we will see that there does not exist any (first order) bias-correction-based information criterion for the maximum pseudo likelihood estimator for a large class of copulae. This motivates the investigation of a median unbiased information criterion in parametric and semi-parametric estimation. See the further discussion in Section 2.1. We derive the theoretical basis for the Copula Information Criterion in Section 2. As in the fully parametric case, we get different results if we assume the parametric model is in fact correct or not. Subsection 2.2 assumes this and results an AIC-like formula, while Subsection 2.3 does not and results in a TIC-like1 formula. Finally, we give a small numerical illustration of the CIC in Section 3 and conclude with some needed mathematical results in Appendix A.

2. Approximation Theorems for the Kullback–Leibler divergence

We start out considering Xn −1 −1 An(θ) = n `n(θ) = n log cθ(Vˆi) i=1

1See Claeskens & Hjort (2008) for the definitions and derivations of the AIC and TIC formulae in the fully parametric case. THE COPULA INFORMATION CRITERION 3 the pseudo-log-likelihood normalized by sample size. We postulate that the true (but unknown) data-generating distribution has a densityf ◦. We then have the unique decom- position Yd ◦ ◦ ◦ ◦ ◦ ◦ f (x1, x2, . . . , xd) = c (F1 (x1),F2 (x2),...,Fd (xd)) fi (xi). i=1 This motivates naming c◦ the true data generating copula and, more traditionally, name ◦ fi the true data generating marginal distributions. There is convergence almost surely Z a.s. ◦ An(θ) −−−→ A(θ) = c (v) log cθ(v) dv n→∞ [0,1]d for each θ. Let θ0 be the least false value for the family, i.e. the maximizer of A(·), assumed to be unique and an inner point of the parameter space. This is also the minimizer of the Kullback–Leibler distance Z ◦ ◦ c KL(c , cθ) = c log dv [0,1]d cθ from the true to the parametric copula. For the following result, which is proved in Genest et al. (1995), let ∂A (θ ) ∂` (θ ) U = n 0 = n−1 n 0 , n ∂θ ∂θ the normalized pseudo-score function, evaluated at θ0. Also, let ( ) Xd Z ∂φ(v, θ0) ◦ Σ = I + Var (I{ξi ≤ vi} − vi) dC (v) d ∂v i=1 [0,1] i where I is the Information matrix

t I = Eφ(ξ, θ0)φ(ξ, θ0) , and where φ(·, θ) = ∂/∂θ log c(·, θ) and ξ ∼ C◦.

Lemma 1. Given the regularity assumptions on {cθ : θ ∈ Θ} stated in Genest et al. (1995) and somewhat more explicitly in Tsukahara (2005), we have

√ W nUn −−−→ U ∼ Np(0, Σ). n→∞ We shall also need the symmetric matrix Z 00 ∂ log cθ0 (v) J = −A (θ0) = − c(v) t dv, [0,1]d ∂θ∂θ assumed to be of full rank. A useful random process is now √ Hn(s) = n{An(θ0 + s/ n) − An(θ0)}.

d √ It is defined for those s ∈ R for which θ0 + s/ n is inside the parameter region; in d particular, for any s ∈ R , Hn(s) is defined for all large n. 4 STEFFEN GRØNNEBERG AND NILS LID HJORT

Lemma 2. We have

W t 1 t Hn(s) −−−→ H(s) = s U − s Js n→∞ 2 where U is the limit of Un and where the convergence takes place inside each Skorokhod space D([−a, a]d).

Proof. Essentially, Taylor expansion demonstrate √ 1 H (s) = st nU − stJ s + o (1), n n 2 n P a.s. where Jn −−−→ J is the empirical version of J. The result then follows from Lemma n→∞ 1. ¤

The first consequence of note is the limiting distribution of the maximum pseudo- likelihood estimator. From continuity of the argmax functional,

W Mn = argmax(Hn) −−−→ M = argmax(H). n→∞ But this is the same as

√ W −1 −1 −1 n(θˆ − θ0) −−−→ J U ∼ Np(0,J ΣJ ). n→∞ This result duplicates that of Genest et al. (1995) and the slight extension given by Chen & Fan (2005) to allow model mis-specification. Secondly, we investigate the actual Kullback–Leibler distance from the true model to that used from fitting the parametric family. However, we model the marginal distributions fully non-parametrically through the empirical distribution functions, which are strongly consistent under any possible set of true marginal distributions. As long as we confine our selves to consider models estimated through the maximum pseudo likelihood paradigm, we are only interested in the Kullback–Leibler divergence KL(c◦, c(·, θˆ)) between the true data generating copula c◦ and the fitted parametric part of our model c(·, θˆ) given by Z Z KL(c◦, c(·, θˆ)) = c◦ log c◦ dv − c◦(v) log c(v, θˆ) dv. [0,1]d [0,1]d It is rather difficult (but possible) to estimate the first term from data, but we may ignore it, since it is common to all parametric families. For the purposes of model selection it therefore suffices to estimate the second term, which is A(θˆ). We now examine

−1 estimator An(θˆ) = n `n,max vis-`a-vis target A(θˆ). Let

Zn = n{An(θˆ) − An(θ0)} − n{A(θˆ) − A(θ0)}. Some re-arrangement shows that

−1 An(θˆ) − A(θˆ) = n Zn + An(θ0) − A(θ0). (3) Also, 1 Z = H (M ) + n(θˆ − θ )tJ(θˆ − θ ) + o (1), n n n 2 0 0 P THE COPULA INFORMATION CRITERION 5 in which we define the stochastically significant part as pn, giving rise to

1 t W 1 t −1 t −1 pn := Hn(Mn) + n(θˆ − θ0) J(θˆ − θ0) −−−→ H(M) + U J U = U J U =: p. 2 n→∞ 2 We have ¡ ¢ p∗ = Ep = EU tJ −1U = Tr(J −1Σ) = Tr J −1I " Ã !# Xd Z −1 ∂φ(v, θ0) ◦ + Tr J Var (I{ξi ≤ vi} − vi) dC (v) . d ∂v i=1 [0,1] i Remark 3. Note that p∗ ≥ 0 as in the standard parametric case.

In the purely parametric case, we have

EAn(θ0) − A(θ0) = 0, and so the standard AIC argument simply ends here (modulo stating a reasonable empirical estimate pˆ∗ of p∗), giving ∗ AIC = 2`n,max − 2ˆp as an asymptotic bias-correction for the estimated (relevant part of the) Kullback–Leibler divergence. However, the current situation has

EAn(θ0) − A(θ0) 6= 0. In fact, −1 −1 −1 An(θ0) − A(θ0) = n rn + n qn + Z˜n + oP (n ), (4) where

1/2 rn = OP (1), qn = OP (n ), and EZ˜n = 0. Further,

W √ W 2 rn −−−→ r and qn/ n −−−→ q = N(0, σ ). n→∞ n→∞

The conclusion is that An(θˆ) overshoots its target A(θˆ) with a random quantity with mean value of the size −1 ∗ ∗ ∗ n (pn + qn + rn), where

∗ ∗ ∗ pn = Epn rn = Ern qn = Eqn. ∗ There is no simple expression for pn, and it is even infinite for some models (Claeskens & Hjort, 2008). The AIC solution is to rather use p∗ = Ep, which is finite and is merely length(θ) under model conditions. Analogously define

p∗ = Ep r∗ = Er q∗ = Eq = 0.

∗ ∗ We will see that there are simple expressions for rn and qn, which are invariant with respect ∗ ∗ to n. We will see that rn = r and use ∗ ∗ ∗ CIC = 2`n,max − 2(ˆp +q ˆn +r ˆn), 6 STEFFEN GRØNNEBERG AND NILS LID HJORT analogous to the AIC formula, which we will see that under model conditions reduce to

∗ ∗ CIC = 2`n,max − 2(ˆp +r ˆ ). The remainder of this section is devoted to deriving the validity of the decomposition in eq. (5), as well as deriving a concrete expressions for q∗ and r∗. The mathematics behind this decomposition is deferred to an appendix. Define ◦ ◦ ◦ ◦ F⊥(v) = (F1 (v1),F2 (v2),...,Fd (vd)) and

Fn,⊥(v) = (Fn,(1)(v1),Fn,(2)(v2),...,Fn,(d)(vd)). We will also need the the “empirical marginal process”, which we define to be

√ ◦ Gn,⊥(v) = n[Fn,⊥(v) − F⊥(v)] √ ◦ ◦ = n(Fn,(1)(v) − F1 (v),...,Fn,(d)(v) − Fd (v))

= (Gn,(1)(v),..., Gn,(d)(v)). If the copula is sufficiently smooth, a Taylor-expansion yields Xn · −1 ◦ 0 ◦ t ˆ ◦ An(θ0) = n log c(F⊥(Xi), θ0) + ζ (F⊥(Xi), θ0) (Vi − F⊥(Xi)) i=1 ¸ 1 + (Vˆ − F ◦ (X ))tζ00(F ◦ (X ), θ )(Vˆ − F ◦ (X )) +O (n−3/2). (5) 2 i ⊥ i ⊥ i 0 i ⊥ i P where ∂ log c(v, θ) ∂2 log c(v, θ) ζ0(v, θ) = and ζ00(v, θ) = . ∂v ∂v∂vt −3/2 k+1 The OP (n ) term originates from the R(h) = O(h ) size of the k’th reminder in a multivariate Taylor expansion, together with the fact that this implies

◦ ◦ 3 −3/2 R(Fn,⊥(x) − F⊥(x)) = Op([Fn,⊥(x) − F⊥(x)] ) = Op(n ) as Gn,⊥(v) has a non-degenerate weak process limit (see the proof of Lemma 7). Notice −3/2 −1 that OP (n ) = oP (n ). The first term of eq.(5) has mean equal to A(θ0), and can thus be included in Z˜n. We can further express the second and third terms of eq.(5) in a more transparent functional form as Z Z 1 √ 0 ◦ t 1 t 00 ◦ n ζ (F⊥(x); θ0) Gn,⊥(x) dFn(x) + Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) dFn(x). n Rd 2n Rd Now notice that Z √ 0 ◦ t n ζ (F⊥(x); θ0) Gn,⊥(x) dFn(x) d R ·Z Z ¸ 0 ◦ t 0 ◦ t ◦ ◦ = n ζ (F⊥(x); θ0) Fn,⊥(x) dFn(x) − ζ (F⊥(x); θ0) F⊥(x) dF (x) d d R ·Z R Z ¸ 0 ◦ t ◦ 0 ◦ t ◦ ◦ − n ζ (F⊥(x); θ0) F⊥(x) dFn(x) − ζ (F⊥(x); θ0) F⊥(x) dF (x) , Rd Rd whose second term is mean zero and can thus be included in Z˜n. THE COPULA INFORMATION CRITERION 7

The master-theorem of this paper is the following, leading to our two information cri- teria.

Theorem 4. Given the existence of all third order derivatives of log c(v, θ) with respect to the components of v for all v ∈ (0, 1)d, the decomposition in eq.(4) is valid for ·Z Z ¸ qn √ 0 ◦ t 0 ◦ t ◦ ◦ √ = n ζ (F⊥(x); θ0) Fn,⊥(x) dFn(x) − ζ (F⊥(x); θ0) F⊥(x) dF (x) n Rd Rd −−−→W N(0, σ2) Z n→∞ 1 t 00 ◦ ◦ rn = Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) dF (x) 2 d R Z W 1 ◦ t 00 ◦ ◦ ◦ −−−→ BF ◦,⊥(x) ζ (F⊥(x); θ0)BF ◦,⊥(x) dF (x) n→∞ 2 d Z R 1 ◦ t 00 ◦ ◦ = BC◦,⊥(v) ζ (v; θ0)BC◦,⊥(v) dC (x) 2 [0,1]d 1 Xn Z˜ = log c(F ◦ (X ), θ ) − A(θ ) n n ⊥ i 0 0 i=1 ·Z Z ¸ 0 ◦ t ◦ 0 ◦ t ◦ ◦ + n ζ (F⊥(x); θ0) F⊥(x) dFn(x) − ζ (F⊥(x); θ0) F⊥(x) dF (x) . Rd Rd ½ 2 0 t t 00 for some σ > 0, using the notation given in Lemma 7. If ζ (v; θ0) 1 and 1 ζ (v; θ0) ⊗ ¾ Ψ(v) 1 are both C◦-integrable, we have Z ∗ 0 t¡ ¢ ◦ qn = Eqn = ζ (v; θ0) 1 − v dC (v) [0,1]d Z ½ ¾ ∗ ∗ 1 t 00 ◦ rn = Ern = r = Er = 1 ζ (v; θ0) ⊗ Ψ(v) 1 dC (v) 2 [0,1]d in which ⊗ is the tensor product and Ψ(v) = (ψi,j(v))i,j is a symmetric matrix defined through the degree of bivariate quadrant dependence (Joe, 1997) of the bivariate copula ◦ projections Ci,j(u, v) = P r{X1,(i) ≤ u, X1,(j) ≤ v} as ◦ ψi,j(v) = Ci,j(vi, vj) − vivj when i < j and 2 ψi,i(v) = vi − vi while i > j is defined through symmetry.

Proof. This is a direct application of Lemma 7 and Lemma 10 in the appendix, in con- junction with previous efforts. ¤

We are hence led to the Copula Information Criterion, which estimates p∗, q∗ and r∗ from data. The first version of the Copula Information Criterion is given in Subsection 2.2 and assumes a correct super-model, in the spirit of the original AIC formula. This assumption drastically reduces the complexity of the p∗, q∗ and r∗ formulas and enables estimates which are less variable than in the general case. The second Copula Information Criterion 8 STEFFEN GRØNNEBERG AND NILS LID HJORT is given in Subsection 2.3 and does not assume a correct super-model. Thus the empirical p∗, q∗ and r∗ versions has to be completely non-parametric, and is thus more variable. This fully non-parametric version is in the spirit of the Takeuchi information criterion (see Claeskens & Hjort, 2008) and are valid even when the copula models considered are non-hierarchical. Our CIC formula will need efficient numerical approximation of Σ. The following simple Lemma reduces the computation needed to approximate Σ through numerical integration and further reduces estimation variance when estimating Σˆ. It is implicitly used in Chen & Fan (2005), but we provide a short proof. For vectors ξ and v both in Rd, define

t χ(ξ ≤ v) = (I{ξ1 ≤ v1},...,I{ξd ≤ vd}) as a multivariate generalization of the indicator function and notice that we can write Σ in vector notation as Z Xd ∂φ(v, θ) I + Var (I{ξi ≤ vi} − vi) dCθ(v) d ∂v i=1 [0,1] i Z µ ¶ ∂ t = I + Var t log c(v, θ) (χ{ξ ≤ v} − v) dCθ(v). [0,1]d ∂θ∂v

◦ Lemma 5. For any pair of copulae C and Cθ, we have that Z µ ¶ ∂ t Var t log c(v, θ) (χ{ξ ≤ v} − v) dCθ(v) [0,1]d ∂θ∂v (Z µ ¶ )2 ∂ t = E t log c(v, θ) (χ{ξ ≤ v} − v) dCθ(v) . [0,1]d ∂θ∂v where ξ ∼ C◦ and in which we write x2 for xxt given a vector x.

Proof. We note that through applying Fubini to Z µ ¶ ∂ t E t log c(v, θ) (χ{ξ ≤ v} − v) dCθ(v), [0,1]d ∂θ∂v we get Z t Z t ∂φ(v, θ) ∂φ(v, θ) ◦ (Eχ(ξ ≤ v) − v) dCθ(v) = (C⊥(v) − v) dCθ(v), [0,1]d ∂v [0,1]d ∂v ◦ ◦ where C⊥(v) is the vector of the marginal cumulative distributions of C . As these mar- ◦ ginal distributions are all uniform, we have that C⊥(v) = v, reducing the above integral to zero. ¤

2.1. Infinite expectation of qn and rn for certain copula models. Copulae with ∗ ∗ positive tail dependence, such as the Gumbel or Joe copulae, often have infinite qn and rn ∗ ∗ ∗ ∗ values, and as q = qn, this can not be fixed through simply using q instead of qn – which is the AIC-solution to this issue. This makes the CIC formula useless in many practically interesting settings. Our asymptotic expansions are still valid, but this infiniteness hinders us in making any bias correction what so ever. THE COPULA INFORMATION CRITERION 9

One possibility for dealing with this non-finiteness of the first moments of q and r can be through developing a Median Unbiased Information Criterion in the style of Uchida & Yoshida (2004) to our random variable setting. However, it seems logical to do this for the standard purely parametric setting first – and not jump right to the current semi- parametric setting. We note this motivates the investigation of a purely parametric Median unbiased information criterion.

2.2. AIC-like formula. This section works under the assumption of a correct super- model, as was the case for the original AIC formula. Although this assumption is often somewhat courageous, it gives natural empirical estimates of p∗, q∗ and r∗ which only vary with θˆ in a simple way, leading to less variable estimates. See the parallel discussion of the AIC versus TIC in Burnham & Anderson (1998, section 7.3). ◦ ∗ ∗ Under the assumption c = cθ0 , we can reduce both the formula for p and for q through the following Theorem. One can find reductions for r∗ as well, but not in such a compact form as the formula already given. ½ ¾ ◦ 0 t t 00 ◦ Theorem 6. If c = cθ0 and if ζ (v; θ0) 1 and 1 ζ (v; θ0)⊗Ψ(v) 1 are C -integrable, we have that   Ã !2 Z t ∗ −1 ∂φ(v, θ0) p = length(θ) + Tr I E (χ(ξ ≤ v) − v) dC(v, θ0)  , [0,1]d ∂v and q∗ = 0.

◦ Proof. The assumption c = cθ0 validates the information matrix equality J = I, giving the familiar looking " # Z t ∗ −1 ∂φ(v, θ0) p = length(θ) + Tr I Var (χ(ξ ≤ v) − v) dC(v, θ0) [0,1]d ∂v for ξ ∼ C◦. This can be simplified through Lemma 5 to give   Ã !2 Z t ∗ −1 ∂φ(v, θ0) p = length(θ) + Tr I E (χ(ξ ≤ v) − v) dC(v, θ0)  . [0,1]d ∂v

As for q∗, we notice that Z ∗ 0 t¡ ¢ q = ζ (v; θ0) 1 − v dC(v; θ0) [0,1]d Xd Z 1 Z 1 Z 1 Y ∂ log c(v; θ0) = ··· c(v; θ0) (1 − vk) dvk dvi ∂vk k=1 0 0 0 i6=k Xd Z 1 Z 1 Z 1 Y ∂c(v; θ0) = ··· (1 − vk) dvk dvi. ∂vk k=1 0 0 0 i6=k Partial integration shows Z ¯ Z ¯ 1 ∂c(v; θ ) ¯vk=1 1 ¯ 0 (1 − v ) dv = c(v; θ )(1 − v )¯ + c(v; θ ) dv = 1 − c(v; θ )¯ , (6) ∂v k k 0 k ¯ 0 k 0 ¯ 0 k vk=0 0 vk=1 10 STEFFEN GRØNNEBERG AND NILS LID HJORT yielding Z Z Z ¯ Xd 1 1 1 ¯ Y ∗ ¯ q = ··· 1 − c(v; θ0)¯ dvi = 0. k=1 0 0 0 vk=1 i6=k ¤

This motivates the Copula Information Criterion

∗ ∗ CIC = 2`n,max − 2(ˆp +r ˆ ), (7) where pˆ∗, and rˆ∗ estimates p∗ and r∗ respectively. The plug-in estimator Z ½ ¾ 1 rˆ∗ = c(v; θˆ)1t ζ00(v; θˆ) ⊗ Ψ(v) 1 dv 2 [0,1]d is the natural choice for estimating r∗, while the natural estimation procedure for p∗ is to use ³ ´ pˆ∗ = length(θ) + Tr Iˆ− Wˆ denoting the generalized inverse of Iˆ by Iˆ− and where Iˆ is the pseudo-empirical inform- ation matrix ˆ ˜ ˆ ˜ ˆ t I = Eθˆφ(ξ, θ)φ(ξ, θ) (8) and ( ) Z µ ¶t 2 ˆ ∂ ˆ ˆ W = Eθˆ t log c(v, θ) (χ{ξ ≤ v} − v) dC(v, θ) (9) [0,1]d ∂θ∂v ˜ where ξ ∼ Cθˆ. These integrals can easily be evaluated through numerical integration routines such as Monte-Carlo simulation. We note that these estimators are somewhat different from the ones suggested by Genest et al. (1995), which are based on using the empirical copula as plug-in estimates of the expectation operator Eθˆ, making them closer to multivariate rank statistics. This would give Z 1 Xn Iˆ? = φ(u, θˆ)φ(u, θˆ)t dCˆ(u) = φ(ξˆ(k), θˆ)φ(ξˆ(k), θˆ)t d n u∈[0,1] k=1 and ( ) Xn Z µ ¶t 2 ˆ ? 1 ∂ ˆ ˆ(k) ˆ W = t log c(v, θ) (χ{ξ ≤ v} − v) dC(v, θ) n d ∂θ∂v k=1 [0,1] ˆ(k) for ξ = Fn,⊥(Xk). Although they are somewhat less computational demanding – simu- lations indicate that they seem to perform poorer than the estimators of eqs.(8) and (9) under model conditions. However, their rank structure make them more robust, and can thus serve as estimators in middle-ground situations where model conditions are believed to be nearly true. THE COPULA INFORMATION CRITERION 11

2.3. TIC-like formula. The CIC-formula of eq.(7) were much simplified by the assump- ◦ tion c = cθ0 . This subsection gives a CIC-formula without this assumption. Hence, it is suitable to perform model selection between non-nested competing copula classes, which is often the case of interest. This section provides non-parametric and natural estimates of pˆ∗, qˆ∗ and rˆ∗.

Let the empirical copula process Cˆn be given by Xn −1 Cˆn(u1, . . . , ud) = (n + 1) I{Fˆn,1(Xi,1) ≤ u1,..., Fˆn,d(Xi,d) ≤ ud}. i=1 Natural estimators for q∗ and r∗ are then the plug-in estimators Z ¡ ¢ qˆ∗ = ζ0(v; θˆ)t 1 − v dCˆ(v) [0,1]d and Z ½ ¾ 1 rˆ∗ = 1t ζ00(v; θˆ) ⊗ Ψ(v) 1 dCˆ(v) 2 [0,1]d which converges almost surely to their targets through the uniform strong consistency of the empirical copula process and the strong consistency of θˆ. As for the estimation of p∗, we follow Chen & Fan (2005) in using

pˆ∗ = Tr Jˆ−Σˆ where Z 2 ˆ Xn 2 ˆ(i) ˆ ˆ ∂ log c(v; θ) ˆ 1 ∂ log c(ξ , θ) J = − t dCn = − t d ∂θ∂θ n ∂θ∂θ u∈[0,1] i=1 and    t 1 Xn  Xd   Xd  Σˆ = φ(ξˆ(i); θˆ) + Wˆ (ξˆ(i); θˆ) φ(ξˆ(i); θˆ) + Wˆ (ξˆ(i); θˆ) n  j j   j j  i=1 j=1 j=1 with ¯ Xn ˆ ¯ ³ n o ´ ˆ ˆ(i) ˆ 1 ∂φ(v; θ)¯ ˆ(i) ˆ(s) ˆ(s) Wj(ξj ; θ) = ¯ I ξj ≤ ξj − ξj n ∂vj ˆ(s) s=1,s6=i v=ξ ˆ(k) again using ξ = Fn,⊥(Xk) and φ(·, θ) = ∂/∂θ log c(·, θ).

3. Illustration

Consider the Frank and the Plackett copulae (families B3 and B2 in Joe, 1997, respect- ively) and denote their cumulative distribution functions by CF,δ and CP,δ. Figure 1 a-d shows the CIC values for the two models with varying δ. It is clear that the r∗-term dominates the CIC value, but that it reflects the degree of positive or negative dependence in the data, which any sensible set of models should agree on. The random noise in the approximated p∗ values is due to variation inherent in Monte-Carlo integration. Assume X ∼ N(0, 1) and Y ∼ N(0, 1) but that the copula of (X,Y ) is a copula mixture of the form λCF,δ + (1 − λ)CP,δ with λ = 80%. We want to use the known unbiasedness of the AIC in the fully parametric case to illustrate that the CIC works as it should. We can do this by the following. 12 STEFFEN GRØNNEBERG AND NILS LID HJORT

If we restrict attention to parametric models with normal marginals and either a Frank or a Plackett copula, we have ¡ −1 −1 ¢ fi(x, y; δ) = ci Φ (x), Φ (y); δ φ(x)φ(y) using the information that both marginals are known to be standard Normal and where i ∈ {F,P }. The true copula is known to be a mixture of the two, with a density given by convolution. Denote this density by c◦, and let f ◦ be the full data-generating mechanism of (X,Y ). We have ¡ ¢ f ◦(x, y) = c◦ Φ−1(x), Φ−1(y) φ(x)φ(y). ◦ This means that the Kullback–Leibler divergence between f and fi,δ is ¡ ¢ ◦ ◦ −1 −1 ◦ f (X,Y ) c Φ (X), Φ (Y ) ◦ KL(f , fi,δ) = E log = E log −1 −1 = KL(c , ci,δ). fi,δ(X,Y ) ci (Φ (X), Φ (Y ); δ) implying ◦ ◦ ◦ ◦ ◦ ∆KL(f ) := KL(f , fF,δF ) − KL(f , fP,δP ) = KL(c , cF,δF ) − KL(c , cP,δP ). (10) Consider the following three formulae. # # 1. The standard AIC formula 2`n,max − 2length(δ) where `n,max is the observed maximum likelihood of the full likelihood of (X,Y ) under the assumption that 2 2 X ∼ N(µ1, σ1) and Y ∼ N(µ2, σ2) and with either a Frank or a Plackett copula specifying their simultaneous distribution. Denote the observed AIC-scores simply

by AICF for the Frank-copula case and AICP for the Plackett-copula case and let ∆AIC = AICF − AICP . 2. The wrong, but classically used AIC-like formula 2`n,max −2length(θ), where `n,max is the observed maximum pseudo-likelihood for the copula model. Denote the ? ? ? ? observed (but unjustified) AIC-scores by AICF and AICP and let ∆AIC = AICF − ? AICP . ∗ ∗ 3. The CIC formula 2`n,max −2(p +r ) calculated under the assumption of a correctly specified model. Denote the observed CIC-scores by CICF and CICP and let ∆CIC = CICF − CICP . Equation (10) shows that if the AIC? formula is correct, ∆AIC? should be approximatly equal to ∆AIC, but if the CIC-formula is correct, ∆CIC should be approximatly equal to ∆AIC. A simulated sample of (X,Y ) with the mixture copula is illustarted in Figure 1 e-g with n = 2000. It is not obvious which model is the best, as the fit of the PMLE models are varying in different parts of the sample space. However, assume that we are interested in which model has the least Kullback–Leibler divergence to the true model. Notice that we use the AIC-like formulae, and not the TIC-like formulae, which is an approximation typical in model selection practice, as the TIC-like formulae have a much higher variability than the AIC-like formulae. We ran 500 simulations as above, and for each simulation calculated the AIC, AIC? and CIC values. Table 1 shows that the CIC-formulae on average agrees with the fully parametric AIC value, while the mean of the incorrectly motivated AIC? misses the mean of AIC almost exactly by the average of −2∆(p∗ +r∗), the correction term which separates AIC? and CIC. THE COPULA INFORMATION CRITERION 13

True p+r values True p+r values p+r p+r −30 −25 −20 −15 −10 −5 0 5 1.00 1.05 1.10 1.15 1.20 1.25

−30 −20 −10 0 10 20 30 −30 −20 −10 0 10 20 30

δ δ

(a) Values of p∗ + r∗ in the CIC (b) Values of p∗ in the CIC for- formula for the Frank copula with mula for the Frank copula with varying δ. varying δ.

True p+r values True p values p p+r −4 −2 0 2 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4

0 20 40 60 80 100 0 20 40 60 80 100

δ δ

(c) Values of p∗ + r∗ in the CIC (d) Values of p∗ in the CIC for- formula for the Planckett copula mula for the Planckett copula with with varying δ. varying δ.

Untransformed observations Pseudo−obaservations v y −3 −2 −1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0

−3 −2 −1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0

x u

(e) Actual simulated data (Xi,Yi) (f) Marginal transformed ob-

served data (Uˆi, Vˆi).

Plackett PMLE model Frank PMLE model v v 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

u u

(g) Simulated sample for the (h) Simulated sample for the Plackett copula with dependence Frank copula with dependence parameter as the PMLE of the ori- parameter as the PMLE of the ori- ginal simulated data. ginal simulated data.

Figure 1. Plots for the illustartion. 14 STEFFEN GRØNNEBERG AND NILS LID HJORT

Min. 1st Qu. Median Mean 3rd Qu. Max. ∆AIC −108.80 −26.73 −6.13 −5.28 16.87 84.95 ∆CIC −122.90 −28.80 −4.65 −5.00 18.14 93.15 ∆AIC∗ −120.30 −26.23 −2.07 −2.43 20.72 95.72 ∆AIC − ∆CIC −27.52 −7.42 −0.64 −0.28 6.51 39.26 ∆AIC − ∆AIC∗ −30.10 −9.99 −3.22 −2.85 3.94 36.69

PMLE δF 12.80 13.50 13.77 13.77 14.03 15.04 PMLE δP 43.06 47.05 48.74 48.71 50.12 56.13 ∗ ∗ pP + rP 2.78 2.83 2.84 2.84 2.85 2.96 ∗ ∗ pF + rF 4.00 4.04 4.06 4.06 4.08 4.13 2∆(p∗ + r∗) −2.65 −2.50 −2.44 −2.44 −2.39 −2.23 Table 1. Summary statistics for the simulation

Appendix A. Mathematical proofs

Lemma 7. We have that Z 1 t 00 ◦ Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) dFn(x) n d R Z 1 t 00 ◦ ◦ −1 = Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) dF (x) + op(n ) n Rd and that Z Z t 00 ◦ ◦ W ◦ t 00 ◦ ◦ ◦ Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) dF (x) −−−→ BF ◦,⊥(x) ζ (F⊥(x); θ0)BF ◦,⊥(x) dF (x) d n→∞ d R Z R ◦ t 00 ◦ ◦ = BC◦,⊥(v) ζ (v; θ0)BC◦,⊥(v) dC (v), [0,1]d

◦ ◦ ◦ ◦ where BF ◦,⊥ and BC◦,⊥(v) are the vectors of marginal projections of the F – and C – ◦ ◦ Brownian Bridges BF ◦ and BC◦ . Further, Z Z t 00 ◦ ◦ ◦ t 00 ◦ ◦ E Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) dF (x) = E BC◦,⊥(v) ζ (v; θ0)BC◦,⊥(v) dC (v) Rd [0,1]d Z ½ ¾ t 00 ◦ = 1 × ζ (v; θ0) ⊗ Ψ(v) ×1 dC (v), [0,1]d in which ⊗ is the tensor product and Ψ(v) = (ψi,j(v))i,j is a symmetric d × d-matrix defined through the degree of bivariate quadrant dependence (Joe, 1997) of the bivariate ◦ copula projections Ci,j(u, v) = P r{X1,(i) ≤ u, X1,(j) ≤ v} as

◦ ψi,j(v) = Ci,j(vi, vj) − vivj when i < j and 2 ψi,i(v) = vi − vi , while i > j is defined through symmetry. THE COPULA INFORMATION CRITERION 15

Proof. We first deal with the weak convergence. As the marginal empirical distributions are the marginal projections (which are continuous in Skorokhod Space) of the full mul- tivariate empirical distribution, its weak convergence properties are known through the weak convergence properties of the full multivariate empirical process Gn(t). By the Don- sker Theorem for empirical processes indexed by multivariate time (Bickel & Wichura, W ◦ ◦ 1971), we know that Gn(t) −−−→ B ◦ in the Skorokhod space D, where B ◦ is the n→∞ F F Brownian Bridge with d-dimensional time. Hence, Gn,⊥(x) converges weakly to the vector ◦ of marginal projections of BF ◦ thanks to the general Continuous Mapping Theorem found in e.g. Billingsley (1968). Clearly,

Z Z t 00 ◦ t 00 ◦ ◦ Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) dFn(x) = Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) dF (x) d d R Z R t 00 ◦ ◦ + Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) d [F − Fn](x). Rd

t 00 ◦ We know that Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) has a weak limit in the Skorokhod-space, again through the use of the Continuous Mapping Theorem. As it has a weak limit, it is

OP (1) with respect to the Skorokhod-metric. The limit is a continuous functional of the continuous Brownian Bridge, which secures the equivalence between its weak convergence with respect to both Skorokhod-type con- vergence and sup-norm-type convergence (see the comments following Theorem 2.1.3 of

Gaenssler & Stute, 1979). Hence, the variable is also OP (1) with respect to the sup-norm. ◦ The Glivenko-Cantelli Theorem states that F −Fn = oP (1) with respect to the sup-norm, which implies

Z t 00 ◦ ◦ Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) d [F − Fn](x) = oP (1). Rd

As new utilization of the Continuous Mapping Theorem (the integral operator is continuous with respect to the sup-norm) now shows that

Z t 00 ◦ ◦ Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) dF (x) Rd has a non-degenerate weak limit of the stated form. We now move on to the expectation statement. Invoke Fubini’s Theorem and notice that

Z Z t 00 ◦ ◦ £ t 00 ◦ ¤ ◦ E Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) dF (x) = E Gn,⊥(x) ζ (F⊥(x); θ0)Gn,⊥(x) dF (x) Rd Rd Xd Xd Xd Xd Z h i 00 ◦ (i) (j) ◦ = ζ (F⊥(x); θ0)k,lE Gn,⊥(xi)Gn,⊥(xj) dF (x). d i=1 j=1 k=1 l=1 R 16 STEFFEN GRØNNEBERG AND NILS LID HJORT

The inner expectation is in fact non-varying with respect to n. This can be seen by " # 1 Xn Xn EG(i) (x )G(j) (x ) = nE I{X ≤ x }I{X ≤ x } − F ◦(x )F ◦(x ) n,⊥ i n,⊥ j n2 s,(i) i t,(j) j i i j j s=1 t=1 1 Xn = P r{X ≤ x ,X ≤ x } + F ◦(x )F ◦(x ) n 1,(i) i 1,(j) j i i j j k=1 ◦ ◦ ◦ = P r{X1,(i) ≤ xi,X1,(j) ≤ xj} − Fi (xi)Fj (xj) = ψi,j(F⊥(x)). The definition of the tensor product then gives the stated reduction of form. The same steps applies to the C◦–Brownian bridge. We will see that the expectation of Z ◦ t 00 ◦ ◦ BC◦,⊥(v) ζ (v; θ0)BC◦,⊥(v) dC (v) [0,1]d can found through the known the discrete case as follows. Assume i 6= j and notice that " # 1 Xn Xn E I{X ≤ x }I{X ≤ x } − P r{X ≤ x ,X ≤ x } n s,(i) i t,(j) j 1,(i) i 1,(j) j s=1 t=1 Z ∞ 1 Xn Xn = Pr{ I{X ≤ x }I{X ≤ x }−P r{X ≤ x ,X ≤ x } ≥ y} dy. n s,(i) i t,(j) j 1,(i) i 1,(j) j 0 s=1 t=1 We would like to show that this integral, which we know is zero, converges to h i ◦ (i) ◦ (j) E BC◦,⊥ (xi)BC◦,⊥ (xj) − P r{X1,(i) ≤ xi,X1,(j) ≤ xj} , as n → ∞. To do this, one can bound 1 Xn Xn γ (y) := Pr{ I{X ≤ x }I{X ≤ x } − P r{X ≤ x ,X ≤ x } ≥ y} n n s,(i) i t,(j) j 1,(i) i 1,(j) j s=1 t=1 uniformly in n with an integrable function, and then apply Lebesgue’s theorem on dom- inated convergence. Bickel & Wichura (1971)’s Theorem 1 in conjunction with their in- equality (1) shows that for any multivariate empirical Kiefer-process KF,m(s, t) we have ( ) 4 Pr sup |KF,m(s, t)| ≥ b ≤ A/b 0≤s≤1,t

4 for some universal constant A. This implies that also γn(b) ≤ A/b , as the average in γn is a bivariate empirical process. Thus we have the desired convergence. ¤

We will need the following two extensions of the well known Lemma 3.9.17 of van der Vaart & Wellner (1996) for dealing with the q∗ term in our CIC formulae. These are to our knowledge new, and should be useful for a variety of asymptotical investigations connected with copulae. Note that Fermanian et al. (2004) use the bivariate integration by parts formula that is at the center of Theorem 9, but does not provide the general formula due to Zaremba (1968).

Lemma 8. Let C(Rd)d be the space of continuous Rd-valued functions on the hyper d cube, and BVC(R) be the set of continuous functions from R to R that have bounded variation in the sense of Hardy and Krause (see e.g. Zaremba, 1968). THE COPULA INFORMATION CRITERION 17

Given a continuous vector valued

f : Rd 7→ Rd, the functional map d d d d Φ: C(R ) × BVC(R ) 7→ R defined by Z Φ(U, V ) = f(x)tU(x) dV (x) Rd is Hadamard differentiable with respect to supremum norm, with differential Z Z t ◦ t ◦ Φ˙ ◦ ◦ (h , h ) = f(x) F (x) dh (x) + f(x) h (x) dF (x) F⊥,F 1 2 ⊥ 2 1 Rd Rd ◦ d d ◦ d where h1,F⊥ ∈ C(R ) and h2,F ∈ BVC(R ).

Proof. For ht1 → h1 and ht2 → h2, define Ut = U + tht1 and Vt = V + tht2. Analogous to the proof of Lemma 3.9.17 of van der Vaart & Wellner (1996), we have the decomposition µZ Z ¶ 1 t t ˙ f(x) Ut(x) dVt(x) − f(x) U(x) dV (x) − ΦF ◦ ,F ◦ (h1, h2) t ⊥ Z Z t t = f(x) h1 d(Vt − V ) + f(x) (h1t − h1) d(Vt − V ).

As we assume continuity from both h1 and h2, it is clear that the above difference converges to zero. Note that the continuity restriction is unnecessary, as we can extend the argument of van der Vaart & Wellner (1996), but we will need no greater generality in the following. ¤

We will deal with h2 functions which are of unbounded variation in the following limit theorems. This is easily solved in the univariate case, as one can apply the well known integration by parts formula for Stieltjes integrals. This is also the case for the current multivariate setting, but here the integration by parts formula is rather esoteric. To state the result which allows h2 to be of (multivariate) unbounded variation, we require ∗ the following, rather technical, notation of Zaremba (1968). Define ∆j as the evaluation operator ¯ ∗ xj =1 ∆ f(x1, . . . , xd) := f(x1, . . . , xd)¯ j xj =0

= f(x1, . . . , xj−1, 1, xj+1, . . . , xd) − f(x1, . . . , xj−1, 0, xj+1, . . . , xd)

∗ ∗ ∗ ∗ and let ∆j(1),...,j(k) be defined as ∆j(1) ··· ∆j(k) and notice that ∆j commutes. Let φ(r, . . . , r + k − 1; r + k, . . . , s) be an expression which depend only on the partition of the variables j(r), . . . , j(s) into the sets {j(r), . . . , j(r + k − 1)} and {j(r + k), . . . , j(s)}. We will let X∗ φ(r, . . . , r + k − 1; r + k, . . . , s) 1,...,d;k be the sum over all the expressions derived from φ(r, . . . , r+k−1; r+k, . . . , s) by replacing the given partition of {j(r), . . . , j(r + k − 1)} successively by all the other partitions of this set into a set of k and a set of s − r − k − 1, each partition being taken exactly once. 18 STEFFEN GRØNNEBERG AND NILS LID HJORT

This is only meaningful if 0 < k < s − r + 1. If either k = 1 or k = s − r + 1, there is strictly speaking no partition. We then define this sum as being reduced to the single valid term. Finally, let dj(1),...,j(k)V indicate that integration applies only to the variables with subscripts j(1), . . . , j(k), the other variables being kept constant in the process of integration. Thus, for instance, Z ∗ ∆3 g(x1, x2, x3) d1,2(x1, x2, x3) [0,1]2 Z Z = g(x1, x2, 1) d(x1, x2, 1) − g(x1, x2, 0) d(x1, x2, 0). [0,1]2 [0,1]2 Finally, define ◦ −1 ◦−1 ◦−1 ◦−1 F⊥ = (F1 ,F2 ,...,Fd ), the vector of generalized inverse marginal cumulative distribution functions.

t ◦ Theorem 9. Assume h2 is continuous, and f(x) F⊥(x) is of bounded variation in the ◦ sense of Hardy and Krause. Further assume that f is of the form f(F⊥(x)). Then Lemma 8 is still valid, with Z t ◦ −1 ◦ Φ˙ ◦ ◦ (h , h ) = f(v) h (F (v)) dC (x) F⊥,F 1 2 1 ⊥ [0,1]d Xd X∗ Z k ∗ ◦ −1 ¡ t ¢ + (−1) ∆k+1,...,d h2(F⊥ (v)) d1,...,k f(v) × v (x). k k=0 1,...,d;k [0,1] Proof. First notice that Z Z ◦ t ◦ t ◦ −1 ◦ f(F⊥(x)) h1(x) dF (x) = f(v) h1(F⊥ (v)) dC (x), Rd [0,1]d and that Z Z ◦ t ◦ t ◦ −1 f(F⊥(x)) F⊥(x) dh2(x) = f(v) v dh2(F⊥ (v)). Rd [0,1]d Then apply the integration by parts result of Proposition 2 in Zaremba (1968). ¤

Lemma 10. We have that Z √ W 0 t ◦ ◦ −1 ◦ qn/ n −−−→ ζ (v; θ0) BF ◦,⊥(F⊥ (v)) dC (v) n→∞ [0,1]d Xd X∗ Z k ∗ ◦ ◦ −1 ¡ 0 t ¢ + (−1) ∆k+1,...,d BF ◦ (F⊥ (v)) d1,...,k ζ (v; θ0) × v (v) k k=0 1,...,d;k [0,1] which is a zero mean Gaussian variable. Thus

1/2 qn = OP (n ).

Further, the expectation of qn is Z ∗ 0 t ◦ Eqn = qn = ζ (v; θ0) (1 − v) dC (v), [0,1]d 0 t ◦ if ζ (v; θ0) 1 is C –integrable. THE COPULA INFORMATION CRITERION 19

Proof. For the limit statements, we first consider the Rd+1 valued stochastic process

√ £ t ◦ ◦ t¤ √ t n (Fn,⊥,Fn) − (F⊥,F ) = n(Gn,⊥, Gn) ,

◦ ◦ t which clearly converges to the zero mean Gaussian field (BF ◦,⊥, BF ◦ ) , again by an ap- plication of the Continuous Mapping Theorem. We then consider the functional map of Lemma 8 defined by Z 0 ◦ t Φ(U, V ) = ζ (F⊥(x); θ0) U(x) dV (x). Rd

The point is that

◦ ◦ qn = n [Φ(Fn,⊥(x),Fn) − Φ(F⊥(x),F )] , √ which enables us to apply the functional delta method to qn/ n. The Functional Delta Theorem (Section 3.9.1 van der Vaart & Wellner, 1996) together with Theorem 9 now shows that

√ qn ◦ ◦ W ˙ ◦ ◦ √ = n [Φ(Fn,⊥,Fn) − Φ(F (x),F )] −−−→ ΦF ◦ ,F ◦ (B ◦ , B ◦ ) n ⊥ n→∞ ⊥ F ,⊥ F in the notation of Theorem 9. This is the form given in the statement of the Lemma. That this limit is a zero mean Gaussian variable is seen through invoking Lemma 3.9.8 of van der Vaart & Wellner (1996). A trivial consequence of the above limit result is that 1/2 qn = OP (n ). As for the expectation, notice that

Z Xn Xd Xn 0 ◦ t 1 0 ◦ 1 E ζ (F⊥(x); θ0) Fn,⊥(x) dFn(x) = E ζ (F⊥(Xi); θ0)k I{Xjk ≤ Xik} d n n R i=1 k=1 j=1 µ· ¸ ¶ 1 Xd Xn Xn Xn = Eζ0(F ◦ (X ); θ ) I{X ≤ X } + Eζ0(F ◦ (X ); θ ) . n2 ⊥ i 0 k jk ik ⊥ i 0 k k=1 j=1 i=1 i=1 i6=j

The observations X1,...,Xn are identically distributed, so this is equal to

µ ¶ Xd n(n − 1) n Eζ0(F ◦ (X ); θ ) I{X ≤ X } + Eζ0(F ◦ (X ); θ ) n2 ⊥ 1 0 k 2k 1k n2 ⊥ 1 0 k k=1 Xd 0 ◦ = Eζ (F⊥(X1); θ0)kI{X2k ≤ X1k} k=1 1 Xd £ ¤ + E ζ0(F ◦ (X ); θ ) − ζ0(F ◦ (X ); θ ) I{X ≤ X } . n ⊥ 1 0 k ⊥ 1 0 k 2k 1k k=1 20 STEFFEN GRØNNEBERG AND NILS LID HJORT

From the independence of X1,...,Xn, we have that Xd Xd Z Z 0 ◦ 0 ◦ ◦ ◦ Eζ (F⊥(X1); θ0)kI{X1k ≤ X2k} = ζ (F⊥(x1); θ0)kI{x2k ≤ x1k} dF (x2) dF (x1) d d k=1 k=1 R R Xd Z Z 0 ◦ ◦ ◦ = ζ (F⊥(x1); θ0)k I{x2k ≤ x1k} dF (x2) dF (x1) d d k=1 R R Xd Z Z 0 ◦ ◦ ◦ = ζ (F⊥(x1); θ0)k I{y ≤ x1k} dFk (y) dF (x1) d k=1 R R Xd Z 0 ◦ ◦ ◦ = ζ (F⊥(x1); θ0)kFk (x1k) dF (x1) d k=1 R Xd Z Z 0 ◦ 0 t ◦ = ζ (v; θ0)kvk dC (v) = ζ (v; θ0) v dC (v). d d k=1 [0,1] [0,1] On combining the above calculations, we see that · Z Z ¸ 0 ◦ t 0 t ◦ Eqn = n E ζ (F⊥(x); θ0) Fn,⊥(x) dFn(x) − ζ (v; θ0) v dC (v) Rd [0,1]d Z 0 t ◦ = ζ (v; θ0) (1 − v) dC (v). [0,1]d

0 t 0 t d Finally, we know that |ζ (v; θ0) v| ≤ |ζ (v; θ0) 1|, as v ∈ [0, 1] . Thus it suffices to assume ◦ 0 t C -integrability of ζ (v; θ0) 1 for all expectations met in the above calculations to be finite. ¤

References

Bickel, P. & Wichura, M. (1971). Multiparameter Stochastic processes. The Annals of 42, 1656–1670. Billingsley, P. (1968). Convergence of Probability Measures. Wiley. Burnham, K. P. & Anderson, D. R. (1998). Model selection and multimodel inference. Springer, 2nd ed. Chen, X. & Fan, Y. (2005). Pseudo-likelihood ratio tests for semiparametric multivariate copula model selection. The Canadian Journal of Statistics 33, 389–414. Claeskens, G. & Hjort, N. (2008). Model Selection and Model Averaging. Cambridge University Press. Fermanian, J., Radulovic,´ D. & Wegkamp, M. (2004). Weak convergence of empir- ical copula processes. Bernoulli 10, 847–860. Gaenssler, P. & Stute, W. (1979). Empirical Processes: A survey of results for independent and identically distributed random variables. The Annals of Probability 7, 193–243. Genest, C., Ghoudi, K. & Rivest, L.-P. (1995). A semiparametric estimation proced- ure of dependence parameters in multivariate families of distributions. Biometrika 82, 543–552. Joe, H. (1997). Multivariate Models and Dependence Concepts. Chapman & Hall. THE COPULA INFORMATION CRITERION 21

Tsukahara, H. (2005). Semiparametric estimation in copula models. The Canadian Journal of Statistics 33, 357–375. Uchida, M. & Yoshida, N. (2004). Information Criteria for Small Diffusions via the Theory of Malliavin–Watanabe. Statistical Inference for Stochastic Processes 7, 35–67. van der Vaart, A. W. & Wellner, J. A. (1996). Weak Convergence and Empirical Processes. Springer Series in Statistics. Springer. Zaremba, S. K. (1968). Some applications of multidimensional integration by parts. Annales Polonici Mathematici 21.

Department of Mathematics, University of Oslo, P.O. Box 1053 Blindern, N-0316 Oslo, E-mail address: [email protected]

Department of Mathematics, University of Oslo, P.O. Box 1053 Blindern, N-0316 Oslo, Norway E-mail address: [email protected]