<<

Risk theory

Harri Nyrhinen, University of Helsinki Fall 2017 Sisältö

1 Introduction 1

2 Background from 3

3 The number of claims 9 3.1 and process ...... 10 3.2 Mixed Poisson variable ...... 16 3.3 The number of claims of a single policy-holder ...... 18 3.4 Mixed Poisson process ...... 19

4 Total claim amount 22

5 Viewpoints on claim size distributions 26 5.1 Tabulation method ...... 26 5.2 Analytical methods ...... 26 5.3 On the estimation of the tails of the distribution ...... 29

6 Calculation and estimation of the total claim amount 31 6.1 Panjer method ...... 31 6.2 Approximation of the compound distributions ...... 33 6.2.1 Limiting behaviour of compound distributions ...... 33 6.2.2 Refinements of the normal approximation ...... 36 6.2.3 Applications of approximation methods ...... 39 6.3 Simulation of compound distributions ...... 42 6.3.1 Producing observations ...... 43 6.3.2 Estimation ...... 44 6.3.3 Increasing efficiency of simulation of small probabilities ...... 46 6.4 An upper bound for the tail probability ...... 53 6.5 Modelling dependence ...... 54 6.5.1 Mixing models ...... 55 6.5.2 Copulas ...... 55

7 Reinsurance 61 7.1 Excess of loss (XL) ...... 61 7.2 Quota share (QS) ...... 67 7.3 Surplus ...... 68 7.4 Stop loss (SL) ...... 69 8 Outstanding claims 71 8.1 Development triangles ...... 71 8.2 Chain-Ladder method ...... 72 8.3 Predicting of the unknown claims ...... 76 8.4 Credibility estimates for outstanding claims ...... 82

9 Solvency in the long run 86 9.1 Classical ruin problem ...... 86 9.2 Practical long run modelling of the capital development ...... 95

10 Insurance from the viewpoint of utility theory 96 10.1 Utility functions ...... 96 10.2 Utility of insurance ...... 98 University of Helsinki 1

1 Introduction

Consider briefly the nature of the insurance industry and the motivation for buying insu- rance contracts. As an example, think about a collection of houses and the associated risk of fires. For a single house-owner, a fire a huge economic loss. To hedge against the risk by building up a bank account with a sufficient amount of money seems not realistic. Usually the problem is solved by means of an insurance contract. This means that each house-owner pays a premium to an insurance company. The premium corresponds roughly to the level of losses because of fires for the house-owner in one year. By the cont- ract, the company compensates these losses. Thus the risk has moved to the insurance company and the house-owners are protected against random large losses by means of a deterministic moderate premium. The company typically makes a large number of similar contracts. The law of large numbers can be applied to see that the company is able to manage the compensations by means of moderate premiums. We have already introduced two important cash flows associated with the insurance business, namely, the compensations and the premiums. There are many other cash flows as it is illustrated in the following picture. University of Helsinki 2

PREMIUMS COMPEN- J  SATIONS J J J J J J

J J J J RETURNS J^ ADMINIST- ON THE - INSURANCE - RATION INVESTMENTS COMPANY COSTS  J J J J J J

J J J

J DIVIDENDS, JJ^ NEW CAPITAL TAXES, etc.

The course is focussed on the analysis of compensations. Examples of the goals are: - how should we describe the compensation process - how should we estimate the solvency of an insurance company. The course can be viewed as a study of risks associated with non-life insurance com- panies. The main source for the course is part I of the book DPP: Daykin, C., Pentikäinen, T. and Pesonen, M. (1994). Practical Risk Theory for Actua- ries. Chapman & Hall, London. The reader is referred to this book especially to get more applied discussion of various topics. More detailed references to appropriate chapters will be given during the course. University of Helsinki 3

2 Background from Probability Theory

The central subject of our interest is the compensation process, called later the claims process. We will consider it as a or as a stochastic process. We list some concepts and facts from the probability theory which are assumed to be known. 1. Probability space Probability space is a triple (Ω, S, P) where Ω is the sample space. S is a sigma-algebra of Ω. The sets of S are called events. P is a probability measure. 2. Random variable A measurable map ξ : (Ω,S) → (R, B) is called a random variable where B is the Borel sigma-algebra of R. In the sequel, the measurability of a real-valued function refers to the measurability with respect to the Borel-sigma-algebra B. 3. Distribution The distribution P of the random variable ξ is the probability measure on (R, B) such that −1 P (B) = P(ξ (B)) = P(ω ∈ Ω | ξ(ω) ∈ B) for every B ∈ B. If the random variables ξ and η have the same distribution then we write

ξ =L η.

4. Distribution function, density function, probability mass function The distribution function F : R → R of the random variable ξ is defined by

F (x) = P(ξ ≤ x) = P ((−∞, x]).

The function f : R → R is the density (function) of ξ if Z x F (x) = f(t)dt −∞ for every x ∈ R. In this case, ξ is a continuous random variable. If there exists a countable subset {x1, x2,...} of R such that

P(ξ ∈ {x1, x2,...}) = 1 then ξ is discrete.Then the probability point mass function g : R → R of ξ is defined by

g(x) = P(ξ = x). University of Helsinki 4

In the sequel we often consider mixtures of continuous and discrete distributions. Then the distribution function has the form Z x X F (x) = f(t)dt + P(ξ = xi). (2.1) −∞ xi≤x kaikilla x ∈ R. 5. Expectation, , higher order moments The expectation of a random variable ξ is defined by Z E(ξ) = ξ(ω)dP(ω) Ω under the assumption that E(| ξ |) < ∞. If ξ ≥ 0 almost surely we also allow +∞ as the value of the expectation. Thus E(ξ) is defined for every non-negative random variable ξ. Let h : R → R be a measurable function. If E(| h(ξ) |) < ∞ and F is the distribution function of ξ then Z ∞ E(h(ξ)) = h(x)dF (x). −∞ If ξ has the density f then Z ∞ E(h(ξ)) = f(x)h(x)dx. −∞ If ξ is discrete and the probability mass function is g then

∞ X E(h(ξ)) = g(xi)h(xi), i=1 P∞ where it is assumed that i=1 g(xi) = 1. For the mixture (2.1) of a continuous and a discrete distribution, it holds

∞ Z ∞ X E(h(ξ)) = f(x)h(x)dx + P(ξ = xi)h(xi). −∞ i=1

The nth (origin) an of ξ is defined by Z ∞ n n an = E(ξ ) = x dF (x) −∞

n if E(| ξ |) < ∞, n = 1, 2,.... Hence, a1 = E(ξ). We also often write µ = E(ξ) or µξ = E(ξ). The nth µn is defined by n µn = E((ξ − a1) ), n ≥ 2. University of Helsinki 5

The variance of ξ is 2 σξ = Var(ξ) = µ2 √ and the σξ = µ2. The γξ is defined by

3 3 3 γξ = E((ξ − a1) )/σ = µ3/σ .

6. Moment generating function

The moment generating function M = Mξ of ξ is a function R → R ∪ {+∞} which is determined by sξ Mξ(s) = E(e ).

The cumulant generating function c = cξ is a function R → R ∪ {+∞} determined by

cξ(s) = log Mξ(s).

Both of the functions are always defined. The following results hold. a) If the moment generating functions of two random variables coincide and are finite in a non-empty open subset of R then the distributions of the random variables coincide. b) Let ξ and η be independent random variables, that is,

P(ξ ∈ A, η ∈ B) = P(ξ ∈ A)P(η ∈ B) for every A, B ∈ B, denoted by ξ ⊥⊥ η. Then

Mξ+η(s) = Mξ(s)Mη(s) and cξ+η(s) = cξ(s) + cη(s) for every s ∈ R. c) The moment generating function has the of all order in the interior of (n) its domain. If s is in that interior then the nth Mξ (s) is

(n) n sξ Mξ (s) = E ξ e .

In particular, if Mξ is finite in a neighbourhood of the origin then

(n) n Mξ (0) = E(ξ ) for every n ∈ N. Furthermore, 0 cξ(0) = E(ξ) and (n) n cξ (0) = E((ξ − a1) ) = µn, n = 2, 3. University of Helsinki 6

If P(ξ ≥ 0) = 1 then always lim M (n)(s) = (ξn). s→0− ξ E 7. Conditional expectation Let ξ and η be random variables. Assume that E(ξ) exists and is finite. Let σ(η) be the sigma-algebra generated by η, that is, σ(η) is the smallest sub-sigma-algebra of S where η is measurable. The conditional expectation of ξ with respect to η is the random variable E(ξ | η) which satisfies

(i) E(ξ | η) is σ(η) − measurable (ii) E{E(ξ | η)1(η ∈ B)} = E(ξ1(η ∈ B)) for every B ∈ B. In (ii) 1 is the indicator-function, that is, 1(η ∈ B)(ω) = 1, when η(ω) ∈ B and 0 otherwise. It can be shown that E(ξ | η) exists and is unique a.s. (almost surely). Furthermore, there exists a measurable map h : R → R such that E(ξ | η) = h(η). Intuitively, h(y) represents the mean of ξ if η = y. We often write E(ξ | η = y) = h(y). The conditional expectation has the following properties (assuming that the required expectations exist).

a) E(aξ1 + bξ2 | η) = aE(ξ1 | η) + bE(ξ2 | η) for every a, b ∈ R b) E{E(ξ | η)} = E(ξ) c) If f : R → R is measurable then E(f(η)ξ | η) = f(η)E(ξ | η) d) If ξ ⊥⊥ η then E(ξ | η) = E(ξ) e) If ξ1 ≤ ξ2 then E(ξ1 | η) ≤ E(ξ2 | η). All these results hold a.s. For a given B ∈ B, the conditional probability of B with respect to η is defined by

P(ξ ∈ B | η) = E(1(ξ ∈ B) | η).

Write also P(ξ ∈ B | η = y) = k(y) if P(ξ ∈ B | η) = k(η). Let Fη be the distribution function of η. There exists a family of distribution functions {Fξ|η(· | y) | y ∈ R}, the so called regular conditional distribution of ξ with respect to η, such that

(i) Fξ|η(· | y) is a distribution function for every y ∈ R (ii) Fξ|η(x | ·) is measurable for every x ∈ R R (iii) P(ξ ≤ x, η ≤ y) = u≤y Fξ|η(x | u)dFη(u) for every x, y ∈ R. If h : R → R is measurable and h(ξ) ∈ L then Z ∞ E(h(ξ) | η = y) = h(x)dFξ|η(x | y). x=−∞ University of Helsinki 7

In simple cases, the conditional expectation and probability can be determined in a elementary way. For example, let ξ and η be both discrete. Assume that ξ concentrates to {x1, x2,...} and η to {y1, y2,...}. Then

P(ξ = xi | η = yj) = P(ξ = xi, η = yj)/P(η = yj),

Fξ|η(x | yj) = P(ξ ≤ x | η = yj) = P(ξ ≤ x, η = yj)/P(η = yj) and ∞ X E(h(ξ) | η = yj) = P(ξ = xi | η = yj)h(xi) i=1 for every i, j = 1, 2,... and x ∈ R. The conditional expectation can be defined with respect to a collection of random variables, or more generally, with respect to a sub-sigma-algebra of S in the following way. Let F be a sub-sigma-algebra of S. The random variable E(ξ | F) is the conditional expectation of ξ with respect to F if (i) E(ξ | F) on F − measurable (ii) E{E(ξ | F)1(A)} = E(ξ1(A)) for every A ∈ F. The conditional expectation with respect to a random variable η is obtained by taking F = σ(η) where σ(η) = {η−1(B) | B ∈ B}. Also the general conditional expectation exists and is unique if E(ξ) is finite. The proper- ties a), b) and e) above still hold. In , if F is a sub-sigma-algebra of G then

E(ξ|F) = E(E(ξ|G)|F). Property c) takes the form c’) If ζ is F-measurable random variable then E(ζξ | F) = ζE(ξ | F). If in particular, η1, . . . , ηN are random variables and F = σ(η1, . . . , ηN ) is the sigma- algebra generated by these variables then there exists a measurable map h : RN → R such that E(ξ| σ(η1, . . . , ηN )) = h(η1, . . . , ηN ). Here RN is equipped by its Borel-sets. The regular conditional distribution also exists in this case. Write in short

E(ξ| σ(η1, . . . , ηN )) = E(ξ| η1, . . . , ηN ).

8. Laws of large numbers University of Helsinki 8

Let ξ1, ξ2,... be independent and identically distributed (i.i.d) random variables and let a1 = E(ξ1) exist and be finite. Then   ξ1 + ··· + ξn P lim = a1 = 1. n→∞ n

This result is known as the strong law of large numbers (SLLN). In addition, for every  > 0,   ξ1 + ··· + ξn lim P − a1 ≥  = 0. n→∞ n This is known as the weak law of large numbers (WLLN). 9. Central limit theorem 2 Let ξ1, ξ2,... be as in section 8. Assume in addition that σ = Var(ξ1) < ∞. Then   ξ1 + ··· + ξn − na1 lim P √ ≤ x = φ(x) n→∞ σ n for every x ∈ R where φ is the distribution function of the standard normal variable, x Z 1 2 φ(x) = √ e−t /2dt. −∞ 2π University of Helsinki 9

3 The number of claims

It is natural to describe the number of claims during a given time interval by means of a random variable whose possible values are non-negative . It is also useful to consider the accumulation of the claims as a stochastic process. For wider applied discussions, we refer to DPP, Section 2. Let K(t) be the number of claims occurred during the time interval (0, t] in a given in- surance portfolio (this means a fixed collection of insurance contracts). Let further K(t, u) be the number of claims occurred during the time interval (t, u] where 0 ≤ t < u. Thus K(t, u) = K(u) − K(t). A random variable K is called a counting variable if P(K ∈ {0, 1, 2,...}) = 1.A stochastic process {K(t) | t ≥ 0} is called a counting process if for each t ≥ 0, K(t) is a random variable on a fixed probability space (Ω, S, P) (which does not depend on t), and the following conditions are satisfied: (i) K(0) = 0 a.s. (ii) K(t) is a counting variable for each t ≥ 0 (iii) The realizations of the process are right continuous and have left limits. That is, the map fω : [0, ∞) → R, fω(t) = K(t)(ω) is right continuous and the limit limh→o+ fω(t − h) exists for every ω ∈ Ω (fω is a cadlag- function) (iv) P(K(t) − K(t−) = 0 or 1, ∀t > 0) = 1 where K(t−) = limh→0+ K(t − h). Condition (iii) is technical. What is essential is that K(t) is -valued and that jump size is always 1 (conditions (ii) and (iv)). These properties are natural when the development of the numbers of claims in time is modelled.

6

—4 fω(t) | r —3 | | r —2 | | r —1 | | r | - t (time) A realization of the counting process0 University of Helsinki 10

The jump times (the occurrence times of the claims) are random in the model. At each time point, at most one claim can occur. This property can be criticized because, for example, in the collision of two cars. Then the claim occurs at the same time for both of the participants. The problem can be solved by interpreting the collision as a one claim. In this way, the applicability of the model can be mad better.

3.1 Poisson distribution and process A random variable K has the Poisson distribution with the λ ≥ 0 if

λk (K = k) = e−λ , k = 0, 1, 2,.... P k!

If λ = 0 then by convention, P(K = 0) = 1. A stochastic process {K(t) | t ≥ 0} is a Poisson process with the intensity λ ≥ 0 if {K(t)} is a counting process and a) K(t, u) has the Poisson distribution with the parameter λ(u−t) for every 0 ≤ t < u

b) for every time points 0 ≤ t1 < u1 ≤ t2 < u2 ≤ · · · ≤ tn < un, the increments K(t1, u1),...,K(tn, un) are independent. Poisson process is perhaps the simplest model to describe the development of the numbers of the claims in time. More flexibility can be obtained by means of the non- homogeneous Poisson process. This is defined in the following way. Let Λ : [0, ∞) → [0, ∞) be an increasing function such that Λ(0) = 0. The process {K(t) | t ≥ 0} is the Poisson process with the intensity function Λ if {K(t)} is a counting process and a’) K(t, u) has the Poisson distribution with the parameter Λ(u) − Λ(t) for every 0 ≤ t < u b’) condition b) holds. The usual Poisson Process is obtained by choosing Λ(t) = λt for every t ≥ 0. Theoretical motivation for the use of the Poisson process in modelling will be given in the sequel. We begin by listing some basic properties of the Poisson distribution. Theorem 3.1.1. Let K be a Poisson distributed random variable with the parameter λ. Then for the moment generating function MK , the expectation E(K), the variance Var(K), and the skewness γK have the forms

λ(es−1) MK (s) = e , ∀s ∈ R, (3.1.1)

E(K) = Var(K) = λ (3.1.2) and √ γK = 1/ λ. (3.1.3) University of Helsinki 11

Proof. The proofs of the claims concerning the moments are left to the reader. Let s ∈ R. Then ∞ ∞ X X λk M (s) = (K = k)esk = e−λ esk K P k! k=0 k=0 ∞ s k X (λe ) s s = e−λ = e−λeλe = eλ(e −1). k!  k=0 The Poisson distribution is a limit of binomial distributions in the following way.

Lemma 3.1.1. Let ξn have Bin(n, pn) distribution,

 n  (ξ = k) = pk (1 − p )n−k, k = 0, 1, 2, . . . , n P n k n n

Assume that limn→∞ npn = λ. Then

k −λ λ lim P(ξn = k) = e , k = 0, 1, 2,... n→∞ k! Proof. Clearly,

n(n − 1) ··· (n − k + 1) λ + o(1)k  λ + o(1)n−k (ξ = k) = 1 − , P n k! n n

n  λ+o(1)  −λ where o(1) → 0 as n → ∞. This proves the lemma because limn→∞ 1 − n = e . . Consider next counting processes. Theorem 3.1.2. Let {K(t)} be a counting process which satisfies

(i) independence of the increments: for every 0 ≤ t1 < u1 ≤ t2 < u2 · · · ≤ tn < un, the increments K(t1, u1),...,K(tn, un) are independent (ii) stationarity of the increments: for every t, r ≥ 0 K(r) and K(t + r) − K(t) are equally distributed. Then there exists λ ≥ 0 such that {K(t) | t ≥ 0} is the Poisson process with the intensity λ. The conditions of Theorem 3.1.2 are useful when Poisson process is considered as a candidate for modelling. The following stronger result is more convenient in the sequel. Theorem 3.1.3. Let {K(t)} be a counting process which satisfies condition (i) of Theorem 3.1.2 and the condition University of Helsinki 12

(ii)’ P(K(r) = 0) = P(K(t + r) − K(t) = 0) for every t, r ≥ 0. Then {K(t) | t ≥ 0} is the Poisson process with the intensity λ = − log P(K(1) = 0). Write in short

pk(t) = P(K(t) = k) for t ≥ 0, k = 0, 1, 2,..., (3.1.4) when the counting process {K(t)} in question is clear from the context. Proof of Theorem 3.1.3. Let r, t ≥ 0. By (i) and (ii)’,

p0(t + r) = P(K(t + r) = 0) = P(K(t) = 0,K(t, t + r) = 0) = P(K(t) = 0)P(K(t, t + r) = 0) = p0(t)p0(r). Let m, n ∈ N. Then m   1 m   1 m/n p = p = p n = [p (1)]m/n . 0 n 0 n 0 n 0

Now p0(t) is decreasing in t so that t p0 (t) = [p0 (1)] for every t ≥ 0. If p0(1) = 1 then P(K(t) = 0) = 1 for every t ≥ 0. Thus {K(t)} is a Poisson process (the intensity is λ = 0). Assume henceforth that p0(1) < 1. Suppose for a while that p0(1) = 0. For any 0 ≤ a < b ≤ 1, we then have

P(K(a, b) ≥ 1) = P(K(b) − K(a) ≥ 1) = 1 − P(K(b) − K(a) = 0) b−a = 1 − p0(b − a) = 1 − [p0(1)] = 1.

This implies that P(K(1) < ∞) = 0, a contradiction.

By the above observations, we can assume in the sequel that p0(1) ∈ (0, 1). Write

λ = − log(p0(1))

−λt so that p0(t) = e kaikilla t ≥ 0. Let now t > 0 and let k be an arbitrary non-negative integer. Consider the probability pk(t). Let n ∈ N and n > k. The intervals ν − 1 ν  In = t, t , ν = 1, . . . , n, ν n n

n n constitute a partition of (0, t], namely, they are pairwise disjoint and (0, t] = I1 ∪ · · · ∪ In . n Let Ak be the collection of the realizations of {K(t)} which have jumps exactly in k n intervals Iν ,   k         [ \ νi − 1 νi \ ν − 1 ν An = K t, t > 0 K t, t = 0 . k  n n n n  1≤ν1<...<νk≤n i=1 ν6∈{ν1,...,νk} University of Helsinki 13

Let further Bn be the collection of the realizations which have at least two jumps in some interval Iν, n [  ν − 1 ν   Bn = K t, t ≥ 2 . n n ν=1 n By (i) and (ii)’, P (Ak ) is binomial,

n P (Ak ) = P(ξn = k),

−λt/n where ξn has Bin(n, 1 − e ) distribution. By Lemma 3.1.1,

k n −λt (λt) lim P (Ak ) = e . (3.1.5) n→∞ k! Consider now the set Bn as n → ∞. Every counting process has a finite number of jumps in finite time intervals and the jump sizes all equal 1. Because the realizations are right continuous any such realization lies outside Bn for large n. By the dominated convergence,

lim (Bn) = 0. (3.1.6) n→∞ P Clearly, n n n n Ak \ B ⊆ {K(t) = k} ⊆ Ak ∪ B so that n n n n P(Ak ) − P(B ) ≤ pk(t) ≤ P(Ak ) + P(B ). By (3.1.5) and (3.1.6), k n −λt (λt) pk(t) = lim P (Ak ) = e . n→∞ k! Let now 0 ≤ t < u. We have to show that K(u) − K(t) has the Poisson distribution with the parameter λ(u − t). Let s ∈ R. By (i),

MK(u)(s) = MK(t)+K(u)−K(t)(s) = MK(t)(s)MK(u)−K(t)(s).

By Theorem 3.1.1 and by the first part of the proof,

MK(u)(s) λu(es−1) −λt(es−1) λ(u−t)(es−1) MK(u)−K(t)(s) = = e e = e . MK(t)(s)

Thus K(u) − K(t) has the Poisson distribution with the parameter λ(u − t).  The next result gives an alternative way to define the Poisson process. We skip the proof. University of Helsinki 14

Theorem 3.1.3.1. Let ξ, ξ1, ξ2,... be independent exponentially distributed random variables with the parameter λ > 0. Thus

−λx P(ξ ≤ x) = 1 − e , x ≥ 0. Define the stochastic process {K(t)|t ≥ 0} by ( sup{k| ξ + ··· + ξ ≤ t}, K(t) = 1 k 0 if ξ1 > t.

Then {K(t)|t ≥ 0} is a Poisson process with the intensity λ. Conversely, if {K(t)|t ≥ 0} is a Poisson process with the intensity λ > 0 and

Tk = inf{t| K(t) ≥ k}, k = 1, 2,..., then T1,T2 − T1,T3 − T2,... are independent exponentially distributed random variables with the parameter λ. We end the section by a generalization of Theorem 3.1.2 where in essence, the statio- narity assumption (ii) is dropped. Theorem 3.1.4. Let {K(t)} be a counting process and let

pk(t) = P(K(t) = k), k = 0, 1, 2,....

Assume that {K(t)} satisfies condition (i) of Theorem 3.1.2. Suppose that p0(t) ∈ (0, 1] for every t ≥ 0 and that p0 : [0, ∞) → (0, 1] is continuous. Then {K(t)} is a Poisson process with the intensity function Λ where

Λ(t) = − log p0(t) (3.1.7) for every t ≥ 0.

We note that if p0 is not continuous at t0 then from the applied point of view, a claim occurs at time t0 with a positive probability. This is not very natural in non-life insurance. Proof of Theorem 3.1.4. We only give the proof under the simplified assumption that p0 is strictly decreasing in [0, ∞) and limt→∞ p0(t) = 0. Let

−1 p0 (t) : (0, 1] → [0, ∞) be the inverse of p0. Let µ > 0 be fixed and

−1 −µt τ(t) = p0 (e ), t ∈ [0, ∞). University of Helsinki 15

Then τ is continuous and strictly increasing. Define the counting process {K∗(t)} by

K∗(t) = K(τ(t)), t ∈ [0, ∞).

Obviously, {K∗(t)} satisfies condition (i) of Theorem 3.1.2. We show that also condition −µt (ii)’ of Theorem 3.1.3 is satisfied. Now p0(τ(t)) = e so that

∗ −µt P(K (t) = 0) = P(K(τ(t)) = 0) = e . For arbitrary t, r ≥ 0,

∗ ∗ ∗ ∗ P(K (t + r) = 0) = P(K (t) = 0) P(K (t + r) − K (t) = 0), so that

∗ ∗ ∗ P(K (t + r) = 0) −µr ∗ P(K (t + r) − K (t) = 0) = = e = P(K (r) = 0). P(K∗(t) = 0) This is (ii)’. By Theorem 3.1.3, {K∗(t)} is a Poisson process with the intensity

∗ − log P(K (1) = 0) = µ.

For given 0 ≤ t1 < t2,

∗ −1 ∗ −1 K(t2) − K(t1) = K (τ (t2)) − K (τ (t1)).

Thus K(t2) − K(t1) has the Poisson distribution with the parameter

−1 −1 µ(τ (t2) − τ (t1)) = − log p0(t2) + log p0(t1)

= Λ(t2) − Λ(t1).

 When the applicability of the Poisson process as the model for the claim numbers is considered, one has to think about the independence and stationarity of the increments. Theorem 3.1.4 shows that the stationarity is not very problematic. The assumption of independence of the increments is more problematic. In practice, it can be difficult to justify this property. For example, economic cycles may cause con- secutive bad or good years. In the next section, we describe a modification of the Poisson process which allows dependence between the increments. University of Helsinki 16

3.2 Mixed Poisson variable Consider the number of claims during a fixed time interval, for example, during one year. The essential feature of the following model is that it is able to describe more variations in the number of claims than the ordinary Poisson distribution. Let Q be a non-negative random variable and let λ > 0 be a constant. Assume that E(Q) = 1. The counting variable K is called a mixed Poisson variable with the parameter (λ, Q) if k X (λq)h F (k|q) = (K ≤ k|Q = q) = e−λq K|Q P h! h=0 for every q ≥ 0 and k = 0, 1, 2,.... The variable Q is called the mixing variable. Heuristically, we can think that at the beginning of the year, the value of the structure variable is drawn from the distribution of Q. If it is q then the Poisson parameter in force during the year is λq. As an applied example, the variations obtained in this way may correspond to the number of slippery days in car insurance. Let H be the distribution function of Q,

H(q) = P(Q ≤ q), q ∈ R. The probability mass function of K is determined by

Z ∞ Z ∞ k −λq (λq) P(K = k) = P(K = k|Q = q)dH(q) = e dH(q), (3.2.1) 0 0 k! k = 0, 1, 2,.... The moment generating function at point s is Z ∞ sK λQ(es−1) λq(es−1) MK (s) = E{E{e |Q}} = E e = e dH(q). (3.2.2) 0 Hence, s MK (s) = MQ(λ(e − 1)) = MQ(c(s)), (3.2.3) where c is the cumulant generating function of the Poisson distribution with the parameter λ. Theorem 3.2.1. Let K be a mixed Poisson variable with the parameter (λ, Q). As- sume that MQ is finite in a neighbourhood of the origin. Then

E(K) = λ,

2 2 2 σK = λ + λ σQ and 2 2 3 3 3 γK = (λ + 3λ σQ + λ γQσQ)/σK . University of Helsinki 17

The proof is left to the reader. It is often useful to consider the numbers of claims separately in appropriate sub- portfolios of the company. Then the question arises what happens if the sub-portfolios are considered together. The following result focusses on this question.

Let Ki be a mixed Poisson variable with the parameter (λi,Qi), i = 1, 2. Dependence between sub-portfolios will be allowed via the structure variables. We will assume that

P(K1 = k1,K2 = k2|Q1,Q2) = P(K1 = k1|Q1) P(K2 = k2|Q2) (3.2.4) for every k1, k2 = 0, 1, 2,.... Let H be the joint distribution function of Q1 and Q2. Then for every Borel sets B1 and B2

Z (λ q )k1 (λ q )k2 −λ1q1 1 1 −λ2q2 2 2 P(K1 = k1,K2 = k2,Q1 ∈ B1,Q2 ∈ B2) = e e dH(q1, q2). B1×B2 k1! k2!

Theorem 3.2.2. Let Ki be a mixed Poisson variable with the parameter (λi,Qi) for i = 1, 2, and let K = K1 + K2. Assume that (3.2.4) holds for every k1, k2 = 0, 1, 2,.... Then K is a mixed Poisson variable with the parameter (λ1 + λ2,Q) where λ Q + λ Q Q = 1 1 2 2 . λ1 + λ2 Proof. By (3.2.4),

s s s sK λ1Q1(e −1) λ2Q2(e −1) (λ1+λ2)Q(e −1) MK (s) = E{E{e |Q1,Q2}} = E e e = E e .

This and (3.2.3) prove the theorem.  It is quite common that the structure variable is not directly observable so that the estimation of the distribution is not easy. A practical approach is to find out by statistical methods the ’best one’ from an appropriate family of distributions. In some situations, it is sufficient to know a couple of the lowest moments of Q which helps the estimation problem. Gamma distribution is a popular approximation of the distribution of Q. The counting variable K will then have a generalized negative which is also called the Polya distribution. We study this model in the following example. Example 3.2.1. Assume that Q has the gamma-(r, α) distribution where α and r are positive constants. The density of Q is αr f(x) = e−αxxr−1 Γ(r) University of Helsinki 18 for x ≥ 0 where Γ is the Eulers gamma function, Z ∞ Γ(r) = e−uur−1du. 0 We assume that E(Q) = 1 which means that we have to take α = r. The distribution function of Q is then Z q rr 1 Z rq H(q) = e−rxxr−1dx = e−xxr−1dx 0 Γ(r) Γ(r) 0 for q ≥ 0. We will show that

Γ(r + k)  r r  λ k (K = k) = (3.2.5) P Γ(r)Γ(k + 1) r + λ r + λ for k = 0, 1, 2,.... In the particular case where r ∈ N, write p = r/(r + λ) to see that  r + k − 1  (K = k) = pr(1 − p)k. P k This is known as the negative binomial distribution. The proof of (3.2.5) is straightforward by using representation (3.2.1):

Z ∞ k Z ∞ k r −λq (λq) −λq (λq) r −rq r−1 P(K = k) = e dH(q) = e e q dq 0 k! 0 k! Γ(r) rrλk Z ∞ = e−(r+λ)qqr+k−1dq. k!Γ(r) 0 The last integrand is up to a multiplicative constant the density of the gamma-(r+k, r+λ) distribution. By this fact, and because Γ(k + 1) = k!, we conclude that

rrλk Γ(r + k) Γ(r + k)  r r  λ k (K = k) = = . P k!Γ(r) (r + λ)r+k Γ(r)Γ(k + 1) r + λ r + λ This is (3.2.5).

3.3 The number of claims of a single policy-holder In applications, it is often necessary to consider the claims process of a single policy- holder. A good example is the pricing of insurance contracts (which is a topic of tariff theory). The mixed Poisson variable is also here useful. Consider a fixed portfolio. Poisson processes may also be reasonable models for the numbers of claims of every policy-holder in the portfolio (and the Poisson distributions University of Helsinki 19 during a fixed time interval). We make this assumption. It is not natural to assume that the Poisson parameter is equal for every policy-holder. For example in car insurance, differences appear because the driving skills of the policy-holders are not equal. We model the portfolio in the following way.

Policy-holder 1 2 ... N Poisson parameter λq1 λq2 ... λqN .

Here λ describes the average expectation of the number of claims in the portfolio. The coefficient qi describes the expectation of the policy-holder i relative to this average. We thus assume that q1 + ··· + qN = 1. Choose a policy-holder from the portfolio at random. Hence, if L denotes this policy- holder then P(L = i) = 1/N for every i. By the total probability,

N N k X X 1 (λqi) (K = k) = (L = i) (K = k|L = i) = e−λqi . P P P N k! i=1 i=1 Take a random variable Q such that

#{i|q ≤ q} H(q) := (Q ≤ q) = i P N for every q ∈ R. Then Z ∞ k −λq (λq) P(K = k) = e dH(q), 0 k! k = 0, 1, 2,.... It is seen that K is a mixed Poisson variable. In this application, Q describes the heterogeneity of the portfolio. In the above consideration, the q-coefficients were taken as known. In real life, they are not completely known but has to be approximated.

3.4 Mixed Poisson process Mixed Poisson distributions are also useful in continuous time considerations. In particu- lar, the increments of the resulting process will no more be independent. We take a Poisson process as the staring point, but let the intensity be random. Let Q1,Q2,... be the yearly mixing variables. We assume that they are i.i.d. with the common distribution function H. Let λ > 0 be fixed. It will describe the mean of the number of claims in each year. University of Helsinki 20

Let Q be a generic variable which has the same distribution as Q1, and let N be a positive integer. We consider claims during the time interval [0,N]. Under the above notations and assumptions, we call the counting process

{K(t) | t ∈ [0,N]} a mixed Poisson process with the parameter (λ, Q) if conditionally, given that

Q1 = q1,...,QN = qN , the process {K(t) | t ∈ [0,N]} is a Poisson process with the intensity function Λ where

btc X Λ(t) = λqn + (t − btc)λqbtc+1, t ∈ [0,N], (3.4.1) n=1 and where btc is the integer part of t. The conditional intensity Λ increases in (3.4.1) linearly inside the years. The rate of increase is allowed to vary from year to year. By considering more complicated intensity functions, it would be possible to obtain more complicated counting processes. We will only consider models where (3.4.1) holds. Let’s illustrate the finite dimensional distributions of the process by means of an example. Let k1 ≤ k2 ≤ k3 be non-negative integers and t1, t2 ∈ (0, 1) and t3 ∈ (1, 2). Then P(K(t1) = k1,K(t2) = k2,K(t3) = k3)

= P(K(t1) = k1,K(t2) − K(t1) = k2 − k1,K(t3) − K(t2) = k3 − k2)

Z ∞ Z ∞ k1 k2−k1 (λq1t1) (λq1(t2 − t1)) = e−λq1t1 e−λq1(t2−t1) q1=0 q2=0 k1! (k2 − k1)! (λ(q (1 − t ) + q (t − 1))k3−k2 −λ(q1(1−t2)+q2(t3−1)) 1 2 2 3 · e dH(q1)dH(q2). (k3 − k2)! The finite dimensional distributions in turn determine the whole process (at least if the minimal possible sigma-algebra is taken in the definition of the process). It is easy to see that the increments of the process in different years are independent (for example, K(1) − K(1/2) and K(5/3) − K(4/3) are independent). However, there is dependence inside the years. We end the section by showing that the increments of a mixed Poisson process {K(t)} are independent in the sense of condition (i) of Theorem 3.1.2 if and only if {K(t)} is a Poisson process. University of Helsinki 21

Let 0 ≤ t < u ≤ 1. Obviously, K(u) − K(t) is a mixed Poisson variable with the parameter (λ(u − t),Q). By Theorem 3.2.1,

Var(K(u) − K(t)) = λ(u − t) + λ2(u − t)2Var(Q).

In particular, Var(K(1)) = λ + λ2Var(Q), 1 1 Var (K(1/2)) = λ + λ2Var(Q) 2 4 and 1 1 Var(K(1) − K(1/2)) = λ + λ2Var(Q). 2 4 If the increment are independent then

Var(K(1)) = Var(K(1/2)) + Var(K(1) − K(1/2)).

This is true only if Var(Q) = 0. This means that P(Q = 1) = 1 so that {K(t)} is a Poisson process. University of Helsinki 22

4 Total claim amount

Every occurrence of a claim means that the company has to pay a compensation which basically covers the economic loss of the policy-holder in question. It is natural to model the compensation as a non-negative random variable. We also call it the claim size. We will study the sum of these random variables in a year in a given portfolio. The sum is called the total claim amount (in the year). For wider applied discussions, we refer to DPP, Section 3.

Denote by Zi the size of the ith claim. If the number of claims occurred in the year is K then the total claim amount X has the form

X = Z1 + ··· + ZK . (4.1)

Both the number of claims and the claim sizes are random. The understanding and esti- mation of the total claim amount is a central topic of the risk theory. Let S be a distribution function. The random variable in (4.1) is called a compound variable with the parameter (K,S) if

K,Z1,Z2,... are independent (4.2) and the distribution of Z1,Z2,... is S. (4.3) We will assume throughout the section that X is a compound variable even if conditions (4.2) and (4.3) are usually only approximately satisfied. For example, inflation may change the distribution of the claim sizes. We will also assume that P(K > 0) > 0. Let Z be a generic variable which has the distribution function S. Thus

P(Z ≤ z) = P(Zi ≤ z) = S(z) for every z ∈ R and i = 1, 2,.... Assume in the sequel that S(0−) = P(Z < 0) = 0. Thus we do not allow negative claim sizes. Then also X is non-negative. Write pk = P(K = k) for k = 0, 1, 2,.... Let X be a compound variable with the parameter (K,S). If K has the Poisson distribution with the parameter λ then we call X a compound Poisson variable with the parameter (λ, S). Similarly, if K has the mixed Poisson distribution with the parameter (λ, Q) the we call X a compound mixed Poisson variable with the parameter (λ, Q, S). When the compound variable X is analysed, it is convenient to study separately the number of claims and the claim sizes. To clarify this, let’s consider the estimation of the distribution function of X from the data. A natural staring point is to assume that all the past observations come from the same distribution. Then there are not many useful University of Helsinki 23 observations because the environment changes all the time. In contrast to this, there is typically a lot of observations from the claim sizes. The estimation of the number of claims is usually otherwise easier. For example, if we agree that it has a Poisson distribution then only one parameter has to be estimated. In summary, the model for the structure of X is of high value in the estimation problem. Let F be the distribution function of X. By the properties of the compound variable,

∞ X k∗ F (x) = P(X ≤ x) = pkS (x) (4.4) k=0 where Sk∗ is the kth convolution of S, ( 0, jos x < 0 S0∗(x) = 1, jos x ≥ 0,

Z ∞ Sk∗(x) = S(k−1)∗(x − y)dS(y), k = 1, 2,.... −∞

In principle, it is possible to determine F by this equation if the probabilities pk and the distribution function of S are known. This can be very time consuming if the mean of the number of claims is large. Observe also that

k∗ P(X ≤ x, K = k) = S (x)pk so that k∗ FX|K (x|k) = S (x), x ∈ R, k = 0, 1, 2,.... In the moment generating function, the number of claims and the claim sizes are separated in the following way. Theorem 4.1. Let X be a compound variable with the parameter (K,S) and let Z has the distribution S. Let MK be the moment generating function of K and cZ the cumulant generating function of Z. Then the moment generating function of X, denoted by MX , is determined by MX (s) = MK (cZ (s)), s ∈ R, where by convention, MX (s) = ∞ if cZ (s) = ∞. Proof. By (4.2) and (4.3),

∞ ∞ X X sX 1 s(Z1+···+Zk) MX (s) = E(e (K = k)) = pkE(e ) k=0 k=0 ∞ ∞ X k X kcZ (s) = pkMZ (s) = pke = MK (cZ (s)).  k=0 k=0 University of Helsinki 24

If in particular, X is a compound Poisson variable with the parameter (λ, S) then

λ(MZ (s)−1) MX (s) = e , (4.5) and if X is a compound mixed Poisson variable with the parameter (λ, Q, S) then

MX (s) = MQ(λ(MZ (s) − 1)). (4.6) The lowest moments of X can often be derived by the means of the moment generating function. Let ai be the ith origin moment of Z, Z ∞ i i ai = E(Z ) = z dS(z). (4.7) 0 Theorem 4.2. Let X be a compound mixed Poisson variable with the parameter (λ, Q, S). Then the mean, the variance, and the skewness of X are

µX = E(X) = λa1 2 2 2 2 σX = Var(X) = λa2 + λ a1σQ and 2 2 3 3 3 3 γX = [λa3 + 3λ a1a2σQ + λ a1γQσQ]/σX under the assumption that the right hand side is well defined. The proof is left to the reader. If X is a compound Poisson variable with the parameter (λ, S) then

µX = λa1, (4.8)

2 σX = λa2, (4.9) 3 µ3 = E((X − µX ) ) = λa3 (4.10) and 3 a3 γX = µ3/σX = 3/2√ . (4.11) a2 λ We end the section by proving an additivity property of the compound Poisson distri- bution. By taking Z ≡ 1 in the following result, it is seen that the sum of two independent Poisson variables is also a Poisson variable.

Theorem 4.3. Let Xi be a compound Poisson variable with the parameter (λi,Si) for i = 1, 2. Assume that X1 and X2 are independent. Then X = X1 + X2 is a compound Poisson variable with the parameter (λ1 + λ2,S) where

λ1S1(z) + λ2S2(z) S(z) = , z ∈ R. (4.12) λ1 + λ2 University of Helsinki 25

Proof. Let Mi be the moment generating function of the claim size associated with Xi, Z ∞ sz Mi(s) = e dSi(z), s ∈ R. −∞ By the independence and by Theorem 4.1,

sX sX1 sX2 MX (s) = E(e ) = E(e ) E(e ) = eλ1(M1(s)−1) eλ2(M2(s)−1).

Thus λ1 λ2 (λ1+λ2)( λ +λ M1(s)+ λ +λ M2(s)−1) MX (s) = e 1 2 1 2 . (4.13) A straightforward calculation shows that the moment generating function associated with the distribution function (4.12) is determined by the formula

λ1 λ2 M1(s) + M2(s), s ∈ R. λ1 + λ2 λ1 + λ2

Thus (4.13) and Theorem 4.1 complete the proof.  It is also useful to consider the total claim amount as a stochastic process in the continuous time. Let {K(t)} be a counting process and let Z1,Z2,... be the claim sizes as earlier. Assume that Z1,Z2,... are i.i.d. and that they are independent of {K(t)} in all respects. The total claim amount {X(t) | t ≥ 0} in the continuous time is defined by

X(t) = Z1 + ··· + ZK(t). (4.14)

If {K(t)} is a Poisson process with the intensity λ and the claim size distribution is S then {X(t)} is called a compound Poisson process with the parameter (λ, S). It is clear that the increments of every compound Poisson process are independent and stationary in the sense of Theorem 3.1.2 (this means that {X(t)} is a Levy process). Similarly, if {K(t)} is a mixed Poisson process with the parameter (λ, Q) then {X(t)} is called a compound mixed Poisson process with the parameter (λ, Q, S). University of Helsinki 26

5 Viewpoints on claim size distributions

We consider in this section estimation of claim size distributions and related problems. As a starting point, we take a collection of past observations from one insurance line or sub- line (claims associated with fires, car accidents, ...). We assume that inflation and other trends have been eliminated from the data so that it is reasonable to assume that the observations, called Z1,Z2,..., are i.i.d. For wider applied discussions, we refer to DPP, Section 3.

5.1 Tabulation method If the number of observations is large it might be possible to rely on the data as such. This means that the empirical distribution is used as the estimate of S. Denote the cor- e responding distribution function by S . Let N be the number of observations Zi. Then by definition, e S (z) = #{i ≤ N | Zi ≤ z}/N (5.1) for every z ∈ R. There is often only a few number of large claims which means that the estimate of the right tail of the distribution may be inaccurate.

5.2 Analytical methods It is often convenient to work with analytically given distributions rather than the em- pirical distribution. The idea is to fit Se to some appropriate mathematical distribution. An additional motivation is caused by the fact that the claim concerning large claims are often sparse. Popular analytical distributions for the estimation are - Gamma distribution with the r, α where r > 0, α > 0 The density is αr s(z) = e−αzzr−1 Γ(r) for z ≥ 0 where Z ∞ Γ(r) = e−uur−1du. 0 - Log- with the parameters µ, σ where µ ∈ R, σ > 0 The density is 1 − 1 (log z−µ)2 s(z) = √ e 2σ2 σz 2π University of Helsinki 27 for z ≥ 0. It is easy to see that if Y has N(µ, σ2) distribution (normal distribution with expectation µ and variance σ2) then Z = eY has the log-normal distribution with the parameters µ and σ. - Pareto distribution with the parameters α, r where α > 0, r > 0 The density is α s(z) = z−α−1 r−α for z ≥ r. The distribution function is z −α S(z) = 1 − r for z ≥ r. The parameters are usually determined by some statistical methods, for example, by the maximum likelihood or by moment method. Concerning the right tails, the above three families of distributions represent three different types. Gamma distribution is light tailed which means that its moment generating function is finite for some positive value of the argument. In contrast to this, log-normal distribution is heavy tailed. Still all the moments are finite. Pareto distribution is also heavy tailed. The origin moment an is finite only if n < α. From the applied point of view, Pareto distribution is the most and gamma distribution the least dangerous one among these three types. A wide class of Pareto type distribution is obtained in the following way. A function f : (0, ∞) → (0, ∞) is called regularly varying with index α ∈ R if for every z > 0, f(tz) lim = zα. t→∞ f(t) If α = 0 then f is called slowly varying. Obviously, f is regularly varying with index α if and only if α f(z) = z f0(z), z > 0, where f0 is slowly varying. Theorem 5.1. The function f : (0, ∞) → (0, ∞) is slowly varying if and only if there exist positive constants b and c such that Z z e(y)  f(z) = a(z) exp dy b y for every z ≥ b where e(z) → 0 and a(z) → c as z → ∞. The proof can be found in Bingham et al. (1987). The next result is an immediate consequence. University of Helsinki 28

Corollary 5.1. If f : (0, ∞) → (0, ∞) is slowly varying then for every  > 0, there exists z > 0 such that −  z ≤ f(z) ≤ z , ∀z ≥ z.

Write in short S¯(z) = 1 − S(z) for z ∈ R. Let α ≥ 0. We say that the (right) tail of Z is regularly varying with index −α if S¯ (restricted to (0, ∞)) is regularly varying with index −α. By Corollary 5.1, we can then associate the power z−α with S¯(z) for large z. The obtained family of distributions is rather wide and extensively studied. In general, the risk associated with the right tail can be roughly described by means of the following characteristics. Let αS ∈ [0, ∞] be such that

S¯(z) ≈ e−αS z for large z. The exact meaning is that

−1 ¯ lim sup z log S(z) = −αS. z→∞ ¯ In words, αS describes the (exponential) rate of convergence of S(z) to zero. It is clear that αS = 0 for every Pareto distribution. In these circumstances, the polynomial rate of convergence is more useful. This means that for some βS ∈ [0, ∞],

S¯(z) ≈ z−βS for large z, or more precisely, that

−1 ¯ lim sup(log z) log S(z) = −βS. z→∞

Lemma 5.1. Let S be the distribution function of Z and let Z+ = max(Z, 0). Then

 sZ  αS = sup s ≥ 0; E e < ∞ (5.2) and  + s βS = sup s ≥ 0; E (Z ) < ∞ . (5.3) Proof. Write  sZ  κ = sup s ≥ 0; E e < ∞ .

We have to show that αS = κ.

We first prove that αS ≥ κ. We can assume that κ > 0. Let s ∈ (0, κ) so that E(esZ ) < ∞. By Chebycheff’s inequality,

sZ  sZ  sz E e ≥ E e 1(Z > z) ≥ e S¯(z). University of Helsinki 29

Hence, −1 ¯ −αS = lim sup z log S(z) ≤ −s, z→∞ so that αS ≥ s. Thus αS ≥ κ. To get the reverse inequality, we can assume that κ < ∞. Let s > κ and  > 0 be arbitrary. Then Z ∞ sZ  sZ  +∞ = E e = P e > z dz. 0

Thus there exists a sequence (zn) of real numbers such that zn → ∞ and

sZ  −1− P e > zn ≥ zn , n = 1, 2,....

−1 Write yn = s log zn. It is seen that

−1 ¯ −αS ≥ lim sup yn log S(yn) ≥ −(1 + )s. n→∞

Consequently, αS ≤ (1 + )s and further, αS ≤ κ. We have proven the first claim of the lemma. The second one is a consequence of this.  Based on the results of Lemma 5.1, we call αS the exponential index and βS as the moment index of the distribution S.

5.3 On the estimation of the tails of the distribution The frequency of the occurrence of large claims is often small. On the other hand, large claims cause a serious risk for the companies. Therefore, it is worth to pay some further attention to these claims. The following three examples may be viewed as practical ways to attack the problem. - Take an enlarged database for the estimation of the large claims. For example, it may be useful to consider the data in an extended time period, or to borrow observations from the country level (if possible). - Estimate possible claims individually. It might be possible to look at the insurance contracts to see which claim sizes are possible, and also to estimate the probability of the occurrence of such claims. - Add some hypothetical claims to data. For example, estimate in a way or other which claim sizes are possible during 10 or 100 years period. If the tail of the claim size distribution is estimated separately it is necessary to combine the two distributions, namely, the one which covers small and the one which covers large claims. Let M > 0 and let S1 and S2 be distribution functions which describe small and large claims, respectively. Assume that S1(M) = 1 and S2(M) = 0. We interpret that S1 is the conditional distribution of the claim size given that it is at most M. Similarly, we University of Helsinki 30

interpret that S2 is the conditional distribution of the claim size given that it is larger than M. The true distribution function S for all the claims is then obtained by setting

S(z) = pM S1(z) + (1 − pM )S2(z) for z ≥ 0 where pM is the probability that the claim size is at most M (also pM has to be estimated). Let Z be a random variable with the distribution function S. If B ⊆ (−∞,M]is a Borel set then Z P(Z ∈ B | Z ≤ M) = dS(z)/S(M) B R Z B pM dS1(z) = = dS1(z). pM S1(M) B Similarly, if B ⊆ (M, ∞) is a Borel set then Z P(Z ∈ B | Z > M) = dS2(z). B

In this way, S corresponds to the distribution function S1 in the range z ≤ M and to the distribution function S2 in the range z > M.

Assume as an example that S2 is the Pareto distribution with the parameters α and r. Then a natural choice is to take M = r. The estimation of α may be partly based on the data and partly on other views. Concerning the small claims, S1 could be as in (5.1) corresponding to the empirical distribution conditionally that the claim size is at most M. Also pM could be estimated directly from the data but it could also be adjusted by other views. University of Helsinki 31

6 Calculation and estimation of the total claim amount

Calculation of the distribution function of a compound variable is difficult even in the case where the distributions of the number of claims and the claim size are known. In principle, convolution sum (4.4) can be used for this purpose. Alternatively, it would be possible to invert the moment generating function of Theorem 4.1 (or the characteristic function which can be expressed by means of the characteristic functions of the number of claims and the claim size). The implementation may need a lot of computer time. We study three other possible ways to attack the problem, namely, a recursive method, approximations, and simulation. Furthermore, we derive upper bounds for tail probabili- ties of compound distributions.

6.1 Panjer method We consider in this section a recursive algorithm for the exact calculation of the distri- bution functions of compound distributions. A background assumption will be that the claim size has only finite number of possible values, but this assumption can be relaxed. The method applies, for example, to Poisson and Polya distributions. Let X be a compound variable. Denote by F the distribution function of X. Let K be the associated number of claims, and let Z,Z1,Z2,... be the associated claim sizes. Write pk = P(K = k) for k ∈ N. The algorithm assumes the recursion  b  p = a + p , k = 1, 2,..., (6.1) k k k−1 where a, b and p0 are constants. Naturally, we assume that p0 > 0. It is easy to see that Poisson, binomial and Polya distributions satisfy (6.1). A slightly harder analysis shows that no other distribution satisfies (6.1). Denote by S the distribution function of the claim size. Assume that there exist r ∈ N, c > 0, and non-negative real numbers s0, s1, . . . , sr such that

(Pr i=0 si = 1 si = P(Z = ic), i = 0, 1, . . . , r.

By the transform S(z) → S(cz) , we end up to the situation where c = 1. If G is the distribution function of the resulting compound distribution then G(z) = F (cz) for each z. Therefore, without a loss of generality, we will assume that c = 1. Then X is non-negative and integer valued. Write

fj = P(X = j), j = 0, 1, 2,.... (6.3) University of Helsinki 32

Theorem 6.1.1. (Panjer method). Under the assumption described above, the pro- babilities f0, f1, f2,... are obtained by the recursion

( p if s = 0 f = 0 0 0 P∞ i i=0 pis0 if s0 > 0, min(j,r) 1 X  ib f = a + s f , j = 1, 2,.... j 1 − as j i j−i 0 i=1

Proof. Clearly, f0 of the theorem equals P(X = 0). Write

0∗ 0∗ s0 = 1, sj = 0, j = 1, 2,... and k∗ sj = P(Z1 + ··· + Zk = j), j = 0, 1, 2, . . . , k = 1, 2,.... Let k ≥ 1. By the independence of the Z-variables,

j j k∗ X X (k−1)∗ sj = P(Z1 = i, Z2 + ··· + Zk = j − i) = sisj−i . (6.4) i=0 i=0

k∗ Consider first the case where sj > 0. By symmetry,

k ! k k ! X 1 X X j Z | Z = j = Z | Z = j = . E 1 i k E i i k i=1 i=1 i=1 On the other hand,

k ! j k ! X X X E Z1 | Zi = j = m P Z1 = m | Zi = j i=1 m=1 i=1 Pj  Pk  m=1 m P Z1 = m, i=2 Zi = j − m = Pk  P i=1 Zi = j

Pj (k−1)∗ m=1 msmsj−m = k∗ . sj Combine the results to see that

j k X (k−1)∗ sk∗ = is s . (6.5) j j i j−i i=1 University of Helsinki 33

k∗ This also holds if sj = 0 because then all the terms on the right hand side equal zero. Let j > 0. Then ∞ ∞ X X  b  f = p sk∗ = a + p sk∗ j k j k k−1 j k=1 k=1

∞ j ∞ j X X (k−1)∗ X b X (k−1)∗ = ap s s + p is s k−1 i j−i j k−1 i j−i k=1 i=0 k=1 i=1 ∞ j   ∞ X (k−1)∗ X ib X (k−1)∗ = as p s + a + s p s 0 k−1 j j i k−1 j−i k=1 i=1 k=1 j X  ib = as f + a + s f . 0 j j i j−i  i=1 The distribution function is now obtained by

j X F (j) = fi, j = 0, 2,.... i=0 Theorem 6.1 gives the exact value for the distribution function everywhere. Reaso- nable approximations are obtained also in the case where the claim size distribution is more general (for example, continuous distributions can be approximated by discrete di- stributions). A problem of the method is that it requires a lot of computer time when r is large.

6.2 Approximation of the compound distributions Approximation methods are often appropriate for large portfolios. They give tools for quick estimation of compound distributions. They may also give qualitative information about underlying risks.

6.2.1 Limiting behaviour of compound distributions Consider the limiting behaviour of the total claim amount as the mean of the number of the claims tends to infinity. The claim size distribution and the possible mixing variable are kept fixed. The limiting procedure indicates that the obtained estimates are useful mostly for big companies. The normal approximation of the probability P(X ≤ x) is by definition,     X − E(X) x − E(X) x − E(X) P(X ≤ x) = P ≤ ≈ φ , σX σX σX University of Helsinki 34 where φ is the distribution function of the standard normal variable. The following result justifies the approximation for the compound Poisson variable.

Theorem 6.2.1.1. Let X = Xλ have the compound Poisson distribution with the parameter (λ, S). Assume that Z ∞ 2 a2 = z dS(z) ∈ (0, ∞). 0 The for every x ∈ , R   Xλ − E(Xλ) lim P ≤ x = φ(x). λ→∞ σXλ Proof. By Theorem 4.3, Xλ =L Xbλc + ξλ, where ξλ and Xbλc are independent and

ξλ =L Xλ−bλc. Hence,  σ Xλ − E(Xλ) Xbλc − E Xbλc Xbλc ξλ − E(ξλ) =L · + . (6.2.1.1) σXλ σXbλc σXλ σXλ By Theorem 4.3, Xbλc = η1 + ··· + ηbλc, where η1, . . . , ηbλc are independent compound Poisson variables with the parameter (1,S). By the central limit theorem,  ! Xbλc − E Xbλc lim P ≤ x = φ(x). (6.2.1.2) λ→∞ σXbλc

On the other hand, r σX bλc lim bλc = lim = 1 (6.2.1.3) λ→∞ λ→∞ σXλ λ and  2! ξλ − E(ξλ) (λ − bλc)a2 lim E = lim = 0. (6.2.1.4) λ→∞ λ→∞ σXλ λa2

Generally, if B,Bλ,Cλ and Dλ are random variables and c and d constants, and

B −→ B (convergence in distribution), λ i.d. C −→ c (convergence in probability), λ P D −→ d, λ P University of Helsinki 35

then C B +D −→ cB+d. The claim of the theorem follows from limits (6.2.1.2) - (6.2.1.4) λ λ λ i.d. 2 because L -convergence of (6.2.1.4) implies convergence in probability.  The estimate of Theorem 6.2.1.1 is not valid for compound mixed Poisson variables. The limit also exists in this case but the limiting distribution is not normal.

Theorem 6.2.1.2. Let X = Xλ have the compound mixed Poisson distribution with the parameter (λ, Q, S). Let H be the distribution function of Q. Assume that Var(Q) is finite and that a2 ∈ (0, ∞). Then   Xλ lim P ≤ x = H(x) λ→∞ E(Xλ) for every continuity point x of H. In other words,

X / (X ) −→ Q. λ E λ i.d.

Proof. Let Var(Xλ|Q) be the conditional variance of Xλ with respect to Q,

2  Var(Xλ|Q) := E (Xλ − E(Xλ|Q)) |Q .

Given that Q = q, the (regular) conditional distribution of Xλ is the compound Poisson distribution with the parameter (λq, S). By Theorem 4.2,

Var(Xλ|Q) = λa2Q. Write Z ∞ a1 = E(Z) = zdS(z). 0 Then  2! Xλ 1 a2 lim E − Q = lim 2 2 E (Var(Xλ|Q)) = lim 2 = 0. λ→∞ E(Xλ) λ→∞ λ a1 λ→∞ λa1

2 This is L -convergence which implies the convergence in distribution.  Theorems 6.2.1.1 and 6.2.1.2 indicate that the compound mixed Poisson distribution is more dangerous than the ordinary Poisson distribution. Roughly speaking, in the latter case the√ observations have a tendency to be close to the expectation (the distance is of the order λ), and in the former case, the distance is of order λ. Furthermore, in Theorem 6.2.1.1, X /λ −→ a , λ i.d. 1 and in Theorem 6.2.1.2, the corresponding limit is a random variable,

X /λ−→ a Q. λ i.d. 1 University of Helsinki 36

6.2.2 Refinements of the normal approximation Theorem 6.2.1.1 justifies the normal approximation of the compound Poisson distribution for large insurance portfolio. Sharper estimates can often be obtained by taking√ into account the skewness of the distribution. By (4.11), the skewness is of order 1/ λ so that it tends to zero rather slowly. The skewness of the normal distribution equals zero. In view of Theorem 6.2.1.2, the normal approximation is questionable for compound mixed Poisson distributions. However, applications of the theorem are not immediate because it needs a precise distribution of the mixing variable. Useful information may be limited to a few lowest moments. The following methods are sometimes also applied to mixed variables although theoretical justifications can mainly be given for compound Poisson distributions. Approximation methods can be viewed as (partial) solutions to the following problem. Let X be a compound variable. The goal is to find out a function ψ such that ψ(X) is close to the standard normal variable. Additionally, ψ should be simple enough to be of practical use. For example, it could be required that ψ depends on the distribution of X only via a few moments. This is the case in the normal approximation where ψ depends on the expectation and the variance, X − (X) ψ(X) = E . σX

NP approximation Let Xλ be as in Theorem 6.2.1.1. Consider again the limiting behaviour of Xλ as λ → ∞. To save notations, we will write X instead of Xλ. Let X be the standardized variable, X − (X) X = E . σX Assume for simplicity that λ ∈ N and that the distribution of X is non-arithmetic, that is, the distribution is not concentrated to any set of type {a + nh|n = 0, ±1, ±2,...}, a, h ∈ R. We will estimate the probability P(X ≤ x) in the limit as λ → ∞.

Assume that the the skewness γX of X exists and is finite. Write √ γX = γ/ λ where γ is the skewness of X when λ = 1. The following heuristic consideration leads to the NP approximation. The reader is also referred to Sundt (1984), Section 9.5. We have   γ 2 0 1 | P(X ≤ x) − φ(x) − √ 1 − x φ (x) |= o √ 6 λ λ uniformly for x ∈ R as λ → ∞. The proof can be found in Feller (1971), Chapter XVI. Equivalently, γ 2 0 P(X ≤ x) = φ(x) + √ 1 − x φ (x) + o (γX ) , (6.2.2.1) 6 λ University of Helsinki 37 where the error term o(1) is uniform over R. The usual normal approximation gives a weaker estimate, namely, P(X ≤ x) = φ(x) + O(γX ). In the NP approximation, one wants to find out a simple transform ϕ such that

P(X ≤ ϕ(y)) = φ(y) + o(γX ). If ϕ has the inverse ν = ϕ−1 defined on an appropriate subset of R then by choosing y = ν(x), P(X ≤ x) = P(X ≤ ϕ(y)) = φ(y) + o(γX ). (6.2.2.2)

The error term is now o(γX ). The transform to be used is determined by γ ϕ(y) = y + X y2 − 1 (6.2.2.3) 6 and s 3 9 6 ν(x) = − + 2 + 1 + x. γX γX γX

By (4.11), the skewness γX is positive so that ϕ has the inverse on [1, ∞). Thus ν is well defined on [1, ∞). Typically, the approximation is applied to the right tail. The NP-approximation of the probability P(X ≥ x) is P(X ≥ x) ≈ 1 − φ(y), where s 3 9 6 x − E(X) y = − + 2 + 1 + . γX γX γX σX Often the goal is to find out x such that

P(X ≥ x) = , where  > 0 is given. The approximation is then used in the following way. Let y be such that φ(y) = 1 − . Then P(X ≥ ϕ(y)) ≈ . The desired x is obtained by setting

( γX 2 x = y + 6 (y − 1) x = E(X) + xσX . The NP approximation is limited to the right tail of the distribution. Obtained estimates are relatively good if the skewness γX is small. A rule of thumb: NP approximation should not be used if γX > 1. University of Helsinki 38

Wilson-Hilferty -approximation Gamma-distribution is sometimes used as such as an approximation of the total claim amount. Theorem 6.2.1.2 justifies this in some sense if the mixing variable has a Polya distribution. The Wilson-Hilferty -approximation is developed to approximate gamma distribution. From this point of view, the method could also give useful estimates for the total claim amount. Let X be a compound variable and

X − (X) X = E σX its standardized version. Similarly, for x ∈ R, we write x − (X) x = E . σX

The Wilson-Hilferty -approximation of the probability P(X ≤ x) is

P(X ≤ x) = P(X ≤ x) ≈ φ(W (x)), where 1 W (x) = c1 + c2(x + c3) 3 and 1 2 2 c1 = − 3g, c2 = 3g 3 , c3 = g and g = . (6.2.2.4) 3g γX Conversely, if we have to determine x such that P(X ≥ x) = , then the following steps will be taken:

1) determine y such that φ(y) = 1 −  2) determine x such that

 3 1 3 x = (y − c1) − c3 c2

3) take x = E(X) + xσX . The transformation in step 2 uses a polynomial of degree 3. In the NP-approximation, the degree is 2. The accuracy of the two methods are rather close to each other. An advance of the Wilson-Hilferty approximation is that

G(x) = φ(W (x)), x ∈ R, determines a distribution function on whole R. An advance of the NP approximation is its simplicity. University of Helsinki 39

We end the section by observing that the the Wilson-Hilferty approximation is a special case of the Haldane approximation. In this latter method, one first determines a parameter h such that the skewness of the variable  X h Y = E(X) is zero (or is close to zero). The normal approximation is then applied to Y − (Y ) Y = E . σY The Wilson-Hilferty approximation is obtained by taking h = 1/3. The Haldane approxi- mation needs as an extra step the determination of h which is not a straightforward task. The obtained estimates are often better than in the Wilson-Hilferty approximation.

6.2.3 Applications of approximation methods We apply in this section approximation methods to the determination of capital requi- rements of companies. Let L be a general (stochastic) amount of money. Assume that by certain contracts, an agent has to pay that amount during the forthcoming year. We associate with L a real number which describes the risk. Let p ∈ (0, 1) be fixed. The value at risk at level p, denoted by VaR[L; p], is defined by −1 VaR[L; p] = FL (p), where FL is the distribution function of L and −1 FL (p) = inf{x | FL(x) ≥ p}. Hence, FL (VaR[L; p]) ≥ p. In words, VaR[L; p] suffices to pay L with a probability ≥ p, and is the minimal such amount of money. In insurance applications, p is often close to one, for example, we may have p ∈ (0.99, 0.999). Observe also that if FL is continuous then −1 FL (p) = inf{x | FL(x) = p} and FL (VaR[L; p]) = p. (6.2.2.5) It is also natural to consider expectations associated with the right tail of the distribution. The conditional VaR, denoted by CVaR[L; p], is defined by CVaR[L; p] = E (L − VaR[L; p] | L > VaR[L; p]) . This describes the severity of the unpleasant event that L can not be covered by the capital in use. In the following, we take L = X, the total claim amount of the company in the forthcoming year. We will assume that the skewness of X is finite and positive. We will make use of the normal approximation and its variants such that simplified formulas (6.2.2.5) can be used. University of Helsinki 40

Capital requirements Assume that the company has at the beginning of the year the initial capital U0. The total claim amount of the forthcoming year is X. The corresponding premium is P = (1 + v)µX where µX = E(X) is the risk premium and v > 0 is the safety loading. It is also usual terminology to call µX the pure risk premium and v the relative safety loading. We will consider the ruin probability of the company during one year. Ruin (or bank- rupt) happens if the initial capital together with the premiums do not suffice for the com- pensations of the claims. The company is allowed to continue in the market if the ruin probability is at most , a small number determined by the supervisory authority. The problem is to determine the required initial capital when the other parameters and the structure of the model are given. Mathematically, we have to find out the initial capital U0 such that P(U0 + P − X < 0) = , or that P(X > U0 + P ) = . The answer is U0 = VaR[X; 1 − ] − P.

Let φ(y) = 1 − . The normal approximation gives

U0 = µX + yσX − P

= yσX − vµX .

The NP approximation gives  γ  U + P = µ + y + X y2 − 1 σ 0 X  6  X so that  γ  U = y + X y2 − 1 σ − vµ . 0  6  X X Similarly, the minimal capital based on the Wilson-Hilferty approximation is

 3 ! 1 3 U0 = (y − c1) − c3) σX − vµX , c2 where c1, c2 and c3 are as in (6.2.2.4). University of Helsinki 41

Merging of companies Consider the situation where two companies are merged to get one larger company. Let Xi be the total claim amount associated with the company, i = 1, 2, and let µi, σi and γi be the mean, the variance, and the skewness of Xi. Let further vi be the safety loading associated with company i. Let’s consider capital requirements by means of the NP approximation. For company i, before the merging, the requirement is  γ  U = y + i y2 − 1 σ − v µ . i  6  i i i After the merging, the requirement for the new company is  γ  U = y + y2 − 1 σ − vµ, 0  6  where µ, σ and γ are related to the company arising in the merging. We assume that y > 1 and that σi and γi are finite and strictly positive for i = 1, 2.

Assume that X1 and X2 are independent. We will show that U0 < U1 + U2. In words, the required initial capital after the merging is smaller than the sum of the required initial capitals of the participating companies. For the moments of the new company, we have

µ = µ1 + µ2,

vµ = v1µ1 + v2µ2 and q 2 2 σ = σ1 + σ2 < σ1 + σ2.

The last equality needs the independence of X1 and X2. Furthermore,

3 3 3 E (X1 + X2 − µ1 − µ2) = E (X1 − µ1) + E (X2 − µ2) . The proof is left to the reader. It follows that 1 U + U − U = y (σ + σ − σ) − (v µ + v µ − vµ) + y2 − 1 (γ σ + γ σ − γσ) 1 2 0  1 2 1 1 2 2 6  1 1 2 2  3 3  1 2  E ((X1 − µ1) ) E ((X2 − µ2) ) > y − 1 2 + 2 − γσ 6 σ1 σ2 1  ((X − µ )3) + ((X − µ )3)  > y2 − 1 E 1 1 E 2 2 − γσ 6  σ2 1  ((X + X − µ − µ )3)  = y2 − 1 E 1 2 1 2 − γσ = 0. 6  σ2 University of Helsinki 42

Thus U0 < U1 + U2. The relative gain of the merging is U 1 − 0 ∈ (0, 1). U1 + U2 The same conclusion is obtained by the normal approximation. Then

Ui = yσi − viµi, i = 1, 2, and q 2 2 U0 = y σ1 + σ2 − v1µ1 − v2µ2 q  2 2 = y σ1 + σ2 − σ1 − σ2 + U1 + U2

< U1 + U2.

The required capital, based on the normal approximation, gets smaller in the merging even if the total claim amounts are not independent. To see this, write σ12 = Cov(X1,X2). Then

2 2 Var(X1 + X2) = σ1 + σ2 + 2σ12.

By Schwarz’s inequality,

σ12 = E ((X1 − µ1)(X2 − µ2))) p 2 p 2 ≤ E ((X1 − µ1) ) E ((X2 − µ2) ) = σ1σ2 so that q 2 2 U0 = y σ1 + σ2 + 2σ12 − v1µ1 − v2µ2 q  2 2 ≤ y σ1 + σ2 + 2σ1σ2 − σ1 − σ2 + U1 + U2

= U1 + U2.

The equality only holds if σ12 = σ1σ2. As a conclusion, we can think that the merging has a tendency to reduce capital requirements.

6.3 Simulation of compound distributions

Computers are usually able to produce independent random variables which are uniform- ly distributed to the interval (0, 1). We call them random numbers. They are not exactly what they should be since they are generated by deterministic algorithms, but they are University of Helsinki 43 widely use as reasonably good approximations. As we will see later, it is possible to pro- duce random numbers from any distribution by means of the above mentioned uniformly distributed random variables. This gives a tool to estimate ’statistically’ the probabilities and expectations in our interest. Simulation can be used to derive unbiased estimates but it can also give some infor- mation about realizations of stochastic processes. The method is also flexible by giving a way to analyse complicated problems. Analytic methods are more restrictive in this sen- se. On the other hand, analytic approaches may give qualitative information about the phenomenon in question. For example, concrete formulas may show which parts of the model cause the main risks. Simulation only gives numbers or graphical illustrations.

6.3.1 Producing observations Assume that we have a computer which is able to produce independent T (0, 1)-distributed random numbers, as many as we need. The common distribution function T of the random numbers is determined by  0, jos x < 0  T (x) = x, jos x ∈ [0, 1] 1, jos x > 1.

Theorem 6.3.1.1. Let F be a distribution function, and let the random variable R have the T (0, 1) distribution. Define the random variable RF by

RF = min{x ∈ R | F (x) ≥ R}.

Then RF is F -distributed, that is,

P(RF ≤ x) = F (x), ∀x ∈ R.

Proof. First observe that RF is well defined because of the right continuity of the distribution functions. Let x0 ∈ R. Then a) R ≤ F (x0) ⇒ RF ≤ x0. This follows from the fact that for a given ω ∈ Ω, RF is the minimal real number x such that F (x) ≥ R.

b) R > F (x0) ⇒ R > F (y) for every y ≤ x0 ⇒ RF > x0. This follows from the monotonicity of F . In summary, R ≤ F (x0) ⇔ RF ≤ x0.

Now F (x0) ∈ [0, 1] so that

P(RF ≤ x0) = P(R ≤ F (x0)) = T (F (x0)) = F (x0).  University of Helsinki 44

The distribution function of a compound Poisson variable has no simple analytic repre- sentation so that the previous theorem does not apply directly. Random numbers from this distribution can still be produced. Assume that λ and the distribution function S of the claim size are given. In the ith replication, we take the following three steps. 1) Generate a Poisson distributed random variable with the parameter λ. Denote it by Ki.

2) Generate Ki i.i.d. random variables with the distribution function S. Denote them by Z1,...,ZKi .

3) Put Xi = Z1 + ··· + ZKi . By repeating the steps keeping all the random variables independent, we obtain a sample X1,X2,... from the desired compound Poisson distribution. Steps 1) and 2) are based on Theorem 6.3.1.1. A similar method applies to general compound distributions. In the case of the com- pound mixed Poisson distribution, step 1) is implemented by generating first an obser- vation from the distribution of the mixing variable Q. Denote it by Qi. Then a Poisson distributed random variable is generated with the parameter λQi.

6.3.2 Estimation

Consider in general the estimation of the probability α := P(X ∈ A), where A is a Borel set. We assume that the distribution function of X has no simple expressions but that it is possible to produce observations by simulation. We build a sample of size N from the distribution of X, that is, we generate N i.i.d. observations such that the common distribution is that of X. Denote them by X1,...,XN . we estimate #{i | X ∈ A} αˆ =α ˆ = i . (6.3.2.1) N N Then N −1 X αˆ = N Ii, (6.3.2.2) i=1 where Ii = 1(Xi ∈ A). Thus I1,...,IN are i.i.d. and ( P(Ii = 0) = P(Xi 6∈ A) = 1 − α, P(Ii = 1) = P(Xi ∈ A) = α.

This is the binomial distribution Bin(1, α) (or ). In particular, E(Ii) = α and Var(Ii) = α(1 − α). Thus

E(ˆα) = α (6.3.2.3) University of Helsinki 45 and α(1 − α) Var(ˆα) = σ2 = . (6.3.2.4) αˆ N By (6.3.2.3), αˆ is an unbiased estimator of α. By the law of large numbers,

( lim αˆN = α) = 1. (6.3.2.5) P N→∞ Thus the estimation of α by simulation seems to be possible. The estimates obtained by simulation always contain an estimation error. To control this, let’s consider criteria for the determination of the sample size. For large N,

I1 + ··· + IN − Nα pNα(1 − α) is approximately N(0, 1) distributed by the central limit theorem (N(0, 1) denotes the standard normal distribution). The same is true for αˆ − α . (6.3.2.6) σαˆ The precision p of αˆ is defined as the relative standard deviation, r σαˆ 1 − α p = pN = = . (6.3.2.6.1) E(ˆα) αN

An appropriate stopping rule for the simulation is to require that p ≤ p0 where p0 is predetermined. The required sample size is then 1 − α N = 2 . αp0 The criterion can not be used directly because α is not known. In practical simulation, the stopping rule has to be based on the sample. A natural way is to calculate after N replications the observed precision pˆN by formula (6.3.2.6.1) but with E(ˆα) replaced by αˆN and σαˆ with s N N r N −1 P I2 − (N −1 P I )2 αˆ (1 − αˆ ) i=1 i i=1 i = N N . N N

In the stopping rule, we require that pˆN ≤ p0. Maybe it is more natural to determine the stopping rule by requiring that | αˆ − α |  ≥ δ < , (6.3.2.7) P α University of Helsinki 46 where  and δ are predetermined (small) positive numbers. Hence, we require that the relative error of the estimate is small with a high probability. However, this is not very different from the precision based criterion. Namely, if we take b such that φ(b) = 1 − /2 and put p0 = δ/b then

|αˆ − α|  |αˆ − α|  ≥ δ = ≥ bp P α P α 0 |αˆ − α|  = P ≥ b σαˆ ≈ 2(1 − φ(b)) = .

Example 6.3.2.1. Let  = 0.05 and δ = 0.1. Then b ≈ 2 so that we take p0 = 0.05. Suppose that the probability α to be estimated equals 10−3. Then we need N replications, 1 − α N ≈ 2 ≈ 400.000. αp0

6.3.3 Increasing efficiency of simulation of small probabilities We observed in Example 6.3.2.1 that to reach the required precision, a lot of replications are needed if the probabilities to be estimated are small. It is also worth to observe that producing of one observation from the distribution of a compound variable is time consuming, in particular, if the expected number of claims is large. The efficiency can often be increased by an appropriate change of measure as it will be studied in this section.

Let ξ, ξ1, ξ2,... be i.i.d. random variables and

Sn = ξ1 + ··· + ξn, n = 1, 2,....

Assume that µξ ∈ (0, ∞). Let a > 0. We study the estimation of the probability

S  α = α := n ≥ (1 + a)µ n P n ξ for large n. The law of large numbers implies that αn → 0 as n → ∞.

Conjugate distributions Let P , Mξ and cξ be the distribution, the moment generating function and the cumulant generating function of ξ, respectively. Let t ∈ R be such that cξ is finite at t. Define the distribution Pt by Z tx−cξ(t) tξ−cξ(t)1  Pt(B) = e dP (x) = E e (ξ ∈ B) (6.3.3.1) B University of Helsinki 47

for every B ∈ B. It is easy to see that Pt is really a , namely, that Pt(R) = 1. We call Pt the conjugate distribution of P with the parameter t or the Esscher transform of P with the parameter t. In general terminology, etx−cξ(t) in equation (6.3.3.1) is the Radon-Nikodym derivative of Pt with respect to P ,

dPt (x) = etx−cξ(t), x ∈ . dP R It is positive everywhere so that dP (x) = e−tx+cξ(t). dPt It follows that Z −tx+cξ(t) −tξ+cξ(t)1  P (B) = e dPt(x) = Et e (ξ ∈ B) (6.3.3.2) B for every B ∈ B where Et means that the distribution of ξ in (6.3.3.2) is Pt. The same transform applies to expectations. Assume that Z E(h(ξ)) = h(x)dP (x) R exists and is finite. Then Z dP −tξ+cξ(t) E(h(ξ)) = h(x) (x)dPt(x) = Et h(ξ)e . R dPt Similarly, tξ−cξ(t) Et(h(ξ)) = E h(ξ)e if the left hand side is well defined.

Assume now that cξ is finite in a neighbourhood of t. Then ∂ (ξ) = ξetξ−cξ(t) = log (esξ) = c0 (t). Et E ∂s E |s=t ξ Conjugate distributions can sometimes be identified by means of moment generating func- tions. Let sξ Mt(s) = Et e and ct(s) = log Mt(s) for s ∈ R. Then M (t + s) (t+s)ξ−cξ(t) ξ Mt(s) = E e = Mξ(t) University of Helsinki 48 and ct(s) = cξ(t + s) − cξ(t).

From these formulas, it might be possible to identify Pt. The expectations of ξ under conjugate distributions can be described in the following way. Let S be the minimal closed interval such that

P (S) = P(ξ ∈ S) = 1. Let further D = {s ∈ R | cξ(s) < ∞}.

We will see in the sequel that cξ is a convex function. This implies that D is convex (a ˚ 0 ˚ point or an interval). If S= 6 ∅ and D is open then the function cξ : D → S is a bijection (S˚ means the interior of S). Hence, if x ∈ S˚ is given then there exists a unique conjugate distribution Pt such that 0 Et(ξ) = cξ(t) = x. For the proofs and more information, the reader is referred to Iscoe et al. (1985).

Importance sampling Let α be as in the first part of Section 6.3.3. We will give representations for α as expectations under conjugate distributions. Let first B1,...,Bn be arbitrary Borel sets. Obviously,

P(ξ1 ∈ B1, . . . , ξn ∈ Bn) = P (B1) ··· P (Bn) −tξ1+cξ(t)1  −tξn+cξ(t)1  = Et e (ξ1 ∈ B1) ··· Et e (ξn ∈ Bn) −t(ξ1+···+ξn)+ncξ(t)1  = Et e (ξ1 ∈ B1, . . . , ξn ∈ Bn) where in the last expectation, ξ1, . . . , ξn are i.i.d. Pt-distributed random variables. This implies that for an arbitrary Borel set B of Rn,

−t(ξ1+···+ξn)+ncξ(t)1  P((ξ1, . . . , ξn) ∈ B) = Et e ((ξ1, . . . , ξn) ∈ B) −tSn+ncξ(t)1  = Et e ((ξ1, . . . , ξn) ∈ B) where Sn is the sum of n i.i.d. Pt-distributed random variables ξ1, . . . , ξn. For the proba- bility of type P(Sn/n ∈ B), the representation   Sn ∈ B = e−tSn+ncξ(t)1(S /n ∈ B) P n Et n can also be interpreted such that the conjugate change of the measure has been made to the sum Sn. In other words, the same distribution of Sn is obtained if we take the conjugate distributions for ξ1, . . . , ξn with preserving the independence, or if we take the University of Helsinki 49

conjugate distribution of Sn. This fact is easily seen by means of the moment generating functions. In particular,   Sn α = ≥ (1 + a)µ = e−tSn+ncξ(t)1(S /n ≥ (1 + a)µ ) (6.3.3.3) P n ξ Et n ξ where on the right hand side, the distribution of the variable under the expectation ope- rator is obtained either by making the conjugate change of measure to the ξ-variables or directly to the sum Sn. The equation gives a starting point for the estimation of α by j means of Pt distributed random variables. In replication j, we obtain an observation Sn from the conjugate distribution of Sn and put

j −tSn+ncξ(t)1 j Yj = e (Sn/n ≥ (1 + a)µξ).

Then Et(Yj) = α. If we have N replications then we estimate

N 1 X αˆ =α ˆ = Y . (6.3.3.4) t N i i=1 The estimator is unbiased. The variance has the representation η − α2 Var (ˆα ) = t t t N where 2 −2tSn+2ncξ(t)1  ηt = Et Y = Et e (Sn/n ≥ (1 + a)µξ) . (6.3.3.5)

A natural goal is to find out t which minimizes the Vart(ˆαt). Equivalently, the goal is to minimize ηt. The stopping rule to be used is still of type p ≤ p0 where p0 is a given target precision and pVar (ˆα ) pη − α2 p = t t = √t Et(ˆαt) α N is the precision after N replications. The interpretation is the same as in the ordinary simulation. Small ηt means a small number of replications. The precision p has to be estimated from the sample. A natural estimate is

p 2 ηˆt − αˆt pˆN = √ αˆt N where N N 1 X 1 X αˆ = Y and ηˆ = Y 2. t N i t N i i=1 i=1 The above estimation method is called importance sampling. University of Helsinki 50

Some The solution to the above minimization problem will be based on large deviations theory. We begin by stating some concepts and results of the theory. An initial step of the theory is the so called Cramér’s theorem from the year 1938. A lot of developments has been seen much later. One of the main results is the Gärtner-Ellis theorem, a notable generalization of Cramér’s theorem (Gärtner (1977), Ellis (1984)).

Let the variables ξ, ξ1, ξ2,... and S1,S2,... be as at the beginning of Section 6.3.3. Let ∗ the common distribution of the ξ-variables be P as earlier. The convex conjugate cξ of cξ is defined by ∗ cξ(v) = sup{sv − cξ(s)}. s∈R It has the following useful geometrical interpretation.

6

" " " c(s) " " " " " " " " " - s " " " a " " " " "

In the picture, the slope of the tangent of c equals v. The value c∗(v) of the convex conjugate is the distance between the origin and the point where the tangent hits the vertical axis. Hence, c∗(v) = |a|. ∗ Lemma 6.3.3.1. The functions cξ and cξ have the following properties. (i) Both of the functions are convex. ∗ (ii) cξ(v) ≥ 0 for every v ∈ R. ∗ (iii) cξ(µξ) = 0. 0 ∗ (iv) If cξ(sv) = v for some sv ∈ D then cξ(v) = svv − cξ(sv).

Proof (main lines only). The convexity of cξ follows from Hölder’s inequality. The ∗ conjugate cξ is convex as a pointwise supremum of convex (actually affine) functions. Further, ∗ cξ(v) ≥ 0 · v − cξ(0) = 0 University of Helsinki 51

for each v. Because s → sv−cξ(s) defines a concave function it attains its global maximum 0 0 at each point where the derivative vanishes. Thus (iv) holds. If cξ(0) exists then µξ = cξ(0) so that ∗ cξ(µξ) = 0 · µξ − cξ(0) = 0. 0 Claim (iii) also holds if cξ(0) does not exist. The proof of this fact is omitted. 

The convex conjugate describes probabilities P(Sn/n ∈ ·) in the following way. Theorem 6.3.2.2 (Cramér’s theorem). Let A ⊆ R be open. Then

−1 ∗ lim inf n log (Sn/n ∈ A) ≥ − inf{cξ(v) | v ∈ A}. n→∞ P

Let B ⊆ R be closed. Then

−1 ∗ lim sup n log P(Sn/n ∈ B) ≤ − inf{cξ(v) | v ∈ B}. n→∞ The proof and more background can be found in Dembo and Zeitouni (1998). The focus of Cramér’s theorem is in the rates of convergences of probabilities. The obtained estimates are usually not very sharp but are useful in other ways. The limit inferior and the limit superior can often be replaced by the limit. For example, if I is an interval then −1 ∗ lim n log (Sn/n ∈ I) = − inf{cξ(v) | v ∈ I} n→∞ P ∗ ∗ except perhaps in the case where cξ has a jump at an endpoint of I. By convexity, cξ has at most two such jumps. The limit is often written a bit non-precisely as

−n inf{c∗(v)|v∈I} P(Sn/n ∈ I) ≈ e ξ .

In the sequel, we also need the following small refinement. Let b ∈ R be arbitrary. Then

−1 ∗ lim n log (Sn/n ∈ [b, ∞)) = − inf{cξ(v) | v ∈ [b, ∞)}. (6.3.3.6) n→∞ P

Asymptotically efficient simulation distribution Let’s turn to the problem to choose in an optimal way the simulation distribution. We have to estimate the proba- bility S  α = n ≥ (1 + a)µ , n P n ξ where a > 0 is fixed and n is large. We will make use of estimator (6.3.3.4) which is based on Pt distributed random variables. The goal is to choose t such that ηt from (6.3.3.5) is minimal in the limit as n → ∞. University of Helsinki 52

0 0 Assume that D is open so that cξ(0) = µξ. Assume further that cξ(ta) = (1 + a)µξ for some ta ∈ R. By Lemma 6.3.3.1,

∗ cξ((1 + a)µξ) = ta(1 + a)µξ − cξ(ta).

2 Let now t ∈ D be arbitrary. The variance is non-negative so that by (6.3.3.5), ηt ≥ α . Thus −1 −1 ∗ lim inf n log ηt ≥ 2 lim inf n log αn = −2cξ ((1 + a)µξ) . (6.3.3.7) n→∞ n→∞ ∗ The last equation follows from (6.3.3.6) because cξ is increasing in [µξ, ∞) (this is true ∗ ∗ ∗ because cξ is convex, cξ(µξ) = 0, and cξ(v) ≥ 0 for every v ∈ R). Take now t = ta. Then

−2taSn+2ncξ(ta)1  ηta = Eta e (Sn/n ≥ (1 + a)µξ) .

0 0 Now cξ(0) = µξ < (1 + a)µξ = cξ(ta) so that by convexity, ta is positive. We conclude that

−2ta(1+a)µξn+2ncξ(ta) ηta ≤ e −2n(t (1+a)µ −c (t )) −2nc∗((1+a)µ ) = e a ξ ξ a = e ξ ξ .

This shows that −1 ∗ lim sup n log ηta ≤ −2cξ((1 + a)µξ). n→∞ Combine this with (6.3.3.7) to see that

−1 ∗ lim n log ηta = −2cξ((1 + a)µξ). n→∞

It is seen that the best rate of convergence is obtained by choosing t = ta. It can be shown that every other choice of t is strictly worse. The proof and more background can be found in Bucklew et al. (1990).

Simulation of compound Poisson distributions We will apply the previous result to compound Poisson distributions. Let X = Xλ be as in Theorem 6.2.1.1. We will again consider limits as λ → ∞ and the claim size distribution S is fixed. Assume also that λ is an integer. The previous results apply Because X can be regarded as a sum of independent compound Poisson variables where the common parameter of terms is (1,S). We now consider the probability

P(X ≥ (1 + a)µX ). (6.3.3.8)

In this case, it is convenient to produce observations directly from the conjugate di- stribution of X. The optimal parameter is determined by the equation

0 cX (ta) = (1 + a)µX . (6.3.3.9) University of Helsinki 53

By means of generating functions, it is seen that the conjugate distribution of X in question is a compound Poisson distribution with the parameter (λMZ (ta),Sta ) where MZ is the moment generating function of the claim size distribution and Sta is the conjugate distribution of S with the parameter ta. In summary, the estimation of probability (6.3.3.8) can be implemented in the following way.

1) Determine ta from (6.3.3.9).

2) Clear up the conjugate distribution of X with the parameter ta, that is, determine

λMZ (ta) and the conjugate distribution Sta .

3) Produce observation Xi from the conjugate distribution from Step 2) as explained after Theorem 6.3.1.1. Let

−taXi+cX (ta) Yi = e 1(Xi ≥ (1 + a)µX )

(µX is the expectation of X under the original distribution). Put

i i p 2 1 X 1 X ηˆi − αˆ αˆ = Y , ηˆ = Y 2 and pˆ = √ i . i i j i i j i j=1 j=1 αˆi i

4) Repeat Step 3) until pˆi ≤ p0 where p0 is predetermined.

6.4 An upper bound for the tail probability

Let ξ, ξ1, ξ2,... be i.i.d. random variables and

Sn = ξ1 + ··· + ξn, n = 1, 2,... as earlier. Assume that µξ ∈ (0, ∞). Let a > 0. We derive an upper bound for the probability S  n ≥ (1 + a)µ . P n ξ

Let cξ be the cumulant generating function of ξ. Theorem 6.4.1. Let a > 0 be arbitrary. Then   Sn −nc∗((1+a)µ ) ≥ (1 + a)µ ≤ e ξ ξ P n ξ for every n ∈ N. University of Helsinki 54

Proof. Let s ≥ 0 be arbitrary. By Chebycheff’s inequality,

nc (s) sSn  e ξ = E e sSn 1  ≥ E e (Sn/n ≥ (1 + a)µξ) sn(1+a)µξ ≥ e P(Sn/n ≥ (1 + a)µξ). By choosing the best s, we obtain   Sn ≥ (1 + a)µ ≤ e−n sup {s(1+a)µξ−cξ(s) | s≥0}. P n ξ

It suffices to show that

sup{s(1 + a)µξ − cξ(s) | s ≥ 0} = sup{s(1 + a)µξ − cξ(s) | s ∈ R}. (6.4.1)

∗ Namely, the last supremum equals cξ((1 + a)µξ). By Lemma 6.3.3.1,

∗ 0 = cξ(µξ) = sup{sµξ − cξ(s) | s ∈ R}. We assumed that a > 0 so that

sup{s(1 + a)µξ − cξ(s) | s ≤ 0} ≤ sup{sµξ − cξ(s) | s ≤ 0} ≤ 0.

On the other hand,

sup{s(1 + a)µξ − cξ(s) | s ≥ 0} ≥ 0 · (1 + a)µξ − cξ(0) = 0.

Thus (6.4.1) holds.  ∗ Limit (6.3.3.6) shows that the exponent cξ((1 + a)µξ) in Theorem 6.4.1 is the best ∗ possible. More precisely, if β > cξ((1 + a)µξ) then

S  n ≥ (1 + a)µ > e−nβ P n ξ for large n.

6.5 Modelling dependence Often numbers of claims, claim sizes, and total claim amounts of different companies (or different insurance lines inside a company) are modelled as independent random variables, at least as the first approximation. However, common environmental factors may cause some dependence. A good example is the affect of weather conditions to car insurance. University of Helsinki 55

We allowed some dependence in Section 6.2 when merging of companies was conside- red. By using the normal approximation, the required initial capital for the new company after merging was q 2 2 U0 = y σ1 + σ2 + 2σ12 − v1µ1 − v2µ2 where σi is the standard deviation of the total claim amount and viµi the safety loading of company i, i = 1, 2. The parameter σ12 is the between the total claim amounts which describes their dependence. In the basic model, independence is assumed so that σ12 = 0. A positive correlation (σ12 > 0) would increase the risk and the capital requirement.

6.5.1 Mixing models

Let X1 and X2 be random variables and let Fi be the distribution function of Xi for i = 1, 2. A possible way to model dependence is to use a mixing random variable θ (or random vector). The variables X1 and X2 are assumed to be conditionally independent, given the value of θ. Let Fθ be the distribution function of θ. Take for every x1, x2 ∈ R, Z ∞ P(X1 ≤ x1,X2 ≤ x2) = F1(x1|y)F2(x2|y)dFθ(y) (6.5.1) −∞ where {Fi(·|y)|y ∈ R} is the regular conditional distribution function of Xi with respect to θ. The product form of the integrand means conditional independence of X1 and X2. The model should be useful in particular if θ has a reasonable interpretation. For example, dependence between the numbers of claims could be partly described by means of the mixing variables. This was discussed in Theorem 3.2.2 although the setting was slightly different. In longer time horizons, inflation often causes dependence between the claim sizes and hence, it is a candidate for the mixing variable in the sense of (6.5.1).

6.5.2 Copulas The descriptions by mixing of Section 6.5.1 are not always natural and can not usually describe all the dependence between random variables. Hence, a more general approach is needed. An appropriate tool for general descriptions is copula. Let X1 and X2 be random variables and let the corresponding distribution functions be FX1 and FX2 . Assume for simplicity that both of them are continuous. Let F(X1,X2) be the joint distribution function,

F(X1,X2)(x1, x2) = P(X1 ≤ x1,X2 ≤ x2), x1, x2 ∈ R. A two-dimensional copula is a function C : [0, 1] × [0, 1] → [0, 1] which satisfies the following three conditions. University of Helsinki 56

(i) C(u1, 0) = C(0, u2) = 0, ∀u1, u2 ∈ [0, 1]. (ii) C(u1, 1) = u1,C(1, u2) = u2, ∀u1, u2 ∈ [0, 1]. (iii) C(v1, v2) − C(u1, v2) − C(v1, u2) + C(u1, u2) ≥ 0, ∀u1 ≤ v1, u2 ≤ v2.

Every copula defines a two-dimensional distribution function whose marginals are uni- formly distributed.

Theorem 6.5.1 (Sklar’s theorem) Let X1 and X2 be as described above. Then there exists a unique copula C such that

F(X1,X2)(x1, x2) = C(FX1 (x1),FX2 (x2)) (6.5.2) for every x1, x2 ∈ R. Furthermore,

C(u1, u2) = P(FX1 (X1) ≤ u1,FX2 (X2) ≤ u2), ∀u1, u2 ∈ [0, 1]. (6.5.3)

The theorem shows that every two-dimensional distribution with given marginals is obtained by means of a copula. The dependence is determined by the copula. We call C of Theorem 6.5.1 the copula of the pair (X1,X2). The proof of the theorem utilizes the following lemma.

Lemma 6.5.1. Let X be random variable and let F be the distribution function of X. Assume that F is continuous. Write

F −1(u) = inf{x|F (x) ≥ u}, u ∈ (0, 1).

Then for every x ∈ R and u ∈ (0, 1), (i) F (F −1(u)) = u. (ii) {X ≤ x} ⊆ {F (X) ≤ F (x)}, P({F (X) ≤ F (x)}\{X ≤ x}) = 0. (iii) P(X ≤ x) = P(F (X) ≤ F (x)), P(F (X) ≤ u) = P(X ≤ F −1(u)). (iv) F (X) is uniformly distributed to the interval (0, 1). Proof. Property (i) follows from the continuity of F . To prove (ii) and (iii), it suffices to consider the case where F (x) ∈ (0, 1). Clearly, {X ≤ x} ⊆ {F (X) ≤ F (x)} and

{F (X) ≤ F (x)} = {X ≤ F −1(F (x))} ∪ {X ∈ (F −1(F (x)), y]} where y = y(x) = sup{z|F (z) ≤ F (x)}. Obviously, F (y) = F (x) so that

−1 −1 P(X ∈ (F (F (x)), y]) = F (y) − F (F (F (x)) = 0. University of Helsinki 57

Furthermore, F −1(F (x)) ≤ x so that P(X ∈ (x, y]) = 0. We have proven (ii). The first claim of (iii) follows from (ii). Further,

−1 P(F (X) ≤ u) = P(F (X) ≤ F (F (u))) −1 = P(X ≤ F (u)) = F (F −1(u)) = u.

This proves the second claim of (iii) and claim (iv). 

Proof of Theorem 6.5.1. It is clear that the function C from (6.5.3) satisfies the requirements (i) and (ii) from the definition of the copula. If u1 ≤ v1 and u2 ≤ v2 then

P(FX1 (X1) ∈ (u1, v1],FX2 (X2) ∈ (u2, v2]) ≥ 0. It is seen that also (iii) is satisfied. Let x1, x2 ∈ R. By Lemma 6.5.1,

F(X1,X2)(x1, x2) = P(X1 ≤ x1,X2 ≤ x2)

= P(FX1 (X1) ≤ FX1 (x1),FX2 (X2) ≤ FX2 (x2))

= C(FX1 (x1),FX2 (x2)). This proves (6.5.2). Let D be an arbitrary copula such that

F(X1,X2)(x1, x2) = D(FX1 (x1),FX2 (x2)) for every x1, x2 ∈ R. If u1, u2 ∈ (0, 1) then there exist x1, x2 ∈ R such that FX1 (x1) = u1. It follows that

D(u1, u2) = F(X1,X2)(x1, x2)

= C(u1, u2).



Properties of copulas Let’s consider some basic properties of the copulas.

Theorem 6.5.2 Let X1 and X2 be random variables with continuous distribution functions FX1 and FX2 . Let C be the copula of (X1,X2) and let g1 and g2 be continuous functions.

(i) If g1 and g2 are increasing functions then the copula of (g1(X1), g2(X2)) is C. ¯ (ii) If g1 and g2 are decreasing functions then the copula of (g1(X1), g2(X2)) is C, ¯ C(u1, u2) = C(1 − u1, 1 − u2) + u1 + u2 − 1, ∀u1, u2 ∈ [0, 1]. University of Helsinki 58

Proof. Consider (i). Let x1, x2 ∈ R. Write

yi = sup{x|gi(x) ≤ xi}, i = 1, 2.

Then {gi(Xi) ≤ xi} = {Xi ≤ yi} so that

P(g1(X1) ≤ x1, g2(X2) ≤ x2) = P(X1 ≤ y1,X2 ≤ y2)

= C(FX1 (y1),FX2 (y2)) = C(Fg1(X1)(x1),Fg2(X2)(x2)).

This proves (i). The proof of the second claim is left to the reader. 

Observe that the distribution functions of g1(X1) and g2(X2) are not necessarily con- tinuous. Nevertheless, formula (6.5.2) holds when Xi is replaced by g(Xi) for i = 1, 2. For a given pair (X1,X2), it is often of interest to understand dependence associated with the tails of the components. As an example, consider the limit

¯ −1 ¯ −1 λ =: lim (X1 > F (v)|X2 > F (v)) v→0+ P X1 X2 ¯ where FXi (x) = 1 − FXi (x) and

F¯ −1(v) = inf{x|F¯ (x) ≤ v}, i = 1, 2. Xi Xi

¯ Theorem 6.5.3 Assume that FXi (x) > 0 for every x > 0, i = 1, 2. Let C be the copula of (X1,X2). If also the conditions of Theorem 6.5.1 are satisfied then 1 − 2v + C(v, v) λ = lim . v→1− 1 − v

Proof. By Theorem 6.5.2,

(X > F¯ −1(v),X > F¯ −1(v)) P 1 X1 2 X2 ¯ ¯ = P(FX1 (X1) ≤ v, FX2 (X2) ≤ v) = C(1 − v, 1 − v) + 2v − 1.

This proves the theorem.  University of Helsinki 59

Examples We illustrate the theory of copulas by examples. The first three are interes- ting from the general point of view. The last two give parametric families of copulas. In estimation, a goal could be to find out the best copula from such a family.

Example 6.5.1. (Independence copula CI ) Let X1 and X2 be independent so that

F(X1,X2)(x1, x2) = FX1 (x1)FX2 (x2) for every x1, x2 ∈ R. Thus the copula of the pair is determined by

CI (u1, u2) = u1u2, u1, u2 ∈ R.

Example 6.5.2. (Frechet upper bound copula CU ) Let X1 = X2 so that

F(X1,X2)(x1, x2) = P(X1 ≤ x1,X1 ≤ x2)

= min(FX1 (x1),FX1 (x2)).

Thus CU (u1, u2) = min(u1, u2). It can be shown that for any copula C,

C(u1, u2) ≤ CU (u1, u2), ∀u1, u2, ∈ [0, 1].

Example 6.5.3. (Frechet lower bound copula CL) Let X1 = −X2 so that

FX2 (x2) = 1 − FX1 (−x2).

Hence,

F(X1,X2)(x1, x2) = P(X1 ≤ x1, −X1 ≤ x2)

= P(X1 ∈ [−x2, x1]) = max(0,FX1 (x1) − FX1 (−x2)) and CL(u1, u2) = max(0, u1 + u2 − 1). It can be shown that for any copula C,

CL(u1, u2) ≤ C(u1, u2), ∀u1, u2, ∈ [0, 1].

Example 6.5.4. (Clayton copula) Let α > 0. Clayton copula is defined by

−α −α −1/α C(u1, u2) = max 0, (u1 + u2 − 1) . University of Helsinki 60

Example 6.5.5. (Gaussian copula) The definition is

−1 −1  C(u1, u2) = Hα φ (u1), φ (u2) where φ is the distribution function of the standard normal variable and Hα is the distri- bution function of the two-dimensional normal distribution with the covariance α (The marginal distributions are standard normal). Thus

x x 2 2 Z 1 Z 2 t1−2αt1t2+t2 1 − 2 Hα(x1, x2) = √ e 2(1−α ) dt1dt2. 2 2π 1 − α −∞ −∞

A further reference to Section 6.5: Denuit et al. (2005). University of Helsinki 61

7 Reinsurance

It is usual that insurance companies protect themselves against adverse fluctuations in their total claim amounts. A suitable way to do this is to buy an insurance from anot- her company. This is called reinsurance. This affects to the ruin probabilities and other quantities of interest. Namely, the interest is typically in the claim amounts after the reinsurance, that is, the part covered by the reinsurer is subtracted. Of course, the insu- rer pays a premium to reinsurer. Similarly to the claims, also the premiums are typically considered after the reinsurance. Let X and Z describe the original total claim amount and the claim size. By a reinsu- rance contract, these quantities will be splitted. The total claim amount X will be written as the sum of the share of the original insurer (also called cedent) and the share of the reinsurer X = Xce + Xre where Xce = the share of the insurer and Xre = the share of the reinsurer. In some reinsurance types, every claim is splitted. Then the shares of the insurer and the reinsurer are written similarly to the total claim amount,

Z = Zce + Zre where Zce = the share of the insurer and Zre = the share of the reinsurer. Our main goal is to clear up the shares Xce and Xre when X and the reinsurance type are given. We call E(Xce) the risk premium of the insurer and E(Xre) the risk premium of the reinsurer. In the same way, it is natural to speak about the safety loadings of the insurer and the reinsurer.

7.1 Excess of loss (XL) In excess of loss reinsurance every claim is splitted by the insurer and the reinsurer. The contract determines the limit which is the maximum to be paid by the insurer. We call it the XL limit in the sequel. The reinsurer pays the rest of each claim. The contract usually covers all the claims in a portfolio (or in a part of portfolio) during a fixed time period stipulated in the contract. University of Helsinki 62

Let M > 0 be the XL limit. The shares of a claim of size Z are

Zce = min(Z,M) and Zre = Z − Zce. The contract gives the insurer a protection against large single claims but not against an exceptional large number of claims. Let S be the distribution function of Z. Then the distribution function Sce of Zce is determined by ( S(z) if z < M Sce(z) = 1 if z ≥ M.

ce ce j Write aj (M) = E ((Z ) ), j = 1, 2,.... Then Z Z ce j ce j j aj (M) = z dS (z) = z dS(z) + M (1 − S(M)). [0,∞) [0,M]

re re j Let aj (M) = E ((Z ) ) be the corresponding moment of the reinsurer. The expectation is Z Z re ce a1 (M) = a1 − a1 (M) = zdS(z) − zdS(z) − M(1 − S(M)) [0,∞) [0,M] Z = zdS(z) − M(1 − S(M)). (M,∞) If X is a compound variable and the expectation of the number of claims equals λ, and if all the claims associated with X are reinsured, then

ce ce E(X ) = λa1 (M) and re re E(X ) = λa1 (M). The XL limit M is often given in euros (or in other currencies). This causes a need to update the limit from year to year because of inflation. Suppose that all the claims increase by the factor 1 + i. That is, the original variable Z is replaced by Z0 = (1 + i)Z where i > 0 describes inflation. The question is: how should we change the XL limit and/or the reinsurance premium. A natural change is to replace the limit M by M 0 = (1+i)M. Then

0 E(Z ) = (1 + i)a1 and 0 0 ce E(min(Z ,M )) = (1 + i)a1 (M). University of Helsinki 63

It is seen that the expectation of the share of the reinsurer is now

0 0 0 re E(Z ) − E(min(Z ,M )) = (1 + i)a1 (M). Hence, it is correct to multiply also the reinsurer premium by the factor 1 + i. Let now the insured total claim amount X be a compound variable with the parameter (K,S), and let Z1,Z2,... be the corresponding claim sizes. The share of the reinsurer is

K K re X re X 1 X = Zi = (Zi − M) (Zi > M). i=1 i=1 Clearly, Xre is a compound variable with the parameter (K,Sre). This information is not necessarily useful for the reinsurer. The representation contains a lot of zeros which are not interesting and from which the reinsurer does not usually get information. We next derive a representation where the zero claims have been dropped.

Let τ0 = 0 and τi = inf{k > τi−1 | Zk > M}, i = 1, 2,.... Write also %i = τi − τi−1 and ˜ Zi = Zτi − M. Let ˜ K = sup{i | τi ≤ K} = #{i ≤ K | Zi > M}. Then K˜ is the number of such claims which cause payments to the reinsurer. The total claim amount can be written in the form

re ˜ ˜ X = Z1 + ··· + ZK˜ . (7.1.1) Assume that S(M) ∈ (0, 1). Define the distribution function S˜ by S(M + z) − S(M) S˜(z) = , z ≥ 0. 1 − S(M) This is the conditional distribution of Z − M given that Z > M. Write

p = 1 − S(M).

Theorem 7.1.1. The regular conditional distribution of K˜ with respect to K is bi- nomial with the parameter (K, p). That is,

h (K˜ = k|K = h) = pk(1 − p)h−k, h = 0, 1, 2, . . . , k = 0, 1, . . . , h. (7.1.1.1) P k University of Helsinki 64

Furthermore, ∞ X h (K˜ = k) = (K = h) pk(1 − p)h−k, k = 0, 1, 2,... P P k h=k ˜ ˜ ˜ and E(K) = p E(K). The variables Z1, Z2,... are i.i.d. with the common distribution function S˜, and the total claim amount Xre is a compound variable with the parameter (K,˜ S˜). Proof. Let h, k ∈ N ∪ {0} and k ≤ h. Then P(K˜ = k, K = h) = P(K = h, exactly k of the occurred claims are larger than M) h = (K = h) pk(1 − p)h−k. P k

This proves (7.1.1.1) which implies the stated representation for the probability P(K˜ = k). Further, ∞ X E(K˜ ) = k P(K˜ = k) k=0 ∞ ∞ X X h = k (K = h) pk(1 − p)h−k P k k=0 h=k ∞ h X X h = (K = h) k pk(1 − p)h−k P k h=0 k=0 ∞ X = P(K = h) ph = p E(K). h=0 ˜ Consider now variable Zi. Obviously,

k−1 P(τ1 = k) = P(Z1 ≤ M,...,Zk−1 ≤ M,Zk > M) = (1 − p) p, k = 1, 2,.... It is seen that ∞ X P(τ1 < ∞) = P(τ1 = k) = 1. k=1

Thus τ1 is a finite valued random variable. It is seen similarly that P(τi < ∞) = 1 for every i ≥ 1 and that %1,%2,... are i.i.d. Let z1 ≥ 0. Then ∞ ˜ X ˜ P(Z1 ≤ z1) = P(%1 = k1, Z1 ≤ z1) (7.1.2) k1=1 University of Helsinki 65

∞ X = P(Z1 ≤ M,...,Zk1−1 ≤ M,Zk1 ∈ (M,M + z1]). k1=1 By the independence of the Z-variables, each term in the sum can be written as

P(Zk1 ∈ (M,M + z1]) P(Z1 ≤ M,...,Zk1−1 ≤ M,Zk1 > M) P(Zk1 > M) ˜ = S(z1) P(Z1 ≤ M,...,Zk1−1 ≤ M,Zk1 > M). We conclude that ∞ ˜ ˜ X P(Z1 ≤ z1) = S(z1) P(Z1 ≤ M,...,Zk1−1 ≤ M,Zk1 > M) k1=1 ∞ ˜ X ˜ = S(z1) P(%1 = k1) = S(z1). k1=1 ˜ ˜ ˜ ˜ It is seen similarly that also Z2, Z3,... are S-distributed and that all the Z-variables are independent. We next show that Xre is a compound variable with the parameters stated in the theorem. Let k ∈ N ∪ {0} and let z1, . . . , zm ∈ R. We have to show that ˜ ˜ ˜ ˜ ˜ ˜ P(K = k, Z1 ≤ z1,..., Zm ≤ zm) = P(K = k)S(z1) ··· S(zm). It suffices to consider the case where m ≥ k. The left hand side equals

∞ X ˜ ˜ P(K = h, %1 + ··· + %k ≤ h, %1 + ··· + %k+1 > h, Z1 ≤ z1,..., Zm ≤ zm). (7.1.3) h=k

It is seen as in (7.1.2) that the requirements concerning the %- and Z˜-variables are de- termined by the original Z-variables so that the event {K = h} is independent of them. Thus (7.1.3) can be written as

∞ X ˜ ˜ P(K = h) P(%1 + ··· + %k ≤ h, %1 + ··· + %k+1 > h, Z1 ≤ z1,..., Zm ≤ zm). (7.1.4) h=k

Let h1, . . . , hm+1 be such that h1 + ··· + hk ≤ h and h1 + ··· + hk+1 > h. We obtain similarly to (7.1.2) that ˜ ˜ P(%1 = h1,...,%m+1 = hm+1, Z1 ≤ z1,..., Zm ≤ zm) ˜ ˜ = S(z1) ··· S(zm)P(%1 = h1,...,%m+1 = hm+1). (7.1.5) University of Helsinki 66

By writing the last probability in the term of sum (7.1.4) as a sum of probabilities of type (7.1.5), we can rewrite (7.1.4) as ∞ ˜ ˜ X S(z1) ··· S(zm) P(K = h) P(%1 + ··· + %k ≤ h, %1 + ··· + %k+1 > h) h=k ˜ ˜ ˜ = S(z1) ··· S(zm) P(K = k). 

Corollary 7.1.2. Let X have the compound mixed Poisson distribution with the parameter (λ, Q, S). Then Xre is a compound mixed Poisson variable with the parameter (λp, Q, S˜). Proof. By Theorem 7.1.1, it suffices to show that K˜ ha the mixed Poisson distribution with the parameter (λp, Q). Let H be the distribution function of Q. By Theorem 7.1.1, ∞ X Z ∞ (λq)h h (K˜ = k) = e−λq dH(q) pk(1 − p)h−k P h! k h=k 0 ∞ Z ∞ (λpq)k X (λ(1 − p)q)h−k = e−λq dH(q) k! (h − k)! 0 h=k Z ∞ k Z ∞ k −λq (λpq) λ(1−p)q −λpq (λpq) = e e dH(q) = e dH(q).  0 k! 0 k!

Example 7.1.1. Let Z have the with the parameter b so that

−bz S(z) = P(Z ≤ z) = 1 − e , z ≥ 0. Let M be the XL limit. Then Z M ce a1 (M) = zdS(z) + M(1 − S(M)) 0 = b−1 − e−bM (M + b−1) + Me−bM = b−1(1 − e−bM ) and re ce −1 −bM a1 (M) = a1 − a1 (M) = b e . Assume that X has the compound Poisson distribution with the parameter (λ, S). Then Xre has the compound Poisson distribution with the parameter (λe−bM , S˜) where S(z + M) − S(M) S˜(z) = 1 − S(M) e−bM − e−b(z+M) = = 1 − e−bz = S(z), z ≥ 0. e−bM Also the claim sizes of the reinsurer are exponentially distributed with the parameter b. University of Helsinki 67

7.2 Quota share (QS) In quota share reinsurance, the insurer pays a fixed proportion of each claim and the reinsurer pays the rest. So in the contract, a parameter r ∈ (0, 1) is determined and the shares concerning the claim Z are

Zce = rZ, Zre = (1 − r)Z.

It is of course equivalent to determine the proportion r to concern the reinsured total claim amount. Le S be the distribution function of the original claim size. Then

ce ce S (z) = P(Z ≤ z) = S(z/r), re re S (z) = P(Z ≤ z) = S(z/(1 − r)) for z ≥ 0. Quota share is simpler than XL. For example, it is natural to split the premiums by using the same proportions as for the claims. A problem is that also small claims will be reinsured.

Let now Xi be the total claim amount of company i, i = 1, 2. We consider the problem of risk sharing where the joint insured total claim amount X1 +X2 will be divided between the companies. A goal is to do this optimally when the goal is to minimize the variances of the total claim amounts of the participating companies.

Theorem 7.2.1. Assume that σXi ∈ (0, ∞), i = 1, 2. Write X = X1 +X2. Let c ∈ [0, 1] and Y1 and Y2 be such that

X = Y1 + Y2 and σY1 = cσX .

∗ ∗ Then σY2 ≥ (1 − c)σX . On the other hand, if Y1 = cX and Y2 = (1 − c)X then

∗ ∗ σY1 = cσX and σY2 = (1 − c)σX .

Theorem 7.2.1 states the following optimality property for quota share reinsurance. Let Y1 and Y2 be the shares of companies 1 and 2 after risk sharing. Assume that

σYi ≤ σX ∈ (0, ∞), i = 1, 2.

Then the companies can make mutual quota share reinsurance contracts such that the resulting share is at least as good as Yi for i = 1, 2. QS gives Pareto optimal solutions to the risk sharing problem: if c ∈ [0, 1],Y1 = cX and Y2 = (1 − c)X then there exists no such risk sharing which gives a better share for both of the companies (and a strictly better for one of the companies) University of Helsinki 68

Proof of Theorem 7.2.1. By Schwarz’s inequality,

2 2 2 2 σY2 = σX−Y1 = σX + σY1 − 2E{(X − E(X)) (Y1 − E(Y1))} 2 2 p 2 2 ≥ σX + σY1 − 2 E{(X − E(X)) } E{(Y1 − E(Y1)) } 2 2 2 2 = σX + σY1 − 2σX σY1 = (1 − c) σX .

This proves the first claim. The second claim is obviously true. 

7.3 Surplus Surplus reinsurance is a kind of refinement of QS such that a part of small claims are excluded from the contract. For every insured i, the maximum amount Li to be compensated is fixed. This could be, for example, the estimated maximum loss (called EML in short). The maximum share of the insurer is determined in the system. Denote this by M. For small insureds, namely, those for which Li ≤ M, the reinsurer pays nothing. If Li > M the the insurer pays the amount r(Li)Z for the claim Z, and the reinsurer pays the rest. Observe that the proportion now depends on the insured, and is taken to be r(Li) = M/Li. In other words, if the claim Z occurs to the insured i then

ce Z = r(Li)Z,

r(Li) = min(1, M/Li).

From the viewpoint of mathematical convenience, it would be natural to assume that the insureds with equal L-limits are similar in the sense that the associated claim size distributions are equal. It is also necessary to approximate the distribution of the L-values in the portfolio. Write S(z|L) = P(Z ≤ z|L), z ∈ R. The interpretation is that S(·|L) is the claim size distribution corresponding to the limit L. Then the distribution function S associated with the portfolio is determined by Z S(z) = S(z|L) dG(L), [0,∞) where G(L) = P(the limit associated with the claim is at most L). The distribution function of the share of the insurer is determined by Z ce S (z) = S(z/r(L)|L) dG(L), z ∈ R. [0,∞) University of Helsinki 69

7.4 Stop loss (SL) In stop loss reinsurance the total claim amount is divided such that the insurer pays the claims upto the predetermined limit and the reinsurer pays the rest. The participants fix in the contract the SL limit M > 0 which is the maximum amount to be paid by the insurer. If X is the original total claim amount then the share of the insurer is Xce = min(X,M). Stop loss protects the insurer both against large single claims and a large number of claims. The risk premium of the reinsurer is Z ∞ re re E (X ) = P(X > x)dx 0 Z ∞ = P(X > M + x)dx. 0 Consider now the situation where the insurer will pay by itself a fixed level P from the total claim amount in the sense that if Y is the total claim amount of the insurer after the reinsurance then E(Y ) = P . The following result shows that SL is optimal when the measure to be used is the variance of the total claim amount. Let R : (0, ∞) → R:

R(M) = E(min(X,M)). Hence, R(M) is the risk premium of the insurer when stop loss is used with the limit M. Theorem 7.4.1. Let X be the original total claim amount and let P ∈ (0, E(X)) be fixed. Then there exists M such that the share YM of the insurer in the SL contract with limit M satisfies E(YM ) = P . Let Y be the share of the insurer after an arbitrary reinsurance such that 0 ≤ Y ≤ X and E(Y ) = P. Then Var(YM ) ≤ Var(Y ) The proof of the theorem will be based on the following useful lemma

Lemma 7.4.2. Let X1 and X2 be random variables with finite expectations. Assume that E(X1) = E(X2). Let Fi be the distribution function of Xi, i = 1, 2. Assume that there exists x0 ∈ R such that ( F1(x) ≤ F2(x) for x < x0,

F1(x) ≥ F2(x) for x ≥ x0. University of Helsinki 70

Let h : R → R be a convex function. Then

E(h(X1)) ≤ E(h(X2)). (7.4.6)

Proof of the lemma. Consider first the case where h(x0) = 0 and h(x) ≥ 0 for every x ∈ R. Let y > 0 be arbitrary. By the convexity, there exist a, b ∈ R ∪ {±∞} such that

P(h(X1) > y) = P(X1 < a) + P(X1 > b) = F1(a−) + 1 − F1(b) ≤ F2(a−) + 1 − F2(b) = P(h(X2) > y). Hence, Z ∞ E(h(X1)) = P(h(X1) > y) dy 0 Z ∞ ≤ P(h(X2) > y) dy = E(h(X2)). 0 Consider now a general h. By the convexity, there exists c ∈ R such that

h(x) ≥ h(x0) + c(x − x0) 0 for every x ∈ R (for example, c = h (x0+)). Write

r(x) = h(x) − h(x0) − c(x − x0), x ∈ R.

By the first part of the proof, E(r(X1)) ≤ E(r(X2)). This completes the proof because E(X1) = E(X2).  Proof of Theorem 7.4.1. Let M > 0 and ∆ > 0. Then

| R(M + ∆) − R(M) | = E(min(X,M + ∆) − min(X,M)) = E ((X − M)1(X ∈ (M,M + ∆]) + ∆ P(X > M + ∆) ≤ ∆ P(X > M). It is seen that R is continuous. Further, R(M) → 0 as M → 0+ and R(M) → E(X) as M → ∞. This proves the existence of the required SL limit.

To prove the optimality of YM , choose 2 X1 = YM ,X2 = Y, x0 = M and h(x) = (x − P ) in the previous lemma. Let x < M. We assumed that Y ≤ X so that

F1(x) = P(YM ≤ x) = P(X ≤ x) ≤ P(Y ≤ x) = F2(x).

On the other hand, F1(M) = 1 so that F2(x) ≤ F1(x) for x ≥ M. Finally, h is convex so that (7.4.6) holds.  Further references: Sundt (1984). University of Helsinki 71

8 Outstanding claims

Insurance contracts are usually such that the company is liable to compensations concer- ning the claims which have occurred when the contract has been in force. The notifications and the requests for compensations can come a long time after the occurrences. Even if they were given, the claims have to be processed by the company before payments. Also the compensations are sometimes paid as pensions which may continue decades. By the above discussion, the company is liable to pay compensations associated with the outstanding claims, that is, with the claims which have already occurred but has not yet been settled. The associated total future compensations is a random liability of the company. The company has to set up a reserve for these compensations. The reserve basically corresponds to the mean of the payments (but is usually supplemented by a safety margin). We call it the loss reserve. The understanding of the outstanding claims is important from the solvency point of view. Namely, the economic situation of the company can be described as a difference between the assets and liabilities. The main part of the liabilities often comes from outs- tanding claims. Also in pricing of insurance contracts, it is essential that the price corres- ponds to the claims occurring during the period of the contract. A general consequence of this occurrence based liabilities is that the premiums typically come before the compen- sations. In the mean time, the company may receive investment income on the available capital. Some examples illustrate the nature of outstanding claims. - Car insurance After an accident, the claim is usually quickly notified. The payment will be paid, for example, after the car has been repaired. The magnitude of the outstanding claims remains small. - Workers compensation Some occupational diseases may be latent for decades (for example, slow changes caused by chemicals). In serious accidents, compensations will be paid as pensions. The magnitude of the outstanding claims may be much larger than the yearly paid compen- sations . - Incoming XL reinsurance The reinsurer may be notified about the claim at the time the XL limit is reached. Thus the notification processes are different for the insurer and the reinsurer.

8.1 Development triangles Consider the following grouping of a statistical data, the so called development triangle. University of Helsinki 72

Q Q j 0 1 ··· I − 1 I i Q

0 C00 C01 ··· C0,I−1 C0I

1 C10 C11 ··· C1,I−1

··· ·········

I − 1 CI−1,0 CI−1,1

I CI0

The triangle describes the realized payments of the claims up to the end of year I. The occurrence years i = 0, 1,...,I are listed on the left and the payment years j = 0, 1,...,I relative to occurrence years are on the top line. Observation Cij contains payments from the occurrence year i. It is the sum of them, settled during the years i, i+1, . . . , i+j. The payments up to the end of year I are obtained for each occurrence year from the diagonal

CI0,CI−1,1,...,C1,I−1,C0I .

In the estimation problem associated with the outstanding claims, we should fill the triangle to a rectangular. Assume that all the payments take place during L years from the occurrence. Then the problem is to estimate C0L,C1L,...,CIL. The claim reserve is the sum of these estimates less the compensations already paid. We assume that L ≤ I. Often the first step is to estimate the mean value of the outstanding claims, or more strictly, the conditional mean given the data. We only consider this in the sequel. The estimation is problematic because of small number of useful observations. Also inflation may causes some bias to estimates but it could be possible to eliminate this affect from the data. The changes in the development process are probably more difficult to control.

8.2 Chain-Ladder method

Let Cij, i, j = 0, 1,...,I, be random variables which describe the payments concerning the occurrence year i up to the year i + j. Suppose that we have the development triangle as in Section 8.1 which contains the observed values of them. University of Helsinki 73

In the Chain-Ladder method, the final payments CiI are obtained in the following way. Denote PI−j  i=0 Cij mˆ j = , j = 1,...,I. PI−j  i=0 Ci,j−1 Estimate r ˆ Y Cir = Ci,I−i mˆ j, i = 0, 1, . . . , I, r ≥ I − i + 1. (8.2.1) j=I−i+1 Thus the newest observation is taken from each occurrence year, and it is ’developed’ by making use of the observations from earlier years. The Chain-Ladder estimate of the mean of the outstanding claims of the occurrence year i is ˆ ˆ Ui = CiI − Ci,I−i, and the estimate of the mean of the total outstanding claims at the end of year I is

I ˆ X ˆ U = Ui. i=0 The following assumptions are natural as a background for the method (Mack (1993)):

1) There exist constants mj such that

E (Cij | Ci0,...,Ci,j−1) = mjCi,j−1, i = 0, 1, . . . , I, j = 1,...,I, 2) for every i 6= j,

(Ci0,...,CiI ) and (Cj0,...,CjI ) are independent.

In words, we assume that the development coefficients mj only depend on the (relative) payment year but not on the occurrence year, and that the occurrence years are mutually independent.

The parameters mj are called Chain-Ladder coefficients or development coefficients. The estimates of them are quotients of the column sums so that they describe the relative increase of payments. Write D = σ(Ckj, k, j ≥ 0, k + j ≤ I),

Dk = σ(Ckj, 0 ≤ j ≤ I − k), k = 0, 1,...,I and Bk = σ(Cij, 0 ≤ j ≤ k, i + j ≤ I), k = 0, 1,...,I. University of Helsinki 74

Thus, D contains the whole history (from the development triangle), Dk the observations from the occurrence year k (on line in the triangle) and Bk from the relative payment years j ≤ k (the triangle is cut vertically).

Theorem 8.2.1. Assume that P(Ci0 > 0) = 1 for every i ≥ 0. Then under assumptions 1) and 2), E(CiI | D) = mI−i+1 ··· mI Ci,I−i, (8.2.2) and E(m ˆ jmˆ j+1 ··· mˆ I ) = mjmj+1 ··· mI . (8.2.3) Result (8.2.2) motivates to use (8.2.1) as the estimate for the final payments. Unbiased estimates for the products mI−i+1 ··· mI can be obtained from (8.2.3).

Proof of Theorem 8.2.1. By 2), E(CiI | D) = E(CiI | Di). By the iterativity property of the conditional expectation and by 1),

E(CiI | D) = E{E(CiI | Ci0,...,Ci,I−1) | Di} = E(mI Ci,I−1 | Di) = mI E(Ci,I−1 | Di) = ···

= mI ··· mI−i+1E(Ci,I−i | Di) = mI ··· mI−i+1Ci,I−i. This proves (8.2.2). By 2),

E(Cij | Bj−1) = E(Cij | Ci0,...,Ci,j−1) = mjCi,j−1, j ≥ 1 so that ! PI−j C (m ˆ | B ) = i=0 i,j | B E j j−1 E PI−j j−1 i=0 Ci,j−1

I−j !−1 I−j X X = Ci,j−1 mjCi,j−1 = mj. i=0 i=0

It is also seen that E(m ˆ j) = mj. Consequently,

E(m ˆ jmˆ j+1 ··· mˆ I ) = E{E(m ˆ jmˆ j+1 ··· mˆ I | BI−1)} = E{mˆ jmˆ j+1 ··· mˆ I−1E(m ˆ I | BI−1)} = mI E{mˆ jmˆ j+1 ··· mˆ I−1} = ···

= mI ··· mj+1 E{mˆ j} = mj ··· mI .  University of Helsinki 75

Chain-Ladder method is much based on assumption 1). Under that assumption, result (8.2.2) motivates the straightforward use of the data. The next goal is to sharpen the view my considering a more detailed model for the occurrence and payment processes. University of Helsinki 76

8.3 Predicting of the unknown claims At a fixed time point, we call the claim unknown if it has already occurred but has not yet been notified to the company. The estimates concerning these claims has to be based on statistical methods (if the claim has been notified then the final claim amount concerning that claim could also be estimated individually). We will only consider the prediction of the number of the unknown claims. The results can be used directly in the estimation of the corresponding claim amounts, given that the claim sizes are independent of the notification process. The reporting delay means the difference between the occurrence and the notification times of the claim. The reporting time is the occurrence time plus the reporting delay. Let’s consider notifications of the claims occurred during the time interval (0, d] where d > 0 is fixed. Assume that 1) the occurrences of the claims follow the Poisson process with the intensity function Λ where Z t Λ(t) = λ(u)du, ∀t ≥ 0, (8.3.1) 0 and the map t → λ(t) is strictly positive, bounded on the compact intervals, and piecewise continuous and differentiable (in finite intervals, there are at most a finite number of discontinuities and λ is differentiable at the continuity points) 2) reporting delays have the distribution function G where G(0) = 0 3) the reporting delays are i.i.d. and independent of the numbers of the claims in all respects. Let {K(t)} be the Poisson process which describes the occurrences of the claims. For any u ≥ 0, define

Vd(u) = the number of claims which occur during (0, d] and which are reported during (0, u].

Let 0 = t0 < t1 < ··· < tn−1 < tn = ∞. Write

I1 = (t0, t1],...,In−1 = (tn−2, tn−1],In = (tn−1, tn).

Theorem 8.3.1. Assume conditions 1), 2) and 3). Then Vd(ti) − Vd(ti−1) has the Poisson distribution with the parameter

Z d λ(s)(G(ti − s) − G(ti−1 − s)) ds, i = 1, . . . , n. (8.3.1.1) 0

Furthermore, Vd(t1) − Vd(t0),...,Vd(tn) − Vd(tn−1) are independent. University of Helsinki 77

Remark 8.3.1. We assumed that t0 = 0 and tn = ∞. Thus the integrand in (8.3.1.1) is G(t1 −s) for i = 1 and 1−G(tn−1 −s) if i = n. Similarly, Vd(t0) = 0 and Vd(tn) = K(d). Remark 8.3.2. Theorem 8.3.1 shows that the observed numbers of claims do not give any information about forthcoming notifications (provided that Λ and G are known). We begin with a result concerning the jump times of the Poisson process. Lemma 8.3.1. Let {K(t)} be a Poisson process which satisfies assumption 1), and let Si be the ith jump time of the process, i = 1, 2,.... Then for every 0 ≤ s1 ≤ · · · ≤ sk ≤ d,

P(Si ≤ si, i = 1, . . . , k | K(d) = k) ! ! k! Z s1 Z sk−1 Z sk = k ··· λ(x1) ··· λ(xk)dxk dxk−1 ··· dx1. Λ(d) 0 xk−2 xk−1

Thus conditionally, given that K(d) = k, the density f of the random vector (S1,...,Sk) is determined by k! f(s , . . . , s ) = λ(s ) ··· λ(s )1(0 ≤ s ≤ · · · ≤ s ≤ d). 1 k Λ(d)k 1 k 1 k

Remark 8.3.3. If Y1,...,Yk are i.i.d. random variables with the density y → λ(y)/Λ(d) for y ∈ [0, d] then the density of the ordered sample is f of Lemma 8.3.1 (Y(1) is the smal- lest from the observations, Y(2) the second smallest, etc.). See for example Mikosch (2004), Section 2.1.6. Proof of Lemma 8.3.1. In the case where λ(t) ≡ λ for each t then the desired result is obtained by a direct integration because then S1,S2 − S1,...,Sk − Sk−1 are i.i.d. exponentially distributed random variables with the parameter λ. The reader is referred to Karlin and Taylor (1975), p.126. The result is ! ! k! Z s1 Z sk−1 Z sk P(Si ≤ si, i = 1, . . . , k | K(d) = k) = k ··· dxk dxk−1 ··· dx1. d 0 xk−2 xk−1 (8.3.1.2) Let now λ be general. Write τ(t) = Λ−1(t), t ≥ 0, and define the process {K∗(t)} by

K∗(t) = K(τ(t)).

Then {K∗(t)} is a homogeneous Poisson process as it was observed in the proof of Theo- ∗ ∗ ∗ −1 rem 3.1.4. Let S1 ,S2 ,... be the associated jump times, Si = τ (Si), i = 1, 2,.... The probability of the claim of the lemma equals

∗ −1 ∗ −1 P(Si ≤ τ (si), i = 1, . . . , k | K (τ (d)) = k) University of Helsinki 78 which by (8.3.1.2), can be written as

−1 −1 −1 ! ! k! Z τ (s1) Z τ (sk−1) Z τ (sk) k ··· dxk dxk−1 ··· dx1. (8.3.1.3) Λ(d) 0 xk−2 xk−1

−1 Substitute xk = τ (yk) in the innermost , it is seen that

−1 Z τ (sk) Z sk dxk = λ(yk)dyk. xk−1 τ(xk−1)

−1 By substituting xk−1 = τ (yk−1) in the second innermost integral, we obtain

−1 ! ! Z τ (sk−1) Z sk Z sk−1 Z sk λ(yk)dyk dxk−1 = λ(yk)dyk λ(yk−1)dyk−1. xk−2 τ(xk−1) τ(xk−2) yk−1

Continue inductively and make use of the fact that τ(0) = 0 to complete the proof. 

Proof of Theorem 8.3.1. Let Ri be the reporting delay of the ith claim. By the independence and Lemma 8.3.1, for every r1, . . . , rk ∈ R and 0 ≤ s1 ≤ · · · ≤ sk ≤ d,

P(R1 ≤ r1,...,Rk ≤ rk,S1 ≤ s1, ..., Sk ≤ sk | K(d) = k) ! ! k! Z s1 Z sk−1 Z sk = k G(r1) ··· G(rk) ··· λ(x1) ··· λ(xk)dxk dxk−1 ··· dx1. Λ(d) 0 xk−2 xk−1

Let k1, . . . , kn ∈ N ∪ {0} be such that k1 + ··· + kn = k. Obviously,

{K(d) = k, Vd(t1) − Vd(t0) = k1,...,Vd(tn) − Vd(tn−1) = kn} n \ = {K(d) = k} {#{j ≤ k | Sj + Rj ∈ Ii} = ki}}. i=1 We next clear up the probability

P(Vd(t1) − Vd(t0) = k1,...,Vd(tn) − Vd(tn−1) = kn | K(d) = k). (8.3.2) Let

k k A = {(r1, . . . , rk, s1, . . . , sk) ∈ R × R | 0 ≤ s1 ≤ · · · ≤ sk ≤ d, #{j ≤ k | sj + rj ∈ Ii} = ki, i = 1, . . . , n}. Consider the partitions of A based on the order of claims concerning the occurrence and notification. More precisely, let

{i11, . . . , i1k1 , . . . , in1, . . . , inkn } = {1, . . . , k}. University of Helsinki 79

The interpretation is that

i11th,...,i1k1 th claim is reported in the interval I1,

i21th,...,i2k2 th claim is reported in the interval I2, and so on. In the following, symbol P0 means summing over all these partitions (there are k! such partitions). We obtain k1!···kn!

P(Vd(t1) − Vd(t0) = k1,...,Vd(tn) − Vd(tn−1) = kn | K(d) = k) k X 1 = P( (Sm + Rm ∈ Ii) = ki, i = 1, . . . , n | K(d) = k) m=1 Z k! 1 = k (A) λ(s1) ··· λ(sk)dG(r1) ··· dG(rk)ds1 ··· dsk Λ(d) Rk×Rk k! Z X0 = k λ(s1) ··· λ(sk) Λ(d) 0≤s1≤···≤sk≤d (G(t − s ) − G(t − s )) ··· (G(t − s ) − G(t − s )) 1 i11 0 i11 1 i1k1 0 i1k1 ... · (G(t − s ) − G(t − s )) ··· (G(t − s ) − G(t − s )) ds ··· ds . n in1 n−1 in1 n inkn n−1 inkn 1 k

The integrand is symmetric with respect to s1, . . . , sk. By summing up over all k! of their permutations, we obtain the integral over the rectangular [0, d]k (the union of the resulting areas of the integration is [0, d]k and the Lebesgue measure of their intersections vanish). University of Helsinki 80

Thus

P(Vd(t1) − Vd(t0) = k1,...,Vd(tn) − Vd(tn−1) = kn | K(d) = k) 1 X0 Z d = k λ(si11 )(G(t1 − si11 ) − G(t0 − si11 ))dsi11 Λ(d) 0 Z d ··· λ(s )(G(t − s ) − G(t − s ))ds i1k1 1 i1k1 0 i1k1 i1k1 0 ··· Z d

λ(sin1 )(G(tn − sin1 ) − G(tn−1 − sin1 ))dsin1 0 Z d ··· λ(s )(G(t − s ) − G(t − s ))ds inkn n inkn n−1 inkn inkn 0 k1 1 X0 Z d  = k λ(s)(G(t1 − s) − G(t0 − s))ds Λ(d) 0 Z d kn ··· λ(s)(G(tn − s) − G(tn−1 − s))ds 0 k 1 k! Z d  1 = k λ(s)(G(t1 − s) − G(t0 − s))ds Λ(d) k1! ··· kn! 0 Z d kn ··· λ(s)(G(tn − s) − G(tn−1 − s))ds . 0

We conclude that

P(Vd(t1) − Vd(t0) = k1,...,Vd(tn) − Vd(tn−1) = kn) = P(K(d) = k) P(Vd(t1) − Vd(t0) = k1,...,Vd(tn) − Vd(tn−1) = kn | K(d) = k) k −Λ(d) Λ(d) 1 k! = e k k! Λ(d) k1! ··· kn! Z d k1 Z d kn · λ(s)(G(t1 − s) − G(t0 − s))ds ··· λ(s)(G(tn − s) − G(tn−1 − s))ds 0 0 k hR d i 1 λ(s)(G(t1 − s) − G(t0 − s))ds − R d λ(s)(G(t −s)−G(t −s))ds 0 = e 0 1 0 k1! k hR d i n λ(s)(G(tn − s) − G(tn−1 − s))ds − R d λ(s)(G(t −s)−G(t −s))ds 0 ··· e 0 n n−1 . kn!

This proves the claims of the theorem.  University of Helsinki 81

Corollary 8.3.1. Under the assumptions of Theorem 8.3.1, the process {Vd(t) | t ≥ 0} is a Poisson process with the intensity function Λd, Z d Λd(t) = λ(s)G(t − s)ds, t ≥ 0. 0 Proof. (the proof is not asked in the examinations) By Theorem 8.3.1, it suffices to show that {Vd(t) | t ≥ 0} is a counting process. Let Si be the occurrence time and Ri the reporting delay of the ith claim, as earlier. Then

Vd(t) = #{i | Si ≤ d, Si + Ri ≤ t}. (8.3.3)

Let n ∈ N be fixed. For ω ∈ {K(d) = n}, the corresponding realization of Vd is increasing, right continuous and integer-valued. (possible i indices in (8.3.3) are 1, . . . , n so that the required properties follow from the definition of Vd(t)). Hence, the realizations of {Vd(t)} have these properties with probability one. It remains to prove that

P(Vd(t) − Vd(t−) ≥ 2 for some t ≥ 0) = 0. (8.3.3.1)

Let T > 0 and N ∈ N be large. Divide the interval [0,T ] to N non-overlapping interval with the length T/N. Let the endpoints be u0, u1, . . . , uN . Then

P(Vd(ui+1) − Vd(ui) ≥ 2 for some i = 0, 1,...,N − 1) (8.3.2.2)

≤ N max{P(Vd(ui+1) − Vd(ui) ≥ 2) | i = 0, 1,...,N − 1}. By Theorem 8.3.1,

−(Λ(ui+1)−Λ(ui)) −(Λ(ui+1)−Λ(ui)) P(Vd(ui+1) − Vd(ui) ≥ 2) = 1 − e − e (Λ(ui+1) − Λ(ui)). Let λ¯ = sup{λ(t)|t ∈ [0,T ]}. ¯ −x −x Then |Λ(ui+1) − Λ(ui)| ≤ λT/N for each i. The map x → 1 − e − xe is increasing for x > 0 so that −λT/N¯ ¯ P(Vd(ui+1) − Vd(ui) ≥ 2) ≤ 1 − e (1 + λT/N) ≤ 1 − 1 − λT/N¯  (1 + λT/N¯ ) = (λT/N¯ )2. It is seen that (8.3.3.2) is at most (λT¯ )2/N → 0 as N → ∞. Thus

P(Vd(t) − Vd(t−) ≥ 2 jollain t ≤ T ) = 0 so that (8.3.3.1) follows.  Consider now the background assumption 1) of the Chain-Ladder method in the model of Theorem 8.3.1. Assume for simplicity that G(1) = 1 and that the claim sizes all equal University of Helsinki 82

one. Let V = V1(1) be the number of the reported and U = K(1) − V1(1) the number of the unknown claims at time t = 1. Consider the outstanding claims at time t = 1. An assumption of the Chain-Ladder method was that

E(U + V | V ) = mV (8.3.4) where m is a constant. By Theorem 8.3.1,

E(U + V | V ) = V + E(U | V ) Z 1 = V + E(U) = V + λ (1 − G(u))du. 0 Thus (8.3.4) does not hold.

8.4 Credibility estimates for outstanding claims Chain-Ladder method gives one way to estimate the outstanding claims. Theorem 8.3.1 gives an alternative in a basic model which does not the Chain-Ladder estimate. A message of the theorem is that the data can be used to estimate the intensity of the Poisson process and the distribution function of the reporting delays, but the estimate associated with an occurrence year should not depend directly on the observation of that year. On the other hand, assumptions 1), 2) and 3) of Theorem 8.3.1 can be viewed as too strong. We consider in this section a generalization where the occurrences of the claims are described by a mixed Poisson process with the parameter (λ, Q). We still assume 2), and in assumption 3), we also take the mixing variable to be independent of the reporting delays. Consider for simplicity only the claims which occur in the interval [0, 1]. The goal is to estimate the number of outstanding claims at time t ≥ 1. We have then observed V (t) = V1(t) of them. The number of outstanding claims is denoted by U(t) = U1(t), U(t) = K(1) − V (t). Intuitively, now the observed number of the reported claims tells something about the mixing variable Q = Q1. Also U(t) depends on Q so that V (t) could have some prediction power in the model. We take Q as a non-observable variable so that it can not be used directly in the prediction of U(t). A natural estimate for U(t) is the conditional expectation

E(U(t)|V (t)). This is of the form h(V (t)) where h is a measurable real valued function. For simplicity, only affine functions of V (t) are often considered as potential estimates for U(t), that is, the estimates will be of type a + bV (t), a, b ∈ R. University of Helsinki 83

It is well known that the conditional expectation E(U(t)|V (t)) minimizes the mean squared error 2 E [U(t) − f(V (t))] over all measurable functions f. Based on this fact, it is natural to require a similar optimality property from the constants a and b. Thus, we would like to choose a and b such that 2 E [U(t) − a − bV (t)] (8.4.1) minimizes. Theorem 8.4.1. Assume that Var(V (t)) and Var(U(t)) are finite. The the global minimum minimum of (8.4.1) is obtained at

∗ ∗ a = at = E(U(t)) − bt E(V (t)), Cov(V (t),U(t)) b = b∗ = . t Var(V (t)) Proof. For the optimality, the first order partial derivatives of (8.4.1) must vanish,

( ∂ ∂a = −2E (U(t) − a − bV (t)) = 0, ∂ ∂b = −2E (V (t)(U(t) − a − bV (t))) = 0. Equivalently, a + bE(V (t)) = E(U(t)), 2 aE(V (t)) + bE V (t) = E(V (t)U(t)). ∗ ∗ The solution gives the stated values a = at and b = bt . By convexity, these values give the global minimum.  Write Z 1 Ht = G(t − s)ds, t ≥ 1, 0 where G is as in Theorem 8.3.1. 2 Theorem 8.4.2. Assume that σQ ∈ (0, ∞) and that Ht > 0. Then the global minimum ∗ ∗ of (8.4.1) is obtained with a = at and b = bt where

∗ at = (1 − ct)E(U(t)), ∗ 1 − Ht bt = ct , Ht 2 λHtσQ ct = 2 , 1 + λHtσQ University of Helsinki 84

where E(U(t)) = λ(1 − Ht). Proof. By Theorem 8.4.1,

∗ ∗ at = E(U(t)) − bt E(V (t)), Cov(U(t),V (t)) b∗ = . t Var(V (t))

By Theorem 8.3.1, U(t) is a mixed Poisson variable with the parameter (λ(1 − Ht),Q) and V (t) is a mixed Poisson variable with the parameter (λHt,Q). Hence, the mixing variables are equal. By Theorem 3.2.1,

2 2 2 Var(V (t)) = λHt + λ Ht σQ. By Theorem 8.3.1, given Q, the variables U(t) and V (t) are conditionally independent. Thus

E(U(t)V (t)) = E{E(U(t)V (t) | Q)} = E{E(U(t) | Q) E(V (t) | Q)} 2 2 2 2 = E(λ (1 − Ht)Ht Q ) = λ Ht(1 − Ht)E(Q ). Further,

Cov(U(t),V (t)) = E(U(t)V (t)) − E(U(t))E(V (t)) 2 2 = λ Ht(1 − Ht)σQ. Hence, 2 ! ∗ λHtσQ at = λ(1 − Ht) 1 − 2 = (1 − ct) E(U(t)) 1 + λHtσQ and 2 ∗ λ(1 − Ht)σQ 1 − Ht bt = 2 = ct .  1 + λHtσQ Ht Theorem 8.4.2 gives for the unknown claims at time t the estimate

∗ ∗ ∗ U (t) = at + bt V (t) 1 − Ht = (1 − ct)E(U(t)) + ct V (t). Ht Here   1 − Ht 1 − Ht E V (t) = λHt Ht Ht = λ(1 − Ht) = E(U(t)). University of Helsinki 85

The estimate is a weighted average of the expectation E(U(t)) and a scaled observation ∗ V (t). Parameter ct is called the credibility factor (or coefficient) and U (t) is called the credibility estimate.

Credibility factor ct tells how much we rely on the observation V (t). The form of it can be motivated in the following way. (i) The relative standard deviation of the Poisson variable (the standard deviation divided by the mean) is small if the Poisson parameter is large. This motivates to use large ct if λHt is large (if Q = q then we expect that V (t) ≈ λHtq). (ii) Large variance of the mixing variable means that the realizing Poisson parameters vary considerably from year to year. Hence, it is natural to give a large weight for the observation. That is, we should make use of large ct. The credibility estimate U ∗(t) may be viewed as a compromise between the Chain- Ladder estimate and the estimate of Theorem 8.3.1. Further references to Chapter 8: Rantala (1984), Ruohonen (1988) and Norberg (1993). University of Helsinki 86

9 Solvency in the long run

We have already studied the ruin problem in a short time horizon, mostly during one year. Basically, the problem was:

1) The company has an initial capital U0 at the beginning of the year 2) The company runs its insurance business for one year 3) The problem is to determine or approximate the survival probability of the company, that is, the probability that the company is able to pay the compensations associated with the claims during the year. Survival means that ruin does not occur during the year. The short time horizon is often sufficient from the insurance supervisory point of view. If the one year ruin probability is small enough (of order 10−3 − 10−2, say) then the situation of the policyholders may be viewed as safe enough. If it is too large then it may still be possible to merge the company with a more solvent one. Then the compensations could be paid and the situation would not necessarily be disadvantageous for the receiving company. We focus in this section on solvency questions in long time horizons. This should be interesting from the company point of view but also from the policyholders point of view. The problem is basically the same as in points 1-3 above but the time horizon is longer, 10 years, say, or even infinite. The question is: what is the probability that the capital of the company is non-negative at the end of each year during the given time horizon. We study the ruin problem in the classical random walk model for the insurance processes. To be of more applied interest, various further aspects should be taken into considerations. We will give some of them at the end of the chapter.

9.1 Classical ruin problem

Let U0 > 0 be the initial capital of the company and

ξn = the loss in year n, n = 1, 2,...,

Yn = ξ1 + ··· + ξn, the accumulated loss.

A natural interpretation is that ξn is the net payout in year n (the claims less the premiums) but this structure is not used in the sequel. The time of ruin T = T (U0) is defined by ( inf{n | Y > U } T = n 0 +∞, if Yn ≤ U0, n = 1, 2,....

Hence, the capital of the company will be negative at time T for the first time. Negative capital means that the company is in the bankrupt. University of Helsinki 87

Assume that ξ, ξ1, ξ2,... are i.i.d. random variables. Let c be the cumulant generating function, sξ c(s) = log E(e ), s ∈ R, and let D = {s ∈ R | c(s) < ∞}. Let c∗ be the convex conjugate as earlier,

∗ c (v) = sup{sv − c(s)}, v ∈ R. s∈R Write s¯ = sup{s ∈ R | c(s) < ∞} ∈ [0, ∞] and 1 x = lim . s→s¯− c0(s) The limit exists if s¯ > 0 because c0 is increasing by convexity. Define the function h : (x, ∞) → R ∪ {∞} by h(x) = xc∗(1/x).

Lemma 9.1.1. Assume that E(ξ) < 0 and Var(ξ) > 0. Assume further that c(s) ∈ (0, ∞) for some s > 0. Then there exists a unique R ∈ (0, ∞) such that c(R) = 0. 0 If x ∈ (x, ∞) then there exists a unique sx ∈ (0, s¯) such that c (sx) = 1/x and then 0 0 h (x) = −c(sx). Let µT = 1/c (R). Then h(µT ) = R, h is strictly decreasing on (x, µT ] and strictly increasing on [µT , ∞). Parameter R is called Lundberg exponent. The assumption E(ξ) < 0 means from the applied point of view that the premium contains a positive safety loading.

6 c(s)

R s¯ | | - s University of Helsinki 88

˚ Proof of Lemma 9.1.1. We begin by showing that c is strictly convex in D. Let s0 ∈ ˚ 00 D. Then c (s0) = Vars0 (ξ) ≥ 0 (the variance with respect to the conjugate distribution

Ps0 ). The inequality is strict since otherwise, Ps0 would concentrate to a single point, and 00 so would the original distribution. Thus c (s0) > 0 so that c is strictly convex. We assumed that E(ξ) < 0 so that c(s) is negative for some s > 0. We also assumed that c(s) > 0 for some s > 0 so that the required parameter R exists. The uniqueness follows from the convexity of c. Now c0 is continuous on D˚ so that c0(s) = 0 for some s > 0. If x ∈ (x, ∞) then by the strict convexity, there exists a unique sx ∈ (0, ∞), jolle 0 c (sx) = 1/x. This connection determines the map x → sx which is differentiable (because 0 −1 sx = (c ) (1/x)). Consider now the function h. The derivative for x ∈ (x, ∞) is

∂ h0(x) = (x (s /x − c(s ))) ∂x x x ∂s ∂s = x − xc0(s ) x − c(s ) = −c(s ). ∂x x ∂x x x

0 0 If x = µT then sx = R and h (x) = 0. If x ∈ (x, µT ) then sx > R so that h (x) < 0. Thus h is strictly decreasing on (x, µT ]. It is seen similarly that h is strictly increasing on [µT , ∞). 

Theorem 9.1.1. Let ξ1, ξ2,... be as described above. Then under the assumptions of Lemma 9.1.1, −RU P(T < ∞) ≤ e 0 (9.1.1) for every U0 > 0 and −1 lim U0 log P(T < ∞) = −R. (9.1.2) U0→∞ Upper bound (9.1.1) is useful because it provides a way to find out the initial capi- tal such that the ruin probability is under the given level. Limit (9.1.2) shows that the exponent R is the best possible in the sense that for any given R0 > R,

−R0U P(T < ∞) > e 0 for large U0. The limiting procedure in (9.1.2) is motivated because a large U0 indicates a small ruin probability. This type of requirement is natural also when longer time horizons are considered. For finite time horizons, the following analogy to Theorem 9.1.1 holds.

Theorem 9.1.2. Assume the conditions of Theorem 9.1.1. Let x ∈ (x, µT ). Then

∗ −xc (1/x)U0 P(T ≤ xU0) ≤ e (9.1.3) University of Helsinki 89

for every U0 > 0 and

−1 ∗ lim U0 log P(T ≤ xU0) = −xc (1/x). (9.1.4) U0→∞

Proof of Theorem 9.1.1. The event {T = n} can be expressed as

{T = n} = {(ξ1, . . . , ξn) ∈ Bn},

n where Bn is a Borel subset of R ,

n Bn = {(v1, . . . , vn) ∈ R | v1 ≤ U0, v1+v2 ≤ U0, . . . , v1+···+vn−1 ≤ U0, v1+···+vn > U0}. Let’s make the conjugate change of measure with the parameter R to the distributions of the variables ξ1, . . . , ξn. The independence of the variables is preserved. It is also possible to take a sequence ξ1, ξ2,... of independent PR-distributed random variables on some probability space. Then for a measurable rectangular A1 × · · · × An,

P((ξ1, . . . , ξn) ∈ A1 × · · · × An) = P(ξ1 ∈ A1) ··· P(ξn ∈ An) −Rξ1+c(R)1  −Rξn+c(R)1  = ER e (ξ1 ∈ A1) ··· ER e (ξn ∈ An) By Lemma 9.1.1, c(R) = 0. By the independence,

−R(ξ1+···+ξn)1  P((ξ1, . . . , ξn) ∈ A1 × · · · × An) = ER e ((ξ1, . . . , ξn) ∈ A1 × · · · × An) .

Necessarily, a similar representation holds for every Borel subset of Rn. In particular,

P(T = n) = P((ξ1, . . . , ξn) ∈ Bn) −R(ξ1+···+ξn)1  = ER e ((ξ1, . . . , ξn) ∈ Bn) −RYn 1  = ER e (T = n) .

Now YT > U0 on {T < ∞} so that

∞ X P(T < ∞) = P(T = n) n=1 ∞ X −RYn 1  = ER e (T = n) n=1 −RYT 1  = ER e (T < ∞) −RU0 1  ≤ ER e (T < ∞) −RU0 = e PR(T < ∞). University of Helsinki 90

0 This implies (9.1.1). In fact, ER(ξ1) = c (R) > 0 so that by the law of large numbers, PR(T < ∞) = 1.

Let y > 0 be fixed and N = N(U0) = byU0c + 1. Clearly,

{YN > U0} ⊆ {T < ∞} so that

P(T < ∞) ≥ P(YN > U0)     YN U0 YN 1 = P > ≥ P > . N byU0c + 1 N y By Cramér’s theorem,   −1 −1 YN 1 lim inf U0 log P(T < ∞) ≥ lim inf U0 log P > U0→∞ U0→∞ N y   −1 YN 1 ∗ = y lim inf N log P > ≥ −y inf c (v). U0→∞ N y v>1/y

Choose y > 0 such that the obtained lower bound is as large as possible. Then we get

−1 lim inf U0 log P(T < ∞) U0→∞ ≥ − inf{y inf(c∗(v) | v > 1/y)} = − inf{inf(yc∗(v) | y > 1/v)} y>0 v>0 = − inf(c∗(v)/v) = − inf(yc∗(1/y)). v>0 y>0

One lower bound is obtained by choosing y = µT in the last infimum. Then by Lemma 6.3.3.1, c∗(c0(R)) Rc0(R) − c(R) yc∗(1/y) = = = R. c0(R) c0(R) Hence, −1 lim inf U0 log P(T < ∞) ≥ −R. U0→∞ By upper bound (9.1.1),

−1 lim sup U0 log P(T < ∞) ≤ −R. U0→∞

The obtained estimates prove (9.1.2).  The following picture illustrates the above proof of the lower bound. University of Helsinki 91

6 c(s)

R | - s

Let 1/y be the slope of the tangent of c. Then the distance from the hitting point of the tangent and the horizontal axis to the origin is yc∗(1/y). By geometry, the minimal distance equals R, and it is obtained by choosing y such that 1/y = c0(R). Proof of Theorem 9.1.2. We make a conjugate change of measure with the para- meter sx which is described in Lemma 9.1.1. Then

−sxYT +T c(sx)1  P(T ≤ xU0) = Esx e (T ≤ xU0) .

Obviously, c(sx) ≥ 0 so that

−sxU0+xU0c(sx) P(T ≤ xU0) ≤ e Psx (T ≤ xU0) ∗ ≤ e−x(sx/x−c(sx))U0 = e−xc (1/x)U0 .

This is (9.1.3) Let now y ∈ (0, x) be fixed. For y close enough to x, the strict convexity of c shows 0 that y = 1/c (sy) for some sy > sx. Now

{YbyU0c+1 > U0} ⊆ {T ≤ xU0} for large U0. By Cramér’s theorem,

−1 lim inf U0 log P(T ≤ xU0) U0→∞   −1 YbyU0c+1 1 ≥ lim inf U0 log P > U0→∞ byU0c + 1 y ≥ −y inf c∗(v). v>1/y University of Helsinki 92

Now inf c∗(v) = c∗(1/y), v>1/y for y close to x since c∗ is finite (and hence, continuous by convexity) in a neighbourhood of 1/x. It follows that

−1 lim inf U0 log P(T ≤ xU0) U0→∞ ≥ − lim (yc∗(1/y)) = −xc∗(1/x). y→x−

Limit (9.1.4) follows from this and (9.1.3).  The following law of large numbers is closely related to above results. Theorem 9.1.3. Under the assumptions of Lemma 9.1.1,

lim P(| T/U0 − µT |≤  | T < ∞) = 1 U0→∞ for every  > 0.

Proof. Let x > µT and let sx be as in Lemma 9.1.1. Then

−sxYT +T c(sx)1  P(xU0 ≤ T < ∞) = Esx e (xU0 ≤ T < ∞) .

Now sx < R and c(sx) < 0 so that

−sxU0+xU0c(sx) P(xU0 ≤ T < ∞) ≤ e Psx (xU0 ≤ T < ∞) ∗ ≤ e−xc (1/x)U0 .

Let h be as in Lemma. Combine the above estimate with Theorem 9.1.2 to conclude that for small  > 0,

(T < (µ − )U ) (| T/U − µ |≤ ) ((µ + )U < T < ∞) 1 = P T 0 + P 0 T + P T 0 P(T < ∞) P(T < ∞) P(T < ∞) e−h(µT −)U0 (| T/U − µ |≤ ) e−h(µT +)U0 ≤ + P 0 T + . P(T < ∞) P(T < ∞) P(T < ∞) The first and the last term tend to zero by Lemma 9.1.1 and Theorem 9.1.1 so that the desired limit follows. 

We end the section by considering the case where the safety loading equals zero.

Theorem 9.1.4.Let ξ, ξ1, ξ2,... be i.i.d. random variables and let {Yn} be as above. Assume that E(ξ) = 0 and that Var(ξ) ∈ (0, ∞). Let U0 > 0. Then P(T (U0) < ∞) = 1. If in addition, c is finite in a neighbourhood of the origin then E(T (U0)) = ∞. University of Helsinki 93

The first result of the theorem justifies the positive safety loading. 2 Proof of Theorem 9.1.4. Write σ = Var(ξ). Let U0 > 0. By the central limit theorem,   Yn U0 1 lim P(Yn > U0) = lim P √ > √ = . n→∞ n→∞ σ n σ n 2 Hence, 1 P(T (U0) < ∞) ≥ lim P(Yn > U0) = . (9.1.5) n→∞ 2 Make the contrary assumption that P(T (U0) < ∞) ≤ a < 1 for U0 > 0. Write ∞ 1 X 1 V (U0) = (YT (U0) − U0) (T (U0) < ∞) = (Yn − U0) (T (U0) = n). n=1

Fix α ∈ (a, 1) and  > 0. Let further v0 be such that P(V (U0) > v0) < . Then

P(T (v0 + 2U0) < ∞) = P(T (v0 + 2U0) < ∞,V (U0) ≤ v0) + P(T (v0 + 2U0) < ∞,V (U0) > v0) ≤ P(T (v0 + 2U0) < ∞,V (U0) ≤ v0) + . Now ∞ [ {T (v0 + 2U0) < ∞,V (U0) ≤ v0} = {T (U0) = n, Yn − U0 ≤ v0,T (v0 + 2U0) < ∞} n=1

∞ [ ⊆ {T (U0) = n, Yn − U0 ≤ v0, ξn+1 + ··· + ξn+j > U0 jollain j ≥ 1} n=1 so that ∞ X P(T (v0 + 2U0) < ∞,V (U0) ≤ v0) ≤ P(T (U0) = n, Yn − U0 ≤ v0)P(T (U0) < ∞) n=1 2 = P(T (U0) < ∞,V (U0) ≤ v0)P(T (U0) < ∞) ≤ a . Hence, 2 P(T (v0 + 2U0) < ∞) ≤ a + .

By choosing  = a(α − a) and writing U1 = v0 + 2U0, it is seen that

P(T (U1) < ∞) ≤ αa.

Repeat the argument to see that there exists U2 such that

2 P(T (U2) < ∞) ≤ α a. University of Helsinki 94

Continue in the same way to obtain a contradiction to (9.1.5). Thus P(T (U0) < ∞) = 1. It remains to show that E(T (U0)) = ∞. Clearly,

∞ ∞ n X X X E(T (U0)) = nP(T = n) = P(T = n) 1 n=1 n=1 k=1 ∞ ∞ ∞ X X X = P(T = n) = P(T ≥ k). k=1 n=k k=1

We assumed that E(ξ) = 0 so that there exists h > 0 such that c(s) > 0 ja c0(s) > 0

0 for every s ∈ (0, h). For sufficiently large x, there exists sx ∈ (0, h) such that c (sx) = 1/x. It is seen as in Theorem 9.1.2 that

∗ −xc (1/x)U0 P(T ≤ xU0) ≤ e . Hence,

∞ X E(T (U0)) = (1 − P(T ≤ k − 1)) k=1 ∞ X ∗ ≥ 1 − e−kc (U0/k)

k=k0 for large k0. To prove the theorem, it suffices to show that

∗ lim x 1 − e−xc (U0/x) > 0. (9.1.6) x→∞

0 Let tx be such that c (tx) = U0/x. Clearly, tx → 0+ as x → ∞ so that

0 00 2 c(tx) = c(0) + c (0)tx + (c (0)/2 + o(1))tx 00 2 = (c (0)/2 + o(1))tx. Now 0 −1 0 −1 ∂(c ) (y) 1 (c ) (0) = 0 and = 00 ∂y |y=0 c (0) so that

0 −1 tx = (c ) (U0/x)  1  U = + o(1) 0 , x → ∞. c00(0) x University of Helsinki 95

∗ It follows that xc (U0/x) → 0 as x → ∞. By l’Hospitals rule and Lemma 9.1.1, limit (9.1.6) equals

∗ U e−xc (U0/x) ∂ (xc∗(U /x)/U ) lim 0 ∂x 0 0 x→∞ −1/x2 ∗ c(t )e−xc (U0/x) = lim x x→∞ 1/x2

0 2 00 where c (tx) = U0/x. Thus the limit equals U0 /(2c (0)) > 0.  Further references to Section 9.1: Martin-Löf (1983), Martin-Löf (1986).

9.2 Practical long run modelling of the capital development In the analysis of Section 9.1, the model can be viewed as oversimplified. At least the following features should be added in order to have a reasonable approximation of the reality. 1. Investment income. The companies invest the money they have which causes an additional source of income. We regard as investment income, for example, interest rate from the bank account, dividends from the stocks owned by the company, and gains from the operations in the financial market. 2. Inflation. Claim sizes and other monetary quantities have a tendency to increase because of inflation. 3. Real growth. The number of policy holders of companies vary from year to year. 4. Economic cycles. Cycles in the economy affect the insurance claims and the pre- miums. Asymptotic estimates for ruin probabilities are nowadays available in general models which describe at least some of the above mentioned features. Further reference to Section 9.2: DPP, Part II. University of Helsinki 96

10 Insurance from the viewpoint of utility theory

The insurance contracts are typically such that the insured looses money in the mean. That is, if X is the total claim amount of the policy-holder and P the premium then usually P > E(X). This is based on the common fact that the premium contains a positive safety loading (and also a loading for administration costs). From this point of view, a buying of an insurance looks irrational. Thus we can ask the motivation of the insured. More widely, a natural question is which risks the policy-holder wants to safeguard by insurance contracts and which prices are appropriate. We consider these questions from the viewpoint of utility theory. As an introduction, let’s consider the so called St Petersburgh paradox. A coin tossing is carried out until the tail appears for the first time. If this needs n replications then player A pays 2n−1 euros to player B. What is the value of this game for player B? More concretely, if B is going to sell his right to participate this game, what is the price he wants. Let V be the (random) amount of money received by player B. Then

∞ X 1 (V ) = · 2n−1 = +∞. E 2n n=1

Would the price be determined as the expectation E(V ) then it would be ∞. It is plausible that there exists a finite price which would be appropriate for B.

10.1 Utility functions In the utility theory, the value of the wealth of a person or a company is measured by means of the utility function. The wealth as such can not be viewed as a reasonable measure, in particular, when large amounts of money are considered. It is, however, natural to require that the value should be an increasing function of the wealth. It is also rather natural to require that any fixed increase of wealth should be a decreasing function of the (initial) wealth. This is called the law of diminishing marginal utility. We will describe the value of the wealth by a utility function u : R → R. Thus the value (or utility) of the wealth v is u(v). By the above discussion, we assume that u is increasing. We also take the law of diminishing marginal utility as a staring point. In precise terms, we will assume that for every t > 0, the function gt : R → R,

gt(v) = u(v + t) − u(v) (10.1.1) is decreasing. The value or the utility of a random variable Y is defined as the expectation E(u(Y )), provided that Y is integrable. If an agent can choose between two random University of Helsinki 97 variables the he chooses the one which has the higher utility. This principle is called the expected utility hypothesis. Example 10.1.1. Let u(v) = vp for v > 0 where p ∈ (0, 1). The game of St Petersburgh paradox has the utility

∞ X 1 1 (u(V )) = · (2n−1)p = . E 2n 2 − 2p n=1

If P is such that u(P ) = E(u(V )) then player B considers P as a fair (or neutral) price for the right to participate the game. In many situations, the absolute utility is not of direct interest. Instead of that, it is interesting to order random variables in decision makings. For this purpose, it is equivalent to use the utility function ua,b,

ua,b(v) = au(v) + b, in place of u (a > 0 and b ∈ R are arbitrary constants).

Lemma 10.1.1. Assume that u is increasing and that the function gt of (10.1.1) is decreasing for every t > 0. Then u is concave, that is,

u(αv1 + (1 − α)v2) ≥ αu(v1) + (1 − α)u(v2) for every α ∈ (0, 1), v1, v2 ∈ R. (10.2.1)

Proof. We first show that u is continuous. Let v ∈ R be arbitrary. Because u is increasing the limits

u(v−) = lim u(w) ja u(v+) = lim u(w) w→v− w→v+ exist and u(v−) ≤ u(v+). So we have to show that u(v+) ≤ u(v−). Let w < v and n ∈ N. By our assumptions, g v−w is decreasing so that n

n X   k(v − w)  (k − 1)(v − w) u(v) = u(w) + u w + − u w + n n k=1   v − w  v − w ≥ u(w) + n u v + − u v − . 2n 2n Thus  v − w  v − w u(v) − u(w) u v + − u v − ≤ . 2n 2n n As n → ∞, the right hand side tends to zero and the left hand side to u(v+) − u(v−). Hence, u(v+) ≤ u(v−). University of Helsinki 98

To prove the concavity, consider arbitrary v, w ∈ R, w < v. By the assumptions, w + v  w + v  u − u(w) ≥ u(v) − u 2 2 so that 1 1  1 1 u w + v ≥ u(w) + u(v). 2 2 2 2 Thus (10.2.1) holds for α = 1/2. Let v − w w = x < x ··· < x n = v, x − x = . 0 1 2 i i−1 2n

Apply the previous result to the pairs (x0, x2), (x1, x3),... to see that the piecewise linear n function through the points (xi, u(xi)), i = 0, 1,..., 2 is concave. By the continuity of u, the resulting sequence of piecewise linear functions converge pointwise to u. It follows that u is concave. 

10.2 Utility of insurance We assume throughout the section that the utility functions of the (potential) insured and the company are increasing and concave. We study under the expected utility hypothesis under which conditions an insurance contract may result.

Consider an agent (a potential insured) whose wealth is initially a0 and whose utility function is u. Let X describe the total claim amount of the agent during the next year. Suppose that the company offers a contract for the compensation of X with the premium P . A natural question is: how large can P be in order that the agent is willing to agree with the offer? Consider two possible futures: a) No insurance

The wealth of the agent after one year is a0 − X. The utility is

E(u(a0 − X)).

b) Insurance

The wealth is now a0 − P and the utility after one year

E(u(a0 − P )) = u(a0 − P ). By the concavity of u, Jensen’s inequality applies so that

E(u(a0 − X)) ≤ u (E(a0 − X)) = u (a0 − E(X)) . University of Helsinki 99

The agent chooses insurance at least if

P ≤ E(X) since u is increasing. More generally, insurance will be chosen if

u(a0 − P ) ≥ E(u(a0 − X)). (10.2.2)

It is convenient to write this in the form P ≤ π(a0) where

u(a0 − π(a0)) = E(u(a0 − X)).

The quantity π(a0) is called the zero utility premium of the agent. Usually π(a0) > E(X) so that the agent can choose insurance even if the premium contains a positive safety loading.

Consider now the prospects of the company. Let the capital be initially A0 and let the utility function be U. The company is willing to sell a contract if the premium P satisfies

E(U(A0 + P − X)) ≥ U(A0). By Jensen’s inequality,

E(U(A0 + P − X)) ≤ U(A0 + P − E(X)).

Thus the company does not sell if the premium is less than E(X). Let %(A0) be such that

E(U(A0 + %(A0) − X)) = U(A0).

Then %(A0) is the zero utility premium of the company. The company sells the contract if P ≥ %(A0). Both sides agree the contract if

%(A0) ≤ P ≤ π(a0).

Consider the particular case where U is affine, U(v) = αv +β for every v ∈ R where α > 0 and β ∈ R are constants. Then %(A0) = E(X). If the variations of X can be regarded small from the viewpoint of the company then U can be approximated by an affine function on the range of X. Then we can expect that %(A0) is close to E(X) so that there are good prospects to determine the premium such that both sides agree with the contract. Further references: Sundt (1984), Chapter 11. University of Helsinki 100

Viitteet

Bingham, N., C. Goldie, and J. Teugels (1987). Regular variation. Cambridge: Cam- bridge Univ. Press. Bucklew, J. A., P. Ney, and J. S. Sadowsky (1990). Monte Carlo simulation and large deviations theory for uniformly recurrent Markov chains. J. Appl. Prob. 27, 44–59. Dembo, A. and O. Zeitouni (1998). Large Deviations Techniques and Applications (2nd ed.). Berlin: Springer–Verlag. Denuit, M., J. Dhaene, M. Goovaerts, and R. Kaas (2005). Actuarial theory for depen- dent risks. Wiley. Ellis, R. S. (1984). Large deviations for a general class of random vectors. Ann. Pro- bab. 12, 1–12. Feller, W. (1971). An Introduction to Probability Theory and Its Applications (2nd ed.), Volume II. New York: John Wiley and Sons. Gärtner, J. (1977). On large deviations from the measure. Theory Prob. Appl. 22, 24–39. Iscoe, I., P. Ney, and E. Nummelin (1985). Large deviations of uniformly recurrent Markov additive processes. Adv. in Appl. Math. 6, 373–412. Karlin, S. and H. Taylor (1975). First course in Stochastic Processes. Academic Press Inc. Mack, T. (1993). Distribution-free calculation of the standard error of chain ladder estimates. Astin Bulletin 23, 213–225. Martin-Löf, A. (1983). estimates for ruin probabilities. In A. Gut and L. Holst (Eds.), Probability and Mathematical Statistics, pp. 129–139. Uppsala University: Dept. of Mathematics. Martin-Löf, A. (1986). Entropy, a useful concept in risk theory. Scand. Actuarial J., 223–235. Mikosch, T. (2004). Non-Life insurance mathematics. Berlin: Springer-Verlag. Norberg, R. (1993). Prediction of outstanding liabilities in non-life insurance. Astin Bulletin 23, 95–115. Rantala, J. (1984). An application of stochastic control theory to insurance business. PhD. Thesis. University of Tampere. Ruohonen, M. (1988). Proceedings of international congress of actuaries, part 4. Hel- sinki. Sundt, B. (1984). An Introduction to Non-life Insurance Mathematics. Karlsruhe: Verlag Vers. GmGH.