Smoothing Techniques

Home , Cokurtosis, Coskewness, Distance matrix, Laplacian matrix

Appendix A Smoothing Techniques

A.1 Smoothing Splines

Splines provide a nonlinear and smooth ﬁtting to a unidimensional or multivariate scatter of points. The spline smoothing can be regarded as a nonlinear and nonparametric regression model. We assume that at each point xi (independent variable) we observe yi (dependent variable), i = 1,...n, and we are interested in seeking a nonlinear relationship linking y to x of the form:

y = f(x)+ ε, (A.1) for which the objective is to estimate f(.). It is of course easier if we knew the general form of the function f(.). In practice, however, this information is very seldom available. The spline smoothing considers f(.)to be a polynomial. One of the most familiar polynomial smoothing is the cubic spline and corresponds to the case of a piece-wise cubic function, i.e.

2 3 f(x)= fi(x) = ai + bix + cix + dix for xi ≤ x ≤ xi+1, (A.2) for i = 1,...n − 1. In addition, to get smoothness the ﬁrst two derivatives are assumed to be continuous, i.e.

dα dα f (x ) = f − (x ), (A.3) dxα i i dxα i 1 i for α = 0, 1, and 2. The constraints given by Eq. (A.2) and Eq. (A.3) lead to a smooth function. However, the problem is not closed, and we need extra conditions. n − 2 The problem is normally simpliﬁed by minimising the quantity i=1 (yi f(xi)) with a smoothness condition that takes the form of an integral of the second derivative squared. The functional to be minimised is

© Springer Nature Switzerland AG 2021 453 A. Hannachi, Patterns Identiﬁcation and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3 454 A Smoothing Techniques

2 n d2f(x) F = [y − f(x )]2 + λ dx. (A.4) k k dx2 k=1

The first part of Eq. (A.4) is a measure of the goodness of fit, whereas the second part provides a measure of the overall smoothness. In the theory of elastic rods the latter term is proportional to the energy of the rod when it is bent under constraints. Note that the functional F(.),Eq.(A.4), can be extended to two dimensions and the final surface will behave like a smooth plate. The function F in Eq. (A.4)is known as the penalised residual sum of squares. Also, in Eq. (A.4) λ represents the smoothing parameter and controls the relative weight given to the roughness penalty and goodness of fit. It controls therefore the balance between goodness of fit and smoothness. For example, the larger the parameter λ the smoother the function f . Remark Note that if ε = 0inEq.(A.1) the spline simply interpolates the data. This means that the spline solves Eq. (A.4) with λ → 0, which is equivalent to 2 min f (x) dx subject to yk = f(xk), k = 1,...n. (A.5)

Equation (A.4) can be extended to higher dimensions by replacing the second derivative roughness measure f 2 by high-order derivatives with respect to all coordinates to ensure consistency with all directions. In this case the quantity to be minimised takes the form 2 n ∂k y − f(x ) 2 + λ k f(x) dx, j j k1+...+km k k Rm ∂x 1 ...∂x m j=1 k1+...+km=k 1 m (A.6) where k is a fixed positive integer. The obtained solution is known as thin-plate m+k−1 spline. The solution is generally obtained as linear combination of the m monomials of degree less than k and a set of n radial basis functions (Wahba 1990). The minimisation of Eq. (A.4), when λ is known, yields the cubic spline. The determination of λ, however, is more important since it controls the smoothness of the fitting. One way to obtain an appropriate estimate of λ is to use cross-validation. The idea behind cross-validation is to have estimates that minimise the effect of omitted observations. If fλ,k(x) is an estimate of the spline with parameter λ when 2 the kth observation is omitted, the mis-fit of the point xk is given by yk − fλ,k(x) . The best value of λ is the one that minimises the cross-validation criterion

n 2 wk yk − fλ,k(x) , k=1 where wk, k = 1,...n, is a set of weights, see Sect. A.1.2. A Smoothing Techniques 455

A.1.1 More on Smoothing Splines

Originally, splines were used as a way to smoothly interpolate the set of points (xk,yk), k = 1,...n, where x1

n 1 b 2 min (y − f(x ))2 + μ f (m)(x) dx (A.7) n i i i=1 a for some μ>0, and over the set of functions with 2m − 2 continuous derivatives over [a,b], C2m−2 ([a,b]). The solution is a piece-wise polynomial of degree 2m−1 inside each interval [xi,xi+1], i = 1,...n− 1, and m − 1 inside the outer intervals [a,x1] and [xn,b]. In general, the smoothing spline can be formulated as a regularisation problem d (Tikhonov 1963; Morozov 1984). Given a set of points x1,...xn in R and n d numbers y1,...yn, we seek a smooth function f(x) from R into R that best ﬁts the data (x1,y1),...(xn,yn). This problem is normally solved by seeking to minimise the functional:

n 2 2 F(f) = (yi − f(xi)) + μL(f ) . (A.8) i=1

Note that the ﬁrst part measures the mis-ﬁt and the second part is a penalty measuring the smoothness of the function f(.). The operator L is in general a differential operator, and μ is the smoothing parameter, which controls the trade- off between both attributes. By computing F(f + δf ) − F(f), where δf is a small “departure” from f , the stationary solutions of Eq. (A.8) can be shown to satisfy

n ∗ μ LL (f )(x) = [yi − f(x)] δ (x − xi) , (A.9) i=1 where L∗ is the adjoint (Appendix F)ofL.

1 The space over which Eq. (A.7) is deﬁned is normally referred to as Sobolev space of functions [ ] − b (m) 2 deﬁned over a,b with m 1 absolutely continuous derivatives and satisfying a f (x) dx < ∞. 456 A Smoothing Techniques

Exercise Derive Eq. (A.9).

Hint Using the L2 norm where L(f )=< L(f ), L(f ) >, we obtain, after discarding the second-order terms in δf , F (f + δf ) − F (f ) = − n − + ∗ 2 i=1 [(yi f(xi)) δf (xi)] 2μ. The solution is then obtained as the stationary points of the differential operator F (f ): = =− n − + ∗ = F (f ).v 2 i=1 [yi f(xi)] v(xi) 2μ − n − − + ∗ 2 < i=1 (yi f(xi)) .δ(x xi), v > 2μ .

The solution to Eq. (A.9) can be expressed in the form of an integral as 1 n f(x) = G(x, y) (y − f(y)) δ(y − x ) dy, μ i i i=1 where G(x, y) is the Green’s function of the operator L∗L (see Sect. A.3 below); hence,

1 n n f(x) = (y − f(x )) G (x, x ) = μ G (x, x ) . (A.10) μ i i i i i i=1 i=1 = = = The coefﬁcients μj , j 1 ...n, are computed by applying it to x xj , i.e. yj n i=1 μiGji or

Gμ = y, (A.11)

T T where G = (G)ij = G xi, xj , μ = (μ1,...,μn) and y = (y1,...,yn) .Note that Eq. (A.11) can be extended to include a homogeneous solution p(x) of the partial differential equation (PDE) L∗L, to yield

n f(x) = μiG (x, xi) + p(x) i=1 with L∗L(p) = 0. Examples of functions p(.) are given below. The popular thin-plate spline corresponds to the case where the differential operator is an extension to that given in Eq. (A.7) to the multidimensional space, i.e. 2 ! m 2 = m ∂ L(f ) k k f(x) dx, (A.12) k1! ...kd ! Rd ∂x 1 ...∂x d k1+...kd =m 1 d and where the functional or energy to be minimised takes the form A Smoothing Techniques 457

n 2 2 F(f) = (yk − f(xk)) + λLf (A.13) k=1 for a fixed positive integer m. The function f(x) used in Eq. (A.12)orEq.(A.13)is of class Cm, i.e. with (m − 1) continuous derivatives and the mth derivative satisfies Lf (m) < ∞. The corresponding L∗L operator is invariant under translation and rotation; hence, the corresponding Green’s function G(x, y) is a radial function and satisfies

(−1)mmG(x) = δ(x), (A.14) whose solution, see e.g. Gelfand and Vilenkin (1964), is a thin-plate spline, i.e. x2m−d log |x for 2m − d>0, and d is even G(x) = (A.15) x2m−d when d is odd.

The general thin-plate spline is then given by

n f(x) = μj G x − xj + p(x), (A.16) j=1 where p(x) is a polynomial of degree m − 1 that can be expressed as a linear + − combination of l = d m 1 monomials in Rd of degree less than m, i.e. p(x) = d l = = k=1 λkpk(x). The parameters μj , j 1,...m,and λk, k 1,...l, are obtained by taking f(xj ) = yj and imposing further conditions on the polynomial p(x) in order to close the problem. This is a well known radial basis function problem. Noisy data have not been explicitly mentioned, but the formulation given in Eq. (A.7) takes account of an uncorrelated noise (see Sect. A.3) when the interpolation is not exact. If the noise is not autocorrelated, e.g.

yi = f(xi) + εi for i = 1,...n, with zero-mean multinormal noise with E εεT = W where T ε = (ε1,...,εn) , then the functional to be minimised takes the form of a penalised likelihood:

− F(f) = (y − f)T W 1 (y − f) + μL(f )2,

T 2 where f = (f(x1),...,f(xn)) . In the case of thin-plate spline where L(f ) is given by Eq. (A.12), the solution is similar to Eq. (A.16) and the parameters T T μ = (μ1,...,μl) and λ = (λ1,...,λn) are the solution of a linear system of the form: 458 A Smoothing Techniques G + nμWP λ y = , PT O μ 0 where G = (Gij ) = G(xi − xj ) for i, j = 1,...nand P = (Pik) = (pk(xi)) for i = 1,...nand k = 1,...l.

A.1.2 Choice of the Smoothing Parameter

So far the parameter μ was assumed to be fixed but unknown. One way to deal with the problem would be to choose an arbitrary value based on experience. A more concise way is to compute it from the data using an elegant procedure known as cross-validation, see also Chap. 15. Suppose that one would like to solve the problem given in Eq. (A.7) and would like to have an optimal estimate of μ.The idea of cross-validation is to take one or more points out of the sample and find the value of μ that minimises the mis-fit. Suppose in fact that xk was taken out. Then (k) the spline fμ (.) that fits the remaining data minimises the functional:

n 1 b 2 [y − f(x )]2 + μ f (m)(t) dt. (A.17) n i i i=1,i=k a

The overall optimal value of μ is the one that minimises the overall mis-ﬁt or cross- validation:

n 1 2 c (μ) = y − f (k)(x ) . (A.18) v n k μ k k=1

Let us designate by fμ(x) the spline function ﬁtted to the whole sample for μ A(μ) = a (μ) i, j = ,...n agiven . Let also ij , 1 , be the matrix relating T T y = (y1,...,yn) to fμ = fμ(x1), . . . fμ(xn) , i.e. satisfying A(μ)y = fμ. Then Craven and Wahba (1979) have shown that − − (k) = yk fμ(xk) yk fμ (xk) , 1 − akk(μ) and therefore, n 2 1 yk − fμ(xk) cv(μ) = . (A.19) n 1 − akk(μ) k=1

= 1 The generalised cross-validation is obtained by substituting a(μ) n tr (A(μ)) for akk(μ) to yield A Smoothing Techniques 459

(I − A(μ)) y 2 c (μ) = n . (A.20) v tr (I − A(μ))

Then, Eq. (A.19)orEq.(A.20) is minimised to yield an optimal value of μ.

A.2 Radial Basis Functions

A.2.1 Exact Interpolation

Radial basis functions (RBFs) constitute one of the attractive tools to interpolate and/or smooth scattered data. RBFs have been formally introduced and coined by Powell (1987) in exact multivariate interpolation, although the technique was hanging around before that, see e.g. Hardy (1971), Franke (1982). 2 d Given m distinct points xi, i = 1,...n,inR , and n real numbers fi, i = 1,...n, these numbers can be regarded as the values at xi of a certain unknown function f(x). The problem of RBF is to ﬁnd a smooth real function s(x) satisfying the interpolation conditions:

s(xi) = fi i = 1,...n (A.21) and that the interpolating function is of the form

n m s(x) = λkφk(x) = λkφ (x − xk) . (A.22) k=1 k=1

The functions φk(x) = φ (x − xk) are known as radial basis functions. The real function φ(x) is deﬁned over the positive numbers, and . is any Euclidean norm or Mahalanobis distance. Thus the radial basis functions s(x) are a simple linear combination of the shifted radial basis functions φ(x).

Examples of Radial Basis Functions

• φ(r) = rk, for a positive integer k. The cases k = 1, 2, and 3 correspond, respectively, to the linear, quadratic and cubic RBF. 1 • φ(r) = r2 + c 2 for c>0 and corresponds to the multiquadratic case. 2 • φ(r) = e−ar for a>0 is the Gaussian RBF. • φ(r) = r2 log r, which is the thin-plate spline.

2Generally known as nodes, knots,orcentres of interpolation. 460 A Smoothing Techniques

−1 • φ(r) = 1 + r2 and corresponds to inverse quadratic. Equations (A.21) and (A.22) lead to the following linear system:

Aλ = f, (A.23)

T where A = (aij ) = φ xi − xj , i, j = 1,...n, and f = (f1,...fn) . The more general RBF interpolation problem is obtained by extending Eq. (A.22) to yield

n s(x) = pm(x) + λkφ (x − xk) , (A.24) k=1

d where pm(x) is a low-order polynomial of degree at most m in R . Apart from interpolation, RBFs constitute an attractive tool that can be used for various other purposes such as minimisation of multivariate functions, see e.g. Powell (1987)for a discussion, ﬁltering and pattern recognition (Carr et al. 1997), and can also be used in PDEs and neural networks (Larsson and Fornberg 2003). Note also that RBFs are used naturally in other ﬁelds such as gravitation.3 Because there are more parameters than constraints in the case of Eq. (A.24), further constraints are imposed, namely,

n λj p(xj ) = 0 (A.25) j=1 for all polynomials p(x) of degree at most m. Apart from introducing more equations, system of Eq. (A.25) can be used to measure the smoothness of the RBF (Powell 1990). It also controls the rate of growth at inﬁnity of the non- polynomial part of s(x) in Eq. (A.24) (Beatson et al. 1999). If (p1,...pl), with = m+d = (m+d)! l d m!d! , is a basis of the space of algebraic polynomials of degree less or equal than m in Rd ,thenEq.(A.25) becomes

3For example in the N-body problem, the gravitational potential at a point y takes the form

N α φ(y) = i . xi − y i=1

∂ −∇2 = = Similarly, the heat equation ∂t h h 0, with initial condition h(t, x) g(x),has,fort>0, the solution − 3 − x−y2 h(t, x) = (4πt) 4 e 4t g(y)dy, which looks like Eq. (A.22) when it is discretised. A Smoothing Techniques 461

n λj pk(xj ) = 0 k = 1,...l. (A.26) j=1

Note also that pm(x) in Eq. (A.24) can be substituted for the combination l k=1 μkpk(x), which, when combined with Eq. (A.26), yields the following system: λ AP λ f A = = , (A.27) μ PT O μ 0 where P = (pij ) = (pj (xi)), (i = 1,...n, j = 1,...l). The next important equation in RBF is related to the invertibility of the system of Eq. (A.23) and Eq. (A.27). For many choices of φ(.), the matrix A in Eq. (A.23) is invertible. For example, for the linear and multiquadratic cases, A is always invertible for every n and d, provided the points are distinct (Michelli 1986). For the quadratic case where s(x) becomes quadratic in x, A becomes singular if the number of points n exceeds the dimension of the space of quadratic polynomials, i.e. n> 1 + + ≥ 2 (d 1)(d 2), whereas for the cubic case A can be singular if d 2 but is always 4 nonsingular for the unidimensional case. Powell (1987) also gives further examples −β of nonsingularity such as φ(r) = r2 + 1 , (β > 0). Consider now the extended interpolation in Eq. (A.24), the matrix A in Eq. (A.27) is nonsingular only if the columns of P are linearly independent. Michelli (1986) gives sufﬁcient conditions for the invertibility of the system of Eq. (A.27) based on conditional positivity,5 i.e. when φ(r) is conditionally strictly

4In this case the matrix is nonsingular for all φ(r) = r2α+1 (α positive integer) and the interpolation function

n 2α+1 s(x) = λi |x − xi | i=1 is a spline function. 5A real function φ(r) defined on the set of non-negative real numbers is conditionally (strictly) positive definite of order m + 1 if for any distinct points x1,...xn and scalars satisfying λ1,...λn satisfying Eq. (A.26) the quadratic form T λ λ = λi φ xi − xj λj ij is non-negative (positive). The “conditionally” in the definition refers to Eq. (A.26). The set of conditionally positive definite functions of order m has been characterised by Michelli (1986). If a k continuous function φ(.)defined on the set of non-negative real numbers is such that (−1)k d φ(r) drk is completely monotonic, then φ(r2) is conditionally positive definite of order k. Note that if k (−1)k d φ(r) ≥ 0 for all positive integers k,thenφ(r) is said to be completely monotonic. The drk following important result is also given in Michelli (1986). If the continuous and positive function 462 A Smoothing Techniques positive. The previous two sufficient conditions allow for various choices of radial −α β basis functions, such as r2 + a2 for α>0, and r2 + a2 for 0 <β<1. For 3 = 2 = r instance, the functions φ1(r) r and φ2(r) 4 log r have their second derivatives 2 3 2 2 completely monotonic for r>0. The functions φ1(r ) = r and φ2(r ) = r log r can also be used as RBF. The latter case corresponds to the thin-plate spline. Another case of singularity was provided by Powell (1987) and corresponds to = ∞ −xr2 b φ(r) 0 e ψ(x)dx, where ψ(x) is non-negative with a ψ(t)dt > 0for some constants a and b. There is also another set of functions such as r2(m+1)−d for odd and 2(m + 1)>d φ(r) = r2(m+1)−d log r2(m+1)−d , where m is the largest degree of the polynomial included in s(x). Remark The system of Eq. (A.27) can be solved using SVD. Alternatively, one can define the n × (n − l) matrix Q whose columns span the orthogonal complement of the columns of P. Hence PT λ yields a unique γ such that PT λ = γ . The first system from Eq. (A.27) yields QT AQγ = QT f, which is invertible since Q is full rank and A is strictly conditionally positive definite of order m + 1. The vector μ is then obtained from Pμ = f − AQγ . A possible choice of Q is given by Beatson et al. (2000), namely ⎡ ⎤ p + p + ... p ⎢ 1,l 1 1,l 2 1n ⎥ ⎢ . . ⎥ ⎢ . . ⎥ ⎢ ⎥ ⎢ ⎥ =−⎢ pl,l+1 pl,l+2 ... pln ⎥ Q ⎢ ⎥ , ⎢ 10... 0 ⎥ ⎢ . ⎥ ⎣ .. ⎦ 00... 1 see also Beatson et al. (1999) for fast fitting and evaluation of RBF.

Example: Thin-Plate Spline = 2 = + n − In this case we have φ(r) r log r and s(x) p(x) k=1 λkφ ( x xi ) with p(x) = μ0 + μ1x + μ2y. The matrix P in this case is given by

φ(r) deﬁned on the set of non-negative numbers has its ﬁrst derivative completely monotonic (not constant), then for any distinct points x1,...xn

n−1 2 (−1) det φ xi − xj > 0. A Smoothing Techniques 463 ⎛ ⎞ 1 x1 y1 ⎜ ⎟ ⎜ 1 x1 y1 ⎟ P = ⎜ . . . ⎟ . ⎝ . . . ⎠ 1 xn yn

Note that thin-plate (or biharmonic) splines serve to model the deformation of an inﬁnite thin plate (Bookstein 1989) and are a C1 function that minimises the energy 2 2 2 ∂2s ∂2s ∂2s E(s) = + 2 + dxdy. R2 ∂x2 ∂x∂y ∂y2

A.2.2 RBF and Noisy Data

In the previous section the emphasis was on exact interpolation where the ﬁtted function goes exactly through the data (xi,fi). This corresponds to the case when the data are noise-free. If the data are contaminated with noise, then instead of the condition given by Eq. (A.21) we seek a function s(x) that minimises the functional

1 n [s(x ) − f ]2 + ρs2, (A.28) n i i i=1 where the penalty function, given by ρs2, provides a measure of the smoothness of s(x) and ρ ≥ 0 is the smoothing parameter. Equation (A.12) provides an example of norm, which is used in thin-plate splines (Cox 1984; Wahba 1979; Craven and Wahba 1979). Remark If we use the semi-norm in Eq. (A.12) with d = 3 and m = 2, the solution to Eq. (A.28) is given (Wahba 1990)by

n s(x) = p(x) + λix − xi, i=1 where p(x) is a polynomial of degree 1, i.e. p(x) = μ0 + μ1x1 + μ2x2 + μ3x3, and the coefﬁcients are given by the linear system: A − 8nπρIP λ f = , PT O μ 0 where A and P are deﬁned as in Eq. (A.27). 464 A Smoothing Techniques

A.2.3 Relation to PDEs and Other Techniques

In many problems in mathematical physics, one seeks to solve the following PDE:

Lu = f, (A.29) in a domain D within Rd under speciﬁc boundary conditions, and L is a differential operator. The Green’s function6 G of the operator L is the (generalised) function satisfying

LG(x, y) = δ(x − y), (A.30) where δ is the Dirac (or impulse) function. The solution to Eq. (A.29) is then given by the following convolution: u(x) = f(y)G (x, y) dy + p(x), (A.31) D where p(x) is a solution to the homogeneous equation Lp = 0. Note that Eq. (A.31) is to be compared with Eq. (A.24). In fact, if there is an operator L satisfying Lφ(x− xi) = δ(x − xi) and also Lpm(x) = 0, then clearly the RBF φ(r) is the Green’s function of the differential operator L and the radial basis function s(x) given by Eq. (A.24) is the solution to

n Lu = λkδ(x − xk). k=1

As such, it is possible to use the PDE solver to solve an interpolation problem (see e.g. Press et al. (1992)). In general, given φ, the operator L can be determined using ﬁltering techniques from time series. RBF interpolation is also related to kriging. For example, when pm(x) = 0in Eq. (A.24), then the equations are similar to kriging and where φ plays the role of an (isotropic) covariance function. The relation to splines has also been outlined. For example, when the radial function φ(.) is cubic or thin-plate spline, then we have a spline interpolant function. In this case the function minimises the bending energy of an inﬁnite thin plate in two dimensions, see Poggio and Girosi (1990) 2m+1 for a review. For instance, if φ(r)= r (m positive integer), then the function = n | − |2m+1 + m k s(x) i=1 λi x xi k=1 μkx is a natural spline.

6The Green’s function G depends only on the operator L and has various properties. For example, if L is self-adjoint, then G is symmetric. If L is invariant under translation, then G(x, y) = G(x−y), and if L is also invariant under rotation, then G is a radial function, i.e. G(x, y) = G(x − y). A Smoothing Techniques 465

A.3 Kernel Smoother

This is a kind of local average smoother where a weighted average is obtained around each target point. Unlike linear smoothers, the kernel smoother uses a particular function K()known as kernel. Given the data points (xj ,yj ), j = 1,...n, for each target point xi the weighted average yˆi is obtained by

n yî = Kij yj , (A.32) j=1 where the weights are given by −1 n x − x x − x K = K i m K i j . ij b b m=1 ≥ Clearly the weights are non-negative and add up to one, i.e. Kij 0, and for each n = i, j=1 Kij 1. The kernel function K(.) satisfies the following properties: (1) K(t) ≥ 0, for all t. ∞ (2) −∞ K(t)dt = 1. (3) K(−t) = K(t) for all t. Hence, K(.) is typically a symmetric probability density function. Note that the parameter b gives a measure of the size of the neighbourhood in the averaging process around each target point xi. Basically, the parameter b controls the “width” x → of the function K(b ). In the limit b 0, we get a Dirac function δ0. In this case the smoothed function is identical to the original scatter, i.e. yî = yi. On the other hand, in the limit b →∞we get a uniform weight function, and the smoothed curve ˆ = = 1 reduces to the mean, i.e. yi y n yi. A familiar example of kernels is given by the Gaussian PDF:

1 − x2 K(x) = √ e 2 . 2π

There are several other kernels used in the literature. The following are examples:

• Box kernel K(x) = 1[− 1 1 ]. 2 , 2 = − |t| • Triangle kernel K(x) 1 a 1[− 1 , 1 ] (for some a>0). ⎧ a a 3 ⎪ x 2 |x| M ⎨⎪ 1 − 6 + 6 for x ≤ M M 2 3 • Parzen kernel: K(x) = |x| M ⎪ 2 1 − for ≤|x|≤M ⎩⎪ M 2 0for|x| >M. These kernels can also extend easily to the multidimensional case. Appendix B Introduction to Probability and Random Variables

B.1 Background

Probability is a branch of mathematics that deals with chance, randomness or uncertainty. When tossing a coin, for example, one talks of probability of getting head or tail. With an operator receiving phone calls at a telephone switch board, one also talks about the probability of receiving a given number of phone calls within a given time interval. We also talk about the probability of having rain tomorrow at 13:00 at a given location. Games of chance also constitute other good examples involving probability. Games of chance are very old indeed, and it has been found that cubic dices have been used by ancient Egyptians around 2000 BC (DeGroot and Shervish 2002). Probability calculus has been popularised apparently around the mid-ﬁfteen century by Blaire Pascal and Pierre De Fermat, and it was in 1933 that A. N. Kolmogorov axiomatised the probability theory using sets and measures theory (Kolmogorov 1933). Despite its common use by most scientists, no unique interpretation of probability exists among scientists and philosophers. There are two main schools of thought: (i) The frequentists school, led by von Mises (1928) and Reichenback (1937), k holds that the probability p of an event is the relative frequency n of the occurrence of that event in an inﬁnite sequence of similar (and independent) trials, i.e.

k p = lim , n→∞ n where k is the number of times that event occurred in n trials. (ii) The subjective or “Bayesian” school, which holds that the probability of an event is a subjective or personal judgement of the likelihood of that event. This interpretation goes back to Thomas Bayes (1763) and Pierre Simon

© Springer Nature Switzerland AG 2021 467 A. Hannachi, Patterns Identiﬁcation and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3 468 B Introduction to Probability and Random Variables

Laplace in the early sixteenth century (see Laplace 1951). This trend argues that randomness is not an objectively measurable phenomenon but rather a “knowledge” phenomenon, i.e. they regard probability as an epistemological rather than ontological concept. Besides these two main schools, there is another one: the classical school, which interprets probability based on the concept of equally likely outcomes. According to this interpretation, when performing a random experiment, one can assign the same probability to events that are equally likely. This interpretation can be useful in practice, although it has a few difficulties such as how to define equally likely events before even computing their probabilities, and also how to define probabilities of events that are not equally likely.

B.2 Sets Theory and Probability

B.2.1 Elements of Sets Theory

Sets and Subsets

Let S be a set of elements s1,s2,...,sn, finite or infinite. Note that in the case of infinite sets one distinguishes two types: (1) countable sets whose elements can be counted using natural numbers 1, 2,...and (2) uncountable sets that are infinite but one cannot count their elements. The set of rational numbers is countable, whereas the set of real numbers is uncountable. A subset A of S, noted A ⊂ S, is a set whose elements belong to S. The empty set Ø and the set S itself are examples of (trivial) subsets.

Operations on Subsets

Given a set S and subsets A, B, and C, one can perform the following operations: • Union—The union of A and B, note A ∪ B, is the subset containing the elements from A or B. It is clear that A ∪ Ø = A, and A ∪ S = S. Also, if A ⊂ B, then A ∪ B = B. This definition can be extended to an infinite sequence of subsets ∪∞ A1,A2,...to yield k=1Ak. • Intersection—The intersection of two subsets A and B, noted as A ∩ B,isthe set containing only common elements to A and B. If no common elements exist, then A∩B = Ø, and the two subsets are said to be mutually exclusive or disjoint. It can be seen that A ∩ S = A and that if D ⊂ B then D ∩ B = D. The definition also extends to an infinite sequence of subsets. • Complements—The complement of A, noted as Ac, is the subset of elements that are not in A. One has (Ac)c = A; Sc = Ø; A ∪ Ac = S and A ∩ Ac = Ø. B Introduction to Probability and Random Variables 469

B.2.2 Deﬁnition of Probability

Link to Sets Theory

An experiment involving different outcomes when it is repeated under similar conditions is a random experiment. For example, throwing a dice yields in general1 a random experiment. The outcome of a random experiment is called an event.The set of all possible outcomes of a random experiment is named as the sample space S. For the case of the dice, S ={1, 2, 3, 4, 5, 6}, and any subset of S is an event. For example, A ={1, 3, 5} corresponds to the event of odd outcomes. The empty subset Ø corresponds to the impossible event. A sample space is discrete if it is ﬁnite or countably inﬁnite. All the operations on subsets mentioned above can be readily transferred to operations between events. For example, two events A and B are disjoint or mutually exclusive if A ∩ B = Ø.

Deﬁnition/Axioms of Probability

Given a sample space S, a probability is a function deﬁned on the events of S, assigning a number Pr(A) to each event A, satisfying the following properties (axioms): (1) For any event A,0≤ Pr(A)≤ 1. (2) Pr(S) = 1. ∪∞ = ∞ (3) Pr( i=1Ai) i=1 Pr(Ai), for any sequence of disjoint events A1,A2,....

Properties of Probability

• Direct consequences (1) Pr(Ø) = 0. (2) Pr(Ac) = 1 − Pr(A). (3) Pr(A∪ B) = Pr(A)+ Pr(B)− Pr(A∩ B). (4) If A ⊂ B, then Pr(A)≤ Pr(B). (5) If A and B are exclusive, then Pr(A∩ B) = 0. Exercise Derive the above properties. Exercise Compute Pr(A∪ B ∪ C). Answer Pr(A)+ Pr(B)+ Pr(C)− Pr(A∩ B) − Pr(A∩ C) − Pr(B ∩ C) + Pr(A∩ B ∩ C).

1When the dice is fair. 470 B Introduction to Probability and Random Variables

• Conditional Probability Given two events A and B, with Pr(B)> 0, the conditional probability of A given by B, denoted by Pr(A|B) is deﬁned by

Pr(A|B) = Pr(A∩ B)/Pr(B).

• Independence—Two events A and B are independent if and only if Pr(A∩B) = P r(A)P r(B). This is equivalent to Pr(A|B) = Pr(A). This deﬁnition also extends to more than two independent events. As a consequence, one has the following property:

Pr(A|B) = Pr(B|A)P r(A)/P r(B).

Note the difference between independent and exclusive/disjoint events. • Bayes theorem For n events B1,...,Bn forming a partition of the sample space S, i.e. mutually exclusive whose union is S, and A any event, then

Pr(B )P r(A|B ) Pr(B |A) = i i . i n | j=1 Pr(Bj )P r(A Bj )

B.3 Random Variables and Probability Distributions

Deﬁnition A random variable is a real valued function deﬁned on a sample space S of a random experiment. A random variable is usually noted by a capital letter, e.g. X, Y or Z, and the values it takes by a lower case, e.g. x, y or z. Hence a random variable X assigns a value x to each outcome in S. Depending on the sample space, one can have either discrete or continuous random variables. Sometimes we can also have a mixed random variable. Here we mainly describe discrete and continuous random variables.

B.3.1 Discrete Probability Distributions

Let X be a discrete random variable taking discrete values x1,...,xk, and pj = pr(X = xj ), j = 1,...,k. Then the function

f(x)= Pr(X = x) is the probability function of X. One immediately sees that k f(x ) = j=1 j k = j=1 pj 1. The function F(x)deﬁned by B Introduction to Probability and Random Variables 471 F(x) = Pr(X ≤ x) = f(xi) xi ≤x is the cumulative distribution function (cdf) of X. The cdf of a discrete random variable is piece-wise constant function between 0 and 1. Various other characteristics can be deﬁned from X, which are included in the continuous case discussed below.

B.3.2 Continuous Probability Distributions

Definition Let X be a continuous random variable taking values in a continuous subset I of the real axis. The function f(x)defined by b Pr(a ≤ x ≤ b) = f(x)dx a for any interval [a,b] in I is the probability density function (pdf) of X. Hence the quantity f(x)dx represents the probability of the event x ≤ X ≤ x + dx, i.e. Pr(x ≤ X ≤ x + dx) = f(x)dx. The pdf satisfies the following properties: (1) f(x)≥ 0 for all x. ∞ (2) −∞ f(x)dx = 1. The cumulative distribution function of X is given by x F(x) = f(x)dx. −∞

Remark Let X be a discrete random variable taking values x1,...,xk, with probabilities p1,...,pk. Designate by δx() the Dirac impulse function, i.e. δx(y) = 1, only if y =x, and zero otherwise. Then the probability function f(x)can be written = k as f(x) j=1 pj δxj (x). Hence by using the rule of integration of a Dirac impulse function, i.e.

δx(y)g(y)dy = g(x)1I (x), I where 1I () is the indicator of the interval I, then X can be analysed as if it were continuous.

Moments of a Random Variable

Let X be a continuous random variable with pdf f()and cdf F(). The quantity: 472 B Introduction to Probability and Random Variables E(X) = xf (x)dx is the expected value or ﬁrst-order moment of X. Note that for a discrete random variable one obtains, using the above remark, E(X) = xipi. k k The kth-order moment of X is deﬁned by mk = E(X ) = x f(x)dx.The centred kth-order moment is k k μk = E (X − E(X)) = (x − μ) f(x)dx.

2 The second-order centred moment μ2 is the variance, var(X) = σ of X, and we have σ 2 = E(X2) − E(X)2. One can deﬁne addition and multiplication of two (or more) random variables over a sample space S and also multiply a random variable by a scalar. The expectation operator is a linear operator over the set of random variables on S, i.e. E(λX + Y) = λE(X) + E(Y).Wealsohavevar(λX) = λ2var(X).

Cumulants

The (non-centred) moments μm, m = 1, 2,..., of a random variable X with pdf f(x), are deﬁned by

m m μm = E X = x f(x)dx.

The centred moments are deﬁned with respect to the centred random variable X − E(X). The characteristic function is given by φ(s) = E eisX = eisxf(x)dx, and the moment generating function is given by g(s) = E esX = esxf(x)dx.

= dm | We have, in particular, μm dsm φ(s) s=0. The cumulant of order m of X, κm,is given by

dm κ = log (φ(s)) | = . m imdsm s 0 For example, the third-order moment is the skewness, which provides a measure of the symmetry of the pdf (with respect to the mean when centred moment is used), B Introduction to Probability and Random Variables 473

= − + 3 and κ3 μ3 3μ2μ1 2μ1. For the fourth-order cumulant, also called the kurtosis = − − 2 + 2 − 4 of the distribution, κ4 μ4 4μ3μ1 3μ2 12μ1μ2 6μ1. Note that for zero- = − 2 mean distributions κ4 μ4 3μ2. A distribution with zero kurtosis is known as mesokurtic, like the normal distribution. A distribution with positive kurtosis is known as super-Gaussian or leptokurtic.This distribution is characterised by a higher maximum and heavier tail than the normal distribution with the same variance. A distribution with negative kurtosis is known as sub-Gaussian or platykurtotic and has lower peak and higher tails than the normal distribution with the same variance.

B.3.3 Joint Probability Distributions

Let X and Y be two random variables over a sample space S with respective pdfs fX() and fY (). For any x and y, the function f(x,y) deﬁned by x y Pr(X ≤ x; Y ≤ y) = f(u,v)dudv −∞ −∞ is the joint probability density function. The deﬁnition can be extended in a similar T fashion to p random variables X1,...,Xp. The vector x = X1,...,Xp is called a random vector, and its pdf is given by the joint pdf f(x) of these random variables. Two random variables X and Y are said to be independent if

f(x,y) = fX(x)fY (y), for all x and y. The pdfs fX() and fY () and associated cdfs FX() and FY () are called marginal pdfs and marginal cdfs of X and Y , respectively. The marginal pdfs and cdfs are linked to the joint cdf via

= ∞ = d FX(x) F(x, ) and fX(x) dx FX(x), and similarly for the second variable. The expectation of any function h(X, Y ) is given by E(h(X,Y)) = h(x,y)f(x,y)dxdy.

The covariance between X and Y is given by

cov(X, Y) = E(XY) − E(X)E(Y).

The correlation between X and Y is given by 474 B Introduction to Probability and Random Variables

cov(X, Y) ρX,Y = √ var(X)var(Y) and satisﬁes −1 ≤ ρX,Y ≤ 1. If ρX,Y = 0, the random variables X and Y are said to be uncorrelated. Two independent random variables are uncorrelated, but the converse is not true. For a random vector x = X1,...,Xp with joint pdf f(x) = f(x1,...,xp), the joint probability (or cumulative) distribution function is given by xp x1 F(x1,...,xp) = ... f(u1,...,up)du1 ...dup. −∞ −∞

The joint pdf is then given by

p ∂ F(x1,...,xp) f(x1,...,xp) = . ∂x1 ...∂xp

Like the bivariate case, p random variables X1,...,Xp are independent if the joint cdf F() can be factorised into a product of the marginal cdfs as: F(x1,...,xp) = FX (x1)...FX (xp), and similarly for the joint pdf. Also, we have fX (x1) = ∞1 ∞ p 1 −∞ ... −∞ f(x)dx2 ...dxp, and similarly for the remaining marginal pdfs.

B.3.4 Expectation and Covariance Matrix of Random Vectors

T Let x = X1,...,Xp be a random vector with pdf f() and cdf F().The expectation of a function g(x) is deﬁned by E [g(x)] = g(x)f (x)dx.

The mean μ of x is obtained when g() is the identity, i.e. μ = xf(x)dx. Assuming the random variables X1,...,Xp have ﬁnite variance, the covariance matrix, xx, of x is given by T T T xx = E (x − μ)(x − μ) = E xx − μμ ,

= with components [ xx]ij cov Xi,Xj . The covariance matrix is symmetric = 2 2 positive semi-deﬁnite. Let us now designate by Dxx diag σ1 ,...,σp ,the diagonal matrix containing the individual variances of X1,X2,...,Xp, then the correlation matrix ρXi ,Xj is given by: B Introduction to Probability and Random Variables 475 −1/2 T −1/2 −1/2 −1/2 xx = E Dxx (x − μ)(x − μ) Dxx = Dxx xxDxx .

B.3.5 Conditional Distributions

Let x and y be two random vectors over some state space with joint pdf fx,y(.).The conditional probability density of y given x = x is given by

fy|x(y|x) = fx,y(x, y)/fx(x), when fx(x) = 0; otherwise, one takes fy|x(x, y) = fx,y(x, y). Using this conditional pdf, one can obtain the expectation of any function h(y) given x = x, i.e. ∞ E (h(y)|x = x) = h(y)fy|x(y|x)dy, −∞ which is a function of x only. As in the two-variable case, x and y are independent if fy|x(y|x) = fy(y) or equally fx,y(.) = fx(.)fy(.). In particular, two (zero- mean) random vectors x and y are uncorrelated if the covariance matrix vanishes, i.e. E xyT = O.

B.4 Examples of Probability Distributions

B.4.1 Discrete Case

Bernoulli Distribution

A Bernoulli random variable X takes only two values, 0 and 1, i.e. X has two outcomes: success or failure (true or false) with respective probabilities Pr(X = 1) = p and Pr(X = 0) = q = 1 − p. The pdf of this distribution can be written as f(x) = px(1 − p)1−x, where x is either 0 or 1. A familiar example of a Bernoulli trial is the tossing of a coin.

Binomial Distribution

A binomial random variable X with parameters n and 0 ≤ p ≤ 1, noted as X ∼ B(n,p), takes n + 1 values 0, 1,...,nwith probabilities

= = n j − n−j Pr(X j) j p (1 p) , 476 B Introduction to Probability and Random Variables

n = n! where j j!(n−j)! . Given a Bernoulli trial with probability of success p,a Binomial trial B(n,p) consists of n repeated and independent Bernoulli trials. Formally, if X1,...,Xn are independent and identically distributed (IID) Bernoulli n random variables with probability of success p, then k=1 Xk follows a binomial distribution B(n,p). A typical example consists of tossing a coin n times, and the number of heads is a binomial random variable. Exercise Let X ∼ B(n,p), show that μ = E(X) = np, and σ 2 = var(X) = np(1 − p). Show that the characteristic function φ(t) = E(eiXt) is (peit + q)n.

Negative Binomial Distribution

In a series of independent Bernoulli trials, with constant probability of success p,the random variable X representing the number of trials until r successes are obtained is a negative binomial with parameters p and r. The parameter r can take values 1, 2,..., and for each value, we have a distribution, e.g.

= = j−1 − j−r r Pr(X j) r−1 (1 p) p , for j = r, r + 1,.... If we are interested in the ﬁrst success, i.e. r = 1, one gets the geometric distribution. Exercise Show that the mean and variance of the negative binomial distribution are, respectively, μ = r/p and σ 2 = r(1 − p)/p2.

Poisson Distribution

A Poisson random variable with parameter λ>0 can take all the integer numbers and satisﬁes

= = λk −λ = Pr(X k) k! e k 0, 1,.... Poisson distributions are typically used to analyse processes involving counts. Exercise Show that for a Poisson distribution one has E(X) = λ = var(X). and φ(t) = exp λ(eit − 1) .

B.4.2 Continuous Distributions

The Uniform Distribution

A continuous uniform random variable over the interval [a,b] has the following pdf: B Introduction to Probability and Random Variables 477

1 f(x)= 1[ ](x), b − a a,b where 1I () is the indicator of I, i.e. with a value of one inside the interval and zero elsewhere. Exercise Show that for a uniform random variable X over [a,b], E(X) = (a+b)/2 and var(X) = (a − b)2/12.

The Normal Distribution

The normal (or Gaussian) distribution, N(μ,σ2), has the following pdf: 1 (x − μ)2 f(x)= √ exp − . σ 2π 2σ 2

Exercise Show that for the above normal distribution E(X) = μ and var(X) = σ 2. X−μ For a normal distribution, the random variable σ has zero mean and unit variance and is referred to as the standard normal. The cdf of X is generally noted as x (x) = −∞ f(u)du and is known as the error function. The normal distribution is very useful and can be reached using a number of ways. For example, if Y is − binomial B(n,p), Y ∼ B(n,p), then √ Y np approximates the standard normal np(1−p) − for large np. The same result holds for Y√ λ when Y follows a Poisson distribution λ with parameter λ. This result constitutes a particular case of a more general result, namely the central limit theorem (see e.g. DeGroot and Shervish 2002, p. 282). The Central Limit Theorem Let X1,...,Xn be a sequence of n IID random variables with mean μ and variance 0 <σ2 < ∞ each, then for every number x X − μ lim Pr n √ ≤ x = (x), n→∞ σ/ n = 1 n where () is the standard normal cdf, and Xn n k=1 Xk. The theorem says that the (properly scaled) sum of a sequence of independent random variables with same mean and (ﬁnite) variance is approximately normal.

The Exponential Distribution

The pdf of the exponential distribution with parameter λ>0 is given by 478 B Introduction to Probability and Random Variables λe−λx if x ≥ 0 f(x)= 0 otherwise.

The Gamma Distribution

The pdf of the gamma distribution with parameters λ>0 and β>0 is given by ' β λ xβ−1e−βx if x>0 f(x)= (β) 0 otherwise,

= ∞ −t y−1 where (y) 0 e t dt,fory>0. If the parameter β<0, the distribution is known as Erlang distribution. Exercise Show that for the above gamma distribution E(X) = β/λ, and var(X) = β/λ2. Show that φ(t) = (1 − it/λ)−β .

The Chi-Square Distribution

2 The chi-square random variable with n degrees of freedom (dof), noted as χn , has the following pdf: ' − 2 n/2 n −1 − x 2 e x/2 if x>0 f(x)= (n/2) 0 otherwise.

2 = 2 = Exercise Show that E(χn ) n and var(χn ) 2n. n 2 If X1,...,Xn are independent N(0, 1), the random variable k=1 Xk is distributed 2 ∼ 2 2 2 as χn with n dof. If Xk N(0,σ ), then the obtained χn follows the σ chi-square distribution. Exercise Find the pdf of the σ 2 chi-square distribution. n −n/2 n − − x = σ 2 2 1 2σ2 Answer f(x) (n/2) x e for x>0.

The Student Distribution

The student T distribution with n dof has the following pdf:

+ − + − n 1 n 1/2(n 1 ) x2 2 f(x)= 2 1 + . (1/2)(n/2) n B Introduction to Probability and Random Variables 479

If X ∼ N(0, 1) and Y ∼ χ 2 are independent, then T = √X has a student n Y/n distribution with n dofs.

The Fisher–Snedecor Distribution

The Fisher–Snedecor random variable with n and m dof, Fn,m, has the following pdf: ' n n/2 n+m n+m ( ) ( ) n −1 nx − m 2 x 2 + 2 x> f(x)= (n/2)(m/2) 1 m if 0 0 otherwise.

= m = 2m2(n+m−2) Exercise Show that E(Fn,m) m−2 and var(Fm,n) 4(m−2)(m−4) . ∼ 2 ∼ 2 = X/n If X χn and Y χm are independent, then Fn,m Y/m follows a Fisher– Snedecor distribution with n and m dof.

The Multivariate Normal Distribution

A multinormally distributed random vector x, noted as x ∼ Np(μ, ), has the pdf = 1 −1 − μ T −1 − μ f(x) p 1 exp (x ) (x ) , (2π) 2 || 2 2 where μ and are, respectively, the mean and the covariance matrix of x.The = μT − 1 T characteristic function of this distribution is φ(t) exp i t 2 t t .The multivariate normal distribution is widely used and has some very useful properties that are given below: T •LetA be a m × p matrix, and y = Ax, and then y ∼ Nm(Aμ, AA ). •Ifx ∼ Np(μ, ), and rank() = p, then

x − μ T −1 x − μ ∼ 2 ( ) ( ) χp.

x ∼ μ xT = xT xT x • Let the random vector Np( , ) partitioned as ( 1 , 2 ), where 1 is q-dimensional (q < p), and similarly for the mean and the covariance matrix, μ = μT μT i.e. 1 , 2 , and = 11 12 , 21 22

then 480 B Introduction to Probability and Random Variables

x μ (1) the marginal distribution of 1 is multinormal Nq ( 1, 11); (2) x1 and x2 are independent if and only if 12 = O; (3) if 22 is of full rank, then the conditional distribution of x1 given x2 = x2 is multinormal with x |x = =μ + −1 −μ x |x = − −1 E ( 1 2 x2) 1 12 22 (x2 2) and var ( 1 2) 11 12 22 21.

The Wishart Distribution

The Wishart distribution with n dofs and parameter ,ap × p symmetric positive semi-deﬁnite matrix (essentially a covariance matrix), is the distribution of a p × p random matrix X (a matrix whose elements are random variables) with pdf ⎧ − − ⎨ − n p 1 2np/2πp(p 1)/4|X| 2 1 −1 ( + − exp tr X if X is positive deﬁnite f(X) = ||n/2 p ( n 1 k ) 2 ⎩ k=1 2 0 otherwise.

If X1,...,Xn are IID Np (0, ), p ≤ n, then the p × p random matrix

n W = T XkXk k=1 has a Wishart probability distribution with n dof.

B.5 Stationary Processes

A (discrete) stochastic process is a sequence of random variables X1,X2,..., which is a realisation of some random variable X, noted as Xk ∼ X. This stochastic process is entirely characterised by specifying the joint probability distribution of any finite set (Xk1 ,...,Xkm ) from the sequence. The sequence is also called sometimes time series, when the indices t = 1, 2,...are identified with “time”. When one observes a finite realisation x1,x2,...xn of the previous sequence, one also talks of a finite sample time series.Letμt = E(Xt ) and γ(t,k) = cov(Xt ,Xt+k),fort = 1,,2,..., and k = 0, 1,.... The process (or time series) is said to be stationary if μt and γ(t,k)are independent of t. In this case one obtains

E(Xt ) = μ and γ(k)= γk = cov(Xt ,Xt+k) .

The function γ(), deﬁned on the integers, is the autocovariance function of the stationary stochastic process. The function B Introduction to Probability and Random Variables 481

γk ρk = γ0 is the autocorrelation function. We assume that we have a finite sample x1,...,xn, supposed to be an independent realisation of some random variable X of finite mean and variance. Let 2 x ands be the sample mean and the sample variance, respectively, i.e. x = 1 n 2 = 1 n − 2 n−1 k=1 xk, and s n−1 k=1(xk x) . Note that because the sample is random, these estimators are also random. These estimators satisfy, respectively, E(X) = E(X), and E(s2) = var(X), and for this reason, they are referred to as unbiased estimators of the mean and the variance of X, respectively. Also, the ˆ = 1 { ≤ } function F(x) n # xk,xk x represents an estimator of the cdf F() of X,or empirical distribution function (edf). Given a finite sample x1,...,xn, an estimator of the autocovariance function is given by

− 1 n k γˆ = (x − x)(x + − x), k n i i k t=1

ˆ =ˆ ˆ and the estimator of the autocorrelation is ρk γk/γ0. It can be shown that ˆ = 1 ∞ 2 + − + 2 2 var(ρk) n i=1 ρi ρi+kρi−k 4ρkρiρi−k 2ρi ρk . This expression can be simplified further if the autocorrelation decays, e.g. exponentially. The computation of the variance of the sample estimators is useful in defining a confidence interval for the estimators. Appendix C Stationary Time Series Analysis

This appendix gives a brief introduction to stationary time series analysis for the univariate and multivariate cases.

C.1 Autocorrelation Structure: One-Dimensional Case

A (discrete) time series is a sequence of numbers xt , t = 1, 2 ...,n. In time series exploration and modelling, a time series is considered as a realisation of some stochastic process, i.e. a sequence of random variables. So, conceptually the time series xt , t = 1, 2,... is considered as a sequence of random variables and the corresponding observed series is simply a realisation of these random variables. The time series is said to be (second-order) stationary if the mean is constant and the covariance between any xt and xs is a function of t − s, i.e.

E (xt ) = μ and cov (xt ,xs) = γ(t − s). (C.1)

C.1.1 Autocovariance/Correlation Function

The autocovariance function of a stationary time series xt , t = 1, 2 ..., is deﬁned by

γ(τ)= cov (xt+τ ,xt ) = E (xt+τ − μ)(xt − μ) . (C.2)

It is clear that the variance of the time series is simply σ 2 = γ(0).The autocorrelation function ρ() is given by

© Springer Nature Switzerland AG 2021 483 A. Hannachi, Patterns Identiﬁcation and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3 484 C Stationary Time Series Analysis

γ(τ) ρ(τ) = . (C.3) σ 2

Properties of the Autocovariance Function

The autocovariance function γ(.)satisﬁes the following properties: • |γ(τ)|≤γ(0) = σ 2. • γ(τ)= γ(−τ). • For any p integers τ1,τ2,...τp and real numbers a1,a2,...ap,wehave

p γ(τi − τj )aiaj ≥ 0, (C.4) i,j=1

and the autocovariance function is said to be non-negative deﬁnite or positive semi-deﬁnite. Exercise Derive the above properties. + ≥ Hint Use the fact that var (λxt xt+τ ) 0 for any real λ. For the last one, use the ≥ fact that var( i aixτi ) 0.

C.1.2 Time Series Models

Let εt , t ≥ 1, be a sequence of IID random variables with zero mean and variance 2 σε . This sequence is called white noise. The autocovariance of such a process is simply a Dirac pulse, i.e.

γε(τ) = δτ , i.e. one at τ = 0, and zero elsewhere. Although the white noise process is the simplest time series, it remains, however, hypothetical because it does not exist in practice. Climate and other time series are autocorrelated. Simple linear time series models have been formulated to explain this autocorrelation. The models we are reviewing here have been formulated in the early 1970s and are known as autoregressive moving average (ARMA) models (Box and Jenkins 1970; see also Box et al. (1994)).

Some Basic Notations

Given a time series (xt ) where t is either continuous or discrete, various operations can be deﬁned. C Stationary Time Series Analysis 485

• Backward shift or delay operator B—This is deﬁned for discrete time series by

Bxt = xt−1. (C.5)

m More generally, for any integer m ≥ 1, B xt = xt−m. By analogy, one can deﬁne the inverse operator B−1, which is the forward operator. It is clear from Eq. (C.5) that for a constant c, Bc = c. Also for any integers m and n, BmBn = Bm+n. Furthermore, for any time series (xt ), t = 1, 2,...,wehave

1 2 2 x = (1 + αB + α B + ...)x = x + αx − + ... (C.6) 1 − αB t t t t 1

whenever |α| < 1.

Remark Consider the mean x of a ﬁnite time series (xt ), t = 1,...n. Then, 1 n 1 n 1 1 − Bn x = x = Bi x = x . n i n n n 1 − B n i=1 i=0

• Differencing operator ∇=1 − B—This is deﬁned by

∇xt = (1 − B)xt = xt − xt−1. (C.7)

2 2 For example, ∇ xt = (1 − B) xt = xt − 2xt−1 + xt−2. k • Seasonal differencing ∇k = 1 − B —This operator is frequently used to deal with seasonality. • Gain operator—It is a simple linear multiplication of the time series, i.e. axt .The parameter a is referred to as gain. • Differencing operator D—For a continuous time series, {y(t),a ≤ t ≤ b},the differencing operator is simply a differentiation D, i.e.

dy(t) Dy(t) = (C.8) dt whenever this differentiation is possible. • Continuous shift operator—Another useful operator normally encountered in ﬁltering is the shift operator in continuous time series, Bu, deﬁned by

Buy(t) = y(t − u). (C.9)

This operator is equivalent to the backward shift operator in discrete time series. It can be shown that

u −uD −u d B = e = e dt . (C.10) 486 C Stationary Time Series Analysis

Fig. C.1 Examples of time series of AR(1) models with lag-1 autocorrelation 0.5 (a)and−0.5 (b)

Exercise Derive Eq. (C.10). Hint Use a Taylor expansion of y(t − u).

ARMA Models

• Autoregressive schemes: AR(p) Autoregressive models of order p are given by p k xt = φ1xt−1 + φ2xt−2 + ...+ φpxt−p + εt = φkB xt + εt . (C.11) k=1

The white noise εt is only correlated with xs for s ≥ t. When p = 1, one gets the familiar Markov or ﬁrst-order autoregressive, AR(1), model also known as red noise. Figure C.1 shows an example of generated time series of an AR(1) model with opposite lag-1 autocorrelations. The red noise is a particularly simple model that is frequently used in climate research and constitutes a reasonably good model for many climate processes, see e.g. Hasselmann (1976, 1988), von Storch (1995a,b), Penland and Sardeshmukh (1995), Hall and Manabe (1997), Feldstein (2000) and Wunsch (2003) to name just a few. • Moving average scheme: MA(q) Moving average models of order q, MA(q), are deﬁned by q k xt = εt + φ1εt−1 + ...+ φq εt−q = 1 + φkB εt . (C.12) k=1 C Stationary Time Series Analysis 487

It is possible to combine both the above models, AR(p) and MA(q), into just one single model, the ARMA model. • Autoregressive moving average scheme: ARMA(p, q) It is given by p q k k 1 − φkB xt = 1 + θkB εt . (C.13) k=1 k=1 = The ARMA(p, q) model can also be written in a more compact form as φ(B)xt = − p k = + q k θ(B)εt , where φ(z) 1 k=1 φkz and θ(z) 1 k=1 θkz . Stationarity of the ARMA(p, q) model, Eq. (C.13), requires that the roots of

φ(z) = 0 (C.14) be outside the unit circle, see e.g. Box et al. (1994) for details. Various ways exist to identify possible models for a given time series. For example, the autocorrelation function of an ARMA model is a damped exponential and/or sine waves that could be used as a guide to select models. Another useful measure is the partial autocorrelation function. It exploits the fact that, for example, for an AR(p) model the autocorrelation function can be entirely described by the ﬁrst p lagged autocorrelations whose behaviour is described by the partial autocorrelation, which is a function that cuts off after lag p for the AR(p) model. Alternatively, one can use concepts from information theory (Akaike 1969, 1974) by ﬁtting a whole range of models, computing the residual estimates εˆ and their variances (the mean squared errors) σˆ 2 and then deriving, for example, the Akaike information criterion (AIC) given by

2 AIC = log(σˆ 2) + (P + 1), (C.15) n where P is the number of parameters to be estimated. The best model corresponds to the smallest AIC.

C.2 Power Spectrum

= We assume that we have a stationary time series xt , t 1, 2 ..., with summable ∞ autocovariance function γ(.), i.e. k γ(k)< . The spectral density function, or power spectrum, f(ω)is deﬁned by

∞ 1 − f(ω)= γ(k)e ikω. (C.16) 2π k=−∞ 488 C Stationary Time Series Analysis

Fig. C.2 Autocorrelation function of AR(1) models with lag-1 autocorrelations 0.5(a)and −0.5(b)

Using the symmetry of the autocovariance function, the power spectrum becomes ∞ σ 2 f(ω)= 1 + 2 ρ(k)coskω . (C.17) 2π k=1

Remark Similar to power spectrum, the bispectrum is the Fourier transform of the bicovariance function, and is related to the skewness (e.g. Pires and Hannachi 2021) Properties of the Power Spectrum • f()is even, i.e. f(−ω) = f(ω). • f(ω)≥ 0 for all ω in [−π, π ]. = π iωτ = π • γ(τ) −π e f(ω)dω −π cosτωf(ω)dω, i.e. the autocovariance function is the inverse Fourier transform of the power spectrum. Note that from the last 2 = π property, one gets, in particular, the familiar result σ −π f(ω)dω, i.e. the power spectrum distributes the variance. Examples = σ 2 • The power spectrum of a white noise is constant, i.e. f(ω) 2π . • For a red noise time series (of zero mean), xt = αxt−1 + εt , the autocorrelation function is ρ(τ) = α|τ|, and its power spectrum is f(ω) = − σ 2 − + 2 1 2π 1 2αcosω α (Figs. C.2, C.3).

Exercise Derive the relationship between the variance of xt and that of the innovation εt in the red noise model. 2 = 2 − 2 −1 Hint σ σε (1 α ) . Exercise = + 1. Compute the autocorrelation function ρ(.) of the AR(1) model xt φ1xt−1 εt . 2. Compute k≥0 ρ(k). C Stationary Time Series Analysis 489

Fig. C.3 Power spectra of two AR(1) models with lag-1 autocorrelation 0.5 and −0.5

− 3. Write ρ(τ) = e τ/T0 , and calculate T as a function of φ . ∞ 0 1 −τ/T0 4. Calculate 0 e dτ. 5. Reconcile the expression from 2 with that from 4. ≥ = + = k Hint 1. For k 1, xt xt−k φ1xt−1xt−k εt xt−k yields ρ(k) φ1 . 2. 1 . 1−φ1 3. T =− 1 . 0 logφ1 4. T0. −1 =− − − ≈ − 5. T0 log(1 (1 φ1)) 1 φ1. General Expression of the ARMA Spectra A direct method to compute the spectra of ARMA processes is to make use of results from linear ﬁltering as outlined in Sect. 2.6 of Chap. 2.

Exercise Consider the delay operation yt = Bxt . Find the relation between the Fourier transforms y(ω) and x(ω) of yt and xt , respectively. Find a similar relationship when yt = αxt + βBxt . Answer y(ω) = (α + βeiω)x(ω).

Let xt , t = 0, 1, 2,..., be a stationary time series, and consider the ﬁltering equation:

p yt = αkxt−k. k=1 490 C Stationary Time Series Analysis

Using the above exercise, we get y(ω) = A(eiω)x(ω), where

p k A(z) = αkz , k=1 where the function a(ω) = A(eiω) is the frequency response function, which is the Fourier transform of the transfer function. Now, the power spectrum of yt is linked to that of xt following:

2 fy(ω) =|a(ω)| fx(ω).

The application of this to the ARMA time series model (C.13), see also Chap. 2, yields ) ) ) iω )2 2) θ(e ) ) fx(ω) = σ ) ) . (C.18) ε φ(eiω)

In the above equation it is assumed that the roots of φ(z) are outside unit circle (stationarity) and similarly for θ(z)(for invertibility, i.e. εt is written as a convergent power series in xt ,xt−1,...).

C.3 The Multivariate Case

Let xt , t = 1, 2,..., be a multivariate time series where each element xt = xt1,xt2,...xtp is now p-dimensional. We suppose that xt is of mean zero and covariance matrix 0.

C.3.1 Autocovariance Structure

The lagged cross- or autocovariance matrix (τ) is deﬁned by

= T (τ) E xt+τ xt . (C.19)

The elements of (τ) are [(τ)]ij = E xt+τ,ixt,j . The diagonal elements are the autocovariances of the individual unidimensional time series forming xt , whereas its off-diagonal elements are the lagged cross-covariances. The lagged covariance matrix has the following properties: • (−τ) = [(τ)]T . • (0) is the covariance matrix 0 of xt . C Stationary Time Series Analysis 491

• (τ) is positive semi-deﬁnite, i.e. for any integer m>0, and real vectors a1,...,am,

m T − ≥ ai (i j)aj 0. (C.20) i,j=1

Similarly, the lagged cross-correlation matrix

ϒ = −1/2 −1/2 (τ) 0 (τ) 0 ,

− 1 2 whose elements ρij (τ) are [ϒ(τ)]ij = γij (τ) γii(0)γjj (0) , has similar properties. Furthermore, we have

|ρij (τ)|≤1.

Note that the inequality γij (τ) ≤ γij (0),fori = j, is not true in general.

C.3.2 Cross-Spectrum

As for the univariate case, we can deﬁne the spectral density matrix F(ω) of xt , t = 1, 2,... for −π ≤ ω ≤ π as the Fourier transform of the autocovariance matrix: ∞ 1 − F(ω) = e iτω(τ) (C.21) 2π τ=−∞ ∞ whenever τ (τ) < , where . is a matrix norm. For example, if | | ∞ = τ γij (τ) < ,fori, j 1, 2,...p, then F(ω) exists. Unlike the univariate case, however, the spectral density matrix can be complex because () is not symmetric. The diagonal elements of F(ω) are real because they represent the power spectra of the individual univariate time series that constitute xt . The real part of F(ω) is the co-spectrum matrix, whereas the imaginary part is the quadrature spectrum matrix. The spectral density matrix has the following properties: • F(ω) is Hermitian, i.e.

∗ F(−ω) = [F(ω)] T ,

where (∗) represents the complex conjugate. • The autocovariance matrix is the inverse Fourier transform of F(ω), i.e. 492 C Stationary Time Series Analysis π (τ) = F(ω)eiτωdω. (C.22) −π = π = • 0 −π F(ω)dω, and 2πF(0) k (k). • F(ω) is semi-deﬁnite (Hermitian), i.e. for any integer m>0, and complex numbers c1,c2,...,cp,wehave

p ∗T = ∗ ≥ c F(ω)c ci Fij (ω)cj 0, (C.23) i,j=1

T where c = c1,c2,...,cp . The coherence and phase between xt,i and xt,j, t = 1, 2,...,fori = j, are, respectively, given by

2 |Fij (ω)| cij (ω) = , (C.24) Fii(ω)Fjj (ω)

and Im(Fij (ω)) φij (ω) = Atan . (C.25) Re(Fij (ω))

The coherence, Eq. (C.24), gives essentially a measure of the square of the correlation coefﬁcient between both the time series in the frequency domain. The phase, Eq. (C.25), on the other hand, gives a measure of the time lag between the time series.

C.4 Autocorrelation Structure in the Sample Space

C.4.1 Autocovariance/Autocorrelation Estimates

We assume that we have a ﬁnite sample of a time series, xt , t = 1, 2 ...n. There are various ways to estimate the autocovariance function γ(). The most widely used estimators are

− 1 nτ γˆ (τ) = (x − x)(x + − x) (C.26) 1 n t t τ t=1 and

− 1 nτ γˆ (τ) = (x − x)(x + − x) . (C.27) 2 n − τ t t τ t=1 C Stationary Time Series Analysis 493

We can assume for simplicity that the sample mean is zero. It is clear from ˆ 1 Eq. (C.26) and Eq. (C.27) that γ1() is slightly biased, with bias of order n , i.e. asymptotically unbiased, whereas γˆ2() is unbiased. The asymptotically unbiased estimator γˆ1() is, however, consistent, i.e. its variance goes to zero as the sample size goes to infinity, whereas the estimator γˆ2() is inconsistent with its variance tending to infinity with the sample size (see e.g. Jenkins and Watts 1968). But, for a fixed lag both the estimators are asymptotically unbiased and with approximate variances satisfying (Priestly 1981, p. 328)

ˆ ≈ 1 ˆ ≈ 1 var γ1(τ) O(n ) and var γ2(τ) O(n−k ).

Similarly, the autocorrelation function can be estimated by

γ(τ)ˆ ρ(τ)ˆ = , (C.28) σˆ 2

2 whereγ()ˆ is an estimator of the autocovariance function, and σˆ = (n − −1 n − 2 ˆ = 1) t=1(xt x) is the sample variance. The sample estimate ρ1(τ), τ 0, 1,...n− 1, is positive semi-deﬁnite, see Eq. (C.4), whereas this is not in general true for ρˆ2(.), see e.g. Priestly (1981, p. 331). The graph showing the sample autocorrelation function ρ(τ)ˆ versus τ is normally referred to as correlogram. A simple and useful signiﬁcance test for the sample autocorrelation function is one based on asymptotic normality and white noise, namely,

E ρ(τ)ˆ ≈ 0forτ = 0 and

ˆ ≈ 1 = var ρ(τ) n for τ 0.

These approximations can be used to construct conﬁdence intervals for the sample autocorrelation function.

C.4.2 The Periodogram

Raw Periodogram

We consider again a centred sample of a time series, xt , t = 1, 2 ...n, with sample autocovariance function γ()ˆ . In spectral estimate we normally consider the Fourier = 2πk =−[n−1 ] [ n ] [ ] frequencies ωk n ,fork 2 ,..., 2 , where x is the integer part of x. 2π The frequency 2t (radians/time unit), where t is the sampling interval, is known 494 C Stationary Time Series Analysis as the Nyquist frequency.1 The Nyquist frequency represents the highest frequency that can be resolved, and therefore, the power spectrum can only be estimated for frequencies less than the Nyquist frequency. The sequence of the following complex vectors:

1 T iωk 2iωk inωk ck = √ e ,e ,...,e n

= ∗T = for k 1, 2,...n, is orthonormal, i.e. ck cl δkl, and therefore, any n- dimensional complex vector x can be expressed as

[ n ] 2 x = αkck, (C.29) =−[ n−1 ] k 2

= ∗T T where αk ck x. The application of Eq. (C.29) to the vector (x1,...,xn) yields − the discrete Fourier transform of the time series, i.e. α = √1 n x e itωk .The k n t=1 t periodogram of the time series is deﬁned as the squared amplitude of the Fourier coefﬁcients, i.e. ) ) 1 ) n )2 ) −itωk ) In(ωk) = ) xt e ) . (C.30) n t=1

Now, from Eq. (C.29) one easily gets

[ n ] [ n ] n 2 2 − ˆ 2 = 2 = | |2 = (n 1)σ xt αk In(ωk). (C.31) t=1 =−[ n−1 ] =−[ n−1 ] k 2 k 2

As for the power spectrum, the periodogram also distributes the sample variance, i.e. the periodogram In(ωk) represents the contribution to the sample variance from the frequency ωk. The periodogram can be seen as an estimator of the power spectrum, Eq. (C.16). In fact, by expanding Eq. (C.30) one gets ⎡ ⎤ n n−1 1 − I (ω ) = ⎣ x2 + x x eikωp + e ikωp ⎦ n p n t t τ t=1 k=1 |t−τ|=k

1 1 Or 2t if the frequency is expressed in (1/time unit). For example, if the sampling time interval 1 is unity, then the Nyquist frequency is 2 . C Stationary Time Series Analysis 495

n−1 = γ(k)ˆ cos(ωpk). (C.32) k=−(n−1)

Therefore 1 I (ω ) is a candidate estimator for the power spectrum f(ω ). Fur- 2π n p p = n−1 ˆ thermore, it can be seen from Eq. (C.32) that E In(ωp) k=−(n−1) E γ(k) cos(ωpk), i.e.

E In(ωp) ≈ 2πf (ωp). (C.33)

The periodogram is therefore an asymptotically unbiased estimator of the power spectrum. However, it is not consistent because it can be shown to have a constant variance. The periodogram is also highly erratic with sampling ﬂuctuations that do not vanish as the sample size increases, and therefore, some smoothing is required.

Periodogram Smoothing

Various ways exist to construct a consistent estimator of the spectral density function. Smoothing is the most widely used way to achieve consistency. The smoothed periodogram is obtained by convolving the (raw) periodogram using a “spectral window” W()as

[ n ] 1 2 f(ω)ˆ = W(ω− ω )I (ω ). (C.34) 2π k n k =−[ n−1 ] k 2

The spectral window is a symmetric kernel function that integrates to unity and decays at large values. This smoothing is equivalent to a discrete Fourier transform of the weighted autocovariance estimator using a (time domain) lag window λ(.) as

− 1 n 1 f(ω)ˆ = λ(k)γ(k)ˆ cos(ω k). (C.35) 2π p k=−(n−1)

The sum in Eq. (C.35) is normally truncated at the truncation point of the lag window. The spectral window W() is the Fourier transform of the lag window, whose aim is to neglect the contribution, in the sample autocovariance function, from large lags. This means that localisation in time is associated with broadness in the spectral domain and vice versa. Figure C.4 illustrates the relationship between time (or lag) window and spectral window. Various lag/spectral windows exist in the literature. Two examples are given below, namely, the Bartlett (1950) and Parzen (1961) windows: 496 C Stationary Time Series Analysis

Fig. C.4 Illustration of the relationship between time and spectral windows

• Bartlett window for which the lag window is deﬁned by 1 − τ for |τ|

and the corresponding spectral window is M sin(πMω) 2 W(ω) = . (C.37) n πMω

• Parzen window: ⎧ ⎨⎪ − τ 2 + τ 3 | |≤ M 1 6 M 6 M for τ 2 λ(τ) = 2 1 − τ 3 (C.38) ⎩⎪ M 0 otherwise,

and 6 sin(Mω/4) 4 W(ω) = . (C.39) πM3 sinω/2 C Stationary Time Series Analysis 497

Fig. C.5 Parzen window showing W(ω) in ordinate versus ω in abscissa for different values of the parameter M

Figure C.5 shows an example of the Parzen window for different values of the parameter M. Notice in particular that as M increases the lag window becomes narrower. Since M can be regarded as a time resolution, it is clear that the variance increases with M and vice versa. Remark There are other ways to estimate the power spectrum such as the maximum entropy method (MEM). The MEM estimator is achieved by ﬁtting an autoregressive model to the time series and then using the model parameters to compute the power spectrum, see e.g. Burg (1972), Ulrych and Bishop (1975), and Priestly (1981). The cross-covariance and the cross-spectrum can be estimated in a similar way to the sample covariance function and sample spectrum. For example, the cross-covariance between two zero-mean time series samples xt , and yt , t = 1,...n, can be estimated using

− 1 nτ γˆ (τ) = x y + (C.40) 12 n t t τ t=1 for τ = 0, 1,...,n− 1, which is then complemented by symmetry, i.e. γˆ12(−τ) = γˆ21(τ). Similarly, the cross-spectrum can be estimated using

1 M fˆ (ω) = λ(k)γˆ (k)eiωk. (C.41) 12 2π 12 k=−M Appendix D Matrix Algebra and Matrix Function

D.1 Background

D.1.1 Matrices and Linear Operators

Matrices

T T Given two n-dimensional vectors x = (x1,...,nn) and y = (y1,...,yn) and scalar λ, then x + y and λx are also n-dimensional vectors given, respectively, T T by (x1 + y1,...,xn + yn) and (λx1,...,λxn) .ThesetEn of all n-dimensional vectors is called a linear (or vector) space. It is n-dimensional if it is real and 2n- dimensional if it is complex. For the real case, for example, a natural basis of the space is (e1,...,en), where ek contains zero everywhere except at the kth position where it is one. AmatrixX of order n × p is a collection of (real or complex) numbers xij , i = 1,...,n, j = 1,...,p, taking the following form: ⎛ ⎞ x11 x12 ... x1p ⎜ ⎟ ⎜ x21 x22 ... x2p ⎟ X = ⎜ . . . ⎟ = xij . ⎝ . . . ⎠ xn1 xn2 ... xnp

When p = 1, one obtains a n-dimensional vector, i.e. one column of n numbers T x = (x1,...,xn) . When n = p, the matrix is called square. Similar operations can be deﬁned on matrices, i.e. for any two n × p matrices X = xij and Y = yij , and scalar λ,wehaveX + Y = xij + yij and λX = λxij .Thesetofalln × p real matrices is a linear space with dimension np.

© Springer Nature Switzerland AG 2021 499 A. Hannachi, Patterns Identiﬁcation and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3 500 D Matrix Algebra and Matrix Function

Matrices and Linear Operators

Any n×p matrix X is a representation of a linear operator from a linear space Ep into n a linear space En. For example, if the space En is real, then one gets En = R . Let us T denote by xk = (x1k,...,xnk) , and then the matrix is written as X = x1,...,xp . The kth column xk of X represents the image of the kth basis vector ek of Ep, i.e.

Xek = xk.

D.1.2 Operation on Matrices

Transpose

T The transpose of a n × p matrix X = (xij ) is the p × n matrix X = yij where yij = xji.

Product

The product of n × p and p × q matrices Xand Y, respectively, is the n × q matrix = = = p Z XY, deﬁned by Z zij , with zij k=1 xikykj .

Diagonal

A diagonal matrix is n × n matrix of the form A = xij δij , where δij is the Kronecker symbol. For a n × p matrix A, the main diagonal is given by all the elements aii, i = 1,...,min(n,p).

Trace × = = n The trace of a square n n matrix X (xij ) is given by tr (X) k=1 xkk.

Determinant

Let X = (xij ) be a p × p matrix, and then, the determinant |X| of X is a multilinear function of the columns of X and is deﬁned by

*p |π| det(X) =|X|= (−1) xkπ(k), (D.1) π k=1 D Matrix Algebra and Matrix Function 501 where the sum is over all permutations π() of {1, 2,...,p} and |π| is either +1or −1 depending on whether π() is written as the product of an even or odd number of transpositions, respectively. The determinant can also be deﬁned in a recurrent manner as follows. For a scalar x, the determinant is simply x, i.e. det(x) = x. Then, for a p × p matrix X, the determinant is given by i+j i+j |X|= (−1) xij ij = (−1) xij ij , j i

−(i,j) where ij is the determinant of the (p − 1) × (p − 1) matrix X obtained by deleting the ith line and jth column. The determinant ij is referred to as the minor i+j of xij , and the term cij = (−1) ij as the cofactor of xij . It can be shown that

p xikcjk =|X|δij , (D.2) k=1 where δij is the Kronecker symbol. The matrix C = (cij ) is the matrix of cofactors of X.

Matrix Inversion

• Conventional inverse When |X| = 0, the square p × p matrix X = (xij ) is invertible and its inverse −1 −1 −1 X satisﬁes XX = X X = Ip. It is clear from Eq. (D.2) that when X is invertible the inverse is given by

− 1 X 1 = CT , (D.3) |X|

where C is the matrix of cofactors of X. In what follows, the elements of X−1 are denoted by xij , i, j = 1,...n, i.e. X−1 = (xij ). • Generalised inverse Let X be a n × p matrix, and the generalised inverse of X is the p × n matrix X− satisfying the following properties:

XX− and X−X are symmetric XX−X = X, and X−XX− = X−.

The generalised inverse is unique and is also known as pseudoinverse or Moore– Penrose inverse. • Rank The rank of a n × p matrix X is the number of columns of X or its transpose that are linearly independent. It is the number of rows (and columns) of the largest 502 D Matrix Algebra and Matrix Function

invertible square submatrix of X. We have automatically: rank(X) ≤ min(n, p). The matrix is said to be of full rank if rank(X) = min(n, p).

Symmetry, Orthogonality and Normality

Let X be a real p × p square matrix, and then • X is symmetric when XT = X; T T • it is orthogonal (or unitary) when XX = X X = Ip; • it is normal when it commutes with its transpose, i.e. XXT = XT X, when the matrix is complex; • X is Hermitian when X∗T = X. For the complex case, the other two properties remain the same except that the transpose (T ) is replaced by the complex conjugate transpose (∗T ).

Direct Product

Let A = (aij ) and B = (bij ) two matrices of respective order n × p and q × r.The direct product of A and B, noted as A × B or A ⊗ B,isthenq × pr matrix deﬁned by ⎛ ⎞ a11B a12B ... a1pB ⎜ ⎟ ⎜ a21B a22B ... a2pB ⎟ A ⊗ B = ⎜ . . . ⎟ . ⎝ . . . ⎠ an1B an2B ... anp B.

The above product is indeed a left direct product. A direct product is also known as Kronecker product. There is also another type of product between two n × p matrices of the same order A = (aij ) and B = (bij ), and that is the Hadamard product given by

A B = aij bij .

Positivity

A square p × p matrix A is positive semi-deﬁnite if xT Ax ≥ 0, for any p- dimensional vector x. It is deﬁnite positive when xT Ax > 0 for any non-zero p-dimensional vector x. D Matrix Algebra and Matrix Function 503

Eigenvalues/Eigenvectors

Let A a p×p matrix. The eigenvalues of A are given by the set of complex numbers λ1,...,λp solution to the algebraic polynomial equation:

|A − λIp|=0.

The eigenvectors of u1,...,up of A are the solutions to the eigenvalue problem:

Au = λu, where λ is an eigenvalue. The eignevectors are normally chosen to have unit-length. For any invertible p × p matrix B, the eigenvalues of A and B−1AB are identical.

Some Properties of Square Matrices

Let A and B be two p × p matrices. Then we have: • tr(αA + B) = αtr(A) + tr(B), for any number α. • tr(AB) = tr(BA). − • tr( A) = tr( P 1AP) for any nonsingular p × p matrix P. • tr AxxT = xT Ax, where x is a vector. • (AB)−1 = B−1A−1. • det (AB) =|AB|=|A||B|. p p • |A ⊗ B|=|A| |B| and tr (A ⊗ B) = tr(A)tr(B). = p • tr (A) k=1 λk, where λ1,...,λp are the eigenvalues of A. • The eigenvectors corresponding to different eigenvalues are orthogonal. • rank(A) = #{λk; λk = 0}. •IfA is (real) symmetric, then its eigenvalues λ1,...,λp and its eigenvectors P = u1,...,up are real. If it is positive semi-deﬁnite, then its eigenvalues are all non-negative. If the matrix is Hermitian, then its eigenvalues are all non- ∗T negative. For both cases, we have A = P P , where = diag λ1,...,λp . •IfA is normal, i.e. commuting with its Hermitian transpose, then it is diagonalisable and has a complete set of orthogonal eigenvectors.

Singular Value Decomposition (SVD)

The SVD theorem has different forms, see e.g. Golub and van Loan (1996), and Linz and Wang (2003). In its simplest form, any n × p real matrix X, of rank r, can be decomposed as

X = UDVT , (D.4) 504 D Matrix Algebra and Matrix Function

T T where the n × r and p × r matrices U and V are orthogonal, i.e. U U = V V = Ir and D = diag (d1,...,dr ) where dk > 0, k = 1,...r, are the singular values of X.

Theorem of Sums of Products

Let A, B, C and D be p × p, p × q, q × q and q × p matrices, respectively, then

−1 • (A + BCD)−1 = A−1 − A−1B C−1 + DA−1B DA−1 and −1 −1 • |A + BD|=|A||Ip + A BD|=|A||Iq + DA B|, when all necessary inverses exist.

Theorem of Partitioned Matrices

Let A be a block partitioned matrix as A A A = 11 12 , A21 A22 then we have

| |=| || − −1 |=| || − −1 | A A11 A22 A21A11 A12 A22 A11 A12A22 A21 , when all necessary inverses exist. Furthermore, if A is invertible with inverse denoted by 11 12 − A A A 1 = , A21 A22 then

−1 11 = − −1 A A11 A12A22 A21 , 12 =− 11 −1 =− −1 22 A A A12A22 A11 A12A ,

−1 22 = − −1 A A22 A21A11 A12 , and 21 =− 22 −1 =− −1 −1 A A A21A11 A22 A21A11 . D Matrix Algebra and Matrix Function 505

D.2 Most Useful Matrix Transformations

• LU decomposition For any nonsingular n × n matrix A, there exists some permutation matrix P such that

PA = LU, where L is a lower triangular matrix with ones in the main diagonal and U is an upper triangular matrix. • Cholesky factorisation For any symmetric positive semi-deﬁnite matrix A, there exists a lower triangular matrix L such that

A = LLT .

• QR decomposition For any m × n matrix A, with m ≥ n say, there exist a m × m unitary matrix Q and a m × n upper triangular matrix R such that

A = QR. (D.5)

The proof of this result is based on Householder transformation and ﬁnds a sequence of n unitary matrices Q1,...,Qn such that Qn ...Q1A = R.Ifatstepk,wehave say Lk | B Qk ...Q1A = , Om−k,k | c|C where Lk is a k ×k upper triangular matrix, then Qk+1 will transform the vector c = T T (ck+1,...,cm) into the vector d = (d,0,...,0) without changing the structure of Lk and the (m −k)× k null matrix Om−k,k. This matrix is known as Householder transformation and has the form: Ik Ok,m−k Qk+1 = , Om−k,k Pm−k

2 P − = I − − uuT u = c +c ( , ,..., ) where m k m k uT u , where 1 0 0 . Remark The following formula can be useful when expressing matrix products. Consider two p×q and r×q matrices U and V, respectively, with U = u ,...,u 1 q = T q i j and V v1,...,vq . Since the ith and jth element of UV is k=1 ukvk , and 506 D Matrix Algebra and Matrix Function

i j T because ukvk is the ith and jth element of ukvk , one gets

q T = T UV ukvk . k=1

= T = Similarly, if λ1,...,λq is a diagonal matrix, then we also have U V q T k=1 λkukvk .

D.3 Matrix Derivative

D.3.1 Vector Derivative

T Let f(.) be a scalar function of a p-dimensional vector x = x1,...,xp .The partial derivative of f(.)with respect to x is noted ∂f and is deﬁned in the usual k ∂xk way. The derivative of f(.)with respect to x is given by ∂f ∂f ∂f T =∇f(x) = ,..., (D.6) ∂x ∂x1 ∂xp and is also known as the gradient of f(.)at x. The differential of f() is then written as df = p ∂f dx =∇f(x)T dx, where dx = dx ,...,dx T . K=1 ∂xk k 1 p Examples • For a linear form f(x) = aT x, ∇f(x) = a. T • For a quadratic form f(x) = x Ax, ∇xf = 2Ax.

For a vector function f(x) = f1(x),...,fq (x) , where f1(.),...,fq (.) are scalar functions of x, the gradient in this case is called the Jacobian matrix of f(.) and is given by T T ∂fj Df(x) = ∇f1(x) ,...,∇fq (x) = (x) . (D.7) ∂xi

D.3.2 Matrix Derivative

Deﬁnition Let X = xij = x1,...,xq be a p×q matrix and Y = yij = F (X) a r ×s matrix function of X. We assume that the elements yij of Y are differentiable scalar function with respect to the elements xij of X. We distinguish two cases: D Matrix Algebra and Matrix Function 507

1. Scalar Case = If Y F (X) is a scalar function, then to deﬁne the derivative of F() we ﬁrst use T = T T the vec (.) notation given by vec (X) x1 ,...,xq transforming X into a pq- dimensional vector. The differential of F (X) is then obtained by considering F() as a function of vec (X). One gets the following expression: ∂F ∂F = . (D.8) ∂X ∂xij

∂F × The derivative ∂X is then a p q matrix. 2. Matrix Case If Y = F (X) is a r × s matrix, where each yij = Fij (X) is a differentiable scalar function of X, the partial derivative of Y with respect to xmn is the r × s matrix: ∂Y ∂Fij (X) = . (D.9) ∂xmn ∂xmn

The partial derivative of Y with respect to X, based on Eq. (D.9), is the pr × qs matrix given by ⎛ ⎞ ∂Y ... ∂Y ⎜ ∂x11 ∂x1q ⎟ ∂Y ⎜ . . ⎟ = ⎝ . . ⎠ . (D.10) ∂X ..... ∂Y ... ∂Y ∂xp1 ∂xqq

Equation (D.10) also deﬁnes the Jacobian matrix DY (X) of the transformation. Another deﬁnition of the Jacobian matrix is given in Magnus and Neudecker (1995, p. 173) based on the vec transformation, namely,

∂vecF (X) DF (X) = . (D.11) ∂ (vecX)T

Equation (D.11) is useful to compute the Jacobian matrices using the vec transformation of X and Y and then get the Jacobian of a vector function. Note that Eqs. (D.9)or(D.10) can also be written as a Kronecker product: ∂Y ∂ ∂yij = Y ⊗ = . (D.12) ∂X ∂X ∂X

In this appendix we adopt the componentwise derivative concept as in Dwyer (1967). 508 D Matrix Algebra and Matrix Function

D.3.3 Examples

In the following examples the p × q matrix Jij will denote the matrix whose ith and jth element is one and zero elsewhere, i.e. Jij = δm−i,n−j = δmiδnj , and similarly for the r × s matrix Kαβ . For instance, if X = (xmn), then Y = Jij X is the matrix whose ith line is the jth line of X and zero elsewhere (i.e. ymn = δmixjn), and Z = XJij is the matrix whose j column is the ith column of X and zero elsewhere (i.e. zmn = δjnxmi). The matrices Jij and Kij are essentially identical, but they are obtained differently, see the remark below.

Case of Independent Elements

We assume that the matrix X = xij is composed of pq independent variables. We j will also use interchangeably the notation δij or δ for Kronecker symbol. i × = = •LetX be a p p matrix, and f (X) tr(X) k xkk. The derivative of f()is ∂ (tr (X)) = δ . Hence, ∂xmn mn

∂ ∂ tr (X) = I = tr XT . (D.13) ∂X p ∂X • f (X) = tr (AX).Heref (X) = a x , and ∂f = a δk δn = i k ik ki ∂xmn i,k ik m i anm; hence,

∂f = AT . (D.14) ∂X

• g (X) = g (f (X)), where f(.) is a scalar function of X and g(y) is a differentiable scalar function of y. In this case we have

∂g dg ∂f = (f(X)) . ∂X dy ∂X

∂ tr(XA) = tr(XA) T For example, ∂X e e A . • f(X) = det (X) =|X|. For this case, one can use Eq. (D.2), i.e. |X|= j xαj Xαj where Xαj is the cofactor of xαj . Since Xαj is independent of xαk, | | for k = 1,...n, one gets ∂ X = X , and using Eq. (D.3), one gets ∂xαβ αβ

∂|X| − =|X|X T . (D.15) ∂X

Consequently, if g(y) is any real differentiable scalar function of y, then we get D Matrix Algebra and Matrix Function 509

∂ dg − g (|X|) = (|X|) |X|X T . (D.16) ∂X dy

• f (X) = g (H(X)), where g(Y) is a scalar function of matrix Y and H(X) is a matrix function of X, both differentiable. Using a similar argument from the derivative of a scalar function of a vector, one gets ∂f (X) ∂g ∂yij ∂g ∂ (H(X))ij = (H(X)) = (H(X)) . ∂x ∂y ∂x ∂y ∂x αβ i,j ij αβ i,j ij αβ

Y = Y (X) X For example, if is any differentiable matrix function of , then ∂|Y(X)| = ∂|Y| ∂yij = ∂yij = ∂YT i,j i,j Yij i,j Yij . That is, ∂xαβ ∂yij ∂xαβ ∂xαβ ∂xαβ ji T ∂|Y(X)| − ∂Y ∂Y − = tr |Y|Y T =|Y|tr Y 1 . (D.17) ∂xαβ ∂xαβ ∂xαβ

Remark We can also compute the derivative with respect to an element and derivative of an element with respect to a matrix as in the following examples.

∂X ∂XT T •Letf(X) = X, then = Jαβ , and = Jβα = J . ∂xαβ ∂xαβ αβ × = ∂yij = •Forar s matrix Y yij ,wehave ∂Y Kij . ∂[f(X)] •Forf(X) = AXB, one obtains ∂f (X) = AJ B and ij = AT K BT . ∂xαβ αβ ∂X ij n Exercise Compute ∂X . ∂xαβ

n−1 Hint Use a recursive relationship. Write U = ∂XX , and then U = J Xn−1 + n ∂xαβ n αβ ∂Xn−1 − X = J Xn 1 + XU − . By induction, one ﬁnds that ∂xαβ αβ n 1

n−1 n−2 2 n−3 n−2 Un = X Jαβ + XJαβ X + X Jαβ X + ...+ X Jαβ X.

Application 1. f(X) = XAX, in this case

∂f (X) = Jαβ AX + XAJαβ . (D.18) ∂xαβ

This could be proven by expressing the ith and jth element [XAX]ij of XAX. = = Application 2. g(X ) tr(f(X)) where f(X ) XAX. Since tr ∂f (X) = tr J AX + XAJ = [AX] + [XA] , hence ∂xαβ αβ αβ βα βα

∂tr (XAX) = (AX + XA)T . (D.19) ∂X 510 D Matrix Algebra and Matrix Function

Application 3. ∂|XAXT | −1 −1 =|XAXT | XAT XT XAT + XAXT XA . (D.20) ∂X | T | −1 In particular, if A is symmetric, then ∂ XAX = 2|XAXT | XAT XT XA . ∂X T ∂ XAX T One can use the fact that = Jαβ AX + XAJβα, see also Eq. (D.18), ∂xαβ ∂ XAXT ∂ AXT which can be proven by writing ij = ∂xik AXT + x kj . ∂xαβ k ∂xαβ kj ik ∂xαβ β α T The ﬁrst sum in the right hand side of this expression is simply k δk δi AX kj , T T which is the (i, j)th element of Jαβ AX (and also the (α, β)th element of Jij XA ). Similarly, the second sum is the (i, j)th element of XAJαβ and, by applying the trace operator, provides the required answer. Exercise Complete the proof of Eq. (D.20). ∂|XAXT| T T −1 T Hint First use Eq. (D.17), i.e. =|XAX |tr XAX Jαβ AX +XAJβα . ∂xαβ T −1 Next use the same argument as that used in Eq. (D.19) to get tr XAX Jαβ T = T −1 T = T −T T AX i XAX AX XAX XA . A similar iα βi αβ reasoning yields −1 −1 −1 T T T tr XAX XAJαβ = tr XAJβα XAX = XAX XA , αβ which again yields Eq. (D.20). Application 4.

∂|AXB| − =|AXB|AT (AXB) T BT . (D.21) ∂X ∂ ∂|AXB| In fact, one has [AXB]ij = aiαbβj = AJαβ B . Hence = ∂xαβ ij ∂xαβ |AXB|tr AJ B (AXB)−1 . The last term equals a B (AXB)−1 = αβ i iα βi B (AXB)−1 a , which can be easily recognised as B (AXB)−1 A = i βi iα βα T −T T A (AXB) B αβ .

−1 • Derivative of the inverse. To compute ∂X , one can use the fact that X−1X = ∂xαβ ik I , i.e. xikx = δ , which yields after differentiation: ∂x x = p k kj ij k ∂xαβ kj −1 − xik J , i.e. ∂X X =−X−1J or k αβ kj ∂xαβ αβ D Matrix Algebra and Matrix Function 511

∂ −1 −1 −1 X =−X Jαβ X . (D.22) ∂xαβ

• f(X) = Y = AX−1B −. First, we have ∂ Y =−AX−1J X−1B. ∂xαβ αβ Let us now ﬁnd the derivative of each element of Y with respect to X.We × × ﬁrst note that for any two matrices of respective orders n p and q m , β A = aij and B = bij , one has AJαβ = aiαδ and AJαβ B = aiαbβj . j ∂yij −1 −1 −1 −1 Now, =−AX Jαβ X B =−AX X B , which is also ∂xαβ ij iα βj −1 T −1 T −1 T −1 T − AX X B = AX Jij X B , that is αi jβ αβ

∂yij − − =−X T AT J BT X T . (D.23) ∂X ij

• f(X) = y = tr X−1A −. One uses the previous argument, i.e. ∂y = ∂xαβ −tr X−1J X−1A , which is − X−1 X−1A =− X−T αβ i iα βi i αi T −T A X iβ . Therefore,

∂ − − − tr X 1A =−X T AT X T . (D.24) ∂X

Alternatively, one can also use the identity tr(X) =|X|tr(X−1) (e.g. Graybill 1969, p. 227).

Case of Symmetric Matrices

The matrices dealt with in the previous examples have independent elements. When, however, the elements are not independent, the rules change. Here we consider the case of symmetric matrices, but there are various other dependencies such as normality, orthogonality etc. Let X = xij be a symmetric matrix, and J ij = Jij + Jji − diag Jij , i.e. the matrix with one for the (i, j)th and (j, i)th elements and zero elsewhere. We have ∂X = J .Now,iff(X) is a scalar function of ∂xij ij the symmetric matrix X, then we can start ﬁrst with the scalar function f(Y) for a general matrix, and we get (e.g. Rogers 1980) ∂f (X) = ∂f (Y) + ∂f (Y) − ∂f (Y) T diag . ∂X ∂Y ∂Y ∂Y Y=X

The following examples illustrate this change. • ∂ tr (AX) = a ∂xki = a + a ; therefore, ∂xαβ i,k ik ∂xαβ αβ βα 512 D Matrix Algebra and Matrix Function

∂ tr (AX) = A + AT . (D.25) ∂X

Exercise Show that ∂ AXB = AJ B + BJ B − AJ B δ . ∂xαβ αβ βα αα αβ • Derivative of determinants ∂ − − |X|=|X| 2X 1 − diag X 1 . (D.26) ∂X

∂ − − − |AXB|=|AXB| AT (AXB) T BT +B (AXB) 1 A−diag B (AXB) 1 A . ∂X (D.27) Exercise Derive Eq. (D.26) and Eq. (D.27). = + T − Hint Apply (D.17) to the transformation Y(X) X1 X1 diag (X), where X is the lower triangular matrix whose elements are x1 = x ,fori ≤ j. Then 1 ij ij ∂ ∂|Y| ∂yij ji one gets |Y(X)|= = |Y|y J αβ . Keeping in mind ∂xαβ ij ∂yij ∂xαβ ij ij = | | −1J that Y X, the previous expression yields X tr X αβ . To complete the proof, remember that J = J + J − diag J ; hence, tr X−1J = αβ αβ βα αβ αβ βα + αβ − αβ αβ = −1 − −1 x x x δx 2X diag X αβ . Similarly, Eq. (D.27) is similar to Eq. (D.21) but involves symmetry, i.e. ∂ AXB = AJ B. Therefore, ∂ |AXB|=|AXB|tr AJ B (AXB)−1 . ∂xαβ αβ ∂xαβ αβ • Derivative of a matrix inverse

−1 ∂X −1 −1 =−X J αβ X . (D.28) ∂xαβ

The proof is similar to the non-symmetric case, but using J αβ instead. • Trace involving matrix inverse

∂ − − − − − tr X 1A =−X 1 A + AT X 1 + diag X 1AX 1 . (D.29) ∂X

D.4 Application

D.4.1 MLE of the Parameters of a Multinormal Distribution

Matrix derivatives ﬁnd straight application in multivariate analysis. The most familiar example is perhaps the estimation of a p-dimensional multinormal distribution D Matrix Algebra and Matrix Function 513

N (μ, ) from a given sample of data. Let x1,...,xn be a sample from such a distribution. The likelihood of this sample (see e.g. Anderson 1984)is

*n *n − − 1 − L= f (x ; μ, ) = (2π) p/2 || 1/2exp − (x − μ)T 1 (x −μ) . t 2 t t t=1 t=1 (D.30) The objective is then to estimate μ and by maximising L. It is usually simpler to use the log-likelihood L = log L, which reads

n np n 1 − L = log L = log 2π − log ||− (x − μ)T 1 (x − μ) . (D.31) 2 2 2 t t t=1

∂L = The estimates are obtained by solving the system of equations given by ∂μ 0 and ∂L = −1 ∂ O. The ﬁrst of these yields is (assuming that exists)

n (xt − μ) = 0, (D.32) t=1 which provides the sample mean. For the second, one can use Eqs. (D.16)–(D.26), and Eq. (D.29) for the last term, which can be written as a trace of a matrix product. This yields

− − − − − − − 2 1 − diag 1 − 2 1S 1 + diag 1S 1 1 = O,

−1 −1 −1 −1 which can be simpliﬁed to 2 Ip − S − diag Ip − S = O, i.e.

−1 −1 Ip − S = O, (D.33) yielding the sample covariance matrix S.

D.4.2 Estimation of the Factor Model Parameters

The estimation of the parameters of a factor model can be found in various text books, e.g. Anderson (1984), Mardia et al. (1979). The log-likelihood of the model has basically the same form as Eq. (D.31) except that now is given by = + T , where is a diagonal covariance matrix (see Chap. 10, Eq. (10.11)). Using Eq. (D.16) along with results from Eq. (D.20), we get ∂ log | T + ∂ |= T + −T ∂ | T + |= 2 . In a similar way we get ∂ log 514 D Matrix Algebra and Matrix Function

T + −1 ∂ | T + diag . Furthermore, using Eq. (D.27), one gets ∂ log −1 −1 |=2 T T + − diag T T + .

= T ∂ −1 = −T T −T T − Exercise Let H XAX . Show that ∂X tr H S H S H XA H−1SH−1XA.

Hint Let H = (hij ), then using arguments from Eq. (D.24) and Eq. (D.19) one gets

−1 ∂ − ∂tr(H S) ∂ tr H 1S = h ∂x ∂h ∂x ij αβ ij ij αβ −T T −T T = −H S H Jαβ AX + XAJβα . ij ij ij −1 −1 T This is precisely tr −H SH Jαβ AX + XAJβα . Using an argument similar to that presented in Eqs. (D.23)–(D.24), one gets ∂ − − − − − trH 1S =− H T ST H T XAT + H 1SH 1XA . ∂xαβ αβ

Applying the above exercise and keeping in mind the symmetry of ,see Eq. (D.29), yield

∂ − − − − − tr 1S = 2 −2 1S 1 + diag 1S 1 . ∂

= ∂ −1 =− T −T T −T T Exercise Let H AXB, and show that ∂X tr H S A H S H B . Hint As before, one ﬁnds − H−1SH−1 AJ B =−tr AJ BH−T ST H−T , ij ij αβ ij αβ − −T T −T =− −T T −T and this can be written as i aiα BH S H βi BH S H A βα. Using the above result with A = , B = T and X = , keeping in mind the symmetry of and , one gets ∂ −1 = T − −1 −1 + −1 −1 ∂ tr S 2 2 S diag S +diag T −2−1S−1 + diag −1S−1 .

For diagonal, one simply gets ∂ − − − − − − − tr 1S = −2 1S 1 + diag 1S 1 , that is − diag 1S 1 . ∂ [ ]αα αα

Finally, one gets D Matrix Algebra and Matrix Function 515

∂L =−n −1 + − −1 −1 + −1 −1 ∂ 2 2 2 2 S diag S =− −1 − −1 + −1 −1 n ( 2S) diag S ∂L =−n T −1 − T −1 + T − −1 −1 + −1 −1 ∂ 2 2 diag 2 2 S diag S − n T − −1 −1 + −1 −1 2 diag 2 S diag S ∂L =−n −1 − −1 −1 =−n −1 − −1 ∂ 2 diag diag S 2 diag ( S) . (D.34) Note that if one removes the terms pertaining to symmetry one finds what has been presented in the literature, e.g. Dwyer (1967), Magnus and Neudecker (1995). For example, in Dwyer (1967) symmetry was not explicitly taken into account in the differentiation. The symmetry condition can be considered via Lagrange multipliers (Magnus and Neudecker 1995). It turns out, in fact, that the stationary points of a × ∂f (X) = scalar function f(X) of the symmetric p p matrix X, i.e. ∂X O are also the solutions to (Rogers 1980, th. 101, p. 80) ∂f (Y) = O, (D.35) ∂Y Y=X ∂f (Y) = ∂f (Y) where Y is non-symmetric, whenever T , which is ∂Y Y=X ∂Y Y=X straightforward based on the definition of the derivative given in Eq. (D.8), and where the first differentiation reflects positional aspect, whereas the second one refers to the ordinary differentiation. This result simplifies calculation considerably. The stationary solutions to L are given by

−1 ( − S) −1 = O T −1 − −1 = ( S) O (D.36) diag −1 ( − S) −1 = O.

D.4.3 Application to Results from PCA

Matrix derivative also ﬁnds application in various other subjects. The eigenvalue problem of EOFs is a straightforward application. An interesting alternative to this eigenvalue problem, which uses matrix derivative, is provided by the following result (see Magnus and Neudecker 1995, th. 3, p. 355). For a given p × p positive semi-deﬁnite matrix , of rank r, the minimum of

(Y) = tr ( − Y)2 , (D.37) where Y is positive semi-deﬁnite matrix of rank q ≤ p, is obtained for

q = 2 T Y λkvkvk , (D.38) k=1 516 D Matrix Algebra and Matrix Function

2 = where λk, and vk, k 1,...q, are the leading eigenvalues and associated eigenvectors of . The matrix Y thus defined provides the best approximation to . Consequently, if represents the covariance matrix of some data matrix, then Y is simply the covariance matrix of the filtered data matrix obtained by removing the contribution from the last r − q eigenvectors of . Exercise Show that Y defined by Eq. (D.38) minimises Eq. (D.37). Hint Write Y = AAT and find A. 1 T A p ×r matrix A is semi-orthogonal if A A = Ir . Another connection to EOFs is provided by the following theorem (Magnus and Neudecker 1995). Theorem Let X be a n × p data matrix. The minimum of the following real valued function:

T = − T − T = − T 2 φ(X) tr X ZA X ZA X ZA F , (D.39) where the p × r matrix A is semi-orthogonal and .F stands for the Fröbenius norm, is obtained for A = (v1,...,vr ) and Z = XA, where v1,...,vr are the 2 2 T eigenvectors associated with the leading eigenvalues λ1,...,λr of X X. Further, p 2 the minimum is k=r+1 λk. In other words, A is the set of the leading eigenvectors, and Z thematrixofthe associated PCs. Exercise Find the stationary points of Eq. (D.39). Hint Use a Lagrange function (see exercise below). Exercise Show that the Lagrangian function of

min φ(X) s.t. F(X) = O, where F(X) is a symmetric matrix function of X,is

(X) = φ(X) − tr (LF(X)) , where L is a symmetric matrix. − = Hint The Lagrangian function is simply φ(X) i,j lij [F(X)]ij , where lij lji since F(X) is symmetric.

1The set of these matrices is known as Stiefel manifold. D Matrix Algebra and Matrix Function 517

D.5 Common Algorithms for Linear Systems and Eigenvalue Problems

D.5.1 Direct Methods

A number of algorithms exist to solve linear systems of the kind Ax = b and Ax = λx. For the linear system, the m × m matrix A is normally assumed to be nonsingular. A large number of algorithms exists to solve those problems, see e.g. Golub and van Loan (1996). Some of those algorithms are more suited than others particularly for large and/or sparse matrices. Broadly speaking, two main classes of methods exist for solving linear systems and also eigenvalue problems, namely, direct and iterative. Direct methods are based mostly on decompositions. The main direct methods include essentially the SVD, LU and QR decompositions (Golub and van Loan 1996).

D.5.2 Iterative Methods

Case of Eigenvalue Problems

Iterative methods are based on what is known as Krylov subspaces. Given a m × m matrix A and a non-vanishing m-dimensional vector y, the sequence n−1 y, Ay,...,A y is referred to as Krylov sequence. The Krylov subspace Kn(A, y) is deﬁned as the space spanned by a Krylov sequence, i.e.

n−1 Kn(A, y) = Span y, Ay,...,A y . (D.40)

Iterative methods to solve systems of linear equations (see below) or eigenvalue problems are generally referred to as Krylov space solvers. The Krylov sequence b, Ab,...,An−1b can be used in the approximation process of the eigenelements, but it is in general ill-conditioned, and an orthonormalisation is required. This is obtained using two main algorithms: Lanczos and Arnoldi algorithms for Hermitian and non-Hermitian matrices, respectively (Watkins 2007; Saad 2003). The basic idea is to construct an orthonormal basis given by the m × n matrix Qn =[q1,...,qn] of Kn, which is used to obtain a projection of the operator A K = H H H ∗T onto n, Hn Qn AQn, where Qn is the conjugate transpose of Qn , i.e. Qn . The pair (λ, x), with Hnx = λx, provides an approximate pair of eigenvalue and associated eigenvector2 of A.

2 The number λ and vector Qkx are, respectively, referred to as Ritz value and Ritz vector of A. 518 D Matrix Algebra and Matrix Function

Lanczos Method Lanczos method is based on a triangularisation algorithm of a Hermitian matrix (or symmetric for real cases) A,as

AQn = QnHn, (D.41) where Hn =[h1,...,hn] and is a triangular matrix. Let us designate by [α1,...,αn] the main diagonal of H and [β1,...,βn−1] as its upper and sub- diagonals. Identifying the jth columns from both sides of Eq. (D.41) yields

βj qj+1 = Aqj − βj−1qj−1 − αj qj . (D.42)

The algorithm then starts from an initial vector q1 (taking q0 = 0) and obtains αj ,βj and qj+1 at each iteration step. (The vectors qi, i = 1,...n, are orthonormal.) After k steps, one gets

= + T AQk QkHk βkqk+1ek (D.43)

T with ek being the k-element vector (0,...,0, 1) . The algorithm stops when βk = 0. Arnoldi Method Arnoldi algorithm is similar to Lanczos’s except that the matrix Hn = (hij ) is upper = ≥ + Hessenberg matrix, which satisﬁes hij 0fori j 2. As for the Lanczos = − j method, Eq. (D.42) yields hj+1,j qj+1 Aqj i=1 hij qi.Afterk steps, one obtains

= + T AQk QkHk hk+1,kqk+1ek . (D.44)

The above Eq. (D.44) can be written in a compact form as AQk = Qk+1Hk, where Hk is the obtained (k + 1) × k Hessenberg matrix. The matrix Hk is obtained from H k+1 by deleting the last row. Note that Arnoldi (and also Lanczos) methods are modiﬁed versions of the Gram–Schmidt orthogonalisation procedure, with Hessenberg and tridiagonal matrices involved, respectively, in the two methods. Note also that a non-symmetric Lanczos method exists, which yields a non- symmetric tridiagonal matrix (e.g. Parlett et al. 1985).

Case of Linear Systems

The simplest iterative method is the Jacobi iteration, which solves a ﬁxed point problem. It transforms the linear system Ax = b into x = Axˆ + bˆ, where Aˆ = −1 ˆ −1 Im − D A and b = D b, with D being either the diagonal matrix of A or simply the identity matrix. The ﬁxed point iterative algorithm is then given by xn+1 = ˆ ˆ Axn+b, with a given initial condition x0. The algorithm converges when the spectral D Matrix Algebra and Matrix Function 519 radius of Aˆ is less than unity, i.e. ρ(Aˆ )<1. The computation of the nth residual vector rn = b − Axn involves the Krylov subspace Kn+1(A, r0). Other methods like gradient and semi-iterative methods are included in the Krylov space solver. From an initial condition x0, the residual takes the form

xn − x0 = pn−1(A)r0, (D.45) for a polynomial pn−1(.) of degree n − 1, and belongs to Kn(A, r0). The problem is then to ﬁnd a good choice of xn in the Krylov space. There are essentially two methods for this (Saad 2003), namely, Arnoldi’s method (described above) or FOM (Full Orthogonalisation Method) and the GMRES (Generalised Minimum Residual Method) algorithms. The FOM algorithm is based on the above Arnoldi orthogonalisation procedure and looks for xn − x0 within Kn(A, r0) such that (b − Axn) is orthogonal to this Krylov space (Galerkin condition). From the initial residual r0, and letting r0 = βq1, with β =r02, the algorithm yields a similar equation to (D.41), i.e. T = T = Qk AQk Hk and Qk r0 βq1. The approximate solution at step k is then given by

= + = + −1 T = + −1 xk x0 Qkyk x0 βQkHk Qk q1 x0 βQkHk e1, (D.46)

= T = T where e1 Qk q1, i.e. e1 (1, 0,...,0) . Exercise Derive the above approximation (Eq. (D.46)).

Hint At the kth step, the vector xk − x0 belongs to Kk and is therefore of the form T − = T − Qky. The above Galerkin condition means that Qk (b Axk) 0, that is, Qk r0 T T = Qk AQk y 0, and using Eq. (D.41) yields Eq. (D.46).

The GMRES algorithm seeks vectors xn − x0 within Kn(A, r0) such that b − Axn is orthogonal to AKn. This condition implies the minimisation of Axn − b2 (e.g. Saad and Schultz 1985). Variants of these methods and other algorithms exist for particular choices of the matrix A, see e.g. Saad (2003) for more details. An approximate solution at step k, which is in the space x0 + Kk, is given by

∗ xk = x0 + Qkz . (D.47)

Using Eq. (D.44), one gets

∗ ∗ Axk − b = r0 − AQkz = βq1 − Qk+1Hkz , (D.48)

∗ ∗ which yields Qk+1(βe1 − Hkz ). The vector z is then given by

∗ z = argmin βq1 − Qk+1Hkz2 = argmin βe1 − Hkz2, (D.49) z z 520 D Matrix Algebra and Matrix Function where the last equality holds because Qk+1 is orthonormal (Qk+1x2 =x2), and Hk is the matrix deﬁned below Eq. (D.44). Remark The Krylov space can be used, for example, to approximate the exponential of a matrix, which is useful particularly for large matrices. Given a matrix A and a A vector v, an approximation of e v,usingKk(A, v), is given by (Saad 1990)

tA tHk e v ≈ βQke e1, (D.50) with β =v2. Equation (D.50) can be used, for example, to compute the solution of an inhomogeneous system of ﬁrst-order ODE. Also, and as pointed out by Saad (1990), Eq. (D.50) can be used to approximate the (matrix) integral:

∞ T X = euAbbT euA du, (D.51) 0 which provides a solution to the Lyapunov equation AXT + XAT + bbT = O. Appendix E Optimisation Algorithms

E.1 Background

Optimisation problems are ubiquitous in all fields of science. Various optimisation algorithms exist in the literature, and they depend on whether the first and/or second derivative of the function to be optimised is available. These algorithms also depend on the function to be optimised and the type of constraints. Since minimisation is the opposite of maximisation, we will mainly focus on the former. In general there are four types of objective functions (and constraints): • linear; • quadratic; • smooth nonlinear; • non-smooth. A minimisation problem without constraints is an unconstrained minimisation problem. When the objective function is linear and the constraints are linear inequalities, one has a linear programming, see e.g. Foulds (1981, chap. 2). When the objective function and the constraints are nonlinear, one gets a nonlinear programme. Minimisation algorithms also vary according to the nature of the problem. For instance in the unconstrained case, Newton’s method can be used in the multivariate case when the gradient and the second derivative of the objective function are provided. When only the first derivative is provided, a quasi-Newton method can be applied. When the dimension of the problem is large, conjugate gradient algorithms can be used. In the presence of nonlinear constraints, a whole class of gradient methods, such as reduced and projected gradient methods, can be used. Convex programming methods can be used when we have linear or nonlinear inequalities as constraints. In most cases, a constrained problem can be transformed into an unconstrained problem using Lagrange multipliers. In this appendix we provide a short review of the most widely used optimisation algorithms that are used in atmospheric science.

© Springer Nature Switzerland AG 2021 521 A. Hannachi, Patterns Identiﬁcation and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3 522 E Optimisation Algorithms

For a detailed review of the various optimisation problems and algorithms, the reader is referred to Gill et al. (1981). There is in general a large difference between one- and multidimensional problems. The univariate and bivariate minimisation problems are in general not difficult to solve since the function can be plotted and visualised, particularly when the function is smooth. When the first derivative is not provided, methods like the golden section can be used. The problem gets more difficult for many variables when there are multiple minima. In fact, the main obstacle to minimisation in the multivariate case is the problem of local minima. For example, the global minimum can be attained when the function is quadratic:

1 f(x) = xT Ax − bT x + c, (E.1) 2 where A is a symmetric matrix. The quadratic Eq. (E.1) is a typical example that deserves attention. The gradient of f(.) is Ax − b, and a necessary condition for optimality is given by ∇f(x) = 0. The solution to this linear equation provides a partial answer, however. To get a complete answer, one has to compute the second derivative at the solution of the necessary condition, to yield the Hessian: ∂2f H = (hij ) = = A. (E.2) ∂xi∂xj

Clearly, if A is positive semi-deﬁnite, the solution of the necessary condition is a global minimum. In general, however, the function to be minimised is non- quadratic, and more advanced tools have to be applied, and this is the aim of this appendix.

E.2 Single Variable

In general the one-dimensional case is the easiest minimisation problem, particularly when the objective function is smooth.

E.2.1 Direct Search

A simple direct search method is based on successive function evaluation, aiming at reducing the length of the interval containing the minimum. The most widely used methods are: • Dicholomus search—It is based on successively halving the interval containing the minimum. After n iterations, the length In of the interval containing the minimum is E Optimisation Algorithms 523

1 In = I , 2n/2 0

where I0 = x2 − x1 if [x1,x2] is the initial interval. • Golden section—It is based on subdividing the initial interval [x1,x2] into three subintervals using two extra points x3 and x4, with x1

x(i) = τ−1 x(i) − x(i) + x(i) 3 τ 2 1 1 (i) = 1 (i) − (i) + (i) x4 τ x2 x1 x1 , √ 1 1+ 5 where τ is the Golden number 2 . There are various other methods such as quadratic interpolation and Powell’s method, see Everitt (1987) and Box et al. (1969).

E.2.2 Derivative Methods

When the ﬁrst and perhaps the second derivatives are available, then it is known that the two conditions:

d ∗ = dx f(x ) 0 2 (E.3) d f(x∗)>0 dx2 are sufﬁcient conditions for x∗ to be a minimum of f(). In this case the most widely used method is based on Newton algorithm, also known as Newton–Raphson, and df (x) df (x) aims at computing the zero of dx based on the tangent line at dx . The algorithm reads

df (xk)/dx + = − xk 1 xk 2 2 (E.4) d f(xk)/dx

2 2 when d f(xk)/dx = 0. Note that when the second derivative is not provided, then the denominator of Eq. (E.4) can be approximated using a ﬁnite difference scheme:

xk − xk−1 df (xk) xk+1 = xk − . df (xk) − df (xk−1) dxk

u + 1It is the limit of n 1 when n →∞where u = u = 1andu + = u + u − . un 0 1 n 1 n n 1 524 E Optimisation Algorithms

E.3 Direct Multivariate Search

As for the one-dimensional case, there are direct search methods and gradient- based algorithms. Among the most widely used direct search methods, one ﬁnds the following:

E.3.1 Downhill Simplex Method

This method is due to Nelder and Mead (1965) and was originally described by Spendley et al. (1962). The method is based on a simplex,2 generally with mutually equidistant vertices, from which a new simplex is formed simply by reﬂection of the vertex (where the objective function is largest) through the opposite facet, i.e. through the hyperplane formed by the remaining m points (or vertices), to a “lower” vertex where the function is smaller. Details on the method can be found, e.g. in Box et al. (1969) and Press et al. (1992). The method can be useful for a quick search but can become inefﬁcient for large dimensions.

E.3.2 Conjugate Direction/Powell’s Method

Most multivariate minimisation algorithms attempt to ﬁnd the best search direction along which the function can be minimised. The conjugate direction method is based on minimising a quadratic function and is known as quadratically convergent. Consider the quadratic function:

1 f(x) = xT Gx + bT x + c. (E.5) 2

The directions u and v are said to be conjugate (with respect to G)ifuT Gv = 0. The method is then based on ﬁnding a set of mutually conjugate search directions along which minimisation can proceed. Powell (1964) has shown that if a set of search (i) (i) (i) (i) = directions u1 ,...un ,atthe ith iteration, are normalised so that uk Guk 1, = (i) (i) k 1,...,n, then det u1 ,...,un is maximised only when the vectors are (linearly independent) mutually conjugate. This provides a way of ﬁnding a new search direction that can replace an existing one. The procedure is therefore to minimise the function along individual lines and proceeds as follows. Starting from

2Asimplexinam-dimensional space is a polygonal geometric ﬁgure with m+1 vertices, or m+1 facets. Triangles and pyramids are examples of simplexes in three- and four-dimensional spaces, respectively. E Optimisation Algorithms 525 x0 and a direction u0, one minimises the univariate function f(x0 + λu0) and then replaces x0 and u0 by x0 + λu0 and λu0, respectively. Powell’s algorithms run as follows:

0. Initialise ui = ei, i.e. the canonical basis vectors, i = 1,...,m. 1. Initialise x = x0. 2. Minimise f(xi−1 + λui), xi = x0 + λui, i = 1,...,m. 3. Set ui+1 = ui, i = 1,...,m, um = xm − x0. 4. Minimise f(xm + λum), x0 = xm + λum, and then go to 1. Powell (1964) showed that the procedure yields a set of k mutually conjugate directions after k iterations. The procedure has to be reinitialised with new vectors after every m iterations in order to escape dependency of the obtained vectors, see Press et al. (1992) for further details. Remark The reason for using one-dimensional minimisation is conjugacy. In fact, if u1,...,um are mutually conjugate with respect to G, the required minimum is taken to be

m x1 = x0 + αkuk, k=1 = = + where the parameters αk, k 1,...,m, are chosen to minimise f(x1) f(x0 m k=1 αkuk). These coefficients therefore minimise m 1 f(x ) = α2uT Gu + α uT (Gx + b) + f(x ) (E.6) 1 2 i i i i i 0 0 i=1 based on conjugacy of ui, i = 1,...,m. Hence, the effect of searching along ui 1 2 T + T + is to find αi that minimises 2 αi ui Gui αiui (Gx0 b), and this value of αi is independent of the other terms in Eq. (E.6). It is shown precisely by Fletcher (1972) that for a quadratic function, with particular definite Hessian G, when the search directions ui, i = 1,...,m, are conjugate of each other, then the minimum is found in at most m iterations. Furthermore, (i+1) (i) (i) x = x + α ui is the minimum point in the subspace generated by the initial (1) T = approximation x , and the directions u1,...,ui, and the identities gi+1uj 0, (i+1) j = 1,...,i,, also hold (gi+1 =∇f(x )).

E.3.3 Simulated Annealing

This algorithm is based on concepts from statistical mechanics and makes use of Boltzmann probability of energy distribution in thermodynamical systems in equilibrium (Metropolis et al. 1953). The method uses Monte Carlo simulation to 526 E Optimisation Algorithms generate moves and is particularly useful because it can escape local minima. The algorithm can be applied to continuous and discrete problems (Press et al. 1992), see also Hannachi and Legras (1995) for an application to atmospheric low-frequency variability.

E.4 Multivariate Gradient-Based Methods

Unlike direct search methods, gradient-based approaches use the gradient of the objective function. Here we assume that the (smooth) objective function can be approximated by

1 f(x + δx) = f(x) + g(x)T δx + δxT Hδx + o(|δx|2), (E.7) 2 where g(x) =∇f(x), and H = ∂ f(x) are, respectively, the gradient vector ∂xi ∂xj and Hessian matrix of f(x). Gradient methods also belong to the class of descent algorithms where the approximation of the desired minimum at various iterations is perturbed in an additive manner as

xm+1 = xm + λu. (E.8)

Descent algorithms are distinguished by the manner in which the search direction u is chosen. Most gradient methods use the gradient as search direction since the gradient ∇f(x) points in the direction where the function increases most rapidly.

E.4.1 Steepest Descent

=−g The steepest descent uses u g and choses λ that minimises the univariate objective function:

h(λ) = f(xm + λu). (E.9)

The solution at iteration m + 1 is then given by

∇f(xm) xm+1 = xm − λ . (E.10) ∇f(xm) E Optimisation Algorithms 527

Note that Eq. (E.9) is quadratic when Eq. (E.7) is used, in which case the solution is given by3

∇ 3 = f(xm) λ T . (E.11) ∇f(xm) H∇f(xm)

Note that because of the one-dimensional minimisation at each step, the method can be computationally expensive. Some authors use decreasing step-size selection λ = αk,(0<α<1), for k = 1, 2,...,until the ﬁrst k where f()has decreased (Cadzow 1996).

E.4.2 Newton–Raphson Method

This is a generalisation of the one-dimensional Newton–Raphson method and is based on the minimisation of the quadratic form Eq. (E.1), where the search direction is given by

− u =−H 1∇f(x). (E.12)

At the (m + 1)th iteration, the approximation to the minimum is given by

−1 xm+1 = xm − H (xm)∇f(xm). (E.13)

−1 Note that it is also possible to choose xm+1 = xm − λH ∇f(xm), where λ can be found through a one-dimensional minimisation as in the steepest descent. Newton method requires the inverse of the Hessian at each iteration, and this can be quite expensive particularly for large problems. There is also another drawback of the approach, namely, the convergence towards the minimum can be secured only if the Hessian is positive deﬁnite. Similarly, the steepest descent is no better since it is known to exhibit a linear convergence, i.e. a slow convergence rate. These drawbacks have led to the development of more advanced and improved algorithms. Among these methods, two main classes of algorithms stand out, namely, the conjugate gradient and the quasi-Newton methods discussed next.

3 One can eliminate the Hessian from Eq. (E.11) by choosing a ﬁrst guess λ0 for λ andthenusing − λ2 1 δ =− g = 0 − g − + Eq. (E.7)with x λ0 g , which yields λ 2 g f x λ0 g f (x) λ0 g . 528 E Optimisation Algorithms

E.4.3 Conjugate Gradient

It is possible that the descent direction −g =−∇f(x) and the direction to the minimum may be near to orthogonality, which can explain the slow convergence rate of the steepest descent. For a quadratic function, for example, the best search direction is conjugate to that taken at the previous step (Fletcher 1972, th. 1). This is the basic idea of conjugate gradient for which the new search direction is constructed to be conjugate to the gradient of the previous step. The method can be thought of as an association of conjugacy with steepest descent (Fletcher 1972), and is also known as Fletcher–Reeves (or projection) method. From the set of conjugate gradients −gk, k = 1,...,m, a new set of conjugate directions is formed via linear combination as

k−1 uk =−gk−1 + αjkuj , (E.14) j=1

=− T T = − =∇ where αjk gk−1Huj /uj Huj , j 1,...,k 1, and gk f(xk). Since in a quadratic form, e.g. Eq. (E.5), one has

δgk = gk+1 − gk = Hδx = H(xk+1 − xk), and because in a linear (one-dimensional) search δxk = λkuk, one gets

T g − δgj−1 α =− k 1 (E.15) jk T δ uj gj−1

4 for j = 1,...,k− 1. Furthermore, αjk, j = 0,...,k− 2, vanishes, yielding

T δ gk−1 gk−2 u =−g − + u − , k k 1 T δ k 1 uk−1 gk−2 which simpliﬁes to

T gk−1gk−1 =− − + − uk gk 1 T uk 1, (E.16) gk−2gk−2

4 After k − 1 one-dimensional searches in (u1,...,uk−1), the quadratic form is minimised at xk−1, = − then gk−1 is orthogonal to uj , j 1,...,k 2, because of the one-dimensional requirement for = − d + = T = minimisation in each direction um,m 1,...,k 2, dα f(xk−2 αuj ) gk−1uj 0 .Fur- = thermore, since the vectors uj are linear combinations of gi , i 1,...,j, the vectors gj are = j T = = − also linear combinations of u1,...,uj ,i.e.gj i=1 αi ui , hence gk−1gj 0, j 1,...,k 2. E Optimisation Algorithms 529 for k = 1,...,nwith u1 =−g0. For a quadratic function, the algorithm converges in at most n iterations, where n is the problem dimension. For a general function, Eq. (E.16) can be used to update the search direction every iteration, and that in practice uk is reset to −gk−1 after every n iterations.

E.4.4 Quasi-Newton Method

The Newton–Raphson direction −H−1g may be thought of as an improvement (or correction) to the steepest descent direction −g =−∇f(x). The quasi-Newton approach attempts to take advantage of the steepest descent and quadratic convergence rates of the basic second-order Newton method. It is based on approximating the inverse of the Hessian matrix H. In the modiﬁed Newton–Raphson, Goldfeld et al. (1966) propose to use the following iterative scheme:

−1 xk+1 = xk − (λIn + H) g (E.17) based on the approximation:

−1 −1 −1 −2 2 In + λ H ≈ In − λ H + λ H + .... (E.18)

The most widely used quasi-Newton procedure, however, is the Davidson–Fletcher– Powell method (Fletcher and Powell 1963), sometimes referred to as variable metric method, which is based on approximating the inverse H−1 by an iterative procedure for which the kth iteration reads

xk+1 = xk − λkSkgk, (E.19)

−1 where Sk is a sequence that converges to H and is given by

δ δ T δ δ T Sk gk gk Sk xk xk S + = S − + , (E.20) k 1 k δ T δ δ T δ gk Sk gk xk gk where δgk = gk+1 − gk and δxk = xk+1 − xk =−λkSkδgk. Note that there exist in the literature various other formulae for updating Sk, see e.g. Adby and Dempster (1974) and Press et al. (1992). These techniques can be adapted and simpliﬁed further depending on the objective function, such as the case of the sum of squares, encountered in least square regression analysis, see e.g. Everitt (1987). 530 E Optimisation Algorithms

E.4.5 Ordinary Differential Equations-Based Methods

Optimisation techniques based on solving systems of ordinary differential equations have also been proposed and used for some time, although not much in atmospheric science, see e.g. Hannachi et al. (2006); Hannachi (2007). The approach seeks the solution to

min F(x), (E.21) x where x is a n-dimensional real vector, by following trajectories of a system of ordinary differential equations. For instance, we know that if x∗ is a solution to Eq. (E.21), then ∇F(x∗) = 0. Therefore by integrating the dynamical system

dx =−∇F(x), (E.22) dt starting from a suitable initial condition, one should converge in principle to x∗. This method can be regarded as the continuous version of the steepest descent algorithm. In fact, Eq. (E.22) becomes equivalent to the steepest algorithm when dx xt+h−xt dt is approximated by the simple ﬁnite difference h . The system of ODE, Eq. (E.22), can be interpreted as the equation describing a particle moving in a potential well given by F(.). Note that Eq. (E.22) can also be replaced by a continuous Newton equation of the form:

dx − =−H 1 (x) ∇F(x), (E.23) dt where H is the Hessian matrix of F() at x. Some of these techniques have been reviewed in Brown (1986), Botsaris (1978), Alufﬁ-Pentini et al. (1984) and Snyman (1982). It is argued, for example, in Brown (1986) that ordinary differential equation-based methods compare very favourably with conventional Newton and quasi-Newton algorithms, see Hannachi et al. (2006) for an application to simpliﬁed EOFs.

E.5 Constrained Minimisation

E.5.1 Background

Constrained minimisation problems are more subtle than unconstrained problems. We give here a brief review, and for more details the reader is referred to more specialised textbooks on optimisation, see e.g. Gill et al. (1981). A typical (smooth) constrained minimisation problem takes the form: E Optimisation Algorithms 531

minx f(x) s.t. gi(x) = 0 i = 1,...,r (E.24) hj (x) ≤ 0 j = 1,...,m.

When the functions involved in Eq. (E.24) are convex or polynomials, the problem is known under the name of mathematical programming. For instance, if f(.) is quadratic or convex and the constraints are linear, efﬁcient programming procedures exist for the minimisation. In general, most algorithms attempt to transform Eq. (E.24) into an unconstrained problem. This can be done easily, via a change of variable, when the constraints are simple. The following examples illustrate this. • For constraints of the form x ≥ 0, the change of variable is x = y2. ≤ ≤ = a+b + b−a •Fora x b, one can have x 2 2 sin y. In a number of cases the inequality h(x) ≤ 0 can be handled by introducing a slack variable y yielding

h(x) + y2 = 0.

Equality constraints can be handled in general by introducing Lagrange multipliers. Under some regularity conditions,5 a necessary condition for x∗ to be a constrained local minimum of Eq. (E.24) is the existence of Lagrange multipliers u∗ = T T (u∗1,...,u∗r ) and v∗ = (v∗1,...,v∗m) such that ∇ + r ∇ + m ∇ = f(x∗) i=1 u∗i gi(x∗) j=1 v∗j hj (x∗) 0 v∗j hj (x∗) = 0 j = 1,...,m (E.25) v∗j ≥ 0 j = 1,...,m; the conditions given by Eq. (E.25) are known as Kuhn–Tucker optimality conditions and express the stationarity of the Lagrangian:

r m L (x; u, v) = f(x) + uigi(x) + vj hj (x) (E.26) i=1 j=1 at x∗ for the optimum values u∗ and v∗. Note that the ﬁrst vector equation in Eq. (E.25) can be solved by minimising the sum of squares of its elements, i.e. 2 n ∂L min = . In mathematical programming, system Eq. (E.25) is generally k 1 ∂xk referred to as the dual problem of Eq. (E.24).

5 Namely, linear independence between ∇hj (x∗) and ∇gi (x∗), i = 1,...,r,forallj satisfying hj (x∗) = 0. 532 E Optimisation Algorithms

E.5.2 Approaches for Constrained Minimisation

Lagrangian Method

It is based on minimising, at each iteration, the Lagrangian:

; = + T + T L (x u, v) f(x) uk g(x) vk h(x), (E.27) yielding a minimum xk+1 at the next iteration step k + 1. The multipliers uk+1 and vk+1 are taken to be the optimal multipliers for the linearised constraints:

T g(xk+1) + (x − xk+1) ∇g(xk+1) = 0 T (E.28) h(xk+1) + (x − xk+1) ∇h(xk+1) ≤ 0.

This method is based on linearising the constraints about the current point xk+1. More details can be found in Adby and Dempster (1974) and Gill et al. (1981). Note that in most iterative techniques, an initial feasible point can be obtained by + 2 minimising j hj (x) i gi (x).

Penalty Function

The basic idea of penalty is as follows. In the search of a constrained minimum of some function, one often encounters the common situation where the constraints are of the form g(x) ≤ 0, and at each iteration the newly formed x has to satisfy these constraints. A simple way to handle this is by forming a linear combination of the elements of g(x), i.e. uT g(x), known as penalty function that accounts for the positive components of g(x). The components of u are zero whenever the corresponding components of g(x) do not violate the constraints (i.e. non-positive) and large positive otherwise. One then has to minimise the sum of the original objective function and the penalty function, i.e. the penalised objective function. Minimising the penalised objective function can prevent the search algorithm from choosing directions where the constraints are violated. In general terms, the penalised method is based on sequentially minimising an unconstrained problem of the form: F(x) = f(x) + wj G hj (x), ρ + H (gi(x,ρ)) , (E.29) j i where wj , j = 1,...,m, and ρ are parameters that can change value during the minimisation, and usually ρ decreases to zero as the iteration number increases. The functions G() and H()are penalty functions. For example, the function E Optimisation Algorithms 533

u2 G(u, ρ) = (E.30) ρ is one of the widely used penalties. When inequality constraints are present, and for a ﬁxed ρ, the barrier function G() is non-zero in the interior of the feasible region (hj (x) ≤ 0, j = 1,...,m) and inﬁnite on its border. This maintains iterates xk inside the feasible set, and as ρ → 0 the constrained minimum is approached. Examples of barrier functions in this case include log (−h(x)) and ρ . The following penalty function h2(x) wj 1 ρ3 + g2(x) (E.31) h2(x) ρ i j j i has also been used for problem Eq. (E.24).

Gradient Projection

This method is based on ﬁnding search directions by projecting the gradient −g = −∇f(x) onto the hyperplane tangent to the feasible set, i.e. the set satisfying the constraints, at the current point x. The inequality constraints (that are not satisﬁed) and the equality constraints are linearised around the current point, i.e. by considering

Kx = 0, (E.32) where K is (r +l1)×n matrix, and l1 is the number of constraints h(x) ≥ 0. Then at each iteration the constraints are linear, and the direction is obtained by projecting −g onto the tangent space to obtain u, i.e.

− g = u + KT w. (E.33)

Using Eq. (E.32), one gets

−1 w =− KKT Kg, (E.34) and the negative projected gradient reads −1 T T u =− In − K KK K g. (E.35)

The algorithm goes as follows:

0. Choose x0 from the feasible set. 534 E Optimisation Algorithms

1. Linearise equations gi(), i = 1,...,r, and the currently binding inequalities, i.e. those for which hj (xk) ≥ 0 to compute K in Eq. (E.32). 2. Compute the projected gradient u from Eq. (E.35). + (1) = 3. Minimise the one-dimensional function f (xk λk+1u), and then set xk+1 (1) (2) (1) x + λ + u.Ifx is feasible, then set x = x , otherwise use a suitable k k 1 k+1 k+1 k+1 1 2 (2) version of the Newton’s method applied to ρ j hj (x) to ﬁnd a point xk+1 on the boundary of the feasible region. (2) ≤ = (2) 4. If f(xk+1) f(xk),setxk+1 xk+1 and then go to 1. Otherwise go to 3, (t) 1 and solve for λ + by generating, e.g. a sequence x = x + λ + u until k 1 k+1 k τ t−2 k 1 (t) ≤ f(xk+1) f(xk) is satisﬁed. 5. Iterate steps 1 to 4 until the constrained minimum is obtained. The optimal multipliers vi corresponding to the binding inequalities and u∗ are given by w in Eq. (E.34).

Other Gradient-Related Methods

Another gradient-related approach is the multiple gradient summation where the search direction is given by ∇f(x ) ∇hj (xk) u =− k , (E.36) ∇f(x ) ∇h (x ) k j j k where the summation is taken over the violated constraints at the current point xk. Another search method, based on small step gradient, is given by

m u =−∇f(xk) − wj (xk)∇hj (xk), (E.37) j=1 where wj (xk) = w if hj (xk)>0(w is a suitably chosen large constant) and zero otherwise, see Adby and Dempster (1974). The ordinary differential equations-based method can also be used in constrained minimisation in a similar way after the problem has been transformed into an unconstrained minimisation problem, see e.g. Brown (1986) and Hannachi et al. (2006) for the case of simpliﬁed EOFs. Appendix F Hilbert Space

This appendix introduces some concepts of linear vector spaces, metrics and Hilbert spaces.

F.1 Linear Vector and Metric Spaces

F.1.1 Linear Vector Space

A linear vector space X is a set of elements x, y,... on which one can define addition x + y between elements x and y of X satisfying, for all elements x, y, and z, the following properties: • x + y = y + x. • (x + y) + z = x + (y + z). • The null element 0, satisfying x + 0 = x, belongs to X . • The “inverse” −x of x, satisfying x +−x = 0,isalsoinX . The first two properties are known, respectively, as commutativity and associativity. These properties make X a commutative group. In addition, a scalar multiplication has to be defined on X with the following properties: • α (x + y) = αx + αy and (α + β) x = αx + βx, • α (βx) = (αβ) x, •1x = x, for any x and y in X , and scalars α and β.

© Springer Nature Switzerland AG 2021 535 A. Hannachi, Patterns Identiﬁcation and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3 536 F Hilbert Space

F.1.2 Metric Space

Ametricd(., .) deﬁned on a set X is a real valued function deﬁned over X ×X with the following properties: (i) d(x, y) = d(y, x), (ii) d(x, y) = 0 if and only if x = y, (iii) d(x, y) ≤ d(x, z) + d(z, y), for all x, y and z in X . AsetX withametricd(., .) is referred to as a metric space (X ,d).

F.2 Norm and Inner Products

F.2.1 Norm

A norm on a linear metric space X , denoted by ., is a real valued function satisfying, for all vectors x and y in X and scalar λ the following properties: (i) x≥0, and x=0 if and only if x = 0, (ii) λx=|λ|x, (iii) x + y≤x+y. A linear vector space with a norm is named normed space.

F.2.2 Inner Product

An inner product deﬁned on a linear vector space X is a scalar function deﬁned on X × X , denoted by <.,.>, satisfying, for all vectors x and y in X and scalar λ, the following properties: (i) <λx + y, z >= λ + < y, z >, (ii) < x, y >= (< y, x >)∗, where the superscript (∗) stands for the complex conjugate, (iii) < x, x >≥ 0, and < x, x >= 0 if and only if x = 0. A linear vector space X with an inner product is an inner product space.

F.2.3 Consequences

The existence of a metric and/or a norm leads to deﬁning various topologies. X { }∞ X For example, given a metric space ( ,d), a sequence xn n=1 of elements in F Hilbert Space 537 converges to an element x0 in X if

lim d (xn, x0) = 0. n→∞ { }∞ X Similarly, a sequence xn n=1 of elements in a normed space , with norm . ,is said to converge to x0 in X if

lim xn − x0=0. n→∞ { }∞ X Also, a sequence xn n=1 of elements in an inner product space converges to an element x0 in X if

lim < xn − x0, xn − x0 >= 0. n→∞

The existence of an inner product in a linear vector space X allows the deﬁnition of orthogonality as follows. Two vectors x and y are orthogonal, denoted by x ⊥ y,if < x, y >= 0.

F.2.4 Properties

1. A normed linear space, with norm ., deﬁnes a metric space with the metric given by d(x, y) =x − y. 2. An inner product space X , with inner product <.,.>, is a normed linear space with the norm deﬁned by x=< x, x >1/2, and is consequently a metric space. 3. For any x and y in an inner product space X , the following properties hold. • | < x, y > |≤xy, • x + y2 +x − y2 = 2x2 + 2y2. This is known as the parallelogram identity. 4. Given an n-dimensional linear vector space with an inner product, one can always construct an orthonormal basis (u1,...,un), i.e. < uk, ul >= δkl. 5. Also, the limit of the sum of two sequences in an inner product space is the sum of the limit of the sequences. Similarly, the limit of the inner product of two sequences is the inner product of the limits of the corresponding sequences. 538 F Hilbert Space

F.3 Hilbert Space

F.3.1 Completeness

X { }∞ X Let ( ,d) be a metric space. A sequence xn n=1 of elements in is a Cauchy sequence if for every real >0 there exists a positive integer N for which d (xn, xm) <for all m ≥ N, and n ≥ N. A metric space is said to be complete if every Cauchy sequence in the metric space converges to an element within the space.

F.3.2 Hilbert Space

A complete inner product space X with the metric:

d (x, y) =< x − y, x − y >1/2=x − y is a Hilbert space. A number of results can be drawn from a Hilbert space. { }∞ X For example, if the sequence xn n=1 of elements in a given Hilbert space is { }∞ = n X orthogonal, then the sequence yn n=1, given by yn k=1 xk, converges in ∞ 2 if and only if the scalar series k=1 xk converges, see e.g. Kubáckouáˇ et al. (1987). A fundamental result in Hilbert spaces concerns the concept of approximation of vectors from the Hilbert space by vectors from subspaces. This result is expressed under the so-called projection theorem, given below (see e.g. Halmos 1951). Projection Theorem Let U be a Hilbert space and V a Hilbert subspace of U. Let also u be a vector in U but not in V, and v a vector in V. Then there exists a unique vector v in V such that

u − v= min u − v. v in V

Furthermore, the vector v is uniquely determined by the property that < u − v, v >= 0 for all v in V. The vector v is called the (orthogonal) projection of u onto V. The concept of Hilbert space ﬁnds its natural way in time series and prediction theory, and we provide a few examples below.

Examples of Hilbert Space

• Example 1 F Hilbert Space 539

Consider the collection U of all (complex) random variables U, with zero mean and finite variance, i.e. E(U) = 0, and Var(|U|2)<∞, defined on some sample space. The following operation defined for all random variables U and V in U by

∗ = E U V , where U ∗ is the complex conjugate of U, defines a scalar product over U and makes U a Hilbert space, see e.g. Priestly (1981, p. 190). Exercise Show that the above operation is well defined. Hint Use the fact that Var(λU + V)≥ 0 for all scalar λ to deduce that is well defined. The theory of Hilbert space in stochastic processes and time series started towards the late 1940s (Loève 1948) and was lucidly formulated by Parzen (1959, 1961)in the context of random function (stochastic processes). The concept of Hilbert space, and in particular the projection theorem, finds natural application in the theory of time series prediction. • Example 2

Designate by T a subset of the real numbers, and let {Xt , for t in T } be a stochastic 2 process (or random function) satisfying E |Xt | < ∞ for t in T . Such stochastic process is said to be second order. Let U be the set of random variables of the form = n U k=1 ckXtk , where n is a positive integer, c1,...,cn are scalars and t1,...,tn are elements in T . That is, U is the set of all ﬁnite linear combinations of random variables Xt for t in T and is known as the space spanned by the random function ∗ {Xt , for t in T }. The inner product = E(UV ) induces an inner product on U. The space U, extended by including all random variables that are limit of sequences in U, i.e. random variables W satisfying

lim Wn − W=0 n→∞ { }∞ U for some sequences Wn n=1 in , is a Hilbert space (see e.g. Parzen 1959).

F.3.3 Application to Prediction

The Univariate Case

Let H be the Hilbert space deﬁned in example 1 above and {Xt ,t = 0, ±1, ±2,...} a (discrete) stochastic process. Let now Ht be the subset spanned by the sequence Xt ,Xt−1,Xt−2,.... Using the same reasoning as in example 2 above, Ht is a Hilbert space. 540 F Hilbert Space

Let now m be a given positive integer, and our objective is to estimate Xt+m using elements from Ht . This is the classical prediction problem, which seeks an ˆ element Xt+m in Ht satisfying

ˆ 2 ˆ 2 2 Xt+m − Xt+m = E |Xt+m − Xt+m| = min Xt+m − Y . Y in Ht

ˆ Hence Xt+m is simply the orthogonal projection of Xt+m onto Ht .Fromthe projection theorem, we get ˆ E Xt+m − Xt+m Y = 0,

ˆ that is, Xt+m − Xt+m is orthogonal to Y , for any random variable Y in Ht . In prediction theory, the set Ht is sometimes referred to as the set of all ˆ possible predictors and that the predictor Xt+m provides the minimum mean square ˆ prediction error (see e.g. Karlin and Taylor 1975, p. 464). Since Xt+m is an element of Ht , the previous orthogonality also yields another orthogonality between ˆ ˆ Xt+m − Xt+m and Xt+n − Xt+n for n

In probabilistic terms we consider the stochastic process (Xt ) observed for t ≤ n, and we seek to “estimate” the random variable Xn+h. The conditional probability distribution of the random variable Xn+h given In ={Xt ,t ≤ n} is fh(xn+h|xt ,t ≤ n) = Pr(Xn+h ≤ x|Xt ,t ≤ n) = fh(x). The knowledge of fh(.) permits the ˆ determination of all the conditional properties of Xn+h|In. The estimate Xn+h of Xn+h|In is then chosen as a solution to the minimisation problem: ˆ 2 2 min E (Xn+h − Y) |In = min (x − y) fh(x)dx. (F.1)

The solution is automatically given by ˆ Xn+h = E (Xn+h|Xt ,t ≤ n) , (F.2) ˆ and the term εn+h = Xn+h − Xn+h is known as the forecast error. Exercise Show that the solution to Eq. (F.1) is given by Eq. (F.2).

Hint Recall the condition fh(x)dx = 1 and use Lagrange multiplier.

An important result emerges when {Xt } is Gaussian, namely, E (Xn+h|Xt ,t ≤ n) is a linear function of Xt , t ≤ n, and this what makes the reason behind choosing F Hilbert Space 541 linear predictors. The general linear predictor is then ˆ Xn+h = αkXt−k. (F.3) k≥0

ˆ The predictor Xn+h is meant to optimally approximate the (unobserved) future value of Xn+h of the time series. In stationary time series the forecast error εn+h is also 2 = 2 stationary, and its variance σ E(εn+h) is the forecast error variance.

The Multivariate Case

Prediction of multivariate time series is more subtle than single variables time series not least because matrices are involved. Matrices have two main features, namely, they do not (in general) commute, and they can be singular without being null. In this appendix a brief review of the multivariate prediction problem is given. For a full discussion on prediction of vector time series, the reader is referred to Doob (1953), Wiener and Masani (1957, 1958), Helsen and Lowdenslager (1958), Rozanov (1967), Masani (1966), Hannan (1970) and Koopmans (1974), and the up-to-date text by Wei (2019 ). T 2 Let xt = Xt1,...,Xtp denote a p-dimensional second-order (E |xt | < ∞) zero-mean random vector. Let also {xt ,t = 0, ±1, ±2,...} be a second-order vector random function (or stochastic process, or time series), H the Hilbert space spanned by this random function, i.e. the space spanned by Xt,k, k = 1,...,p, t = 0, ±1, ±2,..., and ﬁnally, Hn the Hilbert space spanned by Xt,k, k = 1,...,p, t ≤ n.Ap-dimensional random vector y = Y1,...,Yp is an element of Hn if each component Yk, k = 1,...,p belongs to Hn. Stated otherwise Hn can be regarded as composed of random vectors y that are ﬁnite linear combinations of elements of the vector random functions of the form:

m = y Akxtk k=1 for some integers m, t1,...,tm, and p×p (complex) matrices Ak, k = 1,...,m.To be consistent with the deﬁnition of uncorrelated random vectors, a generalised inner product on H, known as the Gramian matrix (see e.g. Koopmans 1974), is deﬁned by

∗ u, v p= E uv , 542 F Hilbert Space where (∗) stands for the transpose complex conjugate.1 Note that the norm over H is the trace of the Gramian matrix, i.e.

p 2 2 T x = E|Xk| = tr xx p . k=1

It can be seen that the orthogonality of random vectors is equivalent to non- correlation (as in the univariate case). Let u = (U1,...,Up) be a random vector ˆ ˆ in H. The projection uˆ = (U1,...,Up) of u onto Hn is a random vector whose components are the projection of the associated components of u. The projection theorem yields, for any y in Hn, ∗ u − uˆ, y p= E u − uˆ y = O.

As for the univariate case, the predictor xˆt+m of xt+m is given by the orthogonal projection of xt+m onto Ht . The prediction error εt+m = xt+m −xˆt+m is orthogonal to all vectors in Ht .Also,εk is orthogonal to εl for l = k, i.e.

ε εT = E k l δkl , where is the covariance matrix of the prediction error εk. The prediction error ε εT variance tr E t+1 t+1 of the one-step ahead prediction is given in Chap. 8.

1That is, the Gramian matrix consists of all the inner products between the individual components of u and v. Appendix G Systems of Linear Ordinary Differential Equations

This appendix gives the solutions of systems of ordinary differential equations (ODEs) of the form:

dx = Ax + b (G.1) dt with the initial condition x0 = x(t0), where A is a m × m real (or complex) matrix and b is a m-dimensional real (or complex) vector. When A is constant, the solution is quite simple, but when it is time-dependent the solution is slightly more elaborate.

G.1 Case of a Constant Matrix A

G.1.1 Homogeneous System

By using the exponential form of matrices:

1 eA = Ak, (G.2) k! k≥0 which can also be extended to etA, for any scalar t, one gets

detA = etAA = AetA. dt

Remark In general eA+B = eAeB. However, if A and B commute, then we get equality.

© Springer Nature Switzerland AG 2021 543 A. Hannachi, Patterns Identiﬁcation and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3 544 G Systems of Linear Ordinary Differential Equations

Using the above result, the solution of

dx = Ax (G.3) dt with initial condition x0 is

tA x(t) = e x0. (G.4)

Remark The above result can be used to solve the differential equation:

dmy dm−1y dy + a − + ...+ a + a y = 0(G.5) dtm m 1 dtm−1 1 dt 0

m−1 with initial conditions y(t ), dy(t0) ,..., d y(t0) . The above ODE can be trans- 0 dt dtm−1 formed into a system similar to Eq. (G.3), with the Fröbenius matrix A given by ⎛ ⎞ 01... 00 ⎜ ⎟ ⎜ 00... 00⎟ ⎜ ⎟ = ⎜ . ⎟ A ⎜ . ⎟ , ⎝ ⎠ 00... 01 −a0 −a1 ... −am−2 −am−1

m−1 and x(t) = (y(t), dy(t),..., d y(t))T , with initial condition x = x(t ). dt dtm−1 0 0

G.1.2 Non-homogeneous System

Here we consider the non-homogeneous case corresponding to Eq. (G.1), and we = dx − = assume that b b(t), i.e. b is time-dependent. By noting that dt Ax tA d −tA e dt (e x), the solution is given by t tA (t−s)A x(t) = e x0 + e b(s)ds. (G.6) t0

Remark Equation (G.6) can be used to integrate a mth-order non-homogeneous differential equation. G Systems of Linear Ordinary Differential Equations 545

G.2 Case of a Time-Dependent Matrix A

G.2.1 General Case

We consider now the following system of ODEs:

dx = A(t)x (G.7) dt with initial condition x0. The theory behind the integration of Eq. (G.7)is based on using a set of independent solutions of the differential equation. If x1(t),...,xm(t) is a set of m solutions of Eq. (G.7) with respective initial conditions x1(t0),...,xm(t0), assumed to be linearly independent, then the matrix M(t) = (x1(t),...,xm(t)) satisﬁes the following system of ODEs:

dM = AM. (G.8) dt

It turns out that if M(t0) is invertible the solution to (G.8) is also invertible. Remark It can be shown, see e.g. Said-Houari (2015) or Teschl (2012), that the Wronskian W(t) = det(M(t)) satisﬁes the ODE:

dW = tr(A)W, (G.9) dt

(or W(t) = W(t ) exp( t tr(A(u))du). The Wronskian can be used to show that, 0 t0 like M(t0), M(t) is also invertible. The solution to Eq. (G.7) then takes the form:

x(t) = S(t, t0)x(t0), (G.10) where S(., .) is the propagator of Eq. (G.7) and is given by

− S(t, u) = M(t)M 1(u). (G.11)

These results can be extended to the case of a non-homogeneous system:

dx = A(t)x + b(t), (G.12) dt with initial condition x0, which takes the form: t x(t) = S(t, t0)x(t0) + S(t, u)b(u)du. (G.13) t0 546 G Systems of Linear Ordinary Differential Equations

The above solution can again be used to integrate a mth-order non-homogeneous differential equation with varying coefficients. A useful simplification of Eq. (G.11) can be obtained when the matrix A satisfies A(t)A(s) = A(s)A(t) for all t and s. In this case the propagator S(., .) takes a simple expression, namely

t A(u)du S(t, s) = e s . (G.14)

It is worth mentioning here that Eq. (G.13) can be extended to the case when the term b is a random forcing in relation to time-dependent multivariate autoregressive models.

G.2.2 Particular Case of Periodic Matrix A: Floquet Theory

A particularly important case in physical sciences corresponds to a periodic A(t), with period T , i.e. A(t +T)= A(t). This case is particularly relevant to atmospheric science because of the strong seasonality. The theory of the solution of

x˙ = A(t)x, (G.15) with initial condition x0 = x(t0), for a periodic m × m matrix A(t) is covered by the so-called Floquet theory (Floquet 1883). The solution takes the form:

x(t) = eμt y(t), (G.16) for some periodic function y(t), and therefore need not be periodic. A set of m independent solutions x1(t), . . . xm(t) make what is known as the fundamental matrix X(t), i.e. X(t) =[x1(t), . . . , xm(t)], and if the initial condition X0 = X(t0) is the identity matrix, i.e. X0 = Im, then X(t) is called the principal fundamental = −1 matrix. It is therefore clear that the solution to Eq. (G.15)isx(t) X(t)X0 x0, where X(t) is a fundamental matrix. An important result from Floquet theory is that if X(t) is a fundamental matrix so is X(t + T), and that there exists a nonsingular matrix B such that + = X(t T) X(t)B. Using the Wronskian, Eq. (G.9), one gets the determinant | |= T of B, i.e. B exp 0 tr(A(u))du . Furthermore, the eigenvalues of B,or characteristic multipliers, which can be written as eμ1T ,...,eμmT , yield the so- called characteristic (or Floquet) exponents μ1,...,μm. Remark In terms of the resolvent, see Sect. G2, the propagator S(t, τ) is the principal fundamental matrix. The characteristic exponents, which may be complex, are not unique but the characteristic multipliers are. In addition, the system (or the origin) is asymptotically G Systems of Linear Ordinary Differential Equations 547 stable if the characteristic exponents have negative real parts. It can be seen that if u is an eigenvector of B with eigenvalue ρ = eμT , then x(t) = X(t)u is a solution to Eq. (G.15), and that x(t + T)= ρx(t). The solution then takes the form x(t) = eμT x(t)e−μT = eμT y(t), where precisely the vector y(t) is T-periodic. Appendix H Links for Software Resource Material

An EOF primer by the author can be found here: https://pdfs.semanticscholar.org/f492/b48483c83f70b8e6774d3cc88bec918ab630. pdf.

A CRAN (R programming language) package for EOFs and EOF rotation by Alan Jassby is here: https://www.rdocumentation.org/packages/wq/versions/0.4.8/topics/eof.

The site of David M. Kaplan provides Matlab codes for EOFs and varimax rotation: https://websites.pmc.ucsc.edu/~dmk/notes/EOFs/EOFs.html.

Mathworks provides codes for PCA, factor analysis and factor rotation using different rotation criteria at: https://uk.mathworks.com/help/stats/rotatefactors.html. https://uk.mathworks.com/help/stats/analyze-stock-prices-using-factor-analysis. html.

There are also freely available Matlab source codes of factor analysis at freesourcecode.net: http://freesourcecode.net/matlabprojects/57962/factor-analysis-by-the-principal- components-method.--in-matlab#.XysoXfJS80o.

Python (and R) PCA and varimax rotation can be found at this site: https://mathtuition88.com/2019/09/13/python-code-for-pca-rotation-varimax- matrix/.

A R package provided by MetNorway, including EOF, CCA and more, can be found here:

© Springer Nature Switzerland AG 2021 549 A. Hannachi, Patterns Identiﬁcation and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3 550 H Links for Software Resource Material https://rdrr.io/github/metno/esd/man/ERA5.CDS.html.

The site of Imad Dabbura from HMS provides coding implementation in R and Python at: https://imaddabbura.github.io/.

A step-by-step introduction to NN with programming codes in Python is provide by Dr Andy Thomas: “An Introduction to Neural Networks for Beginners” at https://adventuresinmachinelearning.com/wp-content/uploads/2017/07/.

Mathworks also provides softwares for recurrent NN used in time series forecasting: https://uk.mathworks.com/help/deeplearning/.

The following site provides various Matlab codes in Machine learning: http://codeforge.com/s/0/self-organizing-map-matlab-code.

The site of Dr Qadri Hamarsheh “Neural Network and Fuzzy Logic: Self- Organizing Map Using Matlab” here: http://www.philadelphia.edu.jo/academics/qhamarsheh/uploads/Lecture%2016_ Self-organizing%20map%20using%20matlab.pdf.

Self-organising Maps Using Python, by James McCaffrey here: https://visualstudiomagazine.com/articles/2019/01/01/self-organizing-maps-python. aspx.

The book by Nielsen (2015) provides hands-on approach on NN (and deep learning) with Python (2.7) here: http://neuralnetworksanddeeplearning.com/about.html.

The book by Buduma (2017) provides codes for deep learning in Tensorﬂow at: https://github.com/darksigma/Fundamentals-of-Deep-Learning-Book.

The book by Chollet (2018) provides an exploration of deep learning from scratch with Python codes here: https://www.manning.com/books/deep-learning-with-python.

Random Forest: Simple Implementation with Python: https://holypython.com/rf/random-forest-simple-implementation/.

Random Forest (Easily Explained), with Python, by Shubham Gupta: https://medium.com/@gupta020295/random-forest-easily-explained-4b8094feb90.

An Implementation and Explanation of the Random Forest in Python by Will Koehrsen: https://towardsdatascience.com/an-implementation-and-explanation-of-the-random- H Links for Software Resource Material 551 forest-in-python-77bf308a9b76.

Forecasting with Random Forest (Python implementation), by Eric D. Brown: https://pythondata.com/forecasting-with-random-forests/.

Time series forecasting with random forest via time delay embedding (In R programming language), by Mauel Tilgner: https://www.statworx.com/at/blog/time-series-forecasting-with-random-forest/.

A Python-based learning library to evaluate mathematical expression efﬁciently is found here: http://deeplearning.net/software/theano/.

Other programming languages.

Yann Lecun provides a set of softwares in Lush at: http://yann.lecun.com/ex/downloads/index.html.

A toolkit for recurrent NN applied to language modelling is given by Tomas Mikolov at: http://www.ﬁt.vutbr.cz/~imikolov/rnnlm/.

A recurrent NN library for LSTM, multidimensional RNN, and more, can be found here: https://sourceforge.net/projects/rnnl/.

A Matlab 5 SOM toolbox by Juha Vasento et al. can be found here: http://www.cis.hut.ﬁ/projects/somtoolbox/.

The following site provides links to a number of softwares on deep learning: http://deeplearning.net/software_links/. References

Absil P-A, Mahony R, Sepulchre R (2010) Optimization on manifolds: Methods and applications. In: Diehl M, Glineur F, Michiels EJ (eds) Recent advances in optimizations and its application in engineering. Springer, pp 125–144 Achlioptas D (2003) Database-friendly random projections: Johnson-Lindenstrauss with binary coins. J Comput Syst Sci 66:671–687 Ackoff RL (1989) From data to wisdom. J Appl Syst Anal 16:3–9 Adby PR, Dempster MAH (1974) Introduction to optimization methods. Chapman and Hall, London Aires F, Rossow WB, Chédin A (2002) Rotation of EOFs by the independent component analysis: toward a solution of the mixing problem in the decomposition of geophysical time series. J Atmospheric Sci 59:111–123 Aires F, Chédin A, Nadal J-P (2000) Independent component analysis of multivariate time series: application to the tropical SST variability. J Geophys Res 105(D13):17437–17455 Akaike H (1969) Fitting autoregressive models for prediction. Ann Inst Stat Math 21:243–247 Akaike H (1974) A new look at the statistical model identification. IEEE Trans Auto Control 19:716–723 Allen MR, Smith LA (1997) Optimal filtering in singular spectrum analysis. Phys Lett A 234:419– 423 Allen MR, Smith LA (1996) Monte Carlo SSA: Detecting irregular oscillations in the presence of colored noise. J Climate 9:3373–3404 Aluffi-Pentini F, Parisi V, Zirilli F (1984) Algorithm 617: DAFNE: a differential-equations algorithm for nonlinear equations. Trans Math Soft 10:317–324 Amari S-I (1990) Mathematical foundation of neurocomputing. Proc IEEE 78:1443–1463 Ambaum MHP, Hoskins BJ, Stephenson DB (2001) Arctic oscillation or North Atlantic oscillation? J Climate 14:3495–3507 Ambaum MHP, Hoskins BJ, Stephenson DB (2002) Corrigendum: Arctic oscillation or North Atlantic oscillation? J Climate 15:553 Ambrizzi T, Hoskins BJ, Hsu H-H (1995) Rossby wave propagation and teleconnection patterns in the austral winter. J Atmos Sci 52:3661–3672 Ambroise C, Seze G, Badran F, Thiria S (2000) Hierarchical clustering of self-organizing maps for cloud classification. Neurocomputing 30:47–52. ISSN: 0925–2312 Anderson JR, Rosen RD (1983) The latitude-height structure of 40–50 day variations in atmospheric angular momentum. J Atmos Sci 40:1584–1591 Anderson TW (1963) Asymptotic theory for principle component analysis. Ann Math Statist 34:122–148

© Springer Nature Switzerland AG 2021 553 A. Hannachi, Patterns Identiﬁcation and Data Mining in Weather and Climate, Springer Atmospheric Sciences, https://doi.org/10.1007/978-3-030-67073-3 554 References

Anderson TW (1984) An introduction to multivariate statistical analysis, 2nd edn. Wiley, New York Angell JK, Korshover J (1964) Quasi-biennial variations in temperature, total ozone, and tropopause height. J Atmos Sci 21:479–492 Ångström A (1935) Teleconnections of climatic changes in present time. Geografiska Annaler 17:242–258 Annas S, Kanai T, Koyama S (2007) Principal component analysis and self-organizing map for visualizing and classifying fire risks in forest regions. Agricul Inform Res 16:44–51. ISSN: 1881–5219 Asimov D (1985) The grand tour: A tool for viewing multidimensional data. SIAM J Sci Statist Comp 6:128–143 Adachi K, Trendafilov N (2019) Some inequalities contrasting principal component and factor analyses solutions. Jpn J Statist Data Sci. https://doi.org/10.1007/s42081-018-0024-4 Astel A, Tsakouski S, Barbieri P, Simeonov V (2007) Comparison of self-organizing maps classification approach with cluster and principal components analysis for large environmental data sets. Water Research 41:4566–4578. ISSN: 0043-1354 Bach F, Jorda M (2002) kernel independent component analysis. J Mach Learn Res 3:1–48 Bagrov NA (1959) Analytical presentation of the sequences of meteorological patterns by means of the empirical orthogonal functions. TSIP Proceeding 74:3–24 Bagrov NA (1969) On the equivalent number of independent data (in Russian). Tr Gidrometeor Cent 44:3–11 Baker CTH (1974) Methods for integro-differential equations. In: Delves LM, Walsh J (eds) Numerical solution of integral equations. Oxford University Press, Oxford Baldwin MP, Gray LJ, Dunkerton TJ, Hamilton K, Haynes PH, Randel WJ, Holton JR, Alexander MJ, Hirota I, Horinouchi T, Jones DBA, Kinnersley JS, Marquardt C, Sao K, Takahas M (2001) The Quasi-biennial oscillation. Rev Geophys 39:179–229 Baldwin MP, Stephenson DB, Jolliff IT (2009) Spatial weighting and iterative projection methods for EOFs. J Climate 22:234–243 Barbosa SM, Andersen OB (2009) Trend patterns in global sea surface temperature. Int J Climatol 29:2049–2055 Barlow HB (1989) Unsupervised learning. Neural Computation 1:295–311 Barnett TP (1983) Interaction of the monsoon and Pacific trade wind system at international time scales. Part I: The equatorial case. Mon Wea Rev 111:756–773 Barnston AG, Liveze BE (1987) Classification, seasonality, and persistence of low-frequency atmospheric circulation patterns. Mon Wea Rev 115:1083–1126 Barnett TP (1984a) Interaction of the monsoon and the Pacific trade wind systems at interannual time scales. Part II: The tropical band. Mon Wea Rev 112:2380–2387 Barnett TP (1984b) Interaction of the monsoon and the Pacific trade wind systems at interannual time scales. Part III: A partial anatomy of the Southern Oscillation. Mon Wea Rev 112:2388– 2400 Barnett TP, Preisendorfer R (1987) Origins and levels of monthly and seasonal forecast skill for United States srface air temperatures determined by canonical correlation analysis. Mon Wea Rev 115:1825–1850 Barnston AG, Ropelewski CF (1992) Prediction of ENSO episodes using canonical correlation analysis. J Climate 5:1316–1345 Barreiro M, Marti AC, Masoller C (2011) Inferring long memory processes in the climate network via ordinal pattern analysis. Chaos 21:13,101. https://doi.org/10.1063/1.3545273 Bartholomew DJ (1987) Latent variable models and factor analysis. Charles Griffin, London Bartlett MS (1939) The standard errors of discriminant function coefficients. J Roy Statist Soc Suppl. 6:169–173 Bartlett MS (1950) Periodogram analysis and continuous spectra. Biometrika 37:1–16 Bartlett MS (1955) An introduction to stochastic processes. Cambridge University Press, Cam- bridge References 555

Basak J, Sudarshan A, Trivedi D, Santhanam MS (2004) Weather data mining using independent component analysis. J Mach Lear Res 5:239–253 Basilevsky A, Hum PJ (1979) Karhunen-Loève analysis of historical time series with application to Plantation birth in Jamaica. J Am Statist Ass 74:284–290 Basilevsky A (1983) Applied matrix algebra in the statistical science. North Holland, New York Bauckhage C, Thurau C (2009) Making archetypal analysis practical. In: Pattern recognition, Lecture Notes in Computer Science, vol 5748. Springer, Berlin, Heidelberg, pp 272–281. https://doi.org/10.1007/978-3-642-03798-6-28 Bayes T (1763) An essay towards solving a problem in the doctrine of chances. Phil Trans 53:370 Beatson RK, Cherrie JB, Mouat CT (1999) Fast fitting of radial basis functions: Methods based on preconditioned GMRES iteration. Adv Comput Math 11:253–270 Beatson RK, Light WA, Billings S (2000) Fast solution of the radial basis function interpolation equations: Domain decomposition methods. SIAM J Sci Comput 200:1717–1740 Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15:1373–1396 Bellman R (1961) Adaptive control processes: A guide tour. Princeton University Press, Princeton Bell AJ, Sejnowski TJ (1995) An information-maximisation approach to blind separation and blind deconvolution. Neural Computing 7:1004–1034 Bell AJ, Sejnowski TJ (1997) The “independent components” of natural scenes are edge filters. Vision Research 37:3327–3338 Belouchrani A, Abed-Meraim K, Cardoso J-F, Moulines E (1997) A blind source separation technique using second order statistics. IEEE Trans Signal Process 45:434–444 Bentler PM, Tanaka JS (1983) Problems with EM algorithms for ML factor analysis. Psychome- trika 48:247–251 Berthouex PM, Brown LC (1994) Statistics for environmental engineers. Lewis Publishers, Boca Raton Bishop CM (1995) Neural networks for pattern recognition. Clarendon Press, Oxford, 482 p. Bishop CM (2006) Pattern recognition and machine learning. Springer series in information science and statistics. Springer, New York, 758 p. Bjerknes J (1969) Atmospheric teleconnections from the equatorial Pacific. Mon Wea Rev 97:163– 172 Björnsson H, Venegas SA (1997) A manual for EOF and SVD analyses of climate data. Report No 97-1, Department of Atmospheric and Oceanic Sciences and Centre for Climate and Global Change Research, McGill University, p 52 Blumenthal MB (1991) Predictability of a coupled ocean-atmosphere model. J Climate 4:766–784 Bloomfield P, Davis JM (1994) Orthogonal rotation of complex principal components. Int J Climatol 14:759–775 Bock H-H (1986) Multidimensional scaling in the framework of cluster analysis. In: Degens P, Hermes H-J, Opitz O (eds) Studien Zur Klasszfikation. INDEKS-Verlag, Frankfurt, pp 247– 258 Bock H-H (1987) On the interface between cluster analysis, principal component analysis, and multidimensional scaling. In: Bozdogan H, Kupta AK (eds) Multivariate statistical modelling and data analysis. Reidel, Boston Boers N, Donner RV, Bookhagen B, Kurths J (2014) Complex network analysis helps to identify impacts of the El Niño Southern Oscillation on moisture divergence in South America. Clim Dyn. https://doi.org/10.1007/s00382-014-2265-7 Bolton RJ, Krzanowski WJ (2003) Projection pursuit clutering for exploratory data analysis. J Comput Graph Statist 12:121–142 Bonnet G (1965) Theorie de linformation−sur l’interpolation optimale d’une fonction aléatoire èchantillonnée. C R Acad Sci Paris 260:297–343 Bookstein FL (1989) Principal warps: thin plate splines and the decomposition of deformations. IEEE Trans Pattern Anal Mach Intell 11:567–585 Borg I, Groenen P (2005) Modern multidimensional scaling. Theory and applications, 2nd edn. Springer, New York 556 References

Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifier. In: Haussler D (ed) Proceedings of the 5th anuual ACM workshop on computational learning theory. ACM Press, pp 144–152 Pittsburgh. Botsaris CA, Jacobson HD (1976) A Newton-type curvilinear search method for optimisation. J Math Anal Applics 54:217–229 Botsaris CA (1978) A class of methods for unconstrained minimisation based on stable numerical integration techniques. J Math Anal Applics 63:729–749 Botsaris CA (1979) A Newton-type curvilinear search method for constrained optimisation. J Math Anal Applics 69:372–397 Botsaris CA (1981) Constrained optimisation along geodesics. J Math Anal Applics 79:295–306 Box MJ, Davies D, Swann WH (1969) Non-linera optimization techniques. Oliver and Boyd, Edinburgh Box GEP, Jenkins MG, Reinsel CG (1994) Time series analysis: forecasting and control. Prentice Hall, New Jersey Box GEP, Jenkins MG (1970) Time series analysis. Forecasting and control. Holden-Day, San Fracisco (Revised and published in 1976) Branstator G, Berner J (2005) Linear and nonlinear Signatures in the planetary wave dynamics of an AGCM: Phase space tendencies. J Atmos Sci 62:1792–1811 Breakspear M, Brammer M, Robinson PA (2003) Construction of multivariate surrogate sets from nonlinear data using the wavelet transform. Physica D 182:1–22 Breiman L (2001) Random forests. Machine Learning 45:5–32 Bretherton CS, Smith C, Wallace JM (1992) An intercomparison of methods for finding coupled patterns in climate data. J Climate 5:541–560 Bretherton CS, Widmann M, Dymnykov VP, Wallace JM, Bladé I (1999) The effective number of spatial degrees of freedom of a time varying field. J Climate 12:1990–2009 Brillinger DR, Rosenblatt M (1967) Computation and interpretation of k-th order spectra. In: Harris B (ed) Spectral analysis of time series. Wiley, New York, pp 189–232 Brillinger DR (1981) Time series-data: analysis and theory. Holden-Day, San-Francisco Brink KH, Muench RD (1986) Circulation in the point conception-Santa Barbara channel region. J Geophys Res C 91:877–895 Brockwell PJ, Davis RA (1991) Time series: theory and methods, 2nd edn. Springer, New York Brockwell PJ, Davis RA (2002) Introduction to time series and forecasting. Springer, New York Brown AA (1986) Optimisation methods involving the solution of ordinary differential equations. Ph.D. thesis, the Hatfield polytechnic, available from the British library Brownlee J (2018) Statistical methods for machine learning. e-learning. ISBN-10. https://www. unquotebooks.com/get/ebook.php?id=386nDwAAQBAJ Broomhead DS, King GP (1986a) Extracting qualitative dynamics from experimental data. Physica D 20:217–236 Broomhead DS, King GP (1986b) On the qualitative analysis of experimental dynamical systems. In: Sarkar S (ed) Nonlinear phenomena and chaos. Adam Hilger, pp 113–144 Buduma N (2017) Fundamentals of deep learning, 1st edn. O’Reilly, Beijing Bürger G (1993) Complex principal oscillation pattern analysis. J Climate 6:1972–1986 Burg JP (1972) The relationship between maximum entropy spectra and maximum likelihood spectra. Geophysics 37:375–376 Cadzow JA, Li XK (1995) Blind deconvolution. Digital Signal Process J 5:3–20 Cadzow JA (1996) Blind deconvolution via cumulant extrema. IEEE Signal Process Mag (May 1996), 24–42 Cahalan RF, Wharton LE, Wu M-L (1996) Empirical orthogonal functions of monthly precipitation and temperature ever over the united States and homogeneous Stochastic models. J Geophys Res 101(D21): 26309–26318 Capua GD, Runge J, Donner RV, van den Hurk B, Turner AG, Vellore R, Krishnan R, Coumou D (2020) Dominant patterns of interaction between the tropics and mid-latitudes in boreal summer: Causal relationships and the role of time scales. Weather Climate Discuss. https:// doi.org/10.5194/wcd-2020-14. References 557

Cardoso J-F (1989) Source separation using higher order moments. In: Proc. ICASSP’89, pp 2109– 2112 Cardoso J-F (1997) Infomax and maximum likelihood for source separation. IEEE Lett Signal Process 4:112–114 Cardoso J-F, Souloumiac A (1993) Blind beamforming for non-Gaussian signals. IEE Proc F 140:362–370 Cardoso J-F, Hvam Laheld B (1996) Equivalent adaptive source separation. IEEE Trans Signal Process 44:3017–3030 Carr JC, Fright RW, Beatson KR (1997) Surface interpolation with radial basis functions for medical imaging. IEEE Trans Med Imag 16:96–107 Carreira-Perpiñán MA (2001) Continuous latent variable models for dimensionality reduction and sequential data reconstruction. Ph.D. dissertation. Department of Computer Science, University of Sheffield Carroll JB (1953) An analytical solution for approximating simple structure in factor analysis. Psychometrika 18:23–38 Caroll JD, Chang JJ (1970) Analysis of individual differences in multidimensional scaling via an n-way generalization of ’Eckart-Young’ decomposition. Psychometrika 35:283–319 Cassano EN, Glisan JM, Cassano JJ, Gutowski Jr. WJ, Seefeldt MW (2015) Self-organizing map analysis of widespread temperature extremes in Alaska and Canada. Clim Res 62:199–218 Cassano JJ, Cassano EN, Seefeldt MW, Gutowski WJ, Glisan JM (2016) Synoptic conditions during wintertime temperature extremes in Alaska. J Geophys Res Atmos 121:3241–3262. https://doi.org/10.1002/2015JD024404 Causa A, Raciti F (2013) A purely geometric approach to the problem of computing the projection of a point on a simplex. JOTA 156:524–528 Cavazos T, Comrie AC, Liverman DM (2002) Intraseasonal variability associated with wet monsoons in southeast Arizona. J Climate 15:2477–2490. ISSN: 0894-8755 Chan JCL, Shi J-E (1997) Application of projection-pursuit principal component analysis method to climate studies. Int J Climatol 17(1):103–113 Charney JG, Devore J (1979) Multiple equilibria in the atmosphere and blocking. J Atmos Sci 36:1205–1216 Chatfield C (1996) The analysis of time series. An introduction 5th edn. Chapman and Hall, Boca Raton Chatfield C, Collins AJ (1980) Introduction to multivariate analysis. Chapman and Hall, London Chatfield C (1989) The analysis of time series: An introduction. Chapman and Hall, London, 241 p Chekroun MD, Kondrashov D (2017) Data-adaptive harmonic spectra and multilayer Stuart- Landau models. Chaos 27:093110 Chen J-M, Harr PA (1993) Interpretation of extended empirical orthogonal function (EEOF) analysis. Mon Wea Rev 121:2631–2636 Chen R, Zhang W, Wang X (2020) Machine learning in tropical cyclone forecast modeling: A Review. Atmosphere 11:676. https://doi.org/10.3390/atmos11070676 Cheng X, Nitsche G, Wallace MJ (1995) Robustness of low-frequency circulation patterns derived from EOF and rotated EOF analysis. J Climate 8:1709–1720 Chernoff H (1973) The use of faces to represent points in k-dimensional space graphically. J Am Stat Assoc 68:361–368 Coifman RR, Lafon S (2006) Diffusion maps. Appl Comput Harmon Anal 21:5–30. https://doi. org/10.1016/j.acha.2006.04.006 Chollet F (2018) Deep learning with Python. Manning Publications, New York, 361 p Christiansen B (2009) Is the atmosphere interesting? A projection pursuit study of the circulation in the northern hemisphere winter. J Climate 22:1239–1254 Cleveland WS, McGill R (1984) The many faces of a scatterplot. J Am Statist Assoc 79:807–822 Cleveland WS (1993) Visualising data. Hobart Press, New York Comon P, Jutten C, Herault J (1991) Blind separation of sourcesi, Part ii: Problems statement. Signal Process 24:11–20 Comon P (1994) Independent component analysis, a new concept? Signal Process 36:287–314 558 References

Cook D, Buja A, Cabrera J (1993) Projection pursuit indices based on expansions with orthonormal functions. J Comput Graph Statist 2:225–250 Cover TM, Thomas JA (1991) Elements of information theory. Wiley Series in Telecommunica- tion. Wiley, New York Cox DD (1984) Multivariate smoothing spline functions. SIAM J Num Anal 21:789–813 Cox TF, Cox MAA (1994) Mulyidimensional scaling. Chapman and Hall, London Craddock JM (1965) A meteorological application of principal component analysis. Statistician 15:143–156 Craddock JM (1973) Problems and prospects for eigenvector analysis in meteorology. Statistician 22:133–145 Craven P, Wahba G (1979) Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation. Numer Math 31:377–403 Cristianini N, Shawe-Taylor J, Lodhi H (2001) Latent semantic kernels. In: Brodley C, Danyluk A (eds) Proceedings of ICML-01, 18th international conference in machine learning. Morgan Kaufmann, San Francisco, pp 66–73 Crommelin DT, Majda AJ (2004) Strategies for model reduction: Comparing different optimal bases. J Atmos Sci 61:2206–2217 Cupta AS (2004) Calculus of variations with applications. PHI Learning, India, 256p. ISBN: 9788120311206 Cutler A, Breiman L (1994) Archetypal analysis. Technometrics 36:338–347 Cutler A, Cutler DR, Stevens JR (2012) Random forests. In: Zhang C, Ma YQ (eds) Ensemble machine learning. Springer, New York, pp 157–175 Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signal Sys 2:303–314 Daley R (1991) Atmospheric data assimilaltion. Cambridge University Press, Camnbridge, 457 p Dasgupta S, Gupta A (2003) An elementary proof of a theorem of Johnson and Lindenstrauss. Random Struct Algorithm 22:60–65 Daubechies I (1992) Ten lectures on wavelets. Soc. for Ind. and Appl. Math., Philadelphia, PA Davis JM, Estis FL, Bloomﬁeld P, Monahan JF (1991) Complex principal components analysis of sea-level pressure over the eastern USA. Int J Climatol 11:27–54 de Lathauwer L, de Moor B, Vandewalle J (2000) A multilinear singular value decomposition. SIAM J Matrix Analy Appl 21:1253–1278 DeGroot MH, Shervish MJ (2002) Probability and statistics, 4th edn. Addison–Wesley, Boston, p 893 DelSole T (2001) Optimally persistent patterns in time varying ﬁelds. J Atmos Sci 58:1341–1356 DelSole T (2006) Low-frequency variations of surface temperature in observations and simulations. J Climate 19:4487–4507 DelSole T, Tippett MK (2009a) Average predictability time. Part I: theory. J Atmos Sci 66:1172– 1187 DelSole T, Tippett MK (2009b) Average predictability time. Part II: seamless diagnoses of predictability on multiple time scales. J Atmos Sci 66:1188–1204 Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Roy Statist Soc B 39:1–38 De Swart HE (1988) Low-order spectral models of the atmospheric circulation: A survey. Acta Appl Math 11:49–96 Derouiche S, Mallet C, Hannachi A, Bargaoui Z (2020) Rainfall analysis via event features and self-organizing maps with application to northern Tunisia. J Hydrolo revised Diaconis P, Freedman D (1984) Asymptotics of graphical projection pursuit. Ann Statist 12:793– 815 Diamantaras KI, Kung SY (1996) Principal component neural networks. Wiley, New York Ding C, Li T, Jordan IM (2010) Convex and semi-nonnegative matrix factorizations. IEEE Trans Pattern Anal Mach Intell 32:45–55 References 559

Donges JF, Petrova I, Loew A, Marwan N, Kurths J (2015) How complex climate networks complement eigen techniques for the statistical analysis of climatological data. Clim Dyn 45:2407–2424 Donges JF, Zou Y, Marwan N, Kurths J (2009) Complex networks in climate dynamics. Eur Phys J Spec Top 174:157–179. https://doi.org/10.1140/epjst/e2009--01098-2 Dommenget D, Latif M (2002) A cautionary note on the interpretation of EOFs. J Climate 15:216– 225 Dommenget D (2007) Evaluating EOF modes against a stochastic null hypothesis. Clim Dyn 28:517–331 Donner RV, Zou Y, Donges JF, Marwan N, Kurths J (2010) Recurrence networks—a novel paradigm for nonlinear time series analysis. New J Phys 12:033025. https://doi.org/10.1088/ 1367-2630/12/3/033025 Donohue KD, Hennemann J, Dietz HG (2007) Performance of phase transform for detecting sound sources with microphone arrays in reverberant and noisy environments. Signal Process 87:1677–1691 Doob JL (1953) Stochastic processes. Wiley, New York Dorn M, von Storch H (1999) Identiﬁcation of regional persistent patterns through principal prediction patterns. Beitr Phys Atmos 72:15–111 Dwyer PS (1967) Some applications of matrix derivatives in multivariate analysis. J Am Statist Ass 62:607–625 Ebdon RA (1960) Notes on the wind ﬂow at 50 mb in tropical and subtropical regions in January 1957 and in 1958. Q J R Meteorol Soc 86:540–542 Ebert-Uphoff I, Deng Y (2012) Causal discovery for climate research using graphical models. J Climate 25:5648–5665. https://doi.org/10.1175/JCLI-D-11-00387.1 Efron B (1979) Bootstrap methods: Another look at the Jackknife. Ann Stat 7:1–26 Efron B, Tibshirani RJ (1994) An introduction to bootstrap. Chapman-Hall, Boca-Raton. ISBN-13: 978-0412042317 Eslava G, Marriott FHC (1994) Some criteria for projection pursuit. Stat Comput 4:13–20 Eugster MJA, Leisch F (2011) Weighted and robustarchetypal analysis. Comput Stat Data Anal 55:1215–1225 Eugster MJA, Leisch F, (2013) Archetypes: Archetypal analysis. http://CRAN.R-project.org/ package=archetypes. R package version 2.1-2 Everitt BS (1978) Graphical techniques for multivariate data. Heinemann Educational Books, London Everitt BS (1984) An introduction to latent variable models. Chapman and Hall, London Everitt BS (1987) Introduction to optimization methods and their application in statistics. Chapman and Hall, London Everitt BS (1993) Cluster analysis, 3rd edn. Academic Press, London, 170pp Everitt BS, Dunn G (2001) Applied Multivariate Data Analysis, 2nd edn. Arnold, London Evtushenko JG (1974) Two numerical methods of solving non-linear programming problems. Sov Math Dokl 15:20–423 Evtushenko JG, Zhadan GV (1977) A relaxation method for solving problems of non-linear programming. USSR Comput Math Math Phys 17:73–87 Fang K-T, Zhang Y-T (1990) Generalized multivariate analysis. Springer, 220p Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy U (eds) (1996) Advances in knowledge discovery and data mining. AAAI Press/The MIT Press, Menlo Park, CA Ferguson GA (1954) The concept of parsimony in factor analysis. Psychometrika 18:23–38 Faddeev DK, Faddeeva NV (1963) Computational methods of linear algebra. W.H. Freeman and Company, San Francisco Fisher RA (1925) Statistical methods for research workers. Oliver & Boyd, Edinburgh Fischer MJ, Paterson AW (2014) Detecting trends that are nonlinear and asymmetric on diurnal and seasonal time scales. Clim Dyn 43:361–374 Fischer MJ (2016) Predictable components in global speleothem δ18O. Q Sci Rev 131:380–392 560 References

Fischer MJ (2015) Predictable components in Australian daily temperature data. J Climate 28:5969–5984 Fletcher R (1972) Conjugate direction methods. In: Murray W (ed) Numerical methods for unconstrained optimization. Academic Press, London, pp 73–86 Fletcher R, Powell MJD (1963) A rapidly convergent descent method for minimization. Comput J 6:163–168 Floquet G (1883) Sur les équations différentielles linéaires à coefficients periodiques. Annales de l’École Normale Supérieure 12:47–88 Flury BN (1988) Common principal components and related mutivariate models. Wiley, New York Flury BN (1984) Common principal components in k groups. J Am Statist Assoc 79:892–898 Flury BN (1983) Some relations between the comparison of covariance matrices and principal component analysis. Comput Statist Dana Anal 1:97–109 Fodor I, Kamath C (2003) On the use of independent component analysis to separate meaningful sources in global temperature series. Technical Report, Lawrence Livermore National Labora- tory Foulds LR (1981) Optimization techniques: An introduction. Springer, New York Frankl P, Maehara H (1988) The Johnson-Lindenstrauss lemma and the sphericity of some graphs. J Combin Theor 44:355–362 Fraedrich K (1986) Estimating the dimensions of weather and climate attractors. J Atmos Sci 43:419–432 Franke R (1982) Scattered data interpolation: tests of some methods. Math Comput 38(157):181– 200 Franzke C, Feldstein SB (2005) The continuum and dynamics of Northern Hemisphere teleconnection patterns. J Atmos Sci 62:3250–3267 Franzke C, Majda AJ, Vanden-Eijnden E (2005) Low-order stochastic mode reduction for a realistic barotropic model climate. J Atmos Sci 62:1722–1745 Franzke C, Majda AJ, Branstator G (2007) The origin of nonlinear signatures of planetary wave dynamics: Mean phase space tendencies and contributions from non-Gaussianity. J Atmos Sci 64:3987–4003 Franzke C, Feldstein SB, Lee S (2011) Synoptic analysis of the Pacific-North American teleconnection pattern. Q J R Meterol Soc 137:329–346 Fraser AM, Dimitriadis A (1994) Forecasting probability densities by using hidden Markov models with mixed states. In: Weigend SA, Gershenfeld NA (eds) Time series prediction: forecasting the future and understanding the past. Persus Books, Reading, MA, pp 265–282 Frawley WJ, Piatetsky-Shapiro G, Mathews CJ (1992) Knowledge discovery in databases: an overview. Al Magazine 13:57–70 Frederiksen JS (1997) Adjoint sensitivity and finite-time normal mode disturbances during blocking. J Atmos Sci 54:1144–1165 Frederiksen JS, Branstator G (2001) Seasonal and intraseasonal variability of large-scale barotropic modes. J Atmos Sci 58:50–69 Frederiksen JS, Branstator G (2005) Seasonal variability of teleconnection patterns. J Atmos Sci 62:1346–1365 Friedman JH, Tukey JW (1974) A projection pursuit algorithm for exploratory data analysis. IEEE Trans Comput C23:881–890 Friedman JH, Stuetzle W, Schroeder A (1984) Projection pursuit density estimation. J Am Statist Assoc 79:599–608 Friedman JH (1987) Exploratory projection pursuit. J Am. Statist Assoc 82:249–266 Fuller WA (1976) Introduction to statistical time series. Wiley, New York Feldstein SB (2000) The timescale, power spectra, and climate noise properties of teleconnection patterns. J Climate 13:4430–4440 Friedman JH, Stuetzle W (1981) Projection pursuit regression. J Amer Statist Assoc 76:817–823 Fukunaga K, Koontz WLG (1970) Application of the Karhunen-Loève expansion to feature selection and ordering. IEEE Trans Comput C-19:311–318 References 561

Fukuoka A (1951) A study of 10-day forecast (A synthetic report). Geophys Mag Tokyo XXII:177– 218 Galton F (1885) Regression towards mediocrity in hereditary stature. J Anthropological Inst 15:246–263 Gámez AJ, Zhou CS, Timmermann A, Kurths J (2004) Nonlinear dimensionality reduction in climate data. Nonlin Process Geophys 11:393–398 Gardner WA, Napolitano A, Paura L (2006) Cyclostationarity: Half a century of research. Signal Process 86:639–697 Gardner WA (1994) Cyclostationarity in communications and signal processing. IEEE Press, 504 p Gardner WA, Franks LE (1975) Characterization of cyclostationary random signal processes. IEEE Trans Inform Theory 21:4–14 Gavrilov A, Mukhin D, Loskutov E, Volodin E, Feigin A, Kurths J (2016) Method for reconstructing nonlinear modes with adaptive structure from multidimensional data. Chaos 26:123101. https://doi.org/10.1063/1.4968852 Geary RC (1947) Testing for normality. Biometrika 34:209–242 Gelfand IM, Vilenkin NYa (1964) Generalized functions-vol 4: Applications of harmonic analysis. Academic Press Ghil M, Allen MR, Dettinger MD, Ide K, Kondrashov D, Mann ME, Robertson AW, Saunders A, Tian Y, Varadi F, Yiou P (2002) Advanced spectral methods for climatic time series. Rev Geophys 40:1.1–1.41 Giannakis D, Majda AJ (2012) Nonlinear laplacian spectral analysis for time series with intermit- tency and low-frequency variability. Proc Natl Sci USA 109:2222–2227 Giannakis D, Majda AJ (2013) Nonlinear laplacian spectral analysis: capturing intermittent and low-frequency spatiotemporal patterns in high-dimensional data. Stat Anal Data Mining 6. https://doi.org/10.1002/sam.11171 Gibbs JW (1902) Elementary principles in statistical mechanics developed with especial reference to the rational foundation of thermodynamics. Yale University Press, New Haven, CT. Republished by Dover, New York in 1960 Gibson J (1994) What is the interpretation of spectral entropy? In: Proceedings of IEEE international symposium on information theory, p 440 Gibson PB, Perkins-Kirkpatrick SE, Uotila P, Pepler AS, Alexander LV (2017) On the use of self- organizing maps for studying climate extremes. J Geophys Res Atmos 122:3891–3903. https:// doi.org/10.1002/2016JD026256 Gibson PB, Perkins-Kirkpatrick SE, Renwick JA (2016) Projected changes in synoptic weather patterns over New Zealand examined through self-organizing maps. Int J Climatol 36:3934– 3948. https://doi.org/10.1002/joc.4604 Gill PE, Murray W, Wright HM (1981) Practical optimization. Academic Press, London Gilman DL (1957) Empirical orthogonal functions applied to thirty-day forecasting. Sci Rep No 1, Department of Meteorology, Mass Inst of Tech, Cambridge, Mass, 129pp. Girshik MA (1939) On the sampling theory of roots of determinantal equations. Ann Math Statist 43:128–136 Glahn HR (1962) An experiment in forecasting rainfall probabilities by objective methods. Mon Wea Rev 90:59–67 Goerg GM (2013) Forecastable components analysis. J Mach Learn Res Workshop Conf Proc 28:64–72 Goldfeld SM, Quandt RE, Trotter HF (1966) Maximization by quadratic hill-climbing. Economet- rica 34:541–551 Golub GH, van Loan CF (1996) Matrix computation. John Hopkins University Press, Baltimore, MD Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press, Cambridge, MA, 749 p. http://www.deeplearningbook.org Gordon AD (1999) Classiﬁcation, 2nd edn. Chapman and Hall, 256 p Gordon AD (1981) Classiﬁcation: methods for the exploratory analysis of multivariate data. Chapman and Hall, London 562 References

Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53:325–338 Graybill FA (1969)Introduction to matrices with application in statistics. Wadsworth, Belmont, CA Graystone P (1959) Meteorological office discussion−Tropical meteorology. Meteorol Mag 88:113–119 Grenander U, Rosenblatt M, (1957) Statistical analysis of time series. Wiley, New York Grimmer M (1963) The space-filtering of monthly surface temperature anomaly data in terms of pattern using empirical orthogonal functions. Q J Roy Meteorol Soc 89:395–408 Hackbusch W (1995) Integral equations: theory and numerical treatment. Birkhauser Verlag, Basel Haghroosta T (2019) Comparative study on typhoon’s wind speed prediction by a neural networks model and a hydrodynamical model. MethodsX 6:633–640 Haines K, Hannachi A (1995) Weather regimes in the Pacific from a GCM. J Atmos Sci 52:2444- 2462 Hall A, Manabe S (1997) Can local linear stochastic theory explain sea surface temperature and salinity variability? Clim Dyn 13:167–180 Hall, P (1989) On polynomial-based projection indices for exploratory projection pursuit. Ann Statist 17:589–605 Halmos PR (1951) Introduction to Hilbert space. Chelsea, New York Halmos PR (1972) Positive approximants of operators. Indian Univ Math J 21:951–960 Hamlington BD, Leben RR, Nerem RS, Han W, Kim K-Y (2011) Reconstructing sea level using cyclostationary empirical orthogonal functions. J Geophys Res 116:C12015. https://doi.org/10. 1029/2011JC007529 Hamlington BD, Leben RR, Strassburg MW, Kim K-Y (2014) Cyclostationary empirical orthogonal function sea-level reconstruction. Geosci Data J 1:13–19 Hamming RW (1980) Coding and information theory. Prentice-Hall, Englewood Cliffs, New Jersey Hannachi A, Allen M (2001) Identifying signals from intermittent low-frequency behaving systems. Tellus A 53A:469–480 Hannachi A, Legras B (1995) Simulated annealing and weather regimes classification. Tellus 47A:955–973 Hannachi A, Iqbal W (2019) On the nonlinearity of winter northern hemisphere atmospheric variability. J Atmos Sci 76:333–356 Hannachi A, Turner AG (2013a) Isomap nonlinear dimensionality reduction and bimodality of Asian monsoon convection. Geophys Res Lett 40:1653–1658 Hannachi A, Turner GA (2013b) 20th century intraseasonal Asian monsoon dynamics viewed from isomap. Nonlin Process Geophys 20:725–741 Hannachi A, Dommenget D (2009) Is the Indian Ocean SST variability a homogeneous diffusion process. Clim Dyn 33:535–547 Hannachi A, Unkel S, Trendafilov NT, Jolliffe TI (2009) Independent component analysis of climate data: A new look at EOF rotation. J Climate 22:2797–2812 Hannachi, A (2010) On the origin of planetary-scale extratropical winter circulation regimes. J Atmos Sci 67:1382–1401 Hannachi A (1997) Low frequency variability in a GCM: three-dimensional flow regimes and their dynamics. J Climate 10:1357–1379 Hannachi A, O’Neill A (2001) Atmospheric multiple equilibria and non-Gaussian behaviour in model simulations. Q J R Meteorol Soc 127:939–958 Hannachi A (2008) A new set of orthogonal patterns in weather and climate: Optimall interpolated patterns. J Climate 21:6724–6738 Hannachi A, Jolliffe TI, Trendafilov N, Stephenson DB (2006) In search of simple structures in climate: Simplifying EOFs. Int J Climatol 26:7–28 Hannachi A, Jolliffe IT, Stephenson DB (2007) Empirical orthogonal functions and related techniques in atmospheric science: A review. I J Climatol 27:1119–1152 Hannachi A (2007) Pattern hunting in climate: A new method for finding trends in gridded climate data. Int J Climatol 27:1–15 Hannachi A (2000) A probobilistic-based approach to optimal filtering. Phys Rev E 61:3610–3619 References 563

Hannachi A, Stephenson DB, Sperber KR (2003) Probability-based methods for quantifying nonlinearity in the ENSO. Climate Dynamics 20:241–256 Hannachi A, Mitchell D, Gray L, Charlton-Perez A (2011) On the use of geometric moments to examine the continuum of sudden stratospheric warmings. J Atmos Sci 68:657–674 Hannachi A, Woollings T, Fraedrich K (2012) The North Atlantic jet stream: a look at preferred positions, paths and transitions. Q J Roy Meteorol Soc 138:862–877 Hannachi A (2016) Regularised empirical orthogonal functions. Tellus A 68:31723. https://doi. org/10.3402/tellusa.v68.31723 Hannachi A, Stendel M (2016) What is the NAO? In: Colijn (ed) Appendix 1 in Quante, North sea region climate change assessment. Springer, Berlin, Heidelberg, pp 489–493 Hannachi A, Trendaﬁlov N (2017) Archetypal analysis: Mining weather and climate extremes. J Climate 30:6927–6944 Hannachi A, Straus DM, Franzke CLE, Corti S, Woollings T (2017) Low-frequency nonlinearity and regime behavior in the Northern Hemisphere extratropical atmosphere. Rev Geophys 55:199–234. https://doi.org/10.1002/2015RG000509 Hannan EJ (1970) Multiple time series. Wiley, New York Harada Y, Kamahori H, Kobayashi C, Endo H, Kobayashi S, Ota Y, Onoda H, Onogi K, Miyaoka K, Takahashi K (2016) The JRA-55 reanalysis: Representation of atmospheric circulation and climate variability. J Meteor Soc Jpn 94:269–302 Hardy RL (1971) Multiquadric equations of topography and other irregular surfaces. J Geophys Res 76:1905–1915 Hardoon DR, Szedmak S, Shawe-Taylor J (2004) Canonical correlation analysis: An overview with application to learning methods. Neural Comput 16:2639–2664 Harman HH (1976) Modern factor analysis, 3d edn. The University of Chicago Press, Chicago Harshman RA (1970) Foundation of the PARAFAC procedure: models and methods for an ’Explanatory’ multi-mode factor analysis. In: UCLA working papers in phonetics, vol 16, pp 1– 84 Hartigan JA (1975) Clutering algorithms. Wiley, New York Hasselmann K (1976) Stochastic climate models. Part I. Theory. Tellus 28:474–485 Hasselmann K (1988) PIPs and POPs−A general formalism for the reduction of dynamical systems in terms of principal interaction patterns and principal oscillation patterns. J Geophys Res 93:11015–11020 Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer series in statistics. Springer, New York Haykin S (1999) Neural networks: A comprehensive foundation, 2nd edn. Prentice Hall Interna- tional, New Jersey, 897 p Haykin S (2009) Neural networks and learning machines, 3rd edn. Prentice Hall, New York, 938 p Hayashi Y (1973) A method of analyzing transient waves by space-time cross spectra. J Appl Meteorol 12:404–408 Haykin S (ed) (1994) Blind deconvolution. Prentice-Hall, Englewood Cliffs, New Jersey Heinlein RA (1973) Time enough for love. New English Library, London Heiser WJ, Groenen PJF (1997) Cluster differences scaling with a within-clusters loss component and a fuzzy successive approximation strategy to avoid local minima. Psychometrika 62:63–83 Held IM (1983) Stationary and quasi-stationary eddies in the extratropical troposphere: theory. In: Hoskins BJ, Pearce RP (eds) Large-scale dynamical processes in the atmosphere. Academic Press, pp 127–168 Helsen H, Lowdenslager D (1958) Prediction theory and Fourier series in several variables. Acta Math 99:165–202 Hendon HH, Salby ML (1994) The life cycle of the Madden-Julian oscillation. J Atmos Sci 51:2225–2237 Hertz JA, Krogh AS, Palmer RG (1991) Introduction to the theory of neural computation. Lecture Notes Volume I, Santa Fe Institute Series. Addison-Wesley Publishing Company, Reading, MA Hewitson BC, Crane RG (2002) Self-organizing maps: applications to synoptic climatology. Clim Res 22:13–26. ISSN: 0936-577X 564 References

Hewitson BC, Crane RG (1994) Neural nets: Applications in geography. Springer, New York. ISBN: 978-07-923-2746-2 Higham NJ (1988) Computing nearest symmetric positive semi-definite matrix. Linear Algebra Appl 103:103–118 Hill T, Marquez L, O’Connor M, Remus W (1994) Artificial neural network models for forecasting and decision making. Int J Forecast 10:5–15 Hinton GE, Dayan P, Revow M (1997) Modeling the manifolds of images of hand written digits. IEEE Trans Neural Netw 8:65–74 Hirsch MW, Smale S (1974) Differential equations, dynamical systems, and linear algebra. Academic Press, London Hochstadt H (1973) Integral equations. Wiley, New York Hodges JL, Lehmann EL (1956) The efficiency of some non-parametric competitors of the t-test. Ann Math Statist 27:324–335 Hoerl AE, Kennard RW (1970) Ridge regression: biased estimation for nonorthogonal problems. Technometrics 12:55–67 Holsheimer M, Siebes A (1994) Data mining: The search for knowledge in databases. Technical Report CS-R9406, CWI Amsterdam Horel JD (1981) A rotated principal component analysis of the interannual variability variability of the Northern Hemisphere 500 mb height field. Mon Wea Rev 109:2080–2092 Horel JD (1984) Complex principal component analysis: Theory and examples. J Climate Appl Meteor 23:1660–1673 Horn RA, Johnson CA (1985) Matrix analysis. Cambridge University Press, Cambridge Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Networks 4:251–257 Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Networks 2:359–366 Horton DE, Johnson NC, Singh D, Swain DL, Rajaratnam B, Diffenbaugh NS (2015) Contribution of changes in atmospheric circulation patterns to extreme temperature trends. Nature 522:465– 469. https://doi.org/10.1038/nature14550 Hosking JRM (1990) L-moments: analysis and estimation of distributions using linear combinations of order statistics. J R Statist Soc B 52:105–124 Hoskins BJ, Karoly DJ (1981) The steady linear response to a spherical atmosphere to thermal and orographic forcing. J Atmos Sci 38:1179–1196 Hoskins BJ, Ambrizzi T (1993) Rossby wave propagation on a realistic longitudinally varying flow. J Atmos Sci 50:1661–1671 Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psych 24:417–520, 498–520 Hotelling H (1935) The most predictable criterion. J Educ Psych 26:139–142 Hotelling H (1936a) Simplified calculation of principal components. Psychometrika 1:27–35 Hotelling H (1936b) Relation between two sets of variables. Biometrika 28:321–377 Hsieh WW (2001a) Nonlinear canonical correlation analysis of the tropical Pacific climate variability using a neural network approach. J Climate 14:2528–2539 Hsieh WW (2001b) Nonlinear principal component analysis by neural networks. Tellus 53A:599– 615 Hsieh WW (2009) Machine learning methods in the environmental sciences: neural networks and kernels. Cambridge University Press, Cambridge Hsieh W, Tang B (1998) Applying neural network models to prediction and data analysis in meteorology and oceanography. Bull Am Meteorol Soc 79:1855–1870 Hubbert S, Baxter B (2001) Radial basis functions for the sphere. In: Haussmann W, Jetter K, Reimer M (eds) Recent progress in multivariate approximation, 4th international conference, September 2000, Witten-Bommerholz. International Series of Numerical Mathematics, vol. 137. Birkhäuser, Basel, pp 33–47 Huber PJ (1985) Projection pursuit. Ann Statist 13:435–475 Huber PJ (1981) Robust statistics. Wiley, New York, 308 p References 565

Hunter JS (1988) The digidot plot. Am Statistician 42:54 Hurrell JW (1996) Influence of variations in extratropical wintertime teleconnections on Northern Hemisphere temperature. Geophys Res Lett 23:665–668 Hurrell JW, Kushnir Y, Ottersen G, Visbeck M (2003) An overview of the North Atlantic Oscillation. In: Hurrell JW, Kushnir Y, Ottersen G, Visbeck M (eds) The North Atlantic Oscillation, climate significance and environmental impact, Geophysical Monograph, vol 134. American Geophysical Union, Washington, pp 1–35 Huth, R., C. Beck, A. Philipp, M. Demuzere, Z. Ustrnul, M. Cahynová, J. Kyselý, O. E. Tveito, (2008) Classifications of atmospheric circulation patterns, Recent advances and applications. Ann. NY Acad Sci 1146(1):105–152. ISSN: 0077-8923 Huva R, Dargaville R, Rayner P (2015) The impact of filtering self-organizing maps: A case study with Australian pressure and rainfall. Int J Climatol 35:624–633. https://doi.org/10.1002/joc. 4008 Hyvärinen A (1998) New approximations of differential entropy for independent component analysis and projection. In: Jordan MA, Kearns MJ, Solla SA (eds) Advances in neural information processing systems, vol 10. MIT Press, Cambridge, MA, pp 273–279 Hyvärinen A (1999) Survey on independent component analysis. Neural Comput Surv 2:94–128 Hyvärinen A, Oja E (2000) Independent component analysis: Algorithms and applications. Neural Net 13:411–430 Hyvärubeb A, Karhunen J, Oja E (2001) Independent component analysis. Wiley, 481pp Iskandar I (2009) Variability of satellite-observed sea surface height in the tropical Indian Ocean: comparison of EOF and SOM analysis. Makara Seri Sains 13:173–179. ISSN: 1693-6671 Izenman AJ (2008) Modern multivariate statistical techniques, regression, classification and manofold learning. Springer, New York Jackson JE (2003) A user’s guide to principal components. Wiley, New Jersey, 569pp James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning: with application in R. Springer texts in statistics. Springer, New York. https://doi.org/10.1007/978- 1-4614-7138-7-5 Jee JR (1985) A study of projection pursuit methods. Technical Report TR 776-311-4-85, Rice University Jenkins MG, Watts DG (1968) Spectral analysis and its applications. Holden-Day, San Francisco Jennrich RI (2001) A simple general procedure for orthogonal rotation. Psychometrika 66:289–306 Jennrich RI (2002) A simple general procedure for oblique rotation. Psychometrika 67:7–19 Jennrich RI (2004) Rotation to simple loadings using component loss function: The orthogonal case. Psychometrika 69:257–273 Jenssen R (2000) Image denoising based on independent component analysis. M.Sc. Thesis, the University of Tromso Jeong JH, Resop JP, Mueller ND, Fleisher DH, Yun K, Butler EE, et al. (2016) Random forests for global and regional crop yield predictions. PLOS ONE 11:e0156571. https://doi.org/10.1371/ journal.pone.0156571 Johnson W, Lindenstrauss J (1984) Extensions of Lipschitz mappings into a Hilbert space. In: Conference in modern analysis and probability (New Haven, Conn., 1982). Contemporary mathematics, vol 26. American Mathematical Society, pp 189–206 Johnson ES, McPhaden MJ (1993) On the structure of intraseasonal Kelevin waves in the equatorial Pacific ocean. J Phys Oceanogr 23:608–625 Johnson NC, Feldstein SB, Tremblay B (2008) The continuum of northern hemisphere teleconnection patterns and a description of the NAO Shift with the use of self-organizing maps. J Climate 21:6354–6371 Johansson JK (1981) An extension of Wollenberg’sredundancy analysis. Psychometrika 46:93–103 Jolliffe IT (2003) A cautionary note on artificial examples of EOFs. J Climate 16:1084–1086 Jolliffe IT, Cadima J (2016) Principal components analysis: a review and recent developments. Phil Trans R Soc A 374:20150202 Jolliffe IT, Uddin M, Vines KS (2002) Simplified EOFs−three alternatives to retain. Clim Res 20:271–279 566 References

Jolliffe IT, Trendafilov TN, Uddin M (2003) A modified principal component technique based on the LASSO. J Comput Graph Stat 12:531–547 Jolliffe IT (1987) Rotation of principal components: Some comments. J Climatol 7:507–510 Jolliffe IT (1995) Rotation of principal components: Choice of normalization constraints. J Appl Stat 22:29–35 Jolliffe IT (2002) Principal component analysis, 2nd edn. Springer, New York Jones MC (1983) The projection pursuit algorithm for exploratory data analysis. Ph.D. Thesis, University of Bath Jones MC, Sibson R (1987) What is projection pursuit? J R Statist Soc A 150:1–36 Jones RH (1975) Estimating the variance of time averages. J Appl Meteor 14:159–163 Jöreskog KG (1967) Some contributions to maximum likelihood factor analysis. Psychometrika 32:443–482 Jöreskog KG (1969) A general approach to confirmatory maximum likelihood factor analysis. Psychometrika 34:183–202 Jung T-P, Makeig S, Mckeown MJ, Bell AJ, Lee T-W, Sejnowski TJ (2001) Imaging brain dynamics using independent component analysis. Proc IEEE 89:1107–1122 Jungclaus J (2008) MPI-M earth system modelling framework: millennium full forcing experiment (ensemble member 1). World Data Center for climate. CERA-DB “mil0010”. http://cera-www. dkrz.de/WDCC/ui/Compact.jsp?acronym=mil0010 Jutten C, Herault J (1991) Blind separation of sources, part i: An adaptive algorithm based on neuromimetic architecture. Signal Process 24:1–10 Kaiser HF (1958) The varimax criterion for analytic rotation in favor analysis. Psychometrika 23:187–200 Kano Y, Miyamoto Y, Shimizu S (2003) Factor rotation and ICA. In: Proceedings of the 4th international symposium on independent component analysis and blind source separation (Nara, Japan), pp 101–105 Kao SK (1968) Governing equations and spectra for atmospheric motion and transports in frequency-wavenumber space. J Atmos Sci 25:32–38 Kapur JN (1989) Maximum-entropy models in science and engineering. Wiley, New York Karlin S, Taylor HM (1975) A first course in stochastic processes, 2nd edn. Academic Press, Boston Karthick S, Malathi D, Arun C (2018) Weather prediction analysis using random forest algorithm. Int J Pure Appl Math 118:255–262 Keller LB (1935) Expanding of limit theorems of probability theory on functions with continuous arguments (in Russian). Works Main Geophys Observ 4:5–19 Kendall MG (1994) Advanced theory of statistics. Vol I: distribution theory, 6th edn. In: Stuart A, Ord JK (eds). Arnold, London. Kendall MG, Stuart A (1961) The advanced theory of statistics: Inference and relationships, 3rd edn. Griffin, London. Kendall MG, Stuart A (1977) The advanced Theory of Statistics. Volume 1: distribution theory, 4th edn. Griffin, London Keogh EJ, Chu S, Hart D, Pazzani MJ (2001) An online algorithm for segmenting time series. In: Proceedings 2001 IEEE international conference on data mining, pp 289–296. https://doi.org/ 10.1109/ICDM.2001.989531 Kettenring JR (1971) Canonical analysis of several sets of variables. Biometrika 58:433–451 Khatri CG (1976) A note on multiple and canonical correlation for a singular covariance matrix. Psychometrika 41:465–470 Khedairia S, Khadir MT (2008) Self-organizing map and k-means for meteorological day type identification for the region of Annaba–Algeria. In: 7th computer information systems and industrial management applications, Ostrava, pp 91–96. ISBN: 978-0-7695-3184-7 Kiers HAL (1994) Simplimax: Oblique rotation to an optimal target with simple structure. Psychometrika 59:567–579 Kikkawa S, Ishida M (1988) Number of degrees of freedom, correlation times, and equivalent bandwidths of a random process. IEEE Trans Inf Theory 34:151–155 References 567

Kiladis GN, Weickmann KM (1992) Circulation anomalies associated with tropical convection during northern winter. Mon Wea Rev 120:1900–1923 Killworth PD, McIntyre ME (1985) Do Rossby-wave critical layers absorb, reflect or over-reflect? J Fluid Mech 161:449–492 Kim K-Y, Hamlington B, Na H (2015) Theoretical foundation of cyclostationary EOF analysis for geophysical and climate variables: concept and examples. Eart Sci Rev 150:201–218 Kim K-Y, North GR (1999) A comparison of study of EOF techniques: analysis of non-stationary data with periodic statistics. J Climate 12:185–199 Kim K-Y, Wu Q (1999) A comparison study of EOF techniques: Analysis of nonstationary data with periodic statistics. J Climate 12:185–199 Kim K-Y, North GR, Huang J (1996) EOFs of one-dimensional cyclostationary time series: Computations, examples, and stochastic modeling. J Atmos Sci 53:1007–1017 Kim K-Y, North GR (1997) EOFs of harmonizable cyclostationary processes. J Atmos Sci 54:2416–2427 Kimoto M, Ghil M, Mo KC (1991) Spatial structure of the extratropical 40-day oscillation. In: Proc. 8’th conf. atmos. oceanic waves and stability. Amer. Meteor. Soc., Boston, pp 115–116 Knighton J, Pleiss G, Carter E, Walter MT, Steinschneider S (2019) Potential predictability of regional precipitation and discharge extremes using synoptic-scale climate information via machine learning: an evaluation for the eastern continental United States. J Hydrometeorol 20:883–900 Knutson TR, Weickmann KM (1987) 30–60 day atmospheric oscillation: Composite life cycles of convection and circulation anomalies. Mon Wea Rev 115:1407–1436 Kobayashi S, Ota Y, Harada Y, Ebita A, Moriya M, Onoda H, Onogi K, Kamahori H, Kobayashi C, Endo H, Miyaoka K, Takahashi K (2015) The JRA-55 Reanalysis: General specifications and basic characteristics. J Meteor Soc Jpn 93:5–48 Kohonen T (2001) Self-organizing maps, 3rd edn. Springer, Berlin, 501 p Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biological Cybernetics 43:59–69 Kohonen T (1990) The self-organizing map. Proc IEEE 78:1464–1480 Kolmogorov AN (1933) Foundations of the theory of probability (Grundbegriffe der Wahrschein- lichkeitsrechnung). Translated by Nathan Morrison and Published by Chelsea Publishing Company, New York, 1950 Kolmogorov AN (1939) Sur l’interpolation et l’extrapolation des suites stationaires. Comptes Rendus Acad Sci Paris 208:2043–2045 Kolmogorov AN (1941) Stationary sequences in Hilbert space. Bull Math Univ Moscow 2:1–40 Kondrashov D, Chekroun MD, Yuan X, Ghil M (2018a) Data-adaptive harmonic decomposition and stochastic modeling of Arctic sea ice. Dyn Statist Clim Syst 3:179–205 Kondrashov, D., M. D. Chekroun, P. Berloff, (2018b) Multiscale Stuart-Landau emulators: Application wind-driven ocean gyres. Fluids 3:21. https://doi.org/10.3390/fluids3010021 Kooperberg C, O’Sullivan F (1996) Prediction oscillation patterns: A synthesis of methods for spatial-temporal decomposition of random fields. J Am. Statist Assoc 91:1485–1496 Koopmans LH (1974) The spectral analysis of time series. Academic Press, New York Kramer MA (1991) Nonlinear principal component analysis using autoassociative neural networks. AIChE J 37:233–243 Kress R, Martensen E (1970) Anwendung der rechteckregel auf die reelle Hilbertransformation mit unendlichem intervall. Z Angew Math Mech 50:T61–T64 Krishnamurthi TN, Chakraborty DR, Cubucku N, Stefanova L, Vijaya Kumar TSV (2003) A mechanism of the Madden-Julian oscillation based on interactions in the frequency domain. Q J R Meteorol Soc 129:2559–2590 Kruskal JB (1964a) Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 29:1–27 Kruskal JB (1964b) Nonmetric multidimensional scaling: a numerical method. Psychometrika 29:115–129 568 References

Kruskal JB (1969) Toward a practical method which helps uncover the structure of a set of multivariate observations by finding the linear transformation which optimizes a new ‘index of condensation’. In: Milton RC, Nelder JA (eds) Statistical computation, New York Kruskal JB (1972) Linear transformations of multivariate data to reveal clustering. In: Multidi- mensional scaling: theory and application in the behavioural sciences, I, theory. Seminra Press, New York Krzanowski WJ, Marriott FHC (1994) Multivariate analysis, Part 1. Distributions, ordination and inference. Arnold, London Krzanowski WJ (2000) Principles of multivariate analysis: A user’s perspective, 2nd edn. Oxford University Press, Oxford Krzanowski WJ (1984) Principal component analysis in the presence of group structure. Appl Statist 33:164–168 Krzanowski WJ (1979) Between-groups comparison of principal components. J Am Statist Assoc 74:703–707 Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in neural information processing systems, vol 25. Curran Associates, Red Hook, NY, pp 1097–1105 Kubáckouáˇ L, Kubácekˇ L, Kukucaˇ J (1987) Probability and statistics in geodesy and geophysics. Elsevier, Amsterdam Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86 Kundu PK, Allen JS (1976) Some three-dimensional characteristics of low-frequency current fluctuations near the Oregon coast. J Phys Oceanogr 6:181–199 Kutzbach JE (1967) Empirical eigenvectors of sea-level pressure, surface temperature and precipitation complexes over North America. J Appl Meteor 6:791–802 Kwon S (1999) Clutering in multivariate data: visualization, case and variable reduction. Ph.D. Thesis, Iowa State University Kwasniok F (1996) The reduction of complex dynamical systems using principal interaction patterns. Physica D 92:28–60 Kwasniok F (1997) Optimal Galerkin approximations of partial differential equations using principal interaction patterns. Phys Rev E 55:5365–5375 Kwasniok F (2004) Empirical low-order models of barotropic flow. J Atmos Sci 61:235–245 Labitzke K, van Loon H (1999) The stratosphere. Springer, New York Laplace PS (1951) A philosophical essay on probabilities. Dover Publications, New York Larsson E, Fornberg B (2003) A numerical study of some radial basis function based solution methods for elliptic PDEs. Comput Math Appli 47:37–55 Laughlin S (1981) A simple coding procedure enhances a neuron’s information capacity. Z Natureforsch 36c:910–912 Lawley DN (1956) Tests of significance for the latent roots of covariance and correlation matrices. Biometrika 43:128–136 Lawley DN, Maxwell AE (1971) Factor analysis as a statistical method, 2nd edn. Butterworth, London Lazante JR (1990) The leading modes of 10–30 day variability in the extratropics of the Northern Hemisphere during the cold season. J Atmos Sci 47:2115–2140 LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444. https://doi.org/10.1038/ nature14539 Ledoit O, Wolf M (2004) A well-conditioned estimator for large-dimensional covariance matrices. J Multivar Anal 88:365–411 Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791 Legates DR (1991) The effect of domain shape on principal components analyses. Int J Climatol 11:135–146 Legates DR (1993) The effect of domain shape on principal components analyses: A reply. Int J Climatol 13:219–228 References 569

Leith CE (1973) The standard error of time-average estimates of climatic means. J Appl Meteorol 12:1066–1069 Leloup JA, Lachkar Z, Boulanger JP, Thiria S (2007) Detecting decadal changes in ENSO using neural networks. Clim Dyn 28:147–162. https://doi.org/10.1007/s00382-006-0173-1. ISSN: 0930-7575 Leurgans SE, RA Moyeed, Silverman BW (1993) Canonical correlation analysis when the data are curves. J R Statist Soc B 55:725–740 Li G, Ren B, Yang C, Zheng J (2011a) Revisiting the trend of the tropical and subtropical Pacific surface latent heat fluxduring 1977–2006. J Geophys Res 116:D10115. https://doi.org/10.1029/ 2010JD015444 Li G, Ren B, Zheng J, Yang C (2011b) Trend singular value decomposition analysis and its application to the global ocean surfacelatent heat flux and SST anomalies. J Climate 24:2931– 2948 Lin G-F, Chen L-H (2006) Identification of homogeneous regions for regional frequency analysis using the self-organizing map. J Hydrology 324:1–9. ISSN: 0022-1694 Lingoes JC, Roskam EE (1973) A mathematical and empirical analysis of two multidimensional analysis scaling algorithms. Psychometrika 38(Monograph supplement):1–93 Linz P, Wang RLC (2003) Exploring numerical methods: An introduction to scientific computing using MATLAB. Jones and Bartlett Publishers, Sudbury, MA Lim Y-K, Kim K-Y (2006) A new perspective on the climate prediction of Asian summer monsoon precipitation. J Climate 19:4840–4853 Lim Y-K, Cocke S, Shin DW, Schoof JT, LaRow TE, O’Brien JJ (2010) Downscaling large-scale NCEP CFS to resolve fine-scale seasonal precipitation and extremes for the crop growing seasons over the southeastern United States. Clim Dyn 35:449–471 Liu Y, Weisberg RH (2007) Ocean currents and sea surface heights estimated across the West Florida Shelf. J Phys Oceanog 37:1697–1713. ISSN: 0022-3670 Liu Y, Weisberg RH, Mooers CNK (2006) Performance evaluation of the selforganizing map for feature extraction. J Geophys Res 111:C05018. https://doi.org/10.1029/2005JC003117. ISSN: 0148-0227 Liu Y, Weisberg RH (2005) Patterns of ocean current variability on the West Florida Shelf using the self-organizing map. J Geophys Res 110:C06003. https://doi.org/10.1029/2004JC002786 Loève M (1948) Functions aléatoires du second order. Suplement to P. Lévy: Processus Stochas- tiques et Mouvement Brownien. Gauthier-Villars, Paris Loève M (1963) Probability theory. Van Nostrand Reinhold, New York Loève M (1978) Probability theory, vol II, 4th edn. Springer, 413 p Lorenz EN (1963) Deterministic non-periodic flow. J Atmos Sci 20:130–141 Lorenz EN (1970) Climate change as a mathematical problem. J Appl Meteor 9:325–329 Lorenz EN (1956) Empirical orthogonal functions and statistical weather prediction. Technical report, Statistical Forecast Project Report 1, Dept. of Meteor., MIT, 49 p Losada IJ, Reguero BG, Méndez FJ, Castanedo S, Abascal AJ, Minguez R (2013) Long-term changes in sea-level components in Latin America and the Caribbean. Global Planetary Change 104:34–50 Lucio JH, Valdés R, Rodríguez LR (2012) Improvements to surrogate data methods for nonstationary time series. Phys Rev E 85:056202 Luxburg U (2007) A tutorial on spectral clustering. Statist Comput 17:395–416 Lütkepoch H (2006) New introduction to multiple time series analysis. Springer, Berlin Madden RA, Julian PR (1971) Detection of a 40–50 day oscillation in the zonal wind in the tropical pacific. J Atmos Sci 28:702–708 Madden RA, Julian PR (1972) Description of global-scale circulation cells in the tropics with a 40–50 day period. J Atmos Sci 29:1109–1123 Madden RA, Julian PR (1994) Observations of the 40–50-day tropical oscillation−A review. Mon Wea Rev 122:814–837 Magnus JR, Neudecker H (1995) Matrix differential calculus with applications in statistics and econometrics. Wiley, Chichester 570 References

Malik N, Bookhagen B, Marwan N, Kurths J (2012) Analysis of spatial and temporal extreme monsoonal rainfall over South Asia using complex networks. Clim Dyn 39:971–987. https:// doi.org/10.1007/s00382-011-1156-4 Malozemov VN, Pevnyi AB (1992) Fast algorithm of the projection of a point onto the simplex. Vestnik St. Petersburg University 1(1):112–113 Mansour A, Jutten C (1996) A direct solution for blind separation of sources. IEEE Trans Signal Process 44:746–748 Mardia KV, Kent TJ, Bibby MJ (1979) Multivariate analysis. Academic Press, London Mardia KV (1980) Tests of univariate and multivariate normality. In: Krishnaiah PR (ed) Handbook of statistics 1: Analysis of variance. North-Holland Publishing, pp 279–320 Martinez WL, Martinez AR (2002) Computational statistics handbook with MATLAB. Chapman and Hall, Boca Raton Martinez AR, Solka J, Martinez WL (2010) Exploratory data analysis with MATLAB, 2nd edn. CRS Press, 530 p Maruyama T (1997) The quasi-biennial oscillation (QBO) and equatorial waves−A historical review. Pap Meteorol Geophys 47:1–17 Marwan N, Donges JF, Zou Y, Donner RV, Kurths J (2009) Complex network approach for recurrence analysis of time series. Phys Lett A 373:4246–4254 Mathar R (1985) The best Euclidean fit to a given distance matrix in prescribed dimensions. Linear Algebra Appl 67:1–6 Matsubara Y, Sakurai Y, van Panhuis WG, Faloutsos C (2014) FUNNEL: automatic mining of spatially coevolving epidemics. In: KDD, pp 105–114 https://doi.org/10.1145/2623330. 2623624 Matthews AJ (2000) Propagation mechanisms for the Madden-Julian oscillation. Q J R Meteorol Soc 126:2637–2651 Masani P (1966) Recent trends in multivariate prediction theory. In: Krishnaiah P (ed) Multivariate analysis – I. Academic Press, New York, pp 351–382 Mazloff MR, Heimbach P, Wunch C (2010) An eddy-permitting Southern Ocean State Estimate. J Phys Oceano 40:880–899 McCullagh P, Nelder JA (1989) Generalized linear models, 2nd edn. Chapman & Hall, London, 511 p McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 5:115–133 McDonald AJ, Cassano JJ, Jolly B, Parsons S, Schuddeboom A (2016) An automated satellitecloud classification scheme usingself-organizing maps: Alternative ISCCP weather states. J Geophys Res Atmos 121:13,009–13,030. https://doi.org/10.1002/2016JD025199 McEliece RJ (1977) The theory of information and coding. Addison-Wesley, Reading, MA McGee VE (1968) Multidimensional scaling of N sets of similarity measures: a nonmetric individual differences approach. Multivar Behav Res 3:233–248 McLachlan GJ, Krishnan T (1997) The EM algorithm and extensions. Wiley, New York McLachlan GJ (2004) Discriminant analysis and statistical pattern recognition. Wiley Interscience, 545 p Meila M, Shi J (2000) Learning segmentation by random walks. In: Proceedings of NIPS, pp 873– 879 Mercer T (1909) Functions of positive and negative type and their connection with the theory of integral equations. Trans Lond Phil Soc A 209:415–446 Merrifield MA, Winant CD (1989) Shelf circulation in the Gulf of California: A description of the variability. J Geophys Res 94:18133–18160 Merrifield MA, Guza RT (1990) Detecting propagating signals with complex empirical orthogonal functions: A cautionary note. J Phys Oceanogr 20:1628–1633 Mestas-Nuñez AM (2000) Orthogonality properties of rotated empirical modes. Int J Climatol 20:1509–1516 Metropolis N, Rosenbluth A, Rosenbluth M, Teller A, Teller E (1953) Equation of state calculation by fast computing machines. J Chem Phys 21:1087–1092 References 571

Meyer Y (1992) Wavelets and operators. Cambridge University Press, New York, 223 p Meza–Padilla R, Enriquez C, Liu Y, Appendini CM (2019) Ocean circulation in the western Gulf of Mexico using self–organizing maps. J Geophys Res Oceans 124:4152–4167. https://doi.org/ 10.1029/2018JC014377 Michelli CA (1986) Interpolation of scattered data: Distance matrices and conditionally positive definite functions. Constr Approx 2:11–22 Michelot C (1986) A finite algorithm for finding the projection of a point onto the canonical simplex of Rn. JOTA 50:195–200 Mirsky L (1955) An introduction to linear algebra. Oxford University Press, Oxford, 896pp Mitchell TM (1998) Machine learning. McGraw-Hill, New York, 432 p Mikhlin SG (1964) Integral equations, 2nd edn. Pergamon Press, London Minnotte MC, West RW (1999) The data image: A tool for exploring high dimensional data sets. In: Proc. ASA section on stat. graphics, Dallas, TX, American Statistical Association, pp 25–33 Moiseiwitsch BL (1977) Integral equations. Longman, London Monahan AH, DelSole T (2009) Information theoretic measures of dependence, compactness, and non-Gaussianity for multivariate probability distributions. Nonlin Process Geophys 16:57–64 Monahan AH, Fyfe CJ (2007) Comment on the shortcomings of nonlinear principal component analysis in identifying circulation regimes. J Climate 20:374–377 Monahan, A.H., L. Pandolfo, Fyfe JC (2001) The prefered structure of variability of the Northern Hemisphere atmospheric circulation. Geophys Res Lett27:1139–1142 Monahan AH, Tangang FT, Hsieh WW (1999) A potential problem with extended EOF analysis of standing wave fields. Atmosphere-Ocean 3:241–254 Monahan AH, Fyfe JC, Flato GM (2000) A regime view of northern hemisphere atmospheric variability and change under global warming. Geophys Res Lett 27:1139–1142 Monahan AH (2000) Nonlinear principal component analysis by neural networks: theory and application to the Lorenz system. J Climate 13:821–835 Monahan AH (2001) Nonlinear principal component analysis: tropical Indo–Pacific sea surface temperature and sea level pressure. J Climate 14:219–233 Moody J, Darken CJ (1989) Fast learning in networks of locally-tuned processing units. Neural Comput 1:281–294 Moon TK (1996) The expectation maximization algorithm. IEEE Signal Process Mag, 47–60 Mori A, Kawasaki N, Yamazaki K, Honda M, Nakamura H (2006) A reexamination of the northern hemisphere sea level pressure variability by the independent component analysis. SOLA 2:5–8 Morup M, Hansen LK (2012) Archetypal analysis for machine Learning and data mining. Neurocomputing 80:54–63 Morozov VA (1984) Methods for solving incorrectly posed problems. Springer, Berlin. ISBN: 3- 540-96059-7 Morrison DF (1967) Multivariate statistical methods. McGraw-Hill, New York Morton SC (1989) Interpretable projection pursuit. Technical Report 106. Department of Statistics, Stanford University, Stanford. https://www.osti.gov/biblio/5005529-interpretable-projection- pursuit Mukhin D, Gavrilov A, Feigin A, Loskutov E, Kurths J (2015) Principal nonlinear dynamical modes of climate variability. Sci Rep 5:15510. https://doi.org/10.1038/srep15510 Munk WH (1950) On the wind-driven ocean circulation. J Metorol 7:79–93 Nadler B, Lafon S, Coifman RR, Kevrikedes I (2006) Diffusion maps, spectral clustering, and reaction coordinates of dynamical systems. Appl Comput Harmon Anal 21:113–127 Nason G (1992) Design and choice of projection indices. Ph.D. Thesis, The University of Bath Nason G (1995) Three-dimensional projection pursuit. Appl Statist 44:411–430 Nason GP, Sibson R (1992) Measuring multimodality. Stat Comput 2:153–160 Nelder JA, Mead R (1965) A simplex method for function minimization. Comput J 7:308–313 Newman MEJ (2006) Modularity and community structure in networks. PNAS 103:8577–8582. www.pnas.org/cgi/doi/10.1073/pnas.0601602103 Newman MEJ, Girvan M (2004) Finding and evaluating community structure in networks. Phys Rev E 69:026113. https://doi.org/10.1103/PhysRevE.69.026113 572 References

Nielsen MA (2015) Neural networks and deep learning. Determination Press North GR (1984) Empirical orthogonal functions and normal modes. J Atmos Sci 41:879–887 North GR, Bell TL, Cahalan FR, Moeng JF (1982) Sampling errors in the estimation of empirical orthogonal functions. Mon Wea Rev 110:699–706 Neumaier A, Schneider T (2001) Estimation of parameters and eigenmodes of multivariate autoregressive models. ACL Trans Math Soft 27:27–57 Nuttal AH, Carter GC (1982) Spectral estimation and lag using combined time weighting. Proc IEEE 70:1111–1125 Obukhov AM (1947) Statistically homogeneous fields on a sphere. Usp Mat Navk 2:196–198 Obukhov AM (1960) The statistically orthogonal expansion of empirical functions. Bull Acad Sci USSR Geophys Ser (English Transl.), 288–291 Ohba M, Kadokura S, Nohara D, Toyoda Y (2016) Rainfall downscaling of weekly ensemble forecasts using self-organising maps. Tellus A 68:29293. https://doi.org/10.3402/tellusa.v68. 29293 Oja E (1982) A simplified neuron model as a principal component analyzer. J Math Biol 15:267– 273 Önskog T, Franzke C, Hannachi A (2018) Predictability and non-Gaussian characteristics of the North Atlantic Oscillation. J Climate 31:537–554 Önskog T, Franzke C, Hannachi A (2020) Nonlinear time series models for the North Atlantic Oscillation. Adv Statist Clim Meteorol Oceanog 6:1–17 Osborne AR, Kirwan AD, Provenzale A, Bergamasco L (1986) A search for chaotic behavior in large and mesoscale motions in the pacific ocean. Physica D Nonlinear Phenomena 23:75–83 Overland JE, Preisendorfer RW (1982) A significance test for principal components applied to a cyclone climatology. Mon Wea Rev 110:1–4 Packard NH, Crutchfield JP, Farmer JDR, Shaw RS (1980) Geometry from a time series. Phys Rev Lett 45:712–716 Palmer CE (1954) The general circulation between 200 mb and 10 mb over the equatorial Pacific, Weather 9:3541–3549 Pang B, Yue J, Zhao G, Xu Z (2017) Statistical downscaling of temperature with the random forest model. Hindawi Adv Meteorol Article ID 7265178:11 p. https://doi.org/10.1155/2017/7265178 Panagiotopoulos F, Shahgedanova M, Hannachi A, Stephenson DB (2005) Observed trends and teleconnections of the Siberian High: a recently declining center of action. J Climate 18:1411– 1422 Parlett BN, Taylor DR, Liu ZS (1985) A look-ahead Lanczos algorithm for nonsymmetric matrices. Math Comput 44:105–124 Parzen E (1959) Statistical inference on time series by Hilbert space methods, I. Technical Report No. 23, Department of Statistics, Stanford University. (Published in Time Series Analysis Papers by E. Parzen, Holden-Day, San Francisco Parzen E (1961) An approach to time series. Ann Math Statist 32:951–989 Parzen E (1963) A new approach to the synthesis of optimal smoothing and prediction systems. In: Bellman R (ed) Proceedings of a symposium on optimization. University of California Press, Berkeley, pp 75–108 Pasmanter RA, Selten MF (2010) Decomposing data sets into skewness modes. Physica D 239:1503–1508 Pauthenet E (2018) Unraveling the thermohaline of the Southern Ocean using functional data analysis. Ph.D. thesis, Stockholm University Pauthenet E, Roquet F, Madec G, Nerini D (2017) A linear decomposition of the Southern Ocean thermohaline structure. J Phys Oceano 47:29–47 Pavan V, Tibaldi S, Brankovich C (2000) Seasonal prediction of blocking frequency: Results from winter ensemble experiments. Q J R Meteorol Soc 126:2125–2142 Pawitan Y (2001) In all likelihood: statistical modelling and inference using likelihood. Oxford University Press, Oxford Pearson K (1901) On lines and planes of closest fit to systems of points in space. Phil Mag 2:559– 572 References 573

Pearson K (1895) Notes on regression and inheritance in the case of two parents. Proc R Soc London 58:240–242 Pearson K (1920) Notes on the history of correlation. Biometrika 13:25–45 Pham D-T, Garrat P, Jutten C (1992) Separation of mixture of independent sources through maximum likelihood approach. In: Proc EUSIPCO, pp 771–774 Pires CAL, Hannachi A (2021) Bispectral analysis of nonlinear interaction, predictability and stochastic modelling with application to ENSO. Tellus A 73, 1–30 Plaut G, Vautard R (1994) Spells of low-frequency oscillations and weather regimes in the northern hemisphere. J Atmos sci 51:210–236 Pearson K (1902) On lines and planes of closest fit to systems of points in space. Phil Mag 2:559– 572 Penland C (1989) Random forcing and forecasting using principal oscillation patterns. Mon Wea Rev 117:2165–2185 Penland C, Sardeshmukh PD (1995) The optimal growth of tropical sea surface temperature anomalies. J Climate 8:1999–2024 Pezzulli S, Hannachi A, Stephenson DB (2005) The variability of seasonality. J Climate 18:71–88 Philippon N, Jarlan L, Martiny N, Camberlin P, Mougin E (2007) Characterization of the interannual and intraseasonal variability of west African vegetation between 1982 and 2002 by means of NOAA AVHRR NDVI data. J Climate 20:1202–1218 Pires CAL, Hannachi A (2017) Independent subspace analysis of the sea surface temperature variability: non-Gaussian sources and sensitivity to sampling and dimensionality. Complexity. https://doi.org/10.1155/2017/3076810 Pires CAL, Ribeiro AFS (2017) Separation of the atmospheric variability into non-Gaussian multidimensional sources by projection pursuit techniques. Climate Dynamics 48:821–850 Poggio T, Girosi F (1990) Networks for approximation and learning. Proc IEEE 78:1481–1497 Polya G, Latta G (1974) Complex variables. Wiley, New York, 334pp Powell MJD (1964) An efficient method for finding the minimum of a function of several variables without calculating derivatives. Comput J 7:155–162 Powell MJD (1987) Radial basis functions for multivariate interpolation: a review. In: Mason JC, Cox MG (eds) Algorithms for the approximation of functions and data. Oxford University Press, Oxford, pp 143–167 Powell MJD (1990) The theory of radial basis function approximation in (1990) In: Light W (ed) Advances in numerical analysis, Volume 2: wavelets, subdivision algorithms and radial basis functions. Oxford University Press, Oxford Preisendorfer RW, Mobley CD (1988) Principal component analysis in meteorology and oceanography. Elsevier, Amsterdam Press WH, et al (1992) Numerical recipes in Fortran: The Art of scientific computing. Cambridge University Press, Cambridge Priestly MB (1981) Spectral analysis of time series. Academic-Press, London Posse C (1995) Tools for two-dimensional exploratory projection pursuit. J Comput Graph Statist 4:83–100 Ramsay JO, Silverman BW (2006) Functional data analysis, 2nd edn. Springer Series in Statistics, New York Rasmusson EM, Arkin PA, Chen W-Y, Jalickee JB (1981) Biennial variations in surface temperature over the United States as revealed by singular decomposition. Mon Wea Rev 109:587–598 Rayner NA, Parker DE, Horton EB, Folland CK, Alexander LV, Rowell DP, Kent EC, Kaplan A (2003) Global analyses of sea surface temperature, sea ice, and night marine air temperature since the late nineteenth century. J Geophys Res 108:014, 4407. Rényi A (1961) On measures of entropy and information. In: Neyman J (ed) Proceedings of the Fourth Bekeley symposium on mathematical statistics and probability, vol I. The University of California Press, Berkeley, pp 547–561 Rényi A (1970) Probability theory. North Holland, Amsterdam, 666pp Reed RJ, Campbell WJ, Rasmussen LA, Rogers RG (1961) Evidence of a downward propagating annual wind reversal in the equatorial stratosphere. J Geophys Res 66:813–818 574 References

Reichenback H (1937) Les fondements logiques du calcul des probabilités. Ann Inst H Poincaré 7:267–348 Rennert KJ, Wallace MJ (2009) Cross-frequency coupling, skewness and blocking in the Northern Hemisphere winter circulation. J Climate 22:5650–5666 Renwick AJ, Wallace JM (1995) Predictable anomaly patterns and the forecast skill of northern hemisphere wintertime 500-mb height fields. Mon Wea Rev 123:2114–2131 Reusch DB, Alley RB, Hewitson BC (2005) Relative performance of self-organizing maps and principal component analysis in pattern extraction from synthetic climatological data. Polar Geography 29(3):188–212. https://doi.org/10.1080/789610199 Reyment RA, Jvreskog KG (1996) Applied factor analysis in the natural sciences. Cambridge University Press, Cambridge Richman MB (1981) Obliquely rotated principal components: An improved meteorological map typing technique. J Appl Meteor 20:1145–1159 Richman MB (1986) Rotation of principal components. J Climatol 6:293–335 Richman MB (1987) Rotation of principal components: A reply. J Climatol 7:511–520 Richman MB (1987) Rotation of principal components: A reply. J Climatol 7:511–520 Richman M (1993) Comments on: The effect of domain shape on principal components analyses. Int J Climatol 13:203–218 Richman M, Adrianto I (2010) Classification and regionalization through kernel principal component analysis. Phys Chem Earth 35:316–328 Richman MB, Leslie LM (2012) Adaptive machine learning approaches to seasonal prediction of tropical cyclones. Procedia Comput Sci 12:276–281 Richman MB, Leslie LM, Ramsay HA, Klotzbach PJ (2017) Reducing tropical cyclone prediction errors using machine learning approaches. Procedia Comput Sci 114:314–323 Ripley BD (1994) Neural networks and related methods for classification. J Roy Statist Soc B 56:409–456 Riskin H (1984) The Fokker-Planck quation. Springer Ritter H (1995) Self-organizing feature maps: Kohonen maps. In: Arbib MA (ed) The handbook of brain theory and neural networks. MIT Press, Cambridge, MA, pp 846–851 Roach GF (1970) Greens functions: introductory theory with applications. Van Nostrand Reinhold Campany, London Rodgers JL, Nicewander WA (1988) Thirten ways to look at the correlation coefficients. Am Statistician 42:59–66 Rodwell MJ, Hoskins BJ (1996) Monsoons and the dynamics of deserts. Q J Roy Meteorol Soc 122:1385–1404 Rogers GS (1980) Matrix derivatives. Marcel Dekker, New York Rojas R (1996) Neural networks: A systematic introduction. Springer, Berlin, 509 p Rosenblatt F (1962) Principles of neurodynamics. Spartman, New York Rosenblatt F (1958) The perceptron: A probabilistic model for information storage and organiza- tion in the brain. Psychological Rev 65:386–408 Ross SM (1998) A first course in probability, 5th edn. Prentice-Hall, New Jersey Roweis ST (1998) The EM algorithm for PCA and SPCA. In: Jordan MI, Kearns MJ, Solla SA (eds) Advances in neural information processing systems, vol 10. MIT Press, Cambridge, MA Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linera embedding. Science 290:2323–2326 Rozanov YuA (1967) Stationary random processes. Holden-Day, San-Francisco Rubin DB, Thayer DT (1982) EM algorithms for ML factor analysis. Psychometrika 47:69–76 Rubin DB, Thayer DT (1983) More on EM for ML factor analysis. Psychometrika 48:253–257 Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagation errors. Nature 323:533–536 Rumelhart DE, Widrow B, Lehr AM (1994) The basic ideas in neural networks. Commun ACM 37:87–92 References 575

Runge J, Petoukhov V, Kurths J (2014) Quantifying the strength and delay of climatic interactions: the ambiguities of cross correlation and a novel measure based on graphical models. J Climate 27:720–739 Runge J, Heitzig J, Kurths J (2012) Escaping the curse of dimensionality in estimating multivariate transfer entropy. Phys Rev Lett 108:258701. https://doi.org/10.1103/PhysRevLett.108.258701 Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, Philadelphia Saad Y (1990) Numerical solution of large Lyapunov equations. In: Kaashoek AM, van Schuppen JH, Ran AC (eds) Signal processing, scattering, operator theory, and numerical methods, Pro- ceedings of the international symposium MTNS-89, vol III, pp 503–511, Boston, Birkhauser Saad Y, Schultz MH (1985) Conjugate gradient-like algorithms for solving nonsymmetric linear systems. Math Comput 44:417–424 Said-Houari B (2015) Diffierential equations: Methods and applications. Springer, Cham, 212pp Salim A, Pawitan Y, Bond K (2005) Modelling association between two irregularly observed spatiotemporal processes by using maximum covariance analysis. Appl Statist 54:555–573 Sammon JW Jr (1969) A nonlinear mapping for data structure analysis. IEEE Trans Comput C- 18:401–409 Samuel AL (1959) Some studies in machine learning using the game of of checkers. IBM J Res Dev 3:211–229 Saunders DR (1961) The rationale for an “oblimax” method of transformation in factor analysis. Psychometrika 26:317–324 Scher S (2020) Artificial intelligence in weather and climate prediction. Ph.D. Thesis in Atmo- spheric Sciences and Oceanography, Stockholm University, Sweden 2020 Scher S (2018) Toward data-driven weather and climate forecasting: Approximating a simple general circulation model with deep learning. Geophys Res Lett 45:12,616–12,622. https:// doi.org/10.1029/2018GL080704 Scher S, Messori G (2019) Weather and climate forecasting with neural networks: using general circulation models (GCMs) with different complexity as a study ground. Geosci Model Dev 12:2797–2809 Schmidtko S, Johnson GC, Lyman JM (2013) MIMOC: A global monthly isopycnal upper-ocean climatology with mixed layers. J Geophys Res, 118. https://doi.org/10.1002/jgrc.20122 Schneider T, Neumaier A (2001) Algorithm 808: ARFit − A Matlab package for the estimation of parameters and eigenmodes of multivariate autoregressive models. ACM Trans Math Soft 27:58–65 Schölkopf B, Smola A, Müller K-R (1998) Nonlinear component analysis as a kernel eigenvalue problem. Neural Comput 10:1299–1319 Schölkopf B, Mika S, Burgers CJS, Knirsch P, Müller K-R, Rätsch G, Smola A (1999) Input space vs. feature space in kernel-based methods. IEEE Trans Neural Netw 10:1000–1017 Schoenberg IJ (1935) Remarks to Maurice Fréchet’s article ‘sur la définition axiomatique d’une classe e’espace distanciés vectoriellement applicable sur l’espace de Hilbert’. Ann Math (2nd series) 36:724–732 Schoenberg IJ (1964) Spline interpolation and best quadrature formulae. Bull Am Soc 70:143–148 Schott JR (1991) Some tests for common principal component subspaces in several groups. Biometrika 78:771–778 Schott JR (1988) Common principal component subspaces in two groups. Biometrika 75:229–236 Scott DW (1992) Multivariate density estimation: theory, practice, and vizualization. Wiley, New York Schuenemann KC, Cassano JJ (2010) Changes in synoptic weather patterns and Greenland precipitation in the 20th and 21st centuries: 2. Analysis of 21st century atmospheric changes using self-organizing maps, J Geophys Res 115:D05108. https://doi.org/10.1029/2009JD011706. ISSN: 0148-0227 Schuenemann KC, Cassano JJ, Finnis J (2009) Forcing of precipitation over Greenland: Synoptic climatology for 1961–99. J Hydrometeorol 10:60–78. https://doi.org/10.1175/2008JHM1014. 1. ISSN: 1525-7541 576 References

Scott DW, Thompson JR (1983) Probability density estimation in higher dimensions. In: Computer science and statistics: Proceedings of the ﬁfteenth symposium on the interface, pp 173–179 Seal HL (1967) Multivariate statistical analysis for biologists. Methuen, London Schmid PJ (2010) Dynamic mode decomposition of numerical and experimental data. J Fluid Mech 656(1):5–28 Seitola T, Mikkola V, Silen J, Järvinen H (2014) Random projections in reducing the dimensionality of climate simulation data. Tellus A, 66. Available at www.tellusa.net/index.php/tellusa/ article/view/25274 Seitola T, Silén J, Järvinen H (2015) Randomized multi-channel singular spectrum analysis of the 20th century climate data. Tellus A 67:28876. Available at https://doi.org/10.3402/tellusa.v67. 28876. Seltman HJ (2018) Experimental design and analysis. http://www.stat.cmu.edu/~hseltman/309/ Book/Book.pdf Seth S, Eugster MJA (2015) Probabilistic archetypal analysis. Machine Learning. https://doi.org/ 10.1007/s10994-015-5498-8 Shalvi O, Weinstein E (1990) New criteria for blind deconvolution of nonminimum phase systems (channels). IEEE Trans Inf Theory 36:312–321 Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27:379–423, 623– 656 Shepard RN (1962a) The analysis of proximities: multidimensional scaling with unknown distance function. Part I. Psychometrika 27:125–140 Shepard RN (1962b) The analysis of proximities: multidimensional scaling with unknown distance function. Part II. Psychometrika 27:219–246 Sheridan SC, Lee CC (2010) Synoptic climatology and the general circulation model. Progress Phys Geography 34:101–109. ISSN: 1477-0296 Shi J, Malik J (2000) Normalized cuts and image segmentation. IEEE Trans Pattern Anal Mach Intell 22:888–905 Schnur R, Schmitz G, Grieger N, von Storch H (1993) Normal modes of the atmosphere as estimated by principal oscillation patterns and derived from quasi-geostrophic theory. J Atmos Sci 50:2386–2400 Sibson R (1972) Order invariant methods for data analysis. J Roy Statist Soc B 34:311–349 Sibson R (1978) Studies in the robustness of multidimensional scaling: procrustes statistics. J Roy Statist Soc B 40:234–238 Sibson R (1979) Studies in the robustness of multidimensional scaling: Perturbational analysis of classical scaling. J Roy Statist Soc B 41:217–229 Sibson R (1981) Multidimensional scaling. Wiley, Chichester Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London Simmons AJ, Wallace MJ, Branstator WG (1983) Barotropic wave propagation and instability, and atmospheric teleconnection patterns. J Atmos Sci 40:1363–1392 Smith S (1994) Optimization techniques on Riemannian manifolds. In Hamiltonian and gradient ﬂows, algorithm and control (Bloch A, Ed.), Field Institute Communications, Vol 3, Amer Math Soc, 113–136. Snyman JA (1982) A new and dynamic method for unconstrained optimisation. Appl Math Modell 6:449–462 Solidoro C, Bandelj V, Barbieri P, Cossarini G, Fonda Umani S (2007) Understanding dynamic of biogeochemical properties in the northern Adriatic Sea by using self-organizing maps and k- means clustering. J Geophys Res 112:C07S90. https://doi.org/10.1029/2006JC003553. ISSN: 0148-0227 Socanˇ G (2003) The incremental value of minimum rank factor analysis. Ph.D. Thesis, University of Groningen, Groningen Spearman C (1904a) General intelligence, objectively determined and measured. Am J Psy 15:201–293 References 577

Spearman C (1904b) The proof and measurement of association between two things. Am J Psy 15:72, and 202 Spence I, Garrison RF (1993) A remarkable scatterplot. Am Statistician 47:12–19 Spendley W, Hext GR, Humsworth FR (1962) Sequential applications of simplex designs in optimization and evolutionary operations. Technometrics 4:441–461 Stewart D, Love W (1968) A general canonical correlation index. Psy Bull 70:160–163 Steinschneiders S, Lall U (2015) Daily precipitation and tropical moisture exports across the eastern United States: An application of archetypal analysis to identify spatiotemporal structure. J Climate 28:8585–8602 Stephenson G (1973) Mathematical methods for science students, 2nd edn. Dover Publication, Mineola, 526 p Stigler SM (1986) The History of Statistics: The Measurement of Uncertainty Before 1900. Harvard University Press, Cambridge, MA Stommel H (1948) The westward intensification of wind-driven ocean currents. EOS Trans Amer Geophys Union 29:202–206 Stone M, Brooks RJ (1990) Continuum regression: cross-validation sequentially constructed prediction embracing ordinary least squares, partial least squares and principal components regression. J Roy Statist Soc B52:237–269 Su Z, Hu H, Wang G, Ma Y, Yang X, Guo F (2018) Using GIS and Random Forests to identify fire drivers in a forest city, Yichun, China. Geomatics Natural Hazards Risk 9:1207–1229. https:// doi.org/10.1080/19475705.2018.1505667 Subashini A, Thamarai SM, Meyyappan T (2019) Advanced weather forecasting Prediction using deep learning. Int J Res Appl Sci Eng Tech IJRASET 7:939–945. www.ijraset.com Sura P, Hannachi A (2015) Perspectives of non-Gaussianity in atmospheric synoptic and low- frequency variability. J Cliamte 28:5091–5114 Swenson ET (2015) Continuum power CCA: A unified approach for isolating coupled modes. J Climate 28:1016–1030 Takens F (1981) Detecting strange attractors in turbulence. In: Rand D, Young LS (eds) Dynamical systems and turbulence, warwick 1980. Lecture Notes in Mathematics, vol 898. Springer, New York, pp 366–381 Talley LD (2008) Freshwater transport estimates and the global overturning circulation: shallow, deep and throughflow components. Progress Ocenaography 78:257–303 Taylor GI (1921) Diffusion by continuous movement. Proc Lond Math Soc 20(2):196–212 Telszewski M, Chazottes A, Schuster U, Watson AJ, Moulin C, Bakker DCE, Gonzalez-Davila M, Johannessen T, Kortzinger A, Luger H, Olsen A, Omar A, Padin XA, Rios AF, Steinhoff T, Santana-Casiano M, Wallace DWR, Wanninkhof R (2009) Estimating the monthly pCO2 distribution in the North Atlantic using a self-organizing neural network. Biogeosciences 6:1405–1421. ISSN: 1726–4170 Tenenbaum JB, de Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290:2319–2323 TerMegreditchian MG (1969) On the determination of the number of independent stations which are equivalent to prescribed systems of correlated stations (in Russian). Meteor Hydrol 2:24–36 Teschl G (2012) Ordinary differential equations and dynamical systems. Graduate Studies in Mathematics, vol 140, Amer Math Soc, Providence, RI, 345pp Thacker WC (1996) Metric-based principal components: data uncertainties. Tellus 48A:584–592 Thacker WC (1999) Principal predictors. Int J Climatol 19:821–834 Tikhonov AN (1963) Solution of incorrectly formulated problems and the regularization method. Sov Math Dokl 4:1035–1038 Theiler J, Eubank S, Longtin A, Galdrikian B, Farmer JD (1992) Testing for nonlinearity in time series: the method of surrogate data. Physica D 58:77–94 Thiebaux HJ (1994) Statistical data analyses for ocean and atmospheric sciences. Academic Press Thomas JB (1969) An introduction to statistical communication theory. Wiley Thomson RE, Emery WJ (2014) Data analysis methods in physical oceanography, 3rd edn. Elsevier, Amsterdam, 716 p 578 References

Thompson DWJ, Wallace MJ (1998) The arctic oscillation signature in wintertime geopotential height and temperature fields. Geophys Res Lett 25:1297–1300 Thompson DWJ, Wallace MJ (2000) Annular modes in the extratropical circulation. Part I: Month- to-month variability. J Climate 13:1000–1016 Thompson DWJ, Wallace JM, Hegerl GC (2000) Annular modes in the extratropical circulation, Part II: Trends. J Climate 13:1018–1036 Thurstone LL (1940) Current issues in factor analysis. Psychological Bulletin 37:189–236 Thurstone LL (1947) Multiple factor analysis. The University of Chicago Press, Chicago Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288 Tippett MK, DelSole T, Mason SJ, Barnston AG (2008) Regression based methods for finding coupled patterns. J Climate 21:4384–4398 Tipping ME, Bishop CM (1999) Probabilistic principal components. J Roy Statist Soc B 61:611– 622 Toumazou V, Cretaux J-F (2001) Using a Lanczos eigensolver in the computation of empirical orthogonal functions. Mon Wea Rev 129:1243–1250 Torgerson WS (1952) Multidimensional scaling I: Theory and method. Psychometrika 17:401–419 Torgerson WS (1958) Theory and methods of scaling. Wiley, New York Trenberth KE, Jones DP, Ambenje P, Bojariu R, Easterling D, Klein Tank A, Parker D, Rahimzadeh F, Renwick AJ, Rusticucci M, Soden B, Zhai P (2007) Observations: surface and atmospheric climate change. In: Solomon S, Qin D, Manning M, et al. (eds) Climate Change (2007) The physical science basis. Contribution of working Group I to the fourth assessment report of the intergovernmental panel on climate change. Cambridge University Press, p 235–336 Trenberth KE, Shin W-TK (1984) Quasi-biennial fluctuations is sea level pressures over the Northern Hemisphere. Mon Wea Rev 111:761–777 Trendafilov NT (2010) Stepwise estimation of common principal components. Comput Statist Data Anal 54:3446–3457 Trendafilov NT, Jolliffe IT (2006) Projected gradient approach to the numerical solution of the SCoTLASS. Comput Statist Data Anal 50:242–253 Tsai YZ, Hsu K-S, Wu H-Y, Lin S-I, Yu H-L, Huang K-T, Hu M-C, Hsu S-Y (2020) Application of random forest and ICON models combined with weather forecasts to predict soil temperature and water content in a greenhouse. Water 12:1176 Tsonis AA, Roebber PJ (2004) The architecture of the climate network. Phys A 333:497–504. https://doi.org/10.1016/j.physa.2003.10.045 Tsonis AA, Swanson KL, Roebber PJ (2006) What do networks have to do with climate? Bull Am Meteor Soc 87:585–595. https://doi.org/10.1175/BAMS-87-5-585 Tsonis AA, Swanson KL, Wang G (2008) On the role of atmospheric teleconnections in climate. J Climate 21(2990):3001 Tu JH, Rowley CW, Luchtenburg DM, Brunton SL, Kutz JN (2014) On dynamic mode decomposition: Theory and applications. J Comput Dyn 1:391–421. https://doi.org/10.3934/jcd.2014.1. 391 Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31:279– 311 Tukey JW (1977) Exploratory data analysis. Addison-Wesley, Reading, MA Tukey PA, Tukey JW (1981) Preparation, prechosen sequences of views. In: Barnett V (ed) Interpreting multivariate data. Wiley, Chichester, pp 189–213 Tyler DE (1982) On the optimality of the simultaneous redundancy transformations. Psychome- trika 47:77–86 Ulrych TJ, Bishop TN (1975) Maximum entropy spectral analysis and autoregressive decomposition. Rev Geophys Space Phys 13:183–200 Unkel S, Trendafilov NT, Hannachi A, Jolliffe IT (2010) Independent exploratory factor analysis with application to atmospheric science data. J Appl Stat 37:1847–1862 Unkel S, Trendafilov NT, Hannachi A, Jolliffe IT (2011) Independent component analysis for three- way data with an application from atmospheric science. J Agr Biol Environ Stat 16:319–338 References 579

Uppala SM, Kallberg PW, Simmons AJ, Andrae U, Bechtold VDC, Fiorino M, Gibson JK, Haseler J, Hernandez A, Kelly GA, Li X, Onogi K, Saarinen S, Sokka N, Allan RP, Andersson E, Arpe K, Balmaseda MA, Beljaars ACM, Berg LVD, Bidlot J, Bormann N, Caires S, Chevallier F, Dethof A, Dragosavac M, Fisher M, Fuentes M, Hagemann S, Hólm E, Hoskins BJ, Isaksen L, Janssen PAEM, Jenne R, Mcnally AP, Mahfouf J-F, Morcrette J-J, Rayner NA, Saunders RW, Simon P, Sterl A, Trenberth KE, Untch A, Vasiljevic D, Viterbo P, Woollen J (2005) The ERA-40 re-analysis. Q J Roy Meteorol Soc 131:2961–3012 van den Dool HM, Saha S, Johansson Å(2000) Empirical orthogonal teleconnections. J Climate 13:1421–1435 van den Dool HM (2011) An iterative projection method to calculate EOFs successively without use of the covariance matrix. In: Science and technology infusion climate bulletin NOAA’s National Weather Service. 36th NOAA annual climate diagnostics and prediction workshop, Fort Worth, TX, 3–6 October 2011. www.nws.noaa.gov/ost/climate/STIP/36CDPW/36cdpw- vandendool.pdf van den Wollenberg AL (1977) Redundancy analysis: an alternative to canonical correlation analysis. Psychometrika 42:207–219 Vasicek O (1976) A test for normality based on sample entropy. J R Statist Soc B 38:54–59 Vautard R, Ghil M (1989) Singular spectrum analysis in nonlinear dynamics, with applications to paleoclimatic time series. Physica D 35:395–424 Vautard R, Yiou P, Ghil M (1992) Singular spectrum analysis: A toolkit for short, noisy chaotic signals. Physica D 58:95–126 Venables WN, Ripley BD (1994) Modern applied statistics with S-plus. McGraw-Hill, New York Vesanto J, Alhoniemi E (2000) Clustering of the self-organizing map. IEEE Trans Neural Net 11:586–600 Vesanto J (1997) Using the SOM and local models in time series prediction. In Proceedings of workshop on self-organizing maps (WSOM’97), Espo, Finland, pp 209–214 Vinnikov KY, Robock A, Grody NC, Basist A (2004) Analysis of diurnal and seasonal cycles and trends in climate records with arbitrary observations times. Geophys Res Lett 31. https://doi. org/10.1029/2003GL019196 Vilibic´ I, et al (2016) Self-organizing maps-based ocean currents forecasting system. Sci Rep 6:22924. https://doi.org/10.1038/srep22924 von Mises R (1928) Wahrscheinlichkeit, Statistik und Wahrheit, 3rd rev. edn. Springer, Vienna, 1936; trans. as Probability, statistics and truth, 1939. W. Hodge, London von Storch H (1995a) Spatial patterns: EOFs and CCA. In: von Storch H, Navarra A (eds) Analysis of climate variability: Application of statistical techniques. Springer, pp 227–257 von Storch J (1995b) Multivariate statistical modelling: POP model as a ﬁrst order approximation. In: von Storch H, Navarra A (eds) Analysis of climate variability: application of statistical techniques. Springer, pp 281–279 von Storch H, Zwiers FW (1999) Statistical analysis in climate research. Cambridge University Press, Cambridge von Storch H, Xu J (1990) Principal oscillation pattern analysis of the tropical 30- to 60-day oscillation. Part I: Deﬁnition of an index and its prediction. Climate Dynamics 4:175–190 von Storch H, Bruns T, Fisher-Bruns I, Hasselmann KF (1988) Principal oscillation pattern analysis of the 30- to 60-day oscillation in a general circulation model equatorial troposphere. J Geophys Res 93:11022–11036 von Storch H, Bürger G, Schnur R, Storch J-S (1995) Principal ocillation patterns. A review. J Climate 8:377–400 von Storch H, Baumhefner D (1991) Principal oscillation pattern analysis of the tropical 30- to 60-day oscillation. Part II: The prediction of equatorial velocity potential and its skill. Climate Dynamics 5:1–12 Wahba G (1979) Convergence rates of “Thin Plate” smoothing splines when the data are noisy. In: Gasser T, Rosenblatt M (eds) Smoothing techniques for curve estimation. Lecture notes in mathematics, vol 757. Springer, pp 232–246 580 References

Wahba G (1990) Spline models for observational data SIAM. Society for Industrial and Applied Mathematics, Philadelphia, PA, 169 p Wahba G (2000) Smoothing splines in nonparametric regression. Technical Report No 1024, Department of Statistics, University of Wisconsin. https://www.stat.wisc.edu/sites/default/files/ tr1024.pdf Walker GT (1909) Correlation in seasonal variation of climate. Mem Ind Met Dept 20:122 Walker GT (1923) Correlation in seasonal variation of weather, VIII, a preliminary study of world weather. Mem Ind Met Dept 24:75–131 Walker GT (1924) Correlation in seasonal variation of weather, IX. Mem Ind Met Dept 25:275–332 Walker GT, Bliss EW (1932) World weather V. Mem Roy Met Soc 4:53–84 Wallace JM (2000) North Atlantic Oscillation/annular mode: Two paradigms–one phenomenon. QJR Meteorol Soc 126:791–805 Wallace JM, Dickinson RE (1972) Empirical orthogonal representation of time series in the frequency domain. Part I: Theoretical consideration. J Appl Meteor 11:887–892 Wallace JM (1972) Empirical orthogonal representation of time series in the frequency domain. Part II: Application to the study of tropical wave disturbances. J Appl Meteor 11:893–900 Wallace JM, Gutzler DS (1981) Teleconnections in the geopotential height field during the Northern Hemisphere winter. Mon Wea Rev 109:784–812 Wallace JM, Smith C, Bretherton CS (1992) Singular value decomposition of wintertime sea surface temperature and 500-mb height anomalies. J Climate 5:561–576 Wallace JM, Thompson DWJ (2002) The Pacific Center of Action of the Northern Hemisphere annular mode: Real or artifact? J Climate 15:1987–1991 Walsh JE, Richman MB (1981) Seasonality in the associations between surface temperatures over the United States and the North Pacific Ocean. Mon Wea Rev 109:767–783 Wan EA (1994) Time series prediction by using a connectionist network with internal delay lines. In: Weigend AS, Gershenfeld NA (eds) Time series prediction: forecasting the future and understanding the past. Addison-Wesley, Boston, MA, pp 195–217 Wang D, Arapostathis A, Wilke CO, Markey MK (2012) Principal-oscillation-pattern analysis of gene expression. PLoS ONE 7 7:1–10. https://doi.org/10.1371/journal.pone.0028805 Wang Y-H, Magnusdottir G, Stern H, Tian X, Yu Y (2014) Uncertainty estimates of the EOF- derived North Atlantic Oscillation. J Climate 27:1290–1301 Wang D-P, Mooers CNK (1977) Long coastal-trapped waves off the west coast of the United States, summer (1973) J Phys Oceano 7:856–864 Wang XL, Zwiers F (1999) Interannual variability of precipitation in an ensemble of AMIP climate simulations conducted with the CCC GCM2. J Climate 12:1322–1335 Watkins DS (2007) The matrix eigenvalue problem: GR and Krylov subspace methods. SIAM, Philadelphia Watt J, Borhani R, Katsaggelos AK (2020) Machine learning refined: foundation, algorithms and applications, 2nd edn. Cambridge University Press, Cambridge, 574 p Weare BC, Nasstrom JS (1982) Examples of extended empirical orthogonal function analysis. Mon Wea Rev 110:481–485 Wegman E (1990) Hyperdimensional data analysis using parallel coordinates. J Am Stat Assoc 78:310–322 Wei WWS (2019) Multivariate time series analysis and applications. Wiley, Oxford, 518 p Weideman JAC (1995) Computing the Hilbert transform on the real line. Math Comput 64:745–762 Weyn JA, Durran DR, Caruana R (2019) Can machines learn to predict weather? Using deep learning to predict gridded 500-hPa geopotential heighjt from historical weather data. J Adv Model Earth Syst 11:2680–2693. https://doi.org/10.1029/2019MS001705 Werbos PJ (1990) Backpropagation through time: What it does and how to do it. Proc IEEE, 78:1550–1560 Whittle P (1951) Hypothesis testing in time series. Almqvist and Wicksell, Uppsala Whittle P (1953a) The analysis of multiple stationary time series. J Roy Statist Soc B 15:125–139 Whittle P (1953b) Estimation and information in stationary time series. Ark Math 2:423–434 References 581

Whittle P (1983) Prediction and regulation by linear least-square methods, 2nd edn. University of Minnesota, Minneapolis Widrow B, Stearns PN (1985) Adaptive signal processing. Prentice-Hall, Englewood Cliffs, NJ Wikle CK (2004) Spatio-temporal methods in climatology. In: El-Shaarawi AH, Jureckova J (eds) UNESCO encyclopedia of life support systems (EOLSS). EOLSS Publishers, Oxford, UK. Available: https://pdfs.semanticscholar.org/e11f/f4c7986840caf112541282990682f7896199. pdf Wiener N, Masani L (1957) The prediction theory of multivariate stochastic processes, I. Acta Math 98:111–150 Wiener N, Masani L (1958) The prediction theory of multivariate stochastic processes, II. Acta Math 99:93–137 Wilkinson JH (1988) The algebraic eigenvalue problem. Clarendon Oxford Science Publications, Oxford Wilks DS (2011) Statistical methods in the atmospheric sciences. Academic Press, San Diego Williams MO, Kevrekidis IG, Rowley CW (2015) A data-driven approximation of the Koopman operator: extending dynamic mode decomposition. J Nonlin Sci 25:1307–1346 Wiskott L, Sejnowski TJ (2002) Slow feature analysis: unsupervised learning of invariances. Neural Comput 14:715–770 Wise J (1955) The autocorrelation function and the spectral density function. Biometrika 42:151– 159 Woollings T, Hannachi A, Hoskins BJ, Turner A (2010) A regime view of the North Atlantic Oscillation and its response to anthropogenic forcing. J Climate 23:1291–1307 Wright RM, Switzer P (1971) Numerical classification applied to certain Jamaican eocene numuulitids. Math Geol 3:297–311 Wunsch C (2003) The spectral description of climate change including the 100 ky energy. Clim Dyn 20:353–363 Wu C-J (1996) Large optimal truncated low-dimensional dynamical systems. Discr Cont Dyn Syst 2:559–583 Xinhua C, Dunkerton TJ (1995) Orthogonal rotation of spatial patterns derived from singular value decomposition analysis. J Climate 8:2631–2643 Xu J-S (1993) The joint modes of the coupled atmosphere-ocean system observed from 1967 to 1986. J Climate 6:816–838 Xue Y, Cane MA, Zebiak SE, Blumenthal MB (1994) On the prediction of ENSO: A study with a low order Markov model. Tellus 46A:512–540 Young GA, Smith RL (2005) Essentials of statistical inference. Cambridge University Press, New York, 226 p. ISBN-10: 0-521-54866-7 Young FW (1987) Multidimensional scaling: history, theory and applications. Lawrence Erlbaum, Hillsdale, New Jersey Young FW, Hamer RM (1994) Theory and applications of multidimensional scaling. Eribaum Associates, Hillsdale, NJ Young G, Householder AS (1938) Discussion of a set of points in terms of their mutual distances. Psychometrika 3:19–22 Young G, Householder AS (1941) A note on multidimensional psycho-physical analysis. Psy- chometrika 6:331–333 Yu Z-P, Chu P-S, Schroeder T (1997) Predictive skills of seasonal to annual rainfall variations in the U.S. Affiliated Pacific Islands: Canonical correlation analysis and multivariate principal component regression approaches. J Climate 10:2586–2599 Zveryaev II, Hannachi AA (2012) Interannual variability of Mediterranean evaporation and its relation to regional climate. Clim Dyn. https://doi.org/10.1007/s00382-011-1218-7 Zveryaev II, Hannachi AA (2016) Interdecadal changes in the links between Mediterranean evaporation and regional atmospheric dynamics during extended cold season. Int J Climatol. https://doi.org/10.1002/joc.4779 Zeleny, M (1987) Management support systems: towards integrated knowledge management. Human Syst Manag 7:59–70 582 References

Zhang G, Patuwo BE, Hu MY (1997) Forecasting with artiﬁcial neural networks: The state of the art. Int J Forecast 14:35–62 Zhu Y, Shasha D (2002) Statstream: Statistical monitoring of thousands of data streams in real time. In: VLDB, pp 358–369. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.19.8732 Index

A Artificial intelligence, 415 Absolutely summable, 157 Artificial neural network, 416 Activation, 419 Asimov, D., 12 Activation function, 416, 425 Aspect ratio, 167 Activation level, 423 Assimilation, 2 Active phase, 214 Asymptotically unbiased, 256 Adaptive filter, 150 Asymptotic approximation, 45 Adjacancy matrix, 67, 169, 305 Asymptotic distribution, 228 Adjoint, 344 Asymptotic limit, 88 Adjoint operator, 422 Asymptotic uncertainty, 46 Adjoint patterns, 134 Atlantic Multidecadal Oscillation (AMO), 293 Adjoint vector, 134 Atlantic Niño, 293 Advection, 118 Atmosphere-Land-Ocean-Ice system, 2 African jet, 164 Atmospheric models, 119 Agulhas currents, 405 Attractors, 146–148, 295 Air temperature, 199 Augmenting function, 259 Algebraic topology, 403 Autocorelation function, 176 Altimetry data, 374 Autocorrelation, 49, 50 Amplitude, 96 Autocorrelation function, 48, 152, 172, 481 Analytic functions, 102 Autocovariance function, 26, 27, 98, 156, 483 Analytic signal, 102 Autoregression matrix, 119 Angular momentum, 101 Autoregressive model, 125, 311, 425, 486, 546 Annual cycle, 160 Autoregressive moving-average (ARMA) Annular mode, 184 processes, 48 Anthropogenic, 3 Auxiliary matrix, 152 APHRODITE, 216 Average information content, 244 Approximation theorem, 416 Average predictability, 183 Archetypal analysis, 55, 397 Average predictability pattern (APP), 183 Arctic-like oscillation, 233 Arctic Oscillation (AO), 42, 56, 180, 285, 288, 332, 387 B ARMA process, 154 Back fitting algorithm, 258 Arnoldi algorithm, 40, 138 Background noise, 149 Arnoldi method, 518 Backpropagation, 422, 425, 426 AR(1) process, 51 Back-transformation, 381

Band pass filter, 108 Canonical correlation, 339 Bandwidth, 256 Canonical correlation analysis (CCA), 4, 14 Baroclinic structures, 125 Canonical correlation patterns, 341 Baroclinic waves, 135 Canonical covariance analysis (CCOVA), 344 Barotropic models, 56 Canonical covariance pattern, 345 Barotropic quasi-geostrophic model, 141 Canonical variates, 254, 339 Barotropic vorticity, 440 Caotic time series, 48 Bartlett’s factor score, 229 Carbone dating, 2 Basis functions, 35, 320, 357 Categorical data, 203 Bayesian framework, 411 Categorical predictors, 435 Bayes theorem, 470 Cauchy distribution, 424 Bernoulli distribution, 475 Cauchy principal value, 101 Beta-plane, 142 Cauchy sequence, 538 Between-groups sums-of-squares, 254 Cauchy’s integral formula, 101 Betweenness, 68 Causal interactions, 68 Bias, 419 Caveats, 168 Bias parameter, 421, 423 Centered, 256 Bias-variance trade-off, 47 Centering, 23 Biennial cycles, 132 Centering operator, 341 Bimodal, 259 Central England temperature, 149 Bimodal behaviour, 440 Central limit theorem (CLT), 45, 247, 277 Bimodality, 213, 259, 311, 314 Centroid, 167, 205 Binary split, 435 Chaotic attractor, 147 Binomial distribution, 475–476 Chaotic system, 147 Bi-orthogonality, 178 Characteristic multipliers, 546 Bi-quartimin criterion, 231 Chernoff, H., 17 Bivariate cumulant, 250 Chi-square, 342, 385, 436 Blind deconvolution, 267, 271 Chi-squared distance, 253 Blind separation, 4 Chi-square distribution, 478 Blind source separation (BSS), 266, 268 Cholesky decomposition, 320, 397 Bloch’s wave, 373 Cholesky factorisation, 505 Blocked flow, 313 Circular, 154 Bloc-Toeplitz, 159 Circular autocorrelation matrix, 154 Bock’s procedure, 254 Circular covariance matrix, 154 Boltzmann H-function, 245, 271 Circulation patterns, 445 Bootstrap, 45, 46, 49, 437 Circulation regime, 314 Bootstrap blocks, 49 Classical MDS, 206 Botstrap, 47 Classical scaling, 207–209 Botstrap resampling, 48 Classical scaling problem, 204 Boundary condition, 329 Classification, 3, 434 Boundary currents, 405 Climate analysis, 16 Box plots, 17 Climate change, 442 Branch, 434 Climate change signal, 184 Break phases, 214 Climate dynamics, 3 Broadband, 115 Climate extremes, 444, 445 Broad-banded waves, 114 Climate forecast system (CFS), 443 B-slpines, 322, 326 Climate Modeling Intercomparison Project Bubble, 430 (CMIP), 368 Burg, 196 Climate models, 2, 445 Climate modes, 3 Climate networks, 67, 68, 169, 281 C Climate prediction, 440 Calculus of variations, 273 Climatic extreme events, 68 Canberra, 203 Climatic sub-processes, 68 Index 585

Climatological covariance, 183 Conjugate information, 136 Climatology, 23, 160 Connection, 68 Climbing algorithms, 247 Constrained minimization, 76, 530 Closeness, 68 Contingency table, 203 Clouds, 243 Continuous AR(1), 222 Cluster analysis, 254 Continuous CCA, 352 Clustering, 4, 260, 397, 416, 429 Continuous curves, 321 Clustering index, 254 Continuous predictor, 435 Clustering techniques, 242 Continuum, 5 Clusters, 243, 253 Continuum power regression, 388 CMIP5, 369, 383 Convective processes, 118 CMIP models, 387 Convex combination, 398 Cocktail-party problem, 268 Convex hull, 397, 398 Codebook, 430 Convex least square problems, 398, 400 Co-decorrelation time matrix, 177 Convex set, 398 Coding principle, 280 Convolution, 26, 38, 107, 266 Coherence, 492 Convolutional, 422 Coherent structures, 33, 296 Convolutional linear (integral) operator, 169 Cokurtosis, 293 Convolutional NN, 440, 442 Colinearity, 82 Convolving fIlter, 267 Combined field, 338 Coriolis parameter, 310 Common EOF analysis, 368 Correlation, 16 Common EOFs, 383 Correlation coefficient, 16, 21, 193 Common factors, 220, 221 Correlation integral, 178 Common PC, 383 Correlation matrix, 24, 61 Communality, 222 Coskewness, 293 Communities, 306 Co-spectrum, 29, 114 Compact support, 173 Costfunction, 402, 435 Competitive process, 431 Coupled, 2 Complex conjugate, 26, 29, 106, 187, 191 Coupled pattern, 33, 337 Complex conjugate operator, 96 Coupled univariate AR(1), 138 Complex covariance matrix, 95 Covariability, 201, 338 Complex data matrix, 97 Covariance, 16, 21, 24 Complex EOFs (CEOFs), 4, 13, 94, 95 Covariance function, 36 Complex frequency domain, 97 Covariance matrix, 24, 25, 40, 53, 54, 58, 61, Complexified fileds, 35, 95, 96 88 Complexified multivariate signal, 106 Covariance matrix spectra, 49 Complexified signal, 106 Covariance matrix spectrum, 57 Complexified time series, 99, 100 Critical region, 229 Complexity, 1–3, 11, 55, 67, 241, 265 Cross-correlations, 93 Complex network, 169 Cross-covariance function, 27 Complex nonlinear dynamical system, 2 Cross-covariance matrix, 106, 134, 137, 339, Complex principal components, 94, 96 341, 392 Composition operator, 138 Cross-entropy, 417 Comprehensive Ocean-Atmosphere Data Set Cross-spectra, 98, 106 (COADS), 169 Cross-spectral analysis, 93 Conditional distribution, 223 Cross-spectral covariances, 94 Conditional expectations, 189, 228 Cross-spectral matrix, 97 Conditional probability, 129, 470 Cross-spectrum, 28, 100, 497 Condition number, 42 Cross-spectrum matrix, 28, 29, 98, 105, 113, Confidence interval, 109 114 Confidence limits, 41, 46, 58 Cross-validation (CV), 45–47, 238, 325, 330, Conjugate directions, 525 362, 437, 454, 458 Conjugate gradient, 422, 528 Cross-validation score, 391 586 Index

Cubic convergence, 283 Delay operator, 485 Cubic spline, 19, 453, 454 Delays, 148 Cumulant, 249, 250, 267, 275, 472–473 Delay space, 149 Cumulative distribution function, 20, 251, 302, Dendrogram, 448 419, 471 Descent algorithm, 83, 187, 209, 331, 526 Currents, 405 Descent numerical algorithm, 236 Curse of dimensionality, 9 Descriptive data mining, 3 Curvilinear coordinate, 321 Determinant, 500 Curvilinear trajectory, 85 Determinantal equation, 153 Cyclone frequencies, 61 Deterministic, 175 Cyclo-stationarity, 132, 373 Detrending, 17 Cyclo-stationary, 100, 125, 132, 370 Diagonalisable, 40, 66, 96 Cyclo-stationary EOFs, 367, 372 Diagonal matrix, 40 Cyclo-stationary processes, 372 Dicholomus search, 522–523 Differenced data, 58 Differencing operator, 58 D Differentiable function, 209 Damp, 164 Differential entropy, 244, 271 Damped oscillators, 138 Differential manifold, 400 Damped system, 135 Differential operator, 464 Damping, 56 Difﬁerentiable, 242 Damping times, 138 Diffusion, 56, 310 Data-adaptive harmonic decomposition Diffusion map, 304 (DAHD), 169 Diffusion process, 56–58 Data analysis, 11 Digital ﬁlter, 124 Data assimilation, 426 Dimensionality reduction, 138, 429 Database, 3 Dimension reduction, 3, 11, 38 Data image, 448 Dirac delta function, 27, 103 Data-Information-Knowledge-Wisdom Direct product, 502 (DIKW), 10 Discontinuous spectrum, 380 Data mapping, 11 Discrepancy measures, 259 Data matrix, 22, 25, 36, 38, 41, 42, 45, 58, 61, Discrete categories, 433 63, 158 Discrete fourier transform, 104, 107 Data mining, 3, 4 Discretised Laplacian, 362 Data space, 223 Discriminant analysis, 254, 295 Davidson–Fletcher–Powell, 529 Discriminant function, 423 Dawson’s integral, 103 Disjoint, 469 Deaseasonalised, 17 Disorder, 244 Decadal modes, 317 Dispersive, 114, 164 Decay phase, 164 Dispersive waves, 115 Decision making, 4 Dissimilarities, 202 Decision node, 434 Dissimilarity matrix, 202, 205, 448 Decision trees, 433 Distance matrix, 207 Deconvolution, 4, 266, 267 Distortion errors, 210 Decorrelating matrix, 276 Distribution ellipsoid, 63 Decorrelation time, 171, 172, 176, 178, 180, Domain dependence, 71 241 Double centered dissimilarity matrix, 206 Degeneracy, 168, 169 Double diagonal operator, 401 Degenerate, 91, 110, 152, 165, 168 Doubly periodic, 373 Degenerate eigenvalue, 155 Downscaling, 38, 374, 445, 451 Degrees of freedom (dof), 2, 11, 50 Downward propagating signal, 93, 110 Delay coordinates, 148, 167, 317 Downward signal propagating, 113 Delayed vector, 158 Downwelling current patterns, 447 Delay embedding, 169 Dual form, 396 Index 587

Duality, 176 Entropy index, 248, 250, 255 Dynamical mode decomposition (DMD), 138 Envelope, 103, 397 Dynamical reconstruction, 147, 157 EOF rotation, 55 Dynamical systems, 86, 88, 118, 138, 146, Epaneshnikov kernel, 248 147, 169 Equiprobable, 9 ERA-40 reanalyses, 167, 263 ERA-40 zonal mean zonal wind, 113 E Error covariance matrix, 175, 189 Earth System Model, 369 E-step, 227 East Atlantic pattern, 345 Euclidean distance, 203 Easterly, 93 European Centre for Medium Range Weather Eastward propagation, 125 Forecasting (ECMWF), 92 Eastward shift, 49 European Re-analyses (ERA-40), 92, 212 East-west dipolar structure, 345 Expalined variance, 109 ECMWF analyses, 125 Expansion coefficients, 38 Edgeworth expansion, 275 Expansion functions, 35 Edgeworth polynomial expansion, 250 Expectation (E), 227 Effective number of d.o.f, 53 Expectation maximisation (EM), 226 Effective numbers of spatial d.o.f, 53 Expectation operator, 251, 262 Effective sample size, 46, 50 Explained variance, 41, 53, 238 Efficiency, 247 Exploratory data analysis (EDA), 3 E-folding, 138 Exploratory factor analysis (EFA), 233, 239, E-folding time, 50, 51, 123, 127, 173, 310 290 Eigenanalysis, 35 Exponential distribution, 477–478 Eigenmode, 111 Exponential family, 402 Eigenspectrum, 96, 382 Exponential smoothing, 18 Eigenvalue, 37, 39, 503 Exponential smoothing filter, 27 Eigenvalue problems, 34, 134 Extended EOFs, 35, 94, 139, 146, 316 Eigenvector, 39, 503 Extremes, 397, 398 Ekman dissipation, 310 Elbow, 405 Ellipsoid of the distribution, 58 F Ellipsoids, 62 Factor analysis (FA), 4, 12, 46, 219, 224 Elliptical, 311 Factor loading matrix, 234 Elliptical distributions, 61 Factor loading patterns, 233 Elliptically contoured distributions, 375 Factor loadings, 220, 221, 230, 237 Elliptical region, 65 Factor model, 223, 228, 233, 238 Ellipticity, 375 Factor model parameters, 513–515 El-Niño, 33, 55, 157, 293, 405 Factor rotation, 73 El-Niño Southern Oscillation (ENSO), 2, 11, Factors, 219, 223 33, 132, 391, 412 Factor scores, 229, 237 EM algorithm, 238 Factor-scores matrix, 220 Embeddings, 148, 151, 202, 204, 211 Fastest growing modes, 135 Embedding space, 148 FastICA, 282, 283 Empirical distribution function (edf), 20, 21 FastICA algorithm, 282 Empirical orthogonal functions (EOFs), 13, 22, Fat spectrum, 149 34, 38 Feasible set, 86 Empirical orthogonal teleconnection (EOT), 67 Feature analysis, 199 Emptiness paradox, 9 Feature extraction, 3, 11, 416 Emptyness, 243 Feature space, 296, 297, 306, 392, 422 Empty space phenomena, 243 Feedback matrix, 119, 121, 122, 130, 146 Empty space phenomenon, 6, 9, 10 Filter, 174 Energy, 456, 463 Filtered data matrix, 63 Entropy, 196, 232, 243, 244, 271, 278, 436 Filtered time series, 99 588 Index

Filtering, 112, 166, 342 Fröbenius norm, 217, 230, 285, 391, 398 Filtering problem, 177 Fröbenius product, 230 Filter matrix, 276 Fröbenius structure, 139 Filter patterns, 178 Full model, 228 Finear ﬁlter, 26 Full rank, 54, 187 Finite difference scheme, 361 Funcional EOFs, 321 First-order auto-regressive model, 56 Functional analysis, 300 First-order Markov model, 179 Functional CCA, 353 First-order optimality condition, 386 Functional EOF, 319 First-order spatial autoregressive process, 56 Functional PCs, 322 First-order system, 135 Fundamental matrix, 546 Fisher information, 243, 245 Fisher’s linear discrimination function, 254 Fisher–Snedecor distribution, 479 G Fitted model, 228 Gain, 27 Fixed point, 394 Gamma distribution, 478 Fletcher-Powell method, 226 Gaussian, 19, 63, 129, 192, 243 Fletcher–Reeves, 528 Gaussian grid, 44, 328 Floquet theory, 546 Gaussianity, 375 Floyd’s algorithm, 211 Gaussian kernel, 301, 393 Fluctuation-dissipation relation, 129 Gaussian mixture, 214 Forecast, 172, 185 Gaussian noise, 221 Forecastability, 172, 197 General circulation models (GCMs), 387 Forecastable component analysis (ForeCA), Generalised AR(1), 139 196 Generalised eigenvalue problem, 61, 66, 177, Forecastable patterns, 196 324, 327, 357, 361, 396 Forecasting, 38, 416, 422 Generalised inverse, 501 Forecasting accuracy, 130 Generalised scalar product, 190 Forecasting models, 179 Generating kernels, 300 Forecasting uncertainty, 442 Geodesic distances, 211 Forecast models, 179 Geometric constraints, 70, 71 Forecast skill, 185 Geometric moments, 167 Forward-backward, 421 Geometric properties, 72 Forward stepwise procedure, 257 Geometric sense, 62 Fourier analysis, 17 Geopotential height, 62, 66, 68, 125, 180, 188, Fourier decomposition, 107 260, 262, 382, 445 Fourier series, 104, 373 Geopotential height anomalies, 181 Fourier spectrum, 103 Geopotential height re-analyses, 157 Fourier transform (FT), 27, 48, 99, 102, 125, Gibbs inequality, 273 176, 183, 187, 192, 267, 494 Gini index, 435 Fourth order cumulant, 170 Global scaling, 209 Fourth order moment, 75 Gobal temperature, 284 Fractal dimensions, 149 Golden section, 523 Fredholm eigen problem, 37 Goodness-of-ﬁt, 209, 259 Fredholm equation, 320 Gradient, 85, 242, 386 Fredholm homogeneous integral equation, 359 Gradient ascent, 282 Frequency-band, 97 Gradient-based algorithms, 283 Frequency domain, 94, 97, 176 Gradient-based approaches, 526 Frequency response function, 29, 104, 108, Gradient-based method, 268 189, 191, 267 Gradient methods, 247 Frequency-time, 103 Gradient optimisation algorithms, 256 Friction, 93 Gradient types algorithms, 282 Friedman’s index, 251 Gram matrix, 301 Fröbenius matrix norm, 233 Gram-Schmidt orthogonalization, 397 Index 589

Grand covariance matrix, 160, 165, 338 Hybrid, 185 Grand data matrix, 161 Hyperbolic tangent, 421, 440 Grand tour, 12 Hypercube, 6, 8 Green function, 119 Hyperspaces, 241 Greenhouse, 450 Hypersphere, 5, 7 Greenland blocking, 146 Hypersurface, 86, 296 Green’s function, 129, 457, 464 Hypervolumes, 6, 243 Growing phase, 135 Hypothesis of independence, 45 Growth phase, 164 Growth rates, 127, 138 Gulf Stream, 405 I Gyres, 405 ICA rotation, 286 Ill-posed, 344 Impulse response function, 27 H Independence, 266, 470 Hadamard, 285, 401 Independent and identically distributed (IID), Hadamard product, 328, 502 50, 224 HadCRUT2, 183 Independent component analysis (ICA), 55, HadGEM2-ES, 369 266 HadISST, 332 Independent components, 63, 268 Hadley Centre ice and sea surface temperature Independent principal components, 286 (HadISST), 290 Independent sample size, 50 Hamiltonian systems, 136 Independent sources, 265, 293 Hankel matrix, 169 Indeterminacy, 352 Heavy tailed distributions, 252 India monsoon rainfall, 66 Hellinger distance, 259 Indian Ocean dipole (IOD), 58, 412 Henderson filter, 18 Indian Ocean SST anomalies, 58 Hermite polynomials, 249, 252, 410 Inference, 3 Hermitian, 29, 96, 98, 137, 492, 502 Infomax, 280, 282, 283 Hermitian covariance matrix, 106, 109 Info-max approach, 278 Hermitian matrix, 109 Information, 244 Hessian matrix, 526 Information capacity, 280 Hexagonal, 430 Information-theoretic approaches, 270 Hidden, 4 Information theory, 243, 244, 271 Hidden dimension, 12 Initial condition, 140 Hidden factors, 269 Initial random configurations, 209 Hidden variables, 220 Inner product, 36, 401, 536 Hierarchical clustering, 448 Insolation, 199 High dimensionality, 3, 11 Instability, 71 Higher-order cumulants, 284 Instantaneous frequency, 102, 112, 113 Higher order moments, 266, 267 Integrability, 190, 191 High-order singular value decomposition, 293 Integrable functions, 299 Hilbert canonical correlation analysis (HCCA), Integral operator, 299, 300 137 Integrated power, 177 Hilbert EOFs, 95, 97, 100, 109, 113, 161 Integro-differential equations, 320, 326, 357, Hilbert filter, 102 359 Hilbert PC, 113 Integro-differential operator, 328 Hilbert POPs (HPOPs), 136 Integro-differential system, 360 Hilbert singular decomposition, 101 Interacting molecules, 1 Hilbert space, 36, 138, 190, 344 Interacting space/time scales, 2 Hilbert transform, 97, 101, 102, 105, 107, 109, Inter-dependencies, 67, 69 136, 145 Interesting, 242, 243 Homogeneous diffusion processes, 56 Interesting features, 15 Hovmoller diagram, 31 Interestingness, 243–245 590 Index

Interesting structures, 243, 264 Kelvin wave, 100, 162 Intergovernmental Panel for Climate Change Kernel, 169, 300, 465 (IPCC), 183 Kernel CCA, 395 Intermediate water, 322 Kernel density estimate, 280 Interpoint distance, 201, 368, 430 Kernel density estimation, 255 Interpoint distance matrix, 449 Kernel EOF, 297 Interpolated, 17 Kernel function, 19, 37, 299 Interpolation, 189 Kernel matrix, 301, 397 Interpolation covariance matrix, 194 Kernel MCA, 392 Interpolation error, 190, 241 Kernel methods, 280 Interpolation ﬁlter, 190 Kernel PCA, 297 Interpretation, 11 Kernel PDF, 259 Inter-tropical convergence zone (ITCZ), 147 Kernel POPs, 317 Intraseasonal time scale, 132 Kernel smoother, 256, 260, 280, 465 Intrinsic mode of variability, 58 Kernel smoothing, 19 Invariance, 72 Kernel transformation, 301 Invariance principle, 237 Kernel trick, 299 Inverse Fourier transform, 48 k-fold CV, 47 Inverse mapping, 307 Khatri-Rao, 291 Invertibility, 191 K-L divergence, 273, 274 Invertible, 395 k-means, 254, 305, 397 Invertible linear transfomation, 121 k-means clustering, 399 Irish precipitation, 391 k-nearest neighbors, 212 ISOMAP, 211 Kohonen network, 429 Isomap, 212, 411 Kolmogorov formula, 175 Isopycnal, 322 Kolmogorov-Smirnov distance, 253 Isotropic, 105, 283 Kolmogorov-Wiener approach, 174 Isotropic kernels, 393 Koopman operator, 138 Isotropic turbulence, 53 Kriging, 464 Isotropic uniqueness, 237 Kronecker matrix product, 291 Isotropy, 253 Kronecker symbol, 133 Iterative methods, 43 Krylov method, 138 Iterative process, 260 Krylov subspace, 40, 43 Kullback-Leibler distance, 259 Kullback-Leibler (K-L) divergence, 272, 277 J Kuroshio, 405 Jacobian, 272, 278, 283 Kuroshio current, 317 Jacobian operator, 310 Kurtosis, 53, 170, 259, 264, 267, 281, 282 JADE, 284 Japanese reanalyses, 314 Japan Meteorological Agency, 383 L Johnson-Lindenstrauss Lemma, 368–370 Lag-1 autocorrelations, 379 Joint distribution, 223 Lagged autocorrelation matrix, 150 Joint entropy, 280 Lagged autocovariance, 50 Joint probability density, 269 Lagged covariance matrix, 94, 98, 99, 159, Joint probability density function, 473 175, 180 JRA-55, 314 Lagrange function, 516 Jump in the spectrum, 387 Lagrange multiplier, 39, 76, 255, 273, 283, 340, 395 Lagrangian, 77, 331, 396 K Lagrangian function, 327 Karhunen–Loéve decomposition, 155 Lagrangian method, 532 Karhunen–Loéve equation, 373 Lagrangian multipliers, 358 Karhunen–Loéve expansion, 36, 37, 91 Lanczos, 40, 42 Index 591

Lanczos method, 517–518 Local averages, 11 La-Niña, 33, 405 Local concepts, 11 Lapalce probability density function, 270 Localized kernel, 302 Laplace-Beltrami differential operator, 304 Local linear embedding, 211 Laplacian, 305, 310, 360, 361 Logistic, 419 Laplacian matrix, 304, 317 Logistic function, 279, 280 Laplacian operator, 327, 361 Logistic regression, 417 Laplacian spectral analysis, 317 Log-likelihood, 224, 238 Large scale atmosphere, 264 Long-memory, 173, 179, 180 Large scale atmospheric flow, 295 Long short-term memory (LSTM), 440 Large scale flow, 384 Long-term statistics, 1 Largescale processes, 167 Long term trends, 180 Large scale teleconnections, 445 Lorenz, E.N., 147 Latent, 220 Lorenz model, 440 Latent heat fluxes, 383 Lorenz system, 157 Latent patterns, 4 Low frequency, 35, 180 Latent space, 223 Low-frequency modes, 184 Latent variable, 11, 12, 56 Low-frequency patterns, 194, 446 Lattice, 431 Low-frequency persistent components, 199 Leading mode, 44 Low-level cloud, 445 Leaf, 434 Low-level Somali jet, 212 Learning, 416, 425 Low-order chaotic models, 310 Leas square, 82 Low-order chaotic systems, 148 Least Absolute Shrinkage and Selection Lyapunov equation, 520 Operator (LASSO), 82 Lyapunov function, 263 Least square, 64, 165, 174 Least squares regression, 388 Leave-one-out CV, 47 M Leave-one-out procedure, 391 Machine learning, 3, 415 Legendre polynomials, 251 Madden-Julian oscillation (MJO), 91, 132, Leptokurtic, 270, 473 146, 164, 184 Likelihood, 385 Mahalanobis distance, 459 Likelihood function, 385 Mahalanobis metrics, 203 Likelihood ratio statistic, 228 Mahalanobis signal, 183 Lillieford test, 59 Manifold, 86, 211, 223, 295, 303 Linear convergence, 283 Map, 22 Linear discriminant analysis, 296 Marginal density function, 269 Linear filter, 29, 102, 189, 210, 266, 267 Marginal distribution, 223 Linear growth, 125 Marginal pdfs, 473 Linear integral operator, 299 Marginal probability density, 279, 280 Linear inverse modeling (LIM), 129 Marginal probability density functions, 65 Linearisation, 135 Markov chains, 305 Linearised physical models, 71 Markovian time series, 173 Linear operator, 26, 500 Markov process, 118 Linear programming, 521 Matching unit, 430 Linear projection, 241, 243 Mathematical programming, 531 Linear space, 295 Matlab, 23, 24, 26, 40, 77, 109, 161, 345, 377 Linear superposition, 55 Matrix derivative, 506–512 Linear transformation, 268 Matrix inversion, 342 Linear trend, 193 Matrix norm, 216 Linkage, 448 Matrix of coordinates, 205 Loading coefficients, 54 Matrix optimisation, 398 Loadings, 38, 54, 74, 75, 223 Matrix optimisation problem, 63 Loading vectors, 372 Maximization, 227 592 Index

Maximization problem, 74 MPI-ESM-MR, 369 Maximum covariance analysis (MCA), 337, M-step, 227 344 Multichannel SSA (MSSA), 157 Maximum entropy, 274 Multi-colinearity, 343 Maximum likelihood, 46, 62 Multidimensional scaling (MDS), 4, 201, 242, Maximum likelihood method (MLE), 224 254 Maximum variance, 38, 40 Multilayer perceptron (MLP), 423 Mean sea level, 383 Multimodal, 243 Mean square error, 37, 47 Multimodal data, 249 Mediterranean evaporation, 66, 345 Multinormal, 58 Mercer kernel, 37 Multinormal distribution, 130 Mercer’s theorem, 299 Multinormality, 46 Meridional, 94 Multiplicative decomposition, 259 Meridional overturning circulation, 199 Multiplicity, 154 Mesokurtic, 473 Multiquadratic, 424 Metric, 202 Multispectral images, 256 Mid-tropospheric level, 68 Multivariate filtering problem, 29 Minimum-square error, 175 Multivariate Gaussian distribution, 8 Minkowski distance, 202 Multivariate normal, 62, 245 Minkowski norm, 217 Multivariate normal distribution, 9 Mis-fit, 458 Multivariate normal IID, 362 Mixed-layer, 322 Multivariate POPs, 138 Mixing, 42, 55, 412 Multivariate random variable, 23, 24 Mixing matrix, 268, 269, 276 Multivariate spectrum matrix, 199 Mixing problem, 55, 284, 375 Multivariate t-distribution, 61 Mixing property, 375 Multivarite AR(1), 219 Mixture, 268 Mutual information, 64, 273–275, 278 Mixture model, 238 Mutually exclusive, 469 Mode analysis, 64 Model evaluation, 384 Model simulations, 2 N Mode mixing, 373 Narrow band, 103 Modes, 56 Narrow band pass filter, 98 Modes of variability, 38 Narrow frequency, 103 Modularity, 306 National Center for Environmental Prediction Modularity matrix, 306 (NCEP), 383 Moisture, 164, 445 National Oceanic and Atmospheric Moment matching, 53 Administration (NOAA), 411 Moments, 250, 259, 269 N-body problem, 460 Momentum, 136 NCEP-NCAR reanalysis, 68, 146, 233, 260, Monomials, 299 446 Monotone regression, 208 NCEP/NCAR, 31, 184, 310 Monotonicity, 430 Negentropy, 245, 259, 274, 277, 282, 284 Monotonic order, 385 Neighborhood graph, 212 Monotonic transformation, 210, 375 Nested period, 374 Monsoon, 114, 345 Nested sets, 448 Monte Carlo, 45, 46, 48, 343, 379, 525 Nested sigmoid architecture, 426 Monte Carlo approach, 260 Networks, 67, 68 Monte-Carlo bootstrap, 49 Neural-based algorithms, 283 Monthly mean SLP, 77 Neural network-based, 284 Moore–Penrose inverse, 501 Neural networks (NNs), 276, 278, 302, 415, Most predictable patterns, 185 416 Moving average, 18, 150 Neurons, 419 Moving average filter, 27 Newton algorithm, 283 Index 593

Newton–Raphson, 529 Null space, 137 Newton–Raphson method, 527 Nyquist frequency, 114, 494 Noise, 118 Noise background, 149 Noise covariance, 139 Noise floor, 382 O Noise-free dynamics, 124 Objective function, 84 Noise variability, 53 Oblimax, 232 Non-alternating algorithm, 400 Oblimin, 231 Nondegeneracy, 205 Oblique, 74, 81, 230 Nondifferentiable, 83 Oblique manifold, 401 Non-Gaussian, 267 Oblique rotation, 76, 77, 231 Non-Gaussian factor analysis, 269 Occam’s rasor, 13 Non-Gaussianity, 53, 264 Ocean circulation, 445 Non-integer power, 389 Ocean current forecasting, 445 Nonlinear, 296 Ocean current patterns, 446 Nonlinear association, 12 Ocean currents, 101 Nonlinear dynamical mode (NDM), 410 Ocean fronts, 322 Nonlinear features, 38 Ocean gyres, 170 Nonlinear interactions, 2 Oceanic fronts, 323 Nonlinearity, 3, 12 Ocean temperature, 321 Nonlinear manifold, 213, 247, 295, 410 Ocillating phenomenon, 91 Nonlinear mapping, 299 Offset, 419 Nonlinear MDS, 212 OLR, 161, 162, 164 Nonlinear ow regimes, 49 OLR anomalies, 160 Nonlinear PC analysis, 439 One-mode component analysis, 64 Nonlinear programme, 521 One-step ahead prediction, 175, 185 Nonlinear smoothing, 19 Operator, 118 Nonlinear system of equations, 263 Optimal decorrelation time, 180 Nonlinear units, 422 Optimal interpolation, 411 Nonlocal, 10 Optimal lag between two fields, 349–350 Non-locality, 115 Optimal linear prediction, 179 Non-metric MDS, 208 Optimally interpolated pattern (OIP), 189, 191 Non-metric multidimensional scaling, 430 Optimally persistent pattern (OPP), 176, 178, Non-negative matrix factorisation (NMF), 403 180, 183, 185, 241 Non-normality, 269, 288 Optimisation algorithms, 521 Non-parametric approach, 280 Optimization criterion, 76 Nonparametric estimation, 277 Order statistics, 271 Non parametric regression, 257, 453 Ordinal MDS, 208 Non-quadratic, 76, 83, 209, 275 Ordinal scaling, 210 Nonsingular affine transformation, 246 Ordinary differential equations (ODEs), 85, Normal, 502 147, 263, 530, 543 Normal distribution, 245, 477 Orthogonal, 74, 81, 230, 502 Normalisation constraint, 81 Orthogonal complement, 86, 206 Normal matrix, 40 Orthogonalization, 397 Normal modes, 119, 123, 130, 134, 138 Orthogonal rotation, 74, 75, 77 North Atlantic Oscillation (NAO), 2, 9, 31, 33, Orthomax-based criterion, 284 42, 49, 56, 68, 77, 83, 234, 259, 284, Orthonormal eigenfunctions, 37 288, 289, 293, 311, 312, 332, 381, 387, Oscillatory, 122 391, 446 Outgoing long-wave radiation (OLR), 146, 345 North Pacific Gyre Oscillation (NPGO), 293 Outlier, 17, 250, 260, 271 North Pacific Oscillation, 233, 234 Out-of-bag (oob), 437 Null hypothesis, 48, 56, 57, 228, 288, 342 Overfitting, 395 594 Index

P Polar vortex, 93, 167, 259, 291 Pacific decadal oscillation (PDO), 293, 412 Polynomial equation, 153 Pacific-North American (PNA), 2, 33, 68, 127, Polynomial fitting, 19 259, 263 Polynomial kernels, 302 Pacific patterns, 83 Polynomially, 296 Pairwise distances, 201 Polynomial transformation, 299 Pairwise similarities, 202 Polytope, 403, 404 Parabolic density function, 248 POP analysis, 219 Paradox, 10 POP model, 179 Parafac model, 291 Positive semi-definite, 177, 206, 207, 216, 502 Parsimony, 13 Posterior distribution, 227 Partial least squares (PLS) regression, 388 Potential vorticity, 309 Partial phase transform, 390 Powell’s algorithms, 525 Partial whitening, 388, 390 Power law, 286 Parzen lagged window, 184 Power spectra, 196, 267 Parzen lag-window, 185 Power spectrum, 48, 100, 110, 156, 172–174, Parzen window, 182 180, 199, 487 Pattern recognition, 295, 416 Precipitation, 445 Patterns, 3, 22 Precipitation predictability, 440 Pattern simplicity, 72 Predictability, 184, 199 Patterns of variability, 91 Predictable relationships, 54 Pdf estimation, 416 Predictand, 338, 363 Pearson correlation, 281 Prediction, 3, 189, 416, 421 Penalised, 354 Prediction error, 174 Penalised likelihood, 457 Prediction error variance, 174, 185 Penalised objective function, 532 Predictive data mining, 3 Penalized least squares, 344 Predictive Oscillation Patterns (PrOPs), 185 Penalty function, 83, 532 Predictive skill, 170 Perceptron convergence theorem, 417 Predictor, 338, 363 Perfect correlation, 354 Pre-image, 394 Periodic signal, 149, 151, 155, 180 Prewhitening, 268 Periodogram, 180, 187, 192, 495 Principal axes, 58, 63 Permutation, 95, 159, 376 Principal component, 39 Permutation matrix, 160, 170 Principal component analysis (PCA), 4, 13, 33 Persistence, 157, 171, 185, 350, 440 Principal component regression (PCR), 388 Persistent patterns, 142, 172 Principal coordinate analysis, 202 Petrie polygon, 403 Principal coordinate matrix, 206 Phase, 27, 96, 492 Principal coordinates, 206–208, 215 Phase changes, 113 Principal fundamental matrix, 546 Phase functions, 110 Principal interaction pattern (PIP), 119, 139 Phase propagation, 113 Principal oscillation pattern (POP), 15, 94, 95, Phase randomization, 48 119, 126 Phase relationships, 97, 100 Principal prediction patterns (PPP), 343 Phase shift, 95, 150, 151 Principal predictors, 351 Phase space, 169 Principal predictors analysis (PPA), 338 Phase speeds, 112, 157 Principal regression analysis (PRA), 338 Physical modes, 55, 56 Principal trend analysis (PTA), 199 Piece-wise, 19 Principlal component transformation, 54 Planar entropy index, 250 Prior, 223 Planetary waves, 310 Probabilistic archetype analysis, 402 Platykurtic, 271 Probabilistic concepts, 17 Platykurtotic, 473 Probabilistic framework, 45 Poisson distribution, 476 Probabilistic models, 11, 50 Polar decomposition, 217 Probabilistic NNs, 424 Index 595

Probability, 467 Quadratic trend, 379 Probability-based approach, 149 Quadrature, 91, 110, 114, 149, 150, 155, 167, Probability-based method, 46 192 Probability density function (pdf), 2, 11, 23, Quadrature function, 107 58, 65, 196, 245, 471 Quadrature spectrum, 29 Probability distribution, 219, 243 Quantile, 46 Probability distribution function, 255 Quantile-quantile, 58 Probability function, 470 QUARTIMAX, 75 Probability matrix, 404 Quartimax, 230 Probability space, 213 QUARTIMIN, 76, 77, 81 Probability vector, 398 Quartimin, 231 Product-moment correlation coefficient, 16 Quas-biennial periodicity, 93 Product of spheres, 401 Quasi-biennial oscillation (QBO), 91, 101, Profile likelihood, 363 113, 145 Profiles, 321 Quasi-geostrophic, 315 Progressive waves, 114, 115 Quasi-geostrophic model, 135, 309 Projected data, 251 Quasi-geostrophic vorticity, 135 Projected gradient, 83, 86, 283, 533 Quasi-Newton, 422, 437 Projected gradient algorithm, 284 Quasi-Newton algorithm, 141, 425 Projected matrix, 369 Quasi-stationary signals, 367 Projected/reduced gradient, 85 Projection index, 242, 246 Projection methods, 367 R Projection operators, 86, 206 Radial basis function networks, 442 Projection pursuit (PP), 242, 269, 272, 280, Radial basis functions (RBFs), 321, 325, 357 293 Radial coordinate, 375 Projection theorem, 538 Radial function, 457 Propagating, 91, 97 Radiative forcing, 2, 118 Propagating disturbances, 97, 107, 117 Rainfall, 447 Propagating features, 168 Rainfall extremes, 445 Propagating patterns, 91, 95, 96, 122, 135, 166 Raleigh quotient, 177, 183, 395, 396 Propagating planetary waves, 72 Random error, 219 Propagating signal, 110 Random experiment, 469 Propagating speed, 93 Random forest (RF), 433, 450 Propagating structures, 94, 95, 118, 145, 157 Random function, 190, 353 Propagating wave, 162 Randomness, 244 Propagation, 113, 145 Random noise, 192, 220 Propagator, 130, 546 Random projection, 369 Prototype, 397, 430 Random projection matrix, 368 Prototype vectors, 448 Random samples, 49 Proximity, 201, 430 Random variable, 24, 224, 470 Pruning, 436, 437 Random vector, 154, 474 Pseudoinverse, 501 Rank, 501 Rank correlation, 21 Rank order, 210 Q Ranndom projection (RP), 368 QR decomposition, 505–506 Rational eigenfunctions, 109 Quadratic equation, 154 RBF networks, 424 Quadratic function, 82 Reanalysis, 2, 77, 131, 445 Quadratic measure, 53 Reconstructed attractor, 148 Quadratic nonlinearities, 296 Reconstructed variables, 164 Quadratic norm, 53 Rectangular, 430 Quadratic optimisation problem, 38 Recurrence networks, 68, 169 Quadratic system of equations, 263 Recurrence relationships, 153 596 Index

Recurrences matrix, 169 Rotationally invariant, 250, 251 Recurrent, 422 Rotation criteria, 74 Recurrent NNs, 421 Rotation matrix, 73, 74, 115, 223, 284 Recursion, 259 Roughness measure, 454 Red-noise, 49, 51, 52, 152 R-square, 315, 348 Red spectrum, 125, 135 Runge Kutta scheme, 263 Reduced gradient, 242 Running windows, 50 Redundancy, 274, 278 Redundancy analysis (RDA), 337, 348 Redundancy index, 347 S Redundancy reduction, 279 Salinity, 321 Regimes, 3 Sample covariance matrix, 237 Regime shift, 315 Sample-space noise model, 227 Regime transitions, 169 Sampling errors, 178 Regression, 3, 38, 67, 82, 184, 221, 315, 337, Sampling fluctuation, 342 381, 395, 416, 422, 434 Sampling problems, 71 Regression analysis, 257 Scalar product, 140 Regression curve, 209 Scalar product matrix, 207 Regression matrix, 66, 338, 347 Scaled SVD, 365 Regression matrix A, 363 Scaling, 25, 54, 62, 207, 396, 399 Regression models, 4 Scaling problem, 62, 238, 338 Regularisation, 325, 390, 395, 396 Scandinavian pattern, 234, 293, 330 Regularisation parameter, 391 Scandinavian teleconnection pattern, 288 Regularisation problem, 330, 455 Scores, 37 Regularised EOFs, 331 Scree plot, 404 Regularised Lagrangian, 396 Sea level, 374 Regularization parameters, 344 Sea level pressure (SLP), 21, 31, 33, 41, 68, Regularized CCA (RCCA), 343 115, 180, 194, 212, 284, 314, 315 Replicated MDS, 210 Sea saw, 13 Reproducing kernels, 300 Seasonal cycle, 42, 160 Resampling, 46 Seasonality, 17, 93 Residual, 6, 36, 44, 87 Sea surface temperature (SST), 11, 33, 55, 132, Residual sum of squares (RSS), 344, 398, 404 180, 284, 293, 383, 391, 404, 445 Resolvent, 129, 546 Second kind, 359 Response, 27 Second order centered moment, 24 Response function, 102 Second-order differential equations, 86 Response variables, 351 Second-order Markov model, 179 RGB colours, 256 Second order moments, 259, 266 Ridge, 313, 395 Second-order stationary, 132 Ridge functions, 257, 258 Second-order stationary time series, 196 Ridge regression, 344, 391, 395 Second-order statistics, 375 Riemannian manifold, 400 Self-adjoint, 37, 299, 300 Robustness, 438 Self-interactions, 68 Rominet patterns, 3 Self-organisation, 425, 429 Root node, 434 Self-organising maps (SOMs), 416, 429 Rosenblatt perceptron, 417 Semi-annual oscillation, 162 Rossby radii, 310 Semi-definite matrices, 63 Rossby wave, 115, 263 Semi-difinite positivity, 226 Rossby wave breaking, 146 Semi-orthogonal matrix, 64 Rotated EOF (REOF), 4, 13, 35, 73, 141 Sensitivity to initial as well as boundary Rotated factors, 72, 230 conditions, 2 Rotated factor scores, 234 Sen surface height, 445 Rotated principal components, 73 Sentivity to initial contitions, 147 Rotation, 72, 73, 229 Sequentially, 386 Index 597

Serial correlation, 171 Smooth maximum covariance analysis Serial orrelation, 50 (SMCA), 319, 358 Shannon entropy, 37, 243, 244 Smoothness, 256, 352 Shew orthogonal projection, 404 Smoothness condition, 453 Short-memory, 180 Smooth time series, 352 Siberian high, 381, 383 Smpling with replacement, 47 Sigmoid, 279, 421, 423 Sneath’s coefficient, 203 Sigmoid function, 278 Soothing condition, 355 Signal patterns, 178 Southern Oscillation, 33 Signal-to-noise maximisation, 177 Southern Oscillation mode, 196 Signal-to-noise ratio, 44, 254 Spacetime orthogonality, 70 Significance, 343 Sparse systems, 40 Significance level, 59 Spatial derivative, 111 Similarity, 201 Spatial weighting, 44 Similarity coefficient, 203 Spearman’s rank correlation coefficient, 21 Similarity matrix, 207, 217 Spectral analysis, 344, 391 Simple structure, 72 Spectral clustering, 302, 304, 306, 317 Simplex, 260, 402, 403, 524 Spectral decomposition theorem, 300 Simplex method, 524 Spectral density, 156, 196 Simplex projection, 408 Spectral density function, 26, 27, 98, 124, 186 Simplex vertices, 404 Spectral density matrix, 175, 185, 187, Simplicity, 72, 73 190–192 Simplicity criteria, 81 Spectral domain, 97 Simplicity criterion, 73, 74 Spectral domain EOFs, 99 Simplified Component Technique-LASSO Spectral entropy, 197 (SCoTLASS), 82 Spectral EOFs, 98 Simplified EOFs (SEOFs), 82 Spectral methods, 328 Simulated, 3 Spectral peak, 125 Simulated annealing, 250 Spectral radius, 217 Simulations, 33 Spectral space, 373 Singular, 175 Spectral window, 187 Singular covariance matrices, 151 Spectrum, 54, 58, 66, 137, 148, 160, 167, 180, Singular system analysis (SSA), 146 375, 380 Singular value, 26, 42, 74, 161, 345, 350 Sphered, 256 Singular value decomposition (SVD), 26, 40, Sphereing, 283 42, 96, 503–504 Spherical cluster, 302 Singular vectors, 109, 122, 149, 160, 206, 341, Spherical coordinates, 326, 327 342, 389, 393 Spherical geometry, 44, 361 Skewness, 26, 264, 288 Spherical harmonics, 118, 310 Skewness modes, 262 Spherical RBFs, 360 Skewness tensor, 263 Sphering, 25, 54 Skew-symmetric, 217 Splines, 19, 321, 326, 354, 423 Sliding window, 148 Spline smoothing, 258 SLP anomalies, 42, 49, 213, 233, 285, 331, 381 Split, 436 S-mode, 22 Splitting, 434 Smooth EOFs, 319, 326, 332 Splitting node, 434 Smooth functions, 319 Squared residuals, 210 Smoothing, 18, 19 Square integrable, 36 Smoothing constraint, 258 Square root of a symmetric matrix, 25 Smoothing kernel, 19 Square root of the sample covariance matrix, Smoothing parameter, 303, 326, 362 256 Smoothing problem, 258 Squashing, 279 Smoothing spectral window, 192 Squashing functions, 421 Smoothing spline, 354 Srrogate data, 48 598 Index

SST anomalies, 60 Supervised, 416 Stability analysis, 135 Support vector machine (SVM), 422 Standard error, 45 Support vector regression, 442 Standard normal distribution, 10, 46 Surface temperature, 62, 183, 369, 445 Standing mode, 169 Surface wind, 33 Standing oscillation, 135 Surrogate, 48 Standing waves, 124, 168, 169 Surrogate data, 45 State space, 38 Swiss-roll, 211 Stationarity, 121, 370 Symmetric, 502 Stationarity conditions, 121 Synaptic weight, 430 Stationary, 26, 94, 154 Synchronization, 68 Stationary patterns, 91 Synoptic, 167 Stationary points, 283 Synoptic eddies, 180 Stationary solution, 88, 310 Synoptic patterns, 445 Stationary states, 311, 313 Synoptic weather, 444 Stationary time series, 98, 156 System’s memory, 171 Statistical downscaling, 365 Steepest descent, 425, 526 Steepness, 421 T Stepwise procedure, 386 Tail, 9, 250 Stochastic climate model, 310 Taining dataset, 48 Stochastic integrals, 99 Tangent hyperbolic, 282 Stochasticity, 12 Tangent space, 86 Stochastic matrix, 305, 398 t-distribution, 375 Stochastic model, 122 Teleconnection, 2, 33, 34, 49, 67, 284, 345, Stochastic modelling, 222 446 Stochastic process, 36, 37, 190, 352, 480 Teleconnection pattern, 31 Stochastic system, 118 Teleconnectivity, 66, 67 Storm track, 61, 180 Tendencies, 136, 302, 310 Stratosphere, 91, 448 Terminal node, 434 Stratospheric activity, 291 Ternary plot, 404 Stratospheric ﬂow, 93 Tetrahedron, 403 Stratospheric warming, 146 Thermohaline circulation, 321 Stratospheric westerlies, 93 Thin-plate, 463 Stratospheric westerly ﬂow, 93 Thin-plate spline, 424, 454 Stratospheric zonal wind, 125 Third-order moment, 262 Streamfunction, 127, 132, 141, 263, 310, 311 Thompson’s factor score, 229 Stress function, 208, 209 Three-way data, 291 Structure removal, 256 Tikhonov regularization, 344 Student distribution, 478 T-mode, 22 Subantarctic mode water, 323 Toeplitz, 156, 159 Sub-Gaussian, 271, 280 Toeplitz covariance matrix, 152 Subgrid processes, 44 Toeplitz matrix, 149 Subgrid scales, 118 Topographic map, 429 Subjective/Bayesian school, 467 Topological neighbourhood, 431 Submatrix, 159 Topology, 431 Subscale processes, 118 Topology-preserving projection, 448 Substructures, 168 T-optimals, 179 Subtropical/subpolar gyres, 317 Tori, 211 Sudden stratospheric warming, 291 Trace, 500 Summer monsoon, 34, 66, 212, 404 Training, 46, 416, 425 Sum-of-squares, 254 Training set, 47, 391 Sum of the squared correlations, 351 Trajectory, 147 Super-Gaussian, 270, 473 Trajectory matrix, 149 Index 599

Transfer entropy, 281 Variational problem, 455 Transfer function, 27, 104, 105, 125, 419, 424 VARIMAX, 73, 77, 81 Transition probability, 305 Varimax, 115 Transpose, 500 Vector autoregressive, 138 Trapezoidal rule, 192 Vertical modes, 322 Travelling features, 124 Visual inspection, 241 Travelling waves, 117, 145 Visualization, 11 Tree, 448 Visualizing proximities, 201 Trend, 377 Volcanoes, 2 Trend EOF (TEOF), 375 Vortex, 167 Trend pattern, 381 Vortex area, 167 Trends, 3, 180 Vorticity, 135 Triangular matrix, 397 Triangular metric inequality, 205 Triangular truncation, 310 W Tridiagonal, 153 Water masses, 322 Tropical cyclone forecast, 444 Wave amplitude, 113 Tropical cyclone frequency, 442 Wavelet transform, 103 Tropical Pacific SST, 440 Wave life cycle, 113 Tucker decomposition, 293 Wavenumber, 71, 111, 115 Tukey’s index, 248 Weather forecasting, 440 Tukey two-dimensional index, 248 Weather predictability, 184 Tukey window, 183 Weather prediction, 34, 451 Two-sample EOF, 383 Weighted covariance matrix, 281 Two-way data, 291 Weighted Euclidean distances, 211 Weight matrix, 278 Welch, 196 U Westerly flows, 34, 93 Uncertainty, 45, 49, 244 Westerly jets, 92 Unconstrained problem, 76, 186 Western boundary currents, 55 Uncorrelatedness, 270 Western current, 406 Understanding-context independence, 9 Westward phase tilt, 125 Uniform distribution, 196, 244, 251, 271 Westward propagating Rossby waves, 157 Uniformly convergent, 37 Whitened, 389 Uniform random variables, 21 Whitening transformation, 364 Unimodal, 259, 311 White noise, 56, 131, 149, 197, 267 Uniqueness, 222 Wind-driven gyres, 322 Unit, 419, 421 Wind fields, 212 Unitary rotation matrix, 115 Window lag, 160, 167 Unit circle, 135 Winning neuron, 430 Unit gain filter, 107 Wishart distribution, 385, 480 Unit-impulse response, 266 Working set, 400 Unit sphere, 386 Wronskian, 545 Unresolved waves, 310 Unstable modes, 135 Unstable normal modes, 135 Unsupervised, 416 X Upwelling, 447 X11, 17

V Y Validation, 3 Young-Householder decomposition, 206 Validation set, 47 Yule-Walker equations, 175, 185 600 Index

Z Zonal velocity, 184 Zero frequency, 107, 177 Zonal wavenumber, 126 Zero-skewness, 170 Zonal wind, 92, 184 Zonally symmetric, 92 z-tranform, 267 Zonal shift the NAO, 49