<<

Physica D 230 (2007) 65–71 www.elsevier.com/locate/physd

Statistical in the atmosphere and other dynamical

Richard Kleeman∗

Courant Institute of Mathematical Sciences, 251 Mercer Street, New York, NY 10012, USA

Available online 24 July 2006 Communicated by C.K.R.T. Jones

Abstract

Ensemble predictions are an integral part of routine weather and climate prediction because of the sensitivity of such projections to the specification of the initial state. In many discussions it is tacitly assumed that ensembles are equivalent to functions (p.d.f.s) of the random variables of interest. In general for vector valued random variables this is not the case (not even approximately) since practical ensembles do not adequately sample the high dimensional state spaces of dynamical systems of practical relevance. In this contribution we place these ideas on a rigorous footing using concepts derived from Bayesian analysis and information theory. In particular we show that ensembles must imply a coarse graining of and that this coarse graining implies loss of information relative to the converged p.d.f. To cope with the needed coarse graining in the context of practical applications, we introduce a hierarchy of entropic functionals. These the information content of multivariate marginal distributions of increasing order. For fully converged distributions (i.e. p.d.f.s) these functionals form a strictly ordered hierarchy. As one proceeds up the hierarchy with ensembles instead however, increasingly coarser partitions are required by the functionals which implies that the strict ordering of the p.d.f. based functionals breaks down. This breakdown is symptomatic of the necessarily limited sampling by practical ensembles of high dimensional state spaces and is unavoidable for most practical applications. In the second part of the paper the theoretical machinery developed above is applied to the practical problem of mid-latitude weather prediction. We show that the functionals derived in the first part all decline essentially linearly with and there appears in fact to be a fairly well defined cut off time (roughly 45 days for the model analyzed) beyond which information is unimportant to statistical prediction. c 2006 Elsevier B.V. All rights reserved.

Keywords: Predictability; Information theory; Statistical ; Dynamical systems

1. A Bayesian perspective on predictability which for discrete distributions can be written1 X  p(x) The Bayesian perspective of mathematical statistics (see, D(p k q) ≡ p(x) ln (1.1) q(x) for example, [2]) posits prior and posterior probability x∈H distributions for random variables of interest: Before the where H is a countable index and we are using p(x) as acquisition of particular data concerning a random variable, one the prior and q(x) as the posterior discrete distributions. If we is assumed to have a prior distribution derived from all previous consider a particular partitioning of Rn our state space then observations. Subsequently the new data acquired modifies this this discrete form can be written easily as a Riemann sum by distribution to a posterior distribution. The extent to which this writing the probability of a particular partition as the product new distribution “differs” from the original prior is a measure of the local probability density and the partition volume. In the of the usefulness of the newly acquired data. The functional usual infinite limit this then becomes the continuous density usually deployed (see, for example, [4]) to measure such a difference or utility is the relative D(ppost k pprior) 1 Note that to make this definition well defined we assume two further things: Firstly the summands are only non-zero when p(x) 6= 0 and secondly that q(x) = 0 only when p(x) = 0. The second condition is equivalent to saying ∗ Tel.: +1 212 998 3233; fax: +1 212 995 4121. that if the probability of a particular event was zero in the (infinite) past it will E-mail address: [email protected]. always be zero in the future.

0167-2789/$ - see front matter c 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.physd.2006.06.005 66 R. Kleeman / Physica D 230 (2007) 65–71 distribution form: transparent fashion. A first step towards an understanding of Z  f (z) this issue is the concept of a state space partitioning. D( f kg) ≡ f (z) ln dz allRn g(z) 2.1. Partitions where f and g are the limiting probability density functions p q Suppose we have a state space of dimension n and define a corresponding with and respectively. This transition n between discrete and continuous forms will be important in our partition Γ of R as a complete and non-overlapping coverage n { } subsequent discussion below. Note that this transition does not of R by a collection of M subsets Γ1, Γ2,..., ΓM . More ∈ n ∈ occur for the absolute entropy H(p) = − P p(x) ln(p(x)) precisely, for any x R there exists a j such that x Γ j and x∈H ∩ = { } 6= because the volume element in the does not cancel in addition Γi Γk ∅ for all i k. and so one is left with an infinite “renormalization” constant Now associated with each partition member is a probability when one takes the limit. which is given by The ideal statistical prediction problem consists in deter- Z p = f (x)dx mining an initial condition distribution using observations and i Γi then using a good dynamical model to project this distribution forward in time. In terms of the Bayesian perspective, the where the underlying pdf is f . Clearly in the limit that the intuitively obvious prior distribution for this process is the volume of each partition approaches zero then pi approaches f x∗ Γ x∗ Γ “climatological” or “equilibrium” distribution associated with ( i )Vol( i ) where i is some element of i . Also using the p q the particular dynamical . This is clearly the best prior i and the analogous i we can define the discrete relative D k Γ in the absence of information concerning the initial conditions. entropy Γ (p q) with respect to using Eq. (1.1) and It also has the advantage that since the posterior prediction dis- the Riemann sum formalism shows that this approaches the continuous D( f k g) in the above limit. tribution usually converges asymptotically to the prior, the util- A partition Λ is said to be a refinement of Γ if every Γ ity of statistical predictions approach zero asymptotically which i contains at least one Λ . We write Γ w Λ and it follows easily coincides with intuition. From a practical perspective, the rel- j that ative entropy measures the utility of the statistical prediction under the assumption that the model from which it is derived Theorem 2.1. If Γ w Λ then DΓ ≤ DΛ where the discrete is perfect. In the realistic case of imperfect models the degree relative entropies have the same underlying continuous pdfs. to which perfect model utility corresponds with real utility is determined by the realism of the dynamical model (more dis- Proof. The result follows easily from the definition of cussion on this may be found in [7,9]). refinement and Theorem 16.1.2 of [4].  Often our will be subject to external The straightforward interpretation of this result is that the forcing. In the case that this is periodic (as it is for the climate coarsening of a particular partitioning results in a drop in the system) we choose our prior to have the same phase with relative entropy since we are discarding the information on finer respect to the periodic forcing as the posterior at the time of scales. Note that as this refinement process approaches the limit interest. Such a convention is the one commonly adopted in the discussed above the relative entropy approaches monotonically climate community. from below the continuous value. Partitions and ensembles have an obvious statistical connection: 2. Coarse graining, ensembles and marginal entropy 2.2. Ensembles An excellent introductory reference for information theory An ensemble E is a set of K points in Rn and one is the book of Cover and Thomas [4]. In Appendix A we review can naturally define a bin count n(Γ , E) to be an integer the properties of relative entropy relevant to this contribution. i valued on Γ which specifies the number of ensemble More background can be found in the book as well as in [11] members which are members of a particular Γ . It is obvious which is a somewhat more mathematical paper by the present i that the bin count f ≡ n(Γ , E) serves as a basis for estimating author and others. The material presented in this section i i p however it is equally clear that this estimate which we denote relies heavily on information theoretic and advanced statistical i by p is just that and has an uncertainty associated with it. concepts. Less technically minded readers will find a summary bi In fact we can conceptually write down the probability P(p) at section end. b that this estimate is actually equal to p ≡ (p , p ,..., p ). For any practical dynamical system of significant dimen- 1 2 M In [9] we deduced using elementary Bayesian arguments that sionality, the integration of the corresponding Fokker Planck this should be given by for the pdf becomes computationally problematical. In such a situation typically a Monte Carlo approach is taken. = + P(bp) Φf (bp) More precisely, initial conditions are sampled from an assumed (2.1) f+ ≡ ( f + 1, f + 1,..., f + 1) distribution and then each sample member is integrated forward 1 2 M in time. A sample only of the pdf is therefore available at all where Φf+ is the multivariate Dirichlet distribution (see [1]). . What is the relationship between this “ensemble” and The most likely bpi or first moment of this distribution which the pdf ? We approach this question in a (hopefully) intuitively we denote by bpi serves as the “best” sample estimator for pi R. Kleeman / Physica D 230 (2007) 65–71 67 and is given by

fi + 1 bpi = (2.2) b K + M where M is the number of partition elements in Γ . Notice that this differs from the naive choice of fi /K and also is always non-zero. Now the uncertainty involved in this sample estimate implies an expected information loss: If we use the above estimate when in fact pi is different then, by the information theoretic interpretation of relative entropy, the loss of information is DΓ (p k,bp). Since we have a probability distribution that any particular s is actually the correct p, we are able therefore to calculate the expected information loss associated with our particular estimator of pE

Z Fig. 1. The effect on information of different partition refinements for a = EL(bp) P(s)D(s kbp)ds. particular ensemble. Note that one can evaluate this explicitly using Eq. (2.1) a reduction in available information. It is clear that the finite and one may also show that it is minimized by using the best size of the available ensemble implies a fundamental restriction estimator from Eq. (2.2). In addition in general this information on the amount of information available about multivariate pdfs. loss increases as the partition of state space is refined since This conceptual situation is shown schematically in Fig. 1. the sample size in each partition element Γi decreases and so This situation naturally motivates the study of marginal the sampling error involved in estimating pi increases. For a distributions since here this so-called curse of dimensionality partition that is “too fine” this loss can approach the estimated can sometimes be avoided. As an example, consider a bi-variate relative entropy DΓ (bp k bq) meaning that our estimate of the marginal distribution: If m divisions per dimension are retained information content of the ensemble is completely unreliable. then the partition in the relevant two dimensional subspace In order to avoid this one must choose a partition which has has m2 elements. Thus to avoid sampling loss with practical many bin counts n(Γi , E)  1 both for the prediction and prior samples here we require perhaps m < 150. Often m = 100 (climatological) ensembles. One may define an effective relative is sufficient with many distributions (Gaussian for example) to entropy with respect to a particular partition Γ and ensembles obtain very close to complete convergence of discrete entropic E p and Eq as functionals to their continuous limits. Motivated by the above we introduce the concept of marginal entropy: eff = { k − − } DΓ (E p, Eq ) max DΓ (bp bq) EL(bp) EL(bq), 0 . (2.3) Definition. Suppose we have n random variables Xi with corresponding multivariate distribution p(X1, X2,..., Xn) and In the section below we use the usual relative entropy all possible marginal distributions p(X j1 , X j2 ,..., X jm ) of without the expected information loss removed however this order m < n with jk ≤ n (and distinct) then the marginal equation is worth bearing in mind. relative entropy of order m is defined as 1 Dm (p k q) ≡ 2.3. Marginal entropies n  m The practical method for the construction of ensembles × X k involves the repeated integration of a dynamical model using D(p(X j1 , X j2 ,..., X jm ) q(X j1 , X j2 ,..., X jm )) (2.4) many different initial conditions. Such a situation implies j1,, j2,..., jm   that except for very low order dynamical systems we are where n is the usual binomial coefficient. The marginal restricted to sample sizes of at most around 105 since that many m integrations over time periods of practical interest is typically entropy is thus the average relative entropy of all possible m extremely computationally expensive. Some thought shows that marginal distributions of order . Marginal entropies have been this implies some rather severe restrictions on the estimation of used in most particularly in connection multivariate pdfs and associated entropic functionals. Thus, for with liquids where correlations between molecules are of significance (see, for example, [6] and [12]). example, if one is interested in a state space of dimension n Marginal relative entropies can be shown to satisfy an and in retaining 10 divisions of data per dimension then such a inequality hierarchy: partition will have 10n members and so to avoid the sampling loss of information discussed in the previous subsection we Theorem 2.2. Marginal relative entropies with respect to n must restrict ourselves to n < 5. One could, of course, partially random variables Xi and the same partition Γ satisfy the avoid this issue by reducing the number of divisions per following chain of inequalities dimension. This amounts to coarsening our partition however and as we saw in Theorem A.3 above this inevitably implies D1(p k q) ≤ D2(p k q) ≤ · · · ≤ Dn(p k q) = D(p k q)(2.5) 68 R. Kleeman / Physica D 230 (2007) 65–71 where for notational ease we are dropping the partition partitions for lower order marginal entropies. One might chose subscript Γ . to do this in order to maximize available information. As a final observation if instead of considering partitions to Proof. Use the notation D(Y , Y ,..., Y ) to denote the 1 2 k estimate ensemble information content one were to fit particular relative entropy of p(Y , Y ,..., Y ) and q(Y , Y ,..., Y ) 1 2 k 1 2 k distributions to the ensemble (for example Gaussians and their (note order of random variables here is immaterial). The chain generalizations) then one is left with the rather difficult issue of rule of relative entropy shows that calculating sampling loss since clearly such a fitted distribution D(Y1, Y2,..., Yk) ≤ D(Y1, Y2,..., Yk, Yk+1) is uncertain. That this is an important problem can be seen when one calculates the entropy estimate from such a fitted form. If which also shows that the ensemble is small enough to imply a “coarse” partition in 1 n dimensional space then experience shows that the relative D(Y1, Y2,..., Yk) ≥ {D(Y2,..., Yk) + D(Y1, Y3,..., Yk) k entropy calculated from the fitted distribution is often much + · · · + D(Y1, Y2,..., Yk−1)}. larger than the value obtained from any viable partitioning as n discussed above. This would seem to imply that in such a case If this inequality is applied term by term to the sum Dk k the sampling loss from the fitted distribution may be quite high. k (p q) and repetitions of the smaller order relative entropies The large uncertainty associated with the fitted distribution for collected we obtain this scenario will also mean that the numerical fitting problem     n n − k + 1 n − is ill-conditioned since there will be many almost equally likely Dk(p k q) ≥ Dk 1(p k q) k k k − 1 distributions that “fit” the ensemble.

n or using the properties of k 3. Mid-latitude weather predictability

k k−1 D (p k q) ≥ D (p k q).  3.1. Basic model configuration

This hierarchy has the natural interpretation that more We now apply the machinery of the previous sections to information is apparent as the higher order multivariate the problem of weather predictability. We restrict our analysis behaviour of the distribution is taken into account. to the mid-latitudes since current atmospheric models are It is very important to realise that this inequality chain only thought to simulate the major characteristics of here holds for a fixed partition Γ of a particular n dimensional state reasonably well. The tropical regions are heavily influenced by space. On the other-hand some reflection shows that finer and moist convection which is commonly thought to be only fairly finer partitions are possible without much sampling loss as the simulated in current generation weather models. order of the marginal entropy is reduced. We discuss this issue We used the openly available University of Hamburg further in the next section when we consider a practical example PUMA code (documented in [10]) and used the version in from atmospheric science. which radiation and convection are replaced by a temperature To summarize: Practical calculations in the field of statistical relaxation term. The model was configured to have horizontal predictability of realistic dynamical systems imply that we resolution of spectral T 42 (around 2.8 × 2.8 degrees) and must consider ensembles of possible predictions rather than 5 vertical levels. Qualitatively the simulation of synoptic pdfs. The ensembles represent sample estimates of desired variability in the storm tracks during the Northern winter2 was distributions. One view on this sample estimation is obtained in good agreement with observations. by partitioning the state space and counting the number of Initial conditions were assumed to be drawn from a Gaussian ensemble members passing through each partition element. distribution which had mean fields taken at random from It is clear that as the order of the multivariate distribution an extended integration of the model; variances an order increases any estimate must rely on coarser partitioning per of magnitude less than climatology and a homogeneous dimension added. This problem is often called the “curse of horizontal3 spatial decorrelation scale of 1000 km. Such dimensionality”. Of course if one is only interested in say a distribution implies initial conditions that have greater uni-variate or bi-variate distributions then this is not usually uncertainty than is normal in a typical operational forecasting a problem. Motivated by this fundamental practical difficulty, situation. we have introduced a natural hierarchy of so called “marginal Attention was focused on the North American and Atlantic entropies” (see Eq. (2.4)) which measure the information storm track region and a domain of 90◦W–0◦ in longitude and content of marginal variations of increasing order. For a 20◦N–65◦N in latitude. A reduced state space consisting of given partitioning of state space they form a strict inequality stream-function EOFs was used. The first ten such patterns hierarchy (Eq. (2.5)) reflecting the increase in information explain around 95% of the variance in our restricted domain content as higher order (marginal) distributions are considered. which was dominated by large scale synoptic variations Some reflection however also shows that the lower order associated with baroclinic . Given this, we chose these marginal distributions can be more “precisely” viewed using finer partitions of state space and Theorem 2.1 then shows that the finer partitions have greater information content. This 2 All experiments below used this season. means that the strict hierarchy will be broken by choosing finer 3 Vertical correlations we assumed zero. R. Kleeman / Physica D 230 (2007) 65–71 69

10 patterns to analyze predictability and so they were taken as the basis elements of our reduced state space.

3.2. State-space partitioning, ensembles and marginal en- tropies

A partitioning strategy in state-space is somewhat arbitrary but should provide a clear interpretation of the resulting information measures. Guided by this we chose our partitions here to have equal prediction ensemble members in each dimension of state-space. This approach is widely used in practical contexts where such partitions are referred to as (for example) quartiles, deciles and so on. In a future publication (see [8]) we will explore the sensitivity of our results to this partitioning choice. The efficiency of our chosen model means that ensembles of size considerably larger than those commonly used operationally were possible. Here we used a 9600 member Fig. 2. Marginal entropy evolution. sample and integrated each member for 90 days under the (correct) assumption that at such a time the prediction and suggest the rather surprising conclusion that there may actually equilibrium ensembles should be statistically indistinguishable. be a cut-off time in weather prediction beyond which initial In terms of the discussion in Section 1, we identify the prior condition information is completely irrelevant. distribution with an equilibrium or climatological ensemble and The ordering of the marginal entropies for short prediction the posterior distribution with the prediction ensemble. Since leads is more or less consistent with our expectation discussed these ensembles become statistically indistinguishable after a above: The order 2 marginal exceeds the order 1 marginal sufficient time (see below) the relative entropy between them presumably because of the hierarchy shown in Theorem 2.2 declines with time until a residual due to sampling remains. above since both are close to converged to their continuous From the analysis of the previous section it is clear value with the partitioning chosen. This ordering is reversed for that we need to choose our bin count per partition to be orders 3–5 presumably because of the coarse-graining effects “fairly large” i.e. order 5–10 in order to avoid significant of Theorem 2.1. The second effect described appears most sampling loss. In addition, finer partitions are possible important for short prediction times probably because it is then for lower order marginal entropies since marginalization that coarse graining issues are most prominent since at such of distributions implies consolidation of partitions in the times the prediction ensemble has considerably less spread than direction of the eliminated/integrated dimensions. Guided by the equilibrium. Notice that the uni-variate entropy always lags these considerations we restricted our attention to marginal consistent with our explanations. entropies of order less than 6 since beyond this the required The results shown here have important implications for the partitioning per dimension is very coarse. For the remaining problem of atmospheric climate prediction. They suggest that marginal entropies (1–5) we chose the number of partitions per predictions which extend beyond a month and a half no longer dimension to be 1024, 32, 10, 6 and 4 respectively. Notice that gain any benefit from the inclusion of initial condition data and the partitioning for order 4 and 5 is quite coarse. Based on must rely for their skill entirely on boundary conditions which experience with idealized multivariate distributions (Gaussian are derived mainly from the ocean conditions. and Gamma) we would expect only orders 1–3 to have nearly The almost linear decline in predictability noted is rather converged relative entropies. In other words we might expect to striking and may be a fundamental property of mid-latitude see the coarse graining effects of Theorem 2.1 for order 4 and geophysical turbulence. These matters and a more detailed 5. On the other hand one might expect to see the effects of the analysis of the meteorological results including methodological hierarchy Theorem 2.2 for orders 1–3. robustness are examined in a manuscript shortly to be submitted to an atmospheric journal [8]. 3.3. Results

The of the first five marginal entropies is Acknowledgements shown in Fig. 2. Broadly speaking there is a consistent decline in relative The author wishes to thank Greg Eyink from Johns Hopkins entropy for the first 45 days which is approximately linear in University for a stimulating discussion at the UCLA/IPAM nature. Following this time the relative entropy has a very small 2005 data assimilation meeting concerning the material value consistent with a residual sampling error.4 These results presented here. He would also like to thank the organizers of the UCLA/IPAM meeting (where this material was first presented) for organizing a very useful and enjoyable meeting. The work 4 Remember Eq. (2.3) from the previous section. of Prof. Klaus Fraedrich and co-workers in making the PUMA 70 R. Kleeman / Physica D 230 (2007) 65–71 atmospheric model available is also gratefully acknowledged. and by definition This work was supported by the CMG and ATM programs of Z 0 0 0 0 NSF with grant numbers 0417728 and 0430889 respectively. f (x, t) ≡ F(x, t, x , t )dx dt Rn+1 Z Appendix A. Properties of relative entropy f (x0, t0) ≡ F(x, t, x0, t0)dxdt Rn+1 An excellent introduction to the information theoretic and similarly for G and g. concepts used in this contribution can be found in the book of Proof. Consider the joint distributions F(x, t, x0, t0) and Cover and Thomas [4] and the mathematical presentation [11]. G(x, t, x0, t0) then the chain rule for relative entropy (see [4] Here we present some relevant results from these references. page 23) shows that Proofs for the first two theorems can be found there. The third theorem proof is presented here in detail as its precise form is Dtt0 (F k G) = Dt ( f k g) + Dt|t0 (F k G) (A.3) of some physical relevance and is only briefly sketched in [4]. where D 0 is the so-called conditional relative entropy: The final theorem is often stated but not proved in the literature t|t Z so we provide a proof. 0 0 0 Dt|t0 ( f k g) ≡ dx f (x , t ) The continuous form of the relative entropy satisfies a Rn number of important properties which we now discuss in detail: Z  F(x, t | x0, t0)  × | 0 0 dx F(x, t x , t ) ln 0 0 n G(x, t | x , t ) Theorem A.1. Suppose we have two probability densities f R and g then D( f kg) ≥ 0 with equality if and only if f = g which is the expected relative entropy of the conditional almost everywhere (i.e. at points where f 6= 0. distributions. It is obvious in view of this form and the first theorem of this section that all terms in Eq. (A.3) are non- Simply put this confirms that if the prior and posterior p.d.fs negative in general. In addition we can use the chain rule to differ significantly then (positive) utility has been obtained by also establish that the observation/prediction process no matter what form this difference takes. Dtt0 (F k G) = Dt0 ( f k g) + Dt0|t (F k G). Combining this with Eq. (A.3) and using the causality condition Theorem A.2. Suppose we define a general non-linear which ensures that in fact D 0 (F k G) = 0, we obtain transformation of our state space: F : Rn → Rn which is non- t|t degenerate i.e. det(J(F)) 6= 0 where J is the Jacobian, then Dt ( f k g) = Dt0 ( f k g) + Dt0|t (F k G) the relative entropy of the transformed probability densities is left . which establishes the desired result.  We refer to Eq. (A.1) as a causality condition for the This result is rather important practically since non- following reason. If we have a and we specify linear transformations of state space are common particularly the state variables to have value x0 at time t0. It seems in (consider the standard transformation of reasonable to suppose (in the absence of outside influences) the vertical coordinate from geometric height to sigma) that the probability that the state vector will be x at a later and invariance of predictability measures under such a time t should be uniquely specified. This is certainly the case transformation would seem desirable. Note that absolute for the Fokker–Planck equation and the included Liouville entropy does not satisfy non-linear invariance although linear equation which governs deterministic dynamical systems. It is invariance is satisfied (see [11] for more detail). interesting to note that Theorem A.3 is a standard result in the Consider now the time evolution of probability densities. We study of asymptotic solutions of the Fokker–Planck equation are able to prove a generalized second law of thermodynamics: (see [5] page 61). If this conditional is unique its value will Theorem A.3. Suppose we have two probability densities F = be the same whether we are considering the equilibrium (g) 0 0 0 0 0 f F(x, t, x , t ) and G = G(x, t, x , t ) with x, x ∈ Rn and or transient ( ) behaviour of the system hence the causality t, t0 ∈ R and let us assume that the following causality requirement stated. condition holds for the associated conditional densities: It is interesting in the context of Theorem A.3 to consider a subspace of state space. Clearly now the conditionals in F(x, t | x0, t0) = G(x, t | x0, t0) where t ≥ t0 (A.1) (A.1) need not be unique since their value will depend also on the variables in the complement of the subspace then the associated marginal distributions f (x, t), f (x0, t0) and 0 0 which we have not specified. One may expect then that the g(x, t), g(x , t ) satisfy relative entropy of subspaces will not necessarily satisfy the

Dt ( f k g) ≤ Dt0 ( f k g) (A.2) temporal monotonicity condition of Eq. (A.2). We can interpret this behaviour as information flow to the subspace from its where complement. Under certain interesting conditions both absolute Z  f (x, t) and relative entropy are actually conserved (we consider the Dt ( f k g) ≡ f (x, t) ln dx Rn g(x, t) latter case): R. Kleeman / Physica D 230 (2007) 65–71 71

Theorem A.4. Suppose we have a dynamical system given by then the probability density evolution equation becomes the more general Fokker–Planck equation with non-zero diffusion ∂ui = Ai (u) i = 1, N (A.4) term. The relative entropy in this case can then be shown to ∂t strictly decline with time (see [5] pages 61–63). Such a situation where Ai is a differentiable vector function which we assume is reminiscent of Boltzmann’s paradox of statistical mechanics satisfies the (Liouville) condition (see [3]) where entropy only increases (and irreversibility N appears) when a macroscopic or coarse-grained view of the X ∂ Ai = 0 molecular dynamics is taken. Physically what is happening in ∂u i=1 i both cases is that as time increases information is lost from the coarse grained scales to the fine scales but on all scales then if f and g are two solutions of the corresponding information is conserved. probability density evolution equation N References X ∂ Ai f ft + = 0 ∂u i=1 i [1] M. Abramovitz, I.A. Stegun (Eds.), Handbook of Mathematical 0 Functions, ninth printing ed., Dover, New York, 1972. then for all times t and t [2] J. Bernardo, A. Smith, Bayesian Theory, John Wiley and Sons, 1994. D( f (t)kg(t)) = D( f (t0)kg(t0)). [3] L. Boltzmann, Lectures on Gas Theory, Dover, March 1995. [4] T.M. Cover, J.A. Thomas, Elements of Information Theory, Wiley, New Proof. We have York, NY, 1991. [5] C.W. Gardiner, Handbook of Stochastic Methods for , ( f ln( f/g))t = ft ln( f/g) + ft − gt ( f/g) and the Natural Sciences, in: Springer Series in Synergetics, vol. 13, Springer, 2004. = ft (ln( f/g) + 1) − gt ( f/g) [6] H.S. Green, C.A. Hurst, Order-disorder Phenomena, in: Monographs in and Thermodynamics, vol. 5, Interscience, London, = −∇·(A f )(ln( f/g) + 1) + ∇·(Ag)( f/g) New York, 1964. [7] R. Kleeman, Measuring dynamical prediction utility using relative = −∇·(A f ln( f/g)) entropy, J. Atmospheric Sci. 59 (2002) 2057–2072. [8] R. Kleeman, Limits to statistical weather predictability, J. Atmospheric where we are using the Liouville condition for the last step Sci. (submitted for publication). here. This shows that the function f ln( f/g) satisfies a flux [9] R. Kleeman, A.J. Majda, Predictability in a model of geostrophic conservation equation and hence that its integral over all space turbulence, J. Atmospheric Sci. 62 (2005) 2864–2879. must be conserved. [10] L.M. Leslie, K. Fraedrich, A new general circulation model: Formulation  and preliminary results in a single and multiprocessor environment, Clim. It is worth observing that a closed Hamiltonian dynamical Dynam. 13 (1997) 35–43. system will satisfy the Liouville condition. Many inviscid fluid [11] A.J. Majda, R. Kleeman, D. Cai, A framework of predictability through relative entropy, Methods Appl. Anal. 9 (2002) 425–444. systems also satisfy this particular condition. If however a [12] L. Onsager, Information Cacade. Unpublished notes (12:163, Mathemat- stochastic term is added to the right hand side of the dynamical ics), Onsager Archive available from NTNU Library, Trondheim, Norway, system in Eq. (A.4) to represent neglected or fine scale motions, http://www.ub.ntnu.no/formidl/hist/tekhist/tek5/eindex.htm.