<<

Applications of IMA Annual Program Year Workshop Applications in Biology, Dynamics, and March 5-9, 2007

Information Geometry (IG) and Algebraic Statistics (AS)

Giovanni Pistone (DIMAT Politecnico di Torino) http://staff.polito.it/giovanni.pistone/ [email protected]

1 Example: new cancer incidence and gender • D. Geiger, C. Meek, and B. Sturmfels. On the toric of graphical models. Ann. Statist., 34:1463–1492, 2006. • F. Rapallo. Toric statistical models: Parametric and binomial repre- sentations. AISM, 2007. DOI 10.1007/s10463-006-0079-z • G. Consonni and G. Pistone. Bayesian analysis of contingency tables with possibly zero-probability cells. Technical Report 0703123, arXiv, 2007 The following table reports data for different types of cancer separated by gender for Alaska in year 1989 (Smirnoff 2003). Type of cancer Female Male Total Lung 38 90 128 Melanoma 15 15 30 Ovarian 18 * 18 Prostate * 111 111 Stomach 0 5 5 Total 71 221 292 Clearly cells (3, 2) and (4, 1) are structural zeros, while we regard the zero count corresponding to the combination (Stomach, Female)=(5, 1) as a possibly zero-probability cell.

2 Quasi-independence • A typical assumption that is of interest in this case is that of quasi- independence (QI), corresponding to the standard independence as- sumption for all sub-tables, excluding those having a structural zero. • For this hypothesis, Simonoff (2003) finds a p-value between 2% and 3%, depending on the method employed. The data thus seem to provide significant evidence against the QI-model, although this evidence is not very strong. • Let I = {1, 2, 3, 4, 5}, J = {1, 2} denote the set of levels for the rows and columns respectively, and consider the two-way table with cells in the set A = I ×J \{(3, 2), (4, 1)}, i.e. with cells (3, 2) and (4, 1) missing.

• Under the QI-model the un-normalized cell probabilities qij are given by

qij = ρiψj, (i, j) ∈ A QI − model

• If qij > 0, (i, j) ∈ A,

log qi,j = αi + βj, (i, j) ∈ A

with αi = log ρi, βj = log ψj ∈ R.

3 Exponential and extended exponential model • The design matrix M and an orthogonal matrix K are:

α1 α2 α3 α4 α5 β1 β2 k1 k2 11 1 0 0 0 0 1 0  11 1 0  21 0 1 0 0 0 1 0  21 −1 −1      31 0 0 1 0 0 1 0  31 0 0          51 0 0 0 0 1 1 0  51 0 1  M =  ,K =   12 1 0 0 0 0 0 1  12 −1 0      22 0 1 0 0 0 0 1  22 1 1      42 0 0 0 1 0 0 1  42 0 0  52 0 0 0 0 1 0 1 52 0 −1

• If qij > 0, (i, j) ∈ A, the QI-model is equivalent to implicit binomial model ( q11q22 − q21q12 = 0

q51q22 − q21q52 = 0. The above equations are the standard conditions for independence in the two 2 × 2 tables with rows {1, 2}, respectively {2, 5}. This is equivalent to the independence of the sub-table {1, 2, 5} × {1, 2}

4 Maximal model

• The maximal design matrix Mmax and the maximal model in monomial form are computable whith Computer Algebra Software.

ζ1 ζ2 ζ3 ζ4 ζ5 ζ6 ζ7    q11 = ζ5ζ7 11 0 0 0 0 1 0 1   21 0 0 1 0 0 0 1   q21 = ζ3ζ7    31 1 0 0 0 0 0 0   q31 = ζ1       q = ζ ζ 51 0 0 0 1 0 0 1  51 4 7 Mmax =   12 0 0 0 0 1 1 0   q12 = ζ5ζ6    22 0 0 1 0 0 1 0   q = ζ ζ    22 3 6 42 0 1 0 0 0 0 0    q42 = ζ2 52 0 0 0 1 0 1 0   q52 = ζ4ζ6 • CoCoATeam. CoCoA: a system for doing Computations in Commu- tative Algebra. Available at http://cocoa.dima.unige.it. • 4ti2 team. 4ti2 – a software package for algebraic, geometric and combinatorial problems on linear spaces. Available at www.4ti2.de. • The number of instances (also called feasable sets) for the QI-model is 87.

5 Bayesian analysis • The model which imposes no restriction on the cell probabilities, save the zero-probability cells (3, 2) and (4, 1), is called the Structural Zero model SZ. The number of SZ-instances is equal to 28 − 1 = 255. • Only two of the above SZ-instances are logically consistent with the observed data: the one giving a positive probability to all eight free cells; and the one giving zero-probability to cell (5, 1) only. We label these instances SZ0 and SZ1. • There exists only one logically consistent instance of the quasi- independence model, i.e. that having all positive cell-probabilities (except for the two cells corresponding to structural zeros), which

we label QI0.

• The models SZ0 and SZ1 are nested, both in the sense of the maximal parameterization and of the supports (faces), so that their a-priori Dirichlet distributions on the parameters should be related. ZI could be parameterized by QI0 and an orthogonal component.

• The two models SZ0 and SZ1 are different exponential models and an a-priori on their union disintegrates into a discrete a-priori for the model and a conditional Dirichlet given the model.

6 Statistical models as differentiable manifolds • Statistical models have a very rich mathematical structure. We can approach this from many points of view: , Convex Analysis, Differential Geometry. • The Differential Geometry approach is the oldest: statistical mod- els are Riemaniann manifolds according: C. R. Rao. Information and accuracy attainable in the estimation of statistical parameters. Bullettin of Calcutta Mathematical Society, 37:81–89, 1945. • More recently, it has been found that there are other manifold struc- tures of interest, called by Amari α-Bundles and α-Connections. • Most of the constructions in the literature are restricted to paramet- ric statistical models. However, there are applications where infinite dimensional statistical models appear, e.g. in Mathematical Finance the set of martingale measures on the Wiener space. • A general coordinate-free construction is important at the conceptual level, because in such a framework the “big picture” comes out more clearly, as in S. Lang. Differential and Riemannian manifolds, volume 160 of Graduate Texts in Mathematics. Springer-Verlag, New York, third edition, 1995.

7 Exponential statistical manifolds The theory of exponential statistical manifolds modeled on Or- licz spaces, has been developed in a joint work with C. Sempi, M.-P. Rogantin, P. Gibilisco, A. Cena, D. Imparato, B. Trivellato (1993-2007). • G. Pistone and C. Sempi. An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Statist., 23(5):1543–1561, October 1995; • G. Pistone and M. P. Rogantin. The exponential statistical man- ifold: mean parameters, orthogonality and space transformations. Bernoulli, 5(4):721–760, August 1999; • P. Gibilisco and G. Pistone. Connections on non-parametric statis- tical manifolds by Orlicz space geometry. IDAQP, 1(2):325–347, 1998; • A. Cena and G. Pistone. Exponential statistical manifold. AISM, 59:27–56, 2007. DOI 10.1007/s10463-006-0096-y. The main focus is in finding a rigorous functional setting for the IG as described in • S. Amari and H. Nagaoka. Methods of information geometry. Amer- ican Mathematical Society, Providence, RI, 2000. Translated from the 1993 Japanese original by Daishi Harada.

8 Statistical varieties In the case of a finite state space or in the case of special distributions, IG has an interesting interface with AS. This interface is both conceptually interesting and computationally useful. • P. Diaconis and B. Sturmfels. Algebraic algorithms for sampling from conditional distributions. Ann. Statist., 26(1):363–397, 1998. • G. Pistone and H. P. Wynn. Finitely generated cumulants. Statist. Sinica, 9(4):1029–1052, October 1999. • G. Pistone, E. Riccomagno, and H. P. Wynn. Algebraic Statistics: Computational Commutative Algebra in Statistics. Chapman&Hall, 2001. • D. Geiger, C. Meek, and B. Sturmfels. On the toric algebra of graphical models. Ann. Statist., 34:1463–1492, 2006. The key notion is that of exponential statistical model which has long been known to have special features, both analytic and algebraic. The study of general infinite dimensional exponential models suggests how to avoid unnatural parameterization of the finite state space mod- els. In turn, the finite state space case suggests how to deal with limit cases.

9 Outline Exponential statistical model • Basic construction. • Exponential and mixture connections. • Models, divergence, tangent bundle. Finite state space or special distributions Extended exponential model Tables with zero cells

10 IG as a special Banach manifold Given a general probability space (X, X , µ),

• M> is the set of all densities which are positive µ-a.s.

• M> is thought to be the “maximal” regular statistical model. • We want to assign a manifold structure to this maximal model in such a way that each specific statistical model could be considered

as a “sub-manifold” of M>. We will see that the notion of sub- manifold to be used is not obvious.

• The model space for the manifold, locally at each p ∈ M>, is a subspace of centered random variables of a suitable Orlicz space. Orlicz spaces are constructed in analogy with Lebesgue spaces, by R imposing conditions of the form Ep [Φ(u)] = Φ(u)pdµ < +∞, for a suitable function Φ. If Φ(x) = |x|a, then the Lebesge La spaces are obtained. A reference on Orlicz spaces is: • M. M. Rao and Z. D. Ren. Applications of Orlicz spaces, volume 250 of Monographs and Textbooks in Pure and . Marcel Dekker Inc., New York, 2002.

11 Orlicz spaces • The Jung function Φ(x) = cosh x − 1 is used instead of the equivalent and more commonly used e|x| − |x| − 1. • Ψ denotes its conjugate Jung function or the equivalent (1+|y|) log(1+ |y|) − |y|. • A u belongs to the vector space LΦ(p) if for some α > 0 Ep [Φ(αu)] < +∞. Φ • The closed unit ball of L (p) consists of all u’s such that Ep [Φ(u)] ≤ 1. • The open unit ball B(0, 1) consists of those u’s such that αu is in the closed unit ball for some α > 1. • The Banach space LΦ(p) is not separable, unless the sample space is finite. In this sense it is an un-natural choice. • However, LΦ(p) is natural for statistics because for each u ∈ LΦ(p) the Laplace transform of u is well defined at 0 and the one- dimensional exponential model p(θ) ∝ eθu is well defined around 0 (and viceversa). • The space LΨ(p) is separable and it is the pre-dual of LΦ(p), with Φ a Ψ pairing Ep [uv]. For 1 < a < +∞, L (p) ⊂ L (p) ⊂ L (p).

12 Moment functional u • For each p ∈ M>, the moment functional is Mp(u) = Ep [e ].

• Mp(0) = 1; otherwise, for each u =6 0, Mp(u) > 1.

• Mp is convex and lower semi-continuous, and its proper domain  Φ dom(Mp) = u ∈ L (p · µ): Mp(u) < ∞ Φ is a convex set which contains the open unit ball Bp(0, 1) of L (p).

Th Mp is infinitely Gˆateaux-differentiable in the interior of its proper ◦ Φ domain, the nth-derivative at u ∈ dom(Mp) in the direction v ∈ L (p) being n d n u n Mp(u + tv) = Ep [v e ] ; dt t=0

Th Mp is bounded, infinitely Fr´echet-differentiable and analytic on the open unit ball of LΦ(p), the nth-derivative at u ∈ B(0, 1) evaluated in Φ Φ (v1, . . . , vn) ∈ L (p) × · · · × L (p) is n u D Mp(u)(v1, . . . , vn) = Ep [v1 ··· vne ] . In particular, n D Mp(0) (v1, . . . , vn) = Ep [v1, . . . , vn] ·

13 Connected component • We associate to each density p a space of p-centered random vari- ables: scores, estimating functions . . . . It is technically crucial to discuss how the relevant spaces depend on the density p. 1−θ θ • Given p, q ∈ M>, the exponential model p(θ) ∝ p q , 0 ≤ θ ≤ 1 connects the two given densities as end points of a curve. This curve need not to be continuous. So we ask for more.

D We say that p, q ∈ M> are connected by an open exponential arc Φ if there exist r ∈ M>, u ∈ L0 (r) and an open interval I that contains 0, and such that p(t) ∝ etu ·r, t ∈ I, is an exponential model containing both p and q. Th Let p and q be densities connected by an open exponential arc. Then the Banach spaces LΦ(p) and LΦ(q) are equal as vector spaces and their norms are equivalent. Th For all q that are connected to p by an open exponential arc, the Orlicz space of centered random variables at q is  q   LΦ(q) = u ∈ LΦ(p) | E u = 0 0 p p

∗ ∗ q  Φ then Ep [ ku] = 0, k = p − 1 , u ∈ L (p).

14 Cumulant functional

• For each p ∈ M> and the u in the set Sp of p-centered random variables in the interior of the proper domain of Mp, the cumulant functional is Kp(u) = log Mp(u).

• Kp is infinitely Gˆateaux-differentiable.

• Kp is bounded, infinitely Fr´echet-differentiable and analytic on the Φ open unit ball of L0 (p). • The mapping ( Sp → M> ep : u 7→ eu−Kp(u)p

is a parameterization of a subset of M>.

• If ep (Sp) = E(p), the corresponding chart is q  q s : E(p) 3 q 7→ log − E log ∈ S . p p p p p

h pi • If q = ep(u), then Kp(u) = Ep log q and

DKp(u)(v) = Eq [v] 2 D Kp(u)(v, w) = Eq [vw]

15 Exponential manifold D For every p ∈ M, the maximal exponential model at p is defined to be the family of densities  ◦  u−Kp(u) E (p) := e p : u ∈ dom Kp ⊆ M>.

Th The maximal exponential model at p, E(p) is equal to the set of all densities q ∈ M> connected to p by an open exponential arc. ∞ Th The collection of charts {(Ep, sp) : p ∈ M>} is an affine C atlas on M>. The transition maps are      p1 p1 sp2 ◦ ep1 : u 7→ u + log − Ep2 u + log p2 p2 • The chart domains are either equal or disjoint because they corre- spond to the connected components (in the sense of open arcs). • E(p) is convex. reference measure.

• The derivative of the transition map sp2 ◦ ep1 is φ φ L0 (p1) 3 u 7→ u − Ep2 [u] ∈ L0 (p2)

which is an isomorphism, because p1 and p2 are connected by an open exponential arc.

16 Divergence

• For each non-negative density q and each p ∈ M>, the divergence is defined by D(q||p) = Eq [log q/p].

• Restricting to the manifold M> ×M> and then to the proper domain, we can restrict to E(p) × E(p) and compute the representative in that chart:

D(q1||q2) = Eq1 [u1 − u2] − Kp(u1) + Kp(u2) Then, the divergence, as a real function on E(p) × E(p), is infinitely Gˆateaux-differentiable. • Its partial derivatives are

D1D(q1||q2) · v = Covq1[u1 − u2, v]

D2D(q1||q2) · v = Eq2 [v] − Eq1 [v] • In particular we can write q   D(q||p) = E − 1 u − K (u) p p p q  which shows that D(q||p), as a function of p − 1 , is the conjugate of the convex function Kp. Equivalently, q   D(q||p) + D(p||q) = E − 1 u = E [u] p p q

17 The exponential geometry The previous theory is intended to capture the essence and to gener- alize the idea of curved exponential model as defined by • B. Efron. Defining the curvature of a statistical problem (with ap- plications to second–order efficiency). The Annals of Statististics, 3:1189–1242, 1975. (with discussion) • B. Efron. The geometry of exponential families. Ann. Statist., 6(2):362–376, 1978; • A. P. Dawid. Discussion of a paper by Bradley Efron. The Annals of Statistics, 3:1231–1234, 1975 • A. P. Dawid. Further comments on a paper by Bradley Efron. The Annals of Statistics, 5:1249, 1977 From the work of S.-I. Amari, we know that there is a second geometry on probabilities, whose geodesics are mixtures. This structure is a connection on a special vector bundle of the M>-manifold. A related manifold on the set M1 of normalized random variables is defined by charts of the form q q 7→ − 1 p Locally, the q’s will have finite KL-divergence D(q||p).

18 Mixture manifold 1 • We enlarge M > considering the sets  Z  1 M≥ = p ∈ L (µ) : p ≥ 0, pdµ = 1  Z  1 M1 = p ∈ L (µ) : pdµ = 1 .

Ψ • For each p ∈ M≥, we consider the subset of L0 (p) defined by:  q  ∗E(p) = q ∈ P : ∈ LΨ (p) p On such a domain, we define the charts  ∗ Ψ  E(p) → L0 (p) η : q p q 7→ − 1  p and the associated parameterizations Ψ ∗ L0 (p) 3 u 7→ (u + 1) p ∈ Up. The collection of sets {∗E(p)} is a covering of M : p∈M1 1 • If p ∈ M, then E(p) ⊂ ∗E(p). ∗ ∗ • If p1, p2 ∈ E(p), then E(p1) = E(p2).

19 Mixture manifold 2 Th The set of charts ∗ {( E(q), ηq) : q ∈ E (p)} is an affine C∞-atlas on ∗E(p), so it defines a manifold modeled on Ψ the Banach space L0 (p).

• For each pair p1, p2 ∈ E (p) the transition map is  ψ ψ  L0 p1 → L0 p2 η ◦ η−1 : p p p2 p1 1 1  u 7→ u + − 1 p2 p2 ∗ Th For each q ∈ M≥, the divergence D (qkp) finite if and only if q ∈ E(p): q q D (qkp) = E log < ∞ ⇐⇒ q ∈ ∗E(p) p p p

∗ ∞ Th For each density p ∈ Mµ, the inclusion j : E (p) ,→ E (p) is of class C .

20 Exponential models φ • Let V be a subspace of L0 . We call exponential model based on V the “flat manifold” n o u−Kp(u) EV (p) = e p | u ∈ V ∩ Sp

• Let ⊥  Ψ V = k ∈ L0 (p) | Ep [ku] = 0, u ∈ V be the orthogonal space of V . If V is closed, then  q q ∈ E (p) ⇐⇒ E k log = 0, k ∈ V ⊥ V p p

⊥ φ • Note that V + V in not a splitting of the space L0 (p). In fact, the proper notion of statistical model appears to be different from what is usually termed a sub-manifold, because, in general, there is no orthogonal splitting of subspaces in LΦ(p). The proper spitting space consists of the orthogonal space of the space of the model in the Ψ pre-dual space L0 (p). q ∗ • If we take k = p − 1, q ∈ E(p), then the orthogonality becomes a special case of the Csiszar pitagorean theorem D(qkp) + D(pkq) = 0

21 Tangent space • Let p(θ), θ ∈ I be a one-dimensional statistical model. If p(0) = p and p(θ) ∈ E(p), then p(θ) = ep(θ)−Kp(p(θ))p. The tangent vector at θ is computed in the exponential chart centered at p as d d p(θ)  d  T p(·) = u(θ) = log + E u(θ) p(θ) dθ dθ p p(θ) dθ and, at θ = 0,   d d p(θ) Tpp(·) = u(θ) = log dθ θ=0 dθ p θ=0 • The exponential model 0 0 eθu (0)−Kp(u (0))p, θ ∈ I is the tangent exponential model. • The statistical model ∗ 0 {q ∈ E(p) | Eq [u (0)] = 0} is the orthogonal mixture model.

22 Classical exponential models • A classical exponential models is d ! X p(x; θ) = exp θiTi(x) − Ψ(θ) , θ ∈ Θ i=1 where Ψ = Td∗µ is the Laplace transform of the distribution of T and Θ is its domain.

• If H = Span (1,T1,...,Td), the membership of pθ in the model is equiv- alent to d X log p(x; θ) = θiTi(x) − Ψ(θ) ∈ H i=1 • The coordinate in the exponential manifold is d X  ∂  s(p ) = θ T (x) − Ψ(θ) θ i i ∂θ i=1 i and the cumulant function is d X ∂ K(u) = Ψ(θ) − Ψ(θ) ∂θ i=1 i

23 Sufficiency • The supporting probability space can be reduced by sufficiency. In fact, if T = σ (T1,...,Td) is the sufficient σ-algebra, it suffices to the  smaller probability space T (X), T , µ|T .

• Assume the all the Ti’s take a finite number of values. Then, the space of measurable functions which are functions of T is the algebra generated by T1,...,Td. If a smaller number of generators can be found, this will lead to a reduction of the dimension of the model space. • The computation of the sufficient algebra, of a generating set of this algebra, and of a linear basis, are all relevant, because all statistic based on the likelihood will be a member of this algebra. • More specifically, assume that the original sample space is finite and that all values of the Ti are rational. Then, we can map each probability of the original model to a finite subset D of Qd to obtain the equivalent exponential model d ! X ∗ p(t; θ) = exp θiti − Ψ(θ) T µ(t), t ∈ D i=1

24 Design theory • In algebraic Design Theory, the points of a finite state space (or treatment space) are labelled with points in Qd. In turn, the sample space is described as a 0-dimensional algebraic variety of an ideal I of the ring R = Q[x1 . . . xd]. The algebra of real functions on the sample space is described by the quotient ring R/I and has a hierarchical linear basis of monomial. Other labels have proved to be useful, in particular the n-roots of the unity. • This same approach should lead to a useful algebraic presentation of the sufficient algebra of exponential models on a finite sample space. • G. Pistone, E. Riccomagno, and H. P. Wynn. Algebraic Statistics: Computational Commutative Algebra in Statistics. Chapman&Hall, 2001 • G. Pistone and M. Rogantin. Indicator function and complex cod- ing for mixed fractional factorial designs (revised 2006). Technical Report 17, Dipartimento di Matematica, Politecnico di Torino, 2006 • H Maruri-Aguilar, R Notari and E Riccomagno. “On the description and identifiability analysis of mixture designs” (Accepted for publi- cation in Statistica Sinica 2007)

25 Example: Exponential versus toric • X is a finite sample space and p is a given probability density. In particular, a multi-way contingency table identified by a collection of

factors X = {X1,...,XF }. If If denotes the set of levels for the factor F Xf , f = 1,...,F , the state space is a product space, i.e X = ×f=1If . • A log-linear model assumes that p(x) > 0 and that log p(x) belongs to X a linear subspace H of R . If H is spanned by {T1,...,Ts}, where the Tj’s are integer valued functions, we can write the log-linear model as s X log p(x) = (log ζj)Tj(x), j=1 • The unnormalized probability q satisfies the same equation

T1(x) Ts(x) q(x) = ζ1 ··· ζs , ζj ≥ 0, j = 0, . . . , s,

where the parameters ζ1, . . . , ζs are subject to non-negativity con- straint only.

26 Example: orthogonality • A third expression of the same model can be derived by elimination of the indeterminates ζ1, . . . , ζs in the monomial parameterization equa-   tion. In fact, if M = T1(x) ··· Ts(x) is the design matrix, the x∈Q orthogonal space of its range can be generated by integer valued   vectors with zero sum K = k1 ··· kr , then

 kj(x) T (x)·k (x) T (x)·k (x) Y kj(x) Y T1(x) Ts(x) 1 j s j q(x) = ζ1 ··· ζs = ζ1 ··· ζs = 1, x x

• As the sum of the elements of each kj, j = 1, . . . , r, is zero, the sum + of the elements of both the positive part kj and the negative part − kj are equal, so that we could write Y + Y − q(x)kj(x) − q(x)kj(x) = 0, j = 1, . . . , r. x x • It follows that the toric model implies a set of r binomial and homo- geneous equations in the un-normalized probabilities q(x), x ∈ Q. – Given a finite state log-linear model and all its limit points, a specific set of configurations of zero-probability cells arises. This set cannot be recovered by setting to zero some parameters in a generic toric parametric representation.

27 Example: instances • However, there exists a “maximal” parametric toric representation that gives a full parameterization of the extended model. 1. All toric models compatible with the implicit binomial model are characterized by a string of T ’s exponents, which is a non-negative integer vector orthogonal to the vectors [k1 . . . kr]. Q 2. The lattice of non-negative integer vectors t ∈ N+ such that the condition t · kj = 0 holds for each j = 1, . . . , r, has a finite number of generators that can be computed with symbolic software.

3. If the generators are S1,...,Su, then the “maximal” toric model is S1(x) Su(x) q(x) = ζ1 ··· ζu x ∈ Q. • The support of the resulting probability will be the set Q1 = {x ∈ Q : S1(x) = 0}. On such a restricted support, the model will be again toric: S2(x) Su(x) q(x) = ζ2 ··· ζu x ∈ Q1. or exponential if all the other parameters ζ2, ··· , ζu are assumed to be strictly positive. • In this sense, we say that each toric model is a union of ex- ponential models with different supports. Each one of these models is called an instance of the model.

28