<<

Likelihood Geometry

June Huh and Bernd Sturmfels

Introduction

Maximum likelihood estimation (MLE) is a fundamental computational problem in statistics, and it has recently been studied with some success from the perspective of . In these notes we give an introduction to the geometry behind MLE for algebraic statistical models for discrete data. As is customary in algebraic statistics [LiAS], we shall identify such models with certain algebraic subvarieties of high-dimensional complex projective spaces. The article is organized into four sections. The first three sections correspond to the three lectures given at Levico Terme. The last section will contain proofs of new results. In Sect. 1,westartoutwithplanecurves,andweexplainhowtoidentifythe relevant punctured Riemann surfaces. We next present the definitions and basic results for likelihood geometry in Pn.Theorems1.6 and 1.7 are concerned with the likelihood correspondence, the sheaf of differential one-forms with logarithmic poles, and the topological Euler characteristic. The ML degree of generic complete intersections is given in Theorem 1.10.Theorem1.15 shows that the likelihood fibration behaves well over strictly positive data. Examples of and Segre varieties are discussed in detail. Our treatment of linear spaces in Theo- rem 1.20 will appeal to readers interested in matroids and hyperplane arrangements. Section 2 begins leisurely, with the question Does watching soccer on TV cause hair loss? [MSS]. This leads us to conditional independence and low rank matrices.

J. Huh (!) Department of Mathematics, University of Michigan, Ann Arbor, MI 48109, USA e-mail: [email protected] B. Sturmfels Department of Mathematics, University of California, Berkeley, CA 94720, USA e-mail: [email protected]

A. Conca et al., Combinatorial Algebraic Geometry, Lecture Notes 63 in Mathematics 2108, DOI 10.1007/978-3-319-04870-3__3, ©SpringerInternationalPublishingSwitzerland2014 64 J. Huh and B. Sturmfels

We study likelihood geometry of determinantal varieties, culminating in the duality theorem of Draisma and Rodriguez [DR]. The ML degrees in Theorems 2.2 and 2.6 were computed using the software Bertini [Bertini], underscoring the benefits of using numerical algebraic geometry for MLE. After a discussion of mixture models, highlighting the distinction between rank and nonnegative rank, we end Sect. 2 with a review of recent results in [ARSZ]ontensorsofnonnegativerank2. Section 3 starts out with toric models [PS,§1.22]andgeometricprogramming [BoydVan,§4.5].Theorem3.2 identifies the ML degree of a toric variety with the Euler characteristic of the complement of a hypersurface in a torus. Theorem 3.7 furnishes the ML degree of a variety parametrized by generic polynomials. Theo- rem 3.10 characterizes varieties of ML degree 1 and it reveals a beautiful connection to the A-discriminant of [GKZ]. We introduce the ML bidegree and the sectional ML degree of an arbitrary in Pn,andweexplainhowthesetwoare related. Section 3 ends with a study of the operations of intersection, projection, and restriction in likelihood geometry. This concerns the algebro-geometric meaning of the distinction between sampling zeros and structural zeros in statistical modeling. In Sect. 4 we offer precise definitions and technical explanations of more advanced concepts from algebraic geometry, including logarithmic differential forms, Chern–Schwartz–MacPherson classes, and schön very affine varieties. This enables us to present complete proofs of various results, both old and new, that are stated in the earlier sections. We close the introduction with a disclaimer regarding our overly ambitious title. There are many important topics in the statistical study of likelihood inference that should belong to “Likelihood Geometry” but are not covered in this article. Such topics include Watanabe’s theory of singular Bayesian integrals [Wat], differential geometry of likelihood in information geometry [AN], and real algebraic geometry of Gaussian models [Uhl]. We regret not being able to talk about these topics and many others. Our presentation here is restricted to the setting of [LiAS,§2.2], namely statistical models for discrete data viewed as projective varieties in Pn.

1 First Lecture

Let us begin our discussion with likelihood on algebraic curves in the complex 2 2 projective plane P .Wefixasystemofhomogeneouscoordinatesp0;p1;p2 on P . 2 The set of real points in P with sign.p0/ sign.p1/ sign.p2/ is identified with the open triangle D D

3 2 .p0;p1;p2/ R p0;p1;p2 >0 and p0 p1 p2 1 : D 2 W C C D ˚ « Given three positive integers u0; u1; u2,thecorrespondinglikelihoodfunctionis

pu0 pu1 pu2 ` .p ;p ;p / 0 1 2 : u0;u1;u2 0 1 2 u u u D .p0 p1 p2/ 0 1 2 C C C C Likelihood Geometry 65

This defines a rational function on P2,anditrestrictstoaregularfunctiononP2 , where nH

2 .p0 p1 p2/ P p0p1p2.p0 p1 p2/ 0 H D W W 2 W C C D ˚ « is our arrangement of four distinguished lines. The likelihood function `u0;u1;u2 is positive on the triangle 2,itiszeroontheboundaryof2,anditattainsits maximum at the point

1 .p0; p1; p2/ .u0; u1; u2/: (1) O O O D u0 u1 u2 C C

The corresponding point .p0 p1 p2/ is the only critical point of the function O WO WO 2 `u0;u1;u2 on the four-dimensional real manifold P .Toseethis,weconsiderthe logarithmic derivative nH

u0 u0 u1 u2 u1 u0 u1 u2 u2 u0 u1 u2 dlog.`u0;u1;u2 / C C ; C C ; C C : D p0 ! p0 p1 p2 p1 ! p0 p1 p2 p2 ! p0 p1 p2 Â C C C C C C Ã

We note that this equals .0; 0; 0/ if and only if .p0 p1 p2/ is the point .p0 p1 W W O WO W p2/ in (1). O Let X be a smooth curve in P2 defined by a homogeneous polynomial f.p0;p1;p2/.Thiscurveplaystheroleofastatisticalmodel,andourtaskis to maximize the likelihood function `u0;u1;u2 over its set X 2 of positive real points. To compute that maximum algebraically, we examine\ the set of all critical points of `u0;u1;u2 on the complex curve X .Thatsetofcriticalpointsisthe likelihood locus.UsingLagrangeMultipliersfromCalculus,weseethatitconsistsnH of all points of X such that dlog.`u ;u ;u / lies in the plane spanned by df and nH 0 1 2 .1; 1; 1/ in C3.Thus,ourtaskistostudythesolutionsinP2 of the equations nH 111 f.p ;p ;p / 0 and det u0 u1 u2 0: (2) 0 1 2 0 p0 p1 p2 1 D @f @f @f D B @p0 @p1 @p2 C @ A Suppose that X has degree d.Then,afterclearingdenominators,thesecond equation has degree d 1.ByBézout’sTheorem,weexpectthelikelihoodlocus C to consist of d.d 1/ points in P2 .Thisisindeedwhathappenswhenf is a generic polynomialC of degree d. nH We define the maximum likelihood degree (or ML degree)ofourcurveX to be the cardinality of the likelihood locus for generic choices of u0; u1; u2.Thusageneral plane curve of degree d has ML degree d.d 1/.However,forspecialcurves,the ML degree can be smaller. C 66 J. Huh and B. Sturmfels

Theorem 1.1. Let X be a smooth curve of degree d in P2,anda #.X / the number of its points on the distinguished arrangement. Then theD ML degree\ H of X equals d 2 3d a. ! C This is a very special case of Theorem 1.7 which identifies the ML degree with the signed Euler characteristic of X .Forageneralcurveofdegreed in P2,we have a 4d,andsod 2 3d a nHd.d 1/ as predicted. However, the number a of pointsD in X can! drop:C D C \ H Example 1.2. Consider the case d 1 of lines. A generic line has ML degree 2. D The line X V.p0 cp1/ has ML degree 1 provided c 0; 1 .Thespecialline D C 62f g X V.p0 p1/ has ML degree 0:(2)hasnosolutionsonX unless u0 u1 0. D C nH C D In the three cases, X is the Riemann sphere P1 with four, three, or two points removed. nH } Example 1.3. Consider the case d 2 of quadrics. A general quadric has ML degree 6.TheHardy–Weinberg curveD,whichplaysafundamentalroleinpopulation genetics, is given by

2p0 p1 2 f.p0;p1;p2/ det 4p0p2 p : D p 2p D ! 1 Â 1 2Ã The curve has only three points on the distinguished arrangement:

X .1 0 0/; .0 0 1/; .1 2 1/ : \ H D W W W W W! W Hence the ML degree of the˚ Hardy–Weinberg curve equals 1.Thismeansthatthe« maximum likelihood estimate (MLE) is a rational function of the data. Explicitly, the MLE equals 1 .p ; p ; p / .2u u /2 ;2.2u u /.u 2u /; .u 2u /2 : 0 1 2 2 0 1 0 1 1 2 1 2 O O O D 4.u0 u1 u2/ C C C C C C $ (3)% In applications, the Hardy–Weinberg curve arises via its parametric representa- tion

2 p0.s/ s D p1.s/ 2s.1 s/ (4) D ! 2 p2.s/ .1 s/ D ! Here the parameter s is the probability that a biased coin lands on tails. If we toss that same biased coin twice, then the above formulas represent the following probabilities:

p0.s/ probability of 0 heads D p1.s/ probability of 1 head D p2.s/ probability of 2 heads D Likelihood Geometry 67

Suppose now that the experiment of tossing the coin twice is repeated N times. We record the following counts, where N u0 u1 u2 is the sample size of our repeated experiment: D C C

u0 number of times 0 heads were observed D u1 number of times 1 head was observed D u2 number of times 2 heads were observed D The MLE problem is to estimate the unknown parameter s by maximizing

u0 u1 u2 u1 2u0 u1 u1 2u2 `u ;u ;u p0.s/ p1.s/ p2.s/ 2 s C .1 s/ C : 0 1 2 D D ! The unique solution to this optimization problem is

2u0 u1 s C : O D 2u0 2u1 2u2 C C

Substituting this expression into (4)givestheestimator p0.s/;p1.s/;p2.s/ for the three probabilities in our model. The resulting rational functionO coincidesO O with (3). $ % } The ML degree is also defined when the given curve X P2 is not smooth, but " it counts critical points of `u only in the regular locus of X.Hereisanexampleto illustrate this.

Example 1.4. AgeneralcubiccurveX in P2 has ML degree 12.Supposenowthat X is a cubic which meets transversally but has one isolated singular point in H P2 .Ifthesingularpointisanode then the ML degree of X is 10,andifthe singularnH point is a cusp then the ML degree of X is 9.TheMLdegreesarefoundby saturating the equations in (2)withrespecttothehomogenousidealofthesingular point. } Moving beyond likelihood geometry in the plane, we shall introduce our objects in any dimension. We fix the complex Pn with coordinates p0;p1;:::;pn,representingprobabilities.Wesummarizetheobserveddataina n 1 vector u .u0; u1;:::;un/ N C ,whereui is the number of samples in state D 2 i.ThelikelihoodfunctiononPn given by u equals

pu0 pu1 pun ` 0 1 n : u ### u u u D .p0 p1 pn/ 0 1 n C C###C C C!!!C The unique critical point of this rational function on Pn is the data point itself:

.u0 u1 un/: W W###W 68 J. Huh and B. Sturmfels

Moreover, this point is the global maximum of the likelihood function `u on the probability simplex n.Throughout,weidentifyn with the set of all positive real points in Pn. The linear forms in `u define an arrangement of n 2 distinguished H C hyperplanes in Pn.Thedifferentialofthelogarithmofthelikelihoodfunctionis the vector of rational functions

u0 u1 un u dlog.`u/ ; ;:::; C .1; 1; : : : ; 1/: (5) D p0 p1 pn ! p # $ % C n n Here p i 0 pi and u i 0 ui .Thevector(5)representsasectionofthe C D C D sheaf of differentialD one-forms on PDn that have logarithmic singularities along . This sheaf isP denoted P H

1 " n .log. //: P H n Our aim is to study the restriction of `u to a closed subvariety X P .Wewill assume that X is defined over the real numbers, irreducible, and not contained in .LetXsing denote the singular locus of X,and Xreg denote X Xsing. When X H n serves as a statistical model, the goal is to maximize the rational function `u on the semialgebraic set X n.Tosolvethisproblemalgebraically,wedetermineall \ critical points of the log-likelihood function log.`u/ on the complex variety X.Here we must exclude points that are singular or lie in . H Definition 1.5. The maximum likelihood degree of X is the number of complex critical points of the function `u on Xreg ,forgenericdatau.Thelikelihood nH correspondence X is the universal family of these critical points. To be precise, L n n X is the closure in P P of L %

.p; u/ p Xreg and dlog.`u/ vanishes at p : W 2 nH ˚ n n n n « We sometimes write Pp Pu for P P to highlight that the first factor is the probability space, with coordinates% p,whilethesecondfactoristhedataspace,% with coordinates u.Thefirstpartofthefollowingresultappearsin[Huh1,§2].A precursor was [HKS, Proposition 3].

Theorem 1.6. The likelihood correspondence X of any irreducible subvariety X n L n n in Pp is an irreducible variety of dimension n in the product Pp Pu .Themap n % n pr1 X Pp is a projective bundle over Xreg ,andthemappr2 X Pu is genericallyW L ! finite-to-one. nH W L ! n See Sect. 4 for a proof. The degree of the map pr2 X Pu to data space is the ML degree of X.ThisnumberhasatopologicalinterpretationasanEulerW L ! characteristic, provided suitable assumptions on X are being made. The relationship between the homology of a manifold and critical points of a suitable function on it is the topic of Morse theory. Likelihood Geometry 69

The study of ML degrees was started in [CHKS,§2]bydevelopingthe 1 connection to the sheaf "X .log. // of differential one-forms on X with logarith- mic poles along .Itwasshownin[H CHKS,Theorem20]thattheMLdegreeofX equals the signedH topological Euler characteristic

. 1/dim X #.X /; ! # nH provided X is smooth and the intersection X defines a normal crossing divisor H \ in X Pn.Amajordrawbackofthatearlyresultwasthatthehypothesesareso restrictive that they essentially never hold for varieties X that arise from statistical models used in practice. From a theoretical point view, this issue can be addressed by passing to a resolution of singularities. However, in spite of existing algorithms for resolution in characteristic zero, these algorithms do not scale to problems of the sizes of interest in algebraic statistics. Thus, whatever computations we wish to do should not be based on resolution of singularities. The following result due to [Huh1]givesthesametopologicalinterpretationof the ML degree. The hypotheses here are much more realistic and inclusive than those in [CHKS,Theorem20]. Theorem 1.7. If the very affine variety X is smooth of dimension d,thentheML degree of X equals the signed topologicaln EulerH characteristic of . 1/d #.X /. ! # nH The term very affine variety refers to a closed subvariety of some algebraic torus m n .C"/ .OurambientspaceP is a very affine variety because it has a closed embedding nH

n n 1 p0 pn P .C"/ C ;.p0 pn/ ;:::; : nH !! W###W 7!! p p $ C C % The study of such varieties is foundational for tropical geometry.Thespecialcase when X is a Riemann surface with a punctures, arising from a curve in P2,was seen in TheoremnH 1.1.WeremarkthatTheorem1.7 can be deduced from works of Gabber–Loeser [Gabber-Loeser]andFranecki–Kapranov[Franecki-Kapranov]on perverse sheaves on algebraic tori. The smoothness hypothesis is essential for Theorem 1.7 to hold. If X is singular then, generally, neither X nor Xreg has its signed Euler characteristic equal to the ML degree of X. VarietiesnH X thatnH demonstrate this are the two singular cubic curves in Example 1.4.

Conjecture 1.8. For any projective variety X Pn of dimension d,notcontained in , Â H . 1/d #.X / MLdegree .X/: ! # nH & In particular, the signed topological Euler characteristic . 1/d #.X / is nonneg- ative. ! # nH 70 J. Huh and B. Sturmfels

Analogous conjectures can be made in the slightly more general setting of [Huh1]. In particular, we conjecture that the inequality

. 1/d #.V / 0 ! # & m holds for any closed d-dimensional subvariety V .C"/ .  Remark 1.9. We saw in Example 1.2 that the ML degree of a projective variety X can be 0. In all situations of statistical interest, the variety X Pn intersects the " open simplex n in a subset that is Zariski dense in X.Ifthatintersectionissmooth then MLdegree.X/ 1.Infact,arguingasin[CHKS,Proposition11],itcanbe shown that for smooth& X,

MLdegree .X/ #.bounded regions of X /: & RnH Here a bounded region is a connected component of the semialgebraic set X RnH whose classical closure is disjoint from the distinguished hyperplane V.p / in Pn. If X is singular then the number of bounded regions of X canC exceed RnH MLdegree .X/.Forinstance,letX P2 be the cuspidal cubic curve defined by " 2 3 .p0 p1 p2/.7p0 9p1 2p2/ .3p0 5p1 4p2/ : C C ! ! D C C

The real part XR consists of 8 bounded and 2 unbounded regions, but the ML degree of X is 7.TheboundedregionthatcontainsthecuspnH .13 17 31/ has no W W! other critical points for `u. } In what follows we present instances that illustrate the computation of the ML degree. We begin with the case of generic complete intersections.Supposethat X Pn is a complete intersection defined by r generic homogeneous polynomials " g1;:::;gr of degrees d1;d2;:::;dr .

Theorem 1.10. The ML degree of X equals Dd1d2 dr ,where ### D d i1 d i2 d ir : (6) D 1 2 ### r i1 i2 ir n r C CX!!!C Ä $ Proof. By Bertini’s Theorem, the generic complete intersection X is smooth in Pn. All critical points of the likelihood function `u on X lie in the dense open subset X .Considerthefollowing.r 2/ .n 1/-matrix with entries in the polynomial nH C % C ring RŒp0;p1;:::;pn:

u0 u1 un ### 2 p0 p1 pn 3 @g1 @g1 ### @g1 p0 p1 pn u @p0 @p1 ### @pn 6 @g2 @g2 @g2 7 : (7) 6 p0 p1 pn 7 J.p/ D 6 @p0 @p1 ### @pn 7 Ä Q ' 6 : : : : 7 6 : : :: : 7 6 : : : 7 6 @gr @gr @gr 7 6 p0 @p p1 @p pn @p 7 6 0 1 ### n 7 4 5 Likelihood Geometry 71

Let Y denote the determinantal variety in Pn given by the vanishing of its .r 2/ .r 2/ minors. The codimension of Y is at most n r,whichisageneralupperC % boundC for ideals of maximal minors, and hence the dimension! of Y is at least r.Our genericity assumptions ensure that the matrix J.p/ has maximal row rank r 1 Q for all p X.Henceapointp X lies in Y if and only if the vector u is in theC row span of J.p/2 .Moreover,byTheorem2 1.6, Q

.Xreg / Y X Y nH \ D \ is a finite subset of Pn,anditscardinalityisthedesiredMLdegreeofX. Since X has dimension n r,weconcludethatY has the maximum possible codimension, namely n r,andthattheintersectionof! X with the determinantal variety Y is proper. We! note that Y is Cohen–Macaulay, since Y has maximal codimension n r,andidealsofminorsofgenericmatricesareCohen–Macaulay. Bézout’s Theorem! implies

MLdegree.X/ degree.X/ degree.Y / d1 dr degree.Y /: D # D ### # The degree of the determinantal variety Y equals the degree of the determinantal variety given by generic forms of the same row degrees. By the Thom–Porteous– Giambelli formula, this degree is the complete homogeneous symmetric function of degree codim.Y / n r evaluated at the row degrees of the matrix. Here, the row D ! degrees are 0; 1; d1;:::;dr ,andthevalueofthatsymmetricfunctionispreciselyD. We conclude that degree.Y / D.HencetheMLdegreeofthegenericcomplete D intersection X .g1;:::;gr / equals D d1d2 dn. D V # ### ut Example 1.11 (r 1). Agenerichypersurfaceofdegreed in Pn has ML degree D d D d d 2 d 3 d n: # D C C C###C Example 1.12 (r 2;n 3). Aspacecurvethatisthegenericintersectionoftwo D D surfaces of degree d and e in P3 has ML degree de d 2e de2. C C } Remark 1.13. It was shown in [HKS,Theorem5]that(6)isanupperboundfor the ML degree of any variety X of codimension r that is defined by polynomials of degree d1;:::;dr .Infact,thesameistrueundertheweakerhypothesisthatX is cut out by polynomials of degrees d1 dr dr 1 ds,soX need not be a complete intersection. However, the&###& hypothesis& codimC &###&.X/ r is essential in order for MLdegree.X/ (6)tohold.ThatcodimensionhypothesiswasforgottenwhenD this upper bound wasÄ cited in [LiAS,Theorem2.2.6]andin[PS,Theorem3.31]. Hence these two book references are not correct as stated. Here is a simple counterexample. Let n 3 and d1 d2 d3 2. Then the bound (6)istheBézoutnumber8D,andthisisalsothecorrectMLD D D degree for a general complete intersection of three quadrics in P3.NowletX be ageneralrationalnormalcurveinP3.ThecurveX is defined by three quadrics, 72 J. Huh and B. Sturmfels namely, the 2 2-minors of a 2 3-matrix filled with general linear forms in % % p0;p1;p2;p3.SinceX is a Riemann sphere with 15 punctures, Theorem 1.7 tells us that MLdegree.X/ 13,andthisexceedstheboundof8. D } We now come to a variety that is ubiquitous in statistics, namely the model of independence for two binary random variables [LiAS,§1.1].Thismodelis represented by Segre’s quadric surface X in P3.Bythiswemeanthesurfacedefined by the 2 2-determinant: % 3 X V.p00p11 p01p10/ P : D ! " The surface X is isomorphic to P1 P1,soitissmooth,andwecanapply Theorem 1.7 to find the ML degree. In% other words, we seek to determine the Euler characteristic of the open complex surface X where nH 3 p P p00p01p10p11.p00 p01 p10 p11/ 0 : H D 2 W C C C D ˚ 1 1 « To this end, we write X P P with coordinates .x0 x1/; .y0 y1/ .Our D % W W surface is parametrized by pij xi yj ,andhence D $ % 1 1 X P P x0x1y0y1.x0 x1/.y0 y1/ 0 nH D 1 % n C 1 C D P x0x1.x0 x1/ 0 P y0y1.y0 y1/ 0 D $nf % ˚C D g % nf C « D g 2-sphere three points 2-sphere three points : D $ nf g% % $ nf g % $ % $ % Since the Euler characteristic is additive and multiplicative,

#.X / . 1/ . 1/ 1: nH D ! # ! D This means that the map u p from the data to the MLE is a rational function in each coordinate. The following7!O “word problem for freshmen” is aimed at finding that function. Example 1.14. Do this exercise: A biologist friend of yours wishes to test whether two binary random variables are independent. She collects data and records the matrix of counts

u u u 00 01 : D u u  10 11à How to ascertain whether u lies close to the independence model

X V.p00p11 p01p10/‹ D ! Astatisticianwhorecentlystartedworkinginherlabexplainsthat,asthefirststep in the analysis of her data, the biologist should calculate the maximum likelihood estimate (MLE) Likelihood Geometry 73

p00 p01 p O O : O D p10 p11 Â O O Ã Can you help your friend by supplying the formula for p as a rational function in u? The solution to this word problem is as follows. TheO MLE is the rank 1 matrix

1 u0 p C u 0 u 1 : (8) O D .u /2 u # C C Â 1 Ã CC C $ % We illustrate the concepts introduced above by deriving this well-known formula. The likelihood correspondence X of X V.p00p11 p01p10/ is the subvariety L D ! of X P3 defined by % T U .p00;p01;p10;p11/ 0; (9) # D where U is the matrix

0 u10 u11 0 u00 u01 ! ! C u11 u01 u00 u10 00 U 0 C ! ! 1 : D u11 u10 0 u01 u00 0 B C ! ! C B 00u01 u11 u00 u10C @ ! ! C A We urge the reader to derive (9)fromDefinition1.5 using a computer algebra system. Note that the determinant of U vanishes identically. In fact, for generic uij,the matrix U has rank 3,soitskernelisspannedbyasinglevector.Thecoordinatesof that vector are given by Cramer’s rule, and we find them to be equal to the rational functions in (8). The locus where the function u p is undefined consists of those u where the matrix rank of U drops below 3.Acomputationshowsthattherankof7!O U drops to 2 on the variety

V.u00 u10; u01 u11/ V.u00 u01; u10 u11/; C C [ C C and it drops to 0 on the point V.u00 u01; u10 u11; u01 u11/.Inparticular,the C C C likelihood function `u given by that point u has infinitely many critical points in the quadric X. } We note that all coefficients of the linear forms that define the exceptional loci 3 in Pu for the independence model are positive. This means that data points u with all coordinates positive can never be exceptional. We will prove in Sect. 4 that this n n usually holds. Let pr1 X Pp and pr2 X Pu be the projections from the likelihood correspondenceW L to!p-space and uW-spaceL ! respectively. We are interested in the fibers of pr2 over positive points u. 74 J. Huh and B. Sturmfels

n 1 n Theorem 1.15. Let u R C ,andletX P be an irreducible variety such that 2 >0 " no singular points of any intersection X pi 0 lies in the hyperplane at infinity p 0 .Then \f D g f C D g (1) the likelihood function `u on X has only finitely many critical points in Xreg ; 1 nH (2) if the fiber pr2$ .u/ is contained in Xreg,thenitslengthequalstheMLdegree of X. The hypothesis concerning “no singular point” will be satisfied for essentially all statistical models of interest. Here is an example which shows that this hypothesis is necessary.

Example 1.16. We consider the smooth cubic curve X in P2 that is defined by

3 f .p0 p1 p2/ p0p1p2: D C C C

The ML degree of the curve X is 3.EachintersectionX pi 0 is a triple \f1 D g point that lies on the line at infinity p 0 .Thefiberpr2$ .u/ of the likelihood fibration over the positive point u f.1C 1D 1/gis the entire curve X. D W W } If u is not positive in Theorem 1.15,thenthefiberofpr2 over u may have positive dimension. We saw an instance of this at the end of Example 1.14.Suchresonance loci have been studied extensively when X is a linear subspace of Pn.See[CDFV] and references therein. The following cautionary example shows that the length of the scheme-theoretic n fiber of X Pu over special points u in the open simplex n may exceed the ML degree ofL X!.

Example 1.17. Let X be the curve in P2 defined by the ternary cubic

2 3 f p2.p1 p2/ .p0 p2/ : D ! C ! This curve intersects in eight points, has ML degree 5,andhasacuspidal singularity at H

P .1 1 1/: WD W W

The prime ideal in RŒp0;p1;p2; u0; u1; u2 for the likelihood correspon- dence X is minimally generated by five polynomials, having degrees .3; 0/; .2;L 2/; .3; 1/; .3; 1/; .3; 1/.Theyareobtainedbysaturatingthetwoequations in (2) with respect to p0p2 p0 p1;p2 p1 . h i\h ! ! i The scheme-theoretic fiber of pr1 over a general point of X is a reduced line in the u-plane, while the fiber of pr1 over P is the double line

2 2 L .u0 u1 u2/ P .2u0 u1 u2/ 0 : WD W W 2 W ! ! D ˚ « Likelihood Geometry 75

The reader is invited to verify the following assertions using a computer algebra system: 2 1 (a) If u is a general point of Pu,thenpr2$ .u/ consists of 5 reduced points in Xreg . 1nH (b) If u is a general point on the line L,thenthelocusofcriticalpointspr2$ .u/ consists of four reduced points in Xreg and the reduced point P . (c) If u is the point .1 1 1/ L,thenprnH1.u/ is a zero-dimensional scheme of W W 2 2$ length 6.ThisschemeconsistsofthreereducedpointsinXreg and P counted with multiplicity 3. nH In particular, the fiber in (c) is not algebraically equivalent to the general fiber (a). This example illustrates one of the difficulties classical geometers had to face when formulating the “principle of conservation of numbers”. See [Fulton,Chap.10]for amoderntreatment. } It is instructive to examine classical varieties from projective geometry from the likelihood perspective. For instance, we may study the in its Plücker embedding. Grassmannians are a nice test case because they are smooth, so that Theorem 1.7 applies.

Example 1.18. Let X G.2; 4/ denote the Grassmannian of lines in P3.Inits D Plücker embedding in P5,thisGrassmannianisthequadrichypersurfacedefinedby

p12p34 p13p24 p14p23 0: (10) ! C D

As in (7), the critical equations for the likelihood function `u are the 3 3-minors of %

u12 u13 u14 u23 u24 u34 p p p p p p : (11) 2 12 13 14 23 24 34 3 p12p34 p13p24 p14p23 p14p23 p13p24 p12p34 4 ! ! 5 By Theorem 1.6,thelikelihoodcorrespondence X is a five-dimensional subvariety L of P5 P5.Thecohomologyclassofthissubvarietycanberepresentedbythe bidegree% of its ideal:

5 4 3 2 2 3 4 BX .p; u/ 4p 6p u 6p u 6p u 2pu : (12) D C C C C

This is the multidegree,inthesenseof[ch3:MS,§8.5],of X with respect to L the natural Z2-grading on the polynomial ring RŒp; u.Wecanuse[ch3:MS, Proposition 8.49] to compute the bidegree from the prime ideal of X .Itsleading coefficient 4 is the ML degree of X.Itstrailingcoefficient2 is the degreeL of X.The polynomials BX .p; u/ will be studied in Sect. 3. The prime ideal of X is computed from the equations in (10)and(11)by saturation with respectL to .Itisminimallygeneratedbythefollowingeight H polynomials in RŒp; u: 76 J. Huh and B. Sturmfels

(a) one polynomial of degree .2; 0/,namelythePlückerquadric, (b) six polynomials of degree .1; 1/,givenby2 2-minors of %

p12 p34 p13 p24 p14 p23 ! ! ! and u12 u34 u13 u24 u14 u23 Â ! ! ! Ã

p12 p13 p23 p12 p14 p24 p13 p14 p34 p23 p24 p34 C C C C C C C C ; u12 u13 u23 u12 u14 u24 u13 u14 u34 u23 u24 u34 Â C C C C C C C C Ã (c) one polynomial of degree .2; 1/,forinstance

2u24p12p34 2u34p13p24 .u23 u24 u34/p14p24 C C C C 2 .u13 u14 u34/p .u12 2u13 u14 u24/p24p34: ! C C 24 ! C C ! For a fixed positive data vector u >0,thesesixpolynomialsin(b)reducetothree linear equations, and these cut out a plane P2 inside P5.Tofindthefourcritical points of `u on X G.2; 4/,wemustthenintersectthetwoconics(a)and(c)in D that plane P2. .m/ 1 The ML degree of the Grassmannian G.r; m/ in P r $ is the signed Euler char- m acteristic of the manifold G.r; m/ obtained by removing r 1 distinguished hyperplane sections. It would be verynH interesting to find a generalC formula for this ML degree. At present, we only know that the ML degree of G.2;$ % 5/ is 26,andthat the ML degree of G.2; 6/ is 156.ByTheorem1.7,thesenumbersgivetheEuler characteristic of G.2; m/ for m 6. nH Ä } We end this lecture with a discussion of the delightful case when X is a linear subspace of Pn,andtheopenvarietyX is the complement of a hyperplane arrangement.Inthiscontext,followingVarchenko[nH Varchenko], the likelihood function `u is known as the master function,andthestatementofTheorem1.7 was first proved by Orlik and Terao in [Orlik-Terao]. We assume that X has dimension d, is defined over R,anddoesnotcontainthevector1 .1; 1; : : : ; 1/.Wecanregard n D1 X as a .d 1/-dimensional linear subspace of R C .Theorthogonalcomplement C X ? with respect to the standard dot product is a linear space of dimension n d n 1 ! in R C .ThelinearspaceX ? 1 spanned by X ? and the vector 1 has dimension n 1 C n d 1 in R C ,andhencecanbeviewedassubspaceofcodimensiond in n! C Pu .Inournextformula,theoperation? is the Hadamard product or coordinatewise product. n n Proposition 1.19. The likelihood correspondence X in P P is defined by L %

p X and u p?.X? 1/: (13) 2 2 C

The prime ideal of X is obtained from these constraints by saturation with respect to . L H Likelihood Geometry 77

Proof. If all pi are non-zero then u p?.X 1/ says that 2 ? C u u u u=p 0 ; 1 ;:::; n WD p0 p1 pn $ % lies in the subspace X ? 1.Equivalently,thevectorobtainedbyaddingamultiple of .1; 1; : : : ; 1/ to u=p Cis perpendicular to X.Wecantakethatvectortobethe differential (5). Hence (13)expressestheconditionthatp is a critical point of `u on X. ut The intersection X is an arrangement of n 2 hyperplanes in X Pd .For special choices of the\ subspaceH X,itmayhappenthattwoormorehyperplanesC ' coincide. Taking p 0 as the hyperplane at infinity, we view X f C D g \ H as an arrangement of n 1 hyperplanes in the affine space Rd .Aregionofthis arrangement is bounded Cif it is disjoint from p 0 . f C D g Theorem 1.20. The ML degree of X is the number of bounded regions of the real affine hyperplane arrangement X in Rd .Thebidegreeofthelikelihood \ H correspondence X is the h-polynomial of the broken circuit complex of the rank d 1matroid associatedL with X . C \ H We need to explain the second assertion. The hyperplane arrangement X \ H consists of the intersections of the n 2 hyperplanes in with X Pd .We C d 1 H ' regard these as hyperplanes through the origin in R C .TheydefineamatroidM of rank d 1 on n 2 elements. We identify these elements with the variables C C x1;x2;:::;xn 2.ForeachcircuitC of M let mC . i C xi /=xj where j is the C D 2 smallest index such that xj C .Thebroken circuit complex of M is the simplicial 2 Q complex with Stanley–Reisner ring RŒx1;:::;xn 2= mC C circuit of M .See [ch3:MS,§1.1]forStanley–Reisnerbasics.TheHilbertseriesofthisgradedringC h W i has the form

d h0 h1z hd z C C###C : .1 z/d 1 ! C What is being claimed in Theorem 1.20 is that the bidegree of X equals L d d 1 2 d 2 d n d BX .p; u/ .h0u h1pu $ h2p u $ hd p / p $ (14) D C C C###C #

Equivalently, this is the class of X in the cohomology ring L n n n 1 n 1 H ".P P Z/ ZŒp; u= p C ; u C : % I D h i

There are several (purely combinatorial) definitions of the invariants hi of the matroid M .Forinstance,theyarecoefficientsofthefollowingspecializationof the characteristic polynomial:

d d 1 d 1 d #M .q 1/ q h0q hd 1q $ . 1/ $ h1q . 1/ h0 : (15) C D # ! $ C###C ! C ! ( Á 78 J. Huh and B. Sturmfels

Theorem 1.20 was used in [Huh0]toproveaconjectureofDawson,statingthat the sequence h0;h1;:::;hd is log-concave, when M is representable over a field of characteristic zero. The first assertion in Theorem 1.20 was proved by Varchenko in [Varchenko]. For definitions and characterizations of the characteristic polynomial #,andmanypoint- ers to matroid basics, we refer to [OTBook]. A proof of the second assertion was given by Denham et al. in a slightly different setting [Denham-Garrousian-Schulze, Theorem 1]. We give a proof in Sect. 4 following [Huh1,§3].Theramificationlocus n of the likelihood fibration pr2 X Pu is known as the entropic discriminant [SSV]. W L !

Example 1.21. Let d 2 and n 4,soX is a plane in P4,definedbytwolinear forms D D

c10p0 c11p1 c12p2 c13p3 c14p4 0; C C C C D (16) c20p0 c21p1 c22p2 c23p3 c24p4 0: C C C C D Following Theorem 1.20,weviewX as an arrangement of five lines in the affine plane \ H

2 p X p0 p1 p2 p3 p4 0 C : f 2 W C C C C 6D g'

Hence, for generic cij,theMLdegreeofX is equal to 6,thenumberofbounded regions of this arrangement. The condition u p?.X? 1/ in Proposition 1.19 translates into 2 C

u0 u1 u2 u3 u4 p0 p1 p2 p3 p4 rank 2 3 3: (17) c10p0 c11p1 c12p2 c13p3 c14p4 Ä 6 7 6c20p0 c21p1 c22p2 c23p3 c24p47 4 5 The 4 4-minors of this 4 5-matrix, together with the two linear forms defining X, % % 4 form a system of equations that has six solutions in P ,forgenericcij.Allsolutions have real coordinates. In fact, there is one solution in each bounded region of X . 4 4 nH The likelihood correspondence X is the fourfold in P P given by the Eqs. (16) and (17). L % We now illustrate the second statement in Theorem 1.20.Supposethatthereal numbers cij are generic, so M is the uniform matroid of rank three on six elements. The Stanley–Reisner ring of the broken circuit complex of M equals

RŒx1;x2;x3;x4;x5;x6= x2x3x4;x2x3x5;x2x3x6;:::;x4x5x6 : h i The Hilbert series of this graded algebra is

2 2 h0 h1z h2z 1 3z 6z C C C C : .1 z/3 D .1 z/3 ! ! Likelihood Geometry 79

We conclude that the bidegree (14)ofthelikelihoodcorrespondence X equals L 4 3 2 2 BX .p; u/ 6p 3p u p u : D C C

For special choices of the coefficients cij in (16), some triples of lines in the arrangement X may meet in a point. For such matroids, the ML degree drops from 6 to some\ integerH between 0 and 5.Werecommenditasanexercisetothe reader to explore these cases. For instance, can you find explicit cij so that the ML degree of X equals 3? What are the prime ideal and the bidegree of X in that case? How can the ML degree of X be 0 or 1? L } It would be interesting to know which statistical model X in Pn defines the n n likelihood correspondence X which is a complete intersection in P P . When X L % is a linear subspace of Pn,thisquestioniscloselyrelatedtotheconceptoffreeness of a hyperplane arrangement. Proposition 1.22. If the hyperplane arrangement X in X is free, then the \ H n n likelihood correspondence X is an ideal-theoretic complete intersection in P P . L % Proof. For the definition of freeness see §1 in the paper [CDFV]byCohen, Denman, Falk and Varchenko. The proposition is implied by their [CDFV,Theo- rem 2.13] and [CDFV, Corollary 3.8]. ut Using Theorem 1.20,thisprovidesalikelihoodgeometryproofofTerao’s theorem that the characteristic polynomial of a free arrangement factors into integral linear forms [Terao].

2SecondLecture

In our newspaper we frequently read about studies aimed at proving that a behavior or food causes a certain medical condition. We begin the second lecture with an introduction to statistical issues arising in such studies. The “medical question” we wish to address is Does Watching Soccer on TV Cause Hair Loss? We learned this amusing example from [MSS,§1]. In a fictional study, 296 British subjects aged between 40 and 50 were inter- viewed about their hair length and how many hours per week they watch soccer (a.k.a. “football”) on TV. Their responses are summarized in the following contin- gency table of format 3 3: % lots of hair medium hair little hair 2 h 51 45 33 U 2Ä–6 h 28 30 29 D 6 h 0 15 27 38 1 & @ A 80 J. Huh and B. Sturmfels

For instance, 29 respondents reported having little hair and watching between 2 and 6 hofsocceronTVperweek.Basedonthesedata,arethesetworandomvariables independent, or are we inclined to believe that watching soccer on TV and hair loss are correlated? On first glance, the latter seems to be the case. Indeed, being independent means that the data matrix U should be close to a rank 1 matrix. However, all 2 2-minors of U are strictly positive, indeed by quite a margin, and this suggests% a positive correlation. However, this interpretation is deceptive. A much better explanation of our data can be given by identifying a certain hidden random variable. That hidden variable is gender.Indeed,supposethatamongtherespondents126 were males and 170 were females. Our data matrix U is then the sum of the male table and the female table, maybe as follows:

3915 48 36 18 U 41220 24 18 9 : (18) D 0 1 C 0 1 72135 863 @ A @ A Both of these tables have rank 1,henceU has rank 2.Hence,theappropriatenull hypothesis H0 for analyzing our situation is not independence but it is conditional independence:

H0 Soccer on TV and Hair Loss are Independent given Gender. W And, based on the data U ,wemostdefinitelydonotrejectthatnullhypothesis. The key feature of the matrix U above was that it has rank 2.Wenowdefinelow rank matrix models in general. Consider two discrete random variables X and Y having m and n states respectively. Their joint probability distribution is written as an m n-matrix %

p11 p12 p1n ### p21 p22 p2n P 0 : : ###: : 1 D : : :: : B C Bp p p C B m1 m2 mnC @ ### A whose entries are nonnegative and sum to 1.Herepij represents the probability that X is in state i and Y is in state j .Theofallprobabilitydistributionsisthestandard simplex mn 1 of dimension mn 1.Wewrite r for the manifold of rank r $ ! M matrices in mn 1. $ The matrices P in 1 represent independent distributions. Mixtures of r M independent distributions correspond to matrices in r .Asalwaysinapplied algebraic geometry, we can make any problem that involvesM semi-algebraic sets progressively easier by three steps: Likelihood Geometry 81

•disregardinequalities, •replacerealnumberswithcomplexnumbers, •replaceaffinespacebyprojectivespace.

In our situation, this leads us to replacing r with its Zariski closure in complex mn 1 M projective space P $ .ThisZariskiclosureistheprojectivevariety r of complex V m n matrices of rank r.Notethat r is singular along r 1.Thecodimensionof % Ä V V $ r is .m r/.n r/. It is a non-trivial exercise to write the degree of r in terms of m;V n; r.Hint:[! ch3:MS! ,Example15.2]. V Suppose now that i.i.d. samples are drawn from an unknown joint distribution on our two random variables X and Y .Wesummarizetheresultingdataina contingency table

u11 u12 u1n ### u21 u22 u2n U 0 : : ###: : 1 : D : : :: : B C Bu u u C B m1 m2 mnC @ ### A The entries of the matrix U are nonnegative integers whose sum is u . The likelihood function for the contingency table U is the followingCC function on mn 1: $ m n u uij P CC p : 7! u u u ij 11 12 mn! i 1 j 1 ### YD YD Assuming fixed sample size, this is the likelihood of observing the data U given an unknown probability distribution P in mn 1.Inwhatfollowswesuppressthe multinomial coefficient. Furthermore, we regard$ the likelihood function as a rational mn 1 function on P $ ,sowewrite

m n uij i 1 j 1 pij `U D u D : D Q pQCC CC We wish to find a low rank probability matrix P that best explains the data U . Maximum likelihood estimation means solving the following optimization problem:

Maximize `U .P / subject to P r .(19) 2 M The optimal solution P is a rank r matrix. This is the maximum likelihood estimate O for U . For r 1,theindependencemodel,themaximumlikelihoodestimateP is O obtained fromD the data matrix U by the following formula, already seen for m n 2 in (8). Multiply the vector of row sums with the vector of column sums andD divideD by the sample size: 82 J. Huh and B. Sturmfels

u1 C 1 u2 C P 0 : 1 u 1 u 2 u n : (20) O D .u /2 # : # C C ### C CC B C Bu C $ % B m C @ CA Statisticians, scientists and engineers refer to such a formula as an “analytic solution”. In our view, it would be more appropriate to call this an “algebraic solution”. After all, we are here using algebra not analysis. Our algebraic solution for r 1 reveals the following points: D •TheMLEP is a rational function of the data U . O •ThefunctionU P is an algebraic function of degree 1. 7! O •TheMLdegreeoftheindependencemodel1 equals 1. V We next discuss the smallest case when the ML degree is larger than 1. Example 2.1. Let m n 3 and r 2.OurMLEproblemistomaximize D D D

u11 u12 u13 u21 u22 u23 u31 u32 u33 u `U .p11 p12 p13 p21 p22 p23 p31 p32 p33 /=p CC D CC subject to the constraints P 0 and rank.P / 2,whereP .pij/ is a 3 3-matrix of unknowns. The equations& that characterizeD the critical pointsD of this optimization% problem are

p11p22p33 p11p23p32 p12p21p33 det.P / $ $ 0 D p12p23p31 p13p21p32 p13p22p31 D C C $ and the vanishing of the 3 3-minors of the following 3 9-matrix: % %

u11 u12 u13 u21 u22 u23 u31 u32 u33 p p p p p p p p p 2 11 12 13 21 22 23 31 32 33 3 p11a11p12a12p13a13 p21a21p22a22p33a33 p31a31p32a32p33a33 4 5 @det.P / where aij is the cofactor of pij in P .Forrandompositivedatauij, D @pij these equations have ten solutions with rank.P / 2 in P8 .HencetheML D nH degree of 2 is 10.Ifweregardtheuij as unknowns, then saturating the above V determinantal equations with respect to 1 yields the prime ideal of the 8 8H [ V likelihood correspondence 2 P P .SeeExample4.8 for the bidegree and LV " % other enumerative invariants of the eight-dimensional variety 2 . LV } Recall from Definition 1.5 that the ML degree of a statistical model (or a projective variety) is the number of critical points of the likelihood function for generic data. Theorem 2.2. The known values for the ML degrees of the determinantal varieties r are V Likelihood Geometry 83

.m; n/ .3; 3/ .3; 4/ .3; 5/ .4; 4/ .4; 5/ .4; 6/ .5; 5/ D r 11111111 D r 2102658 191 843 3119 6776 D r 3111191 843 3119 61326 D r 41116776 D r 51 D The numbers 10 and 26 were computed back in 2004 using the symbolic software Singular,andtheywerereportedin[HKS,§5].Theboldfacenumberswere found in 2012 in [HRS]usingthenumericalsoftwareBertini.Inwhatfollows we shall describe some of the details.

Remark 2.3. Each determinantal variety r is singular along the smaller variety V r 1.Hence,theveryaffinevariety r is singular for r 2,soTheorem1.7 V $ V nH & does not apply. Here, p pij 0 .AccordingtoConjecture1.8,theML degree above providesH a lowerDf CC bound forD theg signed topological Euler characteristic Q of r .Thedifferencebetweenthetwonumbersreflectthenatureofthesingular V nH locus r 1 inside r .Forplanecurvesthathavenodesandcusps,we encounteredV $ n thisH issue inV ExamplesnH 1.4 and 1.17. We begin with a geometric description of the likelihood correspondence. An m % n-matrix P is a regular point in r if and only if rank.P / r.Thetangentspace V 2 m n D TP is a subspace of dimension rn rm r in C % .ItsorthogonalcomplementTP? has dimension .m r/.n r/. C ! ! ! mn 1 The partial derivatives of the log-likelihood function log.`U / on P $ are

@log.`U / uij u CC : @pij D pij ! p CC

Proposition 2.4. An m n-matrix P of rank r is a critical point for log.`U / on r % V if and only if the linear subspace TP? contains the matrix

uij u CC pij ! p i 1;:::;m Ä CC ' jD 1;:::;n D In order to get to the numbers in Theorem 2.2,thegeometricformulationwas replaced in [HRS]withaparametricrepresentationoftherankconstraints.The following linear algebra formulation worked well for non-trivial computations. Assume m n.LetP1;R1;L1 and ƒ be matrices of unknowns of formats r r, r .n r/, .mÄ r/ r,and.n r/ .m r/.Set % % ! ! % ! % !

P1 P1R1 R1 L L1 Im r ;P ; and R ; D ! $ D L1P1 L1P1R1 D In r  à Â! $ à $ % 84 J. Huh and B. Sturmfels where Im r and In r are identity matrices. In the next statement we use the symbol ? for the$ Hadamard$ (entrywise) product of two matrices that have the same format. Proposition 2.5. Fix a general m n data matrix U .Thepolynomialsystem % P?.R ƒ L/T u P U # # C CC # D consists of mn equations in mn unknowns. For generic U ,ithasfinitelymany complex solutions .P1;L1;R1;ƒ/.Them n-matrices P resulting from these % solutions are precisely the critical points of the likelihood function `U on the determinantal variety r . V We next present the analogue to Theorem 2.2 for symmetric matrices

2p11 p12 p13 p1n ### p12 2p22 p23 p2n 0 ### 1 P p13 p23 2p33 p3n : D B : : : ###: : C B : : : :: : C B C B C B p1n p2n p3n 2pnnC @ ### A Such matrices, with nonnegative coordinates pij that sum to 1,representjoint probability distributions for two identically distributed random variables with n states. The case n 2 and r 1 is the Hardy–Weinberg curve, which we discussed in detail in ExampleD 1.3. D Theorem 2.6. The known values for ML degrees of symmetric matrices of rank at most r (mixtures of r independent identically distributed random variables) are

n 234 5 6 D r 111111 D r 21637 270 2341 D r 3137 1394 ‹ D r 41270 ‹ D r 512341 D At present we do not know the common value of the ML degree for n 6 and r 3; 4.InwhatfollowswetakeacloserlookatthemodelforsymmetricD3 3- matricesD of rank 2. %

Example 2.7. Let n 3 and r 2,soX is a cubic hypersurface in P5.The D D 5 5 likelihood correspondence X is a five-dimensional subvariety of P P having bidegree L %

5 4 3 2 2 3 4 BX .p; u/ 6p 12p u 15p u 12p u 3pu : D C C C C Likelihood Geometry 85

The bihomogeneous prime ideal of X is minimally generated by 23 polynomials, namely: L •Onepolynomialofbidegree.3; 0/;thisisthedeterminantofP . •Threepolynomialsofdegree.1; 1/.Thesecomefromtheunderlyingtoricmodel rank.P / 1 .AssuggestedinProposition3.5,theyarethe2 2-minors of f D g %

2p0 p1 p2 p1 2p3 p4 p2 p4 2p5 C C C C C C : 2u0 u1 u2 u1 2u3 u4 u2 u4 2u5 Â C C C C C C Ã •Onepolynomialofdegree.2; 1/, •threepolynomialofdegree.2; 2/, •ninepolynomialsofdegree.3; 1/, •sixpolynomialsofdegree.3; 2/. It turns out that this ideal represents an expression for the MLE P in terms of O radicals in U . We shall work this out for one numerical example. Consider the data matrix U with

u11 10; u12 9; u13 1; u22 21; u23 3; u33 7: D D D D D D For this choice, all six critical points of the likelihood function are real and positive:

p11 p12 p13 p22 p23 p33 log `U .p/ 0:1037 0:3623 0:0186 0:3179 0:0607 0:1368 82:18102 ! 0:1084 0:2092 0:1623 0:3997 0:0503 0:0702 84:94446 ! 0:0945 0:2554 0:1438 0:3781 0:4712 0:0810 84:99184 ! 0:1794 0:2152 0:0142 0:3052 0:2333 0:0528 85:14678 ! 0:1565 0:2627 0:0125 0:2887 0:2186 0:0609 85:19415 ! 0:1636 0:1517 0:1093 0:3629 0:1811 0:0312 87:95759 ! The first three points are local maxima in 5 and the last three points are local minima. These six points define an algebraic field extension of degree 6 over Q.One might expect that the Galois group of these six points over Q is the full symmetric group S6.Ifthiswerethecasethentheabovecoordinatescouldnotbewrittenin radicals. However, that expectation is wrong. The Galois group of the likelihood 5 fibration pr2 X P given by the 3 3 symmetric problem is a subgroup of S6 W L ! U % isomorphic to the solvable group S4. To be concrete, for the data above, the minimal polynomial for the MLE p33 equals O

9528773052286944p6 4125267629399052p5 713452955656677p4 33 ! 33 C 33 3 2 63349419858182p 3049564842009p 75369770028p33 ! 33 C 33 ! 744139872 0: C D 86 J. Huh and B. Sturmfels

We solve this equation in radicals as follows:

16427 1 2 66004846384302 2 p33 227664 12 % % !2 19221271018849 !2 D 14779904193C 2 ! 14779904193! 2 C1 % % !1! !3; 211433981207339$ ! 211433981207339% 2 C 2 $ % where % is a primitive third root of unity, !2 94834811=3,and 1 D 3 5992589425361 5992589425361 2 97163 !2 150972770845322208 % 150972770845322208 % 40083040181952 !1; 2D 5006721709 212309132509! 212309132509 2C 2409 !3 1248260766912 4242035935404 % 4242035935404 % !2 20272573168 !1!2 D $158808750548335C 2 17063004159! 2 % 17063004159! 2 ! % % !1! : ! 76885084075396$ 2 C 422867962414678 ! 422867962414678% 2 $ % The explanation for the extra symmetry stems from the duality theorem below. It furnishes an involution on the set of six critical points that allows us to express them in radicals. } The tables in Theorems 2.2 and 2.6 suggest that the columns will always be symmetric. This fact was conjectured in [HRS]andsubsequentlyprovedbyDraisma and Rodriguez in [DR].

Theorem 2.8. Fix m n and consider the determinantal varieties i for either general or symmetricÄ matrices. Then the ML degrees for rank r andV for rank m r 1 coincide. ! C In fact, the main result in [DR]establishesthefollowingmoreprecisestatement. Given a data matrix U of format m n,wewrite "U for the m n-matrix whose .i; j / entry equals % %

uij ui u j # C # C : .u /3 CC Theorem 2.9. Fix m n and U an m n-matrix with strictly positive integer Ä % entries. There exists a bijection between the complex critical points P1;P2;:::;Ps of the likelihood function `U on r and the complex critical points Q1;Q2;:::;Qs V of `U on m r 1 such that V $ C

P1 ?Q1 P2 ?Q2 Ps ?Qs "U : D D###D D Thus, this bijection preserves reality, positivity, and rationality. The key to computing the ML degree tables and to formulating the duality conjectures in [HRS], was the use of numerical algebraic geometry. The software Bertini allowed for the computation of thousands of instances in which the formula of Theorem 2.9 was confirmed. Bertini is numerical software, based on homotopy continuation, for finding all complex solutions to a system of polynomial equations (and much more). The software is available at [Bertini]. The developers, Daniel Bates, Jonathan Likelihood Geometry 87

Hauenstein, Andrew Sommese, Charles Wampler, have just completed a new textbook [BHSW]onthemathematicsbehindBertini. For the past two decades, algebraic geometers have increasingly employed computational methods as a tool for their research. However, these computations have almost always been symbolic (and hence exact). They relied on Gröbner- based software such as Singular or Macaulay2.Algebraistsoftenfeelacertain discomfort when asked to trust a numerical computation. We encourage discussion about this issue, by raising the following question. Example 2.10. In the rightmost column of Theorem 2.6,itisassertedthatthe solution to a certain enumerative geometry problem is 2341. Which of these would you trust most: •theoutputofasymboliccomputation? •theoutputofanumericalcomputation? •aproofwrittenbyanalgebraicgeometer? In the authors’ view, it always pays off to be critical and double-check all computations, regardless of how they were carried out. And, this applies to all three of the above. } One of the big advantages of numerical algebraic geometry over Gröbner bases when it comes to MLE is the separation between Preprocessing and Solving.For n any particular variety X P ,suchasX r ,wepreprocessbysolvingthe " D V likelihood equations once, for a generic data set U0 chosen by us. The coordinates of U0 may be complex (rather than real) numbers. We can chose them with stable numerics in mind, so as to compute all critical points up to high accuracy. This step can take a long time, but the output is highly reliable. After solving the equations once, for that generic U0,allsubsequentcomputa- tions for any other data set U are very fast. In particular, the computation is fully parallelizable. If we have m processors at our disposal, where m MLdegree .X/, then each processor can track one of the paths. To be precise, homotopyD continuation starts from the critical points of `U0 and transform them into the critical points of `U .Geometricallyspeaking,forfixedX,thehomotopyamountstowalkingonthe n sheets of the likelihood fibration pr2 X Pu . To illustrate this point, here are theW L timings! (in seconds) that were reported in [HRS]forthedeterminantalvarietiesX r .Thosecomputationswerecarried out in Bertini on a 64-bit Linux clusterD withV 160 processors. The first row is the preprocessing time for solving the equations once. The second row is the time needed to solve any subsequent instance:

.m; n; r/ .4; 4; 2/ .4; 4; 3/ .4; 5; 2/ .4; 5; 3/ .5; 5; 2/ .5; 5; 4/ Preprocessing 257 427 1938 2902 348555 146952 Solving 4 4 20 20 83 83 88 J. Huh and B. Sturmfels

This table suggests that combining numerical algebraic geometry with existing tools from computational statistics might lead to a viable tool for certifiably solving MLE problems. We are now at the point where it is essential to offer a disclaimer. The low rank model r does not correctly represent the notion of conditional independence. The M model we should have used instead is the mixture model Mixr .Bydefinition,Mixr is the set of probability distributions P in mn 1 that are convex combinations of $ r independent distributions, each taken from 1.Equivalently,themixturemodel M Mixr consists of all matrices

P A ƒ B; (21) D # # where A is a nonnegative m r-matrix whose rows sum to 1, ƒ is a nonnegative r r diagonal matrix whose entries% sum to 1,andB is a nonnegative r n-matrix whose% % columns sum to 1.Theformula(21) expresses Mixr as the image of a trilinear map between polytopes:

r r & .m 1/ r 1 .n 1/ mn 1 ;.A;ƒ;B/ P: W $ % $ % $ ! $ 7! The following result is well-known; see e.g. [LiAS,Example4.1.2].

Proposition 2.11. Our low rank model r is the Zariski closure of the mixture M model Mixr in the probability simplex mn 1.If r 2 then Mixr r .If $ Ä D M r 3 then Mixr ¨ r . & M The point here is the distinction between the rank and the nonnegative rank of anonnegativematrix.Matricesin r have rank r and matrices in Mixr have M Ä nonnegative rank r.Thuselementsof r Mixr are matrices whose nonnegative rank exceeds its rank.Ä M n Example 2.12. The following 4 4-matrix has rank 3 but nonnegative rank 4: % 1100 1 0110 P 0 1 D 8 # 0011 B C B1001C @ A This is the slack matrix of a regular square. It is an element of 3 Mix3. M n } Engineers and scientists care more about Mixr than r .Inmanyapplications, nonnegative rank is more relevant than rank. The reasonM can be seen in (18). In such alow-rankdecomposition,wedonotwantthefemaletableorthemaletabletohave a negative entry. This raises the following important questions: How to maximize the likelihood function `U over Mixr ? What are the algebraic degrees associated with that optimization problem? Likelihood Geometry 89

Statisticians seek to maximize the likelihood function `U on Mixr by using the r r expectation-maximization (EM) algorithm in the space .m 1/ r 1 .n 1/ of parameters .A; ƒ; B/.Ineachiteration,theEMalgorithmstrictlydecreasesthe$ % $ % $ Kullback–Leibler divergence from the current model point P &.A;ƒ;B/ to the 1 D empirical distribution u U .ThehopeinrunningtheEMalgorithmforgiven CC # data U is that it converges to the global maximum P on Mix .Forapresentation O r of the EM algorithm for discrete algebraic models see [PS,§1.3].Astudyofthe geometry of this algorithm for the mixture model Mixr is undertaken in [KRS]. If the EM algorithm converges to a point that lies in the interior of the parameter polytope, and is non-singular with respect to &,thenthatpointwillbeamongthe critical points on r .ThesearecharacterizedbyProposition2.4.However,since Mix is properly containedM in ,itfrequentlyhappensthatthetrueMLE P lies r r O on the boundary of Mix .Inthatcase,M P is not a critical point of ` on ,meaning r O U r that .P;U/is not in the likelihood correspondence on .SuchpointswillneverbeM O r found by the method described above. V In order to address this issue, we need to identify the divisors in the variety mn 1 r P $ that appear in the algebraic boundary of Mixr .Bythiswemeanthe V " irreducible components W1;W2;:::;Ws of the Zariski closure of @Mixr .Eachof these Wi has codimension 1 in r .OncetheWi are identified, one would need to V examine their ML degree, and also the ML degree of the various strata Wi Wi 1 \###\ s in which `U might attain its maximum. At present we do not have this information even in the smallest non-trivial case m n 4 and r 3. D D D Example 2.13. We illustrate this issue by describing one of the components W of the algebraic boundary for the mixture model Mix3 when m n 4.Considerthe equation D D

p11 p12 p13 p14 0a12 a13 0b12 b13 b14 p21 p22 p23 p24 0a22 a23 0 1 0 1 b21 0b23 b24 p p p p D a 0a # 0 1 31 32 33 34 31 33 b b b 0 Bp p p p C Ba a 0 C 31 32 33 B 41 42 43 44C B 41 42 C @ A @ A @ A This parametrizes a 13-dimensional subvariety W of the hypersurface 3 V D det.P / 0 in P15.ThevarietyW is a component in the algebraic boundary f D g of Mix3.Toseethis,wechoosetheaij and bij to be positive, and we note that P lies outside Mix3 when precisely one of the 0 entries gets replaced by '.Theprime ! ideal of W in QŒp11;:::;p44 is obtained by eliminating the 17 unknowns aij and bij from the 16 scalar equations. A direct computation with Macaulay 2 shows that the variety W is Cohen–Macaulay of codimension-2.BytheHilbert–Burch Theorem, it is defined by the 4 4-minors of the 4 5-matrix. This following specific matrix representation was suggested% to us by Aldo% Conca and Matteo Varbaro:

p11 p12 p13 p14 0 p21 p22 p23 p24 0 0 1 : p31 p32 p33 p34 p34.p11p22 p12p21/ B ! C Bp41 p42 p43 p44 p41.p12p24 p14p22/ p44.p11p22 p12p21/C @ ! C ! A 90 J. Huh and B. Sturmfels

Tte algebraic boundary of Mix3 consists of precisely 304 irreducible components, namely the 16 coordinate hyperplanes and 288 hypersurfaces that are all isomorphic to W .Thisisprovedin[KRS]. In that paper, it is also shown that the ML degree of Wequals633. } The definition of rank varieties and mixture models extends to m-dimensional tensors P of arbitrary format d1 d2 dm.WerefertoLandsberg’sbook % %###% [Land]foranintroductiontotensorsandtheirrank.Now, r is the variety of tensors V of borderrank r,themodel r is the set of all probability distributions in r ,and Ä M V the model Mixr is the subset of tensors of nonnegative rank r. Unlike in the matrix case m 2,themixturemodelforborderrankr 2 isÄ already quite interesting when mD 3.Westatetwotheoremsthatcharacterizeourobjects.Theset-theoreticD version of& Theorem 2.14 is due to Landsberg and Manivel [LM]. The ideal-theoretic statement was proved more recently by Raicu [Rai].

Theorem 2.14. The variety 2 is defined by the 3 3-minors of all flattenings of P . V % Here, flattening means picking any subset A of Œn 1; 2; : : : ; n with 1 Df g Ä A n 1 and writing the tensor P as an ordinary matrix with i A di rows and j jÄ ! 2 j A dj columns. 62 Q TheoremQ 2.15. The mixture model Mix2 is the subset of supermodular distribu- tions in 2. M This theorem was proved in [ARSZ]. Being supermodular means that P satisfies anaturalfamilyofquadraticbinomialinequalities.Weexplaintheseform D 3; d1 d2 d3 2. D D D Example 2.16. We consider 2 2 2 tensors. Since secant lines of the Segre variety 1 1 1 7 % % 7 P P P fill all of P ,wehavethat 2 P and 2 7.Themixturemodel % % V D M D Mix2 is an interesting, full-dimensional, closed, semi-algebraic subset of 7.By 7 definition, Mix2 is the image of a 2-to-1 map & .1/ 7 analogous to (21). W ! The branch locus is the 2 2 2-hyperdeterminant, which is a hypersurface in P7 of degree 4 and ML degree 13% .% The analysis in [ARSZ,§2]representsthemodelMix2 as the union of four toric cells.Oneofthesetoriccellsisthesetoftensorssatisfying

p111p222 p112p221 p111p222 p121p212 p111p222 p211p122 & & & p112p222 p122p212 p121p222 p122p221 p211p222 p212p221 (22) & & & p111p122 p112p121 p111p212 p112p211 p111p221 p121p211 & & &

Anonnegative2 2 2-tensor P in 7 is supermodular if it satisfies these inequal- % % ities, possibly after label swapping 1 2. We visualize Mix2 by restricting to the $ three-dimensional subspace H given by p111 p222;p112 p221;p121 p212 D D D and p211 p122.TheintersectionH 7 is a tetrahedron, and we consider D \ H Mix2 inside that tetrahedron. The restricted model H Mix2 is shown on the left\ in Fig. 1.Itconsistsoffourtoriccellsasshownontherightside.Theboundary\ Likelihood Geometry 91

Fig. 1 A three-dimensional slice of the seven-dimensional model of 2 2 2 tensors of nonnega- tive rank 2.Eachtoriccellisboundedby3 quadrics and contains a vertex% % of the tetrahedron Ä is given by three quadratic surfaces, shown in red, green and blue, and which are obtained from either the first or the second row in (22) by restriction to H . The boundary analysis suggested in Example 2.13 turns out to be quite simple in the present example. All boundary strata of the model Mix2 are varieties of ML degree 1. One such boundary stratum for Mix2 is the five-dimensional toric variety

7 X V.p112p222 p122p212;p111p122 p112p121;p111p222 p121p212/ P : D ! ! ! " As a preview for what is to come, we report its ML bidegree and its sectional ML degree:

7 6 5 2 4 3 3 4 2 5 BX .p; u/ p 2p u 3p u 3p u 3p u 3p u ; D 7 C 6 C 5 2 C 4 3C 3 4C 2 5 (23) SX .p; u/ p 14p u 30p u 30p u 15p u 3p u : D C C C C C In the next section, we shall study the class of toric varieties and the class of varieties having ML degree 1. Our variety X lies in the intersection of these two important classes. }

3ThirdLecture

In our third lecture we start out with the likelihood geometry of embedded toric varieties. Fix a .d 1/ .n 1/ integer matrix A .a0;a1;:::;an/ of rank d 1 that has .1; 1; : : : ;C 1/ as% its lastC row. This matrix definesD an effective action ofC the d n torus .C"/ on projective space P :

d n n a0 a1 an .C"/ P P ;t.p0 p1 pn/ .t Q p0 t Q p1 t Q pn/: % !! % W W###W 7!! # W # W###W # 92 J. Huh and B. Sturmfels

Here ai is the column vector ai with the last entry 1 removed. We also fix Q n 1 c .c0;c1;:::;cn/ .C"/ C ; D 2 n n d viewed as a point in P .LetXc be the closure in P of the orbit .C"/ c. This is a projective toric variety of dimension d,definedbythepair.A; c/.The# ideal that defines Xc is the familiar toric ideal IA as in [LiAS,§1.3],butwith p .p0;:::;pn/ replaced by D p p p p=c 0 ; 1 ;:::; n : (24) D c c c  0 1 n à Example 3.1. Fix d 2 and n 3.Thematrix D D 0301 A 0031 D 0 1 1111 @ A specifies the following family of toric surfaces of degree three in P3:

3 3 2 3 3 Xc .c0 c1x c2x c3x1x2/ .x1;x2/ .C / V.c3 p0p1p2 c0c1c2 p3/: D f W 1 W 2 W W 2 " gD # ! #

Of course, the prime ideal of any particular surface Xc is the principal ideal generated by

p p p p 3 0 1 2 3 : c c c ! c 0 1 2 Â 3 Ã

How does the ML degree of Xc depend on the parameter c .c0;c1;c2;c3/ 4 D 2 .C"/ ? } We shall express the ML degree of the toric variety Xc in terms of the d complement of a hypersurface in the torus .C"/ .Thepair.A; c/ define the sparse Laurent polynomial

a0 a1 an f.x/ c0 x Q c1 x Q cn x Q : D # C # C###C # n Theorem 3.2. The ML degree of the d-dimensional toric variety Xc P is equal to . 1/d times the Euler characteristic of the very affine variety " ! d Xc x .C"/ f.x/ 0 : (25) nH ' 2 W 6D ˚ « For generic c,theMLdegreeagreeswiththedegreeofXc,whichisthenormalized volume of the d-dimensional lattice polytope conv.A/ obtained as the convex hull of the columns of A. Likelihood Geometry 93

Proof. We first argue that the identification (25)holds.Themap

a0 a1 an x p .c0 x Q c1 x Q cn x Q / 7!! D # W # W###W # d n defines an injective group homomorphism from .C"/ into the dense torus of P .Its d image is equal to the dense torus of Xc,sowehaveanisomorphismbetween.C"/ and the dense torus of Xc.Underthisisomorphism,theaffineopenset f 0 d f 6D g in .C"/ is identified with the affine open set p0 pn 0 in the dense f Cd ###C 6D g torus of Xc.ThelatterispreciselyXc .Since.C"/ is smooth, we see that Xc is smooth, so our first assertion followsnH from Theorem 1.7.ThesecondassertionnH is a consequence of the description of the likelihood correspondence X via linear L c sections of Xc that is given in Proposition 3.5 below. ut Example 3.3. We return to the cubic surface Xc in Example 3.1.Forageneral parameter vector c,theMLdegreeofXc is 3.Forinstance,thesurfaceV.p0p1p2 3 3 ! p3/ P has ML degree 3.However,theMLdegreeofXc drops to 2 whenever the plane" curve defined by

3 3 f.x1;x2/ c0 c1x c2x c3x1x2 D C 1 C 2 C 2 has a singularity in .C"/ .Forinstance,thishappensforc .1 1 1 3/.The 3 3 D W W W! corresponding surface V.27p0p1p2 p / P has ML degree 2. C 3 " } The isomorphism (25)hasaniceinterpretationintermsofConvexOptimization. Namely, it implies that maximum likelihood estimation for toric varieties is equiv- alent to global minimization of posynomials,andhencetothemostfundamental case of Geometric Programming.Wereferto[BoydVan,§4.5]foranintroduction to posynomials and geometric programming. n 1 We write for the one-norm on R C ,wesetb Au,andweassumethat j#j n 1 D c .c0;c1;:::;cn/ is in R>0C .Maximumlikelihoodestimationfortoricmodelsis theD problem

pu Maximize subject to p Xc n: (26) p u 2 \ j jj j

ai Setting pi ci x Q as above, this problem becomes equivalent to the geometric program D #

u f.x/j j d Minimize subject to x R>0: (27) xb 2

u b By construction, f.x/j j=x is a posynomial whose Newton polytope contains the d origin. Such a posynomial attains a unique global minimum on the open orthant R>0. This can be seen by convexifying as in [BoydVan,§4.5.3].Thisglobalminimum of (27)correspondstothesolutionof(26), which exists and is unique by Birch’s Theorem [PS,Theorem1.10]. 94 J. Huh and B. Sturmfels

Example 3.4. Consider the geometric program for the surfaces in Example 3.1,with

0301 A 0031 and u .0; 0; 0; 1/: D 0 1 D 1111 @ A The problem (27)istofindtheglobalminimum,overallpositivex .x1;x2/,of the function D

f.x1;x2/ 1 1 2 1 1 2 c0x1$ x2$ c1x1 x2$ c2x1$ x2 c3: x1x2 D C C C

3 This is equivalent to maximizing p3=p subject to p V.c3 p0p1p2 c0c1c2 3 C 2 # ! # p / 3. 3 \ } n n We now describe the toric likelihood correspondence Xc in P P associated L % with the pair .A; c/.ThisisthelikelihoodcorrespondenceofthetoricvarietyXc " Pn defined above. n Proposition 3.5. On the open subset .Xc / P ,thetoriclikelihoodcorrespon- nH % dence X is defined by the 2 2-minors of the 2 .d 1/-matrix L c % % C p=c AT # : (28) u=c AT  # à Here the notation p=c is as in (24). In particular, for any fixed data vector u, the critical points of `u are characterized by a linear system of equations in p restricted to Xc. Proof. This is an immediate consequence of Birch’s Theorem [PS,Theorem1.10]. ut Example 3.6. The Hardy–Weinberg curve of Example 1.3 is the subvariety Xc 2 2 D V.p1 4p0p2/ in the projective plane P . As a toric variety, this plane curve is given by!

012 A and c .1; 2; 1/: D 210 D Â Ã 2 2 The likelihood correspondence of Xc is the surface in P P given by %

2p0 p1 p1 2p2 2p0 p1 det det C C 0: (29) p1 2p2 D u1 2u2 2u0 u1 D Â Ã Â C C Ã Note that the second determinant equals the determinant of the 2 2-matrix (28) % times 4.Saturating(29) with respect to p0 p1 p2 reveals two further equations of degree .1; 1/: C C Likelihood Geometry 95

2.u1 2u2/p0 .2u0 u1/p1 and .u1 2u2/p1 2.2u0 u1/p2: C D C C D C For fixed u,theseequationshaveauniquesolutioninP2,givenbytheformulain(3). } Toric varieties are rational varieties that are parametrized by monomials. We now examine those varieties that are parametrized by generic polynomials. Understand- ing these is useful for statistics since many widely used models for discrete data are given in the form

f ‚ n; W ! where ‚ is a d-dimensional polytope and f is a polynomial map. The coordinates f0;f1;:::;fn are polynomial functions in the parameters ( .(1;:::;(d / D satisfying f0 f1 fn 1.Suchmodelsincludethemixturemodelsin Proposition 2.11C,phylogeneticmodels,Bayesiannetworks,hiddenMarkovmodels,C###C D and many others arising in computational biology [PS]. The model specified by the polynomials f0;:::;fn is the semialgebraic set n f.‚/ n. We study its Zariski closure X f.‚/ in P .Findingitsequations is hard" and interesting. D

Theorem 3.7. Let f0;f1;:::;fn be polynomials of degrees b0;b1;:::;bn satisfying d fi 1.TheMLdegreeofthevarietyX is at most the coefficient of z in the generatingD function P .1 z/d ! : .1 zb0/.1 zb1/ .1 zbn/ ! ! ### !

Equality holds when the coefficients of f0;f1;:::;fn are generic relative to fi 1. D Proof.P This is the content of [CHKS,Theorem1]. ut Example 3.8. We examine the case of quartic surfaces in P3.Letd 2;n 3,pick D D random affine quadrics f1;f2;f3 in two unknowns and set f0 1 f1 f2 f3. This defines a map D ! ! !

2 3 3 f C C P : W ! " The ML degree of the image surface X f.C2/ in P3 is equal to 25 since D .1 z/2 ! 1 6z 25z2 88z3 .1 2z/4 D C C C C### ! The rational surface X is a Steiner surface (or Roman surface). Its singular locus consists of three lines that meet in a point P .Tounderstandthegraphoff ,we 96 J. Huh and B. Sturmfels

2 2 2 observe that the linear span of f0;f1;f2;f3 in CŒx; y has a basis 1; L ;M ;N f g f g where L; M; N represent lines in C2.Letl denote the line through M N parallel to L, m the line through L N parallel to M ,andn the line through L \M parallel \ \ to N .ThemapC2 X is a bijection outside these three lines, and it maps each ! line 2-to-1 onto one of the lines in Xsing.ThefiberoverthespecialpointP on X consists of three points, namely, l m, l n and m n.Ifthequadricf0 were also \ \ \ picked at random, rather than as 1 f1 f2 f3,thenwewouldstillgetaSteiner ! ! ! surface X P3.However,nowtheMLdegreeofX increases to 33. " On the other hand, if we take X to be a general quartic surface in P3,soX is a smooth K3 surface of Picard rank 1,thenX has ML degree 84.Thisisthe formula in Example 1.11 evaluated at n 3 and d 4.HereX is the generic D D nH quartic surface in P3 with five plane sections removed. The number 84 is the Euler characteristic of that open K3 surface. In the first case, X is singular, so we cannot apply Theorem 1.7 directly to nH our Steiner surface X in P3.However,wecanworkintheparameterspaceand 2 consider the smooth very affine surface C V.f0f1f2f3/.Thenumber25 is the Euler characteristic of that surface. n It is instructive to verify Conjecture 1.8 for our three quartic surfaces in P3.We found

#.X / 38 > 25 MLdegree .X/; nH D D #.X / 49 > 33 MLdegree .X/; nH D D #.X / 84 84 MLdegree .X/: nH D D D The Euler characteristics of the three surfaces were computed using Aluffi’s method [AluJSC]. } We now turn to the following question: which projective varieties X have ML degree one? This question is important for likelihood inference because a model having ML degree one means that the MLE p is a rational function in the data u. It is known that Bayesian networks and decomposableO graphical models enjoy this property, and it is natural to wonder which other statistical models are in this class. The answer to this question was given by the first author in [Huh2]. We shall here present the result of [Huh2]fromaslightlydifferentangle. Our point of departure is the notion of the A-discriminant, as introduced and studied by Gel’fand, Kapranov and Zelevinsky in [GKZ]. We fix an r m integer % matrix A .a1;a2;:::;am/ of rank r which has .1; 1; : : : ; 1/ in its row space. The Zariski closureD of

a1 a2 am m 1 r .t t t / P $ t .C"/ W W###W 2 W 2 ˚ m 1 « is an .r 1/-dimensional toric variety YA in P $ .Wehereintentionallychanged the notation! relative to that used for toric varieties at the beginning of this section. The reason is that d and n are always reserved for the dimension and embedding dimension of a statistical model. Likelihood Geometry 97

The dual variety YA" is an irreducible variety in the dual projective space m 1 .P $ /_ whose coordinates are x .x1 x2 xm/.Weidentifypointsx m 1 D W W###W in .P $ /_ with hypersurfaces

r a1 a2 am t .C"/ x1 t x2 t xm t 0 : (30) 2 W # C # C###C # D ˚ m 1 « The dual variety YA" is the Zariski closure in .P $ /_ of the locus of all hyper- surfaces (30)thataresingular.Typically,YA" is a hypersurface. In that case, YA" is defined by a unique (up to sign) irreducible polynomial A ZŒx1;x2;:::;xm. 2 The homogeneous polynomial A is called the A-discriminant.Manyclassical discriminants and resultants are instances of A.Soaredeterminantsandhyperde- terminants. This is the punch line of the book [GKZ]. 3210 Example 3.9. Let m 4;r 2,andA .Theassociatedtoricvariety D D D 0123 is the twisted cubic curve  Ã

2 3 3 YA .1 t t t / t C P : D W W W j 2 " ˚ « 3 The variety YA" that is dual to the curve YA is a surface in .P /_.ThesurfaceYA" parametrizes all planes that are tangent to the curve YA.Theserepresentunivariate cubics

2 3 x1 x2t x3t x4t C C C that have a double root. Here the A-discriminant is the classical discriminant

2 2 3 3 2 2 A 27x x 18x1x2x3x4 4x1x 4x x4 x x : D 1 4 ! C 3 C 2 ! 2 3 3 The surface YA" in P defined by this equation is the discriminant of the univariate cubic. } Theorem 3.10. Let X Pn be a projective variety of ML degree 1.Each  coordinate pi of the rational function u p is an alternating product of linear O 7!O forms in u0; u1;:::;un. The paper [Huh2]givesanexplicitconstructionofthemapu p as a Horn uniformization.Aprecursorwas[Kapranov]. We explain this construction.7!O The point of departure is a matrix A as above. We now take A to be any non-zero homogenous polynomial that vanishes on the dual variety YA" of the toric variety YA.IfYA" is a hypersurface then A is the A-discriminant. First, we write A as a Laurent polynomial by dividing it by one of its monomials: 1 b0 b1 bn A 1 c0 x c1 x cn x : (31) monomial # D ! # ! # !###! # 98 J. Huh and B. Sturmfels

This expression defines an m .n 1/ integer matrix B .b0;:::;bn/ satisfying % C D AB 0.Second,wedefineX to be the rational subvariety of Pn that is given parametricallyD by p i bi ci x for i 0; 1; : : : ; n: (32) p0 p1 pn D # D C C###C

The defining ideal of X is obtained by eliminating x1;:::;xm from the equations above. Then X has ML degree 1,and,byHuh[Huh2], every variety of ML degree 1 arises in this manner.

Example 3.11. The following curve in P3 happens to be a variety of ML degree 1:

2 X V 9p1p2 8p0p3 ;p 12.p0 p1 p2 p3/p3 : D ! 0 ! C C C This curve comes from$ the discriminant of the univariate cubic in Example% 3.9:

3 3 2 2 1 2 x2x3 4 x2 4 x3 1 x2 x3 A 1 2 2 2 2 : monomial # D ! 3 x1x4 ! !27 x1 x4 ! !27 x1x4 ! 27 x1 x4 $ % $ % $ % $ % We derived the curve X from the four parenthesized monomials via the for- mula (32). The maximum likelihood estimate for this model is given by the products of linear forms

3 3 2 2 2 x2x3 4 x2 4 x3 1 x2 x3 p0 p1 2 p2 2 p3 2 2 O D 3 x1x4 O D!27 x1 x4 O D!27 x1x4 O D 27 x1 x4 where

x1 u0 u1 2u2 2u3 x2 u0 3u2 2u3 D! ! ! ! D C C x3 u0 3u1 2u3 x4 u0 2u1 u2 2u3 D C C D! ! ! ! These expressions are the alternating products of linear forms promised in Theo- rem 3.10. } We now give the formula for pi in general. This is the Horn uniformization of [GKZ,§9.3]. O

Corollary 3.12. Let X Pn be the variety of ML degree 1 with parametriza- tion (32)derivedfromascaled" A-discriminant (31). The coordinates of the MLE function u p are 7!O m n bkj pk ck . bijui / : O D # j 1 i 0 YD XD Likelihood Geometry 99

It is not obvious (but true) that p0 p1 pn 1 holds in the formula O CO C###CO D above. In light of its monomial parametrization, our variety X is toric in Pn .In n nH general, it is not toric in P ,duetoappearancesofthefactor.p0 p1 pn/ in equations for X.Interestingly,therearenumerousinstanceswhenthisfactordoesC C###C not appear and X is toric also in Pn. One toric instance is the independence model X V.p00p11 p01p10/,whose MLE was derived in Example 1.14. What is the matrixD A in this! case? We shall answer this question for a slightly larger example, which serves as an illustration for decomposable graphical models. Example 3.13. Consider the conditional independence model for three binary variables given by the graph —– —– .Weclaimthatthisgraphicalmodelis derived from ( ( (

a00 a10 a01 a11 b00 b01 b10 b11 c0 c1 d 11111111111 x1 1 0 0 0 0 0 0100 A y00 0 1 1 0 0 0 00101: D z B 00001100100C B C w B 00000011010C B C @ A The discriminant of the corresponding family of hypersurfaces

4 .x; y; z; w/ .C"/ .a00 a10/x .a01 a11/y .b00 b01/z 2 j C C C C C ˚ .b10 b11/w c0xz c1yw d 0 C C C C C D equals «

A c0c1d a01b10c0 a11b10c0 a01b11c0 a11b11c0 D ! ! ! ! a00b00c1 a10b00c1 a00b01c1 a10b01c1: ! ! ! !

We divide this A-discriminant by its first term c0c1d to rewrite it in the form (31) with n 7.TheparametrizationofX P7 given by (32)canbeexpressedas D "

aij bjk pij k # for i;j;k 0; 1 : (33) D cj d 2f g # This is indeed the desired graphical model —– —– with implicit representation ( ( ( 7 X V p000p101 p001p100 ;p010p111 p011p110 P : D ! ! " The linear forms used$ in the Horn uniformization of Corollary%3.12 are

aij uij bjk u jk cj u j d u D C D C D C C D CCC 100 J. Huh and B. Sturmfels

Substituting these expressions into (33), we obtain

uij u jk pij k C # C for i;j;k 0; 1 : O D u j u 2f g C C # CCC This is the formula in Lauritzen’s book [Lau]forMLEofdecomposablegraphical models. } We now return to the likelihood geometry of an arbitrary d-dimensional projec- tive variety X in Pn,asalwaysdefinedoverR and not contained in .Wedefinethe H n n ML bidegree of X to be the bidegree of its likelihood correspondence X P P . This is a binary form L " %

d d 1 d n d BX .p; u/ .b0 p b1 p $ u bd u / p $ ; D # C # C###C # # where b0;b1;:::;bd are certain positive integers. By definition, BX .p; u/ is the 2 multidegree [ch3:MS,§8.5]oftheprimeidealof X ,withrespecttothenaturalZ - L grading on the polynomial ring RŒp; u RŒp0;:::;pn; u0;:::;un.Equivalently, D the ML bidegree BX .p; u/ is the class defined by X in the cohomology ring L n n n 1 n 1 H ".P P Z/ ZŒp; u= p C ; u C : % I D h i We already saw some examples, for the Grassmannian G.2; 4/ in (12), for arbitrary linear spaces in (14), and for a toric model of ML degree 1 in (23). We note that the bidegree BX .p; u/ can be computed conveniently using the command multidegree in Macaulay2. To understand the geometric meaning of the ML bidegree, we introduce a second n polynomial. Let Ln i be a sufficiently general linear subspace of P of codimension i,anddefine $

si MLdegree .X Ln i /: D \ $ We define the sectional ML degree of X to be the polynomial

d d 1 d n d SX .p; u/ .s0 p s1 p $ u sd u / p $ ; D # C # C###C # # Example 3.14. The sectional ML degree of the Grassmannian G.2; 4/ in (10) equals

5 4 3 2 2 3 4 SX .p; u/ 4p 20p u 24p u 12p u 2pu : D C C C C 5 Thus, if H1;H2;H3 denote generic hyperplanes in P ,thenthethreefoldG.2; 4/ \ H1 has ML degree 20,thesurfaceG.2; 4/ H1 H2 has ML degree 24,andthe \ \ 4 curve G.2; 4/ H1 H2 H3 has ML degree 12.Lastly,thecoefficient2 of pu \ \ \ is simply the degree of G.2; 4/ in P5. } Likelihood Geometry 101

Conjecture 3.15. The ML bidegree and the sectional ML degree of any projective variety X Pn,notlyingin , are related by the following involution on binary forms of degree" n: H

u SX .p; u p/ p SX .p; 0/ BX .p; u/ # ! ! # ; D u p ! u BX .p; u p/ p BX .p; 0/ SX .p; u/ # C C # : D u p C This conjecture is a theorem when X is smooth and its boundary is schön. See Theorem 4.6 below. In that case,n theH ML bidegree is identified, by Huh [Huh1,Theorem2],withtheChern–Schwartz–MacPherson(CSM)classofthe constructible function on Pn that is 1 on X and 0 elsewhere. Aluffi proved in nH [Alu,Theorem1.1]thattheCSMclassofanlocallyclosedsubsetofPn satisfies such a log-adjunction formula.OurformulainConjecture3.15 is precisely the homogenization of Aluffi’s involution. The combination of [Alu,Theorem1.1] and [Huh1,Theorem2]provesConjecture3.15 in cases such as generic complete intersections (Theorem 1.10)andarbitrarylinearspaces(Theorem1.20). In the latter case, it can also be verified using matroid theory. Conjecture 3.15 says that this holds for any X,indicatingadeeperconnectionbetweenlikelihood correspondences and CSM classes. We note that BX .p; u/ and SX .p; u/ always share the same leading term and the same trailing term, and this is compatible with our formulas. Both polynomials start and end like

MLdegree .X/ pn degree .X/ pcodim.X/udim.X/: # C###C # We now illustrate Conjecture 3.15 by verifying it computationally for a few more examples.

Example 3.16. Let us examine some cubic fourfolds in P5.IfX is a generic hypersurface of degree 4 in P5 then its sectional ML degree and ML bidegree satisfy the conjectured formula:

5 4 3 2 2 3 4 SX .p; u/ 1364p 448p u 136p u 32p u 3pu ; D 5 C 4 C 3 2 C 2 3 C 4 BX .p; u/ 1364p 341p u 81p u 23p u 3pu : D C C C C Of course, in algebraic statistics, we are more interested in special hypersurfaces that are statistically meaningful. One such instance was seen in Example 2.7. The mixture model for two identically distributed ternary random variables is the fourfold X P5 defined by "

2p11 p12 p13 det p 2p p 0: (34) 0 12 22 23 1 D p13 p23 2p33 @ A 102 J. Huh and B. Sturmfels

The sectional ML degree and the ML bidegree of this determinantal fourfold are

5 4 3 2 2 3 4 SX .p; u/ 6p 42p u 48p u 21p u 3pu D 5 C 4 C 3 2 C 2 3 C 4 BX .p; u/ 6p 12p u 15p u 12p u 3pu : D C C C C

For the toric fourfold X V.p11p22p33 p12p13p23/,MLbidegreeandsectional ML degree are D !

5 4 3 2 2 3 4 BX .p; u/ 3p 3p u 3p u 3p u 3pu ; D 5 C 4 C 3 2 C 2 3C 4 SX .p; u/ 3p 12p u 18p u 12p u 3pu : D C C C C

Now, taking X V.p11p22p33 p12p13p23/ instead, the leading coefficient 3 changes to 2. D C } Remark 3.17. Conjecture 3.15 is true when Xc is a toric variety with c generic, as in Theorem 3.2.HerewecanuseProposition3.5 to infer that all coefficients of BX are equal to the normalized volume of the lattice polytope conv.A/.Insymbols,for generic c,wehave

d n i i BX .p; u/ degree .Xc/ p $ u : c D # i 0 XD It is now an exercise to transform this into a formula for the sectional ML degree

SXc .p; u/. In general, it is hard to compute generators for the ideal of the likelihood correspondence. Example 3.18. The following submodel of (34)wasfeaturedprominentlyin[HKS, §1]:

12p0 3p1 2p2 det 3p 2p 3p 0: (35) 0 1 2 3 1 D 2p2 3p3 12p4 @ A This cubic threefold X is the secant variety of a rational normal curve in P4,andit represents the mixture model for a binomial random variable (tossing a biased coin four times). It takes several hours in Macaulay2 to compute the prime ideal of 4 4 the likelihood correspondence X P P .Thatidealhas20 minimal generators one in degree .1; 1/,oneindegreeL ".3; 0/%,fiveindegree.3; 1/,tenindegree.4; 1/ and three in degree .3; 2/. After passing to a Gröbner basis, we use the formula in [ch3:MS,Definition8.45]tocomputethebidegreeof X : L 4 3 2 2 3 BX .p; u/ 12p 15p u 12p u 3pu : D C C C Likelihood Geometry 103

We now intersect X with random hyperplanes in P4,andwecomputetheML degrees of the intersections. Repeating this experiment many times reveals the sectional ML degree of X:

4 3 2 2 3 SX .p; u/ 12p 30p u 18p u 3pu : D C C C The two polynomials satisfy our transformation rule, thus confirming Conjec- ture 3.15.WenotethatConjecture1.8 also holds for this example: using Aluffi’s method [AluJSC], we find #.X / 13. nH D! } Our last topic is the operation of restriction and deletion. This is a standard tool for complements of hyperplane arrangements, as in Theorem 1.20.Itwas developed in [Huh1]forarbitraryveryaffinevarieties,suchasX .Wemotivate this by explaining the distinction between structural zeros and samplingnH zeros for contingency tables in statistics [BFH,§5.1.1]. Returning to the “hair loss due to TV soccer” example from the beginning of Sect. 2, let us consider the following questions. What is the difference between the data set

lots of hair medium hair little hair 2 h 15 0 9 U 2Ä–6 h 20 24 12 D 6 h 0 10 12 6 1 & @ A and the data set

lots of hair medium hair little hair 2 h 10 0 5 U 2Ä–6 h 936‹ Q D 6 h 0 7981 & @ A How should we think about the zero entries in row 1 and column 2 of these two contingency tables? Would the rank 1 model 1 or the rank 2 model 2 be more appropriate? M M The first matrix U has rank 2 and it can be completed to a rank 1 matrix by replacing the zero entry with 18.Thus,themodel 1 fits perfectly except for the structural zero in row 1 and column 2. It seems thatM this zero is inherent in the structure of the problem: planet Earth simply has no people with medium hair length who rarely watch soccer on TV. The second matrix U also has rank two, but it cannot be completed to rank 1. Q The model is a perfect fit. The zero entry in U appeared to be an artifact of the 2 Q particular groupM that was interviewed in this study. This is a sampling zero.Itarose because, by chance, in this cohort nobody happened to have medium hair length and watch soccer on TV rarely. We refer to the book of Bishop et al. [BFH,Chap.5]for an introduction. 104 J. Huh and B. Sturmfels

We now consider an arbitrary projective variety X Pn,servingasourstatistical model. Suppose that structural zeros or sampling zeros occur in the last coordinate un.Following[Rapallo,Theorem4],wemodelstructuralzerosbytheprojection n 1 )n.X/.ThismodelisthevarietyinP $ that is the closure of the image of X under the rational map

n n 1 )n P Ü P $ ;.p0 p1 pn 1 pn/ .p0 p1 pn 1/: W W W###W $ W 7!! W W###W $ Which projective variety is a good representation for sampling zeros? We propose that sampling zeros be modeled by the intersection X pn 0 .Thisisnowtobe n 1 \f D g regarded as a subvariety in P $ .Inthismanner,bothstructuralzerosandsampling n 1 n 1 zeros are modeled by closed subvarieties of P $ .InsidethatambientP $ ,our standard arrangement consists of n 1 hyperplanes. Usually, none of these H C hyperplanes contains X pn 0 or )n.X/. It would be desirable\f to expressD g the (sectional) ML degree of X in terms of those of the intersection X pn 0 and the projection )n.X/.Asanalternativetothe \f D g n 1 n ML degree of the projection )n.X/ into P $ ,hereisaquantityinP that reflects the presence of structural zeros even more accurately. We denote by

MLdegree .X un 0/ j D the number of critical points p .p0 p1 pn 1 pn/ of `u in Xreg O D O WO W###WO$ WO nH for those data vectors u .u0; u1;:::;un 1;0/ whose first n coordinates ui are positive and generic. D $ Conjecture 3.19. The maximum likelihood degree satisfies the inductive formula

MLdegree .X/ MLdegree .X pn 0 / MLdegree .X un 0/; (36) D \f D g C j D provided X and X pn 0 are reduced, irreducible, and not contained in their respective . \f D g H We expect that an analogous formula will hold for the sectional ML degree SX .p; u/.Theintuitionbehindequation(36)isasfollows.Asthedatavectoru n moves from a general point in Pu to a general point on the hyperplane un 0 ,the 1 f D g corresponding fiber pr2$ .u/ of the likelihood fibration splits into two clusters. One cluster has size MLdegree .X un 0/ and stays away from .Theotherclustermoves j D n H onto the hyperplane pn 0 in Pp,whereitapproachesthevariouscriticalpoints f D g of `u in that intersection. This degeneration is the perfect scenario for a numerical homotopy, e.g. in Bertini,asdiscussedinSect.2.Thesehomotopiesarecurrently being studied for determinantal varieties by Elizabeth Gross and Jose Rodriguez [GR]. The formula (36)hasbeenverifiedcomputationallyformanyexamples.Also, Conjecture 3.19 is known to be true in the slightly different setting of [Huh1], under acertainsmoothnessassumption.Thisisthecontentof[Huh1, Corollary 3.2]. Likelihood Geometry 105

Example 3.20. Fix the space P8 of 3 3-matrices as in Sect. 2.Fortherank2 variety % X 2,theformula(36)reads 10 5 5.Fortherank1 variety X 1,itreads 1 D0V 1. D C D V D C } Example 3.21. If X is a generic .d; e/-curve in P3,then

2 2 MLdegree .X/ d e de de and X p3 0 .d e distinct points/: D C C \f D gD # Computations suggest that

2 2 2 2 MLdegree .X u3 0/ d e de and MLdegree .)3.X// d e de : j D D C D C To derive the second equality geometrically, one may argue as follows. Both curves 3 2 1 2 2 X P and )3.X/ P have degree de and genus 2 .d e de / 2de 1. " " 1 C ! C Subtracting this from the expected genus 2 .de 1/.de 2/ of a plane curve of 1 ! ! degree de,wefindthat)3.X/ has 2 d.d 1/e.e 1/ nodes. Example 1.4 suggests that each node decreases the ML degree of! a plane! curve by 2.Assumingthistobet the case, we conclude

2 2 MLdegree .)3.X// de.de 1/ d.d 1/e.e 1/ d e de : D C ! ! ! D C Here we are using that a general plane curve of degree de has ML degree de.de 1/. C } This example suggests that, in favorable circumstances, the following identity would hold:

MLdegree .X un 0/ MLdegree .)n.X//: (37) j D D However, this is certainly not true in general. Here is a particularly telling example:

Example 3.22. Suppose that X is a generic surface of degree d in P3.Then

MLdegree .X/ d d 2 d 3; D C C2 MLdegree .X p3 0 / d d ; \f D g D C3 MLdegree .X u3 0/ d ; j D D MLdegree .)3.X// 1: D n Indeed, for most hypersurfaces X P ,thesamewillhappen,since)n.X/ n 1 " D P $ . } As a next step, one might conjecture that (37)holdswhenthemapisbirational and the center .0 0 1/ of the projection does not lie on the variety X.But this also fails: W###W W

Example 3.23. Let X be the twisted cubic curve in P3 defined by the 2 2-minors of % 106 J. Huh and B. Sturmfels

p0 p1 p2 2p0 p2 9p3 p0 6p1 8p2 C ! ! C ! C : 2p0 p2 9p3 p0 6p1 8p2 7p0 p1 2p2 Â ! C ! C C C Ã

The ML degree of X is 13 3 10,andX intersects p3 0 in three distinct D C f D g points. The projection of the curve X into P2 is a cuspidal cubic, as in Example 1.4. We have

MLdegree .X u3 0/ 10 and MLdegree .)3.X// 9: j D D D It is also instructive to compare the number 13 #.X / with the number 11 one gets in Theorem 3.7 for the special twisted cubicD! curvenH with d 1, n 3 and D D b0 b1 b2 b3 3.Therearemanymysteriesstilltobeexploredinlikelihood D D D D geometry, even within P3. }

4CharacteristicClasses

We start by giving an alternative description of the likelihood correspondence which reveals its intimate connection with the theory of Chern classes on possibly noncompact varieties. An important role will be played by the Lie algebra and n 1 cotangent bundle of the algebraic torus .C"/ C .Thissectiontiesourdiscussion to the work of Aluffi [AluJSC, AluLectures, Alu]andHuh[Huh1, Huh0, Huh2]. In particular, we introduce and explain Chern–Schwartz–MacPherson (CSM) classes. And, most importantly, we present proofs for Theorems 1.6, 1.7, 1.15,and1.20. Let X Pn be a closed and irreducible subvariety of dimension d,notcontained in our distinguished arrangement of n 2 hyperplanes, C n n .p0 p1 pn/ P p0 p1 pn p 0 ;p pi : H D W W###W 2 j # ### # C D g C D i 0 ˚ XD

Let 'i denote the restriction of the rational function pi =p to X .Theclosed embedding C nH

n 1 ' X .C"/ C ;'.'0;:::;'n/; W nH !! D shows that the variety X is very affine.Letx be a smooth point of X .We define nH nH

n 1 n 1 *x TxX T'.x/.C"/ C g T1.C"/ C (38) W !! !! WD 1 to be the derivative of ' at x followed by that of left-translation by '.x/$ .Hereg n 1 is the Lie algebra of the algebraic torus .C"/ C .Inlocalcoordinates.x1;:::;xd / around the smooth point x,thelinearmap*x is represented by the logarithmic Jacobian matrix Likelihood Geometry 107

@ log ' i ;0 i n; 1 j d: @xj ! Ä Ä Ä Ä

The linear map *x in (38)isinjectivebecause' is injective. We write q0;:::;qn n 1 for the coordinate functions on the torus .C"/ C .ThesefunctionsdefineaC-linear basis of the dual Lie algebra g_ corresponding to differential forms

0 n 1 1 n 1 dlog.q0/; : : : ; dlog.qn/ H .C"/ C ;". /n 1 g_ C C : 2 C! C ' ' ( Á We fix this choice of basis of g_,andweidentifyP.g_/ with the space of data n vectors Pu :

n n 1 g_ ui dlog.qi / u .u0;:::;un/ C C : ' # j D 2 i 0 n XD o Consider the vector bundle homomorphism defined by the pullback of differential forms

n * "1 ;.x;u/ u dlog.' /.x/: (39) _ gX_reg Xreg i i W H !! H 7!! # n n i 0 XD

Here g is the trivial vector bundle over Xreg modeled on the vector space X_reg nH nH g_.Theinducedlinearmap*x_ between the fibers over a smooth point x is dual to the injective linear map *x TxX g.Therefore* is surjective and ker.* / is W !! _ _ avectorbundleoverXreg .Thisvectorbundlehaspositiverankn d 1,and hence its projectivizationn isH nonempty. ! C

n Proof of Theorem 1.6. Under the identification P.g_/ P ,theprojectivebundle ' u P.ker * _/ corresponds to the following constructible subset of dimension n:

n n n X .Xreg / P Pp P : L \ nH % u  % u ( Á n Therefore its Zariski closure X is irreducible of dimension n,andpr1 X Pp L W L !n is a projective bundle over Xreg .Thelikelihoodvibrationpr2 X Pu is generically finite-to-one becausen theH domain and the range are algebraicW L varieties! of the same dimension. ut Our next aim is to prove Theorem 1.15.Forthiswefixaresolutionof singularities 108 J. Huh and B. Sturmfels

1 ) .Xreg / / X $ nH Q

)   n Xreg / X / P ; nH where ) is an isomorphism over X ,thevarietyX is smooth and projective, reg Q 1 nH and the complement of ) .Xreg / is a simple normal crossing divisor in X with $ nH Q irreducible components D1;:::;Dk.Each'i lifts to a rational function on X which 1 n 1 Q is regular on )$ .X /.Ifu .u0;:::;un/ is an integer vector in Z C ,thenthese functions satisfy nH D

n

ordD .`u/ ui ordD .'i /: (40) j D # j i 0 XD n 1 n 1 If u C C Z C then ordDj .`u/ is the complex number defined by the Eq. (40) 2 n for j 1; : : : ; k.WewriteHi pi 0 and H p 0 for the n 2 hyperplanesD in . WD f D g C WD f C D g C H Lemma 4.1. Suppose that X Hi is smooth along H ,andletDj be a divisor \ C in the boundary of X such that ).D / .Thenthefollowingthreestatements Q j hold: Â H

positive if ).Dj / Hi , (1) If ).Dj / ª H then ordDj .'i / is  C ( zero if ).Dj / ª Hi . positive if ).Dj / ª Hi , (2) If ).Dj / H then ordDj .'i / is  C ! ( nonnegative if ).Dj / Hi .  (3) In each of the above two cases, ordDj .'i / is non-zero for at least one index i.

Proof. Write Hi0 and H 0 for the pullbacks of Hi and H to X respectively. Note C 1C that ordDj .)".Hi0// is positive if Dj is contained in )$ .Hi0/ and otherwise zero. Since

ordDj .'i / ordDj .)".Hi0// ordDj .)".H 0 //; D ! C this proves the first and second assertion, except for the case when ).Dj / Hi  \ H .Inthiscase,ourassumptionthatHi0 is smooth along H 0 shows that ).Dj / C C  Xreg and the order of vanishing of Hi0 along ).Dj / is 1.Therefore

ordDj .'i / ordDj .)".H 0 // 1 0: ! D C ! & The third assertion of Lemma 4.1 is derived by the following set-theoretic reasoning: Likelihood Geometry 109

•If).Dj / ª H ,then).Dj / Hi for some i because ).Dj / is irreducible. C Â Â H n •If).Dj / H ,then).Dj / ª Hi for some i because i 0 Hi . Â C D D; T ut From Lemma 4.1 and Eq. (40)wededucethefollowingresult.InLemmas4.2 and 4.3 we retain the hypothesis from Lemma 4.1 which coincides with that in Theorem 1.15. n 1 Lemma 4.2. If ).Dj / and u R>0C is strictly positive, then ordDj .`u/ is nonzero. Â H 2 Consider the sheaf of logarithmic differential one-forms "1 .log D/,whereD is X 1 Q the sum of the irreducible components of )$ . /.Ifu is an integer vector, then the corresponding likelihood function ` on X definesH a global section of this sheaf: u Q n 0 1 dlog.`u/ ui dlog.'i / H X;" .log D/ : (41) D # 2 Q X i 0 Q XD $ % n 1 n 1 If u C C Z C then we define the global section dlog.`u/ by the above expression2 (41).n n 1 Lemma 4.3. If u R>0C is strictly positive, then dlog.`u/ does not vanish on ) 1. /. 2 $ H 1 Proof. Let x ) . / and D1;:::;Dl the irreducible components of D contain- 2 $ H ing x,withlocalequationsg1;:::;gl on a small neighborhood G of x. Clearly, l 1.Bypassingtoasmallerneighborhoodifnecessary,wemayassumethat "1&.log D/ trivializes over G,and X Q

l

dlog.`u/ ordD .`u/ dlog.gj / ; D j # C j 1 XD where is a regular 1-form. Since the dlog.gj / form part of a free basis of a trivialization of "1 .log D/ over G,Lemma4.2 implies that dlog.` / is nonzero X u 1 Qn 1 on )$ . / if u R C . H 2 >0 ut Proof of Theorem 1.7. In the notation above, the logarithmic Poincaré–Hopf theo- rem states

c "1 .log D/ . 1/d # X ) 1. / : d X $ X Q D ! # Q n H Z Q $ % $ % See [AluLectures,Sect.3.4]forexample.IfX is smooth, then Lemma 4.3 shows that, for generic u,thezero-schemeoftheEq.(nH41)isequaltothelikelihoodlocus 110 J. Huh and B. Sturmfels

x X dlog.`u/.x/ 0 : 2 nH j D Since the likelihood locus˚ is a zero-dimensional scheme« of length equal to the ML degree of X,thelogarithmicPoincaré–HopftheoremimpliesTheorem1.7. ut Proof of Theorem 1.15. Suppose that the likelihood locus x Xreg f 2 nH j dlog.` /.x/ 0 contains a curve. Let C and C denote the closures of that u Q curve in X andDX grespectively. Let ) . / be the pullback of the divisor X of Q " X.Ifu n 1 then Lemma 4.3 impliesH that ) . / C is rationally equivalentH \ to R>0C " Q zero in X2.ItthenfollowsfromtheProjectionFormulathatH # C is also rationally Q H # equivalent to zero in Pn.Butthisisimpossible.Thereforethelikelihoodlocusdoes not contain a curve. This proves the first part of Theorem 1.15. For the second part, we first show that pr 1.u/ is contained in X for a strictly 2$ nH positive vector u.Thismeansthereisnopair.x; u/ X with x which is a limit of the form 2 L 2 H

.x; u/ lim .xt ; ut /; xt Xreg ; dlog.`ut /.xt / 0: D t 0 2 nH D ! If there is such a sequence .x ; u /,thenwecantakeitslimitoverX to find a point t t Q x X such that dlog.`u/.x/ 0,butthiswouldcontradictLemma4.3. Q 2 Q Q D 1 Now suppose that the fiber pr2$ .u/ is contained in Xreg,andhenceinXreg . 1 nH By Theorem 1.6,thisfiberpr2$ .u/ is contained the smooth variety . X /reg. 1 L Furthermore, by the first part of Theorem 1.15,pr2$ .u/ is a zero-dimensional subscheme of . X /reg.Theassertiononthelengthofthefibernowfollowsfroma standard result onL intersection theory on Cohen–Macaulay varieties. More precisely, we have

1 MLdegree.X/ .U1 ::: Un/ X .U1 ::: Un/. X /reg deg.pr2$ .u//; D # # L D # # L D n where the Ui are pullbacks of sufficiently general hyperplanes in Pu containing u, and the two terms in the middle are the intersection numbers defined in [Fulton, Definition 2.4.2]. The fact that . X /reg is Cohen–Macaulay is used in the last equality [Fulton,Example2.4.8]. L ut Remark 4.4. If X is a curve, then the zero-scheme of the Eq. (41)iszero- dimensional for generic u,evenifX is singular. Furthermore, the length of this zero-scheme is at least as large as MLnH degree of X.Therefore

1 #.X / # X )$ . / MLdegree .X/: ! nH &! Q n H & This proves that Conjecture 1.8 holds$ for d %1. D Next we give a brief description of the Chern–Schwartz–MacPherson (CSM) class. For a gentle introduction we refer to [AluLectures]. The group C.X/ of constructible functions on a complex X is a subgroup of the group Likelihood Geometry 111 of integer valued functions on X.Itisgeneratedbythecharacteristicfunctions 1Z of all closed subvarieties Z of X.Iff X Y is a morphism between complex algebraic varieties, then the pushforwardW of! constructible functions is the homomorphism

1 f C.X/ C.Y/; 1Z y # f $ .y/ Z ;yY : " W !! 7!! 7!! \ 2 ( $ % Á If X is a compact complex manifold, then the characteristic class of X is the Chern class of the tangent bundle c.TX/ ŒX H .X Z/.Ageneralizationtopossibly singular or noncompact varieties is\ provided2 by" theI Chern–Schwartz–MacPherson class, whose existence was once a conjecture of Deligne and Grothendieck. In the next definition, we write C for the functor of constructible functions from the category of complete complex algebraic varieties to the category of abelian groups. Definition 4.5. The CSM class is the unique natural transformation

cSM C H W !! " such that cSM.1X / c.TX/ ŒX H .X Z/ when X is smooth and complete. D \ 2 " I The uniqueness follows from the naturality, the resolution of singularities over C, and the requirement for smooth and complete varieties. We highlight two properties of the CSM class which follow directly from Definition 4.5: 1. The CSM class satisfies the inclusion–exclusion relation

cSM.1U U / cSM.1U / cSM.1U / cSM.1U U / H .X Z/: (42) [ 0 D C 0 ! \ 0 2 " I 2. The CSM class captures the topological Euler characteristic as its degree:

#.U / cSM.1U / Z: (43) D 2 ZX

Here U and U 0 are arbitrary constructible subsets of a complete variety X. What kind of information on a constructible subset is encoded in its CSM class? In likelihood geometry, U is a constructible subset in the complex projective space n n n 1 P ,andweidentifycSM.1U / with its image in H .P ; Z/ ZŒp= p C .Thus " D h i cSM.1U / is a polynomial of degree n is one variable p.Tobeconsistentwiththe Ä earlier sections, we introduce a homogenizing variable u,andwewritecSM.1U / as abinaryformofdegreen in .p; u/. The CSM class of U carries the same information as the sectional Euler characteristic

n n i i #sec.1U / #.U Ln i / p $ u : D \ $ # i 0 XD 112 J. Huh and B. Sturmfels

n Here Ln i is a generic linear subspace of codimension i in P .Indeed,itwasproved $ by Aluffi in [Alu,Theorem1.1]thatcSM.1U / is the transform of #sec.1U / under a linear involution on binary forms of degree n in .p; u/.Infact,ourinvolutionin Conjecture 3.15 is nothing but the signed version of the Aluffi’s involution. This is explained by the following result.

Theorem 4.6. Let X Pn be closed subvariety of dimension d that is not contained in .Iftheveryaffinevariety" X is schön then, up to signs, the ML bidegree equalsH the CSM class and the sectionalnH ML degree equals the sectional Euler characteristic. In symbols,

n d n d cSM.1X / . 1/ $ BX . p; u/ and #sec.1X / . 1/ $ SX . p; u/: nH D ! # ! nH D ! # ! Proof. The first identity is a special case of [Huh1,Theorem2],hereadaptedtoPn minus n 2 hyperplanes, and the second identity follows from the first by way of [Alu,Theorem1.1].C ut To make sense of the statement in Theorem 4.6,weneedtorecallthedefinition of schön.ThistermwascoinedbyTevelevinhisstudyoftropicalcompactifications n 1 [Tevelev]. Let U be an arbitrary closed subvariety of the algebraic torus .C"/ C . In our application, U X .WeconsidertheclosuresU of U in various D nH n 1 (not necessarily complete) normal toric varieties Y with dense torus .C"/ C . The closure U is complete if and only if the support of the fan of Y contains the tropicalization of U [Tevelev,Proposition2.3].WesaythatU is a tropical compactification of U if it is complete and the multiplication map

n 1 m .C"/ C U Y; .t; x/ t x W % !! 7!! # is flat and surjective. Tropical compactifications exist, and they are obtained from toric varieties Y defined by sufficiently fine fan structures on the tropicalization of U [Tevelev,§2].TheveryaffinevarietyU is called schön if the multiplication is smooth for some tropical compactification of U .Equivalently,U is schön if the multiplication is smooth for every tropical compactification of U ,byTevelev [Tevelev,Theorem1.4]. Two classes of schön very affine varieties are of particular interest. The first is the class of complements of essential hyperplane arrangements. The second is the class of nondegenerate hypersurfaces. What we need from the schön hypothesis is the existence of a simple normal crossings compactification which admits suffi- ciently many differential one-forms which have logarithmic singularities along the boundary. For complements of hyperplane arrangements, such a compactification is provided by the wonderful compactification of De Concini and Procesi [DP]. For nondegenerate hypersurfaces, and more generally for nondegenerate complete intersections, the needed compactification has been constructed by Khovanskii [Hovanskii]. Likelihood Geometry 113

We illustrate this in the setting of likelihood geometry by a d-dimensional linear subspace of X Pn.TheintersectionofX with distinguished hyperplanes of Pn " H is an arrangement of n 2 hyperplanes in X Pd ,definingamatroidM of rank d 1 on n 2 elements.C ' C C Proposition 4.7. If X is a linear space of dimension d then the CSM class of X nH in Pn is

d i d i n d i cSM.1X / . 1/ hi u $ p $ C : nH D ! i 0 XD where the hi are the signed coefficients of the shifted characteristic polynomial in (15).

Proof. This holds because the recursive formula for a triple of arrangement complements

cSM.1U / cSM.1U 1U / cSM.1U / cSM.1U /; 1 D ! 0 D ! 0 agrees with the usual deletion-restriction formula [OTBook,Theorem2.56]:

#M .q 1/ #M .q 1/ #M .q 1/: 1 C D C ! 0 C Here our notation is as in [Huh1,§3].Wenowuseinductiononthenumberof hyperplanes. ut Proof of Theorem 1.20. The very affine variety X is schön when X is linear. Hence the asserted formula for the ML bidegree ofnHX follows from Theorem 4.6 and Proposition 4.7. ut Rank constraints on matrices are important both in statistics and in algebraic geometry, and they provide a rich source of test cases for the theory developed here. We close our discussion with the enumerative invariants of three hypersurfaces defined by 3 3-determinants. It would be very interesting to compute these formulas for larger% determinantal varieties. Example 4.8. We record the ML bidegree,theCSM class,thesectional ML degree, and the sectional Euler characteristic for three singular hypersurfaces seen earlier in this paper. These examples were studied already in [HKS]. The classes we present n n n are elements of H ".Pp Pu / and of H ".Pp Z/ respectively, and they are written as binary forms in .p; u/%as before. I

•The3 3 determinantal hypersurface in P8 (Example 2.1)has % 8 7 6 2 5 3 4 4 BX .p; u/ 10p 24p u 33p u 38p u 39p u D C C C C 33p3u5 12p2u6 3pu7; C C C 114 J. Huh and B. Sturmfels

8 7 6 2 5 3 4 4 cSM.1X / 11p 26p u 37p u 44p u 45p u nH D! C ! C ! 33p3u5 12p2u6 3pu7; C ! C 8 7 6 2 5 3 4 4 SX .p; u/ 11p 182p u 436p u 518p u 351p u D C C C C 138p3u5 30p2u6 3pu7; C C C 8 7 6 2 5 3 4 4 #sec.1X / 11p 200p u 470p u 542p u 357p u nH D! C ! C ! 138p3u5 30p2u6 3pu7: C ! C •The3 3 symmetric determinantal hypersurface in P5 (Example 2.7)has % 5 4 3 2 2 3 4 BX .p; u/ 6p 12p u 15p u 12p u 3pu ; D C C C C 5 4 3 2 2 3 4 cSM.1X / 7p 14p u 19p u 12p u 3pu ; nH D ! C ! C 5 4 3 2 2 3 4 SX .p; u/ 6p 42p u 48p u 21p u 3pu ; D C C C C 5 4 3 2 2 3 4 #sec.1X / 7p 48p u 52p u 21p u 3pu : nH D ! C ! C •ThesecantvarietyoftherationalnormalcurveinP4 (Example 3.18)has

4 3 2 2 3 BX .p; u/ 12p 15p u 12p u 3pu ; D C C C 4 3 2 2 3 cSM.1X / 13p 19p u 12p u 3pu ; nH D! C ! C 4 3 2 2 3 SX .p; u/ 12p 30p u 18p u 3pu ; D C C C 4 3 2 2 3 #sec.1X / 13p 34p u 18p u 3pu : nH D! C ! C

In all known examples, the coefficients of BX .p; u/ are less than or equal to the absolute value of the corresponding coefficients of cSM.1X /,andsimilarlyfor nH SX .p; u/ and #sec.1X /.Thatthisinequalityholdsforthefirstcoefficientis Conjecture 1.8 whichn relatesH the ML degree of a singular X to the signed Euler characteristic of the very affine variety X . nH } Acknowledgements We thank Paolo Aluffi and Sam Payne for helpful communications, and the Mathematics Department at KAIST, Daejeon, for hosting both authors in May 2013. Bernd Sturmfels was supported by NSF (DMS-0968882) and DARPA (HR0011-12-1-0011).

References

[ARSZ] E. Allmann, J. Rhodes, B. Sturmfels, P. Zwiernik, Tensors of non- negative rank two, in Linear Algebra and its Applications.Special Issue on Statistics. http://www.sciencedirect.com/science/article/ pii/S0024379513006812 Likelihood Geometry 115

[AluJSC] P. Aluffi, Computing characteristic classes of projective schemes. J. Symb. Comput. 35,3–19(2003) [AluLectures] P.Aluffi,Characteristicclassesofsingularvarieties,in Topics in Cohomological Studies of Algebraic Varieties. Trends in Mathe- matics (Birkhäuser, Basel, 2005), pp. 1–32 [Alu] P. Aluffi, Euler characteristics of general linear sections and polynomial Chern classes. Rend. Circ. Mat. Palermo. 62,3–26 (2013) [AN] S. Amari, H. Nagaoka, in Methods of Information Geometry. Translations of Mathematical Monographs, vol. 191 (American Mathematical Society, Providence, 2000) [Bertini] D.J. Bates, J.D. Hauenstein, A.J. Sommese, C.W. Wampler, Bertini: Software for Numerical Algebraic Geometry (2006). www.nd.edu/~sommese/bertini [BHSW] D.J. Bates, J.D. Hauenstein, A.J. Sommese, C.W. Wampler, Numerically Solving Polynomial Systems with Bertini (Society for Industrial and Applied Mathematics, Philelphia, 2013). http:// www.ec-securehost.com/SIAM/SE25.html [BFH] Y. Bishop, S. Fienberg, P. Holland, Discrete Multivariate Analy- sis: Theory and Practice (Springer, New York, 1975) [BoydVan] S. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, Cambridge, 2004) [CHKS] F. Catanese, S. Ho¸sten, A. Khetan, B. Sturmfels, The maximum likelihood degree. Am. J. Math. 128,671–697(2006) [CDFV] D. Cohen, G. Denham, M. Falk, A. Varchenko, Critical points and resonance of hyperplane arrangements. Can. J. Math. 63,1038– 1057 (2011) [DP] C. De Concini, C. Procesi, Wonderful models of subspace arrangements. Selecta Math. New Ser. 1,459–494(1995) [Denham-Garrousian-Schulze] G. Denham, M. Garrousian, M. Schulze, A geometric deletion- restriction formula. Adv. Math. 230,1979–1994(2012) [DR] J. Draisma, J. Rodriguez, Maximum likelihood duality for determinantal varieties. Int. Math. Res. Not. [arXiv:1211.3196]. http://imrn.oxfordjournals.org/content/ early/2013/07/03/imrn.rnt128.full.pdf [LiAS] M. Drton, B. Sturmfels, S. Sullivant, in Lectures on Algebraic Statistics. Oberwolfach Seminars, vol. 39 (Birkhäuser, Basel, 2009) [Franecki-Kapranov] J. Franecki, M. Kapranov, The Gauss map and a noncompact Riemann-Roch formula for constructible sheaves on semiabelian varieties. Duke Math. J. 104,171–180(2000) [Fulton] W. Fulton, in Intersection Theory,2ndedn.Ergebnisseder Mathematik und ihrer Grenzgebiete. A Series of Modern Surveys in Mathematics, vol. 2 (Springer, Berlin, 1998) [Gabber-Loeser] O. Gabber, F. Loeser, Faisceaux pervers l-adiques sur un tore. Duke Math. J. 83,501–606(1996) [GKZ] I.M. Gel’fand, M. Kapranov, A. Zelevinsky, Discriminants, Resultants, and Multidimensional Determinants (Birkhäuser, Boston, 1994) [GR] E. Gross, J. Rodriguez, Maximum likelihood geometry in the presence of data zeros, http://front.math.ucdavis.edu/1310.4197 [HRS] J. Hauenstein, J. Rodriguez, B. Sturmfels, Maximum likelihood for matrices with rank constraints. J. Algebr. Stat. (to appear) [arXiv:1210.0198] 116 J. Huh and B. Sturmfels

[HKS] S. Ho¸sten, A. Khetan, B. Sturmfels, Solving the likelihood equations. Found. Comput. Math. 5,389–407(2005) [Hovanskii] A. Hovanski˘ı, Newton polyhedra and toroidal varieties. Akademija Nauk SSSR. Funkcional’nyi Analiz i ego Priloženija 11,56–64(1977) [Huh1] J. Huh, The maximum likelihood degree of a very affine variety. Compositio Math. 149,1245–1266(2013) [Huh0] J. Huh, h-vectors of matroids and logarithmic concavity.Preprint. http://arxiv.org/abs/1201.2915.[arXiv:1201.2915] [Huh2] J. Huh, Varieties with maximum likelihood degree one. J. Algeb. Stat. (to appear). [arXiv:1301.2732] [Kapranov] M. Kapranov, A characterization of A-discriminantal hypersur- faces in terms of the logarithmic Gauss map. Math. Ann. 290, 277–285 (1991) [KRS] K. Kubjas, E. Robeva, B. Sturmfels, Fixed points of the EM algorithm and nonnegative rank boundaries, http://arxiv.org/abs/ 1312.5634 (in preparation) [Land] J.M. Landsberg, Tensors: Geometry and Applications.Graduate Studies in Mathematics, vol. 128 (American Mathematical Soci- ety, Providence, 2012) [LM] J.M. Landsberg, L. Manivel, On ideals of secant varieties of Segre varieties. Found. Comput. Math. 4,397–422(2004) [Lau] S. Lauritzen, Graphical Models (Oxford University Press, Oxford, 1996) [ch3:MS] E. Miller, B. Sturmfels, in Combinatorial Commutative Algebra. Graduate Texts in Mathematics, vol. 227 (Springer, New York, 2004) [MSS] D. Mond, J. Smith, D. van Straten, Stochastic factorizations, sandwiched simplices and the topology of the space of expla- nations. R. Soc. Lond. Proc. Ser. A Math. Phys. Eng. Sci. 459, 2821–2845 (2003) [OTBook] P. Orlik, H. Terao, in Arrangements of Hyperplanes.Grundlehren der Mathematischen Wissenschaften, vol. 300 (Springer, Berlin, 1992) [Orlik-Terao] P. Orlik, H. Terao, The number of critical points of a product of powers of linear functions. Inventiones Math. 120,1–14(1995) [PS] L. Pachter, B. Sturmfels, Algebraic Statistics for Computational Biology (Cambridge University Press, Cambridge, 2005) [Rai] C. Raicu, Secant varieties of Segre–Veronese varieties. Algebra Number Theory 6,1817–1868(2012) [Rapallo] F. Rapallo, Markov bases and structural zeros. J. Symb. Comput. 41,164–172(2006) [SSV] R. Sanyal, B. Sturmfels, C. Vinzant, The entropic discriminant. Adv. Math. 244,678–707(2013) [Terao] H. Terao, Generalized exponents of a free arrangement of hyper- planes and the Shepherd-Todd-Brieskorn formula. Invent. Math. 63,159–179(1981) [Tevelev] J. Tevelev, Compactifications of subvarieties of tori. Am. J. Math. 129,1087–1104(2007) [Uhl] C. Uhler, Geometry of maximum likelihood estimation in Gaus- sian graphical models. Ann. Stat. 40,238–261(2012) Likelihood Geometry 117

[Varchenko] A. Varchenko, Critical points of the product of powers of linear functions and families of bases of singular vectors. Compositio Math. 97,385–401(1995) [Wat] S. Watanabe, in Algebraic Geometry and Statistical Learning Theory.MonographsonAppliedandComputationalMathemat- ics, vol. 25 (Cambridge University Press, Cambridge, 2009)