ESTIMATION IN EXPONENTIAL FAMILIES WITH UNKNOWN NORMALIZING CONSTANT

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF STATISTICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Sumit Mukherjee May 2014

© 2014 by Sumit Mukherjee. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/cf854qf4476

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Persi Diaconis, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Sourav Chatterjee

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Amir Dembo

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Abstract

Exponential families of probability measures has been a subject of considerable in- terest in Statistics, both in theoretical and applied areas. One of the problems that frequently arise in such models is that the normalizing constant is not known in closed form. Also numerical computation of the normalizing constant is infeasible because the size of the underlying space is huge. As such, carrying out inferential procedures becomes challenging. In this thesis, the main object of study is some specific examples of exponential families on graphs and permutations, where the normalizing constant is hard to com- pute. Using large deviation results asymptotic estimates of the normalizing constant is obtained, and methods to estimate the normalizing constant are developed. In the case of graphs, this analysis gives some insight into the phenomenon of “degeneracy” observed in empirical studies in social science literature. In the case of permutations, this analysis is used to show the consistency of pseudo-likelihood.

iv Acknowledgments

One of the best things that happened to me at Stanford was the chance to inter- act with Persi. Working with Persi was a thoroughly enjoyable experience, more so because the process of learning never stopped being fun. Persi’s approach to any problem fascinates me. He has been a constant source of knowledge, support and encouragement throughout my Ph.d.

I had the opportunity to work with Amir during my five years at Stanford, and I learnt a lot from him. His door was always open to students, and it is possible I may have over used the privilege. He was always willing to answer any question that I might have, no matter how easy they might be.

I had the chance to interact a lot with Sourav regarding my research, more so when he joined Stanford. He always had useful comments and suggestions regarding my research, and for that I’m grateful to him.

I thank Susan and David for helpful comments and suggestions during my Ph.d. proposal. I would also like to thank Guenther for writing me a teaching recommen- dation.

During my time at Stanford I have had the privilege to take courses under Amir Dembo, Persi Diaconis, Iain Johnstone, Andrea Montanari, Art Owen, Joe Romano, David Siegmund and Jonathan Taylor.

v During my Ph.d I collaborated with A. Basak and B. Bhattacharya, and my thesis benefitted a lot from the resulting interactions.

I would like to thank Caroline, Cindy, Ellen, Heather, Helen, Nora, Regina for all their help, and for making it very easy to concentrate on the academic side of things.

As far as the environment of the department goes, this is a great department to be in. My batch-mates Anirban, Alex, Max, Matan, Michael and Yong have been extremely helpful and were there for me whenever I needed them. My fellow gradu- ate students have been wonderful to be around, especially Amir, Austen, Bhaswar, Dennis, Gourab, Josh, Kinjal, Murat, Nike, Qingyuan, Rahul, Shilin, Stefan, Will and Zhen. I would like to thank the entire Stanford Statistics community for making the last five years special.

The single most important reason for my interest in the areas of Statistics and Probability is the Indian Statistical Institute, where I earned both my under graduate and masters degrees in Statistics. I’m grateful to my professors at ISI for introducing me to this field. I also benefitted a lot from my discussions with my friends at ISI, and I’m thankful to them for all their help and co-operation over the years.

None of this would have been possible without my parents, and this place is not enough to mention all the things they have done for me. And last, but by no means least, I would like to thank my wife Kaustari for always staying by my side, and being the wonderful person that she is.

vi Contents

Abstract iv

Acknowledgmentsv

1 Introduction1

2 Tools required8 2.1 Graph Limits...... 8 2.2 Permutation Limits...... 10 2.3 Understanding permutation limits...... 14 2.4 Large deviations theory...... 16

3 Previous work 22 3.1 Methods...... 22 3.2 Theoretical work...... 25

4 on sparse graphs 32 4.1 Introduction...... 32 4.2 Statement of main results...... 38 4.3 The large deviation principle...... 43 4.4 Proofs of main results...... 49 4.5 A particular example...... 53 4.6 Fitting the model...... 58 4.7 Proof of the large deviation principle...... 64

vii 5 Exponential family on permutations 74 5.1 Introduction...... 74 5.2 Statement of main results...... 77 5.3 The large deviation principle...... 86 5.4 Proofs of main results...... 87 5.5 Approximations for small θ ...... 97 5.6 A particular example...... 99 5.7 Analysis of the 1970 draft lottery data...... 104 5.8 Proof of the large deviation principle...... 108

6 Conclusion 119

References 122

viii Chapter 1

Introduction

Exponential families of probability measures has been a topic of considerable re- search in statistics, both in theoretical and applied areas. The concept of exponential families is credited to J. Pitman, G. Darmois, and B. Koopman in 1935-36. The large literature on exponential families is developed in the book length treatments of Barndorff-Nielson ([2]) and Brown ([48]). More recent literature can be found in Jordan-Wainwright ([65]). This thesis will consider exponential families on finite spaces. The definition of exponential family considered in this thesis is given below:

Definition 1.1. Let X be a finite space, and let T : X 7→ Rd be a real vector valued statistic on X . Then for any θ ∈ Rd one has

X 0 eθ T (x) < ∞, x∈X and so the function Z(.): Rd 7→ R defined by

h X 0 i Z(θ) := log eθ T (x) x∈X is finite for all θ.

d θ0T (x)−Z(θ) P For x ∈ X , θ ∈ R define pθ(x) = e . Then for every θ the sum pθ(x) x∈X

1 CHAPTER 1. INTRODUCTION 2

equals 1, and so pθ(.) is a probability measure on X . This is defined to be an expo- nential family on X , with natural parameter θ and sufficient statistic T .

The function Z(θ) plays a very important role in the analysis of exponential families. It is commonly referred to as the log normalizing constant in statistics literature. One of the most important properties of Z(θ) is the following:

∂Z(θ) ∂2Z(θ) E(Ti) = , Cov(Ti,Tj) = , ∂θi ∂θi∂θj

where the mean and covariance of T is computed under pθ.

Another interesting property pertaining to statistical inference is the following:

Suppose one observes X ∼ pθ for some θ, and it is required to estimate θ by the Maximum Likelihood Estimate (MLE), defined by

0 θˆ := arg max eθ T (X)−Z(θ) = arg max{θ0T (X) − Z(θ)}. θ∈Rd θ∈Rd

It readily follows that in this setting the MLE θˆML solves the equation T (X) = ∇Z(θ). Thus for such models the MLE is easy to compute, provided one has some control on Z(θ).

Thus the analysis of the model pθ becomes a lot simpler when the quantity Z(θ) is available in closed form, and is readily amenable to algebraic calculations. However, there are a lot of examples of exponential families where the function Z(θ) is not ex- pressible in closed form. Moreover, if the underlying space X has a large cardinality, it is not usually feasible to compute Z(θ) numerically. Without the knowledge of the normalizing constant, Bayesian methods can be difficult to implement an analyze. For example, the usual definition of the for θ involves the unknown normalizing constant. As such, the analysis of such models become challenging.

Some examples of spaces where such exponential families arise naturally are given CHAPTER 1. INTRODUCTION 3

below:

(a) Graphs

Suppose there are n researchers in a field of active research, denoted by {s1, ··· , sn},

and given any pair of researchers (si, sj) it is known whether they collaborate or not. A very convenient way to encode this data is through a graph on n vertices

labelled by [n] := {1, 2, ··· , n}, with node i representing researcher si. Put an

edge between nodes (i, j) if the researchers si and sj collaborate. In this setting

take X = Gn to be the space of all simple undirected graphs on n vertices which are labelled by [n], and the observed graph G is just one possible realization from (n) Gn. In this example |Gn| = 2 2 is large, even for moderate values of n.

If there is reason to believe that there are some researchers who collaborate more than others, then that will create a lot of examples of stars. For example, a two star is a simple graph with 3 vertices and 2 edges, depicted in figure 1.1. Thus

Figure 1.1: A two star

the two star is just the complete bi-partite graph K1,2. More generally, a k star

denotes the graph K1,k, which has k edges coming out of one vertex. By this definition, an edge is a one star.

If si is a popular researchers with collaborators {si1 , si2 , ··· , sik }, then any triple

(ia, i, ib) will give a two star, where the pair (a, b) ∈ [k]. Thus a researcher with CHAPTER 1. INTRODUCTION 4

k k collaborators produces 2 two stars. One way to capture this behavior is by modeling g as a random sample from an exponential family on the space of graphs, using a two star term as a sufficient statistic. As a starting point, one can consider a two parameter exponential family of the form

θ1E(G)+(θ2/n)T2(G)−Zn(θ1,θ2) pθ(G) = e ,

where G ∈ Gn is a graph on n vertices, and (E(G),T2(G)) are the number of

edges and two stars in G respectively. Here the parameter θ1 represents a base-

line effect, and the number of two stars should be high if θ2 is positive large. This model will be referred to as the edge two star model, and was first ana- lyzed in [59] in 2004 by J. Park and M. Newman. These authors used mean field techniques (non-rigorous) from statistical physics to obtain estimates of the normalizing constant. In 2011 S. Chatterjee and P. Diaconis ([12]) obtained an asymptotic estimate of the log normalizing constant for exponential families on graphs with sub-graph counts as sufficient statistics. In particular their results apply to the edge two star model as well. The main tool for obtaining this es- timate is a Large Deviation Principle (LDP) for the dense graphs, which was obtained in [14] by S. Chatterjee and S.R.S. Varadhan. The main results of [12] and [14] are described in chapter3. The quantifier “dense” graphs is used in an asymptotic sense, and means that the number of edges in a typical graph of n vertices is O(n2), and a typical degree is O(n). However, no analogous the- ory of exponential families exist for sparse graphs, where a typical graph has O(n) edges, and a typical vertex has degree O(1). Possibly most observed social networks are better modeled as sparse graphs. As an example, usually it is the case that most people who collaborate do not collaborate with most other people.

Coupled with exponential family on graphs, there is the problem of “degeneracy” of the model. This is described in more detail in chapter4. The work of Chatterjee and Diaconis in [12] give one explanation for degeneracy. However it does not explain the observation of [62] that some modified version of sub-graph counts do CHAPTER 1. INTRODUCTION 5

not cause degeneracy.

(b) Permutations

Given a permutation π ∈ Sn (i.e. a permutation of [n] = {1, 2, ··· , n}), suppose it is desired to test whether the permutation π was chosen uniformly at random.

One such set up can be the following: (Y1, ··· ,Yn) are mutually independent from a continuous but unknown distribution, and the problem is to test whether

they are identically distributed as well. Let π ∈ Sn denotes the permutation

defined by Y(i) = Yπ(i), where Y(1) < Y(2) < ··· < Y(n) are the order statistics of Y . Thus the permutation π sorts Y in ascending order. Then under the

null hypothesis π has a uniform distribution on Sn. If the alternative is that

(Y1, ··· ,Yk)’s are stochastically larger, this will favor permutations with higher values of (π−1(1), ··· , π−1(k)). Another example when large permutations are relevant is the lottery, where the number of tickets bought is usually large, and the winners are decided uniformly at random. In this case the main interest is in the first few values of the permutation (π(1), ··· , π(k)), and not in the entire permutation. An important example of permutation being used to decide fate of human lives is the 1970 Draft Lottery for the Vietnam War, when the U.S. Government used a random permutation of size 366 to decide the relative dates of when the people (among the citizens U.S.A. born between 1944-1950) had to join the army for war, based on their birthdays. 366 cylindrical capsules were put in a large box, one for each day of the year. The people who were born on the first chosen date had to join the war first, those born on the second chosen date had to go next, and so on. There were widespread allegations ([30]) that the chosen permutation was not uniformly random. The Spearman’s rank correlation between the birthdays and lottery numbers was computed to be −0.226, which is significantly different from 0 at 0.001 level of significance. Since after the permutation is drawn it is very easy to come up with features which are not typical of a uniformly random permutation, a principled approach has to be taken while analyzing such data. CHAPTER 1. INTRODUCTION 6

One way to achieve this is to consider a class of parametric model, so that the direction of possible alternatives are determined before the data is observed. If a permutation is not uniformly chosen, then it might be the case that there is a specific permutation σ towards which the sampling mechanism has a bias, and permutations close to σ have a higher probability of being selected. One class of models on permutations that captures this behavior are the Mallows’ models, given by n P −θ d(π,σ)−Zn(θ,σ) pθ(π) = e i=1 ,

where π ∈ Sn is a permutation, and d(., .) is a distance function on the space of permutations. For θ large and positive, permutations away from σ have small probability compared to those close to σ. The hypothesis of uniformity in this setting is equivalent to the hypothesis that θ = 0. Carrying out estimation and test of hypothesis in such models can be difficult for a general d(., .), as the nor- malizing constant can be hard to approximate. For the particular choice where d(π, σ) is the Kendall’s Tau metric (the minimum number of pairwise inversions needed to convert π into σ), the normalizing constant is known explicitly (see for example [26, (2.9)]). See also the related work of Shannon Starr ([63]), where the author analyzes the same model and obtains an asymptotic estimate of the normalizing constant using a different approach.

This thesis covers examples of exponential families on graphs and permutations, and explores them in some detail. The observed data will be one large graph/permutation, modeled as a random observation from an exponential family. The thesis will develop large deviation results, and will use these results to obtain asymptotic estimates of the log normalizing constant. Large Deviations theory gives a very natural framework to obtain such estimates via the celebrated Varadhan’s Lemma. The necessary tools are introduced in chapter2, and the program is explained in chapter3, with the example of exponential family on dense graphs as carried out in [12]. In chapter4 a large deviation principle for the empirical degree distribution of CHAPTER 1. INTRODUCTION 7

sparse Erd˝os-Renyi graphs is established, and is used to analyze a class of exponen- tial families on graphs. This analysis also gives an insight into the phenomenon of “degeneracy”, which has been observed in some exponential families in social sciences literature. It also gives a way to fit these models to data, which is carried out using an example. In chapter5 large deviation results for a uniformly random permutation are used to analyze a class of one parameter exponential families of permutations. This anal- ysis, apart from providing the asymptotics of the log normalizing constant, is used to prove the consistency of the pseudo-likelihood estimator. The tools developed is used to analyze the draft lottery data for uniformity. A new proof for the large deviations principle on permutations is also given.

The outline of the thesis is as follows:

Chapter2 introduces the main theoretical tools used in this thesis, namely dense graph limits, permutation limits, and large deviation theory. Chapter3 describes two existing methods used for inferential procedures with un- known normalizing constants which are mostly used for practice, the pseudo-likelihood method introduced by J. Besag in 1974-75 ([4],[3]), and Markov Chain Maximum Likelihood Estimate method (MCMLE) introduced by C. Geyer in his Ph.d thesis in 1990 ([34]). It then describes the work of Bhamidi-Sly-Bresler ([5]) which an- alyzes rates of local Markov chains for ERGMs on dense graphs, and the work of Chatterjee-Diaconis ([12]) which (among other things) computes the asymptotics of log normalizing constant for a class of ERGMs on dense graphs. Chapters4 and5 comprise the main body of the thesis, which derive the required large deviations, and analyze exponential families on sparse graphs and permutations respectively. Finally, chapter6 concludes the thesis with a brief summary of the work done, and possible scope for future work. Chapter 2

Tools required

2.1 Graph Limits

The theory of graph limits was developed by Lovasz and coauthors in a series of papers (see [9],[10],[?] and the references there-in), and has received phenomenal attention over the last few years. Graph limit theory connects various topics such as graph homomorphisms, Szemer´edisregularity lemma, quasirandom graphs, graph testing and extremal graph theory, and has even found applications in statistics and related areas [12]. For a detailed exposition of the theory of graph limits refer to Lovasz [?]. Below is given the basic definitions about convergence of graph sequences. If F and G are two graphs, then set | hom(F,G)| t(F,G) := , |V (G)||V (F )| where | hom(F,G)| denote the number of homomorphisms of F into G. In fact, t(F,G) denotes the probability that a uniformly random mapping φ : V (F ) → V (G) defines a graph homomorphism. The basic definition is that a sequence Gn of graphs converges if t(F,Gn) converges for every graph F . There is a natural limit object in the form of a function W ∈ W , where W is the space of all measurable functions from [0, 1]2 into [0, 1] that satisfy W (x, y) = W (y, x) for all x, y. Conversely, every such function arises as the limit of an appropriate graph

8 CHAPTER 2. TOOLS REQUIRED 9

sequence. This limit object determines all the limits of subgraph densities: if H is a simple graph with V (H) = [k] = {1, 2, . . . , k}, let

Y t(H,W ) = W (xi, xj)dx1dx2 ··· dxk. ˆ k [0,1] (i,j)∈E(H)

A sequence of graphs {Gn}n≥1 is said to converge to W if for every finite simple graph H,

lim t(H,Gn) = t(H,W ). (2.1) n→∞ These limit objects, that is, elements of W , are called graph limits or graphons.A finite simple graph G on [n] can also be represented as a graphon in a natural way: Define f G(x, y) = 1{(dnxe, dnye) ∈ E(G)}, that is, partition [0, 1]2 into n2 squares of side length 1/n, and define f G(x, y) = 1 in the (i, j)-th square if (i, j) ∈ E(G) and 0 otherwise. Thus the space W contains an embedding of labelled graphs of all possible sizes. Observe that t(H, f G) = t(H,G) for every simple graph H and therefore the constant sequence G converges to the graph limit f G. The notion of convergence in terms of subgraph densities outlined above can be captured by the cut-distance defined as:

d(f, g) := sup [f(x, y) − g(x, y)]dxdy , S,T ⊂[0,1] ˆS×T for f, g ∈ W . It is possible that there exists two functions f, g ∈ W such that a.s. t(H, f) = t(H, g), but f =6 g. As an example, let σ : [0, 1] 7→ [0, 1] be a measure preserving map which is not the identity, and let f(x, y) = gσ(x, y) := g(σx, σy). It is then easy to check that t(H, f) = t(H, g) for any H. However d(f, g) = 0 need a.s. not hold, as that would imply g = gσ. To see an explicit example, let G be a simple graph on 3 vertices labelled {1, 2, 3} and with an edge between (1, 2), and let g = f G be the graphon of G. If σ(x) = (x + .3)( mod 1), then σ is a measure preserving map, and gσ is the graphon of a graph on the same 3 vertices with an edge between CHAPTER 2. TOOLS REQUIRED 10

(2, 3). In particular, g and gσ are not the same almost surely.

This has the unfortunate effect that the limiting object for a sequence of graphs is unique only upto a measurable map σ. To define the limiting object uniquely, consider an equivalence relation on W as follows: f ∼ g whenever f(x, y) = gσ(x, y) for some measure preserving bijection σ : [0, 1] 7→ [0, 1]. Denote by ge the closure of the orbit gσ in (W , d). The quotient space is denoted by Wfand associated with the following natural metric:

δ(f,e g) := inf d(f, gσ) = inf d(fσ, g) = inf d(fσ1 , gσ2 ). e σ σ σ1,σ2

Then the space (Wf, δ) is compact [9], and the metric δ is commonly referred to as the cut-metric.

One of the main results in graph limit theory is that a sequence of graphs {Gn}n≥1

Gn converges to a limit W ∈ W in the sense defined in (2.1) if and only if δ(fe , Wf) → 0 [9, Theorem 3.8]. More generally, a sequence {Wfn}n≥1 converges to Wf ∈ Wf if and only if δ(Wfn, Wf) → 0.

2.2 Permutation Limits

Analogous to the theory of graph limits Hoppen et al. [40] developed the theory of permutation limits. For π ∈ Sn and τ ∈ Sk, τ is a sub-permutation of π if there exists

1 ≤ i1 < ··· < ik ≤ n such that such that τ(r) < τ(s) if and only if π(ir) < π(is). Note that this is possible only if n ≥ k. As an example, the permutation τ = (132) is a sub permutation of π = (7126354) for the choice i1 = 3, i2 = 4, i3 = 6. Indeed, one has

(π(i1), π(i2), π(i3)) = (π(3), π(4), π(6)) = (2, 6, 5), which is in the same order as (132). In the combinatorics literature, sub-permutations are more commonly referred to as patterns and they are widely studied (refer to Bona [7], the recent paper of Janson et al. [42] and the references therein). This thesis CHAPTER 2. TOOLS REQUIRED 11

follows the setup of Hoppen et al. [40] (see also K´ral’and Pikhurko [44] and Glebov et al. [36]), and refers to patterns as sub-permutations because of the similarity to the notion of sub-graphs, the counterpart object in graph limit theory.

Note that in the above example, the choices of indices (i1, i2, i3) is not unique.

Indeed, one could have chosen i1 = 2, i2 = 6, i3 = 7, for which one has

(π(i1), π(i2), π(i3)) = (π(2), π(6), π(7)) = (1, 5, 4), which has the same relative ordering as (1, 3, 2) and (2, 6, 5). Let Λ(τ, π) denote the number of copies of the sub-permutation τ in the permutation π, i.e. the number of k choices of (i1 < i2 < ··· < ik) ∈ [n] satisfying the relative order condition above. As an example for the choice τ = (2, 1), Λ(τ, π) equals the number of inversions in π, i.e. the number of pairs (r, s) such that 1 ≤ r < s ≤ n and π(r) > π(s).

The density of permutation τ ∈ Sk in a permutation π ∈ Sn is

( −1 n Λ(τ, π) if k ≤ n t(τ, π) = k 0 if k > n.

Thus t(τ, π) denotes the probability that the sub-permutation π◦φ is same as τ, where

φ is a uniformly random one-one function from [k] to [n]. An infinite sequence (πn)n∈N of permutations is convergent as n → ∞ if t(τ, πn) converges for every permutation τ. Every convergent sequence of permutations can be associated with an analytic ob- ject, referred to as permuton, which is a probability measure µ on ([0, 1]2, B([0, 1]2) with uniform marginals, where B([0, 1]2) is the sigma-algebra of the Borel sets of [0, 1]2. Given any such permuton µ, one can construct a sequence of random permu- tations as follows:

For any integer n, sample n independent points (X1,Y1), (X2,Y2),..., (Xn,Yn) in [0, 1]2 randomly from µ, a measure on ([0, 1]2, B([0, 1]2). Set π(i) = j if there exists l ∈ [n] such that Xl = X(i),Yl = Y(j). Here X(1) < X(2) < ··· < X(n) are order statistics of (X1,X2, ··· ,Xn). Another way to define the same π as above is as CHAPTER 2. TOOLS REQUIRED 12

follows:

Let σx ∈ Sn be the permutation that sorts X in increasing order, i.e. Xσx(1) <

Xσx(2) < ··· Xσx(n). This is same as defining σx through the relation Xσx(i) = X(i).

Similarly define σy, and note that the π defined above satisfies σx(i) = π(σy(i)) for −1 all i ∈ [n]. Thus one can define π = σy ◦ σx as the µ-random permutation of size n. For a permuton µ and τ ∈ Sk, define t(τ, µ) as the probability that a µ-random permutation of size k is τ. More precisely, one can write

  t(τ, µ) = P (X(r) − X(s))(Y(τ(r) − Yτ(s))(τ(r) − τ(s)) < 0, 1 ≤ r < s ≤ k . (2.2)

This is the generalization of sub-permutation counts for permutations to general per- mutons. For the choice τ = (2, 1) (2.2) becomes

t(τ, µ) = P((X1 − X2)(Y1 − Y2) < 0), and for the choice τ = (1, 2, 3) (2.2) becomes t(τ, µ) = P((X1 − X2)(Y1 − Y2) > 0, (X2 − X3)(Y2 − Y3) > 0, (X1 − X3)(Y1 − Y3) > 0).

The main result in permutation limit theory of [40] is that for every convergent sequence (πn)n∈N of permutations, there exists a unique permuton µ such that

t(τ, µ) = lim t(τ, πn), (2.3) n→∞ for every permutation τ. The permuton µ is defined as the limit of the sequence

(πn)n∈N. On the other hand, a sequence of µ-random permutations (πn)n∈N converges to µ in this sense, in probability. Note that unlike the graph limit object, the limiting permuton is unique, and there is no need to further quotient the space of permutons. The reason for this is that a permuton still keeps track of the labels. To be more precise, let f(x, y) is a bivariate density on the unit square with uniform marginals, and σ be a measure preserving bijection on [0, 1]. Then the function f(σ(x), σ(y)) is again a bivariate density on the unit square with uniform marginals, and is not CHAPTER 2. TOOLS REQUIRED 13

necessarily the same as f(x, y).

As in graph limit theory, the above notion of permutation convergence can be metrized by embedding all permutations to the space probability measures M on [0, 1]2 with uniform marginals, equipped with the topology of weak convergence.

To this end, define for any π ∈ Sn, a probability measure µπ ∈ M as: dµπ := fπ(x, y)dxdy, where fπ(x, y) = n1{(x, y): π(bnxc) = bnyc} is the density of µπ with respect to Lebesgue measure. Like in graph limit theory, µπ has the following inter- 2 2 pretation: Partition [0, 1] into n squares of side length 1/n, and define fπ(x, y) = n for all (x, y) in the (i, j)-th square if π(i) = j and 0 otherwise. As an example, the measure µπ corresponding to the permutation π = (1, 3, 2) has the density of figure 2.1. Here the shaded region has density 3, and the white region has 0 density.

(1, 0) (1, 1)

(0, 0) (0, 1)

Figure 2.1: Permuton for (1, 3, 2)

The space M equipped with the topology of weak convergence of measures is compact, and is metrizable. The map π → µπ is one-one and hence M contains an embedding of permutations of all sizes. Analogous to the result in graph limits,

Hoppen et al. [40, Lemma 5.3] showed that a sequence of permutations πn conver- gences if and only there is a measure µ ∈ M such that the corresponding sequences of CHAPTER 2. TOOLS REQUIRED 14

measures µπn converges weakly to µ ∈ M . More generally, any sequence of measures

{µn} in M converges weakly to a measure µ if and only if t(τ, µn) → t(τ, µ) for all permutations τ.

2.3 Understanding permutation limits

Since the concept of permutation limits is relatively new, this section is devoted to explaining some basic results about permutation limits. To begin, first note that the convergence of a sequence of permutations πn is defined even if the size of the permu- tation does not go to ∞. However this is not the interesting case, as any convergent permutation sequence where the size of the sequence stays bounded will eventually be constant ([40, Claim 2.4]). Thus without loss of generality one can focus on the case where the size goes to ∞.

The following definition gives another representation of π as a measure on the unit square. This representation will be more suitable for applications to exponential families on permutations.

Definition 2.1. Given a permutation π ∈ Sn, let νπ denote the discrete probability measure on [0, 1]2 defined by the empirical measure of the points (i/n, π(i)/n), i.e.

n 1 X ν := δ . π n (i/n,π(i)/n) i=1

Note that both marginals of νπ are discrete uniform on the set {(i/n), i ∈ 1, 2, ··· , n}.

Since the marginals are not uniform on [0, 1], νπ is not an element of M , but any weak limit of the sequence νπn is in M if the size of the permutation sequence goes to ∞.

Note that if the size of the permutation π is large, the two measure µπ and νπ are close in the weak topology. To see this, let π ∈ Sn, and let Fµπ and Fνπ represent the CHAPTER 2. TOOLS REQUIRED 15

bivariate distribution functions of µπ and νπ respectively. Then it follows that

2 d∞(µπ, νπ) := sup |Fµπ (x, y) − Fνπ (x, y)| ≤ . 0≤x,y≤1 n

Indeed, recall that both µπ and νπ can be defined by partitioning the unit square into n2 boxes, such that exactly n boxes receive a mass of 1/n. Also the choice of the n boxes is such that every row and every column will have exactly one box. Thus any vertical line through x can intersection exactly one box in this partition which has positive probability, and so the above difference can be at most 1/n + 1/n.

This shows that if πn is a sequence of permutations such that size of the permu- tations go to ∞, then w w µπn → µ ⇔ νπn → µ.

For the rest of the section, the notation πn will mean that πn is a permutation of size n, i.e. πn ∈ Sn. In this case, an equivalent definition for permutation convergence is the following:

Definition 2.2. A sequence of permutations πn is said to converge to a measure µ on 2 [0, 1] , if the corresponding sequence of measures νπn converge weakly to µ. Any such

µ obtained as a limit of νπn is necessarily in M , as each marginal converges weakly to U[0, 1].

The next list makes some basic observations about permutation limits.

−1 (a) If πn converges to the law of (X,Y ), then πn converges to the law of (Y,X).

∼ ∼ −1 This follows from the observation that if (X,Y ) νπn , then (Y,X) νπn .

(b) If πn converges to the law of (X,Y ), then setting σn(i) = n + 1 − πn(i) one has

that σn converges to the law of (X, 1 − Y ). This follows from the observation

that if (X,Y ) ∼ νπn , then (X, 1 − Y + 1/n) ∼ νσn .

(c) If πn is a random permutation chosen uniformly from Sn, then νπn converges weakly to the Lebesgue measure on [0, 1]2. CHAPTER 2. TOOLS REQUIRED 16

w w (d) Even if νπn → µ1 and νσn → µ2, νπn◦σn need not converge. Thus permutation convergence does not obey group structure of permutations.

To see this, let πn be a uniformly random permutation on Sn, and let σn = πn for −1 n odd, and πn for n even. Then πn and σn both converge to the uniform measure

on the unit square in probability, but the sequence πn ◦σn does not converge. This

is because along the even sequence one has πn◦σn to be the identity permutation of size n, which converges weakly to the uniform distribution on the diagonal x = y.

However, along the odd sequence πn ◦ πn converges to the uniform distribution on [0, 1]2 by a direct calculation.

(e) As an example of a natural statistic on permutations which is not continuous

with respect to this topology, consider the statistic N(πn)/n, where N(πn) is

the number of fixed points of πn. This is a bounded function on Sn, and equals

νπn {(x, x) : 0 ≤ x ≤ 1}, the mass put on the diagonal by the measure νπn . If

πn is the identity permutation on Sn and σn(i) = (i + 1)( mod n), then both πn

and σn converge to uniform measure on the diagonal, but N(πn)/n = 1 whereas

N(σn)/n = 0.

2.4 Large deviations theory

This section gives a basic introduction to how large deviations theory can be applied to obtain asymptotic estimates of log normalizing constants, and is based on the book “Large Deviation Techniques and Applications” ([20]) by Dembo-Zeitouni. For more on this subject, refer to [20], and the references there-in.

Let (X , B) be a measure space, and let X be equipped with a topology, such that all open (and closed) sets are measurable. Let I : X 7→ [0, ∞] be a lower semi continuous function, i.e. the set ΨI (α) := {x : I(x) ≤ α} is closed for all α ∈ [0, ∞).

Such an I is called a rate function. It is called a good rate function if the set ΨI (α) is compact for all α < ∞.

For every n ≥ 1 let Pn be a probability measure on (X , B). This sequence of CHAPTER 2. TOOLS REQUIRED 17

probability measures is said to satisfy a large deviation principle with speed an and rate function I, if for all A ∈ B,

1 1 − inf I(x) ≤ lim log n(A) ≤ lim log n(A) ≤ − inf I(x). ◦ P P x∈A n→∞ an n→∞ an x∈A

An equivalent representation of the above condition are the following:

(a) Upper bound

c For every α < ∞ and every measurable set A such that A ⊂ ΨI (α) , one has

1 lim log Pn(A) ≤ −α. n→∞ an

(b) Lower bound For any x such that I(x) < ∞ and x ∈ A◦, one has

1 lim log Pn(A) ≥ −I(x). n→∞ an

This representation is sometimes easier to use while proving the large deviation prin- ciple. Sometimes it is the case that the conditions above do not hold in full generality. In particular, assume that the lower bound holds for all measurable sets A, and the upper bound condition holds for all A which are compact subsets of ΨI (α) for some

α < ∞. Then the sequence of probability measures Pn is said to satisfy a weak Large deviation principle with the rate function I(.). The next definition gives a way to strengthen a weak large deviation principle to a full large deviation principle.

Definition 2.3. A sequence of probability measures Pn is said to be exponentially tight if for all α < ∞ there exists a compact set Kα ⊂ X such that

1 c lim log Pn(Kα) < −α. n→∞ an

In words, this means that under Pn most of the probability lies on compact sets, even at an exponential scale. By [20, Lemma 1.2.18] it follows that if a sequence of CHAPTER 2. TOOLS REQUIRED 18

probability measures Pn is exponentially tight and satisfies a weak large deviation principle with a rate function I(.), then Pn satisfies a full large deviation principle with the same rate function I(.), and moreover I(.) is a good rate function.

Another notion which will be useful later is the concept of exponential equivalence of two sequence of probability measures P1,n and P2,n on a metric space (X , d).

Definition 2.4. Let P1,n and P2,n be two sequences of probability measures on a metric space (X , d). Assume that there exists a sequence of probability spaces (Ω, Fn, Pn) containing two families of X valued random variables X1,n and X2,n such that the law of Xi,n under Pn is Pi,n, for i = 1, 2. Also assume that for every δ > 0 the set

{d(X1,n,X2,n) > δ} ∈ Fn and satisfies

1 lim log Pn(d(X1,n,X2,n) > δ) = −∞. n→∞ an

Then the two sequence of laws P1,n and P2,n are said to be exponentially equivalent.

In words, exponential equivalence means there is a very efficient coupling between the two sequences of probability measures. Exponential equivalence can be used to transfer large deviation results from one of the sequences to the other using [20, Theorem 4.2.13], which is stated below:

Theorem 2.5. If P1,n and P2,n are exponentially equivalent, and P1,n satisfies a large deviation principle with a good rate function I(.), then P2,n satisfies the same large deviation principle.

Another tool for proving a weak large deviation principle is the following Theorem, which is a simplified version of [20, Theorem 4.1.11].

Theorem 2.6. Let A denote a base for the topology of X , and for every A ∈ A, assume the limit 1 LA := − lim log Pn(A) n→∞ an exists. Then the sequence of probability measures Pn satisfy a weak large deviation CHAPTER 2. TOOLS REQUIRED 19

principle with the rate I(.) defined by

I(x) := sup LA. A∈A:x∈A

The final result for establishing large deviation principle required for this thesis is the following Lemma ([20, Lemma 4.1.5 (a)]), which gives the transformation of large deviation principle when the underlying topological space is replaced by a bigger space.

Lemma 2.7. Let X be a closed subset of Y such that Pn(X ) = 1 for all n ≥ 1, and suppose that X is equipped with the topology induced by the topology of Y. If Pn satisfies a large deviation principle with speed an and rate function I : X 7→ [0, ∞], 0 then Pn satisfies a large deviation principle with speed an and rate function I : Y 7→ [0, ∞], where I0(y) = I(y) if y ∈ X , +∞ otherwise.

All the techniques introduced above help in proving some form of large devia- tion principle. The main reason for introducing large deviation in the context of exponential families is the following celebrated result due to S.R.S. Varadhan ([20, Theorem 4.3.1]), which gives a way to computing normalizing constants in terms of an optimization problem.

Theorem 2.8. Let Pn satisfies a large deviation principle with a good rate function I(.), and φ : X 7→ R be any continuous function such that

1 γanφ(x) lim log e dPn(x) < ∞ (2.4) n→∞ an ˆX for some γ > 1. Then one has

1 anφ(x) lim log e dPn(x) = sup{φ(x) − I(x)}. n→∞ an ˆX X

The decay condition (2.4) holds trivially when φ is bounded. This lemma reduces the asymptotics of log normalizing constant of exponential families in terms of an optimization problem. The optimization problem itself can be hard, and so the above CHAPTER 2. TOOLS REQUIRED 20

result is useful for estimation of normalizing constants only when the optimization problem can be solved numerically. An explicit example of how this theory has been applied to compute normalizing constants in [12] will be explained in chapter3.

If φ is lower semi continuous, then equality might not hold in Varadhan’s Lemma. In this cases, the lower bound still holds ([20, Lemma 4.3.4]), and the rate function need not be good.

Lemma 2.9. Let Pn satisfy a large deviation principle with speed an and rate function I(.), and let φ : X 7→ R be lower semi continuous. Then

1 anφ(x) lim log e dPn(x) ≥ sup{φ(x) − I(x)}. n→∞ an ˆX X

An analogous upper bound holds if φ is upper semi continuous, I(.) is a good rate function, and condition (2.4) holds ([20, Lemma 4.3.6]).

Lemma 2.10. Let Pn satisfies a large deviation principle with a good rate function I(.), and φ : X 7→ R be any upper continuous function such that (2.4) holds. Then one has

1 anφ(x) lim log e dPn(x) ≤ sup{φ(x) − I(x)}. n→∞ an ˆX X Remark 2.11. Note that even though [20, Lemma 4.3.6] requires φ to be finitely defined everywhere on S, the proof goes through as long as φ ∈ [−∞, ∞).

The final result of this chapter is the following Lemma, [20, Lemma 6.2.13]), which gives the answer of an optimization problem in terms of the Kullback-Leibler Divergence.

Definition 2.12. Let (X , F) be a measure space, and let µ, ν be two probability measures on X . The Kullback Leibler divergence between two probability measures µ and ν on X , denoted by D(µ||ν), is defined to be ∞ if µ is not absolutely continuous with respect to ν. If µ << ν, then denoting the Radon-Nikodyn derivative of µ with respect to ν by f, the Kullback Leibler divergence is defined by

D(µ||ν) = log fdµ = f log fdν. ˆX ˆX CHAPTER 2. TOOLS REQUIRED 21

Note that D(µ||ν) is not symmetric, and it is possible that D(µ||ν) = ∞ and D(ν||µ) < ∞.

Lemma 2.13. For a Polish space X , let B(X ) denote the set of all bounded mea- surable functions from X to R. Then for any two measures µ, ν on X (equipped with Borel sigma algebra), one has

D(µ||ν) = sup { φdµ − eφdν}. φ∈B(X ) ˆ ˆ Chapter 3

Previous work

3.1 Methods

The two most common methods used in statistics for inference in models with un- known normalizing constants are the pseudo-likelihood, and the Markov Chain Maxi- mum Likelihood Estimate (MCMLE). The pseudo-likelihood method was introduced by J. Besag in [3],[4] in 1974-75. The MCMLE was introduced by C. Geyer in his Ph.d thesis ([34]) in 1990. Both these methods bypass the direct computation of the normalizing constant using clever techniques, and as such are computationally feasi- ble for a large class of problems when direct computation of the normalizing constant fails. Before beginning this section, recall the definition of the edge two star model (introduced in the introduction). The mathematical form of this model is

θ1E(G)+(θ2/n)T2(G)−Zn(θ1,θ2) pθ(G) = e ,

where E(G) and T2(G) are the number of edges and two stars in the labelled graph G on n vertices. This model will be used for illustrative purposes to explain methods and theory where applicable.

(a) Pseudo-likelihood

22 CHAPTER 3. PREVIOUS WORK 23

Instead of considering the joint likelihood which involves the unknown normal- izing constant, it might be easier to consider the conditional distribution of any random variable given the rest. As an example, in the edge two star model the conditional distribution of any edge G(i, j) given all the other edges is Bernoulli with parameter P θ1+(θ2/n) (G(i,k)+G(j,k)) e k6=i,j P , θ1+(θ2/n) (G(i,k)+G(j,k)) 1 + e k6=i,j which is much easier to think about. The pseudo-likelihood is defined to be the product of all such one dimensional conditional distributions, one for each random variable. If the conditional distributions are simple, then the pseudo- likelihood will be simpler than the likelihood. As a result one can maximize the pseudo-likelihood analytically or numerically to obtain an estimate of the parameter, called the pseudo-likelihood estimate. As an example, in the edge two star model, the pseudo-likelihood is given by

P  θ1+(θ2/n) {G(i,k)+G(j,k)}G(i,j) k6=i,j Y e P . θ1+(θ2/n) (G(i,k)+G(j,k)) i

The pseudo-likelihood estimate is the global maximizer in θ of the above quan- tity, when a unique maximizer exists. Computing the pseudo-likelihood above requires computing P (G(i, k) + G(j, k)) for all pairs (i, j) which can be com- k6=i,j puted in time polynomial in n. Also the final optimization is a two dimensional optimization of a convex function, and so is easy to carry out numerically. Thus the computation of pseudo-likelihood is a tractable problem numerically.

One important concern is whether there are any performance guarantees of the pseudo-likelihood estimate, and if it is possible to give bounds on how large the error will be. Very little is known about the theoretical properties of the pseudo-likelihood estimator. See however the works of [11],[16],[43],[51], among others for some specific examples where the consistency of the pseudo-likelihood CHAPTER 3. PREVIOUS WORK 24

estimator was shown rigorously in specific examples. In chapter5 the consistency of pseudo-likelihood estimator will be shown for a class of exponential families on permutations, using ideas similar to [11].

(b) MCMLE This method approximates the unknown log normalizing constant by MCMC.

Suppose one wants to estimate the log partition function at θ = (θ1, θ2), and 2 θ0 ∈ R is any fixed parameter . Then setting T (G) := (E(G),T2(G)/n), one can write

Z (θ)−Z (θ ) −Z (θ ) X (θ−θ )0T (G) θ0 T (G) (θ−θ )0T e n n 0 = e n 0 e 0 e 0 = e 0 . Epθ0 G∈Gn

(k) Thus if {y }k≥1 ∈ Gn is a irreducible positive recurrent aperiodic Markov Chain

with stationary distribution pθ0 (.), then by Markov strong law,

N 0 (k) 0 1 X (θ−θ0) T (y ) a.s. (θ−θ0) T Zn(θ)−Zn(θ0) e → Epθ e = e N 0 k=1

Thus by sampling efficiently from one particular parameter configuration θ0, one

can have an estimate of the difference Zn(θ) − Zn(θ0). Since the MLE in this model satisfies 0 θ T (G)−[Zn(θ)−Zn(θ0)] θˆML := arg max{e }, θ∈R2 one can obtain an approximation to the MLE as

N 0 h 0 (k) i−1 θ T (xn) 1 X (θ−θ0) T (y ) θˆMCMLE := arg max e e . θ∈R2 N k=1

Though the above method seems to work for all θ starting from one single θ0, in practice this method produces reasonable estimates only when θ is close to

θ0 ([35]). Thus to estimate the normalizing constant at θ starting from θ0 far CHAPTER 3. PREVIOUS WORK 25

away, an iterative scheme is used which progresses in small steps from θ0 to θ. If the Markov chain mixes slowly then at each step of this iterative procedure one should (in principle) run a Markov Chain for a very long time, thus causing problems with numerical feasibility.

3.2 Theoretical work

(a) Rates of Markov chains in ERGMs

One of the main papers which deals with the theoretical aspects of exponential family on graphs is the work Bhamidi et al in [5], where the authors analyze rate of mixing of Glaubler dynamics for a wide class of graph models. Stating their results needs the definition:

Definition 3.1. The mixing time τmix of a Markov Chain is defined as the num- ber of steps required in order to guarantee that the Markov chain, starting from an arbitrary state, is within total variation distance e−1 from the stationary dis- tribution.

A Markov chain on the space of graphs Gn is called local if at every step at most o(n) many edges are updated.

As an example of a local Markov chain, consider the Glaubler dynamics which at every stage chooses one ordered pair (i, j) with 1 ≤ i < j ≤ n at random n (there are 2 at each stage), and replaces g(i, j) by an independent draw from the conditional distribution of g(i, j) given the other edges. For the edge two star model, this conditional distribution is Bernoulli with parameter

P θ1+(θ2/n) (G(i,k)+G(j,k)) e k6=i,j P , θ1+(θ2/n) (G(i,k)+G(j,k)) 1 + e k6=i,j

as pointed above. At each step of this Markov chain, at most one edge is changed, CHAPTER 3. PREVIOUS WORK 26

and so this chain is local.

For the edge two star model with θ2 > 0, their result says the following ([5, Theorem 5,6]):

Fix the parameter configuration (θ1, θ2) ∈ R × (0, ∞). and define the function Ψ : (0, 1) 7→ R as follows: eθ1+2θ2p Ψθ(p) := 1 + eθ1+2θ2p 0 If the equation Ψθ(p) = p has a unique solution p∗ such that Ψθ(p∗) < 1, then the mixing time of Glaubler dynamics is Θ(n2 log n), and hence polynomial in n. Such parameter configurations are referred to as being in the high temperature region, in keeping with terminology from statistical physics. As an example, for

the edge two star model, this happens for θ1 = −1, θ2 = 1. In this case the

equation Ψθ(p) = p reduces to

e2p−1 p = , 1 + e2p−1

which can be shown to have the unique solution p = 1/2.

If on the other hand there are two such p∗ both of which satisfy Ψθ(p) = p 0 Ω(n) and Ψθ(p) < 1, then the mixing time for Glaubler dynamics is e , and so at least exponential in n. And in fact the mixing time is exponential for any local Markov chain, and not just Glaubler dynamics. Such parameter configurations are referred to as being in the low temperature region. As an example, for the

parameter configuration θ2 = 4, θ1 = −4 the equation Ψθ(p) = p reduces to

e8p−4 p = , 1 + e8p−4

which by a simple analysis has exactly three solution {p1 < 1/2 < p2}, where

p1 + p2 = 1.

It should be mentioned here that the above two domains of parameters (high and CHAPTER 3. PREVIOUS WORK 27

low temperature regions) do not cover the whole parameter space, and no conclu- sion is made about the mixing time about the remaining parameters configura- tions. In fact there are some configurations for which at least one of the solutions 0 p∗ of Ψθ(p) = p satisfy Ψθ(p∗) = 1. Such configurations are referred to as critical

points. As an example consider the parameter configuration θ2 = 2, θ1 = −2. In

this case the equation Ψθ(p) = p reduces to

e4p−2 p = , 1 + e4p−2

0 which has the unique solution p = 1/2, with Ψθ(1/2) = 1.

(b) ERGM on dense graphs

In [14] the authors established a large deviation result for the dense Erd˝os-Renyi graph with parameter p ∈ (0, 1). Recall from chapter2 that Wf is the space of

equivalence classes of W , equipped with the cut metric δ. Let Pen,1/2 be the probability measure induced on Wf through the map G 7→ feG, when G is an Erd˝os-Renyi graph on n vertices with parameter 1/2. The following theorem

gives a large deviation principle for Pen,1/2([14, Theorem 2.3]).

Theorem. The sequence Pen,1/2 satisfies a large deviation principle on Wf with

the good rate function ICV, defined as

1 h i 1 ICV(fe) := f(x, y) log f(x, y) + (1 − f(x, y)) log(1 − f(x, y)) dxdy + log 2. 2 ˆ 2 [0,1]2

It is easy to check that ICV(.) takes the same value for all f in the same equiva-

lence class , and so ICV is well defined. Using this, in [12] the authors show that in a wide class of dense ERGMs with sub-graph counts as sufficient statistics, a typical sample converges in the sense of the cut metric to a finite mixture of Erd˝os-Renyi graphs. CHAPTER 3. PREVIOUS WORK 28

As an example, suppose it is required to estimate the log normalizing constant of

the edge two star model on Gn. Recall that the edge-two star model has p.m.f. eθ1E(G)+(θ2)/nT2(G)−Zn(θ1,θ2), where

n 2 Zn(θ1,θ2) X θ1E(G)+(θ2/n)T2(G) ( ) n φ(G) e = e = 2 2 E(1/2)e G∈Gn

with θ 2E(G) θ 2T (G) φ(G) := 1 + 2 2 . 2 n2 2 n3 Here the expectation is with respect to the Erd˝os-Renyi model with parameter 1/2. Thus the computation of the log normalization constant is equivalent to computation of exponential moments under the Erd˝os-Renyi model. Also note that the function φ(G) is same as

θ1 | hom(T1,G)| θ2 | hom(T2,G)| θ1 G θ2 G + = t(T1, fe ) + t(T2, fe ), 2 n2 2 n3 2 2

where T1 is an edge, and T2 is a two star. By [9, Theorem 3.8] the map fe ∈ Wf 7→ t(H, fe) is continuous for any simple graph H, and so the function φ is continuous on Wf. Since φ is bounded as well, an application of Varadhan’s lemma gives that

Zn(θ1, θ2) nθ1 θ2 o lim = sup t(T1, fe) + t(T2, fe) − I(fe) . n→∞ 2 n f∈Wf 2 2

The above optimization is over a space of functions, and can be difficult to solve in general. However when the sub-graph counts are all k stars it suffices to maximize over constant functions ( [12, Theorem 6.4]). Thus setting f(x, y) = p, the above optimization reduces to

1 2 sup [θ1p + θ2p − p log p − (1 − p) log(1 − p) + log 2]. 2 p∈[0,1]

The above optimization has finitely many global maxima {p1, p2, ··· , pk}, and a

random graph G from the two star model is close to at least one of the pi’s, in CHAPTER 3. PREVIOUS WORK 29

the sense that k G p inf δ (fe , pi) → 0. i=1  e This roughly means that a random graph G from this model behaves as a finite

mixture of Erd˝os-Renyi pi’s, when the number of vertices is large. Since the

optimization problem is one dimensional, the pi’s can be obtained numerically.

Differentiating with respect to p it follows that any optimizing pi satisfies the equation eθ1+2θ2p p = . 1 + eθ1+2θ2p Not surprisingly, this is the same equation as obtained by [5].

(c) Mallows’ model with Kendall’s Tau

Of all the metrics that one can use in the Mallows’ model, the most widely used is

the Kendall’s Tau, defined as follows: Given two permutations π, σ ∈ Sn, define d(π, σ) to be the minimum number of pairwise inversions needed to transform π−1 into σ−1. If σ = e is taken as the identity, then d(π, σ) becomes Inv(π), the number of inversions in the permutation π. The number of inversions of a permutation is the number of pairs (i, j) such that (i < j, π(i) > π(j)).

Taking σ = e as above one can define an exponential family on permutations of the form e−(θ/n)Inv(π)−Zn(θ),

where Zn(θ) is the log normalizing constant. In 2004 Diaconis and Ram ([26, (2.9)]) gave the following explicit answer for the normalizing constant:

n Y e−iθ/n − 1 eZn(θ) = . e−θ/n − 1 i=1 CHAPTER 3. PREVIOUS WORK 30

Using this formula it is easy to check that

Z (θ) − Z (0) 1 1 − e−θx lim n n = log dx. n→∞ n ˆ0 θx

In 2011, S. Starr used a different argument to show that

Z (θ) − Z (0) θ lim n n = sup{− (µ × µ)(h) − D(µ||u)}, n→∞ n µ∈M 2

where M is the set of all probability measures on [0, 1]2 on the unit square, h is a 4 function on [0, 1] given by h((x1, y1), (x2, y2)) = 1(x1−x2)(y1−y2)<0, u is the uniform distribution on the unit square, and D(.||.) is the Kullback Leibler divergence between two probability measures as defined in chapter2. The function h can be thought of as mimicking the number of inversions of a permutation for a general permuton (measure on the unit square with uniform marginals). Solving this optimization problem, Starr showed that the unique maxima is ob- tained at a measure with the following density with respect to unit square:

(θ/2) sinh(θ/2) uθ(x, y) = h i2 . eθ/4 cosh(θ[x − y]/2) − e−θ/4 cosh(θ[x + y − 1]/2)

Plugging in this density, it is not hard to check directly that one gets the same limit as obtained by the explicit calculation. Further, this calculation of Starr shows that ([63, Theorem 1.1]) if π is a random permutation from this model, then

π converges to the measure with density uθ with respect to Lebesgue measure in

probability, where uθ is as given above. Recall from chapter2 the notion of convergence of permutations, which says the following:

2 Given a permutation π ∈ Sn, one can define a probability measure µπ on [0, 1] by the following density with respect to Lebesgue measure:

fπ(x) = n1{(x, y): π(bnxc) = bnyc}. CHAPTER 3. PREVIOUS WORK 31

A sequence of permutations is said to converge to a measure µ, if the corre- sponding sequence of measures converge weakly to µ. Any such limiting object is an element of M , the set of all probability measures on [0, 1]2 with uniform marginals.

To show the convergence Starr worked with νπ as opposed to µπ introduced by

[40], but as argued in chapter2 the weak limits of µπ and νπ are the same. Chapter 4

Exponential family on sparse graphs

4.1 Introduction

Exponential families are frequently used in social science literature to model real life networks. For some references on this, refer to [33], [37], [39], [41],[56], [62],[66] and the references within. Such models are usually referred to as ERGMs in the social science community, which is an abbreviation for Exponential Random Graph Mod- els. Starting with [37] in 2003, it has been noted in the social science literature that ERGMs with sub-graph counts don’t behave in a nice manner in terms of sampling and estimation procedures. This phenomenon is typically referred to as degeneracy. But there is no universally accepted definition for degeneracy. This chapter will adopt a notion similar to [37],[62], where the degeneracy of a model is attributed to the suf- ficient statistics of the model. That is, the model will be deemed non-degenerate iff the model behaves “nicely” for all choices of the parameter values. Thus under this notion, a degenerate model is caused by one or more degenerate statistics, and so the term degenerate will be used for both the model as well as the statistic.

One of the features of degeneracy is that such models place most of their mass on a very small sub-collection of graphs, which are either too sparse, or too dense. Thus

32 CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 33

an MCMC sample from such a model almost invariably gives either a very sparse graph, or a very dense graph. Another feature of such models is that small changes in the parameter can cause a large change in the underlying model. As such parameter estimates obtained from such models are usually not stable. The intuitive idea behind their reasoning is that in such models the neighboring edges are highly correlated. This causes a cascading effect through the graph, and so the model ends up putting most of its mass on very sparse or very dense graphs. Thus in a sense, such models capture “too much information”.

It has been subsequently noted in [62] in 2006 that not all ERGMs exhibit de- generacy in empirical studies. In fact, in this paper the authors argue that using modified versions of sub-graph counts can reduce this problem to a large extent. The modifications specifically aimed at reducing correlations between edges, and empirical studies seem to confirm their intuition.

The next definition gives the necessary notations for introducing some of the examples from [62] which are non degenerate at an empirical level.

Definition 4.1. Let Gn denote the space of all simple labelled undirected graphs on n vertices. For any G ∈ Gn let d(G) = (d1(G), ··· , dn(G)) denote the labeled degree n 1 P sequence of G, i.e. di(G) is the degree of vertex i. Also let E(G) := 2 dj(G) denote j=1 the number of edges in G.

For 0 ≤ i ≤ n − 1, let hi(G) := #{1 ≤ j ≤ n : dj(G) = i} denote the number of n−1 P vertices of degree i. Summing over i gives hi(G) = n, since the sum is over all i=0 n−1 the vertices of G. The quantity h(G) := {hi(G)}i=0 will be referred to as the degree frequency vector.

Recall that a k-star is the complete bi-partite graph K1,k with k edges and k + 1 vertices. For any k ≥ 2, let Tk(G) denote the number of copies of k-star in G. The counting scheme is such that all copies of the k-star are considered, and not just the CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 34

induced ones. For e.g. by this definition T2 = 3 for a triangle. This counting scheme gives the following simple formulae for Tk(G) in terms of its degrees d(G), as well as the degree frequency vector h(G):

n   n−1   X dj(G) X i T (G) = = h (G) . k k i k j=1 i=0

The explanation for the above equality is as follows:

dj (G) For any vertex j, there are k k-stars with j as the center vertex. Adding over j gives the total number of k-stars. The second equality follows by rearranging the first sum.

(a) Geometrically weighted degree statistic: The geometrically weighted degree statistic has the form

n−1 X −αi gwdα(G) := e hi(G), i=0

where α > 0 is a fixed parameter. The geometrically decaying weights ensure that the contribution of vertices with large degree is negligible. Thus as the degrees of the graph increase, the statistic does not grow too fast, and cascading effect of this statistic is reduced.

(b) The alternating k-star: For a fixed parameter λ > 1, the alternating k-star is defined as

n−1 X (−1)k aks (G) := T (G), λ λk−2 k k=2

where Tk’s are the k-star counts defined above. In this case again the geometri- cally decaying weights ensure that the cascading effects of higher star counts is reduced. Also because of the alternate signs the cascading effect of consecutive terms is cancelled to a large extent. CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 35

As a comment, note that using the formula for Tk(G) in terms of the degree frequency vector h(G), the alternating k-star statistic can be written as

n−1 X h 1 i i i aks (G) = λ2 1 − − 1 + h (G) = λ2gwd (G) − nλ2 + 2λE(G) λ λ λ i α i=0

with α = 1 − 1/λ. Thus the two statistics gwdα and aksλ are connected by a simple formula, and both these statistics are functions of the degree frequency vector h.

n−1 P One can restrict attention to statistics of the form f(i)hi(G) where f : N0 := i=0 N ∪ {0} 7→ R, and ask when is this statistic well behaved. The choice of this statistic is justified from the observation that all the three examples are of this form. As an i −αi illustration, the k star, gwdα and aksλ are of this form with f(i) = k , e and 2h 1  i i λ 1 − λ − 1 + λ respectively. The results of chapter gives a clean sufficient condition for non degeneracy as follows:

|f(i)| lim < ∞. (4.1) i→∞ i

n−1 P The reason for this is that under the condition (4.1) the statistic f(i)hi(G)/n i=0 turns out to be continuous, with respect to a suitable topology to be described in the next section.

−αi In particular, (4.1) holds for gwdα for α > 0 as f(i) = e is bounded, and the aksλ for λ > 1 as h 1 i i i f(i) = λ2 1 − − 1 + λ λ is dominated by the linear term. On the other hand, the number of k stars does not satisfy (4.1).

The main tool for these results is the analysis of sparse graphs, as opposed to CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 36

dense graphs as in [5] and [12]. Recall that in a dense graph the typical number of edges is O(n2), and a typical degree is O(n), as opposed to sparse graphs, where a typical graph has O(n) edges, and the typical degree of a vertex is O(1). One reason it is interesting to model sparse graphs is that most real life networks seem to be sparse. Another reason is that the dense graph theory does not provide a good explanation for why the modified versions of sub-graph counts mentioned above (aks and gwd ) are non-degenerate, whereas the sub-graph count statistic are.

This chapter will show that the star counts Tk (which are examples of sub graph counts) are bad (see Theorem 4.8), and the modified subgraph counts defined above

(gwdα and aksλ) are good (see Theorem 4.5). This framework will also give a sufficient criterion for establishing non-degeneracy for a class of graph statistics. For a unified treatment of all the three examples, consider a two parameter exponential family on

Gn of the form

n−1 E(G) n −E(G) P β   β (2) θ hi(G)f(i)−Zn(θ,β,f) Qn,θ,β,f (G) := 1 − e i=0 (4.2) n n

where f : N∪{0} 7→ R, and Zn(θ, β, f) is the (unknown) log normalizing constant. This is a two parameter exponential family with sufficient statistics

n−1 h X i E(G), hi(G)f(i)) i=0

For f ≡ 0 this model reduces to the Erd˝os-Renyi model with parameter (β/n), which puts most of its mass on sparse graphs. Thus for reasonable choices of f(.), the same should be true for all values of θ.

It should be noted at this point that Qn,θ,β,f is not the same as the β-model, studied by Chatterjee-Diaconis-Sly in [13], even though both models have the degree sequence as sufficient statistics. The β model is an exponential family on Gn whose CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 37

sufficient statistic is of the form

n X βjdj(G), j=1 where β = (β1, ··· , βn) is an n dimensional parameter. In [13] the authors worked in the dense graph regime and showed that if the components of the parameter vector β stays uniformly bounded, then all entries of β can be simultaneously estimated consistently. In a similar manner, proposition 4.18 shows the slightly weaker result that one can estimate the value of the function f at every fixed i, under a condition slightly weaker than (4.1).

For analyzing the model of (4.2) it suffices to study the degree sequence. The following definition encodes the entire degree sequence as one probability measure on non-negative integers.

Definition 4.2. Given the labelled degrees of a graph (d1(G), ··· , dn(G)), the empir- n 1 P ical distribution of the degree sequence is defined by µn := n δdj (G) i.e. µn is the j=1 measure which puts mass 1/n at each of the observed degrees dj(G), and is a proba- bility measure on N0 := N ∪ {0}.

An equivalent definition of µn can be given in terms of the degree frequency vector h(G) as follows:

µn is the probability measure which puts mass hi(G)/n at i, for 0 ≤ i ≤ n − 1. n−1 P With this definition, any statistic of the form f(i)hi(G) can be written as nµn[f], i=0 where µ[f] denotes the mean of f with respect to the measure µ (when it exists), i.e.

∞ X µ[f] := µ(i)f(i). i=0

Note that µn does depend on G, but this dependence will not be made explicit for simplicity of notation. CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 38

4.2 Statement of main results

Stating the main results requires the introduction of a few definitions.

Definition 4.3. Let

∞ X S := {µ ∈ P(N0): µ := iµ(i) < ∞}. i=1

Thus S is the set of all probability measures on N0 with finite mean.

Equip S with the following topology:

w νn, ν ∈ S, νn → ν if νn → ν, and νn → ν.

By Scheffe’s theorem, convergence in S is also equivalent to convergence in the metric

∞ X ||µ − ν|| mean = i|µ(i) − ν(i)|. i=1

Note in passing that S is not compact with respect to weak convergence, and hence not compact with respect to the metric ||µ − ν|| mean. For example, the sequence of measures νn = 1/nδn +(1−1/n)δ0 have ν¯n = 1 for all n, and so belong to S. However, the weak limit of ν0 is δ0, for which one has δ¯0 = 0. Thus the sequence νn does not converge in the sense of the metric ||.||mean.

The reason S turns out to be an appropriate space for the analysis of µn is that the mean of µn (which is 2E(G)/n) is bounded with high probability under the Erd˝os-Renyi (β/n) model. The convergence defined above is stronger than weak con- vergence, and in spirit demands uniform integrability as well.

The following definition gives a sufficient condition on the function f under which the model Qn,β,θ=1,f is well behaved. CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 39

Definition 4.4. Let F denote the set of all functions f from N0 to R such that

f(i) lim < ∞. i→∞ i

This restriction essentially means that max(f, 0) is allowed to grow at most linearly for large i. For example, the function f(i) = log(1 + i), f(i) = i(−1)i, f(i) = −i! are in F , whereas f(i) = i log(1 + i), f(i) = i2 are not. An intuitive explanation for the above sufficient condition is that if f is exactly linear, n−1 P the the statistic f(i)hi(G) is equivalent to the number of edges E(G), Thus if |f| i=0 growing at most linearly to ∞, the model will not be too far from an Erd¨os-Renyi, and hence should be reasonably well behaved. e

For any such f there exists positive finite constants C1,C2 such that f(i) ≤ C1+C2i for all i ∈ N0. Thus if f ∈ F and µ ∈ S, then

∞ X µ[f] := µ(i)f(i) ≤ C1 + C2µ¯ < ∞. i=0

In particular, µ[f] is well defined, and lies in [−∞, ∞).

For f ∈ F, u ≥ 0, define an exponential family on N0 with probability mass func- tion 1 σ (i) = uief(i)−Z(u,f), u,f i! where Z(u, f) is the log normalizing constant, i.e.

∞  X 1  Z(u, f) := log uief(i) . i! i=0 It can be readily checked that the assumption f ∈ F implies Z(u, f) < ∞, and more- over σu,f has finite mean, i.e. σu,f ∈ S. Let Ωf ⊂ S denote the set of all probability measures of the form σu,f for u ≥ 0. Also let m(u, f) := σu,f denote the mean of σu,f . CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 40

Finally, for f ∈ F, β > 0 let J(f, β) denote the solution to the following optimiza- tion problem

n m(u, f) m(u, f) + β o J(f, β) := sup Z(u, f) − m(u, f) log u + log(m(u, f)β) − . u≥0 2 2

Since the definition of J(f, β) involves an optimization over non negative reals u, numerical computation of J(f, β) is very easy to do.

The first main result of this chapter is the following theorem, which gives the asymptotics of the log normalizing constant under the assumption that f ∈ F . Recall from (4.2) the definition of the probability mass function of Qn,θ,β,f on Gn, which is

n−1 E(G) n −E P β   β (2) θ hi(G)f(i)−Zn(θ,β,f) Qn,θ,β,f (G) = 1 − e i=0 . n n

Theorem 4.5. For f ∈ F , let G be a random graph sampled from the exponential family Qn,θ=1,β,f with Qn,θ,β,f as in (4.2). Then

(a) 1 lim Zn(1, β, f) = J(f, β), n→∞ n with J(f, β) as in definition 4.4.

(b) The supremum in the definition of J(f, β) is finite, and is attained on a finite set

of non negative reals {u1, u2 ··· , uk} (depending on f, β), and

k p min ||µn − σu ,f ||mean → 0, i=1 i

where µn is the empirical degree distribution of G.

Part (b) of the above theorem says that the empirical degree distribution µn k roughly behaves like a mixture of {σui,f }i=1. An immediate corollary of the above theorem is the following result, which gives a sufficient condition for a statistic to be well behaved for all values of the parameter. CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 41

Corollary 4.6. For |f| ∈ F , consider the exponential family Qn,θ,β,f . Then

1 lim Zn(θ, β, f) = J(θf, β), n→∞ n and J(θf, β) is finite for all θ ∈ R. Further, there exists a finite set of non negative reals {u1, u2, ··· , uk} (depending on f, β, θ) such that

k p min ||µn − σu ,θf ||mean → 0. i=1 i

Remark 4.7. The above corollary shows that if |f| grows at most linearly, then the corresponding model Qn,θ,β,f is well behaved for both positive and negative θ, in the sense that the empirical degree distribution stabilizes for large n.

Another interesting conclusion is that none of the limit points of the degree dis- tribution is a Poisson, as σu,θf is not a unless f is identically 0 or linear, in which case the model itself is a sparse Erd˝os-Renyi (possibly with a different parameter). On the other hand, the empirical degree distribution of a sparse Erd˝os-Renyi graph converges to Poisson. Thus at least in terms of the empirical de- gree distribution, ERGMs on sparse graphs do not behave like an Erd˝os-Renyi graphs. Also, in the case of sparse ERGMs, it is possible to estimate multiple parameters consistently from a large single graph. In particular, see proposition 4.17 which con- structs consistent estimates for (β, θ) when f is known, and proposition 4.18 which constructs consistent estimates for the function f.

The next theorem shows that some condition on f is required for the model Qn,θ,β,f n−1 i P to be well behaved. In particular if f(i) = k (and so f(i)hi = Tk, the number of i=0 k stars), then the model undergoes a drastic change when θ crosses the origin.

i Theorem 4.8. For f(i) = k consider the exponential family Qn,θ,β,f (.) as in (4.2), and let β > 0 be arbitrary.

(a) If θ > 0 then 1 lim Zn(θ, β, f) = ∞. n→∞ n CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 42

(b) If θ < 0 then 1 lim Zn(θ, β, f) = J(θf, β) ∈ (−∞, 0) n→∞ n

The supremum in the definition of J(θf, β) is achieved on a finite set of non

negative reals {u1, ··· , uk}, and

k p min ||µn − σu ,θf || mean → 0. i=1 i

Remark 4.9. Theorem (4.8) shows that the behavior of the model Qn,θ,β,f changes drastically at the origin. For θ < 0 the degree distribution stabilizes, and any limit point of the degree distribution is of the form σu,θf for some u ≥ 0. Also, most of the mass of this model is on graphs where the number of edges is O(n). Thus in this regime the model is well behaved. Also the log normalizing constant when divided by n converges to a finite negative number.

On the other hand for θ > 0 the log normalizing constant scales faster than n, and so the limit blows up. This is because most of the mass of this model has now shifted to graphs with edges larger than O(n), and so the degree of a typical vertex blows up. In this regime, the model is not well behaved.

Thus the limiting normalizing constant jumps from a negative number for θ < 0 to ∞ for θ > 0, bypassing all positive finite values, and there seems to be no uniform scaling in this model which prevents this behavior. Note that for θ < 0 the function θf is in F as defined in 4.4, whereas for θ > 0 this is no longer true.

Even though Theorem 4.5 and Theorem 4.8 characterize the set of all possible limit points of the limiting degree distribution (when the limit does not blow up), they fall short of establishing weak convergence of the degree distribution, nor do they give a closed form expression for the minimizing value(s) of θ in general. If however it is true that there is a unique maximizer to the corresponding optimization problem, then weak convergence readily follows. CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 43

The outline of the rest of the chapter is as follows: Section 3 introduces the large deviation principle of the empirical degree distribution of an Erd˝os-Renyi graph. Section 4 proves the main results of this chapter using the large deviation principle. Section 5 explains how to apply the results of this chapter to estimate normalizing constants with an example. Section 6 proves the existence of consistent estimators for both the parameters in the model Qn,θ,β,f , and gives one way to fit this model. Finally section 7 is devoted to proving the large deviation principle stated in section 3.

4.3 The large deviation principle

The main tool for proving the results of the previous section is a large deviation principle for µn on S, equipped with the topology induced by ||.||mean (see definition 4.4). To see how large deviation comes into the picture, note that the normalizing constant of the model Qn,θ,β,f can be written as

Zn(θ,β,f) nθµn[f] e = EPn,β e , where Pn,β is the Erd˝os-Renyi model with parameter (β/n). By Varadhan’s Lemma, this equates the problem to studying the large deviation of µn under the Erd˝os-Renyi (β/n) model. It should be noted here that the above large deviations problem with respect to the topology of weak convergence (convergence in distribution) has already been studied in [27, Corollary 2.2], and in [8, Theorem 1.8]. In the second paper, the authors prove a large deviations result for the whole graph under the topology of local weak convergence, and the present large deviation result for the weak topology follows as a simple consequence of the general result. Using their results, one can show that any bounded f(i) works, (and so gwdα is non-degenerate), but one cannot directly handle f(i) growing with i (and so no conclusion is reached for the aksλ).

In contrast, this chapter develops large deviations for just the empirical degree CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 44

distribution µn, but with a stronger topology which demands convergence in mean as well, and uses this to show that both the gwd and aks are non-degenerate, and extends the sufficient condition for non-degeneracy to f growing at most linearly. The exact cut off for non-degeneracy seems to be O(i log i) which should be doable by a finer calculation. The intuitive reason for this cut-off is that the dominating term in the Poisson probability mass function is 1/i!, which decays like e−i log i. However this version is not carried out, as the calculations with O(i) are much cleaner, and I cannot think of an interesting examples in between O(i) and O(i2) to merit the finer analysis.

The following definition introduces the rate function for the large deviations prin- ciple, henceforth denoted by l.d.p. for convenience.

Definition 4.10. Define the function ISP : S 7→ (−∞, ∞] by

∞ X µ µ + β I (µ) := µ(i) log(i!µ(i)) − log(µβ) + SP 2 2 i=0 1 µ µ =D(µ||p ) + (µ − β) + log β − log µ, β 2 2 2

∞ where D(.||.) is the Kullback Leibler divergence. Recall that µ¯ = P iµ(i) is the mean i=1 under the measure µ, and pβ is the Poisson distribution with parameter β.

It will be shown in Lemma 4.12 and Corollary 4.14 that ISP is non-negative, and a good rate function (i.e. its level sets are compact). The large deviation principle for µn under the Erd˝os-Renyi model Pn,β is now stated below. As a convention, the infimum over an empty set is taken to be ∞.

Theorem 4.11. For any A ⊂ S,

1 lim log Pn,β(µn ∈ A) ≤ − inf ISP(µ), (4.3) n→∞ n µ∈A 1 lim log n,β(µn ∈ A) ≥ − inf ISP(µ). (4.4) P ◦ n→∞ n µ∈A

In particular, µn satisfies a large deviation principle in S with the good rate function

ISP. CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 45

The l.d.p. follows from (4.3) and (4.4), both of which will be proved in section 3.

Note that it is possible to assign probability to any A ⊂ S, as the probability Pn,β puts mass only on a finite set in S.

A direct application of the above large deviations result computes the log nor- malizing constant in terms of an optimization problem over S, a set of probability distributions. The remainder of the section is devoted to studying the properties of the rate function. These properties will be used to reduce the above optimization problem to a one dimensional optimization over non negative reals.

Lemma 4.12. Let ISP be as defined in (4.10). Then for any α ∈ R. the set Iα :=

{µ ∈ S : ISP(µ) ≤ α} is compact. i.e. for any sequence νn ∈ Iα there exists a further subsequence νnk which converges in Iα.

Proof. To begin, first note that D(.||.) is lower semi continuous with respect to the weak topology, and so lower semi continuous with respect to ||µ − ν|| mean. Also by definition µ 7→ µ is continuous with respect to ||µ − ν|| mean, and so it follows trivially that I is lower semi continuous, and so Iα is closed. Thus to show compactness, it suffices to show that there exists a subsequence νnk of the original sequence which converges in S.

Proceeding to show this, first note that

i i X log i! = log k ≥ log xdx = i log i − i, ˆ k=1 x=0 and so

∞ ∞ X X µ(i) log i! ≥ i log iµ(i) − µ ≥ µ log µ − µ, (4.5) i=0 i=0 where the last inequality follows by Jensen’s inequality on noting that x log x is con- vex. CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 46

Consider next ν ∈ S defined by ν(i) := 2−(i+1). It follows that

∞ ∞ X X µ(i) log µ(i) = D(µ||ν) + µ(i) log ν(i) ≥ −(µ + 1) log 2. (4.6) i=0 i=0

Thus combining (4.5) and (4.6) gives

1  1 + log β  β I (µ) ≥ µ log µ − µ log 2 + + − log 2 = g (µ), (4.7) SP 2 2 2 1

1  1+log β  β where g1(x) := 2 x log x − x log 2 + 2 + 2 − log 2. Since g1(x) is continuous and diverges to ∞ as x → ∞, it follows that g1(µ) ≤ α implies µ ≤ C for some C < ∞.

By a first moment argument using Markovs inequality it follows that Iα is tight in weak topology, and so by Prohorov’s theorem there exists a subsequence νnk which converges weakly in P(N0) to ν, say.

To complete the proof one has to show that ν < ∞, and ν(nk) → ν. For this it suffices to show that the laws ν(nk) are uniformly integrable. This will follow if it can ∞ P be shown that supµ∈Iα i log iµ(i) < ∞. To check the last fact, note that by (4.5) i=0 and (4.6), for any µ ∈ Iα,

∞ ∞ X X µ − β µ X i log iµ(i) ≤ µ + µ(i) log i! =I (µ) + + log(µβ) − µ(i) log µ(i) SP 2 2 i=1 i=0 i=0 µ − β µ ≤I (µ) + + log(µβ) + (µ + 1) log 2 SP 2 2

≤α + sup g2(x) < ∞, 0≤x≤C

x−β x where g2(x) := 2 + 2 log(xβ) + (x + 1) log 2. CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 47

The second lemma develops tools to be used in section 4 to derive theorem 4.5 and theorem 4.8.

Lemma 4.13. Let f ∈ F . Then

(a) The function µ 7→ µ[f] is upper semi continuous from S to R.

(b) infµ∈S {ISP(µ) − µ[f]} is finite and equals −Jf,β, where Jf,β is as defined in (4.4). Further, the infimum in this definition is attained over a finite set of non-negative p reals {u1, ··· , uk}, and any optimizing u satisfies the relation u = βσu,f .

(c) For any open U containing {u1, ··· , uk},

inf {ISP(µ) − µ[f]} > inf {ISP(µ) − µ[f]}. µ∈U c µ∈S

Proof. Since f ∈ F , there exists finite positive constants C1,C2 such that f(i) ≤

C1 + C2i for all i ≥ 0.

(a) Let νk → ν in the metric ||.|| mean. Then for any N ≥ 1 one can bound νk[f] by

N ∞ ∞ X X X νk(i)f(i) + C1 νk(i) + C2 iνk(i) i=0 i=N+1 i=N+1 N ∞ ∞ X X X ≤ νk(i)f(i) + (C1 + C2)||νk − ν||mean + C1 ν(i) + C2 iν(i) i=0 i=N+1 i=N+1

Taking limits as k → ∞ gives

N ∞ ∞ X X X lim νk[f] ≤ ν(i)f(i) + C1 ν(i) + C2 iν(i). k→∞ i=0 i=N+1 i=N+1 CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 48

Since ν < ∞, letting N → ∞ gives

lim νk[f] ≤ ν[f], n→∞

and so µ 7→ µ[f] is upper semi continuous.

(b) For the particular choice µ = δ0 it is easy to check that ISP(δ0) − δ0[f] = β/2 −

f(0) =: α < ∞, and so it suffices to minimize ISP(µ) − µ[f] over µ such that

ISP(µ) − µ[f] ≤ α + 1. Also by (4.7), ISP(µ) − µ[f] ≥ g3(µ), where g3(x) :=

g1(x) − C1 − C2x is continuous and diverges to ∞ as x → ∞. Thus g3(µ) ≤ α + 1

implies µ ≤ C3 for some C3 < ∞. But then the minimizing set is contained in

{µ : ISP(µ) ≤ µ[f] + α + 1} ⊂ {µ : ISP(µ) ≤ C3 + α + 1} = IC3+α+1,

which is compact by Lemma 4.12. Thus the infimum of the lower semi continuous

function ISP(µ) − µ[f] is achieved on a non empty compact set A ⊂ S. Let µ ∈ A be any point where the minimum is attained, and let ν ∈ S be arbitrary. By convexity of S, (1 − t)µ + tν ∈ S for any t ∈ [0, 1], and so with √ u := µβ,

∂ h i ISP((1 − t)µ + tν) − (1 − t)µ[f] − tν[f] ≥ 0 ∂t t=0 ∞ X  i i i  ⇔ 1 + log µ(i) + log i! − (1 + log µ) − log β + − f(i) (ν(i) − µ(i)) ≥ 0 2 2 2 i=0 ∞ X   ⇔ log µ(i) + log i! − i log λ − f(i) (ν(i) − µ(i)) ≥ 0 i=0

⇔D(µ||σu,f ) + D(ν||µ) ≤ D(ν||σu,f ).

where σu,f is as defined in definition (4.4). Since this holds for all ν ∈ S, setting

ν = σu,f gives D(µ||σu,f ) = 0, and so µ = σu,f . Thus A ⊂ Ωf , and further any p σu,f ∈ A satisfies u = βσu,f .

Also compactness of A forces the set of minimizers θ in the definition of Jf,β to CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 49

be a compact subset of [0, ∞). Finally since an analytic non constant function on a bounded domain cannot have infinitely many minimizers, the set of minimizers in u must be finite. This completes the proof of part (b).

(c) If infµ∈U c {ISP(µ) − µ[f]} = ∞ then there is nothing to show. Otherwise by a c similar argument as in part (b), to minimize ISP(µ) − µ[f] over U it is sufficient c c c to minimize over Iα ∩ U for some α < ∞. Since U is closed, Iα ∩ U is compact and so the infimum over U c is attained. But since none of these minimizers are c in U , it follows that infµ∈U c {ISP(ν) − µ[f]} > infµ∈S {ISP(µ) − µ[f]}.

As an immediate consequence Lemma 4.13, the following corollary shows that the rate function ISP is indeed non-negative, with a unique global minima at pβ.

Corollary 4.14. The unique global minimizer of ISP over S is at pβ, with ISP(pβ) = 0.

Proof. Choosing f to be the identically 0 function, it follows by part (b) of Lemma

4.13 that the minimum of ISP(µ) over S is attained over the class of Poisson distri- butions pu for u ≥ 0. Also

1 1 I (p ) = (β − u + u log u − u log β) = D(p ||p ), SP u 2 2 u β and so the unique global minimum occurs at pβ, with ISP(pβ) = 0.

4.4 Proofs of main results

The first lemma of this section gives an upper bound which will be used in proving both theorems.

Lemma 4.15. Let U ⊂ S be open. Then for any f ∈ F , Then

nµn[f] c lim EPn,β e 1µn∈U ≤ sup {µ[f] − ISP(µ)}. n→∞ µ∈U c CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 50

Proof. To begin, first note that

nµn[f] nT (µn) c EPn,β e 1U = EPn,β e , where T : S 7→ [−∞, ∞) is defined by

T (µ) := µ[f] if µ ∈ U c, −∞ otherwise. is upper semi continuous, as µ 7→ µ[f] is upper semi continuous by part (a) of Lemma 4.13. The proof of the lemma can now be completed by invoking Lemma 2.10. Subject to checking (2.4) for some γ > 1, an application of Lemma 2.10 along with Theorem 4.11 gives

nT (µn) lim EPn,β e ≤ sup{T (µ) − ISP(µ)} = sup {µ[f] − ISP(µ)}, n→∞ µ∈S µ∈U c which is the desired conclusion. To check that (2.10) indeed holds, note that f(i) ≤

C1 + C2i for all i ≥ 0 (as f ∈ F ) and so with γ = 2,

n−1 n−1 P P 1 2 f(i)hi(G) 1 2C2 ihi(G) i=0 i=0 lim log E n,β e ≤2C1 + lim log E n,β e n→∞ n P n→∞ n P

1 4C2E(G) =2C1 + lim log E n,β e . n→∞ n P

n Since E(G) follows a Binomial distribution with parameters 2 and β/n, the second β 4C2 term above can be easily computed to be 2 (e − 1), which is finite.

The proof of the above Lemma uses the remark 2.11, which says that T is allowed to take the value −∞, which is the case here.

Proof of Theorem 4.5. (a) Since lim |f(i)| < ∞, it readily follows by part (a) of i→∞ i Lemma 4.15 that the map µ 7→ µ[f] is continuous with respect to ||.||mean. Note that the normalizing constant can be expressed as

n−1 P θ f(i)hi(G) Zn(f,θ,β) i=0 nθµn[f] e = EPn,β e = EPn,β e . CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 51

Also since f ∈ F ,(2.4) holds (as verified during the proof of Lemma 4.15), and so Varadhan’s lemma (Theorem 2.8) is applicable. Thus an application of Varadhan’s Lemma along with the large deviation result of Theorem 4.11 gives

1 lim Zn(θ, β, f) = sup{I(µ) − µ[f]}. n→∞ n µ∈S

By part (b) of Lemma 4.13, the supremum in the r.h.s. above is finite and equals J(θf, β), and the set of optimizing u in the definition of J(θf, β) is finite.

Denoting this set by {u1, u2, ··· , uk}, let

k U := {µ ∈ : min ||µn − σu ,θf ||mean < ε}, P i=1 i

where ε > 0 is fixed. Then U is open, and so by Lemma 4.15,

1 c lim log Qn,θ,β,f (U ) n→∞ n

1 nθµn[f] 1 c ≤ lim log EPn,β e 1µn∈U − lim log Zn(θf, β) n→∞ n n→∞ n

≤ sup {θµ[f] − ISP(µ)} − sup{θµ[f] − ISP(µ)}. µ∈U c µ∈S

The last quantity above is negative by part (c) of Lemma 4.13, and so the con- clusion follows.

Proof of Theorem 4.8. As before,

n  E(G) (2)−E(G) (n) Zn(θ,β,f) X β β nθµn[f] nθµ [f] e = 1 − e = E n e , n n P G∈Gn

i where f(i) = k for i ≥ 0.

1/k (a) For part (a), setting rn := bn Mc for M > 0, note that rn ≤ n − 1 for all large n. Also recall that n   X dj nµ [f] = , n k j=1 CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 52

and so

  r n −r rn rn n − 1 β  n  β (2) n nθµn[f] θ( k ) θ( k ) EPn,β e ≥ e Pn,β(d1 = rn) = e 1 − rn n n

(since d1, which is the degree of vertex 1, has a Binomial distribution with pa- rameters (n − 1, β/n),) and so

k 1 nθµn[f] M β log lim EPn,β e ≥ θ − . n n→∞ k! 2

Since this holds for all M > 0, the conclusion follows on letting M → ∞.

(b) For part (b), since θ < 0, the function θf ∈ F (since θf ≤ 0), and so by Lemma 4.15 with U = φ,

1 nθµn lim log EPn,β e ≤ sup{θµ[f] − ISP(µ)}. n→∞ n µ∈S

For the lower bound, first note that for any M ∈ N,

nθµn[f] nθµn[f] EPn,β e ≥EPn,β e 1max1≤i≤n dj (G)≤M M P nθ µn(i)f(i) i=0 =EPn,β e 1max1≤i≤n dj (G)≤M M P nθ µn(i)f(i) i=0 n ≥EPn,β e Pn,β(d1 ≤ M) ,

where the last step follows by the FKG inequality, on noting that θ < 0, and the M P function g 7→ µn(i)f(i) is non decreasing on the space of graphs Gn. i=0 M P Since µ 7→ θ µ(i)f(i) is bounded and continuous with respect to ||.||mean, it i=0 CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 53

follows by Theorem 4.11 and an application of Varadhan’s lemma that

M 1 nθµn[f] X lim log EPn,β e ≥ sup{θ µ(i)f(i) − ISP(µ)} + log pβ[0,M], n→∞ n µ∈S i=0

where pβ[0,M] is the probability that a Poisson random variable with parameter M β is at most M. Finally note that θµ[f] ≤ θ P µ(i)f(i), and so the r.h.s. of the i=0 above inequality is bounded below by supµ∈S {µ[f] − ISP(µ)} + log pβ[0,M]. The

lower bound follows on letting M → ∞ and noting that pβ[0,M] → 1. Combining the upper and lower bound gives

1 lim Zn(θ, β, f) = sup{θµ[f] − I(µ)}. n→∞ n µ∈S

The rest of the proof follows exactly the same argument as in part(b) of Theorem 4.5, and is not repeated here.

4.5 A particular example

This section uses the theory developed in the previous sections to analyze a particular ERGM on sparse graphs. The sufficient statistic for this model is the number of edges, and the number of isolated vertices in the graph. The isolated vertices term can also be viewed as a penalty term which prefers or dislikes isolated vertices, depending on the sign of the associated parameter θ. This model is probably the simplest model that can be handled using Theorem 4.5.

Let h0 be the number of isolated vertices. As before, consider the probability mass function given by

n  E (2)−E β β θh0−Zn(θ,β,f) Qn,θ,β,f (G) = 1 − e , n n CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 54

where f(i) = 1i=0. This model has two parameters, the edge parameter β > 0 and the sparse penalty parameter θ ∈ R. Note that even in this simple model, the normalizing constant seems intractable. Since f is bounded, it readily follows by Theorem 4.5 that

Z (θ, β, f) lim n = J(θf, β), n→∞ n where

n m(u, f) m(u, θf) + β o J(θf, β) = sup Z(u, θf) − m(u, θf) log u + log(m(u, θf)β) − u≥0 2 2 where Z(u, θf) and m(u, θf) are the log normalizing constant and mean respectively, of the probability mass function σu,θf on non-negative integers given by

1 σ (i) ∝ uieθf(i). u,θf i!

For this particular choice of f, a direct calculation reveals that

ueu eZ(u,θf) = eθ + eu − 1, m(u, θf) = , eu + eθ − 1 and so J(θf, β) can be computed very easily by a one dimensional grid search.

To understand what the degree distribution looks like under this model, one needs to understand the set of maximizers in u of the above optimization problem. Since p any maximizer u also satisfies u = m(u, θf)β, substituting this in the expression for m(u, θf) and canceling u gives

βeu aeu u = = =: h (u), eu + eθ − 1 eu + b a,b where a = β > 0, b = eθ − 1 > −1. (It can be argued rigorously that u = 0 is not a maximizer of J(θf, β), and so the above cancellation is valid.)

The following simple Lemma 4.16 analyzes the roots of the function u = ha,b(u) for a > 0, b > −1. CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 55

aeu Lemma 4.16. For a, b > −1 and b =6 0 consider the function ha,b(u) = eu+b for u > 0.

(a) The equation ha,b(u) = u has either one or three roots (counting multiplicity).

(b) If either b < 0 or a < 4 then the equation ha,b(u) = u has exactly one root.

a Proof. (a) Since ha,b(0) = 1+b > 0 and limu→∞ ha,b(u) = a < ∞, the given equation has at least one root, and the number of roots are odd. Differentiation gives

abeu h0 (u) − 1 = − 1 a,b (eu + b)2

0 u Thus ha,b(u) = 1 is a quadratic equation in e , and so can have at most two real

roots and so by Rolle’s theorem the equation ha,b(u) = u has at most three real roots, thus concluding the proof of part (a).

0 (b) It can be checked that ha,b(u) < 1 if either a < 4 or b < 0, and so ha,b(u) − u is monotone decreasing and so has exactly one root.

From Lemma 4.16, the equation u = ha,b(u) has either one or three roots. This gives rise to two sub cases:

(i) If the equation ha,b(u) = u has one or three roots, but exactly one of these roots

is the global optimizer u0, by part (b) of Theorem 4.5, the degree distribution

converges to σu0,θf . In particular, by (b) of Lemma 4.16 this happens if either θ < 0 or β < 4. An example of this case is Figure 4.1(a), where the plot of the optimization problem is given for β = 5.8 and eθ = 18. The optimization problem has a

unique global maxima at u0 = 5.37782. It has also a local maxima at 0.524199 and a local minima at 2.68391, but they play no role in determining the limiting degree distribution.

(ii) If ha,b(u) = u has exactly three roots u1 < u2 < u3 and u1, u3 are both global

optimizers, then the degree distribution converges to a mixture of σu1,θf and CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 56

0.1 0.1

0 0

−0.1 −0.1

−0.2 −0.2

−0.3 −0.3

−0.4 −0.4

−0.5 −0.5

−0.6 −0.6

−0.7 −0.7 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

(a) (b)

Figure 4.1: (a) Unique global minima, (b) Non unique global minima.

σu3,θf . An approximate example of this is Figure 4.1(b), where the plot of the optimization problem is given for β = 5.77 and eθ = 18.9. The approximate

points of global maxima are u1 = 0.476086 and u3 = 5.29422, and u2 = 2.88435 is a local minima.

Thus even in this simple model there is a “phase transition”, namely for some values of the parameters (β, θ) the degree distribution has a unique limit, whereas for other parameter values the limiting degree distribution is a mixture distribution. Also, even though a phase transition is established, the exact phase transition boundary for this problem (i.e. the parameter values for which the limit is a mixture distribution) has not been characterized in this chapter, and is a topic for future research.

To explore this phenomenon of phase transition, the parameter configuration β = 5.77 and θ = log(18.9) is investigated closely. A sample is drawn from this parameter configuration using Glaubler dynamics with a systematic scan. The number of vertices n is chosen to be 1000, and the number of scans is chosen to be 20. The resulting CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 57

empirical degree distribution is plotted below in figure 4.2(a). The pmf

ui eθ σ (i) = . σ (0) = u,θf i!(eθ + eu − 1) u,θf eθ + eu − 1 is also plotted side by side in figure 4.2(b), for the value u = 5.29422.

0.18 0.16

0.16 0.14

0.14 0.12

0.12 0.1

0.1 0.08 0.08 0.06 0.06

0.04 0.04

0.02 0.02

0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

(a) (b)

Figure 4.2: (a) Observed degree distribution, (b) Predicted degree distribution with u3.

From figure 4.2 the two distributions seem pretty close, and so the theoretical prediction seems to match the sample behavior. Also repeated sampling from this scheme always produced a graph whose degree distribution looks like σu3,θf . Thus the mixing probability of σu1,θf seems to be small.

The exact same analysis works for any model of the form Qn,θ,β,f for any function f such that |f| ∈ F . One possible difference is that for a general f the functions Z(u, θf) and m(u, θf) might not be computable in closed form, and need to be esti- mated numerically as well. For an approximate numerical evaluation of Z(u, f) for a f(i) ui given u, let C < ∞ be such that |f(i)| ≤ Ci for all i ≥ 1. Then the function e i! is (ueC )i C bounded by i! which decays rapidly for i > ue . Thus for a numerical evaluation of Z(u, f) it suffices to evaluate the sum over 0 ≤ i ≤ 2ueC . Given a value of u and a CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 58

knowledge of the constant C, one can efficiently restrict the range of summation. A similar analysis works for the function m(u, f).

4.6 Fitting the model

This section gives a way to estimate the parameters in the model Qn,θ,β,f introduced in (4.2). First assume that the function f is known, and the problem is to estimate the parameters (θ, β). If f is exactly linear, i.e. there exists constants a, b such that f(i) = a + bi, then the model Qn,θ,β,f is same as Erd˝os-Renyi with parameter

1 βeθb ≈ . −θb n−β 1 + e β n

This model is asymptotically not identifiable along the curve where βeθb is constant, and so joint estimation of both parameters (θ, β) is not possible. If f is not linear, then there exists i1 ≥ 1, i2 ≥ 1 such that

f(i ) − f(0) f(i ) − f(0) 1 =6 2 . i1 i2

The following proposition shows that under this condition consistent estimation of both the parameters is possible under this model.

Proposition 4.17. Let i1 ≥ 1, i2 ≥ 1 be such that

f(i ) − f(0) f(i ) − f(0) 1 =6 2 . i1 i2 CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 59

Let θˆn, uˆn be the unique solution to the equations

hi1!hi1 (G)i log =θ(f(i1) − f(0)) + i1 log u, h0(G)

hi2!hi2 (G)i log =θ(f(i2) − f(0)) + i2 log u. h0(G)

2 ˆ ˆ nuˆn Then θn is consistent for θ. Further, the estimator βn := 2E(G) is consistent for β.

Proof. One first needs to check that θˆn andu ˆn are well defined. This will be the case if the linear equations have a unique solution, which happens iff

f(i ) − f(0) f(i ) − f(0) 1 =6 2 . i1 i2

By part (b) of Corollary 4.6 it follows that there exists a finite set {u1, u2, ··· uk} with ul > 0 such that any limit point of the measure µn is of the form σul,θf for some l, 1 ≤ l ≤ k. This implies that there exists a random variable Un taking values in

{u1, u2, ··· , uk} such that for all i ≥ 0 one has

h (G) U i i − n eθf(i)−Z(Un,θf) = o (1). (4.8) n i! P

Taking ratios, this implies by Slutsky’s theorem that

i1!hi1 (G) i1 θ(f(i1)−f(0)) =Un e + oP (1), h0(G)

i2!hi2 (G) i2 θ(f(i2)−f(0)) =Un e + oP (1). h0(G)

i θ(f(i)−f(0)) Since the random variable Une is bounded away from 0 for all i fixed, it CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 60

follows that

hi1!hi1 (G)i log =θ(f(i1) − f(0)) + i1 log Un + oP (1), h0(G)

hi2!hi2 (G)i log =θ(f(i2) − f(0)) + i2 log Un + oP (1). h0(G)

By definition of θˆn, uˆn this implies that

θˆn(f(i1) − f(0)) + i1 logu ˆn =θ(f(i1) − f(0)) + i1 log Un + oP (1),

θˆn(f(i2) − f(0)) + i2 logu ˆn =θ(f(i2) − f(0)) + i2 log Un + oP (1).

Also the given condition implies that the eigenvalues of the matrix   f(i1) − f(0) i1       f(i2) − f(0) i2 are bounded away from 0, and consequently

θˆn − θ = oP (1), uˆn − Un = oP (1).

This shows that θˆn is consistent for θ, andu ˆn is close to the random variable Un. Also since the convergence in the metric ||.||mean implies

2E(G) m(U , θf) − = o (1) n n P as well, which readily implies

ˆ2 2 nun Un = + oP (1) = β + op(1). 2E(Gn) m(Un, θ, f) CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 61

2 where one uses part (b) of Lemma 4.13 which says that Un = βm(Un, θf) holds almost surely. This shows consistency of βˆn for β.

One thing to note in the above argument is that i1, i2 are fixed, where as n becomes large. Thus in applying this approach, one should use the small i’s. The higher values hi(G)/n decay rapidly with i, and small probabilities are typically harder to estimate and are typically more unstable. Another comment is that even though estimates of (θ, β) can be obtained from any such pair (i1, i2) as in proposition 4.17, a better thing to do is to consider the equations

hi!h (G)i log i = θ(f(i) − f(0)) + i log u (4.9) h0(G) simultaneously for i ≥ 1, and come up with estimates for (θ, β). As an example, one could fit a linear regression y(i) = θz(i) + [log u]x(i) with response y(i) = log i!hi(G) h0(G) and explanatory variables z(i) = f(i) − f(0) and x(i) = i.

In particular for the sparse penalty model of section 5 the system of equations in (4.9) become hi!h (G)i log i = −θ + i log u. h0(G) Thus one can estimate θ, u by fitting a linear equation with response y(i) and ex- planatory variable x(i). For the graph G drawn from this model in section 5 with parameters β = 5.77, eθ = 18.9, the response values y(i) are computed. A linear equation is fitted to the points (x(i), y(i)) for 1 ≤ i ≤ 10.

By the method of least squares, the equation which best fits these points is given by y(i) = −2.9495 + 1.6713x(i), which gives 1.6713 θˆn = 2.9495, uˆn = e . CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 62

Since the true value of θ is log(18.9) = 2.9392, this gives a reasonably good estimate for θ. Further, from the data we have 2E(G)/n = 4.886, and so the estimated value of β is βˆn = 5.7905. Since the true value of β is 5.77, the estimate βˆn is not far off. Thus this gives a very simple procedure to estimate both parameters under this model.

If there is no reasonable guess for the function f, then one can think of estimating the whole function f as well. In the model Qn,θ,β,f with f unknown, the two models

Qn,θ,β,f and Qn,1,1,f˜ are asymptotically unidentifiable, where

f˜(i) = θ[f(i) − f(0)] + (i/2) log β.

To ensure identifiability, it is assumed that f(0) = 0, θ = β = 1. Under this assump- tion, the next proposition reconstructs the whole function f.

Proposition 4.18. Let f ∈ F such that f(0) = 0, and consider the model Qn,f,θ=1,β=1 q 2E(G) ˆ as defined in (4.2), and let uˆn := n . Then the function fn : N0 7→ R defined by

hi!hi(G)i fˆn(i) = log − i logu ˆn h0(G) satisfies p fˆn(i) → f(i).

Proof. By part (b) of Theorem 4.5 there exists a random variable Un taking values in a finite set {u1, ··· , uk} such that (4.8) holds. As in the proof of proposition 4.17, by Stutskly’s theorem this gives

i!hi(G) f(i) i − e Un = oP (1), h0(G) which on taking log gives

hi!hi(G)i fˆn(i) + i logu ˆn = log = f(i) + i log Un + oP (1). h0(G)

Thus to complete the proof it suffices to show thatu ˆn − Un = oP (1). To prove this, CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 63

first note that convergence jn the metric ||.||mean implies

2E(G) − m(U , f) = o (1). n n P

2 Since by part (b) of Lemma 4.13 one has Un = m(Un, f), it readily follows that r 2E(G) p uˆ = = m(U , f) + o (1) = U + o (1), n n n P n P thus completing the proof of the Lemma.

As an application of proposition 4.18, again consider the sparse penalty model of section 5. To bring the model in the setting of the proposition, the correct choice of f is f(i) = −θ + (i/2) log β for i ≥ 1 (obviously f(0) = 0). Figure 4.3 shows a plot of the true f versus its estimate, for 1 ≤ i ≤ 10.

6

5

4

3

2

1

0

−1

−2

−3 1 2 3 4 5 6 7 8 9 10

Figure 4.3: Red: Estimated function, Blue: True function

The plot shows that the estimate is pretty close to the true function. This method can be used to estimate the function f at values i which are small compared to the number of vertices n. One problem with estimating the function for larger i’s is that unless n is really huge, some of the counts hi(G) can be 0 for large i. As an example, in CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 64

the sampled graph G used for the estimating the function f above, h14(G) happened to be 0. As such the estimate fˆn(14) became undefined, and so no estimate of f(14) could be obtained using this approach.

4.7 Proof of the large deviation principle

The main tool for the proof is an estimate of the number of graphs with a given sparse degree sequence. Enumeration of graphs with a given degree sequence has a fairly long history in combinatorial literature. For some references on this, see the works of Mckay in [52], Mckay-Wormald in [53],[54],[55], and the references there in. This chapter uses the results of [52] to derive the rate function for the large deviation principle.

The proof of Theorem 4.11 is carried out according to the following strategy: Three auxiliary lemmas will be proved first, and then the theorem derived as a consequence of the lemmas.

The first of the three lemmas uses graph counting estimates, and helps in guessing the rate function. Lemma 4.19. Let N(h) denote the number of simple graphs with given degree fre- quency vector h = (h0, ··· , hn−1). Then

(a) For any h, (2E)! n! h ≤ × N( ) n−1 n−1 , E Q h Q E!2 i! i hi! i=0 i=0

n−1 1 P where E = 2 ihi. i=0

(b) If hi = 0 for all i > M with M < ∞ (i.e. all the degrees are less than M) then

(2E)! n! h ≥ × N( ) C n n−1 E Q hi Q E!2 i! hi(G)! i=1 i=0 CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 65

for some constant C := C(M) < ∞.

Proof. Note that h determines the ordered degree sequence d0 := (d(1) ≥ d(2) ≥

··· d(n)) uniquely. Thus if A(d0) denote the number of graphs with ordered degree n! sequence d0, then the number of graphs with degree frequency h is A(d0) × n−1 . It Q hi! i=0 thus remains to estimate A(d0).

(2E)! (a) The upper bound follows trivially from the representation A(d0) = P (d0) n E Q E!2 d(j)! j=1 n n−1 Q Q hi with 0 ≤ P (d0) ≤ 1 (see [52]), and noting that d(j)! = i! . j=1 i=0

(b) Since the result holds when d(1) = 0 (the entire degree sequence is 0), w.l.o.g.

assume E ≥ d(1) ≥ 1. The assumption hi = 0 for all i > M is equivalent to

d(1) ≤ M. Setting

n 1 X 3 λ := d (d − 1), ∆ˆ := 2 + d + d2 4E (j) (j) (1) 2 (1) j=1

as in [52, Theorem 4.6] note that ∆ˆ ≤ 2 + M + 2M 2, and

n P 2 dj(G) j=1 M λ ≤ n ≤ . P 2 2 dj(G) j=1

The conclusion then follows from [52, Theorem 4.6].

Part (a) of the second lemma shows that any µ ∈ S is close to its truncation in the sense of the metric ||.||mean. Part (b) of this Lemma uses the celebrated Erd¨os- Gallai criterion to show that any probability measure supported on finitely many non-negative integers is close to an empirical degree distribution measure for some graph. The Erd¨os-Gallaicriterion determines whether a given integer sequence is an ordered degree sequence corresponding to a simple graph, and is stated below: CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 66

Theorem. A sequence of non negative integers (d(1) ≥ d(2) ≥ · · · d(n)) corresponds n P to the degree sequence of some simple undirected graph G on n vertices iff d(j) is j=1 even, and for every r ≥ 1,

r n X X d(i) ≤ r(r − 1) + min(d(i), r). j=1 i=r+1

The following definition is required before stating the lemma.

Definition 4.20. Let Hn ⊂ S denote the set of all possible probability measures µn as G varies over Gn, the set of all simple labelled graphs. Note that the measure µn and the degree frequency vector h are in 1-1 correspondence with each other.

Lemma 4.21. (a) Given ν ∈ S such that ISP(ν) < ∞, there exists a sequence νM ∈ S satisfying the following conditions:

M→∞ ISP(νM ) → ISP(ν), M→∞ ||νM − ν||mean → 0,

νM (i) = 0 for i > M.

M P (b) For any M ≥ 2 and non negative vector (y0, ··· , yM ) such that y0 > 0, yi = 1, i=0 there exists µn ∈ Hn such that

M |µ (i) − y | ≤ for 0 ≤ i ≤ M, n i n

µn(i) =0 for i > M.

Proof. (a) Define νM ∈ S by ν(i) νM (i) := M . P ν(i) i=0

M P Clearly νM is well defined as soon as ν(i) > 0, which is true for all large M. i=0 CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 67

Also limM→∞ νM (i) = ν(i), and

M ∞ P iν(i) P iν(i) i=1 i=1 νM = → ∞ = ν, M P P ν(i) ν(i) i=0 i=0

||.||mean and so νM → ν. Finally to check that ISP(νM ) → ISP(ν) it suffices to check

D(νM ||pβ) → D(ν||pβ) < ∞, which follows on noting that

M X ν(i) D(ν ||p ) = C ν(i) log − C log C M β M p M M i=0 βi

M P −1 with CM := ( ν(i)) → 1. i=0

(b) If y0 = 1 then set h by h0 = n and hi = 0 for i > 0. Thus w.l.o.g. assume

0 < y0 < 1. Define a candidate degree frequency h as follows:

M P If ibnyic is even, then set i=1

hi :=bnyic for 1 ≤ i ≤ M, M X h0 :=n − hi, i=1

hi :=0 for i > M. CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 68

M P If ibnyic is odd, then set i=1

hi :=bnyic for 2 ≤ i ≤ M,

h1 :=bny1c + 1, M X h0 :=n − hi, i=1

hi :=0 for i > M.

Since y0 < 1, one has

M X 1 + bnyic ≤ 1 + n(1 − y0) < n i=1

M P for all large n, and so h0 as defined above is positive. Also by construction ihi i=1 is even, and

1 M |µ (i) − y | ≤ for 1 ≤ i ≤ M, |µ (0) − y | ≤ . n i n n 0 n

To complete the proof, it remains to check that the h defined above is indeed is a valid degree frequency, i.e. the corresponding ordered degree sequence

  M,M, ··· ,M,M − 1,M − 1, ··· ,M − 1, ··· , 0, 0, ··· , 0

satisfies the Erd¨os-Gallaicriterion, where i appears hi times, for 0 ≤ i ≤ M.

Denoting the above sequence by (d(1) ≥ d(2) ≥ · · · ≥ d(n)), one needs to check CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 69

that for all 1 ≤ r ≤ n,

r n X X d(i) ≤ r(r − 1) + min(d(i), r). (4.10) i=1 i=r+1

Note that the l.h.s. of (4.10) is bounded by rM and so (4.10) holds trivially for r > M. Also for 1 ≤ r ≤ M the l.h.s. of (4.10) is bounded by M 2, whereas the r.h.s. is at least M X bnyic − M, i=1

which goes to infinity as n grows, since yi > 0 for some i =6 0. Thus (4.10) holds for all r ≤ M for all n large enough, and so the proof is complete.

The third and final lemma uses Lemmas 4.19 and Lemma 4.21 to formalize the claim that it suffices to assume that maximum of the degrees are “not too large”.

Lemma 4.22. For any set A ⊆ S,

1 lim lim log max N(µn) ≤ − inf ISP(µ), (4.11) δ→0 n→∞ n {µn∈Hn∩A:µn(j)=0,j>nδ} µ∈A 1 lim lim log max N(µn) ≥ − inf ISP(µ), (4.12) ◦ M→∞ n→∞ n {µn∈Hn∩A,{µn(j)=0,j>M} µ∈A where E n −E (2E)! n! β   β (2) × − N(µn) := n−1 n−1 1 . E Q h Q n n E!2 i! i hi! i=0 i=0 Proof. To begin note that all the quantities in the definition of N(.) can be expressed in terms of µn, and so N(.) is indeed a function of µn. Towards proving (4.11), first note that Stirling’s approximation gives

| log n! − n log n + n| =0 if n = 0, =1 if n = 1

≤C1 log n for n ≥ 2, CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 70

for some C1 < ∞. Using this along with the assumption µn(j) = 0 for j > nδ gives

n  1 1 X  | log N(µn) + ISP(µn)| ≤ C2 log n + δ + max log(hj ∨ 1) n n h:hj =0,j>nδ i=1 for some C2 = C2(β) < ∞. By Arithmetic Mean-Geometric Mean inequality the maximum for the last term on the r.h.s. occurs when all the non zero hj’s are equal, giving the bound C2(log n/n + δ + δ log(1/δ)), and so

1 lim lim log max N(µn) ≤ − lim lim min ISP(µ) δ→0 n→∞ n µn∈Hn∩A,{µn(j)=0,j>nδ} δ→0 n→∞ µ∈Hn∩A:{µj =0,j>nδ}

≤ − inf ISP(µ), µ∈A where the last step uses the fact that the infimum is taken over a larger collection of µ’s.

Turning to the proof of (4.12), fix ε > 0 arbitrary. By a similar argument as above the assumption µn(j) = 0 for j > M gives

1 | log N(µ ) + I (µ ))| ≤ (C + M) log n, n SP n n 2 and so

1 lim lim log max N(µn) ≥ − lim lim min ISP(µ), M→∞ n→∞ n µn∩Hn∩A,{µn(j)=0,j>M} M→∞ n→∞ µ∈Hn∩A:{µ(j)=0,j>M} and so it suffices to prove that given ε > 0,M0 < ∞ arbitrary, there exists M > M0 such that

lim min ISP(µ) ≤ inf ISP(µ) + ε (4.13) ◦ n→∞ µ∈Hn∩A:{µj =0,j>M} µ∈A

◦ If infµ∈A◦ ISP(µ) = ∞ then there is nothing to prove. So w.l.o.g. fix µ ∈ A such that ◦ ISP(µ) < ∞. Also since A is open, w.l.o.g. it can be assumed that µ(0) > 0. Indeed, CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 71

the sequence ηδ0 + (1 − η)µ converges to µ in ||.||mean as η → 0, and

lim I(ηδ0 + (1 − η)µ) = I(µ). η→0

◦ Fixing µ ∈ A with µ(0) > 0, let µM be the corresponding sequence as constructed in part (a) of Lemma 4.21 . Thus by Lemma 4.21 and openness of A◦ there exists ◦ M > M0 such that ISP(µM ) ≤ ISP(µ) + ε, and µM ∈ A . Since µM (0) > 0, by part

(b) of Lemma 4.21 there exists σn ∈ S ∩ Hn such that σn(i) → µM (i) for 0 ≤ i ≤ M, and σn(i) = 0 for i > M. But this readily gives

n→∞ n→∞ ||σn − µM ||mean → 0,ISP(σn) → ISP(µM ).

◦ ◦ Finally since σn → µM ∈ A open, for all n large enough σn ∈ A ∩ Hn, and so

lim min ISP(µ) ≤ lim ISP(σn) = ISP(µM ) ≤ ISP(µ) + ε, n→∞ µ∈Hn∩A:{µ(j)=0,j>M} n→∞ from which (4.13) follows on taking infimum over µ ∈ A◦.

Proof of Theorem 4.11. Fix δ > 0 and note that dj(G) has a Binomial distribution with parameters n − 1 and β/n, and so

n−1 X n − 1 β r Pn,β( max dj(G) > nδ) ≤ nPn,β(d1 > nδ) ≤n 1≤i≤n r n − 1 r>nδ  β nδ ≤n 2n−1 ≤ n(2βδ)nn−nδ, n − 1

n for all large n. Also since E(G) has a Binomial distribution with parameters 2 and β/n, Hoeffding’s inequality gives

−n/δ Pn,β(E(G) > nK) ≤ e ,

where K = Kδ < ∞. Note that the ordered degree sequences d0 are in one-one CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 72

correspondence with the degree frequency vector h (recall from Lemma 4.19 that

N(h) denotes the number of graphs in Gn with degree frequency vector h), and so for any set A ⊂ S,

Pn,β(µn ∈ A)

≤Pn,β(µn ∈ A, E(G) ≤ Kn, hi(G) = 0, i > nδ)

+ n,β( max dj(G) > nδ) + n,β(E(G) > Kn) P 1≤i≤n P E(G) n −E(G) X β   β (2) ≤ 1 − + n(2βδ)nn−nδ + e−n/δ. n n {G∈Gn:µn∈A,E(G)≤Kn,µn(j)=0,j>nδ}

The contribution of second and third term in the r.h.s. above is at most 2e−n/δ for large n, which on taking log, dividing by n, and taking limits as n → ∞ followed by δ → 0 gives −∞. Thus these two terms can be ignored in the calculations. Proceed- ing to bound the first term note that any graph in the given range of summation can be constructed as follows: First choose the number of edges E, then choose a valid degree frequency vector h compatible with E, and finally choose a graph with this degree sequence.

Since E ≤ Kn, there are Kn ways to choose the number of edges. Given the value of E, choosing the degree frequency vector h is equivalent to choosing the ordered degree sequence d0 := (d(1) ≥ d(2) ≥ · · · d(n)). Thus d0 is a partition of √ 2E, and since the number of partitions of an integer n is bounded by e3 n for all large n (for a proof of this classical result see [28] or [38] ), the number of possible ordered √ 3 2Kn degree sequences d0 which is a partition of 2E is bounded by e for all large n. Finally, given the degree frequency vector h, a graph with this sequence can be chosen in N(h) ways by definition of N(.), where N(.) is as in Lemma 4.19. Using the bound of part (a) of Lemma 4.19 and combining gives the following upper bound to the first term above:

√ 3 2Kn nKe max N(µn) µn∈A∈Hn:µn(j)=0,j>nδ CHAPTER 4. EXPONENTIAL FAMILY ON SPARSE GRAPHS 73

where the last step uses the definition of N(µn) from Lemma 4.22. Taking logs, dividing by n and taking n → ∞ gives

1 1 lim log Pn,β(µn ∈ A) ≤ lim log max N(µn). n→∞ n n→∞ n {µn∈Hn∩A:µn(j)=0,j>nδ}

Letting δ → 0 and using (4.11), (4.3) follows. To prove the lower bound fix M < ∞ and note that for any A,

Pn(µn ∈ A) ≥ Pn(µn ∈ A, µn(j) = 0, i > M) E n −E X β   β (2) = 1 − . n n {g∈Gn:µn∈A,µn(i)=0,i>M}

The r.h.s. above is larger than

E(G) n −E(G) β   β (2) sup N(h) 1 − . {µn∈A∩Hn:µn(i)=0,i>M} n n

By part (b) of Lemma 4.19 this is bounded below by

C(M) max N(µn), µn∈Hn∩A,µn(i)=0,i>M where N(µn) is as in Lemma 4.22. As before taking logs, dividing by n and taking limits gives

1 1 lim log Pn(µn ∈ A) ≥ lim log max N(µn). n→∞ n n→∞ n µn∈A∩Hn,{hj =0,j>M}

On letting M → ∞ and using (4.12), (4.4) follows.

Corollary 4.23. If A is open,

1 lim log Pn,β(µn ∈ A) = − inf ISP(µ). n→∞ n µ∈A

Proof. Follows trivially from (4.3) and (4.4) of Theorem 4.11. Chapter 5

Exponential family on permutations

5.1 Introduction

Analysis of permutation data has a fairly long history in statistics. One of the earlier papers in this area is the work of Mallows ([49]) in 1957, where the author proposed a class of non-uniform models of permutations. Using this approach, in 1978 Feigin and Cohen( [29]) analyzed the nature of agreement between several judges in a contest. In 1985, Critchlow ([17]) gave some examples where Mallows’ model give a good fit to ranking data. See also the works of Fligner and Verducci ([31],[32]), and Critchlow, Fligner and Verducci ([18]), which deal with various aspects of permutation models. See also the book length treatment of Marden in [50], which covers both theoretical and applied aspects of permutation modeling. Permutation modeling has also received some recent attention in Computer Sci- ence literature. See for example [6], [15], [46],[47], and the references within. Of all the models on permutations, possibly the most famous and recurring model is the Mallows’ model with Kendall’s Tau. One of the reasons for this is that for this model the normalizing constant is known explicitly, and so analyzing this model be- comes a lot simpler. Analysis of other models is not easy to carry out because of usual

74 CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 75

difficulties with unknown normalizing constants. This chapter will focus on such mod- els, and provide a starting point for rigorous analysis. One of the ultimate goals of this theory is to develop estimation theory and test of hypothesis for such models, so that if a permutation is believed to be non random, coming from a particular non uniform distribution, one can carry out tests of hypothesis to check the validity of such claims.

Recall from the introduction the definition of Mallows’ models, which is of the form e−θd(π,σ)−Zn(θ,σ), where d(., .) is a metric on Sn, and σ is a fixed permutation in Sn. By a metric d(., .) is meant a non negative function on Sn × Sn satisfying the following conditions:

d(π, σ) ≥ 0, with equality iff π = σ, d(π, σ) = d(σ, π), d(π, σ) ≤ d(π, τ) + d(σ, τ).

Another restriction on d(., .) which seems reasonable is that d(., .) is right invariant, i.e.

d(π, σ) = d(πτ, στ), for all π, σ, τ ∈ Sn.

The justification for this is that τ can be thought of as an arbitrary relabeling of the original objects {1, 2, ··· , n}, and the distance between two permutations should be invariant to any such labeling ([22, Chapter 6]). All the metrics d(., .) considered in this thesis are right invariant.

The above class of models are known as Mallows’ models. Some of the common choices of right invariant metric d(., .), are the following ([22, Ch-5,6]).

(a) Spearman’s Foot Rule CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 76

Pn i=1 |π(i) − σ(i)|

(b) Spearman’s Rank correlation

Pn 2 i=1(π(i) − σ(i))

(c) Hamming Distance

#{1 ≤ i ≤ n : π(i) =6 σ(i)}

(d) Kendall’s Tau

Minimum number of pairwise adjacent transpositions which converts π−1 into σ−1.

(e) Cayley’s distance

Minimum number of adjacent transpositions which converts π into σ=n− number of cycles in πσ−1.

(f) Ulam’s distance Number of deletion-insertion operations to convert π into σ=n− Length of the longest increasing subsequence in σπ−1.

See [22, Ch-5,6] for more details on these models. To be precise the Spearman’s Rank correlation term is the square of a metric, but this version is used more frequently as it is right invariant. If d(., .) is right invariant, then the normalizing constant is free of σ, as X X −1 X e−θd(π,σ) = e−θd(πσ ,e) = e−θd(π,e),

π∈Sn π∈Sn π∈Sn CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 77

where e is the identity permutation. Also if π is a sample from the probability mass function e−θd(π,σ)−Zn(θ), then π ◦ σ−1 is a sample from the probability mass function e−θd(π,e)−Zn(θ). Thus if σ is known, without loss of generality by a relabeling it can be assumed that σ is the identity transformation. If there is no reasonable guess for σ then the question is harder, as now the task is to estimate the parameter σ by an observation π of equal size. This can be hard and possibly impossible, unless the function π 7→ d(π, σ) decays rapidly away from σ. To avoid such assumptions, this chapter will assume σ is known and equals identity. In particular, for the 1970 draft lottery example there is a very natural choice for the location parameter σ. This will be discuss more in section 7.

In all these models θ = 0 corresponds to the uniform distribution, and the model shows no attraction nor repulsion towards e. On the other hand if θ is positive and large, then a random pick from this model is biased towards e.

5.2 Statement of main results

It is hard to study all Mallows’ models in a unified manner. The main reason is that some of these metrics (such as (a) and (b) above) are Euclidean type metrics, whereas the other four examples depend on the group structure of Sn. As a starting point, consider an exponential family of the form

Pn θ f(i/n,π(i)/n)−Zn(f,θ) Qn,f,θ(π) = e i=1 , (5.1) where f is a continuous function on the unit square. In particular, if f(x, y) = −|x−y| then n n X 1 X f(i/n, π(i)/n) = − |i − π(i)|, n i=1 i=1 CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 78

which is a scaled version of the Foot rule (see (a) in list above). For the choice f(x, y) = −(x − y)2,

n n X 1 X f(i/n, π(i)/n) = − (i − π(i))2 n2 i=1 i=1 is a scaled version of Spearman’s rank correlation statistic (see (b) in the list above). A simple calculation shows that the right hand side above is same as

n (n + 1)(2n + 1) 2 X + iπ(i), 3n n2 i=1 and so the same model would have been obtained by setting f(x, y) = xy.

The first main result of this paper is the following theorem which computes the 1 limiting value of n (Zn(θ, f) − Zn(f, 0)) . Since Zn(f, 0) = Zn(0) = log n! for all choices of f, this gives an approximation for Zn(θ, f). The theorem also shows what a random permutation from this model looks like for large n. Recall from chapter2 the notion of convergence of permutations, which says the following: 2 Given a permutation π ∈ Sn, one can define a probability measure µπ on [0, 1] by the following density with respect to Lebesgue measure:

fπ(x) = n1{(x, y): π(bnxc) = bnyc}.

A sequence of permutations is said to converge to a measure µ, if the corresponding sequence of measures converge weakly to µ. Any such limiting object is an element of M , the set of all probability measures on [0, 1]2 with uniform marginals.

The statement of the following theorem uses the Kullback Leibler divergence, which was defined in chapter2.

Theorem 5.1. For any function f ∈ C, the set of all continuous functions on the unit square, consider the probability model Qn,f,θ(π) as defined in 5.1. Then CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 79

(a) Z (f, θ) − Z (0) lim n n = sup{θµ[f] − D(µ||u)}, n→∞ n µ∈M where u is the uniform distribution on the unit square, and µ[f] := fdµ. ´ (b) Under this model the sequence of permutations π = πn converge in probability

to µf,θ ∈ M , where µf,θ the unique maximizer of part (a).

The above theorem gives an approximation of the log normalizing constant in terms of an optimization problem over M , which is an infinite dimensional space. In general, such optimization can be hard to carry out. The next theorem gives an iterative algorithm for computing the density of µf,θ with respect to Lebesgue measure. Intuitively the algorithm starts with the function eθf(x,y) and alternately scales it along x and y marginals to produce uniform marginals in the limit.

1 Theorem 5.2. (a) Given f ∈ C, there exists functions af,θ(.), bf,θ(.) ∈ L [0, 1] unique almost surely with respect to Lebesgue measure satisfying the following property:

θf(x,y)+af,θ(x)+bf,θ(y) 2 The function gf,θ(x, y) := e is a density on [0, 1] with respect

to Lebesgue measure and has uniform marginals. Further gf,θ is the density of

the maximizer µf,θ of Theorem 5.1, and consequently

1 sup{θµ[f] − D(µ||u)} = − [af,θ(x) + bf,θ(x)]dx. µ∈M ˆx=0

(b) Consider the maps S1,S2 : C 7→ C given by

φ(x, y) φ(x, y) S1(φ)(x, y) := 1 ,S2(φ)(x, y) := 1 , 0 φ(x, z)dz 0 φ(z, y)dz ´ ´ and define the sequence φk ∈ C starting with

eθf(x,y) φ0(x, y) = θf(x,y) [0,1]2 e dxdy ´ CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 80

and sequentially given by

φ2k+1 = S1(φ2k), φ2k+2 = S2(φ2k+1), k ≥ 0.

If νk denotes the probability measure on the unit square given by

νk(A) := φk(x, y), ˆA

then

lim D(νk||µf,θ) = 0, k→∞ and

lim {θνk[f] − D(νk||u)} = sup{θµ[f] − D(µ||u)}. k→∞ µ∈M Remark 5.3. The algorithm in part (b) of Theorem 5.2 is the Iterative Proportional Fitting Procedure (IPFP) for bivariate probability measures on the unit square. IPFP originated in the discrete setting in the works of Deming and Stephan ([21]) in 1940. The method is described below in brief. For more on this, see [19],[45],[60], [61], and the references there in.

One of the many equivalent ways of stating the problem is as follows: m n Given a non negative m × n matrix A0 and non negative vectors u ∈ R , v ∈ R , it is required to find positive diagonal matrices D1,D2 of size m × m and n × n respectively, such that the matrix D1AD2 has row sum u and column sum v. For simplicity assume that m = n (so A is a square matrix), and u = v = 1n. In this case the IPFP algorithm proceeds by alternately scaling the matrix by its row and column sums. More precisely, it is given by

A2k(i, j) A2k+1(i, j) A2k+1(i, j) = Pn ,A2k+2(i, j) = Pn . l=1 A2k(i, l) l=1 A2k+1(l, j)

It was shown by Siknhorn [61] in 1967 that positivity of the entries of A0 is a sufficient condition for the convergence of IPFP. In 1968, Kullback ([45]) posed the problem in the continuous setting as follows: CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 81

2 Given a bivariate density a0(x, y) on R and densities u(x) and v(y) on R, the problem is to find functions d1(x) and d2(y), such that a0(x, y)d1(x)d2(y) is a density on R2, and has marginals u(x) and v(y). For this chapter, one can assume that the original density a0 is supported on the unit square, and the given marginals are u(x) = v(y) = 1 on the interval [0, 1]. Thus the problem reduces to finding densities on the unit square of the form a0(x, y)d1(x)d2(y) with uniform marginals, which is exactly what part (a) of Theorem 5.2 is trying to achieve. A complete rigorous proof for the IPFP under various conditions in the continuous setting was finally given by Ruschendorf ([60]) in 1995.

af,θ(x)+bf,θ(y)+θf(x,y) Remark 5.4. Since gf,θ(x, y) = e has uniform marginals, the func- tions af,θ(.), bf,θ(.) are the solutions to the joint integral equations

1 1 eθf(x,z)+af,θ(x)+bf,θ(z)dz = 1, eθf(z,y)+af,θ(z)+bf,θ(y)dz = 1, for all x, y ∈ [0, 1]. ˆ0 ˆ0

By Theorem 5.2,

1 Zn(f, θ) − Zn(0) lim = − [af,θ(x) + bf,θ(x)]dx. n→∞ n ˆx=0

For the limiting normalizing constant in the Mallows’ model with the Foot-rule or the Spearman’s rank correlation one needs to take f(x, y) = −|x−y| and f(x, y) = −(x− 2 y) ( or f(x, y) = xy) respectively. Even though analytic computation for af,θ(.), bf,θ(.) might be difficult, the algorithm of Theorem 5.2 can be used for a numerical evaluation of these functions. Section 6 carries this out to compute approximations for the limiting measure for a particular θ, as well as the limiting log normalizing constant as a function of θ, for the Spearman’s rank correlation model. From the pictures it seems that the density of the limiting measure is a fairly good approximation to the histogram of µπ with n = 10000, thus validating the weak convergence result of Theorem 5.1 and the approximation algorithm of Theorem 5.2. For an example of a real life permutation of a smaller size (n = 366), see section 7 of this chapter which analyzes the draft lottery data.

Another approach for estimation in such models can be to estimate the parameter CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 82

θ without estimating the normalizing constant. The following theorem constructs an √ explicit n consistent estimator for θ, for the class of models considered in Theorem 5.1. This estimate is similar in spirit to Besag’s pseudo-likelihood estimator (see chapter3 for an introduction to pseudo-likelihood). In the usual definition the product of conditional distribution of every variable given the rest of the variables is taken as the pseudo-likelihood. Since in a permutation given {π(j), j =6 i} there is only one value that π(i) can take, it does not make sense to look at the conditional distribution of (π(i)|π(j), j =6 i). In this case a meaningful thing to consider is the distribution of (π(i), π(j)|π(k), k =6 i, j), which gives the pseudo-likelihood as

Y Qn,f,θ(π(i), π(j)|π(k), k =6 i, j). 1≤i

The pseudo-likelihood estimate θˆn is obtained by maximizing the above expression. Taking the log of the pseudo-likelihood and differentiating with respect to θ gives

X h i eθy(i,j,π)+θy(j,i,π) y(i, i, π) + y(j, j, π) − y(i, j, π) − y(j, i, π) , eθy(i,i,π)+θy(j,j,π) + eθy(i,j,π)+θy(j,i,π) 1≤i

Theorem 5.5. (a) In the setting of Theorem 5.1, if

f(x1, y1) + f(x2, y2) − f(x1, y2) − f(x2, y1) ≡ 0, (5.2)

then Qn,f,θ is same as Pn, the uniform measure on Sn, and so there are no consistent estimates for θ.

(b) If (5.2) does not hold, then there exists consistent estimates for θ. In fact the √ following is a construction of a n consistent estimate for θ: CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 83

Let y(i, j, π) = f(i/n, π(j)/n),Nn := n(n − 1)/2, and let Pn(π, θ) denote the expression

1 X h i y(i, i, π) + y(j, j, π) − y(i, j, π) − y(j, i, π) N n 1≤i

The equation Pn(π, θ) = 0 has a unique root in θ with probability tending to 1.

√ ˆ ˆ Denoting this root by θn, one has that n(θn − θ) is Op(1) under Qn,f,θ.

Remark 5.6. Condition (5.2) is equivalent to the existence of two functions φ and ψ such that f(x, y) = φ(x) + ψ(y). This will be made precise during the proof of Pn Theorem 5.5. In this case it is easy to see that i=1 f(i/n, π(i)/n) is a constant free of π, and so the probability distribution is same as the uniform distribution Pn and does not depend on θ at all. Thus there is no way one can give any reasonable estimate for θ in this case.

All the results above relate to the model Qn,f,θ as defined in (5.1). However the tools used to prove these results are quite robust, and can handle other models as well. As an illustration of this, in section 4 proposition 5.7 computes the limiting normalizing constant of the Mallows’ model with Kendall’s tau as its metric ((d) in the original list of metrics).

Proposition 5.7. Consider the Mallows’ model on Sn with Kendall’s tau as the metric, defined by θ P − 1π(i)>π(j)−Zn(θ) Mn,θ(π) = e n i

2 where Zn(θ) is the appropriate normalizing constant. Then, with h : T 7→ R defined by

h((x1, y1), (x2, y2)) := 1(x1−x2)(y1−y2)<0, CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 84 one has Z (θ) − Z (0) θ lim n n = sup{− (µ × µ)(h) − D(µ||u)}. n→∞ n µ∈M 2 Remark 5.8. As pointed out in chapter3, the above limit for the log normalizing constant for the Mallows’ model with Kendall’s Tau was proved already in [63] by a different argument, where the author also derived the weak limit of µπ by solving a partial differential equation (see [63, Theorem 1.1]). This proposition shows that the general techniques developed in this chapter can handle this case as well.

Since in this case the partition function Zn(θ) is explicitly known from [26], the construction of a consistent estimator is easy. In fact, the maximum likelihood esti- √ mate is computable numerically, and it is very easy to check that it is n consistent and asymptotically normal. The pseudo-likelihood is possibly not well behaved in this case, as the conditional distribution of (π(i), π(j)) can change a lot if i and j are far apart and π(i) and π(j) are swapped.

Even though the Mallows’ model with Kendall’s Tau is not in the setting of Theo- rem 5.1, estimation of the log normalization constant is still possible. This is because the function µ 7→ (µ × µ)(h) is continuous on M with respect to weak topology, and is the natural extension for the number of inversions of a permutation to a general probability measure in M . The continuity of this function also follows from [40, Lemma 5.3], but Proposition 5.7 contains a proof for the sake of completeness. Thus to explore other non uniform models on permutations, one needs to understand the continuous real valued functionals on M . Any continuous function on M can be used as a sufficient statistic for an exponential family on permutations, which can be analyzed using this technique. Any function which is continuous with respect to weak topology on [0, 1]2 is continuous on M . As an example, if f is any continuous function on [0, 1]4, then the function µ 7→ [µ × µ](f) is continuous. For an example of a natural function on permutations which is not continuous, let N(π) denote the number of fixed points of π. Then the function π 7→ N(π)/n is not continuous on M . Indeed, if π denotes the identity permutation on Sn and σ denotes the permutation given by σ(i) = i + 1( mod n), then both π and σ converge to the same measure (which is the uniform measure on the diagonal x = y), but N(π) = n CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 85

and N(σ) = 0.

Another interesting problem is to investigate if there is a different topology which is also useful for this problem. This is because as pointed out in chapter2, the topol- ogy of weak convergence does not seem to reflect the group structure of Sn. Recall from chapter2 that even though πn and σn converge to µ and ν respectively, it might still be true that πnσn might not converge. A better topology might take the compo- sition action into account, thus producing different continuous functionals.

The outline of this chapter is as follows: Section 3 describes a Large Deviation Result in Theorem 5.9, which is the main tool for proving the results of this chapter. This theorem is an easy consequence of [64, Theorem 1]. However, section 8 contains an independent and more direct proof of Theorem 5.9 using permutation limit theory, and might be of independent interest. The proof is carried out by using Theorem 2.6, by choosing a suitable base for the weak topology .

As an application of Theorem 5.9, section 4 carries out the proofs of Theorems

5.1, 5.2, 5.5, and Proposition 5.7. Section 5 contains some approximations for af,θ for θ ≈ 0, under the additional assumption that f is symmetric.

Section 6 explores the particular exponential family with Spearman’s rank corre- lation as sufficient statistic, and illustrates the weak convergence of π by comparing the theoretical prediction of Theorem 5.1 with a random sample from this distribu- tion. The density of the limiting measure under this model is plotted on a discrete grid using a discretized version of the algorithm proposed in Theorem 5.2. This gives an approximation to the limiting measure µf,θ. To compare this with µπ, a random permutation is obtained from this model with n = 10000. The algorithm used to draw from this model is a Swendsen-Wang type algorithm, which is adopted from [1]. The histogram for the corresponding measure has a very similar pattern as the limiting density plotted above on the grid, thus illustrating the weak convergence result of CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 86

Theorem 5.1.

Section 7 analyzes the draft lottery data of 1970 using tools developed in this chapter. Recall from the introduction of this thesis the example of the draft lottery, which was used to determine the relative order in which male U.S. citizens will join the army based on their birthdays.

Finally section 8 gives the proof of Theorem 5.9 using permutation limit theory.

5.3 The large deviation principle

The main tool for proving the results of the last chapter is a large deviation Theorem for µπ with respect to weak convergence on M where π ∼ Pn, the uniform probability measure on Sn. This result is stated below.

Theorem 5.9. If π ∼ Pn, the uniform measure on Sn, the sequence µπ satisfies a large deviation principle on M with the good rate function I(µ) := D(µ||u). More precisely, for any set A ⊂ M ,

1 1 − inf D(µ||u) ≤ lim log n(A) ≤ lim log n(A) ≤ − inf D(µ||u), o P P µ∈A n→∞ n n→∞ n µ∈A where Ao and A denotes the interior and closure of A respectively.

It was shown in section 3 in chapter2 that

2 d∞(µπ, νπ) = sup |Fµπ (x, y) − Fνπ (x, y)| ≤ , 0≤x,y≤1 n

Thus the two sequences µπ and νπ are exponentially equivalent, and so it suffices to study the large deviations of either of them. It should be pointed out here that the large deviation of νπ follows from [64, Theorem 1] , which says the following:

Theorem. Let Σ be a polish space, and let {xi,n, 1 ≤ i ≤ n} be a triangular array of CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 87

elements in Σ satisfying the following condition:

n 1 X w δ → µ, n xi,n i=1 where µ is a probability measure on Σ. If π ∼ Pn, then the sequence

n 1 X δ n xi,n,xπ(i),n i=1 satisfies a large deviation principle on the space of probability measures on Σ×Σ with the topology of weak convergence, and rate function given by

IJT(ν) := D(ν||µ × µ) if ν1 = ν2 = µ, and ∞ otherwise ,

where ν1 and ν2 are the marginals of ν.

The large deviation result for νπ follows from this general result: if Σ = [0, 1], xi,n = i/n. In this case n 1 X w δ → U[0, 1]. n xi,n i=1 Since under u the marginals are i.i.d. U[0, 1], the above rate function matches with the rate function of Theorem 5.9. Section 7 of this chapter gives a direct proof of Theorem 5.9 which is simpler than the proof of [64], and might be of independent interest. The main ingredient of this proof is the permutation limit theory developed in [40] outlined in chapter2.

The next section proves the main results of this chapter using Theorem 5.9.

5.4 Proofs of main results

The following proposition derives the large deviation of νπ from that of µπ. CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 88

Proposition 5.10. If π ∼ Pn, the uniform probability measure on Sn, the sequence of probability measures {νπ} and {µπ} are exponentially equivalent. Consequently νπ satisfies a large deviation principle on the space of probability measures on [0, 1]2 with respect to the weak topology, with the good rate function I(.) given by

I(µ) :=D(µ||u), if µ ∈ M , :=∞ otherwise.

Proof. Since the set of all probability measures [0, 1]2 is compact, the set M is com- pact as well. An application of Lemma 2.7 and the large deviation result for µπ

(Theorem 5.9) gives that under Pn, the sequence µπ satisfies a large deviation prin- ciple on the space of probability measures on [0, 1]2 with the rate function I. To complete the proof note that the two sequences {µπ} and {νπ} are exponentially equivalent, and so by Theorem 2.5, νπ satisfies the same large deviation principle as

µπ.

Theorem 5.1 now follows from Proposition 5.10 as follows.

Proof of Theorem 5.1. (a) Note that

1 Pn Zn(f,θ)−Zn(0) X θ i=1 f(i/n,π(i)/n) nθνπ[f] e = e = E n e , n! P π∈Sn

where Zn(0) = log n!, and µ[f] = [0,1]2 fdµ denotes the mean of f with respect to ν. Since the function µ 7→ θµ[f]´ is bounded and continuous, an application of Varadhan’s Lemma (Theorem 2.8) gives the desired conclusion.

(b) The function µ 7→ θµ[f]−D(µ||u) is strictly concave (on the set where it is finite) and upper semi continuous on the compact set M , and so the global maxima is

attained at a unique µf,θ ∈ M . The proof now follows from arguments similar to CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 89

the proof of Lemma 4.15 in chapter4. For any open set U containing µf,θ, define a function T : M 7→ [−∞, ∞) by

T (µ) = θµ[f] if µ ∈ U c, −∞ otherwise .

Then

1 c 1 nT (νπ) 1 log Qn,f,θ(νπ ∈ U ) = log E n e − log Zn(f, θ). n n P n Since T is upper semi continuous and bounded above, (2.4) holds trivially and so

by Lemma 2.10 along with the large deviation result for νπ

1 nT (νπ) lim log EPn e ≤ sup {θµ[f] − D(µ||u)}. n→∞ n µ∈U c∩M

This, along with part (a) gives

1 c lim log Qn,f,θ(νπ ∈ U ) ≤ sup {θµ[f] − D(µ||u)} − sup{θµ[f] − D(µ||u)}. n→∞ n µ∈U c∩M µ∈M

The quantity on the right hand side above is negative as the infimum over the c c compact set U ∩ M is attained, and the global minimizer µf,θ is not in U by

choice. This proves the weak convergence of {νπ} to µf,θ in probability.

Proof of Theorem 5.2. (a) Since θf(.) is integrable with respect to du, by [19, Corol- 1 lary 3.2] there exists functions af,θ(.), bf,θ(.):∈ L [0, 1] such that

θf(x,y)+af,θ(x)+bf,θ(y) dµa,b = ga,bdxdy := e dxdy ∈ M .

The proof that µa,b = µf,θ is by way of contradiction. Suppose this is not true.

Since µf,θ is the unique global minimizer of If,θ(µ) := D(µ||u) − θµ[f], setting

h(α) := If,θ((1 − α)µa,b + αµf,θ) CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 90

it must be that h(α) has a global minima at α = 1. Also

If,θ(µf,θ) ≤ If,θ(u) = −θu(f) < ∞,

dµf,θ which forces D(µf,θ||u) < ∞. Thus letting φf,θ := du gives

0 h (0) = (φf,θ(x, y) − ga,b(x, y))(log ga,b(x, y) − θf(x, y))du ˆT

= (φf,θ(x, y) − ga,b(x, y))(af,θ(x) + bf,θ(y))du ˆT

=Eµf,θ [af,θ(X) + bf,θ(Y )] − Eµa,b [af,θ(X) + bf,θ(Y )] = 0,

where the last equality follows from the fact that both µf,θ and µa,b have the same uniform marginals. But h is convex, which forces that α = 0 is also a global minima of h(.). Thus h(0) = h(1), a contradiction to the uniqueness

of arg maxµ∈M {θµ[f] − D(µ||u)} proved in part (b) of Theorem 5.1. Thus

θf(x,y)+af,θ(x)+bf,θ(y) dµf,θ = dµa,b = e dxdy. Finally, the almost sure uniqueness of

af,θ(.) and bf,θ(.) follows from the uniqueness of the optimizing measure µf,θ. Using part (a) of Theorem 5.1, the last claim of part (a) then follows by a simple calculation.

(b) Since φ0 is continuous and strictly positive, it follows by induction that φk is well defined and strictly positive, and is of the form eθf(x,y)+ak(x)+bk(y) with

ak, bk ∈ C[0, 1]. Also since φk(., .) integrates to 1 along either x or y marginal

for k ≥ 1, it follows by Fubini’s theorem that νk is a probability measure for each k. It then follow from [60, Theorem 3.1] that

lim D(νk||µf,θ) = 0, (5.3) k→∞

lim D(νk||u) = D(µf,θ||u). (5.4) k→∞

To see that the theorem is applicable, one can check that [60, Condition (B3)] CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 91

holds in this case. Indeed, using the notation of this paper, one has

eθf(x,y) ≡ h(x, y) = φ0(x, y) = θf(x,y) , r1(x) 1, [0,1]2 e dxdy ´ and so the condition follows as eθf(x,y) is a continuous function on the compact set [0, 1]2.

Thus one has

lim {θνk[f] − D(νk||u)} = {θµf,θ[f] − D(µf,θ||u)} = sup{θµ[f] − D(µ||u)}, k→∞ µ∈M

where the first equality follows from (5.3) and (5.4) on noting that convergence in Kullback-Leibler implies weak convergence, and the second equality follows from Theorem 5.1. This completes the proof of the theorem.

√ Before proving Theorem 5.5, a general lemma is stated which constructs n con- sistent estimates of θ in one parameter families. The idea of this proof is taken from [11].

Lemma 5.11. For every θ ∈ R, n ∈ N let Rn,θ be a probability measure on (Xn, Fn), and let θ0 ∈ R be fixed. Let Pn(x, θ) be a function which is measurable in x for fixed θ, and absolutely continuous in θ, such that the following two conditions hold:

(a) There exists C = C(θ0) < ∞ such that

2 C E n,θ Pn(x, θ0) ≤ (5.5) R 0 n

(b) There exists a strictly positive continuous function λ : R 7→ R such that

0 lim n,θ0 (Pn(x, θ) ≤ −λ(θ), ∀θ ∈ ) = 1, (5.6) n→∞ R R CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 92

where the derivative is with respect to θ (this exists almost surely by absolute continuity). √ Then the equation Pn(x, θ) = 0 has a unique root θˆn in θ. Further, n(θˆn − θ0) is

Op(1) under Rn,θ0 . √ Proof. For any sequence Mn → ∞ such that Mn = o( n),

√ n C (|P (x, θ )| > M / n) ≤ P (x, θ )2 ≤ → 0. Rn,θ0 n 0 n 2 ERn,θ0 n 0 2 Mn Mn

This along with (5.6) gives that Qn,θ0 (An) → 1, where

√ 0 An := {x ∈ Xn : |Pn(x, θ0)| ≤ Mn/ n, Pn(x, θ) ≤ −λ(θ), θ ∈ R}.

Restricting attention to x ∈ An, for any δ > 0,

θ0+δ 0 Mn Pn(x)(θ0 + δ) = Pn(x, θ0) + Pn(x, θ)dθ ≤ √ − δ inf λ(θ) < 0 θ∈[θ ,θ +δ] ˆθ=θ0 n 0 0 for all large n. Similarly it can be shown that Pn(x, θ0 − δ) > 0 for all x ∈ An, for all large n. Also note that Pn(x, θ) is strictly monotone for x ∈ An, and so by continuity of θ 7→ Pn(x, θ) there exists a unique θˆn satisfying Pn(x, θˆn) = 0, and

θ0 − δ < θˆn < θ0 + δ for all large n. Since δ > 0 is arbitrary, this proves that θˆn is consistent for θ. √ To prove n consistency, note that

θ0 Mn h i √ ≥ |Pn(x, θ0)| = |Pn(x, θ0) − Pn(x, θˆn)| ≥ | λ(θ)dθ| ≥ inf λ(θ) |θˆn − θ0|, |θ−θ |≤δ n ˆθˆn 0 √ ˆ −1 and so Rn,θ0 ( n|θn − θ0| ≤ KMn) → 1, where K := [inf|θ−θ0|≤δ λ(θ)] < ∞. Since √ ˆ this holds for any Mn → ∞, it follows that n(θn−θ0) is Op(1) under Rn,θ0 , concluding the proof of the Lemma.

Proof of Theorem 5.5. (a) If (5.2) holds, then fixing y ∈ [0, 1] and setting H(x, y) := CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 93

f(x, y) − f(x, 0),

H(x1, y) − H(x2, y) = f(x1, y) − f(x1, 0) − f(x2, y) + f(x2, 0) = 0,

and so H(x, y) depends only on y. Denote H(x, y) by ψ(y). f(x, y) can be written as f(x, y) = f(x, 0) + ψ(y) = φ(x) + ψ(y),

where φ(x) := f(x, 0). Using this representation gives

n n n X X X f(i/n, π(i)/n) = φ(i/n) + ψ(i/n) =: xn i=1 i=1 i=1

for some deterministic sequence of reals xn free of π. The the distribution Qn,f,θ is same as Pn, and consequently no consistent estimator for θ exists.

(b) It suffices to check the two conditions of Lemma 5.11 with Rn,θ = Qn,f,θ, Xn = Sn. For checking (5.5) an exchangeable pair is constructed.

0 Consider the following exchangeable pair of permutations (π, π ) on Sn con- structed as follows:

0 Pick π from Qn,f,θ. To construct π , first pick a pair (I,J) uniformly from the n set of all 2 pairs {(i, j) : 1 ≤ i < j ≤ n}, and replace (π(I), π(J)) by an independent pick from the conditional distribution (π(I), π(J)|π(k), k =6 I,J). By a simple calculation, the probabilities turn out to be

0 0 0 0 (π (I), π (J)) = (π(I), π(J)) w.p. Qn,f,θ(π(I ) = π(I), π(J ) = π(J)|π(k), k =6 I,J) eθa(I,I,π)+θa(J,J,π) = , eθa(I,I,π)+θa(J,J,π) + eθa(I,J,π)+θa(J,I,π) 0 0 = (π(J), π(I)) w.p. Qn,f,θ(π(I ) = π(J), π(J ) = π(I)|π(k), k =6 I,J) eθa(I,J,π)+θa(J,I,π) = . eθa(I,I,π)+θa(J,J,π) + eθa(I,J,π)+θa(J,I,π) CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 94

Set π0(i) = π(i) for all i =6 I,J. It can be readily checked that (π, π0) is indeed an exchangeable pair. Also defining

n X W (π) := f(i/n, π(i)/n), and F (π, π0) := W (π) − W (π0) i=1

one can check from the construction of (π, π0) that

0 0 EQn,f,θ [F (π, π )|π] = W (π) − EQn,f,θ [W (π )|π] = Pn(π, θ),

where Pn(π, θ) is as defined in the statement of the Lemma. Also,

2 0 EQn,f,θ Pn(π, θ) =EQn,f,θ Pn(π, θ)[EQn,f,θ F (π, π )|π] 0 =EQn,f,θ Pn(π, θ)F (π, π ) 0 0 =EQn,f,θ Pn(π , θ)F (π , π) 0 0 = − EQn,f,θ Pn(π , θ)F (π, π )

1 0 0 = E n,f,θ (Pn(π, θ) − Pn(π , θ))F (π, π ) 2 Q

where the third line uses the exchangeability of (π, π0), and the fourth line uses antisymmetry F , and the last line is obtained by adding the second and fourth lines together and dividing by 2. This readily implies

2 EQn,f,θ Pn(π, θ) =EQn,θ Vn(π) (5.7)

1 0 0 where Vn(π) = E n,f,θ [(Pn(π, θ) − Pn(π , θ))F (π, π )|π] 2 Q

ij Letting π denote π with the elements (π(i), π(j)) swapped, Vn(π) can be written as

1 X h ih i P (π, θ) − P (πij, θ) y(i, i, π) + y(j, j, π) − y(i, j, π) − y(j, i, π) 2N n n n 1≤i

Also note that for any (i, j) one has

ij 2nM |Pn(π, θ) − Pn(π , θ)| ≤ , where M := 4 sup |f|. Nn T

2 This along with the above representation gives |V (π)| ≤ nM , which, along with n Nn (5.7), completes the proof of (5.5) with C = 3M 2.

Proceeding to check (5.6), note that with y(i, j, π) = f(i/n, π(j)/n) as in the

statement of the Lemma, setting cij := y(i, i, π) + y(j, j, π) and dij := y(i, j, π) + y(j, i, π) gives that

n θ(cij +dij ) −|θ|M 0 1 X 2 e e X 2 −Pn(π, θ) = (cij − dij) ≥ (cij − dij) , N (eθcij + eθdij )2 8N n 1≤i

where the last inequality uses the fact that |cij| ≤ M/2, |dij| ≤ M/2 by definition of M. Since the function g : [0, 1]4 7→ R defined by

h i2 g((x1, y1), (x2, y2)) := f(x1, y1) + f(x2, y2) − f(x1, y2) − f(x2, y1)

is continuous, it follows that

n 1 X (c − d )2 n2 ij ij i=1 n 1 X = g((i/n, π(i)/n), (j/n, π(j)/n)) n2 i,j=1

=(νπ × νπ)(g)

p h i2 → f(x1, y1) + f(x2, y2) − f(x1, y2) − f(x2, y1) dµf,θ(x1, y1)dµf,θ(x2, y2). ˆ[0,1]2

w To check the last convergence, note that νπ → µf,θ in probability by part (b) w of Theorem 5.1, which readily gives νπ × νπ → µf,θ × µf,θ in probability as well. Finally note that by (5.2) the last integral equals some positive real number α > 0, CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 96

and so (5.6) holds with h(θ) = e−M|θ|α/5. Thus both conditions of Lemma 5.11 hold, and so the conclusion follows.

Proof of Proposition 5.7. First it will be shown that µ 7→ −θ[µ×µ](h)/2 is continuous with respect to weak topology on M . Since M is separable, it suffices to work with sequences, and it suffices to check the following:

w µk ∈ M , µk → µ ⇒ (µk × µk)(x1 ≤ x2, y1 ≤ y2) → µ(x1 ≤ x2, y1 ≤ y2)

But this follows from the fact that the boundary of the set {x1 ≤ x2, y1 ≤ y2} is a subset of {x1 = x2, 0 ≤ y ≤ 1} ∪ {0 ≤ x ≤ 1, y1 = y2}, and P(X1 = X2) = 0 where

X1,X2 are i.i.d. with distribution U[0, 1]. Thus µ 7→ −θ[µ × µ](h)/2 is continuous on M ⊃ {µ : I(µ) < ∞}.

Now, a similar computation as in the proof of Theorem 5.1 gives

1 X θ P θ Zn(θ)−Zn(0) − n 1≤i

h((x1, y1), (x2, y2)) = 1(x1−x2)(y1−y2)>0.

The conclusion then follows by an application of Varadhan’s Lemma along with Proposition 5.10, on noting that the proof of Varadhan’s lemma goes through as long as the function µ 7→ −θ(µ × µ)(h)/2 is continuous on the set {I(µ) < ∞}, which is true. (See also Remark (2.11) following Lemma 2.10.) CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 97

5.5 Approximations for small θ

This section uses heuristic methods to explore the model of Theorem 5.1 for θ ≈ 0.

Let f be a symmetric continuous function on the unit square, and consider the model Qn,f,θ as introduced in Theorem 5.1. Then by Theorem 5.2, if π is randomly 2 chosen according to this measure, then νπ converges weakly to a measure on [0, 1] θf(x,y) with density of the form e hf,θ(x)hf,θ(y) with respect to Lebesgue measure, where

af,θ(x) hf,θ(x) = e . Also note that hf,0(x) ≡ 1 which gives af,0(x) ≡ 0 . Now assum- ing that hf,θ is smooth in θ, using a two term Taylor approximation one can write 0 θ2 00 3 hf,θ(x) = 1 + θhf,0(x) + 2 hf,0(x) + O(θ ), and so

θ2 eθf(x,y)h (x)h (y) =1 + θ[f(x, y) + h0 (x) + h0 (y)] + [f(x, y)2 + h00 (x) + h00 (y)] f,θ f,θ f,0 f,0 2 f,0 f,0 2 0 0 0 0 3 +θ [f(x, y)hf,0(x) + f(x, y)hf,0(y) + hf,0(x)hf,0(y)] + O(θ ).

Integrating with respect to y, the uniform marginal condition gives

1 0 2 1 = 1 + θhf,0(x) + θ f(x, y)dy + θC + O(θ ), (5.8) ˆ0

0 1 1 0 and so hf,0(x) = − 0 f(x, y)dy −C, where C = y=0 hf,0(y)dy. Integrating (5.8) with ´ ´ respect to x gives 2C = − [0,1]2 f(x, y)dxdy, and so ´ 1 0 1 hf,0(x) = − f(x, y)dy + f(x, y)dxdy. (5.9) ˆ0 2 ˆ[0,1]2

Keeping track of the coefficient of θ2 in (5.8) and equating it to 0 gives

1 00 2 00 0 =hf,0(x) + [f(x, y) + hf,0(y)]dy ˆ0 1 1 0 0 0 +2hf,0(x) [f(x, y) + hf,0(y)]dy + 2 f(x, y)hf,0(y)dy. ˆ0 ˆ0 CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 98

On using (5.9) this gives

1 00 0 2 2 0 00 hf,0(x) − 2hf,0(x) = − [f(x, y) + 2f(x, y)hf,0(y) + hf,0(y)]dy. ˆ0

0 00 Since hf,0(x) is known explicitly from (5.9), one can solve for hf,0(x) in terms of the function f. In this way, keeping track of the coefficients of θ, it is possible to get analytic formulas for hf,θ(.).

It should be noted that the treatment of this section is not rigorous, and is an attempt to provide a way to heuristically approximate the function hf,θ for small θ. If the one term approximation of (5.8) is used, then one has

0 2 0 2 af,θ(x) = log hf,θ(x) = log(1 + θhf,0(x)) + O(θ ) = θhf,0(x) + O(θ ), and so

1 1 θ 2 af,θ(x) = − θ f(x, y)dy + f(x, y)dxdy + O(θ ). ˆy=0 2 ˆx,y=0

This gives the first order asymptotics of af,θ(x) for θ ≈ 0, in the sense that

a (x) 1 1 1 lim f,θ = − f(x, y)dy + f(x, y)dxdy. θ→0 θ ˆy=0 2 ˆx,y=0

Using this approximation and Theorem 5.2 the following approximation is obtained for the log normalizing constant:

1 1 2 [Zn(f, θ) − Zn(0)] =θ f(x, y)dxdy + O(θ ) + on(1) n ˆx,y=0

This last approximation is trivial from exponential family theory, as by a one term 0 2 Taylor’s expansion of Zn(f, θ) gives Zn(f, θ) − Zn(0) = θZn(f, 0) + O(θ ) with

0 n 1 Zn(f, 0) 1 X  i π(i) n→∞ = E n f , → f(x, y)dxdy n n P n n ˆ i=1 x,y=0 CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 99

under uniform measure on Sn.

5.6 A particular example

Recall that the Spearman’s rank correlation is a metric on the space of permutations given by n 2 X 2 ||π − σ||2 = (π(i) − σ(i)) . i=1 This is possibly the second most used model on permutations after the Mallows’ model with Kendall’s Tau. Even then, the normalizing constant of this model, as well as the typical behavior for most permutation statistics of interest for a random permutation from this model is not known.

As observed in the introduction of this chapter, the Spearman’s rank correlation model is obtained by setting f(x, y) = −(x − y)2 or f(x, y) = xy in the model of

Theorem 5.1. This section will work with the choice f(x, y) = xy. Setting hf,θ(x) := eaf,θ(x) as in section 5 and using the uniform marginal condition gives

∞ 1 X xkθk 1 = eθxyh (x)h (y) = h (x) C (θ) ˆ f,θ f,θ f,θ k k! y=0 k=0

1 k with Ck(θ) := 0 y hf,θ(y)dy, and so ´ ∞  X xkθk −1 h (x) = C (θ) . f,θ k k! k=0

Another integration with respect to x gives

∞ X C2(θ) k = 1. k! k=0

However, an analytic solution of af,θ(.) seems intractable and is not attempted here. CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 100

θxy+af,θ(x)+af,θ(y) Instead, figure 5.1 (a) plots the density gf,θ(x, y) := e correspond- ing to the limiting measure µf,θ, which is the weak limit of µπ under this model. The plot is on a discrete grid of size k × k with k = 1000. The values of the the function are computed by the iterative scaling of row and column sums of a k × k matrix A 2 starting with A(i, j) = e(θ/n )ij, where θ = 20. This is the discretized version of the algorithm of Theorem 5.2.

0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 10 8 10 6 8 4 6 4 2 2 0 0

(a) (b)

Figure 5.1: (a) Density of limiting measure, (b)Histogram with n = 10000.

From the figure it is easy to see that gf,θ has higher values on the diagonal x = y, which also follows from the fact that for θ > 0 the identity permutation has the largest probability under this model. The function gf,θ(., ) is symmetric about the diagonal x = y, which is obvious since f(., .) is symmetric. Another way to see this is by noting that if π converges to a probability measure on [0, 1]2 with limiting density −1 2 gf,θ(x, y), then π converges to a measure on [0, 1] with limiting density gf,θ(y, x). But since n n X X iπ(i) = iπ−1(i), i=1 i=1 −1 −1 the law of π and π are same under Qn,f,θ, and so π has the limiting density gθ(x, y) as well, thus giving gf,θ(x, y) = gf,θ(y, x). CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 101

The function is also symmetric about the other diagonal x + y = 1. A similar reasoning as above justifies this: −1 Define σ ∈ Sn by σ(i) := n + 1 − π (n + 1 − π(i)) and note that if π converges 2 to a probability on [0, 1] with density gf,θ(x, y), then σ converges to a probability on 2 [0, 1] with density gf,θ(1 − y, 1 − x). But since

n n n X X X iπ(i) = (n + 1 − i)(n + 1 − π(i)) = iσ(i), i=1 i=1 i=1 it follows that under Qn,f,θ the distribution of π is same as the distribution of σ. Thus

σ has limiting density gf,θ(x, y) as well, which implies gθ(x, y) = gθ(1 − y, 1 − x), and so gθ is symmetric about the line x + y = 1. To compare how close a random permutation from this model is to the limit, a permutation of size n = 10000 is drawn from this model via MCMC. The algorithm used to simulate from this model is adopted from [1], and is explained below:

1. Start with π chosen uniformly at random from Sn.

n (θ/n2)iπ(i) 2. Given π, simulate {Ui}i=1 mutually independent with Ui uniform on [0, e ].

2 3. Given U, let bj := max{(n /θj) log Uj, 1}. Then 1 ≤ bj ≤ n. Choose an index

i1 uniformly at random from set {j : bj ≤ 1}, and set π(i1) = 1. Remove this

index from [n] and choose an index i2 uniformly from {j : bj ≤ 2} − {i1}, and

set σ(i2) = 2. In general, having defined {i1, ··· , il−1}, remove them from [n],

and choose il uniformly from {j : bj ≤ l} − {i1, i2 ··· il−1}, and set π(il) = l. [That this step can be always carried out completely was proved in [25].]

4. Iterate between the steps 2 and 3 till convergence.

The above iteration is run 10 times to obtain a single permutation π, and then the n frequency histogram of the points {i/n, π(i)/n}i=1 are computed with k × k bins, where k = 10. The mesh plot of the frequency histogram is given below. The pattern of the histogram in Figure 5.1 (b) is very similar to the function plot- ted in Figure 5.1 (a), showing that the probability assigned by the random permuta- tion π has a similar pattern as that of the limiting density gf,θ(x, y). The histogram CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 102

has been drawn with k2 squares, each of size .1 as k = 10.

Given a discrete approximationg ˆ(x, y) on a k × k grid, an approximationµ ˆk for the limiting measure µf,θ can be obtained by the following:

2 2 Partition [0, 1] into k squares each of sides 1/k, and letµ ˆk be the measure on [0, 1]2 continuous with respect to Lebesgue measure with density which is constant on the (i, j)th square, and is equal tog ˆ(i/k, j/k).

As k goes to ∞, it can be shown thatµ ˆk converges to µf,θ weakly. Infact,ν ˆk is a discrete approximation to the measure νk of Theorem 5.2.Thus from a heuristic point of view, θ [0,1]2 xydµˆk − D(ˆµk||u) can be taken as an approximation for the limiting ´ log normalizing constant θ [0,1]2 xydµf,θ − D(µθ||u). ´ Using the above approximation along with part (a) of Theorem 5.1 gives an ap- 1 proximation to n [log Zn(θ) − log Zn(0)] as

k k θ X 1 X ijgˆ(i/k, j/k) − gˆ(i/k, j/k) logg ˆ(i/k, j/k). k4 k2 i,j=1 i,j=1

1 Figure 5.2 gives a plot of θ versus limn→∞ n [log Zn(f, θ) − log Zn(0)], where the limiting value is estimated using the above approximation. For this plot k has been chosen to be 100, and the range of θ has been taken to be [−500, 500]. The number of iterations for the convergence of the iterative algorithm for each θ has been taken as 20, as the algorithm seems to converge very fast. The curve passes through (0, 0), as for θ = 0 one has Zn(θ) = Zn(0) = n!. It also goes to ±∞ as θ goes to ±∞, as expected.

The above method can be used to approximate the limiting log normalizing con- stant for any model of permutations described in the setting Theorem 5.1. CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 103

60

50

40

30

20

10

0

−10

−20 −500 −400 −300 −200 −100 0 100 200 300 400 500

1 Figure 5.2: Plot of θ versus limn→∞ n [log Zn(θ) − log Zn(0)] CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 104

5.7 Analysis of the 1970 draft lottery data

This section analyses the 1970 draft lottery data using the methods as developed in this chapter. The data for this lottery is taken from http://www.sss.gov/LOTTER8.HTM (thanks to Max Grazier G’Sell for help with acquiring the data). This lottery was used to determine the relative order in which male U.S. citizens born between 1944- 1950 will join the army, based on their birthdays. As an example, September 14th was the first chosen day, which means that people born on this date had to join first.

Assume that the 366 days of the year are chronologically numbered, i.e. January 1 is day 1, and December 31 is day 366. Then the data can be represented in the form of a permutation of size 366, where π(i) represents the ith day chosen in the lottery. The lottery was carried out in a somewhat flawed manner, as follows: 366 capsules were made, one for each day of the year. The January capsules were put in a box first, and then mixed among themselves. The February capsules were then put in the box, and the capsules for the first two months were mixed. This was carried on until the December capsules were put in the box, and all the capsules were mixed. As a result of this mixing, the January capsules were mixed 12 times, the February capsules were mixed 11 times, and the December capsules were mixed just once. As a result, most of the capsules for the latter months stayed near the top, and ended up being drawn early in the lottery. The resulting permutation π thus seems to have a bias towards the permutation (n, n − 1, ··· , 1), and so the permutation τ = n + 1 − π should be biased towards the identity.

Thus the question of interest is to test whether the permutation τ is chosen uni- formly at random from S366, and the alternative hypothesis is that τ has a bias towards the identity permutation.

n 1 P For τ ∈ Sn with n = 366, one can construct the measure ντ = n δ(i/n,τ(i)/n) as i=1 before. If τ is indeed drawn from the uniform distribution on Sn, then the measure CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 105

ντ should be close to the uniform distribution on the unit square. The bivariate his- togram of the measure ντ is drawn with 10 × 10 bins in figure 5.3(a) below.

To compare this with the uniform distribution on Sn, a uniformly random permu- tation σ is chosen from Sn, and the histogram of the measure νσ is drawn in figure 5.3(b) with the same the number of bins as above.

(a) (b)

Figure 5.3: (a) Draft Lottery, (b) Uniformly random permutation.

From figure 5.3 it seems that the heights of the bins in the second picture are a bit more uniform than the first.

(a) To check the hypothesis of uniformity more formally, in [30] the author carried out a test based on the Spearman’s rank correlation statistic based and rejected the null hypothesis of uniformity. This analysis is equivalent to assuming that the permutation τ arises out of a Mallows’ model with location parameter identity and Spearman’s rank correlation statistic as the metric. The probability mass function of this model proportional to

n −(θ/n2) P (τ(i)−i)2 e i=1 . CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 106

Under this family, the distribution of τ is uniform on Sn iff θ = 0, and the correct alternative to consider is one sided alternative θ > 0, which says that τ is biased towards the identity permutation.

Since the null hypothesis θ = 0 is rejected, it might be of interest to see if there is another value of θ for which the model better fits the data. To investigate this,

the value of θ is estimated using the pseudo likelihood estimator θˆn of Theorem

5.5. By a direct computation it turns out that θˆn = 1.46. To test whether this value of θ gives a good fit to the given data, an independent random permutation τˆ is drawn from this model with θ = 1.46. The same auxiliary variable algorithm of Andersen-Diaconis from the previous section is used to draw the sample. The histogram ofτ ˆ is given below in figure 5.4(b) with 10 × 10 bins, along side the histogram for the observed permutation τ.

(a) (b)

Figure 5.4: (a) Draft Lottery, (b) Fitted model from rank correlation model

The bivariate histograms of the observed ντ and the simulated ντˆ seem a better

match than the histograms of ντ and νσ in figure 5.3, where σ was a permutation drawn uniformly at random. Thus the non uniform model seems a better fit to the data than the uniform distribution. CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 107

(b) Another natural method of estimation in the type of models considered in Theo- rem 5.1 is following, which can be viewed as an approximation to the Maximum likelihood estimate.

The mle for such models is the solution to the optimization problem

n n 1 X 1 o arg sup θ f(i/n, τ(i)/n) − [Z (θ) − Z (0)] . n n n n i=1

1 Instead, one can replace the quantity n [Zn(θ)−Zn(0)] above by its limiting value from theorem 5.1, and estimate it using Theorem 5.2. This gives an approxi-

mation to the likelihood, and maximizing this gives an estimate θˆLD. Since the limiting function is convex as well, this estimate is well defined, and it is not hard

to show that θˆLD is consistent as well. The exact rate of consistency requires a 00 finer analysis of the second derivative of Zn(θ), and is not carried out here.

For the draft lottery data, with Spearman’s rank correlation as the metric, the

estimate θˆLD equals 1.12. This is somewhat different from the pseudo-likelihood estimate obtained before, which gave an estimate of 1.46 for θ in this model.

Since these two values are not very close, one can ask which of these values of θ gives a better fit to the data. To answer this question, a sample of 5000 values is drawn from this model for θ = 1.12 and θ = 1.46, and the histogram of the −3 Pn 2 statistic n i=1(i − τ(i)) is plotted side by side in figure 5.5. The observed value from the draft lottery data is .1291, represented by the green line. From figure 5.5 it seems that the pseudo likelihood estimate gives a better fit to the data. Another comment is that both the histograms are approximately Pn 2 bell shaped, suggesting that the statistic i=1(i − τ(i)) has a limiting under this model, for all values of θ. CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 108

350

300

250

200

150

100

50

0 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17

Figure 5.5: Blue: Pseudo-likelihood, Red: Large deviation

5.8 Proof of the large deviation principle

At first a few auxiliary lemmas will be proved, and then Theorem 5.9 will be derived as a consequence of these lemmas. The following two definitions are needed for stating the first lemma.

Definition 5.12. For k ∈ N, partition the unit square T := [0, 1]2 into k2 unit squares CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 109

k {Trs}r,s=1, with n o Trs := (x, y) ∈ T : dkxe = r, dkye = s for 2 ≤ r, s, ≤ k, n o T1s := (x, y) ∈ T : dkxe ≤ 1, dkye = s for 2 ≤ s ≤ k, n o Tr1 := (x, y) ∈ T : dkxe ≤ 1, dkye = s for 2 ≤ r ≤ k, n o T11 := (x, y) ∈ T : dkxe ≤ 1, dkye ≤ 1 .

This definition ensures that Trs is a disjoint partition of T . The separate definitions were necessary so that the boundaries {x = 0} and {y = 0} are not left out of the partition. It should be noted that all the sets Trs above are µ continuity sets for any

µ ∈ M . This readily follows from noting that the boundary of Trs is contained in

n r o n r − 1o n s o n s − 1o (x, y): x = ∪ (x, y): x = ∪ (x, y): y = ∪ (x, y): y = , k k k k which has probability 0 under any µ ∈ M .

Definition 5.13. Let Ak,n denote the number of non negative integer valued k × k th nr n(r−1) th matrices with r row sum equal to d k e − d k e and s column sum equal to ns n(s−1) d k e − d k e, i.e.

k k  k2 X nr n(r − 1) X ns n(s − 1)  m ∈ N0 : mrs = d e − d e, mrs = d e − d e , k k k k s=1 r=1

Pk where N0 = N ∪ {0}. Note that any m ∈ Ak,n satisfies r,s=1 mrs = n.

th th For any k × k matrix m, denote by mr. and m.s the r row sum and s column sum respectively, i.e.

s r X X mr. := mrs, m.s := mrs. s=1 r=1

In particular if π ∈ Sn, let M(π) denote the k × k matrix where Mrs(π) := |{1 ≤ i ≤ n :(i/n, π(i)/n) ∈ Tr,s}|. CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 110

If π is random, M(π) is a random matrix. The first lemma gives the distribution of M(π) when π ∼ Pn.

Lemma 5.14. The distribution of M(π) is given by

Qk Qk k k mr.! m.s! X X (M(π) = m) = r=1 s=1 , m = m , m = m Pn Qk r. rs .s rs n! r,s=1 mrs! s=1 r=1 if m ∈ Ak,n, and 0 otherwise.

Proof. Since n ki kπ(i) o M (π) = 1 ≤ i ≤ n : d e = r, d e = s , r,s n n it follows that

k X n ki nr n(r − 1) M (π) = 1 ≤ i ≤ n : d e = r} = d e − d e. r,s n k k s=1

Similarly

k X n kπ(i) ns n(s − 1) M (π) = 1 ≤ i ≤ n : d e = s = d e − d e, r,s n k k r=1 and so any valid configuration m is in Ak,n. So fixing a particular configuration m ∈ Ak,n, the number of possible permutations π compatible with this configuration can be computed as follows:

th For the r row there are mr. choices of indices i, and that can be allocated in k Qk boxes {Tr,s}s=1 in mr.!/ s=1 mrs! ways, so that box Tr,s receives mr,s indices. Taking a product over r, the number of ways to distribute the indices over the boxes is

Qk r=1 mr.! Qk r,s=1 mrs!

Similarly, the number of ways to distribute the targets {π(i)} such that box Tr,s CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 111

receives mrs targets is Qk s=1 m.s! Qk r,s=1 mrs!

Finally after the above distribution box Tr,s has mrs indices and mrs targets, which can then be permuted freely, and so the total number of permutations compatible with any such distribution of indices and targets is

k Y mrs! r,s=1

Combining, the total number of possible permutations π satisfying M(π) = m for a given specification m is given by

Qk Qk r=1 mr.! s=1 m.s! Qk r,s=1 mrs!

Since the total number of permutations in n!, the proof of the claim is complete.

Remark 5.15. Note that in the above proposition the row and column sums of the matrix M are free of π. The distribution of M is a multivariate generalization of the hypergeometric distribution, commonly known as the Fisher-Yates distribution. This distribution also arises in statistics while testing for independence in a 2-way table in the works of Diaconis-Efron ([23],[24]), and its asymptotics are well known.

Before proceeding the following definitions are needed. The first definition gives a base for the weak topology on M .

k2 Definition 5.16. For any µ ∈ M define Pk,µ ∈ [0, 1] by setting Pk,µ(r, s) := µ(Tr,s).

Since Trs is a µ continuity set, the map µ 7→ Pk,µ is continuous on M with respect to weak convergence. One can now define a base for the weak topology on M as follows: Fix k ∈ N, ε >

0, µ0 ∈ M , and define the set

M [k, µ0](ε) := {µ ∈ M : ||Pk,µ − Pk,µ0 ||∞ < ε}, CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 112 where

||Pk,µ − Pk,µ0 ||∞ := max |Pk,µ(r, s) − Pk,µ0 (r, s)|. 1≤r,s≤k

Since Pk,. is continuous, the set M [k, µ0](ε) is open in M .

Proposition 5.17. The collection

M0 := {M [k, µ0](ε): k ∈ N; ε > 0, µ0 ∈ M } is a base for the weak convergence on M .

Proof. One needs to verify that given any µ0 and an open set U containing µ0, there is an element U0 from this collection M0 such that µ0 ∈ U0 ⊂ U. If not, then in particular the set M [k, µ0](1/k) is not contained in U for any k, and so there exists c µk ∈ M [k, µ0](1/k) ∩ U . Then for any function f which is continuous on the unit square, one has

|µ[f] − µk[f]| ≤ max |f|||Pk,µ − Pk,µ ||∞ + 2 sup |f(x1, y1) − f(x2, y2)|, 2 0 [0,1] |x1−x2|,|y1−y2|≤1/k which goes to 0 as k goes to ∞. Thus µk converges weakly to µ, and since U is open, one has that µk ∈ U for all large k. This is a contradiction to the assumption that

µk ∈/ U, and so completes the proof.

This reduces the analysis of measures to the analysis of k × k matrices for a large but fixed k. The second definition introduces a base on a subset of real valued matrices with the usual topology.

Definition 5.18. Let Ak denote the set of all k×k matrices with non negative entries, such that each row and column sum is 1/k, i.e.

n k2 o Ak := x ∈ [0, 1] : xr. = 1/k, x.s = 1/k .

Using the same notations as in the above definition, define a set V[k, µ0](ε) ⊂ Ak as

{x ∈ Ak : ||x − Pk,µ0 ||∞ < ε}. CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 113

Since M(π) ∈ Ak,n is an integer valued matrix, all configurations in V[k, µ0](ε) cannot be attained by setting x = M(π)/n. Define Vn[k, µ0](ε) to be the set of all m ∈ Ak,n such that m/n ∈ V[k, µ0]. More precisely, Vn[k, µ0](ε) is defined by

n 1 V [k, µ ](ε) := A ∩ nV[k, µ ](ε) = m ∈ A : || m − P || < ε}. n 0 k,n 0 k,n n k,µ0 ∞

The following lemma gives an estimate of the probability that M(π) ∈ Vn[k, µ0](ε).

Lemma 5.19.

1 lim log Pn(M(π) ∈ Vn[k, µ0](ε)) = −2 log k − inf H(x), n→∞ n x∈V[k,µ0](ε)

Pk where H(x) := r,s=1 xrs log xrs.

Proof. For the proof, first assume that

lim min H(m/n) = inf H(x). (5.10) n→∞ m∈Vn[k,µ0](ε) x∈V[k,µ0](ε)

The proof of (5.10) is deferred till the end of the lemma. For the lower bound, note that

Pn(M(π) ∈ Vn[k, µ0](ε)) ≥ max Pn(M(π) = m) m∈Vn[k,µ0](ε) Qk m ! Qk m ! = max r=1 r. s=1 .s m∈V [k,µ ](ε) Qk n 0 n! r,s=1 mrs! where the second step uses Lemma 5.14. Now, Stirling’s formula gives that there exists C < ∞ such that

| log n! − n log n + n| =0 if n = 0 =1 if n = 1 ≤C log n if n ≥ 2, CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 114

and so

1 Ck log n log Pn(M(π) ∈ Vn[k, µ0](ε)) ≥ −2 log k − min H(m/n) − n m∈Vn[k,µ0](ε) n for some constant Ck < ∞. On taking limits using (5.10) completes the proof of the lower bound.

For the upper bound note that

n + k2 − 1 Pn(M(π) ∈ Vn[k, µ0](ε)) ≤ 2 max Pn(M(π) = m) k − 1 m∈Vn[k,µ0](ε) 2 k2 ≤(n + k ) max Pn(M(π) = m), m∈Vn[k,µ0](ε) since any valid configuration m is a non negative integral solution of the equation Pk r,s=1 mrs = n. Thus proceeding as before it follows that

0 1 Ck log n log Pn(M(π) ∈ Vn[k, µ0](ε)) ≤ −2 log k − min H(m/n) + n m∈Vn[k,µ0](ε) n

0 for some other Ck < ∞, which on taking limits using (5.10) completes the proof of the upper bound.

It thus remains to prove (5.10). To this effect, let m(n) denote the minimizing configuration on the l.h.s. of (5.10). Then m(n)/n is a sequence in the compact set Pk {x : xrs ≥ 0 : r,s=1 xrs = 1}, and any convergent subsequence converges to a point in V[k, µ0](ε). Thus

lim min H(m/n) ≥ inf H(x) = inf H(x), n→∞ m∈Vn[k,µ0](ε) x∈V[k,µ0](ε) x∈V[k,µ0](ε) where the last equality follows from since H(.) is continuous, completing the proof of the lower bound in (5.10). CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 115

Proceeding to prove the upper bound, it suffices to prove that for any x ∈ (n) (n) V[k, µ0](ε) there exists a sequence m ∈ Vn[k, µ0](ε) such that m /n converges to x as n → ∞. To this effect, let µ ∈ M be such that Pk,µ = x. (It is easy to check that such a µ always exists for any x ∈ Ak). By [40, Lemma 4.2] and [40, Lemma 5.3] there exists a sequence of permutations {σn} of size n such that νσn converges weakly (n) (n) (n) to µ, and so setting m = M(σn) one has that m ∈ Ak,n and m /n → Pk,µ = x. Also the set k2 Wk := {x ∈ [0, 1] : ||x − Pk,µ0 || < ε}

(n) is open, and since x ∈ Wk, it follows that m /n ∈ Wk for all large n. Since

Vn[k, µ0](ε) = nWk ∩ Ak,n, the proof of (5.10) is complete.

The next lemma derives another technical estimate using Lemma 5.19. This lemma will be used to prove Theorem 5.9.

Lemma 5.20. For any set M [k, µ0](ε) (see Definition (5.16)), one has

1 lim log Pn(µπ ∈ M [k, µ0](ε)) = − inf D(Pk,µ||Pk,u) n→∞ n µ∈M where D is the discrete Kullback-Leibler divergence, i.e.

k X xrs D(x||y) = x log . r,s y r,s=1 rs

Proof. First note that 1 2 ||P − M(π)|| ≤ . k,µπ n n

Indeed, since each square Trs has four boundaries each of which intersect in exactly one row/column of the n × n partition of the unit square, the two quantities above can differ only if there is an element on one of these rows/columns. Since each such square has probability 1/n under µπ, the maximum difference can be at most 2/n. Thus for any δ ∈ (0, ε) and all n large enough,

Pn(µπ ∈ M [k, µ0](ε)) ≥ Pn(M(π) ∈ Vn[k, µ0](ε − δ)) CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 116

Using Lemma 5.19 gives

1 lim log Pn(M(π) ∈ Vn[k, µ0](ε − δ)) ≥ −2 log k − inf H(x). n→∞ n V[k,µ0](ε−δ)

Letting δ ↓ 0 gives

1 lim log Pn(M [k, µ0](ε)) ≥ −2 log k − inf H(x). (5.11) n→∞ n V[k,µ0](ε))

A similar argument gives

1 lim log Pn(M(π)Vn[k, µ0](ε + δ)) ≤ −2 log k − inf H(x), n→∞ n V[k,µ0](ε+δ)) from which, letting δ ↓ 0 gives

1 lim log n(M [k, µ0](ε)) ≤ −2 log k − inf H(x). (5.12) n→∞ P n V[k,µ0](ε))

Combining (5.11) and (5.12) gives

1 lim log Pn(M [k, µ0](ε)) = −2 log k − inf H(x), n→∞ n V[k,µ0](ε)) using the continuity of H(.). Finally note that Pk,µ ∈ V[k, µ0](ε) for any µ ∈

M [k, µ0](ε), and conversely for any x ∈ V[k, µ0](ε) there exists a measure µ such that Pk,µ = x . Since

−2 log k − H(x) = −D(x||Pk,u) the result follows.

Definition 5.21. Given any measure µ ∈ M and k ∈ N, construct a measure Pek,µ as follows: Pek,µ gives mass µ(Trs) to the set Tr,s for 1 ≤ r, s ≤ k, and on every Trs the measure Pek,µ is uniform. With this definition, it can be readily checked that

D(Peµ,k||Peu,k) = D(Pk,µ||Pk,u).

Proof of Theorem 5.9. Since M0 is a base for the weak topology on M , by Lemma CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 117

5.20 and Theorem 2.6 it follows that Pn follows a weak ldp with the rate function

I(µ) = sup inf D(Pek,ν||Pek,u). ν∈ [k,µ ](ε) M [k,µ0](ε)3µ M 0

Also since M is compact it follows that full ldp holds with the good rate function I(.).

It thus remains to prove that I(µ) = D(µ||u). To this effect, first note that µ ∈ M [k, µ](1/k), and so

I(µ) ≥ lim inf D(Pek,ν||Pek,u) = lim D(Pek,νk ||Pek,u), k→∞ ν∈M [k,µ](1/k) k→∞ where νk denotes any minimizer of ν 7→ D(Pek(ν)||Pek(u)) over M [k, µ](1/k). But then Pek,νk converges weakly to µ as k → ∞. Indeed, for any continuons function f on [0, 1]2,

|Peνk,k[f] − µ[f]| ≤ sup |f|||Pk,νk − Pk,µ||∞ + 2 sup |f(x1, y1) − f(x2, y2)|. 2 [0,1] |x1−x2|,|y1−y2|≤1/k

This converges to 0 as νk belongs to M [k, µ](1/k), which guarantees that

1 ||P − P || ≤ , k,νk k,µ ∞ k

and so Pek,νk converges weakly to µ. The lower semi continuity of D(.||.) then implies I(µ) ≥ D(µ||u), proving the lower bound.

For the upper bound note that the first supremum is over all M [k, µ0](ε) contain- ing µ , and so trivially

I(µ) ≤ sup D(Pek,µ||Pek,u). k≥1 CHAPTER 5. EXPONENTIAL FAMILY ON PERMUTATIONS 118

Also note that

D(µ||u) = sup { fdµ − log ef du}, f∈B(T ) ˆT ˆT

f D(Pek,µ||Pek,u) = sup { fdµ − log e du}, f∈Bk(T ) ˆT ˆT where B(T ) denotes the set of all bounded measurable functions on T , and Bk(T ) denotes the subset of B(T ) which is constant on every Trs, 1 ≤ r, s ≤ k. Indeed, both the results follows from Lemma 2.13. Consequently supk≥1 D(Pek,µ||Pek,u) ≤ D(µ||u), thus completing the proof of the upper bound. Chapter 6

Conclusion

This thesis explains how large deviations theory can be used to analyze exponential families where normalizing constants are not available in closed form. The main idea here is that once large deviation for a sequence of probability measures Pn has been developed, one can analyze exponential families of the probability measures with Pn as the base measure. The sufficient statistic of these exponential families have to be continuous functions under the topology of the large deviation. The asymptotics of the log normalizing constant are characterized in terms of an optimization problem, usually over abstract spaces. This method can be used to numerically estimate the normalizing constant if the above optimization problem can be reduced to a numeri- cally solvable problem.

This program is explained with the example of exponential family on dense graphs in chapter3. Chapters4 and5 develop large deviation results for sparse graphs and permutations respectively, and carry out the same program for exponential families on these spaces. Apart from giving the asymptotics of the normalizing constant, this analysis gives an improved understanding of these models. In the case of exponential families on sparse graphs, large deviations theory gives some understanding of the phenomenon of degeneracy, apart from giving a sufficient condition for checking non-degeneracy. The computation of the normalizing constant is reduced to a one dimensional optimization problem which is very easy to carry

119 CHAPTER 6. CONCLUSION 120

out in practice, as demonstrated with a simple example. The results are also used to show existence of consistent estimators, and to give a simple way to fit such models. In the case of exponential families on permutations, large deviation is used to check consistency of pseudo-likelihood estimators. It also gives a recursive algorithm to estimate the log normalizing constant, and carries out this estimation for a par- ticular exponential family. The tools developed in this chapter is used to analyze the 1970 draft lottery data.

The main challenge in carrying out this program for more examples is that estab- lishing the requisite large deviation is a non trivial problem, and the proof techniques change a lot for each example. Even the identification of the space and topology for the large deviation is not always obvious. For example in the case of permutations the necessary large deviation was carried out on the set of probability measures on the unit square, which at first sight has nothing to do with a permutation.

One thing should be pointed out there that large deviations theory only captures the leading order asymptotics of the log normalizing constant, and fails to capture further detail. This causes a problem when the rate function is not convex, and there is more than one global minima for the rate function on some set on interest. In this case, one can only conclude that any limiting distribution is supported on these points, and no conclusion can be obtained about the mixing probability associated to each of these points. If such information is needed, a finer analysis of the normal- izing constant is necessary. This has been carried out in [57],[58], where a particular exponential family on dense graphs is analyzed to give finer results.

One important question based on this thesis is that even though multiple pa- rameters can be estimated in the sparse ERGMs, the rates of consistency for both the parameters might not be the same. Estimation of multiple parameters in such exponential families can still be a tricky problem. As an example, the dense graph exponential family results in [12] suggest that consistent estimation of multiple pa- rameters might not be possible. In [57] it is shown that joint estimation of both CHAPTER 6. CONCLUSION 121

parameters is indeed possible in a two parameter exponential family . In fact, it provides explicit consistent estimators for both parameters. In tis case the rate of consistency for both the parameters are not same, which is the main reason for the apparent non estimability of multiple parameters. This means that one of the param- eters can be estimated with a lot more precision than the other, and so estimation of both parameters is still a hard problem to carry out in practice. The exponential family on sparse graphs needs to be analyzed from a similar perspective, to see if something can be said about the rate of consistency of estimates. Continuing on the same project, a more general large deviation result for sparse graphs can be used to analyze a wider class of exponential families than done in this thesis. In fact the work of [8] provides a very good starting point for such analysis.

Possible directions to pursue in the area of permutations can be to study the measures obtained from these permutation models carefully. An explicit form of the limit will give a better prediction for the normalizing constant. Also the analysis of this thesis leaves out some examples of statistics on permutations which are of interest. A more general theory, or a different theory of large deviations might be needed to deal with these examples. References

[1] H. Andersen and P. Diaconis. Hit and run as a unifying device. Journal de la soci´et´efran¸caisede statistique, 148(4):5–28, 2007.

[2] O. Barndorff-Nielsen. Information and exponential families: in statistical theory. Wiley, 1978.

[3] J. Besag. Spatial Interaction and the Statistical Analysis of Lattice Systems. Journal of the Royal Statistical Society. Series B. (Methodological), 36(2):192– 236, 1974.

[4] J. Besag. Statistical Analysis of Non-Lattice Data. Journal of the Royal Statis- tical Society. Series D. (The Statistician), 24(3):179–195, 1975.

[5] S. Bhamidi, A. Sly, and G. , Bresler. Mixing time of exponential random graphs. Annals of Applied Probability, 21(6):2146–2170, 2011.

[6] D. Blei, A. Ng, M. Jordan, and J. Lafferty. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.

[7] M. B´ona. The copies of any permutation pattern are asymptotically normal. Available at http://arxiv.org/abs/0712.2792, 2007.

[8] C. Bordenave and P. Caputo. Large deviations of empirical neighborhood distri- bution in sparse random graphs. Available at http://arxiv.org/abs/1308.5725, 2013.

122 REFERENCES 123

[9] C. Borgs, J. Chayes, L. Lov´asz,V. S´os,and K. Vesztergombi. Convergent se- quences of dense graphs I: Subgraph frequencies, metric properties and testing. Advances in Mathematics, 219(6):1801–1851, 2008.

[10] C. Borgs, J. Chayes, L. Lov´asz,V. S´os,and K. Vesztergombi. Convergent se- quences of dense graphs II. Multiway cuts and statistical physics. Annals of Mathematics, 176(1):151–219, 2012.

[11] S. Chatterjee. Estimation in spin glasses: A first step. The Annals of Statistics, 35(5):1931–1946, 2007.

[12] S Chatterjee and P. Diaconis. Estimating and understanding exponential random graph models. The Annals of Statistics, 41(5):2428–2461, 2013.

[13] S. Chatterjee, P. Diaconis, and A. Sly. Random graphs with a given degree sequence. Annals of Applied Probability, 21(4):1400–1435, 2011.

[14] S. Chatterjee and S. R. S. Varadhan. The large deviation principle for the Erd¨os-Renyi random graph. European Journal of Combinatorics (special issue on Homomorphisms and Limits), 32(7):1000–1017, 2011.

[15] H. Chen, Branavan S., R. Barzilay, and Karger. D. Content modeling using latent permutations. Journal of Artificial Intelligence Research, 36(1):129–163, 2009.

[16] F. Comets. On Consistency of a Class of Estimators for Exponential Families of Markov Random Fields on the Lattice. The Annals of Statistics, 20(1):455–468, 1992.

[17] D. Critchlow. Metric methods for analyzing partially ranked data, volume 34 of Lecture Notes in Statistics. Berlin: Springer-Verlag, New York, 1985.

[18] D. Critchlow, M. Fligner, and J. Verducci. Probability models on rankings. Journal of Mathematical Psychology, 35(3):294–318, 1991. REFERENCES 124

[19] I. Csis´zar. I-Divergence geometry of probability distributions and minimization problems. The Annals of Probability, 3(1):146–158, 1975.

[20] A. Dembo and O. Zeitouni. Large deviations techniques and applications (second edition), volume 38 of Application of Mathematics. Berlin: Springer-Verlag, 1998.

[21] W. Deming and F. Stephan. On a Least Squares Adjustment of a Sampled Frequency Table When the Expected Marginal Totals are Known. Annals of Mathematical Statistics, 11(4):427–444, 1940.

[22] P. Diaconis. Group representations in probability and statistics, volume 11 of Lecture Notes-Monograph series. Institute of Mathematical Statistics, Hayward, CA.

[23] P. Diaconis and B. Efron. Probabilistic-geometric theorems arising from the analysis of contingency tables, Contributions to the Theory and Application of Statistics, A Volume in Honor of Herbert Solomon. Academic Press, 1983.

[24] P. Diaconis and B. Efron. Testing for independence in a two-way table: New interpretations of the Chi-square statistic. The Annals of Statistics, 13(3):845– 913, 1985.

[25] P. Diaconis, R. Graham, and S. Holmes. Statistical problems involving permu- tations with restricted positions, volume 36 of Lecture Notes–Monograph Series, pages 195–222. Institute of Mathematical Statistics, Beachwood, OH, 2001.

[26] P. Diaconis and A. Ram. Analysis of systematic scan metropolis algorithms using iwahori-hecke algebra techniques. Michigan Journal of Mathematics, 48(1):157– 190.

[27] K. Doku-Amponsah and P. M¨orters. Large deviation principles for empirical measures of colored random graphs. Annals of Applied Probability, 20(6):1989– 2021, 2010.

[28] P. Erd¨os.On an elementary proof of some asymptotic formulas in the theory of partitions. The Annals of Mathematics, 43(3):437–450, 1942. REFERENCES 125

[29] P. Feigin and A. Cohen. On a model of concordance between judges. Journal of the Royal Statistical Society. Series B (Methodological), 40(2):203–213, 1978.

[30] S. Fienberg. Randomization and social affairs, 1970 Draft Lottery. Science, 171(3968):255–261, 1971.

[31] M. Fligner and J. Verducci. Distance based ranking models. Journal of the Royal Statistical Society. Series B (Methodological), pages 859–869, 1986.

[32] M. Fligner and J. Verducci. Multistage ranking models. Journal of the American Statistical Association, 83(403), 1988.

[33] O. Frank and D. Strauss. Markov graphs. Journal of the American Statistical Association, 81(395):832–842, 1986.

[34] C. Geyer. Likelihood and Exponential families. Available at http://www.stat. umn.edu/geyer/thesis/th.pdf, 1990.

[35] C. Geyer and E. Thompson. Constrained Monte Carlo Maximum Likelihood for Dependent Data. Journal of the Royal Statistical Society. Series B (Methodolog- ical), 54(3):657–699, 1992.

[36] R. Glebov, A. Grzesik, T. Klimoˇsov´a,and D. Kr´al. Finitely forcible graphons and permutons. Available at http://arXiv:1307.2444, 2013.

[37] M. Handcock. Assessing degeneracy in statistical models of social networks. tech- nical report. Center for Statistics and Social Sciences, University of Washington, Seattle, available at http://www.csss.washington.edu/Papers/wp39.pdf, 2003.

[38] G. Hardy and S. Ramanujan. Asymptotic Formulae in Combinatory Analysis. Journal of the London Mathematical Society, 17(1):75–115, 1918.

[39] P. Holland and S. Leinhardt. An exponential family of probability distributions for directed graphs. Journal of American Statistical Association, 76(373):33–65, 1981. REFERENCES 126

[40] C. Hoppen, Y. Kohayakawa, C. Moreira, B. Rath, and I. Sampaio. Limits of permutation sequences. Journal of Combinatorial Theory Series B, 103(1):93– 113, 2013.

[41] D.R. Hunter and M.S. Handcock. Inference in Curved Exponential Family Mod- els for Networks. Journal of Computational and Graphical Statistics, 15(3):565– 583, 2006.

[42] S. Janson, B. Nakamura, and D. Zeilberger. On the asymptotic statistics of the number of occurrences of multiple permutation patterns. Available at http: //arxiv.org/abs/1312.3955, 2013.

[43] J. Jensen and J. Moller. Pseudolikelihood for exponential family models of spatial point processes. Annals of Applied Probability, 1(3):445–461, 1991.

[44] D. K´raland O. Pikhurko. Quasirandom permutations are characterized by 4- point densities. Geometric and Functional Analysis, 23(2):570–579, 2013.

[45] S. Kullback. Probability densities with given marginals. The Annals of Mathe- matical Statistics, 39(4):1236–1243, 1968.

[46] M. Lapata. Automatic Evaluation of Information Ordering: Kendall’s Tau. Com- putational Linguistics, 32(4):471–484, 2006.

[47] Guy Lebanon and John Lafferty. Cranking: Combining rankings using condi- tional probability models on permutations. In Proceedings of the 19th Interna- tional Conference on Machine Learning, pages 363–370, 2002.

[48] L. Lov´asz. Large networks and graph limits. AMS, 2012.

[49] C.L. Mallows. Non null ranking models. Biometrika, 44:114–130, 1957.

[50] J. Marden. Analyzing and Modeling Rank Data, 1st edition . Chapman & Hall, 1995. REFERENCES 127

[51] S. Mase. Consistency of the Maximum Pseudo-Likelihood Estimator of Con- tinuous State Space Gibbsian Processes. The Annals of Applied Probability, 5(3):603–612, 1995.

[52] B. Mckay. Asymptotics for symmetric 0-1 matrices with prescribed row sums. Ars Combinatoria, 19A:15–25, 1985.

[53] B. McKay, I. Wanless, and N. Wormald. Asymptotic enumeration of graphs with a given bound on the maximum degree. Combinatorics, Probability and Computing, (11):373–392, 2002.

[54] B. Mckay and N. Wormald. Asymptotic enumeration by degree sequence of graphs of high degree. European Journal of Combinatorics, 11:565–580, 1990.

[55] B. Mckay and N. Wormald. Asymptotic enumeration by degree sequence of graphs with degree o(n1/2). Combinatorica, 11(4):369–382, 1991.

[56] M. Morris, D. Hunter, and M. Handcock. Specification of exponential-family random graph models: Terms and computational aspects. Journal of Statistical Software, 24(4), 2008.

[57] S. Mukherjee. Consistent estimation in the two star exponential random graph model. Available at http://arxiv.org/abs/1310.4526.

[58] S. Mukherjee. Phase transition in the two star exponential random graph model. Available at http://arxiv.org/abs/1310.4164.

[59] J.S. Park and M.E.J. Newman. Solution of the 2-star model of a network. Phys. Rev. E, 70(066146), 2004.

[60] L. Ruschendorf. Convergence of the Iterative Proportional Fitting Procedure. The Annals of Statistics, 23(4):1160–1174, 1995.

[61] R. Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums. The American Mathematical Monthly, 74(4):402–405, 1967. REFERENCES 128

[62] T. Snijders, P. Pattison, G. Robins, and M. Handcock. New Specifications for Exponential Random Graph Models. Sociological Methodology, 36(1):99–153, 2006.

[63] S. Starr. Thermodynamic limit for the mallows model on sn. Journal of Mathe- matical Physics, 50(9), 2009.

[64] J. Trashorras. Large deviations for symmetrised empirical measures. Journal of Theoretical Probability, 21(2):397–412, 2008.

[65] M. Wainwright and M. Jordan. Graphical models, exponential families, and variational inference. 1(1-2), 2008.

[66] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge, 1994.