Configuring models with fixed sequences

Daniel B. Larremore Santa Fe Institute

June 21, 2017 NetSci

[email protected] @danlarremore Brief note on references: This talk does not include references to literature, which are numerous and important. Most (but not all) references are included in the arXiv paper: arxiv.org/abs/1608.00607

Stochastic models, sets, and distributions

• a generative model is just a recipe: choose parameters→make the network

• a stochastic generative model is also just a recipe: choose parameters→draw a network

• since a single stochastic generative model can generate many networks, the model itself corresponds to a set of networks.

• and since the generative model itself is some combination or composition of random variables, a random graph model is a set of possible networks, each with an associated probability, i.e., a distribution.

this talk: configuration models: uniform distributions over networks w/ fixed deg. seq. Why care about random graphs w/ fixed degree sequence?

Since many networks have broad or peculiar degree sequences, these random graph distributions are commonly used for:

Hypothesis testing: Can a particular network’s properties be explained by the degree sequence alone?

Modeling: How does the affect the epidemic threshold for disease transmission?

Null model for , : Compare an empirical graph with (possibly) to the ensemble of random graphs with the same degrees. e a loopy c d

simple simple loopy graphs multigraphs graphs

Stub Matching tono multiedgesdraw from theno config. self-loops model

~k = 1, 2, 2, 1 b { }

3 3 4 3 4 3 4 4 multigraphs

2 5 2 5 2 5 2 5 1 1 6 6 1 6 1 6

3 3 the 4standard algorithm:4 3 4

4 loopy and/or 5 5 draw2 from5 the distribution2 5 by sequential2 “Stub Matching”2 1 1 6 6 1 6 1 6

1. initialize each node n with kn half-edges or stubs. 2. choose two stubs uniformly at random and join to form an edge. graph isomorph. vertex-labeled stub-labeled e a loopy multigraphs c d

simple simple loopy graphs multigraphs graphs

no multiedges no self-loops Stub Matching to draw from the config. model

b

3 3 4 3 4 3 4 4 multigraphs

2 5 2 5 2 5 2 5 1 1 draw #1 6 6 1 6 1 6

3 3 4 4 3 4

4 loopy and/or 5 5 2 5 2 5 2 2 1 1 6 6 1 6 1 6 draw #2

graph isomorph. vertex-labeled stub-labeled a b c d loopy multigraphs simple a b c d

Are these two different networks? or the same network? loopy multigraphs

simple Are stubs distinguishable or not? simple The rest of this talk: the answer matters. loopy graphs graphs multigraphs multigraphs simple loopy graphs graphs multigraphs no multiedges no self-loops multigraphs loopy and/or

no multiedges no self-loops graph isomorph. vertex-labeled stub-labeled loopy and/or

graph isomorph. vertex-labeled stub-labeled The distribution according to stub-matching a b c d loopy multigraphs

When we draw a graph usingsimple stub matching, this is the set of graphs that we uniformly sample.

8 of the graphs are simple, while the other 7 simple loopy graphs have self-loops or multiedges. graphs multigraphs multigraphs We therefore say that stub matching uniformly samples space of stub-labeled loopy multigraphs. no multiedges no self-loops

Note, however, that this is notloopy and/or a uniform sample over adjacency matrices (rows).

graph isomorph. vertex-labeled stub-labeled The importance of uniform distributions remove vertex labels remove stub labels

a b c d

loopy multigraphs simple e a loopy multigraphs c d

simple simple loopy graphs multigraphs graphs simple loopy graphs graphs multigraphs

no multiedges no self-loops multigraphs

b no multiedges no self-loops 3 3 4 3 4 3 4 4 multigraphs

2 5 2 5 2 5 2 5 loopy and/or 1 1 6 6 1 6 1 6

3 3 4 4 3 4

4 loopy and/or 5 5 graph isomorph. vertex-labeled stub-labeled 2 5 2 5 2 2 1 1 6 6 1 6 1 6

goal: provably uniform samplinggraph isomorph. for vertex-labeledall eight spaces: stub-labeled loopy{0,1} x {0,1} x {stub-,vertex-} Choosing a space for your configuration model

Question 1: loops? Question 3: vertex- or stub-labeled? stub-labeled These configurations are . . . • two graphs • one graph, drawn two ways • one valid; one nonsensical simple loopy (skip Q3)

• three graphs loopy multigraph • one graph, drawn three ways multigraph • one valid; two nonsensical Question 2: multiedges?

vertex-labeled

example: Are loops reasonable? Would a make sense? [tennis matches: no | author citations: yes] Sampling from configuration models

stub matching samples uniformly from stub-labeled loopy multigraphs

for other spaces, define a Markov chain over the “graph of graphs” G →each vertex is a graph, and directed edges are “double-edge swaps”

swap this way or the other way

NB: Sampling is easy. Provably uniform sampling is not! Markov chains for uniform sampling

Prove that: • the transition matrix is doubly stochastic (G is regular) • the chain is irreducible (G is strongly connected) • the chain is aperiodic (G is aperiodic; gcd of all cycles is one)

Straightforward for stub-labeled loopy multigraphs. Choose two edges uniformly at random and swap them. Accept all swaps and treat each resulting graph as a sample from the U distribution. (Each node in G has degree m-choose-2.)

Easy for stub-labeled multigraphs. Choose two edges uniformly at random and swap them. Reject swaps that create a self-loop and resample the current graph. (Think of any “rejected swap” as a self-loop in G.)

Easy for simple graphs. Choose two edges uniformly at random and swap them. Reject swaps that create a self-loop or multiedge and resample the current graph. (Again, treat “rejected swaps” as a self loops in G.) Markov chains for uniform sampling

For vertex-labeled graphs, we inherit the strong connectedness of G as well as its aperiodicity.

However, ensuring that the Markov chain has a uniform distribution as its stationary distribution requires that we adjust transition probabilities.

a b unadjusted adjusted transitions transitions P = 1/3, 2/3 P = 1/2,1/2

These asymmetric modifications to transition probabilities depend on the number of self-loops and multiedges in the current state. decrease outflow (and increase resampling) Intuition: of graphs with multiedges or self-loops. Stub-labeled loopy graphs: not connected

counterexample: no double-edge swap connects these two graphs!

but see Nishimura 2017 (arxiv:1701.04888) - The connectivity of graphs of graphs with self-loops and a given degree sequence a b c d

Do {stub labels, self-loops, multiedges} matter for how we sample CMs? yes loopy multigraphs simple showed that these spaces are far from introduced (and just outlined) equivalent, even in thermodynamic lim. provably uniform sampling methods. a b c d

Do {stub labels, self-loops, multiedges} matter in applications of CMs? next… loopy multigraphssimple loopy graphs graphs multigraphs simple →hypothesis testing →null model for modularity multigraphs

nosimple multiedges no self-loops loopy graphs graphs multigraphs loopy and/or multigraphs

graph isomorph. vertex-labeled stub-labeled no multiedges no self-loops loopy and/or

graph isomorph. vertex-labeled stub-labeled Hypothesis testing

Do barn swallows tend to associate with other swallows of similar color?

Data: bird interactions, bird colors.

Compute color [correlation over edges] Choose a graph space for barn swallows

QuestionQuestion 1: loops? 1: loops? QuestionQuestion 3: vertex- 3: vertex- or stub-labeled? or stub-labeled? stub-labeledstub-labeled Nonsensical TheseThese configurations configurations are . .are . . . . vertex-labeled • two •graphs two graphs • one •graph, one graph, drawn drawn two ways two ways • one •valid; one valid; one nonsensical one nonsensical simplesimple loopyloopy (skip Q3)(skip Q3)

[Why? If we interacted• three today• threegraphs and graphs yesterday, a randomization in loopyloopy multigraphmultigraph which my today interacts• one •graph, onewith graph, yourdrawn drawnyesterday three threeways waysis nonsensical!] Reasonable multigraphmultigraph • one •valid one ;valid two ;nonsensical two nonsensical [inQuestion 2: multiedges? our data,Question 2: multiedges? in fact!] vertex-labeledvertex-labeled

This should be modeled as a vertex-labeled multigraph. Assortative pairing of barn swallows

Stub-labeled Vertex-labeled 5 5 note: for simple graphs and statistics based on

aphs the graph , r 34 stub-labeled ≡ vert34 ex-labeled Sanity check: 2 2 Density p = 0.001 p = 0.001 should be = for simple 1 1 Simple g 0 0 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 r r 5 5 aphs r NONE of these is centered at zero. 234 234 Density p = 0.608 p = 0.852 Correct space is meaningfully different. Multig 01 01 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 −0.6 −0.4 −0.2 0.0 0.2 0.4 0.6 r r Uniform sampling means we can compare empirical value to null distribution to draw scientific conclusions.

The choice of graph space matters—careful choice & sampling can flip conclusions! Community Detection

Are there groups of vertices that tend to associate with each other more than we expect by chanceC ?

Data: collaborations among geometers.

Maximize modularity, e.g.

calculate alignment scores D

1 convert to alignment indicators

0.5 0 A 600 remove short aligned regions 1 extract highly variable regions

a(t) 0 56789400 1 234

0 200 1 1 alignment position t

0 0

NGDYKEKVSNNLRAIFNKIYENLNDPKLKKHYQKDAPNY

NGDYKKKVSNNLKTIFKKIYDALKDTVKETYKDDPNY B 6 NGDYKEKVSNNLRAIFKKIYDALEDTVKETYKDDPNY

13 16 6 13 16 Coauthorship communities (vertex-labeled multigraph)

expected number of edges in a 1 random degree-preserving null model Modularity 0.9 communities 0.8 generic Q

0.7 specifically, in the and

stub-labeled loopy multigraph CM 0.6 Q

Generic Modularity 0.5

NMI between Eq(6) and Eq(8) partitions 0.4 Similarityof 2 3 4 5 6 7 8 9 10 numbernumber ofof communities communities

expected number of edges in any random degree-preserving null model

same community detection algorithm, same initial state, different results Advanced edge swaps

a reversing a directed triangle b connectivity preserving edge swap c 3 edge swap

a reversing a directed triangle b connectivity preserving edge swap c 3 edge swap

a reversing a directed triangle b connectivity preserving edge swap c 3 edge swap

required for graph-of-graphs irreducibility in directed networks useful if you wish to sample only networks that have a fixed number of connected components other swaps have been proposed, e.g. to improve mixing time

Proofs, samplers, the history of the configuration model, and applications in the paper The point: graph spaces & stub labels matter, in theory and in practice.

Recognizing this exposes a number of unrecognized & unsolved problems.

Provably uniform sampling methods exist—some have existed for decades! Bailey Fosdick Joel Nishimura Johan Ugander

Configuring Random Graph Models with Fixed Degree Sequences Fosdick, Larremore, Nishimura, Ugander. To Appear in SIAM Review. arxiv.org/abs/1608.00607

danlarremore.com/configuration models Thank you @danlarremore [email protected]