arXiv: arXiv:1610.05857

GRAPHICAL MODELS FOR ZERO-INFLATED SINGLE EXPRESSION

, By Andrew McDavid∗ Raphael Gottardo† ‡ Noah Simon§ and Mathias Drton‡

Department of Biostatistics and Computational Biology∗, University of Rochester Medical Center; Rochester, New York. Vaccine and Infectuous Disease Division†, Fred Hutchinson Cancer Research Center, Department of Statistics‡ and Department of Biostatistics§, University of Washington; Seattle, Washington. Bulk experiments relied on aggregations of thou- sands of cells to measure the average expression in an organism. Ad- vances in microfluidic and droplet sequencing now permit expression profiling in single cells. This study of cell-to-cell variation reveals that individual cells lack detectable expression of transcripts that appear abundant on a population level, giving rise to zero-inflated expression patterns. To infer gene co-regulatory networks from such data, we propose a multivariate Hurdle model. It is comprised of a mixture of singular Gaussian distributions. We employ neighborhood selection with the pseudo-likelihood and a group lasso penalty to select and fit undirected graphical models that capture conditional independences between . The proposed method is more sensi- tive than existing approaches in simulations, even under departures from our Hurdle model. The method is applied to data for T follicu- lar helper cells, and a high-dimensional profile of mouse dendritic cells. It infers network structure not revealed by other methods; or in bulk data sets. An R implementation is available at https: //github.com/amcdavid/HurdleNormal.

1. Introduction. Graphical models have been used to synthesize high- throughput gene expression experiments into understandable, canonical forms [Dobra et al., 2004, Markowetz and Spang, 2007]. Although inferring causal relationships between genes is perhaps the ultimate goal of such analysis, arXiv:1610.05857v3 [stat.ME] 14 Mar 2018 causal models may be difficult to estimate with observational data, and ex- perimental manipulation of specific genes has remained costly, and largely inimitable to high-throughput biology. Many analyses have thus focused on undirected graphical models (also known as Markov random fields) that capture the conditional independences present between gene expression lev- els. The graph determining such a model describes each gene’s statistical predictors: each gene is optimally predicted using only its neighbors in the graph. With gene expression studies serving as key motivation, a host of dif- 1 2 MCDAVID ET AL. ferent approaches have been developed for structure learning and parameter estimation in undirected graphical models [Drton and Maathuis, 2017]. Characterization of the conditional independences between genes answers a variety of scientific questions. It can help falsify models of gene regula- tion, since statistical dependence is expected, given causal dependence. In immunology, polyfunctional immune cells, which simultaneously and non- independently express multiple cytokines, are useful predictors of vaccine response [Precopio et al., 2007]. Simultaneous expression or co-expression of cellular surface markers potentially define new cellular phenotypes [Lin et al., 2015], so expanding the “dictionary” of co-expression allows pheno- typic refinements. Graphical models allow one to study such co-expression at the level of direct interactions.

1.1. Single cell gene expression. Established technology determines gene expression levels by assaying bulk aggregates of cells assayed through mi- croarrays or RNA sequencing. Although graphical modeling of the resulting data has seen profitable applications, see e.g. Li et al. [2015], there is an inheritant limitation to what can be inferred from expression levels that are averages across hundreds or thousands of individual cells, as we discuss in Section 2. In contrast, recent microfluidic and molecular barcoding advances have enabled the measurement of the minute quantities of mRNA present in single cells. This new technology provides a unique resolution of gene co- expression and has the potential to facilitate more interpretable conclusions from multivariate data analysis and, in particular, graphical modeling. At the same time, single cell expression experiments bring about new sta- tistical challenges. Indeed, a distinctive feature of single cell gene expression data—across methods and platforms—is the bimodality of expression values [Finak et al., 2015, Marinov et al., 2014, Shalek et al., 2014]. Genes can be ‘on’, in which case a positive expression measure is recorded, or they can be ‘off’, in which case the recorded expression is zero or negligible. Although the cause of this zero-inflation remains unresolved, its properties are of intrinsic interest [Kim and Marioni, 2013]. It has been argued that the zero-inflation represents censoring of expression below a substantial limit of detection, yet comparison of in silico signal summation from many single cells, to the sig- nal measured in biological sums of cells suggest that the limit of detection is negible [McDavid et al., 2013]. Moreover, the empirical distribution of the log-transformed counts appears rather different than would be expected from censoring: the distribution of the log-transformed, positive values is generally symmetric. Yet the presence of bimodality in technically repli- cated experiments (“Pool/split” experiments) implicates the involvement of SINGLE CELL EXPRESSION GRAPHICAL MODELS 3 technical factors [Marinov et al., 2014]. Zero-inflation is seen, in particular, in a single cell gene expression ex- periment we analyze in Section 6. The experiment concerns T follicular helper (Tfh) cells, which are a class of CD4+ lymphocytes. B-cells that se- crete antibodies require Tfh cell co-stimulation to become active [Ma et al., 2012]. Tfh cells are defined, and identified both through their location in the B-cell germinal centers, as well as their production of high levels of the CXCR5, PD1 and BcL-6. In the experiment we consider, Tfh cells were identified from CD4+CXCR5+PD1+ cells from lymph node biopsy. Fig- ure 1 shows the pairwise expression distribution of four Tfh marker genes 20 (P < 10− compared to non-Tfh lymph node T-cells, which are not shown). Although the expression of these genes could help discriminant Tfh from non-Tfh cells, the strength of linear relationships within Tfh cells (upper panels) varies. To identify coexpressing subsets of cells or to clarify the con- ditional relationship between genes, estimating the multivariate dependence structure of expression within Tfh cells is necessary. Figure 1 illustrates the issue of zero-inflation. The data are clearly poorly modeled by the linear regression models whose fit is shown in the lower panels of the figure.

1.2. Modeling zero-inflation. In order to accommodate the distributional features observed in single cell gene expression, we propose a joint probabil- ity density function f(y) of the form

T T 1 T m (1) log f(y) = vy Gvy + vy Hy y Ky C(G, H, K), y R , − 2 − ∈ for the dominating measure obtained by adding a Dirac mass at zero to the Lebesgue measure. The vector y Rm comprises the expression levels of m ∈ genes in a single cell, and the vector v 0, 1 m is defined through element- y ∈ { } wise indicators of non-zero expression, so [vy]i = I yi=0 for i = 1, . . . , m. In the specification from (1), both binary and continuous{ 6 } versions of gene ex- pression are sufficient statistics, and interactions thereof are parametrized, with G, H and K being matrices of interaction parameters. Zeros in these interaction matrices indicate conditional independences (and, thus, absence of edges in a graph for a graphical model). Specifically, the ith and jth co- ordinate are conditionally independent if and only if all interaction matrices have their (i, j) and (j, i) entries zero [Lauritzen, 1996, Theorem 3.9]. As we discuss in more detail in Section 3, the model given by (1), which we refer to as the Hurdle model, can be shown to be equivalent to a finite mixture model of singular Gaussian distributions. In light of the observed symmetry in the positive single cell expression levels, linking the modeling of zero-inflation with Gaussian parameters for nonzero observations is both 4 MCDAVID ET AL.

Corr: Corr: Corr:

BcL6 −0.0214 0.224 −0.0144

● ● ● ● ● ●● ● ● ● ● ●● ● ●●●●● ● ● ●● ●●●● ● Corr: Corr: 0.0916 0.257 CCR7

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●●●●●●● ● ● ● ●●●●●●●●●●●●●● ● ●●●● ● ● ●●●●●●●●●●●●●●●● ● ●●●●● ● ●●● ●●●●●●●●●●●●●●●●●●● ● ●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●● ● ● ●● ●● ●●●●●●●●●●● ● ● ● ● ●● ●●●●●●●●●● ● ●● ● ● ●●●●● ● ● ● ● ●●●● ● ● ● ● ● ● Corr:

Fyn 0.116

● ●●●●●● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ●● ● ● ●●●●●●●● ● ●●●●●●● ● ● ● ● ●●●●●●●●●●●●● ● ●●●●●●●● ● ●●●●●●● ● ●●●●●●●● ● ●●●● ●● ● ● ● ●●● ● ● ● ● ●●● ● ●● ●● ● ● ●●● ● ●● ● ● ● ● ● IL7R

● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●

● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●● ●

BcL6 CCR7 Fyn IL7R Fig 1: Scatter plots of inverse cycle threshold (40-Ct) measurements from a quantitative PCR (qPCR)-based single cell gene expression experiment (lower panels). The cycle threshold (Ct) is the PCR cycle at which a pre- defined fluorescence threshold is crossed, so a larger inverse cycle threshold corresponds to greater log-expression [McDavid et al., 2013]. Measurements that failed to cross the threshold after 40 cycles are coded as 0. Marginal expression in Tfh (CXCR5+PD1+) cells of Tfh marker genes is illustrated in the kernel-density estimates along the diagonal. The lower panels show the linear relationships between pairs of genes. SINGLE CELL EXPRESSION GRAPHICAL MODELS 5 natural and convenient. This said, it is an interesting topic for future work to develop more refined models of the continuous expression arising when genes are ‘on’. We will base statistical inference in the Hurdle model on so-called neigh- borhood selection, where the neighborhood of each gene is inferred via penal- ized regression methods [Meinshausen and B¨uhlmann, 2006]. Neighborhood selection is a state-of-the-art method for estimation and inference in po- tentially high-dimensional graphical models; see the review in Section 3.4 of Drton and Maathuis [2017]. The main challenge in our setting is determining how to calibrate signal in the binary versus the continuous part. We solve this problem using an anisometric group-lasso penalty (Section 4).

1.3. Outline. The remainder of the paper is structured as follows. Sec- tion 2 discusses the parameter targeted in single cell gene expression ex- periments, and why it is not accessible from traditional bulk experiments. Section 3 develops the parametric Hurdle model for single cell gene ex- pression, as specified in (1), and discusses conditional independence in this setting. Section 4 gives a detailed account of estimation of graphical models using neighborhood selection via penalized regression. Section 5 provides a simulation study that demonstrates the benefits of our approach. In Sec- tion 6, we analyze the aforementioned experiment on Tfh cells. Since the data set contains selected gene profiles that were available for both single- and several-cell aggregates, we are able to highlight the refined inferences that can be obtained from single cell data. In Section 7, we analyze data on mouse dendritic cells, which are of far higher dimensionality than the Tfh cell data. Our analyses show in particular that modeling the zero-inflation may uncover distinct networks compared to existing approaches. We con- clude with a discussion in Section 8, where we highlight interesting problems for future research, in particular, in graphical modeling. Section 9 contains an appendix with supplementary material and expanded derivations.

2. Single cell versus bulk expression experiments. Protocols for bulk gene expression experiments, such as for Illumina TrueSeq, call for 100 nanograms of total mRNA, hence require hundreds to thousands of cells. On the one hand, this biological “summation” over many of cells is expected to yield sharper inference on the mean expression level of each gene. However, it can also be expected to distort any conditional (in-)dependences present between genes. m Let Y1,..., Yn be iid random vectors taking values in R , with Yi rep- resenting the copy numbers of m transcripts present in the ith single cell. Now suppose the n cells are aggregated, and the total expression is measured 6 MCDAVID ET AL. using a linear quantification that reports values proportional to the input counts of mRNA. The expression observed in this bulk experiment is then

n Z Y , ∝ i Xi with the constant of proportionality typically a semi-empirical normalization factor, such as TPM (Transcripts Per Kilobase Million) or FPKM (Frag- ments Per Kilobase Million). Although most bulk experiments are designed to test for differences in mean expression due to experimental treatments and lack extensive replication within a condition, stochastic profiling [Janes et al., 2010] experiments have provided iid replicates of Z suitable for esti- mating higher order moments. However, when the distribution of Yi obeys some conditional independence relationships, in general the distribution of Z does not obey these same relationships. For example, take m = 3 and suppose that the Yi are iid samples from a tri-variate distribution supported on 0, 1 3. Let [Y ,Y ,Y ] be a random { } 1 2 3 vector following this distribution, and let pijk = P (Y1 = i, Y2 = j, Y3 = k) be the joint probabilities. Then Y1 and Y3 are conditionally independent given Y (in symbols, Y Y Y ) if and only if the two matrices (p ) 2 1 ⊥⊥ 3 | 2 i0k ik and (pi1k)ik have rank 1 [Drton et al., 2009, Prop. 3.1.4]. Yet even summing over only n = 2 cells, the random vector Z = Y + Y [Z ,Z ,Z ] taking 1 2 ≡ 1 2 3 values in 0, 1, 2 3 generally does not have Z Z Z . { } 1 ⊥⊥ 3 | 2 When the Yi are multivariate Normal, the conditional independence struc- ture is preserved under convolution. Unfortunately for non-Gaussian distri- butions this does not generally hold. As noted in our introduction, single cell gene expression is generally bimodal and zero-inflated, so not plausibly described by a multivariate Normal distribution. Therefore, even though for large enough n the distribution of the bulk experiment Z might approach multivariate (log-)normality, the networks estimated from graphical model- ing of bulk data will not reflect conditional independences that hold among expression levels in single cells.

3. Hurdle models. Univariate Hurdle models arise from modification of a density through excision of points in the support and assignment of posi- tive masses to these points. Targeting zero-inflation, our excision point is the origin. Let vy = I y=0 be the indicator function for a non-zero value of the observation y. Then{ 6 the} Hurdle model derived from a Normal distribution SINGLE CELL EXPRESSION GRAPHICAL MODELS 7 with mean ξ and precision τ 2 has density

(2) f(y) = exp v 1/2 log τ 2/ (2π) + log p/(1 p) ξ2τ 2/2 y − − +yξτ 2 y2τ 2/2 + log(1 p)    −  − with respect to the measure λ0 that is the sum of the Lebesgue measure and a Dirac mass at zero. Here, P (V = 1) = p (0, 1) is a mixing weight y ∈ representing the chance of observing a non-zero value. Varying p, ξ and τ 2, one obtains an exponential family with sufficient statistic y, y2/2, and v , − y and associated natural parameters h = ξτ 2, k = τ 2, and

g = 1/2 log τ 2/ (2π) + log p/(1 p) ξ2τ 2/2. − − 3.1. Multivariate Hurdle models. A plausible model for the joint distri- bution of a random vector Y = [Y1,...,Ym] representing single cell gene expression puts positive mass on every one of the 2m coordinate subspaces (recall Figure 1), including the origin when all genes are ‘off’ and the en- tire space Rm when all genes are ‘on’. Assigning positive mass to the co- ordinate subspaces generalizes the univariate construction from (2). As it is easiest to construct this model conditionally, we introduce the vector T T V = [V1,...,Vm] [I y1=0 ,...,I ym=0 ] that indicates the non-zero co- ≡ { 6 } { 6 } ordinates of Y. Throughout, our notation suppresses the dependence of V on Y. We emphasize that specification of the distribution of the multivari- ate Bernoulli random vector V simply amounts to specification of a 2m probability table. m v For any vector v = [v1, . . . , vm] 0, 1 , define the subspace R = m ∈ { } Rvi where we set R0 = 0 . So, Rv is the coordinate subspace cor- i=1 { } responding to the non-zero entries of v. Similarly, define PD(v) to be the Q cone of m m symmetric matrices that have non-zero entries only in rows × and columns indexed by i with vi = 1, and for which the submatrix given by these rows and columns is positive definite. Now suppose that the con- ditional distribution of Y given V is multivariate Normal and, specifically,

(3) (Y V = v) (µ(v), Σ(v)) | ∼ N with mean vector µ(v) Rv and covariance matrix Σ(v) PD(v). The nor- ∈ ∈ mal distribution in (3) is singular (see Section 9.2 for details) and supported on the subspace Rv. In the applications we have in mind the dimension m will be large enough so that it is infeasible to accurately estimate a general 2m probability table for the distribution of V, and a collection of 2m mean vectors and covariance 8 MCDAVID ET AL. matrices for the conditional distribution of Y. We thus proceed to formu- late a more parsimonious pairwise interaction model. While of far lower dimension, the pairwise model allows one to capture interesting conditional (in-)dependences. First, we assume V to follow an Ising model with joint probabilities

(4) p(v) P (V = v) exp vT Gv , v 0, 1 m, ≡ ∝ ∈ { }  m m where G is a symmetric interaction matrix in R × . Second, we assume that the conditional normal distribution of Y given V = v has log-density

T 1 T v (5) log f(y V = v) = v Hy y Ky C0(H, K), y R , | − 2 − ∈ with respect to Lebesgue measure restricted to the subspace Rv. In (5), H and K are two m m interaction matrices that do not vary with v, × and C0(H, K) is a normalization constant. The matrix K is symmetric and m m positive definite, but H may be arbitrary from R × . Putting the two pieces from (4) and (5) together, the joint density of Y with respect to the product m measure λ0 simplifies to (6) T T 1 T m f(y) = exp v Gv + v Hy y Ky C(G, H, K) , y R . − 2 − ∈   We recognize an exponential family with three interaction matrices G, H and K as natural parameters and the three statistics vvT , vyT , and yyT sufficient. Let (V) be the m m diagonal matrix with (i, i) entry equal to I ≡ I × m Vi. Then for any vector x R the product x is the vector that has the ∈ I ith coordinate replaced by zero for all indices i with Yi = Vi = 0. Similarly, multiplying from left and right to a matrix zeros out all but the principal I submatrix determined by this set of indices. Using this notation, the pairwise Hurdle model from (6) corresponds to the particular choice of

(7) µ(v) = K −Hv, Σ(v) = K − I I I I for the mean vectors and covariance matrices in the conditional  specification from (5). In (7), A− denotes the Moore-Penrose pseudoinverse of a matrix A. From the perspective of (7), the pairwise Hurdle model is a mixture of 2m singular Gaussian distributions whose mean vectors and covariance matrices are derived from one precision matrix K and an interaction matrix H. The notation we used in the conditional specification of the multivariate Hurdle model follows Lauritzen [1996], who describes conditional Gaussian SINGLE CELL EXPRESSION GRAPHICAL MODELS 9

(CG) models with inhomogeneous, non-singular precision K(v) that can depend on the discrete set of covariates in arbitrary, positive-definite fash- ion. These models have been considered more recently by Lee and Hastie [2013] and Cheng et al. [2013]. Our formulation differs from the traditional CG models by involving singular distributions with means and covariance matrices that exhibit structured inhomogeneity.

3.2. Conditional distributions identify interaction parameters. The nor- malizing constant C in equation (6) is a difficult to compute sum of 2m terms. This is expected as already the distributions in the Ising model from (4) have an intractable normalization constant for moderately large m. Fortunately, the univariate full conditional distributions obtained from (6) have tractable normalizing constants and identify the parameters from a given row/column of the interaction matrices G = (gab), H = (hab), and K = (kab). Fix a coordinate b, and define its complement A = 1, . . . , m b . { }\{ } Consider now the density f(y) from (6) as a function of only yb, i.e., yA = [yi : i A] is fixed, and write f[b A] for the conditional density of yb given ∈ | 2 yA. Then noting that viyi = yi and vi = vi, we have

1 2 (8) log f[b A](y) = vbg[b A] + ybh[b A] yb k[b A] C[b A], yb R, | | | − 2 | − | ∈ where C[b A] does not depend on yb and |

(9) g[b A] = gbb + 2gbAvA + hbAyA, | T (10) h[b A] = hbb + hAbvA kbAyA, | − k[b A] = kbb. | The conditional density f[b A] is thus a univariate Hurdle density as specified | in (2) with natural parameters g[b A], h[b A], and k[b A]. The three natural parameters| are obtained| from| linear predictors that depend on a design matrix constructed from yA and vA. For example, we may write gba g[b A] = gbb + Xa | hba a A   X∈ for Xa = [va, ya]. The linear predictor for h[b A] can be written analogously. | We note that if the data include additional nuisance covariates W0 that describe each experimental unit then these can be included by augmenting the linear predictor to

T gba (11) g[b A] = W0 gb0 + gbb + Xa | hba a A   X∈ 10 MCDAVID ET AL. with gb0 being the parameters capturing the effects of the covariates. From this perspective, the conditional distribution in (8) defines a vector gener- alized linear model, parametrized by three natural parameters g[b A], h[b A] | | and k[b A], the first two of which are modeled as a linear function of the expression| of other genes.

3.3. Conditional independence graphs. The dependence structure of the random vector Y = [Y1,...,Ym] may be summarized in its conditional in- dependence graph. This is an undirected graph = ( , ) with vertex set G V E = 1, . . . , m and an edge set that is determined by the conditional V { } E independences in Y. More precisely, the edges in are those two-element E sets a, b for which Y and Y are conditionally dependent given the { } ⊂ V a b remaining variables, i.e., Y a,b . In our case, Y has a density f as in (6). The dominating measure isV\{ a product} measure, and f is positive and contin- uous. Hence, the Hammersley-Clifford theorem assures that the conditional independence graph of Y has an edge a, b if and only if the four possible { } ab interactions are zero, so

(12) gab = hab = hba = kab = 0; see Lauritzen [1996, Chapter 3]. This fact is also evident from the form of the conditional distributions detailed in (8), (9), and (10). It motivates the neighborhood selection procedure developed in the next section.

4. Neighborhood estimation via penalized regression. In the sin- gle cell experiments to which we envision applying this method, the number of cell replicates, n, is larger than the sample sizes seen in typical bulk mRNA experiments. However, it is still often the case that the number of genes m is larger than the number of cell replicates. We are thus in a setting that benefits from application of methods from ‘high-dimensional statistics’; though emerging technologies are increasing available sample sizes.

4.1. Related work. Under scenarios in which n, m while satisfying → ∞ that n > Cdφ(log m)ψ, where C, φ and ψ are constants that depend on the model and d is the maximum vertex degree of the conditional independence graph, penalized regression has been shown to consistently identify the graph of multivariate Normal models [Meinshausen and B¨uhlmann,2006], of Ising (auto-logistic) models [Ravikumar et al., 2010] and of exponential family graphical models [Yang et al., 2014, Chen et al., 2015]. While this paper was in preparation, Tansey et al. [2015] further extended this line of work to general vector space graphical models that include the multivariate Hur- dle model as a special case. However, the standard (isometric) group-lasso SINGLE CELL EXPRESSION GRAPHICAL MODELS 11 they propose for estimation of the conditional independence graph does not account for heterogeneity in the scaling of predictors in the conditional dis- tributions. The anisometric group-lasso we propose in the following section yields drastic improvements in finite samples.

4.2. Anisometric penalty. Throughout this section, we fix an index b and consider the conditional distribution Yb given the other variables in Y for A = 1, . . . , m b . For any a A, define the parameter vector A { }\{ } ∈ θa = [gba, hba, hab, kba]. By (12), Yb Ya YA a if and only if θa = 0. ⊥⊥ | \{ } Let θ = [θ : a A], and let a ∈ T (13) Pλ(θ) = λ θa θa a A X∈ q be the group lasso penalty for tuning parameter λ 0. Maximization of the ≥ penalized conditional log-likelihood function

log f[b A](y) Pλ(θ) | − can lead to a solution that is sparse in parameter blocks, that is, some of the subvectors θa are zero. The penalty is equivalent to placing a sequence of independent, multivariate Laplace priors on blocks of θ and reporting the MAP [Eltoft et al., 2006]. Viewed as a prior, the standard group-lasso penalty from (13) implicitly assumes that each variable in each block has a similar effect size. This may be reasonable if the variables in each block are measured in comparable units, but is problematic otherwise. For example, if covariate X1 is measured in meters, while covariate X2 in centimeters, then the distribution of effect sizes for X2 would be 100-times more dispersed than the distribution of effect sizes for X1. In penalized GLMs, this is typically enforced “at run time” by ensuring covariates are on comparable scales, or Z-scoring each column of the design matrix if no intrinsic scale exists. In our setting of a vector regression, terms from linear predictor g[b A] | and linear predictor h[b A] end up together in blocks, and these coefficients are not necessarily comparable,| as one specifies log-odds of E(V V = 0) b| A while the other specifies conditional expectations of E(Y Y ). Re-scaling b| A does not resolve this, since the same design matrix Xa = [Va,Ya] is used in each linear predictor, and in any case, re-scaling generally alters the solution [Simon and Tibshirani, 2012]. Instead, we propose replacing the isometric `2 norm in the sum in (13) so that the penalty is

T (14) PH,λ(θ) = λ θa Haaθa. a A X∈ q 12 MCDAVID ET AL.

Here, H diag (H ) is a block-diagonal, positive-definite matrix that al- ≡ aa lows terms from the linear predictors to have different scales of penalty. It also accounts for correlation between components of θa, since columns of the design are correlated due to both va and ya appearing as predictors. If prior information existed, the matrix H could be chosen accordingly, with interpretation as a multivariate Laplace prior. Absent prior informa- tion, setting H equal to the Fisher information under a null model θa = 0 for all a results in variable selection approximately equal to conducting score tests, with exact equivalence holding under a null hypothesis of θa = 0 for all a; see Proposition 1 in Section 9.

4.3. Computation. In Algorithm 1, we outline the proposed neighbor- hood selection, allowing for possible nuisance covariates W. The nuisance covariates W might just be an intercept column, but generally could be any cell-level covariate deemed relevant. The smooth and concave function in line 7 can be maximized using any Newton-like algorithm (e.g., BFGS). The objective in line 10 is a sum of a concave, smooth function and a structured concave function and can be efficiently solved using proximal gradient as- cent [Parikh and Boyd, 2014]. In particular, one may exploit the fact that although the proximal operator 1 prox (x) = argmax x u 2 + uT H u γ u γ k − k2 a aa a a A X∈ q is not available in the familiar form of a soft-thresholding operator as in the isometric group-lasso, the proximal operator of the anisometric group-lasso can be efficiently found via a line search after one-time pre-calculation of the singular value decomposition of Haa [Foygel and Drton, 2010]. Throughout the inner-loop, warm starts are exploited for θˆ as λ varies. Active set heuris- tics using the strong rules of Tibshirani et al. [2012] yield computational gains for sparse solutions with large m. The algorithm yields, for each node, a sequence of neighborhoods over a sequence of tuning parameters Λ. These neighborhoods need not be consistent, in the sense that for some element of Λ it could be that b Ne(a) but a / Ne(b). We resolve that by adopting an ∈ ∈ “or” rule. In the accompanying software1, the algorithm is written in a com- bination of R and C++. Timings for the proposed method and competitors (described further in Section 5) are shown in Figure 3.

5. Simulations. We consider a series of simulations under several sets of underlying i) graph topologies, ii) parametric models, iii) sample sizes and

1Available https://github.com/amcdavid/HurdleNormal SINGLE CELL EXPRESSION GRAPHICAL MODELS 13

n m n q Data: Expression matrix Y R × , nuisance covariates W R × , penalty path ∈ ∈ Λ. 2q+1 Working parameters: Unpenalized nuisance parameters θ0 R , edge 4(m 1) ∈ parameters θ R − . ∈ Result: Neighborhoods ne(i, λ), 1 i m, λ Λ ≤ ≤ ∈ 1 for b 1, . . . , m do ∈ { } 2 A 1, . . . , m b ; ← { }\{ } 3 X W, YA, VA ; ← T T 4 θ0 gbb, gb0, hbb, hb0, kbb ; ←  T  5 θ gbA, hbA, h , kbA ; ←  Ab  6 Let log f[b A] (θ0, θ) return the log-density (8) evaluated at [θ0, θ] with covariate  |  matrix X. 7 ¯ θ0 argmaxθ0 log f[b A](θ0, θ = 0) ; ← 2 ¯ | 8 H log f[b A] θ0, 0 ; ← ∇ | 9 for λ Λ do ˆ∈ ˆ  10 [θ0, θ] argmaxθ ,θ log f[b A](θ0, θ) PH,λ(θ); ← 0 | − 11 Let ne(b, λ) contain vertex a whenever any of gˆbA, hˆAb, hˆbA, kˆAb = 0. 6 12 end 13 end Algorithm 1: Neighborhood selection iv) number of vertices. We summarize the considered setups here and defer details (including the choice of graph topology) to the supplementary mate- rial in Section 9.5. The number of observations n varies from 100 to 12500. In the chain graph topology, the number of vertices varies from m = 16 to m = 128, while in the e. coli graph topology, m = 500. The paramet- ric models include the pairwise hurdle model (6), the hurdle model under contamination by t8 noise, a logistic/Ising model and a Gaussian/logistic censoring model specified in Equation (15) in Table 1. The pairwise hurdle model is said to be complete if for each edge present in the graph, all of the corresponding entries in each of the three interaction matrices are non-zero. The pairwise hurdle model is said to be G-minimal when H and K are di- agonal matrices and only G contains non-zero off-diagonal entries. In this case, the G-minimal model is equivalent to a logistic/Ising model.

5.1. Methods compared and default tunings. Six methods were examined to test graph structure inference, and are described in Table 1. The Hur- dle models are fit using the accompanying software HurdleNormal version 0.98.2, while the Logistic, Gaussian and NPN models are fit using the R package glmnet version 2.0-5 (via the autoGLM function in HurdleNormal). The Aracne method is fit using package netbenchmark version 1.6.0. For methods 1-5, neighborhoods are stitched together using an “or” rule, i.e., 14 MCDAVID ET AL. vertices a and b are adjacent if either b ne(a) or a ne(b). ∈ ∈ In Figure 4 various fixed tunings are shown. In the oracle tuning, the graph with maximum sensitivity subject to FDR < 10% is shown. This tuning is not available in practice, but shows the maximum achievable per- formance of each method. With the BIC tuning, we employ the Bayesian Information Criteria on the pseudo-likelihood

ˆ BICλ = 2 log f[b A](θb,λ) + θb,λ 0 log n, − | k k b X∈V where θ is the penalized solution at penalty λ for vertex b, θ is the b,λ k b,λk0 number of non-zero entries, and θˆb,λ is the (unpenalized) maximum pseudo- likelihood estimate for the non-zero entries. The BIC solution is the one that minimizes BICλ. This tuning is available for methods 1-5. In the case of the the Aracne method the BIC is unavailable as no likelihood is defined.

5.2. Results. 30 simulation replicates sufficed to bound the simulation- induced Monte Carlo standard error of the mean < 5 10 3 for FDR and × − < .02 for the sensitivity. The simulations show that mis-specified estimation procedures perform poorly when model (6) is the data generating distribution. When an FDR- controlling oracle is available, the anisometric Hurdle model can dominate other methods in edge-sensitivity (Figure 4A-B). However, when the Hur- dle model is over-parameterized as in the G-sparse scenarios, the minimal Logistic model is superior, though the anisometric `1 penalty partially ame- liorates this gap. In very simple chain-graph scenarios, it is neigh-impossible to recover a network using 10-cell data. The e. coli network provides a counter example where 10-cell data nearly equals the performance available from single cell data. This may be due to the hub-and-spoke nature of the e. coli network, so the effect of marginalization by convolution tends to only add more connections between the hub and its neighborhood. The e. coli data and chain-graphs suggest that collecting single cell data, and estimat- ing graph structure with a method that accommodates zero inflation can accurately discover a wide variety of network topologies. More seriously, ignoring zero-inflation confounds use of information cri- teria to tune network size (panel C). On the other hand, the Hurdle model is robust to a variety of model departures, including contamination with t8- distributed errors (labelled with “t”), and data generation under a Gaussian- Logistic censoring model. When the full solution path is examined (Figure 5), a practioner who reported only the top few edges would often suffer from a large number of false positives when using methods not designed for SINGLE CELL EXPRESSION GRAPHICAL MODELS 15

Graph topologies and parametric models 1. G-minimal chain graphs, with tri-diagonal G-interaction matrix with off-diagonal entries set to 1, and diagonal H, K. In this case, an Ising/logistic model is minimally complete. 2. G-H-K-complete chain graphs, with off diagonal G = .2,H = .75,K = .4. The − − proposed model is thus minimally complete. 3. e. coli-networks: 500 edges from a semi-empirical e. coli network and pairwise hurdle likelihood. 50% of edge weights are G-minimal, 25% K-minimal and 25% complete. 4. 10-cell versions of 1-3. The 10-cell observation Y(10) is generated as Y(10) = 10 Yi log2 i=1 2 /10 and Y is generated as under model 1-3. 5. 1-3 withP non-zero observations contaminated with t8 noise. 6. 1-3 with the following latent Gaussian/logistic selection model:

Y˜ (µ, K), ∼ N P V˜j Y˜ = y˜ = logit(a + by˜j ), | (15)  Y = Y˜ V.˜

Methods 1. Aracne [Margolin et al., 2006]: connects genes with significant pairwise mutual information and applies pruning rules to suppress indirect effects.

2. Gaussian: neighborhood selection with `1-penalized linear regression [Meinshausen and B¨uhlmann,2006].

3. Logistic: neighborhood selection with `1-penalized logistic regression[Ravikumar et al., 2010].

4. NPN: neighborhood selection with `1-penalized linear regression on Gaussian- quantile transformed responses [Liu et al., 2009]. 5. Hurdle (isometric): neighborhood selection with model (6) and isometric group- lasso penalty. 6. Hurdle (anisometric): neighborhood selection with model (6) and anisometric group-lasso penalty. Table 1 Overview of simulation scenarios and methods compared. 16 MCDAVID ET AL. zero-inflated data. For example, with n = 100 in the e. coli network, all methods, aside from the Hurdle have FDR exceeding 20%. The simulations also suggest that perfect recovery of gene networks is impractical at realistic sample sizes, even with a correctly specified model, motivating a form of meta-analysis on estimated graphs, discussed further in Section 7.2.

6. T follicular helper cells. Our simulations show that depending on the data generating scenario, the Hurdle method may substantially out- perform, or at least mimic the performance of other candidate methods. We next sought to see if methods would tend towards consensus in biologically- derived single cell and 10-cell data, or if it were possible that the Hurdle method might offer unique insights. We considered co-expression networks in Tfh cells measured in eight healthy donors. 65 genes were selected for profiling via qPCR on the basis of their role in Tfh signaling and differentia- tion, generally with sparse expression across single cells (overall probability of expression 27%). 465 single cell, and 187 10-cell replicates were taken. Figure 6 shows networks of approximately 24 edges estimated using Hur- dle, Gaussian (with centered data, see Section 9) and Logistic, and Gaussian model using 10-cell aggregates. The size of the network is a compromise be- tween stability selected [Shah and Samworth, 2013] sizes of each procedure, which varies from 11 edges (Hurdle) to 32 edges (Gaussian). Normalized Hamming distances between the four methods, the Aracne method and the Gaussian model fit on the “raw”, uncentered data are reported in Table 2. The Hurdle and Gaussian models are most similar, while the logistic and Gaussian 10-cell network are quite distinct. The Gaus- sian(raw) model on untransformed data is similar to the logistic model, as distance of non-zero expression values from the origin is large compared to the variation among the non-zero values. In the Hurdle network, the transcription factors NFATC1 (Nuclear fac- tor of activated T-cells) and BCL6, and the signaling molecule CD154 and chemokine receptor CCR3 are hubs. NFATC1 has been found to promote transcription of cytokines IL21 [Hermann-Kleiter and Baier, 2010] and sig- naling molecule CD154 [Pham et al., 2005], while BCL6 serves as a transcrip- tional repressor, and is one of the canonical markers constitutively expressed in Tfh cells. CTLA4 which has been described to inhibit inflammation, interacts negatively with inflammatory activator JAK3. The disconnected component of CCR3-CCR4-BTLA-SELL-TNFSF4 may hint at plasticity between Tfh cells and the related T-cell lineages Th1 and Th2. CCR3 and CCR4 are canonical markers of Th2 cells, while TNFSF4 (coding for OX40L) promotes Th2 development de Jong et al. [2002]. Thus co-expression of these SINGLE CELL EXPRESSION GRAPHICAL MODELS 17

Gaussian (10) Gaussian Gaussian(raw) Hurdle logistic Aracne 1.00 0.92 0.92 1.00 1.00 Gaussian(10) 1.00 1.00 1.00 1.00 Gaussian 0.92 0.65 1.00 Gaussian(raw) 1.00 0.39 Hurdle 1.00 Table 2 Hamming Distance Dissimilarities Number of edges between networks of size 24 estimated through various methods. The Gaussian(10) model is a Gaussian model estimated on 10-cell replicates,  while the Gaussian(raw) data is estimated on single cells without centering the data. The remaining models are described in Section 5. genes may suggest cells transitioning between Tfh and Th1 or Th2 states. In the Gaussian network, though NFATC1, BCL6 and CD154 remain highly connected, CD27 now has highest degree and serves as a hub to receptors CXCR4, IL2Rb, IL2Rg, as well as ITGB2, NFATC1 and FYN. CD3e, the backbone responsible for transducing the T-cell receptor signal is connected with co-receptor CD4, CD154, IL2Rg, Fyn and ANP32B. The negative interactions between BTLA and CTLA4 are absent. The logistic network consists primarily of negative interactions. The strongly negative BCL6–BLIMP1 edge is consistent with previously described antag- onism between these genes [Johnston et al., 2009]. Interestingly, this edge is absent in the other networks. Networks found by applying the Aracne and Gaussian(raw) methods are shown in the supplementary material.

7. Mouse dendritic cells. Shalek et al. [2014] exposed bone marrow- derived dendritic cells, from mus musculus, to lipopolysaccharide (LPS). LPS is a toxic compound secreted and structurally utilized by gram-negative bacteria and induces a cascade of changes in a cell’s expression profile through several pathways. Cells were sampled after 0, 1, 2, 4, and 6 hours post-exposure. We estimated transcription networks using 4431 transcripts expressed in at least 20% of 65 cells sampled 2 hours after LPS exposure, at which interval transcription is expected to be undergoing a variety of dy- namic changes. Rather than attempting to perform model selection on this limited sample size, we consider highly sparse (< .01% sparsity) networks of 700 edges, chosen to provide tractable visualization and illustration of the method. The BIC tunings (discussed subsequently) are decidedly larger.

7.1. Selected networks. In a Gaussian model, the network is star-shaped, with Mx1, Ccl17, Tax1bp3 and Ccl3 as hubs all with degrees 15, though ≥ none are directly inter-connected (Figure 7). In all, 2.5% of non-isolated vertices contribute 50% of the edges in the network. With the exception of 18 MCDAVID ET AL.

Tax1bp3, these hub genes are all immune-signaling related. In the Hurdle model (Figure 8), the graph is more chain-like, with maxi- mum degree 12: 7% of nodes provide 50% of the edges. The strongest hub, Mgl2 (also known as Cd301b), has been recently described to be involved in uptake and presentation of glycosylated antigens, such as LPS, by den- dritic cells [Denda-Nagai et al., 2010]. A sub-connected set of genes coding for MHC-II antigen presentation (H2ab1, H2eb1, H2aa) is the densest sub- component, and interconnected to Mgl2 as well as Fabp5. Increased expres- sion of Fabp5 has been shown to increase expression of cytokines Il7 and Il18, hence is also involved in immune cell stimulation [Adachi et al., 2012]. Many of the neighbors of Mgl2,H2ab1, H2eb1, H2aa and Fabp5 are neigh- bors of the hub genes in the Gaussian graph, whereas Mx1, Ccl17 and Ccl3 are sparsely connected in the Hurdle network. Tax1bp3 is absent. Using BIC, both the Gaussian and Logistic models yield networks with more than 25,000 edges, while the Hurdle selects a network of roughly 12,000 edges. The additional flexibility available in the Hurdle for modeling inter- node relationships may permit sparser graphs to describe the conditional dependence relationships. We also observe that the Hurdle synthesizes signal from both Gaussian and Logistic networks. For sufficiently rich network sizes, the Gaussian and Hurdle and Logistic and Hurdle networks share 21% and 1% of possible edges, respectively, compared to only .08% of possible 6 edges between the Gaussian and Logistic networks (binomial test p < 10− ).

7.2. Graphical geneset edge enrichment. We consider how well the 700 edge networks recapitulate known relationships between genes using pre- viously described functional annotations. The Consortium [2015] provides a database of categories to which genes may be annotated if experimentally or computationally they are involved in a biological process. We note that networks may exhibit intraconnection within GO categories, and that some pairs of categories may exhibit preferential interconnection. Each pair (i, j) of GO categories—including self-pairs—induces a coloring of vertices, coloring the vertices belonging to category i color ci and category j color cj. Vertices that do not belong to either i or j remain uncolored. Iterating through the 39872/2 pairs of categories, we test for edge enrichment between colors. Suppose in the inferred graph of 700 edges, nij edges connect ci-colored vertices to cj vertices. If the colored vertices were completely connected with ni vertices of color ci and nj vertices of color cj, then there would be m = n n edges among them (with the obvious adjustment ij i × j made for self-edges when i = j). We now define an enrichment statistic as SINGLE CELL EXPRESSION GRAPHICAL MODELS 19 the hypergeometric tail probability

t = P (N > n ; 700, m , 4431 4430/2), ij ij ij ij × which is the probability of drawing nij colored balls, given 700 draws from urn containing 4431 4430/2 balls of which m are colored. × ij This results in nearly 16 million enrichment statistics on the pairs of categories, which follow a complicated dependence structure under the series of null hypotheses that the observed edges being connected independent of coloring. The top 200 (smallest in magnitude) enrichment statistics t(k), k < 200 are compared to their distribution P (t∗) under a Erdos-Renyi random graph model, yielding a Monte Carlo p-value for each order statistic. A pair of colors (i, j) with rank r < 200 is declared significant if P (t < t ) < ij ij (∗rij ) .05 and P (t(r) < t(∗r)) < .05 for all r < rij, that is, it is significant at 5% and all smaller order statistics are also significant.

7.2.1. Hurdle graphs tend to include intra-category enrichment. In the Gaussian model, more than 100 pairs of categories (colors) are significantly enriched at an FDR of less than 10%, however in these pairs, only 6 corre- spond to intra-category enrichment (Figure 10). These are: response to salt stress, potassium channel regulator activity, extracellular exosome and three genesets containing genes with significant time-course differential expression in the original experiment. In the Hurdle model (Figure 11), 13 of 57 sig- nificantly enriched pairs form intra-connections, including defense response to Gram-negative bacteria, and cell-cell adhesion and several modules in- volving extracellular secretion via the . Also of particular note, genes annotated to the activation of innate immune response are di- rectly connected to RNA PolII transcription factors, as well as “detection of lipopolysaccharide”–“endoplasmic reticulumGolgi intermediate compart- ment.” Both of the modules are absent from the Gaussian network. This suggests that the more appropriate Hurdle model manages to identify tran- scription factor-induced expression changes in these regulated genes, a direct method by which one gene would induce expression changes in another. No significant enrichment was found in the logistic model.

8. Discussion. Graphical models estimated from single cell data are distinct from networks estimated from bulk data, or even repeated stochas- tic samples. In simulations, the Hurdle model with anisometric penalty has much greater sensitivity compared to available methods, while in the two data sets here, it yields substantially different network estimates com- pared to Gaussian and Logistic models on these zero-inflated data. When 20 MCDAVID ET AL. enrichment of gene ontology categories is considered between vertices in transcriptome-wide data, the enrichment uncovered with the Hurdle model is consistent with identifying direct effects of transcription factors on genes undergoing dynamic regulation due to LPS exposure. In our work, we have utilized methods for sparse neighborhood selection. However, the zero-inflated parametric model explored here is not limited to this framework, and could serve as a basis for many network inference tech- niques, including mutual information-based techniques, or to parametrize families of directed networks. Although measuring transcriptome-wide data allows conditional estima- tion of direct effects between genes, non-mRNA factors may also greatly affect gene expression. In this sense, important variables have still been marginalized over, and in the case of the Tfh data, indeed, most of the transcriptome has been marginalized over. Extensions that adapt graphical model selection to clustering and/or factor analytic models would likely be useful and allow greater biological insight with these data sets.

9. Supplementary material.

9.1. Data processing. The method, and code to reproduce results in this paper is available as an R package at https://github.com/amcdavid/ HurdleNormal.

In all models and data sets, the cellular detection rate j Iyij >0 [Finak et al., 2015] was used as an unpenalized adjustment covariate in W as de- P scribed in Algorithm 1. In the Tfh data, a separate, unpenalized intercept was fit for each donor, as well. For the Gaussian and Hurdle models, positive values were conditionally centered

0 if vij = 0, y˜ij = + (yij y¯ else, − j + wherey ¯j is the average in a gene over positive values. This made Vj and Yj marginally orthogonal, speeding up the convergence of the optimization algorithm and reducing the leverage of zeros in the Gaussian model. The “Gaussian(raw)” model was also fit to the untransformed data, but not always discussed as it gave similar results as the Logistic model. The graph stability (via repeated 50% sample splitting) was used to esti- mate the network size. At 60% stability, the number of selected edges ranged from 11 (Hurdle) to 32 (Gaussian). Background noise in the mouse dendritic cells (mDC) data set was thresh- olded as described previously [Finak et al., 2015], and filtered for low- SINGLE CELL EXPRESSION GRAPHICAL MODELS 21 expression and cluster-disrupted cells. Figure 12 shows the Bayesian infor- mation criterion for the fitted path. An interior minimum fails to occur in the solution path for three of the methods.

9.2. Singular normal distributions. A random vector Y has singular Nor- mal distribution (µ, Σ) [Rao, 1973] with mean µ and covariance Σ with N rank r < m if the following holds for a matrix U with UT Σ = 0: a) UT Y = UT µ almost surely, and b) Y has a density

r/2 (2π)− T (16) f(y) = exp (y µ) Σ−(y µ)/2 , (det+ Σ)1/2 {− − − } with respect to Lebesgue measure restricted to the hyperplane UT Y = UT µ. Here det+ is the pseudo-determinant (product of non-zero eigenvalues) and Σ− is a pseudo-inverse, such as the Moore-Penrose inverse. In the case that Σ is zero outside a positive-definite submatrix of size r r, U can be × chosen to be a diagonal selection matrix consisting of zeros and ones, and r m r Y has a density with respect to the measure λ δ − , which is the case ⊗ 0 treated here.

9.3. Normalizing the joint density. The expression (17) T T 1 T m f(y) = exp v Gv + v Hy y Ky C(G, H, K) , y R , − 2 − ∈   that was given in (6) is a normalizable density. Let K+ = K − and I I rewrite (5) as  1 log f(y V = v) = vT Hy yT Ky | − 2 1 = vT Hy vT HK+Hv + vT HT K+KK+Hv yT Ky − − 2 Using the notation from (7) and applying (16), the normalizing constant of the density in (1) is found to be given by

1/2 T T + 1 C(G, H, K) = log exp v Gv + h K −h/2 det K . I I 2π I I v 0,1 m    ∈{X} h  i

9.4. The anisometric penalty is a score test of θa = 0 for all a. 22 MCDAVID ET AL.

2 ∂ log f[b A](y) Proposition 1. Let H = | be the conditional information. ∂θiθj 1 Suppose H and thus also its inverseh H− isi block-diagonal. Then the aniso- metric group lasso penalty is equivalent to a score test of the null hypothesis that θ = 0 vs. the alternative that a pre-specified subvector θ = 0. a 6 Proof: Let c = a, b and suppose that θ = 0. From the KKT condi- V\{ } c tions, θa = 0 is an optimum if and only if

T 1 2 H− < λ , ∇a aa ∇a

∂ log f[b A](y) where a = | is the a-subvector of the conditional log-likelihood ∇ ∂θa gradient. Taking λ2 to be an appropriate quantile from a χ2-distribution with dim(Haa) degrees of freedom results yields a score test.

9.5. Simulation details.

9.5.1. Graphs and parametric alternatives. In the G-minimal and com- plete scenarios, the underlying graph is a (perhaps incomplete) chain, with either 1.5% of nodes connected (Figure 4a) or 5% (Figures 4b and 5). In the e. coli scenario, the underlying graph is a 500-vertex subgraph sampled from a network described in Gama-Castro et al. [2011] and available from GeneNetWeaver [Schaffter et al., 2011]. In the G parametric alternative, given the underlying graph, the data are derived from model (6), restated in (17), with only the G interaction matrix set to non-zero. In this case, al- though the specified conditional independences hold exactly, an auto-logistic (Ising) model is minimally complete, while the multivariate Hurdle model is over-parametrized. In the complete parametric alternative, given the un- derlying graph, all three interaction matrices G, H and K are non-zero simultaneously in the appropriate entries.

9.5.2. Generative models. The Hurdle generative model, and deviations from it are considered. In the exact case, observations are generated through Gibbs sampling from model (6) using the full conditional distributions avail- able in (8). Samples from conditional distributions are generated simply as Bernoulli and Normal random variates. A 2000 iteration burn-in phase, and sample thinning was employed. Thinned samples exhibited only mild auto-correlation. In the contaminated case, a matrix of exact variates Y are sampled, and onto them (given Y = 0) is added t -distributed noise. So ij 6 8 the final variates remain zero-inflated, but are heavier-tailed than a Normal distribution. In the selection case, a matrix Y˜ of latent, non-zero-inflated SINGLE CELL EXPRESSION GRAPHICAL MODELS 23

Gaussian variates are sampled that follow the graphical model implied by the K-interaction matrix. These are zero-inflated through a selection model

P V˜ Y = y = logit(a + b y ), j| j j j  Y = Y˜ V.˜

The parameters aj and bj are chosen to keep P (V˜j) away from the boundary values 0 and 1. Lastly, in some cases, we consider in-silico 10-cell replicates. Given a desired sample size n, draw 10n observations Y from model (6), and let the observed data Y(10) follow 10 (10) Yi Y = log2 2 /10. Xi=1 Acknowledgments. RG was funded by a grant from the Bill and Melinda Gates foundation, the Vaccine and Immunology Statistical Center (VISC), OPP1151646. AM and RG acknowledge funding through grant R01 EB008400 from the National Institute of Biomedical Imaging and Bioengineering, US National Institutes of Health. MD was partially supported by grant DMS 1561814 from the US National Science Foundation. AM thanks Daniel Lu for comments on the networks in reported section 6. References. Yasuhiro Adachi, Sumie Hiramatsu, Nobuko Tokuda, Kazem Sharifi, Majid Ebrahimi, Ari- ful Islam, Yoshiteru Kagawa, Linda Koshy Vaidyan, Tomoo Sawada, Kimikazu Hamano, and Yuji Owada. Fatty acid-binding 4 (FABP4) and FABP5 modulate cytokine production in the mouse thymic epithelial cells. Histochemistry and Cell Biology, 138 (3):397–406, 2012. Shizhe Chen, Daniela M. Witten, and Ali Shojaie. Selection and estimation for mixed graphical models. Biometrika, 102(1):47–64, nov 2015. Jie Cheng, Elizaveta Levina, and Ji Zhu. High-dimensional mixed graphical models. arXiv preprint arXiv:1304.2810, apr 2013. Esther C de Jong, Pedro L Vieira, Pawel Kalinski, Joost H N Schuitemaker, Yuetsu Tanaka, Eddy a Wierenga, Maria Yazdanbakhsh, and Martien L Kapsenberg. Micro- bial compounds selectively induce Th1 cell-promoting or Th2 cell-promoting dendritic cells in vitro with diverse th cell-polarizing signals. Journal of Immunology, 168(4): 1704–1709, 2002. Kaori Denda-Nagai, Satoshi Aida, Kengo Saba, Kiwamu Suzuki, Saya Moriyama, Sarawut Oo-puthinan, Makoto Tsuiji, Akiko Morikawa, Yosuke Kumamoto, Daisuke Sugiura, Akihiko Kudo, Yoshihiro Akimoto, Hayato Kawakami, Nicolai V. Bovin, and Tat- suro Irimura. Distribution and function of macrophage galactose-type C-type lectin 2 (MGL2/CD301b): Efficient uptake and presentation of glycosylated antigens by den- dritic cells. Journal of Biological Chemistry, 285(25):19193–19204, 2010. 24 MCDAVID ET AL.

Adrian Dobra, Chris Hans, Beatrix Jones, Joseph R. Nevins, Guang Yao, and Mike West. Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis, 90:196–212, 2004. Mathias Drton and Marloes Maathuis. Structure learning in graphical modeling. Annual Review of Statistics and Its Application, 4:365–393, 2017. Mathias Drton, Bernd Sturmfels, and Seth Sullivant. Lectures on algebraic statistics, volume 39 of Oberwolfach Seminars. Birkh¨auserVerlag, Basel, 2009. Torbjørn Eltoft, Taesu Kim, and Te Won Lee. On the multivariate Laplace distribution. IEEE Signal Processing Letters, 13(5):300–303, 2006. Greg Finak, Andrew McDavid, Masanao Yajima, Jingyuan Deng, Vivian Gersuk, Alex K. Shalek, Chloe K. Slichter, Hannah W. Miller, M. Juliana McElrath, Martin Prlic, Peter S. Linsley, and Raphael Gottardo. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biology, 16(1):278, dec 2015. Rina Foygel and Mathias Drton. Exact block-wise optimization in group lasso and sparse group lasso for linear regression. Arxiv preprint arXiv:1010.3320, pages 1–19, 2010. Socorro Gama-Castro, Heladia Salgado, Martin Peralta-Gil, Alberto Santos-Zavaleta, Luis Muniz-Rascado, Hilda Solano-Lira, Ver´onicaJimenez-Jacinto, Verena Weiss, Jair S. Garc´ıa-Sotelo, Alejandra L´opez-Fuentes, Liliana Porr´on-Sotelo, Shirley Alquicira- Hern´andez, Alejandra Medina-Rivera, Irma Mart´ınez-Flores, Kevin Alquicira- Hern´andez,Ruth Mart´ınez-Adame, C´esarBonavides-Mart´ınez, Juan Miranda-R´ıos, Araceli M. Huerta, Alfredo Mendoza-Vargas, Leonardo Collado-Torres, Blanca Taboada, Leticia Vega-Alvarado, Maricela Olvera, Leticia Olvera, Ricardo Grande, En- rique Morett, and Julio Collado-Vides. RegulonDB version 7.0: Transcriptional regula- tion of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units). Nucleic Acids Research, 39(SUPPL. 1), 2011. Natascha Hermann-Kleiter and Gottfried Baier. NFAT pulls the strings during CD4+ T helper cell effector functions, 2010. Kevin a Janes, Chun-Chao Wang, Karin J Holmberg, Kristin Cabral, and Joan S Brugge. Identifying single-cell molecular programs by stochastic profiling. Nature Methods, 7 (4):311–317, 2010. Robert J Johnston, Amanda C Poholek, Daniel DiToro, Isharat Yusuf, Danelle Eto, Bur- ton Barnett, Alexander L Dent, Joe Craft, and Shane Crotty. Bcl6 and Blimp-1 are reciprocal and antagonistic regulators of T follicular helper cell differentiation. Science, 325, 2009. Jong Kyoung Kim and John C Marioni. Inferring the kinetics of stochastic gene expression from single-cell RNA-sequencing data. Genome Biology, 14(1), 2013. Steffen Lauritzen. Graphical Models. Oxford University Press, Oxford, 1st edition, 1996. Jason D Lee and Trevor J Hastie. Structure Learning of Mixed Graphical Models. In AISTATS 16, volume 31, pages 388–396, Scottsdale, AZ, USA, 2013. Yupeng Li, Stephanie A. Pearl, and Scott A. Jackson. Gene networks in plant biology: Approaches in reconstruction and analysis. Trends in Plant Science, 20(10):664–675, 2015. Lin Lin, Greg Finak, Kevin Ushey, Chetan Seshadri, Thomas R Hawn, Nicole Frahm, Thomas J Scriba, Hassan Mahomed, Willem Hanekom, Pierre-Alexandre Bart, Giuseppe Pantaleo, Georgia D Tomaras, Supachai Rerks-Ngarm, Jaranit Kaewkung- wal, Sorachai Nitayaphan, Punnee Pitisuttithum, Nelson L Michael, Jerome H Kim, L Robb, Robert J O’Connell, Nicos Karasavvas, Peter Gilbert, Stephen C De Rosa, M Juliana McElrath, and Raphael Gottardo. COMPASS identifies T-cell subsets correlated with clinical outcomes. Nature Biotechnology, 33(6):610–616, 2015. SINGLE CELL EXPRESSION GRAPHICAL MODELS 25

Han Liu, John Lafferty, and Larry Wasserman. The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs. Journal of Machine Learning Research, 10(10):2295–2328, 2009. Cindy S. Ma, Elissa K. Deenick, Marcel Batten, and Stuart G. Tangye. The origins, function, and regulation of T follicular helper cells. Journal of Experimental Medicine, 209(7):1241–1253, 2012. Adam A Margolin, Ilya Nemenman, Katia Basso, Chris Wiggins, Gustavo Stolovitzky, Riccardo Dalla Favera, and Andrea Califano. ARACNE: an algorithm for the recon- struction of gene regulatory networks in a mammalian cellular context. BMC Bioinfor- matics, 7(Suppl 1):S7, mar 2006. Georgi K Marinov, Brian A Williams, Ken McCue, Gary P Schroth, Jason Gertz, Richard M Myers, and Barbara J Wold. From single-cell to cell-pool transcriptomes: stochasticity in gene expression and RNA splicing. Genome Research, 24(3):496–510, mar 2014. Florian Markowetz and Rainer Spang. Inferring cellular networks: a review. BMC Bioin- formatics, 8(6), 2007. Andrew McDavid, Greg Finak, Pratip K Chattopadyay, Maria Dominguez, Laurie Lam- oreaux, Steven S Ma, Mario Roederer, and Raphael Gottardo. Data exploration, quality control and testing in single-cell qPCR-based gene expression experiments. Bioinfor- matics, 29(4):461–467, 2013. Nicolai Meinshausen and Peter B¨uhlmann.High-dimensional graphs and variable selection with the Lasso. Annals of Statistics, 34(3):1436–1462, 2006. Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Opti- mization, 1(3):123–231, 2014. Lan V. Pham, Archito T. Tamayo, Linda C. Yoshimura, Yen Chiu Lin-Lee, and Richard J. Ford. Constitutive NF-kappaB and NFAT activation in aggressive B-cell lymphomas synergistically activates the CD154 gene and maintains lymphoma cell survival. Blood, 106(12):3940–3947, 2005. Melissa L. Precopio, Michael R. Betts, Janie Parrino, David A. Price, Emma Gostick, David R. Ambrozak, Tedi E. Asher, Daniel C. Douek, Alexandre Harari, Giuseppe Pantaleo, Robert Bailer, Barney S. Graham, Mario Roederer, and Richard A. Koup. Immunization with vaccinia virus induces polyfunctional and phenotypically distinctive CD8(+) T cell responses. Journal of Experimental Medicine, 204(6):1405–1416, 2007. C. Radhakirshna Rao. Linear Statistical Inference and its Applications. John Wiley & Sons, Ltd., 2nd edition, 1973. Pradeep Ravikumar, Martin J. Wainwright, and John D. Lafferty. High-dimensional Ising model selection using `1-regularized logistic regression. The Annals of Statistics, 38(3): 1287–1319, jun 2010. Thomas Schaffter, Daniel Marbach, and Dario Floreano. GeneNetWeaver: In silico bench- mark generation and performance profiling of network inference methods. Bioinformat- ics, 27(16):2263–2270, 2011. Rajen D. Shah and Richard J. Samworth. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1):55–80, jan 2013. Alex K. Shalek, Rahul Satija, Joe Shuga, John J. Trombetta, Dave Gennert, Diana Lu, Peilin Chen, Rona S. Gertner, Jellert T. Gaublomme, Nir Yosef, Schraga Schwartz, Brian Fowler, Suzanne Weaver, Jing Wang, Xiaohui Wang, Ruihua Ding, Raktima Raychowdhury, Nir Friedman, Nir Hacohen, Hongkun Park, Andrew P. May, and Aviv Regev. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature, 510(7505):263–269, 2014. 26 MCDAVID ET AL.

Noah Simon and Robert Tibshirani. Standardization and the group lasso penalty. Statis- tica Sinica, 22(3):983–1001, 2012. Wesley Tansey, Oscar Hernan Madrid Padilla, Arun Sai Suggala, and Pradeep Ravikumar. Vector-space Markov random fields via exponential families. Proceedings of The 32nd International Conference on Machine Learning, 37:684–692, 2015. The Gene Ontology Consortium. Gene Ontology Consortium: going forward. Nucleic Acids Research, 43(D1):D1049–D1056, 2015. Robert Tibshirani, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, and Ryan J. Tibshirani. Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 74 (2):245–266, 2012. Eunho Yang, Y Baker, P Ravikumar, G Allen, and Z Liu. Mixed graphical models via exponential families. In AISTATS 17, volume 33, Reykjavik, Iceland, 2014.

E-mail: andrew [email protected] E-mail: [email protected]

E-mail: [email protected] E-mail: [email protected] SINGLE CELL EXPRESSION GRAPHICAL MODELS 27

0.075

0.050 BcL6

0.025

0.000

● ● ● ● ● ● ●●● ● ● ●●●●●●●● ● 20 ● ●● ●●● ●●● ● ● ●●●●●●●●●●●●●●●●● ● ●●●●●●●●● ●● ● ●● ●●●●●●●●●●●●●● ● ● ● ● ● ● ●●●●● ● ● ● ●●●●●●●●● ● ● ●●● ●●●●●●● ● ● ● ● ●●●●●●●●●● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ●●●● ● ● ● ● ●● ●● 15 ● ● ● ● ● ● ●

10 CXCR5 5

0 ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●● ● ●● ● ● ● ● 20 ● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ●● ● ●●●●●●●●●● ●●● ●●●●● ● ●●● ● ● ●●●●●●● ● ● ● ● ● ● ● ● ●● ●● ●●●●● ● ● ●●●●●●●● ● ● ●● ● ● ● ●●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● 15 ● ● ● ● ● ● ● ●

10 PDCD1 5

0 ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 BcL6 CXCR5 PDCD1

Fig 2: Scatter plots of inverse cycle threshold (40-Ct) measurements y from a quantitative PCR (qPCR)-based single cell gene expression experiment (lower panels). The cycle threshold (Ct) is the PCR cycle at which a pre- defined fluorescence threshold is crossed, so a larger inverse cycle threshold corresponds to greater log-expression [McDavid et al., 2013]. Measurements that failed to cross the threshold after 40 cycles are coded as 0. The upper panels show mosaic plots of each pair of contigency tables that can be formed from the indicator functions [vy]i = I yi=0 . On the lower panels, the linear regression on positive pairs of observations{ 6 } is indicated in blue, while the conditional mean values are indicated in red. 28 MCDAVID ET AL.

256

16 Time(s)

1

32 128 512 Number of nodes

Aracne Gaussian NPN Hurdle Hurdle Logistic (Anisometric) (Isometric)

Fig 3: Average timings for graph estimation algorithms as a function of the number of nodes SINGLE CELL EXPRESSION GRAPHICAL MODELS 29

A FDR oracle tuning: dimensional scaling B Sample size complete G-sparse ecoli 1.00 0.3

0.75 0.75

0.2 # Cells 0.50 0.50 1 10 Sensitivity 0.1 0.25 0.25

0.00 0.00 0.0 25 50 75 100 125 25 50 75 100 125 0 500 1000 1500 2000 2500 M N C Consistency and BIC tuning complete complete, selection complete, t 1.00 234 4343 3 4 3 3 33 2 34 3 4 3 444 1 22 3 22 3 234 2 2 2 1 0.75 1 1 1 2 2 2 1 2 0.50 3 1 1 1 1 0.25 1 1 2 1 1 1 0.00

ecoli ecoli, selection G-sparse 1.00 2 22234 331 4

0.75 3 1 1 3 1 0.50 3 1 12 234 3 1 34 3 2 1 Sensitivity 3 0.25 2 2 2 1 3 2 1 3 3 2 21 21 0.00 123 1 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 G-sparse, t 23434 234 33 4 4 1.00 2 1 Method

0.75 2 a Aracne 2 a Hurdle (Anisometric) 0.50 1 a Gaussian 1 11 0.25 a Logistic 0.00 a NPN 0.00 0.25 0.50 0.75 1.00 FDR a Hurdle (Isometric)

Fig 4: Dimensional (a) and sample size scaling (b) of six different network in- ference algorithms applied to simulated data under oracle FDR tuning. Data are generated from the multivariate hurdle model (6) under chain graphs (a) and e. coli graph (b). Panel (c) shows network selection consistency of vari- ous methods using the Bayesian Information Criterion under various models described in table 1. The paths trace out the changes in FDR and sensitivity as the sample size increases geometrically from 100 (1) to 12,500 (4). 30 MCDAVID ET AL.

Sensitivity vs FDR

complete, n=100 complete, n=500 ecoli, n=100 1.00 1.00 0.75 0.75 0.10 0.50 0.50 0.05 0.25 0.25 0.00 0.00 0.00 ecoli, n=500 ecoli, n=2500 G-sparse, n=100 1.00 0.3 0.2 0.75 0.2 0.50 0.1 0.1 0.25 Sensitivity 0.0 0.0 0.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 G-sparse, n=500 1.00 0.75 Aracne Gaussian NPN 0.50 Method Hurdle Logistic Hurdle 0.25 (Anisometric) (Isometric) 0.00 0.00 0.25 0.50 0.75 1.00 False Discovery Rate

Fig 5: Sensitivity vs. FDR for solution paths from methods and scenarios described in Table 1. The symbol indicates the tuning selected by BIC. ⊕ SINGLE CELL EXPRESSION GRAPHICAL MODELS 31

Hurdle Logistic

MAPK1● MAPK1● CXCL13● B3GAT1● CXCL13● B3GAT1●

● ● GZMA ITGA4● GZMA ITGA4● JAK1● JAK1●

IL2Ra● FoxO1● IL2Ra● FoxO1● CD28● CD28●

IL12Rb2● IL12Rb2● ● ● ● GATA3 GATA3 ● ● ● JAK3 Blimp1● CD195 JAK3 Blimp1● CD195

IL12Rb1● IL12Rb1● CTLA4● CTLA4● ● ● CD38 ● CD38 ● IFNg RoRgT● IFNg RoRgT●

● ● FoxP3 CCR7● FoxP3 CCR7●

● ● STAT5B ● STAT5B ● BATF ● ● Ptger4● TNFRSF18 Ptger4● BATFTNFRSF18 Fyn● FoxO3● Fyn● FoxO3●

● ● ANP32B● BCL2L1 ANP32B● BCL2L1

● ● CCL5● CD95● IL21R● CXCR4 CCL5● CD95● IL21R● CXCR4 ICOS● ICOS●

CCR4● BCL2● CCR4● BCL2● CXCR3● CXCR3● ● ● TNFSF4● CCR3 TNFSF4● CCR3 IRF4● IRF4●

● ● ● ● IL7R● MAPK3 TNFRSF4 IL7R● MAPK3 TNFRSF4 BTLA● IL2Rb● BTLA● IL2Rb● Tbet● Tbet●

● ● ● TGFb1 ● TGFb1 ● CD27 ● CD27 ● ● SELL ITGB2● TNFSF13B SELL ITGB2● TNFSF13B

● ● IL21 ● IL21 ● Lck● TNFa Lck● TNFa CASP1● CASP1●

● ● ● ● ● BcL6 CD4● NFATC1 BcL6 CD4 NFATC1 CD154● CD154● ITGAL● ITGAL● CD3E● CD3E● IL2Rg● IL2Rg●

ITGB1● ITGB1●

Gaussian Gaussian10

MAPK1● MAPK1● CXCL13● B3GAT1● CXCL13● B3GAT1●

● ● GZMA ITGA4● GZMA ITGA4● JAK1● JAK1●

IL2Ra● FoxO1● IL2Ra● FoxO1● CD28● CD28●

IL12Rb2● IL12Rb2● GATA3● GATA3● ● ● ● ● JAK3 Blimp1● CD195 JAK3 Blimp1● CD195

IL12Rb1● IL12Rb1● CTLA4● CTLA4● ● ● CD38 ● CD38 ● IFNg RoRgT● IFNg RoRgT●

● ● FoxP3 CCR7● FoxP3 CCR7●

● ● STAT5B ● STAT5B ● ● ● BATF ● Ptger4● BATF Ptger4 TNFRSF18 TNFRSF18 ● Fyn● FoxO3● Fyn● FoxO3

ANP32B● BCL2L1● ANP32B● BCL2L1●

● ● CCL5● CD95● IL21R● CXCR4 CCL5● CD95● IL21R● CXCR4 ICOS● ICOS●

CCR4● BCL2● CCR4● BCL2● CXCR3● CXCR3●

● ● TNFSF4● CCR3 TNFSF4● CCR3 IRF4● IRF4●

● ● IL7R● MAPK3● TNFRSF4 IL7R● MAPK3● TNFRSF4 BTLA● IL2Rb● BTLA● IL2Rb● Tbet● Tbet●

● ● ● TGFb1 ● TGFb1 ● CD27 ● CD27 ● ● SELL ITGB2● TNFSF13B SELL ITGB2● TNFSF13B

● ● IL21 ● IL21 ● Lck● TNFa Lck● TNFa CASP1● CASP1●

● ● ● ● ● BcL6 CD4 NFATC1 CD4● NFATC1 BcL6 CD154● CD154● ITGAL● ITGAL● CD3E● CD3E● IL2Rg● IL2Rg● ITGB1● ITGB1●

Fig 6: Networks of 22 edges estimated through neighborhood selection un- der the Aracne, Hurdle, logistic, Gaussian model (single cells) and Gaussian model (10 cell aggregates) in T follicular helper cells. Brown hues indicate estimated negative dependences, while blue-green hues indicate positive de- pendences. The edge width and saturation are larger for stronger estimated dependences. 32 MCDAVID ET AL.

GAS5● COPS5 ●IK LAMP2● WDR61● ARF3● ● 2410001C21RIKBNIP2 HN1 EIF5A● HNRNPABSRP19 ● ●CDK2AP2●PECI ● ● ● ● NCKAP1LCAPZA1AK135744AK139528● SRP54B● ●● ● 2900064A13RIK● NUCKS1 MED8 SNRNP27 GOLGA7● ● STX4A● GDI2 ● UBE2D3 ●ILF2 ● ● PSMA4 ECH1 ● TMEM49 SUMO1● ● AP3S1 ● RAB11A CEPT1 ERGIC2 ● ● G6PDX● ●YWHAH ● ● ENO1● GPI1 ● YWHAE ●PDLIM5 EIF3LPGK1 ● TNIP3 ● ● ● UQCRC1 ● GOT1 LDHA ● ●IFI30 ● CAR4 AK019262● PLIN2● PSME4 TCF25 ● ● EMB● ● ● DDX39DERL2GUSB PYCARDBHLHE40ACSL4 RILPL2CXCL1 CD14 TNFSF15●●PON3● PDCD6 MALAT1 ● ● ● 1810029B16RIK● ● ● ●MRPL30● ● ● GNS ●AI662270 ● AK212965 ● ● NHP2 FCGR2BRAB31● ● SLAMF9● ●●GUK1 RTN4●SRSF12 WSB1● NDUFS7SEC24A● CCNL1● ● RENBPCKAP2LCLK1● NAA20 ●●UBE2H 1600029D21RIK AHCYL1 ●APBB1IP● ●MIF ● TNFAIP8 PION● ● ● ●AKIRIN1 ● RWDD1SPP1 ● PDIA4●ADAM9 PSME2 ● ● ● PTPN6 ● PPP1R12A● 2810422J05RIK ●OAT EIF3D BCLAF1● TBXAS1 ● 2310036O22RIK● ● NDUFV2● EAPP● SH3KBP1● LGMN ● WTAP RAB32 GTF2BH2−T10 NARS● ●NDUFB10 SMPDL3A ● ACOT9 ● VPS4B ● ● ● ● UGP2 ● EFHD2 MBNL2NUDT9 ● ● ● DNAJA1 ● DPYSL2●●● MARCKSL1 GADD45B MTSS1 ●2610001J05RIK ● PPIH ● ATP6V0D2 ● MPV17 HBXIP● KLHDC4 TECR AK202494●IDH1 P2RY14● OBFC2A CTSH2410042D21RIK ●UBE2M ● AP3D1 DCAF13 PPT1 STXBP2●● ● GPNMB● ● PLA2G7 ● ● ● ● RNASE6 ● ●2310004N24RIK●PTPN12● ● ● ● 2400001E08RIK ●RBM7PI4KA ● ● IL10RB AKAP13AK217941 CNIHSEC13 EEF1B2 ●● S100A1 PSMD4 ● RAB14 ● ● RBM39 ●● ● ● FIS1 CASP7 ● ● ATP5C1● TPT1 PEBP1● TRA2B IFI27L2A ● ● SUCLG1 FBXO33GRN ● ● ●SRSF3ATF3 ● ● AIP EIF4E2PEA15A● ● ● ● TRP53 ●● FAM120AATP6AP1● ● ● CFDP1 SRSF5STX3● ZC3H12C MORF4L1● UBA1PDIA6ITGAMPOLR2E ANXA7CTNNA1 4933434E20RIK● CDC42EP2 ● SGK1 ARL6IP1● WDR33SOD1DPH5● ●●● ● ● NUPR1 TM7SF4 ● ● H47● ● ●● ● CLN3 CXCL2GRB2● ● ● CSNK1A1DNAJB11 2510006D16RIK● ADAM8PMP22● ● ●VPS24IL1F9 LTA4H ● ●● TTC35 PAPOLA ASAP1●● PSMC3 MNDALCSDE1● ●PLAUR● ● CEBPB LAT2● ● ● LMO4 ● ● ● ● IDH3B ● ● ● RAD21 CETN3DPEP2● ● POLR2K AK177731 TBCB SMOX● SEMA4A ● ● ● SCPEP1● 4930412F15RIK TUBA1C VDAC2 ● ● SYF2ATF4● ● CAST ● ● ● BUD31● DDX3X ● RSU1● VDAC1 ● ●GM6377 ARHGDIA ● ● ● MKIAA0694 TAX1BP3 CD36 ● ●NDUFA9M6PR● CHMP2B● NMT1 CCL22● ● SC4MOL ● ● UNC13D● ATP6V1E1●ZYX LRRC33● ● ● FBL SFT2D1COPE1 SLC25A3 ● ● RNF19B SERF2 SAR1A ATP6V1B2 ● ● ● CCDC28BCASC4 ● NEAT1 MAGT1● 4930506M07RIK●TERF1 ● ● PKM2● MRPS18C ● ● TMEM165 PSMA7● PLEKHO2CHI3L3 ● RPL22L1 PAFAH1B1●LSM12MPP6● ● SARS UBL3 CSNK2BGSTM1● ● ●H3F3A ●● LRRFIP1DENRCCL4 ●MORF4L2H2−M2 ● ● GM15645 ● NFKBIE ●AK185681 ●AIMP1● PNP ● ●2210403K04RIK● ● ● ● TAGLN2 ● ● ● ● ●FAM133BH2−DMB1 BAZ1A MDH20610007L01RIK● BLVRA ● H2−T23 ● ●CLAST2DYNC1LI1 ● CD164PPP4R2 ● CNN2● ● SAMHD1 ●1110002B05RIK PIGS PSENENAY096003 ACTN1● ●● ETV3 IL1A ● ● FKBP1A ● XBP1 ANXA3● ● ● CFP ● PFKPPRDX6 ● NDUFB9● CD200R4 PEPDIDI1 TUBA1B● ● ● RNPS1 ● ● ● ● TGIF1● ● ● ● ● ●TNF● CCRL2● PCBP1 ● ETFDH GNG10 SIDT2 CCL3 ● EIF3M NAPSA ● DRP1 PSMC2 ●IRF5 SEC11C ● CORO1A ● ARPC1BPSMD11 ● TLR2● PPP3R1● ARGLU1 ● ● IL12B CD74● FABP5 ● ZEB2EIF3A● ● TSPAN31IDH3G●● RNH1 ● ● ●SETD8● MGL2● IRF1● PARP14GPR137B−PSPPP1CA ● ● ● FRG1 SEL1L PRPF38AUFC1● ● TANK ● SWAP70● ● ● ●HMGB1 FAM96A● ● ● ● ● CLEC12A ● CLEC4ECCNG2 ●CAT ● NCF2 SPOP QDPR 9030625A04RIK● CCT2 ●● IL1R2 ● ENSA● TMEM39A●● TPM4● ● ●CASP11● SMC4 ● TUBB5● FKBP2● ● IRG1TIMM17A ● IL11RA1 MAPKSP1 COPZ1 SAMSN1JAK2● ●● CCT3● NDUFB6 ●CD40● BC028528 ECHS1● POLR1D● UNC93B1 NUCB2 ● H2−AB1 RNASET2B 6330577E15RIKIFITM2 ● CXCL3 ● ●EIF1AX ● ● H2−AA ● ● ● IL6IFI203 NFKB1 ● CAPZA2 ● TSPAN3● NAGA ●HNRNPH1 ● CCDC109BSQSTM1CCL17 CCT5●● ●HMGB2CORO1B RAP1GDS1 ● ● MX1 CLIC4 FN1 ● ●● SDHC ● ●● SLA CD274 2810474O19RIK● CYP51 ● ●DAGLBCAPZB●CD200R1MCFD2● STMN1 SGTA●CNBP SKAP2 ●PIK3R5 ● ● H1F0 ●LY86 CD84 ● ● ● ● ● ● ● ● ● BIRC3 LASS2HSD17B11● ●PLEKHB2 ●G530011O06RIK● BCL2L11 ODC1 PDCD6IP HERC64833439L19RIK CYB5R4 ● ● ● TPM3● ● FAM20CRARS ● ●FAM102B● ● GMFG ● TPR ACO2 RABGEF1CKS1B H2−EB1● CCDC21● CD69● REXO2 ●HVCN1 CD300LF● ● ELOVL1 ● ● ●STAT2● ● ● ● RBPJ CSF1R● SLC15A3● NONO ● ● ●COX7A2LTRPS1 TFEC●GM2382ARHGAP31●UBE2J2PLP2 ● ● B3GNT2 CAPNS1 ISG15DSCR3●● ● ● ●● CD24A CTSCPSMD8● EIF2S1CAV1 ● CCT7 ● ●●EHD4 ● ● ● ● ● PSMD1● ERP44 ● ●DAZAP2 CD53 CD47 TRPV2 CBX3 YBX1HMGN2 SPCS3 HK2 ● CISH BCL10● KCNN4● ● ●RNASET2A● UAP1L1 ● ● SLC38A2● PTP4A2● EIF2A ● ● CLPTM1L● DEK SDC4● ●NDUFA4 ●OXCT1 DCTN3IER3IP1 ● ● ● CSF2RA ARL8B●DRAM2 SLC2A6● ● ●● SIN3B ● ENSMUSG00000073777● ● ● INHBA YKT6 MRPL52SUMO2MBNL1 FAM195B 2700060E02RIK● PSMB2 ● EDF1 ● ● TPD52● ●●● ● ● ● ● B630005N14RIK● CHURC1 ● ALAS1 ZFAND6 MDH1 BAG1 IFRD1 ● CD81● ● ● MRFAP1RNF114 CD83 ● PGD ● GLTP ●●ISCU ● SARNP●● ● ● CCT6A ● ● COMMD3 PSMA1 MGEA5● ● SP140 ● STX7SLBP● ● PCNA CCNG1 AK194371 NIPSNAP3BC1QB1110004F10RIK● ● DEGS1● ● ● PCNP ● ●TRF ●SEPX1 ● ● ● CCR2● SH3GLB1 ● EIF3H APRT ● ●GSR RAB8BTREM2● SMARCA5●2900097C17RIKCTSL ● ● ● GNAI3●GM14005● ●● RAN ATP6V1H SLIRP● TRIM30A COPS6 ● MKIAA0650SBDS● FLRT3● TMEM50A● DNAJA2 ● ● ●EAR11SMAP2● ● 1110008F13RIK●TMED10 MKLN1 ● ● KIDINS220NECAP2 SNX5●● ● ●● ABHD4●GSTP1 GHITM SET SLC38A6 LYNIL2RG ● ● ● NT5C3● CCL2● ● HERPUD1 ● ● ● SBNO1 HSD17B12● BCAP31 ● GLUD1● LEPROT● NR3C1 SURF4 ● ●TCEA1● PRDX2 PFDN5● ● ● IFI204● DAD1 ADCY7● HEXB ETFA SNRPF ITCH● ● ITGAX ● ● ● ● 2400003C14RIK ● TIMM23 MED13L●● FCRLS ● VCP VPS26A TAF9● ● HDAC1● ● TRAM1 ● TNFSF9 S100A13 HDGF● ● ● ● HNRNPC YTHDC1● EPRS● CKS2● ● chemokinesis RASSF4 NCK1● ●ST13● ERGIC3● GAK● LPS weakly LPS ● PSMC1EEA1 ● ●CWC15● RNA MAPKAPK3 FAM32ACOPG CMTM6● ● ● ● activation PAM Uncategorized

Fig 7: Core Gaussian model networks in LPS-treated mouse dendritic cells. Hub genes are shown in red. Vertex colors indicate gene ontology member- ship. Disconnected subgraphs with two vertices are suppressed. SINGLE CELL EXPRESSION GRAPHICAL MODELS 33

ATP5A1 SSR4●●PPIB ●● ●PF4● GBP3●● UBL3●●● SNRPB FDPS● ● LIPA●LPL ● IFT20 ● ● MGEA5CAPZA2 ● ● RQCD1 ● FYB ●● ● WWP2ECE1 CCL22● ●PTPN2● ACAT1 ● GRN● ● ● ● ● ● ● ● ACTL6A ●CPNE2● ● CNN2 ● ● ● ● ●● ● TAB2 RDX LZTFL1 ● 4930412F15RIKAK217941● ●PLD4●BATF ●●●● SDHBCAR4 ● ● HERC6 RPL37● ● ● ● ● CSF1● COQ7 ● ● ● ARHGDIB RENBP ● UGDH ●GORASP2 ● TPT1 ● ●F2RL2 ● ● ● ●DNASE1L1 ● H3F3A● ● PPT1 ● ● CNOT2● ● RPL26P2RY14 ● ● TRAF1● NCEH1ANPEP ● ● ● ● NDUFB9NUDT9 ● CYLD ● ADSS● SPNA2CHMP1B●● PSMA7● ● SSBP1● ● ● ANKRD17● HNRNPAB● S100A4 ● ●CD38● ●● ● STAU1 ● ● RFFL● ● ● ● TUBA1C CD69 ● CD14● SLFN8 ●● CMPK1 ● ● ● ● ● ● ● ● PGLS ● TUSC2●MIFCXCL10 XAF1● CDC42 ● CASP8 GNB2● ETNK1 ● ● ● PYHIN1 ● CYTB ● ● ● ● ●● FTL1 ●● ● ●AK141672 ● PBRM1 ●● ● ● ●PFDN2 ATPASE6● ● EIF2A ● ● ●● SYAP1 ● ●CCT7 PDGFBVDAC3 RNF114 ● ●ID3SDC4● NDUFB10● AK140265 ● COX2● ● ● CSE1L● ●● ●● ● ● RAB11A● ● ● ● ● ●ILF2 ITGAL● ● GU332589● ●LGMN●● MT−ND4 ● ● ● ● RTN4 FBXO30 IRF9 ● ● PCNA● CUL3 CD84 ● ● ●● ●● ● LDHA HMOX1 ● ●● ●FAM20C PILRA ● VCPIP1 ● ●● ● EIF4E TET2 HNRNPM ● ● ● ● ● KCNN4 ● CYC1●●GM6377● ● ● RAB22A● ● ● ● IL10RB HNRNPCETFA PKM2 ● DIP2B ●● ● ●● ●ENO1 ● ● ● ● ● ● ● ● ●FN1 ● ● UBE2D3 ● ● PRDX2 ●● ● ● PRDX3 ● ● FAM129ARSU1● ●GGH● ● RAN● ● ● PSMA1 ●CCT3● ● ● GNG12CLASP2 ● ● BRD7 ●GSN IRGM1● ● ● ●● ST3GAL1 ●GCH1● ● ATP6V1C1 ● INHBA ● ● STMN1 ● NCKAP1L● ● ● ●PLD3 ● ● ACLY ● ●● ● ● ●● ●NANS ● CAPZA1 PRCP● ● ●ATXN3 ● ● ● ● ● ● ● ● ● P4HB● ● ● ● ● ● ACAA2EXOSC9● GLRX3● ●● ● ● CYB5R1 PSMB7● ● ● ITGB2 NLRP3● ● RPS15A● ● ● ● ● ● EDF1 MYL12B TIPARP ADH5 MAPK14 ITGAX BCL2A1ADNAJB9 PSMA4● ●● CHMP1A NAIP2 ● ●RAP1A● ● ● ● ● ●● ● ● ●EZR● RPL18RPL14●RPS18 ●● ● ● HSD17B4● TOLLIP● TUBB2C● OAS1A● VPS4B ● ● ● ● ● ● ● DUSP3●PEBP1 TUFM ● ● SERINC1 MAFF NT5C3● ● TMEM43● ●● ●● ● ●● TRIAP1 ● ● ●UBC ● N6AMT2 TPI1●GPI1● ● ● ● ●TRF SNF8 TUBB6CLN3 ● ● ● CLAST2 NDUFA4 ● ● ● ● VPS24 WDR92 ● TMEM173ATF2 ● ●● PTPRA● ● TMEM49 GAKTNFSF9 ● ● NFKBIZ● PSMA2 ● ●● ●● ● ●● PIK3R5 ● CD53LILRB4 ● SC4MOL ● ●A130040M12RIK PITPNB ●● CLOCK ● ●NFKB1 ● ● ● ● ● ● ● ● ● ● SPNB2● ● ● EIF2C3● ● TOMM70A● ● PSMB3● ● SPINT1● ● ST13VPS26A KLF3● ● BCL10 ● ● CD300A● NT5C● ●●● ● ●STAT5A PABPN1 ● PGK1 CCL4 ● GANAB PTPN6 EIF2AK2 MFGE8 ● DYNC1H1TNFSF13 ● EFR3A ATP6V1H ● ● ● ● ● PACSIN2 ● HMGB2 ● ●● TCN2 ● PSMB8● ● ● ATP6AP1NUPR1 ● PSMB1 ● SNX3 ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ANXA7 KPNB1● ● ● ● CORO1B●HSPB11● ITGA4●● ● ●HPRT● ●● ● ● ● ●TNF ARL1● ● ● ●ACTR1B ECHS1 GRB2CD36●CHI3L3 RPL8 ● H2−T23 CCL3DENR ● ● ● ● ● ● ● ● ● CKLF ● ● ● SENP2 MTDH CSK RAB9● ●9030625A04RIKPTP4A2 GM14005●● ● AGA ● ● ●BCL2A1C● ● UQCR10● ● ● ● TNFAIP8 ● ● PSMD8● ● ● ●● ● CAPZB ● ● EIF2S1● ● ●STUB1 MARCKSL1 EBI3 ● DAD1● ZUFSP YBX1 ● ● ● ● ● ●● ● ● ● NUP98● ● BFAR NUCB2 PPP1R12A● RPS20CDKN1A EIF3K ● ● ●● CALU● ●● ● RNASET2B ● ● CTSB ● ● ARF4 ● SCARB1 ● ● PDIA6ATP6V0D2CTSD ● MPP6● ● DARS● ● ● ● ● ● ● CD63● MS4A6D ● ● GADD45B ● ● ● ● TCP1 UNC13D ● ● STARD5 BHLHE40 ● ● ● ● SEPX1 ● ● GNS ●PYCARD UBA1 CCL6 ● ● GNAQ TM7SF4 COX5A● CLIC4EIF3E LXN ●ACSL4 ● ● ● RAB5A● ● ● LGALS1 ● ●PBLD1 ● BCOR● ●● ● ●● ● ● ● ANXA3 CRK LYZ2 ● ● SEPP1 ZBTB8OS ● ● ● ● ● ● ● ● ● PSMB2 ● RNASE6 STRN3 RAB2A LYZ1● ● ● ● ● ● ● ● ● ● TSPAN3 ● CORO1A EIF5A TRIM5● ORAI2 ● ● SLC2A6AB346227 VAMP7 H2−AB1● FABP5● CCR2AA467197 ● ● ● ● CCND2 NCSTN● ●● ● ● ● ● ● ● ● ● ● STX8●● ● ● ● CTSL TUBB5 ● ● USP18● ● ● ●SAMHD1● ● NUP153 ● IRAK3● ● ● UBE2MIFITM2 TUBA1B● H2−AAJAK2● ● ROBLD3● GOLGA7 ● SLFN5 ● ● SMC4● H2−EB1● ● ● TRIM36 ● ● SMC2 CD74● NAGA● C1QB ECH1 ● ● PSMD1●TMEM205 HMGB1● ● ● ● CKB ● C1QC● ●DHX15 HDAC1 FIBP ● ● ● ● ●TMEM39A MGAT2 ● ● ● TMF1● ● PPP1CA IL1A● MGL2● ● IL12B● H2−Q7● ● ● ATP5K● ● SLC31A1 ● ● CD40 ● KLK1B11 ● PHB ● REEP5 ● ● TANK ● ● DICER1 ● ●TXNDC17 ● ● ● ● ATP6V1B2 ● ● TSPOCST3● ● JAG1 ● ● LDHB ● VPS29 ● ● ● ● ● ●CCT5 ●● ● RNH1 ● MX1● RRAS CASP1 ● ● ● NEDD8 ● ● H1F0● ●●PRDX6 ● PDCD6 ●● ● STAMBP●FBL ● ● ● ● ● SRP14● ● CSF1R● ● PSMA3 STAT2 ● TGFBI SND1 USO1 ● ● ● ● ● ● ● ● ● ● ● ● ●MT2 SUCLG1 ● ● PSMC6 ● CCL17 MT1 ● ● ● ANXA4 ●ATP5C1 ● ITM2B● ● ● DPYSL2 ● FKBP2EAR11 ● ● GOT1● CXCL3 ● ● CSNK2B ● ● ● PSMD12● ● ● ● GALNT7 ● ● ● SRPR● ● ● SLC2A1 COX5B● ●● SC5D●● YWHAQ● ● ●COMMD1 ● ● ● BLOC1S1● ●PSMD10 BPNT1 ● ● ●EIF2S3X ● ● ● ● ● H2−M2 ● ● RNF19BGLB1 RHOQ ● ● ● ● ● ●TNIP1 TREM2 ● ● ACP1●● ● C920009B18RIK TOR1AIP1DCTN2 ● SURF4● ● PLEC GRK4● ● ● ● ● chemokinesis ● ● ● IL6TXNDC5● ● ENOPH1 ● ● PARK7● ● CCL2● LPS ●VPS35● FAM110B● ● PDIA3 ● ● ● ● ●● RNA ● ● TXN1● COPS6 ● TNFRSF1A● SGPL1 ●● endosome ●● ● ● activation PAM Uncategorized

Fig 8: Core Hurdle model networks estimated in LPS-treated mouse den- dritic cells. Hub genes are shown in red. Vertex colors indicate gene ontology membership. Disconnected subgraphs with two vertices are suppressed. 34 MCDAVID ET AL.

A B

C D

Fig 9: Overview of geneset edge enrichment analysis. 1. Vertices A and C belong to the blue category, while vertex B belongs to the red category. Vertex D belongs to neither. 2. There is nij = 1 blue-red intra-connection, while mij = 2 are possible given the 4 edges. 3. The enrichment statistic is the hypergeometric tail probability tij = 2 4 (2)(2) P (N = 2; 4, 2, 6) = 6 = .4 (4) 4. The significance of the blue-red enrichment statistic would be ascer- tained by sampling from the null Erdos-Renyi model over all possible pairs of categories. SINGLE CELL EXPRESSION GRAPHICAL MODELS 35

response to organic cyclic compound ●

cytosol ●

positive regulation of transcription macrophage chemotaxis from RNA polymerase II promoter in ● response to hypoxia ● nucleus positive regulation of cellular ● extravasation ● positive regulation of leukocyte cytoplasm migration ● ● positive regulation of T cell activation ● plasma membrane peptidase activator activity ● ● extracellular exosomepositive regulation of macrophagenegative regulation of protein ● positive regulation of ●chemotaxis ● cytokine productionlocalization to cell surface mitochondrion integrin activation● ● ● ● ● positive regulation of thymocyte endodermal cell differentiation apoptotic process mercury ion binding positive regulation of peptidyl−tyrosinereplication fork peptide cross−linking● cellular● response to interleukin−1 phosphorylationpositive regulation● of transcription ● ● ● ● from RNA polymerasenegative II promoter regulation in of Wnt signaling ● response to pathwaystress poly(A) RNA binding wound healing glial cell migration ● ● ●negative regulation of myoblast ● ● differentiation ● negative regulation of intrinsic apoptotic signaling pathway in response ● thrombospondin receptor activity cell−substrate junction assembly ● to DNA damage by p53 class mediator leukocyte chemotaxis ● positive regulation of chemokine● (C−X−C ● protein complex assembly extracellular space ●positive regulation of fibroblast motif) ligand 2 production cell activation proliferation ● ● negative regulation● of T cell ● ● antigen processing anddifferentiation presentation of positive regulation of osteoclast positive regulationexternal of sidepathway− ofexogenous plasma membrane peptide antigen● via MHC class differentiation restricted SMAD protein phosphorylation● II ● negative regulation of mature B cell ● positive regulation● of T cell regulation of cell shape apoptotic process response to salt stress differentiation ● ● ● ● positive regulation of gene expression ●

heme binding ●

chemokinesis LPS weakly LPS potassium channel regulator activity RNA endosome ● activation PAM Uncategorized

Fig 10: Modules enriched at FDR 10% using graphical geneset edge en- ≤ richment in mouse dendritic cells under Gaussian model. 36 MCDAVID ET AL.

DNA damage response, by p53 class mediator positive regulation of interleukin−8 biosynthetic process ● phagocytosis ● ● androgen metabolic process positive regulation of inflammatory response ● ● hydrolase activity, acting on glycosyl trans−Golgi network transport vesiclebonds heterotypic cell−cell adhesion detoxification of copper ion ● ● ● ● defense response to Gram−negative estrogen metabolic process bacterium negative regulation of viral entry into nitric oxide mediated signal ● host cell ● transduction secretory granule Golgi cis cisterna ● ● ● defense response to Gram−positive ● bacterium cellular zinc ion homeostasis rough endoplasmic reticulum lumen ● ● ● regulation of cell adhesion lung development Golgi stack ● microvillus ● ● ●

endoplasmic reticulum lumen detection of lipopolysaccharide cytoplasmic BAF−type complex ● tubulin complex ● ● ● nuclear inclusion body ● ●

endoplasmic reticulum−Golgi intermediate mitotic condensation Golgi organization compartment cytosolic large ribosomal subunit ● response to tumor necrosis factor ● ● ● double−stranded RNA binding ● ● positive regulation of programmed cell Leydig cell differentiation death cellular response to interleukin−4 ● positive regulation of protein ● localization to plasma membrane ● ● prostaglandin biosynthetic process extracellular exosome positive regulation of interleukin−2 ● biosynthetic process ● negative regulation of DNA damage ● response, signal transduction by p53 immunoglobulin mediated immune response cellular response to oxidative stress chemokine activity class mediator ● ● ● ● negative regulation of myoblast differentiation negative regulation of maturepositive B cell regulation of cytokine apoptotic process secretion syntaxin binding ● ● ● positive regulation of osteoclast ● differentiation ● negative regulation of intrinsic negative regulation of intracellular activation of innate immune response apoptotic signaling pathway in response estrogen receptor signaling pathway to DNA damage by p53 class mediator RNA polymerase II activating ● natural killer cell differentiation ● transcription factor binding positive regulation of chemokine (C−X−C● ● ● motif) ligand 2 production ● chemokinesis tRNA splicing, via endonucleolytic positive regulation of phosphorylation LPS ● cleavage and ligation ● RNA ● endosome ● activation PAM Uncategorized

Fig 11: Modules enriched at FDR 10% using graphical geneset edge en- ≤ richment in mouse dendritic cells under Hurdle model. SINGLE CELL EXPRESSION GRAPHICAL MODELS 37

Gaussian Gaussian(raw)

60000 60000

40000 40000

20000 20000

0 0

Hurdle Logistic (Anisometric) 30000 100000 BIC Score 75000 20000 50000 10000 25000

0 0 0 0 5000 10000150002000025000 5000 10000150002000025000

edges

Fig 12: Bayesian information criterion on mDC data set