<<

Escaping the curse of dimensionality in Bayesian model based clustering

Noirrit Kiran Chandra,∗ Antonio Canale†and David B. Dunson‡

Abstract

In many applications, there is interest in clustering very high-dimensional data. A common strat- egy is first stage reduction followed by a standard clustering algorithm, such as k-means. This approach does not target dimension reduction to the clustering objective, and fails to quantify uncertainty. Bayesian approaches provide an appealing alternative, but often have poor performance in high-, producing too many or too few clusters. This article provides an explanation for this behavior through studying the clustering posterior in a non-standard setting with fixed sample size and increasing dimensionality. We show that the finite sample posterior tends to either assign every observation to a different cluster or all observations to the same cluster as dimension grows, depending on the kernels and prior specification but not on the true data-generating model. To find models avoiding this pitfall, we define a Bayesian oracle for clustering, with the oracle clustering pos- terior based on the true values of low-dimensional latent variables. We define a class of latent mixtures for Bayesian clustering that have equivalent behavior to this oracle as dimension grows. The proposed

arXiv:2006.02700v3 [stat.ME] 23 Jun 2021 approach is shown to have good performance in simulation studies and an application to inferring cell types based on scRNAseq.

Keywords: Big data; Clustering; Dirichlet process; Exchangeable partition probability function; High dimensional; Latent variables; Mixture model.

∗Department of Statistics and Data Sciences, The University of Texas at Austin, TX [email protected] †Dipartimento di Scienze Statistiche, Universit`adegli Studi di Padova, Padova, Italy [email protected] ‡Departments of Statistical Science and Mathematics, Duke University, Durham, NC [email protected]

1 1. INTRODUCTION

T High-dimensional data yi = (yi1, . . . , yip) for i = 1, . . . , n, with p  n have become commonplace, and there is routinely interest in clustering observations {1, . . . , n} into groups. We are particularly motivated by single-cell RNA sequencing (scRNASeq) data, which can be useful in disentangling carcinogenic processes. A key focus is inferring different cell types by clustering cells based on their high-dimensional gene expression data. Although there are a variety of alternatives in the literature (see Kiselev et al. 2019, for a review), we are particularly motivated to consider a Bayesian approach due to the potential for propagating uncertainty in inferring cell types to subsequent analyses. Bayesian clustering is typically based on mixture models of the form:

k X yi ∼ f, f(y) = πhK(y; θh), (1) h=1

T where k is the number of components, π = (π1, . . . , πk) are probability weights, K(y; θh) is the density of the data within component h, and the number of clusters in data y1, . . . , yn corresponds

p to the number of occupied components kn ≤ k. When p is large and yi ∈ R , a typical approach chooses K(y; θh) as a multivariate Gaussian density with a constrained and parsimonious covariance (see Bouveyron and Brunet-Saumard 2014, for a review). Examples include matrices that are diagonal (Banfield and Raftery 1993), block diagonal (Galimberti and Soffritti 2013) or have a factor analytic representation (Ghahramani et al. 1996). To avoid sensitivity to a pre-specified k, one can place a prior on k to induce a mixture of finite mixture model (Miller and Harrison 2018; Fr¨uhwirth-Schnatter et al. 2020). Alternatively, a Bayesian nonparametric approach lets k = ∞, which allows kn to increase without bound as n increases. Under a Dirichlet process (Ferguson 1973) kn increases at a log rate in n, while for a Pitman-Yor process (Pitman and Yor 1997) the rate is a power law. When p is very large, problems arise in following these approaches. The posterior distribution of kn can concentrate on large values (Celeux et al. 2018); often the posterior mode of kn is even equal to n so that each subject is assigned to its own singleton cluster. Celeux et al. (2018) conjectured that this aberrant behavior is mainly due to slow mixing of Markov chain Monte Carlo samplers.

2 Fr¨uhwirth-Schnatter (2006) combat this problem with a specific prior elicitation criterion; this can be successful for p ≈ 100, but calibration of hyperparameters is a delicate issue and scaling to p > 1, 000 is problematic. Alternatively, one may attempt to cluster in lower dimensions via variable selection in clustering (Tadesse et al. 2005; Kim et al. 2006) or by introducing both global and variable-specific clustering indices for each subject, so that only certain variables inform global cluster allocation (Dunson 2009). However, we find these approaches complicated, and to not address the fundamental question of what is causing the poor performance of Bayesian clustering for large p. To fill this gap, we provide theory showing that as p → ∞ with n fixed, the posterior tends to assign probability one to a trivial clustering - either with kn = 1 and all subjects in one cluster or with kn = n and every subject in a different cluster. In a related result for classification, Bickel and Levina (2004) showed that when p increases at a faster rate than n, the Fisher’s linear discriminant rule is equivalent to randomly assigning future observations to the existing classes.

Our result has no relationship with the literature showing that the posterior for kn can diverge in the limit as n → ∞ for nonparametric Bayes procedures even when the data are generated from model (1) with the true k finite (Miller and Harrison 2014; Cai et al. 2020). Indeed, our result holds for finite n regardless of the true data generating model, and has fundamentally different implications - in particular, that one needs to be extremely careful in specifying the kernel K(y; θ) and prior for θ in the large p context. Otherwise, the true posterior can produce degenerate clustering results that have nothing to do with true structure in the data. A key question is whether it is possible to define models that can circumvent this pitfall? We show that the answer is yes if clustering is conducted on the level of low-dimensional latent variables

ηi underlying yi. When the dimension of ηi is small relative to p, yi provides abundant information about the lower-dimensional ηi and the curse of dimensionality can be turned into a blessing. This motivates a novel notion of a Bayesian oracle for clustering. The oracle has knowledge of the latent ηis and defines a Bayesian mixture model for clustering based on the ηis; the resulting oracle clustering posterior is thus free of the curse of dimensionality. We propose a particular latent mixture model structure, which can be shown to satisfy this oracle property and additionally leads to straightforward

3 computation.

2. LIMITING BEHAVIOR OF HIGH-DIMENSIONAL BAYESIAN CLUSTERING

Under a general Bayesian framework, model (1) becomes

X iid yi ∼ f, f(y) = πhK (y; θh) , θh ∼ P0, {πh} ∼ Q0, h≥1 where {πh} ∼ Q0 denotes a suitable prior for the mixture weights. Examples include stick-breaking (Sethuraman 1994) constructions or a k-dimensional Dirichlet distribution with the dimension k given a prior following a mixture of finite mixture approach.

Let ci ∈ {1,..., ∞} denote the cluster label for subject i (for i = 1, . . . , n), with kn = #{c1, . . . , cn} denoting the number of clusters represented in the sample. Conditionally on ci = h, we can write Pk yi | ci = h ∼ K (yi; θh). Assume that nj is the size of the jth cluster with j=1 nj = n. The posterior probability of observing the partition Ψ induced by the clusters c1, . . . , cn conditionally on the data

Y = {y1, . . . , yn} is

Π(Ψ) × Q R Q K (y ; θ) dP (θ) h≥1 i:ci=h i 0 Π(Ψ | Y) = P 0 Q R Q , (2) 0 Π(Ψ ) × K (y ; θ) dP (θ) Ψ ∈P h≥1 i:ci=h i 0 where P is the space of all possible partitions of n data points into clusters. The numerator of (2) is the product of the prior probability of Ψ multiplied by a product of the marginal likelihoods of the observations within each cluster. The denominator is a normalizing constant consisting of an enormous sum over P. Assuming exchangeability, the prior probability of any partition of n subjects into kn groups depends only on n1, . . . , nkn and kn through an exchangeable partition probability function. The latter is available in closed form for popular choices of Q0, including the Dirichlet process, Pitman-Yor process and certain mixture of finite mixtures. The posterior (2) forms the basis for Bayesian inferences on clusterings in the data, while providing a characterization of uncertainty. We are particularly interested in how this posterior behaves in the

T case in which yi = (yi1, . . . , yip) are high-dimensional so that p is very large. To study this behavior theoretically, we consider the limiting case as p → ∞ while keeping n fixed. This setting is quite

4 appropriate in our motivating applications to genomics, as there is essentially no limit to the number of variables one can measure on each study subject, while the number of study subjects often being quite modest. In such settings with enormous p and modest n, we would like the true posterior distribution in (2) to provide a realistic characterization of clusters in the data. In fact, this is commonly not the case, and as p increases the posterior distribution can have one of two trivial degenerate limits. In particular, depending on the choice of kernel density K(·; θ) and the base measure P0 for the θh’s, the posterior assigns probability one to either the kn = 1 clustering that places all subjects in the same cluster or the kn = n clustering that places all subjects in different clusters. This result is formalized in the following theorem.

p Theorem 1. Let y1, . . . , yn denote p-variate random vectors with joint probability measure P0. Let 0 0 Ψ denote the partition induced by the cluster labels c1, . . . , cn, and let c1, . . . , cn denote a new set of 0 cluster labels obtained from c1, . . . , cn by merging an arbitrary pair of clusters, with Ψ the related partition. Assume Q0(πh > 0 for all h = 1, . . . , n) > 0. If

Q R Q K (y ; θ) dP (θ) h≥1 i:ci=h i 0 lim sup Q R Q = 0 p→∞ 0 K (yi; θ) dP0(θ) h≥1 i:ci=h

p p in P0-probability, then limp→∞ Π(c1 = ··· = cn | Y) = 1 in P0-probability. Else if

Q R Q K (y ; θ) dP (θ) h≥1 i:ci=h i 0 lim inf Q R Q = ∞ p→∞ 0 K (yi; θ) dP0(θ) h≥1 i:ci=h

p p in P0-probability, then limp→∞ Π(c1 6= · · · 6= cn | Y) = 1 in P0-probability.

The condition on Q0 is equivalent to saying kn has positive prior mass on 1, . . . , n, which is extremely mild and holds for essentially any prior in the literature, including the Dirichlet process, Pitman-Yor process and suitable mixture of finite mixtures that do not pre-specify k < n. Theorem 1 has disturbing implications in terms of the behavior of posterior distributions for Bayesian clustering in large p settings. To obtain insights into this result, we consider an important special case corresponding

5 to a location-scale mixture of multivariate Gaussian kernels:

iid X iid yi ∼ f, f(y) = πhNp(y; µh, Σh), (µh, Σh) ∼ P0, (3) h≥1

where P0 is a suitable prior on the kernel parameters and Np(µ, Σ) denotes the p-dimensional mul- tivariate normal density with mean µ and covariance matrix Σ. To study the effect of over and

under-parametrization, we consider two settings, respectively in Corollary 1 and 2. Let IW(ν0, Λ0) de-

note an inverse-Wishart distribution with degrees of freedom ν0 and scale matrix Λ0. Let λmin(A) and

T λmax(A) be the smallest and largest eigenvalues of a positive definite matrix A and Y = [y1, . . . , yn] be the complete n × p data matrix. Assume

T 2 p (A0) lim infp→∞ λmin(YY )/p > 0 and kyik ≤ Mp for M > 0 in P0-probability.

Condition (A0) is extremely mild ensuring that the data are non-atomic and is satisfied for any continuous distribution with finite second order moments.

iid ind −1 Corollary 1. In model (3), assume (A0) with Σh ∼ IW(ν0, Λ0) and µh | Σh ∼ Np(µ0, κ0 Σh), 2 with kµ0k = O(p), κ0 = O(1), ν0 = p + c for some fixed constant c ≥ 0, kΛ0k2 = O(1) and p kΛ0k2/λmin(Λ0) = O(1). Then Π(c1 = ··· = cn | Y) → 1 in P0-probability.

Corollary 2. In model (3), assume (A0) with Σh = Σ across all clusters, and let Σ ∼ IW(ν0, Λ0) iid −1 2 and µh | Σ ∼ Np(µ0, κ0 Σ), with kµ0k = O(p), κ0 = O(1), ν0 > p − 1 such that limp→∞ ν0/p > 1, p and kΛ0k2 = O(1) with kΛ0k2/λmin(Λ0) = O(1). Then Π(c1 6= · · · 6= cn | Y) → 1 in P0-probability.

Corollary 1 and 2 show that, for mixtures of Gaussians, we can obtain directly opposite aberrant limiting behavior of the posterior depending on the kernel and prior for the kernel parameters but not

on the clustering prior Q0. Corollary 1 considers the case in which we allow flexible cluster-specific means and dispersion matrices, under typical conjugate multivariate normal inverse Wishart priors. This case can be viewed as overly-complex as p increases, and to combat this complexity, the Bayesian Ockham razor (Jefferys and Berger 1992) automatically assigns probability one to grouping all n individuals into the same cluster, effectively simplifying the model. At the other extreme, covered by Corollary 2, we assume a relatively simplistic model structure in which all the mixture components

6 have a common covariance. In the limiting case as p → ∞, this model forces all individuals to be assigned to their own clusters. These results hold regardless of the true data-generating model, and in particular the true clustering structure. These theoretical results demonstrate that in high dimensions it is crucial to choose a good com- promise between parsimony and flexibility in Bayesian model-based clustering. Otherwise, the true posterior distribution of clusterings in the data can have effectively no relationship whatsoever with true clustering structure in the data. Although we focus on the limiting case as p → ∞, we conjecture that this behavior can ‘kick in’ quickly as p increases, based on intuition built through our proofs and through comprehensive simulation experiments.

3. LATENT FACTOR MIXTURE

To overcome the problems discussed in Section 2, we propose a general class of latent factor mixture models defined as

∞ X yi ∼ f(yi; ηi, ψ), ηi ∼ πhK(ηi; θh), (4) h=1

T where ηi = (ηi1, . . . , ηid) are d-dimensional latent variables, d < n is fixed and not growing with p, f(·; ηi, ψ) is the density of the observed data conditional on the latent variables and measurement parameters ψ, and K(·; θ) is a d-dimensional kernel density. Under (4), the high dimensional data being collected are assumed to provide error-prone mea- surements of an unobserved lower-dimensional set of latent variables ηi on subject i. As a canonical example, we focus on a linear Gaussian measurement model with a mixture of Gaussians for the latent factors:

∞ X yi ∼ Np(Ληi, Σ), ηi ∼ πhNd(µh, ∆h), {πh} ∼ Q0, (5) h=1 where Σ is a p × p diagonal matrix and Λ is a p × d matrix of factor loadings. The key idea is to incorporate all the cluster-specific parameters at the latent data level instead of the observed data level to favor parsimony. The latent variables are supported on a lower-dimensional hyperplane, and we

7 map from this hyperplane to the observed data level through multiplication by a factor loadings matrix and then adding Gaussian noise. The model is highly flexible at the latent variable level, allowing

differences across clusters in the mean through µh and the shape, size, and orientation through ∆h. Figure 1 provides an illustration of how the model works in a toy case; even if the clusters are well separated in the lower-dimensional space, they may be overlapping at the observed data level.

Figure 1: Illustration of the proposed latent factor model: plots of observed yi’s and latent ηi’s for p = 3 and d = 2 with three clusters. The left panel represents the observed p-dimensional data and the right panel latent variables supported on a lower d-dimensional plane.

The proposed model is fundamentally different from the popular mixture of factor analyzers of Ghahramani et al. (1996), which defines a mixture of multivariate Gaussians at the p-dimensional observed data level having cluster-specific means and covariance matrices, with the dimension of the covariances reduced via a factor model. In contrast, we are effectively learning a common affine space within which we can define a simple location-scale mixture of Gaussians. Our approach not only massively reduces the effective number of parameters for large p, but also provides a successful compromise between the two extreme cases of Section 2.

3.1 Prior Specifications

In order to accommodate very high-dimensional data, with p  n, it is important to reduce the effective number of parameters in the p × d loadings matrix Λ. There is a rich literature on sparse

8 factor modeling using a variety of shrinkage or sparsity priors for Λ; for example, refer to Bhattacharya and Dunson (2011) and the references therein. Although a wide variety of shrinkage priors for Λ are appropriate, we focus on a Dirichlet-Laplace prior (Bhattacharya et al. 2015), as it is convenient both computationally and theoretically. On a p-dimensional vector θ, the Dirichlet-Laplace prior with parameter a, denoted by DL(a), can be specified in the following hierarchical manner

ind 2 2 iid θj | φ, τ ∼ N(0, ψjφj τ ), ψj ∼ Exp(1/2), φ ∼ Dir(a, . . . , a), τ ∼ Ga(dqa, 1/2), (6)

where θj is the j-th element of θ, φ is a vector of the same length as θ, Exp(a) is an exponential distribution with mean 1/a, Dir(a1, . . . , ap) is the p-dimensional Dirichlet distribution, and Ga(a, b) is the gamma distribution with mean a/b and variance a/b2. We let vec(Λ) ∼ DL(a). We then choose

−2 iid inverse-gamma priors for the residual variances, σj ∼ Ga(aσ, bσ).

For the prior Q0 on the cluster weights {πh}, for convenience in computation, we use a stick- breaking prior (Ishwaran and James 2001) derived from a Dirichlet process (Ferguson 1973), which has concentration parameter α impacting the induced prior on the number of clusters. To allow greater data adaptivity, we choose a Ga(aα, bα) prior for α. We assign the cluster-specific means and covariances {µh, ∆h} independent multivariate normal inverse-Wishart priors with location µ0, precision parameter κ0, inverse scale matrix ∆0 and degrees of freedom ν0. Our hierarchical Bayesian model for the ηis can be equivalently be represented as

ind iid ηi | µi, ∆i ∼ Nd(µi, ∆i), µi, ∆i | G ∼ G, G ∼ DP(α, G0), α ∼ Ga(aα, bα),

where G0 = NIW(µ0, ∆0, κ0, ν0). In practice, the latent variable dimension d is unknown. Potentially we could put a prior on d and implement a reversible-jump type algorithm, which may lead to inefficient computation. We prefer an empirical Bayes approach that estimates d in a preliminary step. We use the augmented implicitly restarted Lanczos bidiagonalization algorithm (Baglama and Reichel 2005) to obtain approximate singular values and eigenvectors, and choose the smallest db explaining at least 95% of the variability in the data. Since we place a shrinkage prior on Λ, we err on the side of over-estimating d and allow

9 the prior to shrink away unnecessary factors. The left and right singular values are used to initialize the Λ and ηi’s in our Markov chain Monte Carlo implementation.

For all the simulation experiments of the next section and the application, we choose µ0 = 0 and

2 2 2 ∆0 = ξ Id for a scalar ξ > 0. We set ξ = 20, κ0 = 0.001, ν0 = db+ 50, aα = bα = 0.1 as the hyper-parameters of the DP mixture prior; aσ = 1, bσ = 0.3 as the hyper-parameters of the prior on the residual variances, and set a = 0.5 as the Dirichlet-Laplace parameter.

3.2 Posterior

For posterior computation we use a Gibbs sampler defined by the following steps.

T T 2 2 2 Step 1 Letting λj denote the jth row of Λ, η = [η1, . . . , ηn] , Dj = τ diag(ψj1φj1, . . . , ψjdφjd) and (j) T y = (y1j, . . . , ynj) , for j = 1, . . . , p sample

n −1 −2 T −1 T −2 (j) −1 −2 T −1o (λj | −) ∼ Nd (Dj + σj η η) η σj y , (Dj + σj η η) .

  1 P Step 2 Update the ∆h’s from the inverse-Wishart distributions IW ψh, νh whereη ¯h = ηi, b b nh i:ci=h νbh = ν0 + nh and

2 X T κ0nh T ψbh = ξ Id + (ηi − η¯h)(ηi − η¯h) + η¯hη¯h . κ0 + nh i:ci=h

Due to conjugacy, the location parameters µh’s can be integrated out of the model.

Step 3 Sample the latent factors, for i = 1, . . . , n, from

n −1 o (ηi | −) ∼ Nd Ωhρh, Ωh + Ωh(κbh,−i∆h) Ωh ,

P 1 P where nh,−i = 1(cj = h), κh,−i = κ0 + nh,−i,η ¯h,−i = ηi, µh,−i = j6=i b nh,−i j:cj =h,j6=i b nh,−iη¯h,−i T −1 −1 −1 T −1 −1 , ρh = Λ Σ Yi + ∆ µh,−i and Ω = Λ Σ Λ + ∆ . nh,−i+κ0 h b h h

10 Step 4 Sample the cluster indicator variables c1, . . . , cn with probabilities

  R nh,−i Nd(ηi; µh, ∆h)dΠ(µh, ∆h | η−i) for h ∈ {cj}j6=i, Π(ci = h | −) ∝ (7)  R α Nd(ηi; µh, ∆h)dΠ(µh, ∆h) for all j 6= i.

where η−i is the set of all ηj, j = 1, . . . , n except ηi. Due to conjugacy the above integrals are analytically available.

Step 5 Let r be the number of unique ci’s. Following West (1992), first generate ϕ ∼ Beta(α + 1, n),  evaluate π/(1 − π) = (aα + r − 1)/ n(bα − log ϕ) and generate

  Ga(α + r, bα − log ϕ) with probability π, α | ϕ, r ∼  Ga(α + r − 1, bα − log ϕ) with probability 1 − π.

2 n Pn T 2 o Step 6 For j = 1, . . . , p sample σj from Ga aσ + n/2, bσ + i=1(yij − λj ηi) /2 .

Step 7 Update the hyper-parameters of the Dirichlet-Laplace prior through:

(i) For j = 1, . . . , p and h = 1, . . . d sample ψejh independently from an inverse-Gaussian

iG(τφjh/ λjh , 1) distribution and set ψjh = 1/ψejh.

(ii) Sample the full conditional posterior distribution of τ from a generalized inverse Gaussian n P o giG dp(1 − a), 1, 2 j,h λjh /φjh distribution.

(iii) To sample φ | Λ, draw Tjh independently with Tjh ∼ giG(a − 1, 1, 2 λjh ) and set P φjh = Tjh/T with T = jh Tjh.

This simple Gibbs sampler sometimes gets stuck in local modes; a key bottleneck is the exploration Step 4. Therefore, we adopt the split-merge MCMC procedure proposed by Jain and Neal (2004); the authors note that the Gibbs sampler is useful in moving singleton samples between clusters while the split-merge algorithm makes major changes. Hence, we randomly switch between Gibbs and split- merge updates. The split-merge algorithm makes smart proposals by performing restricted Gibbs scans of the same form as in (7).

11 From the posterior samples of ci’s, we compute summaries following Wade and Ghahramani (2018). Our point estimate is the partition visited by the Markov chain Monte Carlo sampler that minimizes the posterior expectation of the Binder loss exploiting the posterior similarity matrix obtained from the different sampled partitions. The sampling algorithm can be easily modified for other priors on Λ having a conditionally Gaussian representation, with Step 7 modified accordingly. For example, we could use horseshoe (Carvalho et al. 2009), increasing shrinkage priors (Bhattacharya and Dunson 2011; Legramanti et al. 2020; Schiavon et al. 2021), or the fast factor analysis prior (Roˇckov´aand George 2016). Similarly, alternative priors for {πh}, such as Pitman and Yor (1997) or Miller and Harrison (2018), can be adopted with minor modifications in Steps 4 and 5.

4. PROPERTIES OF THE LATENT MIXTURE FOR BAYESIAN CLUSTERING METHOD

4.1 Bayes Oracle Clustering Rule

In this section, we first define a Bayes oracle clustering rule. We assume the oracle has knowledge of the exact values of the latent variables {η0i}. Given th is knowledge, the oracle can define a simple Bayesian mixture model to induce a posterior clustering of the data, which is not affected by the high-dimensionality of the problem. This leads to a distribution over the space of partitions defined in the following definition.

Definition 1. Let η0 = {η01, . . . , η0n} be the true values of the unobserved latent variables correspond- ing to each data point.

Π(Ψ) × R Q Q K (η ; θ ) dG (θ) h≥1 i:ci=h 0i h 0 Π(Ψ | η0) = P 0 R Q Q . (8) 0 Π(Ψ ) × 0 K (η0i; θh) dG0(θ) Ψ ∈P h≥1 i:ci=h

Probability (8) expresses the oracles’ uncertainty in clustering. This is a gold standard in being free of the curse of dimensionality through using the oracles’ knowledge of the true values of the latent variables. Under the framework of Section 3, the high-dimensional measurements on each subject provide information on these latent variables, with the clustering done on the latent variable level. Ideally, we would get closer to the oracle partition probability under the proposed method as p

12 increases, turning the curse of dimensionality into a blessing. We show that this is indeed the case in Section 4.2. We assume the oracle uses a location mixture of Gaussians with a common covariance matrix. We

assume the following mixture distribution on η0i’s, independent non-informative Jeffreys prior for the

common covariance, and prior Q0 on the mixture probabilities

∞ iid X iid d+1 −1 − 2 ηi ∼ πhNd(µh, ∆), µh | ∆ ∼ Nd(0, κ0 ∆), ∆ ∝ |∆| , {πh} ∼ Q0. (9) h=1

For d < n, the oracle rule is well defined for the Jeffreys prior on ∆.

4.2 Main Results

In this section, we show that the posterior probability on the space of partitions induced by the proposed model converges to the oracle probability as p → ∞ under appropriate conditions on the

2 2 data generating process and the prior. We assume that the σj ’s are the same having true value σ0. p Our result is based on the following assumptions on P0, the joint distribution of y1, . . . , yn:

ind 2 (C1) yi ∼ Np(Λ0η0i, σ0Ip), for each i = 1, . . . , n;

1 T (C2) limp→∞ Λ0 Λ0 − M = 0 where M is a d × d positive-definite matrix; p 2 2 2 2 2 2 (C3) σL < σ0 < σU where σL and σU are known constants;

(C4) kη0ik = O(1) for each i = 1, . . . , n.

Condition (C1) corresponds to the conditional likelihood of yi given ηi being correctly specified, (C2) ensures that the loading matrix does not grow too rapidly with p and the data contain increasing information on the latent factors as p increases. Related but much stronger conditions appear in the factor modeling (Fan et al. 2008, 2011) and massive covariance estimation literature (Pati et al.

2014). We allow the columns of Λ0 to be non-orthogonal with varying average squared values which

is expected in high-dimensional studies. Condition (C3) bounds the variance of the observed yi and (C4) is a weak assumption ensuring that the latent variables do not depend on n or p. Additionally, we assume that the latent dimension d is known.

Although we use a stick-breaking prior on the mixture probabilities {πh} in Section 3.1, we derive

13 the following result for an arbitrary prior Q0 for wider applicability. We assume the inverse-gamma

2 2 2 prior on residual variance σ to be restricted to the compact set [σL, σU ]. In Lemma 1 we derive sufficient conditions for the posterior probability on the space of partitions to converge to the oracle probability for p → ∞.

T √ T (p) h (p) (p)i −1 T 1/2 Lemma 1. Let η = [η1, . . . , ηn] , ζ = ζ1 , . . . , ζn = ( p log p) (Λ Λ) η and, for any δ > 0, √ Tn −1 Bp,δ = i=1{Λ, ηi : ( p log p) kΛηi − Λ0η0ik < δ}. Assume for any δ > 0

¯ p Π(Bp,δ | Y) → 0 P0-a.s. (10)

where B¯p,δ is the complement of Bp,δ. Let E(· | Y) denote expectation with respect to the posterior distribution of the parameters given data Y and Π(Ψ | ζ(p)) be the posterior probability of partition Ψ

(p) n (p) o with η0 replaced by ζ in (8). Then, limp→∞ E Π(Ψ | ζ ) | Y = Π(Ψ | η0).

Remark 1. Lemma 1 holds for any prior Q0 on the mixture probabilities. Our focus is on verifying that the finite n clustering posterior avoids the high dimensionality pitfalls discussed in Section 2, and we provide no results on the unrelated goal of obtaining greater accuracy in inferring ‘true’ clusters as n increases.

In the following theorem, we show that condition (10) holds for the proposed method and hence we avoid the large p pitfall. The proof is in the Supplementary Materials.

Theorem 2. Let Bp,δ be as defined in Lemma 1 and B¯p,δ be its complement set. Then, under (C1)- ¯ p (C4) and model (5), Π(Bp,δ | Y) → 0 P0-a.s. for any δ > 0.

5. SIMULATION STUDY

We perform a simulation study to analyze the performance of the proposed method in clustering high dimensional data. The sampler introduced in Section 3.2 is available from the GitHub page of the first author. We compare with a Dirichlet process mixture of Gaussian model with diagonal covariance matrix implemented in R package BNPmix (Corradin et al. 2021), a nonparametric mixture of infinite factor analyzers implemented in R package IMIFA (Murphy et al. 2019), and a two-stage approach that

14 Figure 2: Simulation results: Distributions of the adjusted Rand indices (upper plot) and estimated number of clusters (lower plot) in 20 replicated experiments. Horizontal dashed lines denote the true number of clusters. The scenarios, reported in each row, are labelled as Lamb for the model of Section 3, MFA for mixture of factor analyzers, and SpCount for the sparse count scenario.

15 performs an approximate sparse principal component analysis of the high dimensional data to reduce dimensionality from p to d—with d the minimum number of components explaining at least 95% of the variability consistently with Section 3.1—and then applies k-means on the principal components, with k chosen using (?). For the high-dimensional simulation settings we considered, both the mixture of Gaussians and the mixture of factor analyzers showed high instability, including lack of convergence, memory errors, extremely long running times and other issues. For these reasons we report a comparison with the pragmatic two-stage approach only. To test the accuracy of the estimated clustering relative to the true clustering, we compute the adjusted Rand index (Rand 1971). We generated data under: [1] our model, [2] mixture of sparse factor analyzers, and [3] mixture of log transformed zero inflated Poisson counts. [1]-[2] have latent dimension 20, while for [3] the data are discrete and highly non-Gaussian within clusters mimicking the data of Section 6. For [1]–[3] we vary true number of clusters k0 ∈ {10, 15, 25}, with 2/3 of these ‘main’ clusters having the same probability and the remaining 1/3 having together the same probability of a single main cluster. This is challenging, as many methods struggle unless there are a small number of close to equal weight clusters that are well separated. The dimension p varies in p = {1, 000, 2, 500} while the sample size n is n = 2, 000. Data visualization plots using McInnes et al. (2018) are in the Supplementary Materials. For each configuration, we perform 20 independent replications. We run our sampler for 6, 000 iterations discarding the first 1, 000 as burn in and taking one draw every five to reduce autocorrelation. Prior elicitation follows the default specification of Section 3.1. On average a single run under these settings took between 40 and 50 minutes on a iMac with 4.2 GHz Quad-Core Intel Core i7 processor and 32GB DDR4 RAM. Figure 2 reports the distribution of the 20 replicates of the adjusted Rand index and mean estimated number of clusters. The proposed method is uniformly superior in each scenario obtaining high adjusted Rand indices, accurate clustering results, and less variability across replicates. In the second scenario, our method yields relatively low Rand index for k0 = 25. This is expected due to model misspecification and the large number of clusters.

16 6. APPLICATION IN SCRNASEQ CELL LINE DATASET

In this section, we analyze the GSE81861 cell line dataset (Li et al. 2017) to illustrate the proposed method. The dataset profiles 630 cells from 7 cell lines using the Fluidigm based single cell RNA-seq protocol (See et al. 2018). The dataset includes 83 A549 cells, 65 H1437 cells, 55 HCT116 cells, 23 IMR90 cells, 96 K562 cells, and 134 GM12878 cells, 174 H1 cells, and 57,241 genes. The cell- types are known and hence the data provide a useful benchmark to assess performance in clustering high-dimensional data. Following standard practice in single cell data analysis, we apply data pre-processing. Cells with low read counts are discarded, as we lack reliable gene expression measurements for these cells, and data are normalized following Lun et al. (2016). We remove non-informative genes using M3Drop (Andrews and Hemberg 2018). After this pre-processing phase, we obtain a final dataset with n = 531 cells and p = 7,666 genes. We implement the proposed method using our default prior, collecting 10, 000 iterations after a burn-in of 5, 000 and keeping one draw in five. As comparison, we apply the two stage procedure of the previous section and the popular Seurat (Butler et al. 2018) pipeline which performs quality control, normalization, highly variable gene selection, and clustering. Graphical representations of the different clustering results are reported in the Supplementary Materials. The proposed method, the two stage approach, and Seurat achieve adjusted Rand indices of 0.977, 0.734 and 0.805 and yield 12, 10, and 8 clusters, respectively. Seurat is reasonably accurate but splits the H1 cell-type into two clusters, while the two-stage approach is dramatically worse. An appealing aspect of our approach is posterior uncertainty quantification. The 95% credible interval for the adjusted Rand index is [0.900, 0.985] and the posterior probability of having between 11 and 13 clusters is 0.98%. This suggests that the posterior distribution is highly concentrated, which is consistent with our simulations. The posterior similarity matrix reported in the first panel of Figure 3—also reporting the related dendrogram obtained by using complete linkage—clearly shows that the majority of the observations have a high posterior probability of being assigned to a specific cluster and negligible probability of being assigned to an incorrect cluster. Figure 3 also shows micro clusters leading to over-estimation of the number of cell types. Two cells of cluster A549 are put in singleton

17 GM12878 IMR90 GM12878 GM12878 IMR90 GM12878 GM12878 IMR90 GM12878 IMR90 GM12878 GM12878 IMR90 GM12878 GM12878 IMR90 GM12878 IMR90 GM12878 GM12878 IMR90 GM12878 A549 IMR90 A549 IMR90 A549 A549 IMR90 A549 A549 IMR90 A549 IMR90 A549 K562 IMR90 K562 K562 IMR90 K562 IMR90 K562 HCT116 IMR90 HCT116 HCT116 IMR90 HCT116 IMR90 HCT116 H1437 H1437 H1437 H1437 H1437 H1437 H1437 IMR90 IMR90 H1437 H1437 IMR90 H1437 H1 H1 H1437 H1 H1437 H1 H1 H1437 H1 H1 H1437 H1 IMR90 H1 H1 IMR90 H1 H1 IMR90 H1 IMR90 H1 H1 A549 H1 H1 H1437 H1 H1437 H1 A549

Figure 3: Posterior similarity matrix obtained from the Markov chain Monte Carlo samples of the proposed method: left panel reports the similarity matrix for the full cell line dataset along with the dendrogram obtained using complete linkage; row names report the true cluster names; right panel zooms the center of the left panel;

clusters. Similarly cluster IMR90 is divided into two clusters of size 4 and 19 with negligible posterior probability of being merged. Finally cluster H1437 is split into four clusters with the main one comprising 35 of 47 observations and the smallest one comprising just one observation. Introduction of such micro-clusters has minimal impact on the adjusted Rand index, which is clearly much higher for the proposed method than the competitors, and is a minor practical issue that is not unexpected given the tendency of Dirichlet process mixture models to favor some tiny or even singleton clusters. These clusters may even be ‘real’ given that the cell type labels are not a perfect gold standard.

7. DISCUSSION

Part of the appeal of Bayesian methods is the intrinsic penalty for model complexity or ‘Bayesian Ockham razor’ (Jefferys and Berger 1992), which comes through integrating the likelihood over the prior in obtaining the marginal likelihood. If one adds unnecessary parameters, then the likelihood

18 is integrated over a larger region, which tends to reduce the marginal likelihood. In clustering prob- lems, one relies on the Bayesian Ockham razor to choose the appropriate compromise between the two extremes of too many clusters and over-fitting and too few clusters and under-fitting. Often in low-dimensional problems, this razor is effective and one obtains a posterior providing a reasonable representation of uncertainty in clustering data into groups of relatively similar observations. However, a key contribution of this article is showing that this is fundamentally not the case in high-dimensional problems, and one can obtain nonsensical results using seemingly reasonable priors. Perhaps our most interesting result is the degenerate behavior in the p → ∞ case for the true posterior on clusterings, regardless of the true data generating model. This negative result provided motivation for our simple latent factor mixture model, which addresses the large p pitfall by clustering on the latent variable level. Another interesting theoretical result is our notion of a Bayesian oracle for clustering; to our knowledge, there is not a similar concept in the literature. We hope that this new oracle can be used in evaluating other Bayesian procedures for high-dimensional clustering, providing competitors to the simple model we propose. The proposed work is a first step towards addressing pitfalls of Bayesian approaches to high- dimensional clustering. There are several important aspects that remain to be addressed. The first is the issue of fast computation in very large p clustering problems. We have taken a relatively simple Markov chain Monte Carlo approach that works adequately well in our motivating application in which one has the luxury of running the analysis for many hours for each new dataset. However, there is certainly a need for much faster algorithms that can exploit parallel and distributed processing; for example, running Markov chain Monte Carlo for different subsets of the variables in parallel and combining the results. Potentially building on the algorithms of Wang and Dunson (2013) and Scott et al. (2016) may be promising in this regard. Another critical issue in Bayesian and other model-based clustering is the issue of model mis- specification. In particular, clusters may not have Gaussian shapes or may not even match with more elaborate kernels, such as skewed Gaussians (Azzalini 2013). There is an evolving literature for addressing this problem in low dimensions using mixtures of mixtures (Fr¨uhwirth-Schnatter et al. 2020) and generalized Bayes methods (e.g. Miller and Dunson 2019). However, it is not clear that

19 current methods of this type can scale up to the types of dimensions encountered in our motivating genomic applications. One possibility is to rely on variational Bayes methods, such as variational autoencoders (Kingma et al. 2014); however, then one faces challenging issues in terms of obtaining reliable algorithms having theoretical guarantees, generalizability and reproducibility.

ACKNOWLEDGMENTS

This work was partially funded by grants R01-ES027498 and R01-ES028804 from the National Institute of Environmental Health Sciences of the United States Institutes of National Health and by the University of Padova under the STARS Grant.

APPENDIX

Proofs of Section 2

Theorem 1. Consider the ratio of posterior probabilities:

Π(Ψ | Y) . (A.1) Π(Ψ0 | Y)

p If this ratio converges to zero for all c1, . . . , cn in P0-probability as p → ∞, then any partition nested p into another partition is more likely a posteriori implying Π(c1 = ··· = cn | Y) = 1 in P0-probability so that all subjects are grouped in the same cluster with probability one. Conversely if the ratio p converges to +∞, then Π(c1 6= · · ·= 6 cn | Y) = 1 in P0-probability and each subject is assigned to their own cluster with probability one.

Without loss of generality, assume that c1, . . . , cn define kn clusters of sizes n1, . . . , nkn and that 0 0 0 0 c = ci for ci ∈ {1, . . . , kn − 2} and c = kn − 1 for ci ∈ {kn − 1, kn}, with n , . . . , n 0 the cluster i i 1 kn−1 0 sizes under the partition induced by the ci. In general, ratio (A.1) can be expressed as

Qkn R Q Π(Ψ) K (yi; θ) dP0(θ) × h=1 i:ci=h . (A.2) 0 Qkn−1 R Q Π(Ψ ) 0 K (yi; θ) dP0(θ) h=1 i:ci=h

The left hand side of (A.2) can be expressed as the ratio between the EPPFs. Since by assumption there is a positive prior probability for any partition in P, this ratio is finite and does not depend

20 on p or the data generating distribution. Thus, by induction and under the assumptions on the right factor of (A.2) we conclude the proof.

0 0 Corollary 1. Define c1, . . . , cn and c1, . . . , cn consistently with the proof of Theorem 1. Then, consider the ratio of the marginal likelihoods

Qkn R Q −1 Np (yi; µh, Σh)Np(µh; µ0, κ Σh)IW (Σh; ν0, Λ0) d(µh, Σh) h=1 i:ci=h 0 . (A.3) Qkn−1 R Q −1 0 Np (yi; µh, Σh)Np(µh; µ0, κ Σh)IW (Σh; ν0, Λ0) d(µh, Σh) h=1 i:ci=h 0

The numerator of (A.3) is

   p ν0  kn  ν0+nh    Y  1 Γp( ) κ0 2 |Λ0| 2  2 , n p/2 ν0 ν0+nh π h Γp( ) κ0 + nh h=1  2 Ψ nhκ0 Ψ Ψ T 2   Λ0 + S + (¯y − µ0)(¯y − µ0)   h κ0+nh h h 

   T withy ¯Ψ = n−1 P y , SΨ = P y − y¯Ψ y − y¯Ψ , and Γ (·) being the multivariate h h i:ci=h i h i:ci=h i h i h p . Obtaining a corresponding expression for the denominator, the ratio (A.3) becomes

 ν +n   ν +n  0 kn−1 0 kn ( 0 )p/2 Γp 2 Γp 2 κ0(κ0 + n ) × kn−1  ν +n0  0 kn−1 ν0  (κ0 + nkn−1)(κ0 + nkn ) Γp 2 Γp 2

ν +n0 0 kn−1 ν 0 2 0 0 n κ0 0 0 2 Ψ kn−1 Ψ Ψ T |Λ | Λ + S + 0 (¯y − µ )(¯y − µ ) 0 0 kn−1 κ +n kn−1 0 kn−1 0 0 kn−1 × . (A.4) ν0+nh Qkn Ψ nhκ0 Ψ Ψ T 2 Λ0 + S + (¯y − µ0)(¯y − µ0) h=kn−1 h κ0+nh h h

We first study the limit of the first factor of (A.4). From Lemma A.2, we have

       ν0+nkn ν0+nkn−1  1  Γp 2 Γp 2  lim log + log = 0.  0  ν0  p→∞ p ν0+n Γ  Γ kn−1 p 2   p 2 

We now study the limit of the remaining part of (A.4). Note that, if we replace each observation yi

−1/2 with yei = Λ0 (yi − µ0), assumption (A0) is still valid for yei’s. Moreover, |Λ0| terms get canceled out from (A.4). Hence, without loss of generality we can assume µ0 = 0 and Λ0 = Ip. Without loss of

21 generality we can also assume that y Ph−1 , . . . , yPh are in cluster h. We define 1+ j=1 nj j=1 nj

 T Ψ Y(h) = y Ph−1 , . . . , yPh , 1+ j=1 nj j=1 nj to be the sub-data matrix corresponding to the h-th cluster in partition Ψ. Exploiting lower rank factorization results on matrix determinants, we have

n κ T T 1 T I + SΨ + h 0 y¯Ψy¯Ψ = I + Y Ψ Y Ψ − Y Ψ 1 1T Y Ψ p h h h p (h) (h) (h) nh nh (h) nh + κ0 nh + κ0

1 n T o−1 T T = 1 − 1T Y Ψ I + Y Ψ Y Ψ Y Ψ 1 I + Y Ψ Y Ψ , nh (h) p (h) (h) (h) nh p (h) (h) nh + κ0 where the symbol |A| or |a| is to be interpreted as the determinant of the matrix A or the absolute value of the scalar a, respectively. Then, the second factor of (A.4) simplifies to

ν +n0 0 kn−1  −1  2 1 T Ψ0  Ψ0T Ψ0  Ψ0T Ψ0T Ψ0 1 − 0 1 0 Y I + Y Y Y 1 0 I + Y Y n +κ n (kn−1) p (kn−1) (kn−1) (kn−1) n p (kn−1) (kn−1) kn−1 0 kn−1 kn−1 . ν0+nh   −1  2 Qkn 1 T Ψ ΨT Ψ ΨT ΨT Ψ 1 − 1n Y Ip + Y Y Y 1nh Ip + Y Y h=kn−1 nh+κ0 h (h) (h) (h) (h) (h) (h)

−1 n T o T p Using Lemma A.4, limp→∞ Y(h) Ip + Y(h)Y(h) Y(h) − Inh = 0 in P0-probability and 2

1 n o−1 κ lim 1 − 1T Y I + Y T Y Y T 1 = 0 in p-probability. (A.5) nh (h) p (h) (h) (h) nh P0 p→∞ nh + κ0 κ0 + nh

Taking the logarithm of the second and third factor of (A.4) and rearranging it using the previous

22 result

ν +n0 0 kn−1 −1 2 1 T Ψ0 n Ψ0T Ψ0 o Ψ0T 1 − 0 1 0 Y I + Y Y Y 1 0 n +κ n (kn−1) p (kn−1) (kn−1) (kn−1) n kn−1 0 kn−1 kn−1 log ν0+nh n o−1 2 Qkn 1 T Ψ ΨT Ψ ΨT 1 − 1n Y Ip + Y Y Y 1nh h=kn−1 nh+κ0 h (h) (h) (h) (h) ν +n0 0 kn−1 p/2 Ψ0T Ψ0 2 ( 0 ) Ip + Y Y κ0(κ0 + n ) (kn−1) (kn−1) + log kn−1 + log . (A.6) ν0+n (κ0 + nk −1)(κ0 + nk ) h n n Qkn ΨT Ψ 2 Ip + Y Y h=kn−1 (h) (h)

Since ν0 = p + c, in conjunction with (A.5) we have sum of the limits of the first and second terms in p (A.6) is 0 in P0-probability. We finally study the last summand of (A.5) and particularly

ν +n0 0 kn−1 Ψ0T Ψ0 2 Ip + Y Y 1 (kn−1) (kn−1) lim log . (A.7) p→∞ p ν0+nh Qkn ΨT Ψ 2 Ip + Y Y h=kn−1 (h) (h)

Since I + Y ΨT Y Ψ = I + Y Ψ Y ΨT for any partition Ψ, following the same arguments of (A.18) p (h) (h) nh (h) (h) from Lemma A.5 in the Supplementary Materials, we have

kn ν0+nh Ψ0T Ψ0 Y ΨT Ψ 2 T Ip + Y Y = Ip + Y Y × In − ZZ (kn−1) (kn−1) (h) (h) kn−1 h=kn−1 where Z = {I + Y Ψ Y ΨT }−1/2Y Ψ Y ΨT {I + Y Ψ Y ΨT }−1/2, and (A.7) reduces to nkn−1 (kn−1) (kn−1) (kn−1) (kn) nkn (kn) (kn)

0 T T ν + n nkn Ψ Ψ nkn−1 Ψ Ψ 0 kn−1 T log In + Y Y + log In + Y Y + log In − ZZ . 2p kn−1 (kn−1) (kn−1) 2p kn (kn) (kn) 2p kn−1

From (A0), it can be deduced that the limits of the first two terms in the last expression are 0 in p P0-probability as p → ∞. Invoking Lemma A.5 and the fact that ν0 = p + c, we have

0 ν0 + n lim sup kn−1 log I − ZZT < 0, nkn−1 p→∞ 2p

23 and henceforth (A.7) is negative. This lead to

Qkn R Q −1 Np (yi; µh, Σh)Np(µh; µ0, κ Σh)IW (Σh; ν0, Λ0) d(µh, Σh) lim sup log h=1 i:ci=h 0 = 0, Qkn−1 R Q −1 p→∞ 0 Np (yi; µh, Σh)Np(µh; µ0, κ Σh)IW (Σh; ν0, Λ0) d(µh, Σh) h=1 i:ci=h 0

p and hence for p → ∞ all the data points are cluster together in P0-probability thanks to Theorem 1.

0 0 Corollary 2. Define c1, . . . , cn and c1, . . . , cn consistently with the proof of Theorem 1. Then, consider the ratio of the marginal likelihoods

R Qkn R Q −1 Np (yi; µh, Σ) Np(µh; µ0, κ Σ)dµhIW (Σ; ν0, Λ0) dΣ h=1 i:ci=h 0 . (A.8) R Qkn−1 R Q −1 0 Np (yi; µh, Σ) Np(µh; µ0, κ Σ)dµhIW (Σ; ν0, Λ0) dΣ h=1 i:ci=h 0

The numerator of (A.8) is

ν0+n p − 2 kn kn ν0+n   2   np Γ ( ) Y κ0 X Ψ nhκ0 Ψ Ψ T − p 2 ν0/2 Λ0 + Sh + (¯yh − µ0)(¯yh − µ0) π 2 |Λ0| , nh + κ0 nh + κ0 Γp(ν0/2) h=1 h=1

T Ψ 1 P Ψ P  Ψ  Ψ wherey ¯ = yi and S = yi − y¯ yi − y¯ . Hence, obtaining a corresponding h nh i:ci=h h i:ci=h h h expression for the denominator, ratio (A.8) becomes

ν0+n   2 n 0 0 0 0 o p Pkn−1 Ψ nhκ0 Ψ Ψ T ( 0 ) 2 Λ0 + h=1 Sh + n0 +κ (¯yh − µ0)(¯yh − µ0) κ0(κ0 + n )  h 0  kn−1   . (A.9) (κ + n )(κ + n )  n o  0 kn−1 0 kn  Pkn Ψ nhκ0 Ψ Ψ T  Λ0 + S + (¯y − µ0)(¯y − µ0) h=1 h nh+κ0 h h

First note that for nkn , nkn−1 ≥ 1

0 κ0(κ0 + n ) kn−1 < 1. (A.10) (κ0 + nkn−1)(κ0 + nkn )

Similar to Corollary 1, we can assume without loss of generality µ0 to be a p-dimensional vector of

24 zero and Λ0 = Ip. Note that,

kn   n k 2 X Ψ nhκ0 Ψ ΨT X T X nh Ψ ΨT Sh + y¯h y¯h = yiyi − y¯h y¯h nh + κ0 nh + κ0 h=1 i=1 h=1 T n k     X T X 1 X X = yiyi −  yi  yi . (A.11) nh + κ0 i=1 h=1 i:ci=h i:ci=h

  Also, without loss of generality and similarly to Corollary 1, we can assume y Ph−1 , . . . , yPh 1+ j=1 nj j=1 nj    T    Jn Pkn Ψ nhκ0 Ψ Ψ T Ψ Ψ Jn1 kn are in cluster h of Ψ. Then S + y¯ y¯ = Y In − Jn Y , where Jn = diag ,..., h=1 h nh+κ0 h h n1+κ0 nkn +κ0 is an n × n order block diagonal matrix and Jr is the r × r order square matrix with all elements being

Ψ 1. Clearly, Jn is a positive semi-definite matrix of rank kn. Henceforth, exploiting the lower rank factorization structure, each determinant in (A.9) can be simplified as

kn   X Ψ nhκ0 Ψ ΨT T  Ψ Ip + Sh + y¯h y¯h = Ip + Y In − Jn Y nh + κ0 h=1

T Ψ1/2 T −1 T Ψ1/2 = Ip + Y Y In − Jn Y (Ip + Y Y ) Y Jn .

Hence (A.9) reduces to

  ν0+n Ψ01/2 Ψ01/2 2 ( 0 )p/2  I − J Y (I + Y TY )−1Y TJ  κ0(κ0 + n )  n n p n  kn−1 . (A.12) Ψ1/2 Ψ1/2 (κ0 + nkn−1)(κ0 + nkn ) T −1 T  In − Jn Y (Ip + Y Y ) Y Jn 

T −1 T p From Lemma A.3 in the Supplementary Materials, limp→∞ Y (Ip + Y Y ) Y − In = 0 in P0- 2 Ψ probability. Therefore, from the construction of Jn ,

kn 1 Ψ1/2 T −1 T Ψ1/2 Ψ Y lim In − Jn Y (Ip + Y Y ) Y Jn = In − Jn = Inh − Jnh , (A.13) p→∞ nh + κ0 h=1

p T in P0-probability. Notably Jr = 1r1r where 1r is the r-dimensional vector of ones, implying that

25

1 κ0 Ir − Jr = for any positive integer r. Substituting this in (A.13), we have nr+κ0 nr+κ0

kn κ Ψ1/2 T −1 T Ψ1/2 Y 0 p lim In − Jn Y (Ip + Y Y ) Y Jn = , in P0-probability p→∞ nh + κ0 h=1 and therefore,

01 01 Ψ /2 T −1 T Ψ /2 In − Jn Y (Ip + Y Y ) Y Jn (κ + n )(κ + n ) lim = 0 kn−1 0 kn in p-probability. 1 1 0 P0 p→∞ Ψ /2 T −1 T Ψ /2 κ0(κ0 + n ) In − Jn Y (Ip + Y Y ) Y Jn kn−1

Thus if we take the log of (A.12) multiplied by p−1 and study its limit we have

  Ψ01/2 Ψ01/2  0 I − J Y (I + Y TY )−1Y TJ  1 κ0(κ0 + n ) n + ν n n p n  lim inf log kn−1 + 0 log p→∞ Ψ1/2 Ψ1/2 2 (κ0 + nkn−1)(κ0 + nkn ) 2p T −1 T  In − Jn Y (Ip + Y Y ) Y Jn  0 ! 1 κ0(κ0 + n ) n + ν = log kn−1 × 1 − lim sup 0 > 0. 2 (κ0 + nkn−1)(κ0 + nkn ) p→∞ p

Since n is fixed with p, the above limit follows from (A.10) and the assumption on ν0. Thus we have

R Qkn R Q −1 Np (yi; µh, Σ) Np(µh; µ0, κ Σ)dµhIW (Σ; ν0, Λ0) dΣ lim inf h=1 i:ci=h 0 = ∞, p→∞ R Qkn−1 R Q −1 0 Np (yi; µh, Σ) Np(µh; µ0, κ Σ)dµhIW (Σ; ν0, Λ0) dΣ h=1 i:ci=h 0

p and hence for p → ∞ each data point is clustered separately in P0-probability thanks to Theorem 1.

Proofs of Section 4 √ (p) −1 T −1/2 T (p) Lemma 1. Let ζ0 = ( p log p) (Λ Λ) Λ Λ0η0, then Π(Ψ | η0) = Π(Ψ | ζ0 ). Then,

1 1 1 (p) (p) T −1/2 √ ζi − ζ0i ≤ (Λ Λ) Λ × √ kΛηi − Λ0η0ik ≤ √ kΛηi − Λ0η0ik. (A.14) p log p 2 p log p p log p

From (9) we see that the numerator in the right hand side of (8) can be simplified as

n d − 2 kn kn   2   Y κ0 X h nh h hT C × Π(Ψ) × × Sη0 + η¯0 η¯0 , nh + κ0 nh + 1 h=1 h=1

26 Pn h 1 P h P h h T where nh = I(ci = h),η ¯ = η0i, S = (η0i − η¯ )(η0i − η¯ ) and C is a i=1 0 nh i:ci=h η0 i:ci=h 0 0 positive quantity constant across all Ψ0 ∈ P. Hence it is clear that Π(Ψ | η) is a continuous function of η. Since the function is bounded (being a probability function), the continuity is also uniform. Also note that, for the particular choice of Gaussian kernel and base measure in (9), the oracle partition probability (8) is unchanged if η is multiplied by a full-rank square matrix and therefore

(p) (p) (p) Π(Ψ | ζ0 ) = Π(Ψ | η0). Therefore, for any  > 0 there exists δ > 0 such that ζ0 − ζ < δ implies

(p) (p) (p) that Π(Ψ | ζ ) − Π(Ψ | ζ0 ) = Π(Ψ | ζ ) − Π(Ψ | η0) < . Again,

    (p) (p) (p) E Π(Ψ | ζ ) − Π(Ψ | η0) Y = E Π(Ψ | ζ ) − Π(Ψ | ζ ) Bp,δ, Y Π(Bp,δ | Y) 0   (p) ¯ ¯ + E Π(Ψ | ζ ) − Π(Ψ | η0) Bp,δ, Y Π(Bp,δ | Y). (A.15)

Due to continuity, δ can be chosen sufficiently small such that the term inside the first expectation in the right hand side of (A.15) is smaller than arbitrarily small  > 0. Now for any δ > 0, the second term in the right hand side of (A.15) goes to 0 as Π(B¯p,δ | Y) → 0 as p → ∞ by assumption. n o (p) Therefore, for arbitrarily small  > 0, E Π(Ψ | ζ ) − Π(Ψ | η0) Y <  for large enough p. Hence the proof.

REFERENCES

Andrews, T. S. and Hemberg, M. (2018), “M3Drop: Dropout-based for scRNASeq,” Bioinformatics, 35, 2865–2867.

Azzalini, A. (2013), The Skew-Normal and Related Families, Institute of Mathematical Statistics Monographs, Cambridge University Press.

Baglama, J. and Reichel, L. (2005), “Augmented Implicitly Restarted Lanczos Bidiagonalization Meth- ods,” SIAM Journal on Scientific Computing, 27, 19–42.

Banfield, J. D. and Raftery, A. E. (1993), “Model-Based Gaussian and Non-Gaussian Clustering,” Biometrics, 49, 803–821.

27 Bhattacharya, A. and Dunson, D. B. (2011), “Sparse Bayesian Infinite Factor Models,” Biometrika, 98, 291–306.

Bhattacharya, A., Pati, D., Pillai, N. S., and Dunson, D. B. (2015), “Dirichlet-Laplace Priors for Optimal Shrinkage,” Journal of the American Statistical Association, 110, 1479–1490.

Bickel, P. J. and Levina, E. (2004), “Some Theory for Fisher’s Linear Discriminant Function, ‘naive Bayes’, and Some Alternatives When There Are Many More Variables Than Observations,” Bernoulli, 10, 989–1010.

Bouveyron, C. and Brunet-Saumard, C. (2014), “Model-based Clustering of High-dimensional Data: A Review,” Computational Statistics & Data Analysis, 71, 52 – 78.

Butler, A., Hoffman, P., Smibert, P., Papalexi, E., and Satija, R. (2018), “Integrating Single-Cell Tran- scriptomic Data Across Different Conditions, Technologies, and Species,” Nature Biotechnology, 36, 411–420.

Cai, D., Campbell, T., and Broderick, T. (2020), “Finite mixture models are typically inconsistent for the number of components,” arXiv:2007.04470.

Carvalho, C. M., Polson, N. G., and Scott, J. G. (2009), “Handling sparsity via the horseshoe,” in Artificial Intelligence and Statistics, PMLR, pp. 73–80.

Celeux, G., Kamary, K., Malsiner-Walli, G., Marin, J.-M., and Robert, C. P. (2018), “Computa- tional Solutions for Bayesian Inference in Mixture Models,” in Handbook of Mixture Analysis, eds. Fr¨uhwirth-Schnatter, S., Celeux, G., and Robert, C. P., Boca Raton, FL: CRC Press, chap. 5, pp. 77–100.

Corradin, R., Canale, A., and Nipoti, B. (2021), “BNPmix: an R package for Bayesian nonparametric modelling via Pitman-Yor mixtures,” Journal of Statistical Software, in press.

Dunson, D. B. (2009), “Nonparametric Bayes Local Partition Models for Random Effects,” Biometrika, 96, 249–262.

28 Fan, J., Fan, Y., and Lv, J. (2008), “High Dimensional Covariance Matrix Estimation Using a Factor Model,” Journal of Econometrics, 147, 186 – 197.

Fan, J., Liao, Y., and Mincheva, M. (2011), “High-Dimensional Covariance Matrix Estimation in Approximate Factor Models,” The Annals of Statistics, 39, 3320–3356.

Ferguson, T. S. (1973), “A Bayesian Analysis of Some Nonparametric Problems,” Annals of Statistics, 1, 209–230.

Fr¨uhwirth-Schnatter, S. (2006), Finite Mixture and Markov Switching Models, Springer Science & Business Media.

Fr¨uhwirth-Schnatter, S., Malsiner-Walli, G., and Gr¨un,B. (2020), “Dynamic mixtures of finite mix- tures and telescoping sampling,” arXiv preprint arXiv:2005.09918.

Galimberti, G. and Soffritti, G. (2013), “Using Conditional Independence for Parsimonious Model- Based Gaussian Clustering,” Statistics and Computing, 23, 625–638.

Ghahramani, Z., Hinton, G. E., et al. (1996), “The EM Algorithm for Mixtures of Factor Analyzers,” Tech. rep., Technical Report CRG-TR-96-1, University of Toronto.

Ghosal, S. and Van Der Vaart, A. (2017), Fundamentals of Nonparametric Bayesian Inference, Cam- bridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press.

Ishwaran, H. and James, L. F. (2001), “Gibbs sampling methods for stick-breaking priors,” Journal of the American Statistical Association, 96, 161–173.

Jain, S. and Neal, R. M. (2004), “A Split-Merge Markov Chain Monte Carlo Procedure for the Dirichlet Process Mixture Model,” Journal of Computational and Graphical Statistics, 13, 158–182.

Jefferys, W. H. and Berger, J. O. (1992), “Ockham’s Razor and Bayesian Analysis,” American Scien- tist, 80, 64–72.

Kim, S., Tadesse, M. G., and Vannucci, M. (2006), “Variable Selection in Clustering via Dirichlet Process Mixture Models,” Biometrika, 93, 877–893.

29 Kingma, D. P., Mohamed, S., Rezende, D. J., and Welling, M. (2014), “Semi-supervised learning with deep generative models,” in Advances in neural information processing systems, pp. 3581–3589.

Kiselev, V. Y., Andrews, T. S., and Hemberg, M. (2019), “Challenges in unsupervised clustering of single-cell RNA-seq data,” Nature Reviews Genetics, 20, 273–282.

Legramanti, S., Durante, D., and Dunson, D. B. (2020), “Bayesian cumulative shrinkage for infinite factorizations,” Biometrika, 107, 745–752.

Li, H., Courtois, E. T., Sengupta, D., Tan, Y., Chen, K. H., Goh, J. J. L., Kong, S. L., Chua, C., Hon, L. K., Tan, W. S., et al. (2017), “Reference Component Analysis of Single-Cell Transcriptomes Elucidates Cellular Heterogeneity in Human Colorectal Tumors,” Nature Genetics, 49, 708.

Lun, A. T., Bach, K., and Marioni, J. C. (2016), “Pooling Across Cells to Normalize Single-Cell RNA Sequencing Data with Many Zero Counts,” Genome Biology, 17, 75.

McInnes, L., Healy, J., Saul, N., and Großberger, L. (2018), “UMAP: Uniform Manifold Approxima- tion and Projection for Dimension Reduction,” Journal of Open Source Software, 3, 861.

Miller, J. W. and Dunson, D. B. (2019), “Robust Bayesian Inference via Coarsening,” Journal of the American Statistical Association, 114, 1113–1125, pMID: 31942084.

Miller, J. W. and Harrison, M. T. (2014), “Inconsistency of Pitman-Yor process mixtures for the number of components,” The Journal of Research, 15, 3333–3370.

— (2018), “Mixture Models with a Prior on the Number of Components,” Journal of the American Statistical Association, 113, 340–356.

Murphy, K., Viroli, C., and Gormley, I. C. (2019), IMIFA: Infinite Mixtures of Infinite Factor Analysers and Related Models, R package version 2.1.1.

Pati, D., Bhattacharya, A., Pillai, N. S., and Dunson, D. (2014), “Posterior Contraction in Sparse Bayesian Factor Models for Massive Covariance Matrices,” The Annals of Statistics, 42, 1102–1130.

30 Pitman, J. and Yor, M. (1997), “The Two-Parameter Poisson-Dirichlet Distribution Derived from a Stable Subordinator,” Annals of Probability, 25, 855–900.

Rand, W. M. (1971), “Objective Criteria for the Evaluation of Clustering Methods,” Journal of the American Statistical Association, 66, 846–850.

Roˇckov´a,V. and George, E. I. (2016), “Fast Bayesian factor analysis via automatic rotations to sparsity,” Journal of the American Statistical Association, 111, 1608–1622.

Rudelson, M. and Vershynin, R. (2013), “Hanson-Wright Inequality and Sub-Gaussian Concentra- tion,” Electronic Communications in Probability, 18, 9.

Schiavon, L., Canale, A., and Dunson, D. (2021), “Generalized infinite factorization models,” arXiv:2103.10333.

Scott, S. L., Blocker, A. W., Bonassi, F. V., Chipman, H. A., George, E. I., and McCulloch, R. E. (2016), “Bayes and Big Data: The Consensus Monte Carlo Algorithm,” International Journal of Management Science and Engineering Management, 11, 78–88.

See, P., Lum, J., Chen, J., and Ginhoux, F. (2018), “A Single-Cell Sequencing Guide for Immunolo- gists,” Frontiers in Immunology, 9, 2425.

Sethuraman, J. (1994), “A constructive definition of Dirichlet priors,” Statistica sinica, 639–650.

Tadesse, M. G., Sha, N., and Vannucci, M. (2005), “Bayesian Variable Selection in Clustering High- Dimensional Data,” Journal of the American Statistical Association, 100, 602–617.

Vershynin, R. (2012), Introduction to the non-asymptotic analysis of random matrices, Cambridge University Press, p. 210–268.

Wade, S. and Ghahramani, Z. (2018), “Bayesian : Point Estimation and Credible Balls (with Discussion),” Bayesian Analysis, 13, 559–626.

Wang, X. and Dunson, D. B. (2013), “Parallelizing MCMC via Weierstrass Sampler,” arXiv:1312.4605.

31 West, M. (1992), Hyperparameter estimation in Dirichlet process mixture models, Duke University ISDS Discussion Paper# 92-A03.

ADDITIONAL THEORETICAL RESULTS

In this Supplementary materials we denote by kxk the Euclidean norm of a vector x and by kXk2 the spectral norm of a matrix X. smin(X) and smax(X) denote the smallest and largest eigenvalues

T 1 of the matrix (X X) 2 , respectively. For a positive-definite matrix X, λmin(X) and λmax(X) denote the smallest and largest eigenvalues, respectively.

Lemma A.2. Let Γp(·) be the multivariate gamma function, ν0 = p + c for some constant c ≥ 0, and

1 ν0+` ν0+m ` and m be (not varying with p) non-negative integers. Then, limp→∞ p log{Γp( 2 )/Γp( 2 )} = 0.

Proof. Without loss of generality assuming ` > m, we have

    ν0+`−j+1 Q` ν0+j ν0+` p Γ Γ Γp( ) Y 2 j=m+1 2 2 = = . (A.16) ν0+m     Γp( ) ν0+m−j+1 Q` ν0+j−p 2 j=1 Γ 2 j=m Γ 2

Note that, the denominator term in the extreme right hand of (A.16) does not depend on p as ν0 − p is constant from assumption. Applying Stirling’s approximation on the numerator we get

 ν0+j−1  ν0+` ` r   Γp( ) 1 Y  ν0 + j − 1 ν0 + j − 1 2  2 = × 2π E ν0+m   j Γp( ) Q` ν0+j−p 2 2e 2 j=m Γ 2 j=m+1  

 ν +j+1 ν +j+1  `   0   0 1 Y √ ν0 2 j − 1 2  =   × 2πe 1 + Ej , Q` ν0+j−p 2e ν0 j=m Γ 2 j=m+1   where Ej = O(log p) arising from the Stirling’s approximation formulae. Using the result limx→∞(1 + c/x)x = ec, it can be seen that

 ν +j+1 ν +j+1  `   0   0 Y √ ν0 2 j − 1 2  lim 2πe 1 + p→∞ 2e ν j=m+1  0  1   4 (`−m)(2ν0+m+`+3) `−m ν0 1 (`−m)(`+m−1) = (2πe) 2 × × e 2 , 2e

32 which is a finite quantity. Hence the proof.

T −1 T Lemma A.3. For any n × p order matrix Y satisfying (A0), limp→∞ Y (Ip + Y Y ) Y − In = 0 2 p in P0-probability.

T −1 T Proof. Letting Y = UDV , the singular value decomposition of Y , we have Y (Ip + Y Y ) Y =  2 2  d1 dn T U diag 2 ,..., 2 U where d1, . . . , dn are the singular values of Y in descending order. From 1+d1 1+dn (A0) we have lim inf 1 d2 > 0, which further implies that lim inf di → 1 for all i = 1, . . . , n. p→∞ p i p→∞ 1+di

di T −1 T p As ≤ 1, limp→∞ Y (Ip + Y Y ) Y − In = 0 in P0-probability. 1+di 2

Lemma A.4. Let Ye be an ne × p order matrix, formed by arbitrarily selecting ne rows from Y = T T −1 T [y1, . . . , yn] where 1 ≤ n ≤ n. If Y satisfies (A0), then limp→∞ Ye(Ip + Ye Ye) Ye − In = 0 in e e 2 p P0-probability.

Proof. Letting Y = UDV the singular value decomposition of Y , we have Ye = UDVe where Ue is formed by the corresponding rows of Y which were used to form Ye. Using Pati et al. (2014, Lemma 1.1(iii) from T T T T T the Suppplementary section), we have smin(Ye ) ≥ smin(V )smin(D )smin(Ue ) = smin(Y ). Since T T T T smin(Ue ) = smin(U ) = 1, smin(Ye ) ≥ smin(Y ). Therefore, Ye also satisfies (A0) if we substitute Y = Ye. Consequently applying Lemma A.3, we conclude the proof.

h iT Lemma A.5. Let Y be an n × p order matrix satisfying (A0). Let Y = y , . . . , y , i = 1, 2 i ji,1 ji,ni be an arbitrary partiton of the data-matrix into two sub-matrices such that n1 + n2 = n. Then

T −1/2 T T −1/2 p lim supp→∞ smax(Z) < 1 where Z = (In1 + Y1Y1 ) Y1Y2 (In2 + Y2Y2 ) in P0-probability.

T T Proof. From (A0) we have YY = O(p) and lim inf λmin(YY )/p > 0, which implies that 2

T T p 0 < lim inf (In + YY )/p ≤ lim sup (In + YY )/p = O(1) in P0-probability. (A.17) p→∞ p→∞

Following the proof of Lemma A.4, we see that Yi also satisfies (A0), and therefore (A.17) also holds   I + Y Y T Y Y T T n1 1 1 1 2 if Y is replaced with Yi for i = 1, 2. Noting that, In + YY =  , and using  T T Y2Y1 In2 + Y2Y2

33 matrix factorization results we have

1 T 1 T 1 T T (In + YY ) = (In + Y1Y ) (In + Y2Y ) In − ZZ . (A.18) p p 1 1 p 2 2 1

p Again in P0-probability,

2 T T −1 T T −1 lim sup smax(Z) ≤ lim sup Y1 (In1 + Y1Y1 ) Y1 Y2 (In2 + Y2Y2 ) Y2 ≤ 1. (A.19) p→∞ p→∞ 2 2

For (A.17) to hold, all the terms in the RHS of (A.18) must be bounded away from 0. As I − ZZT = n1

Qn1 2 j=1 1 − sj (Z) , the inequality on (A.19) must be strict. Thus, we conclude the proof.

1 T 1 T Lemma A.6. For prior (6) limp→∞ p λmin(Λ Λ) = limp→∞ p λmax(Λ Λ) = v1 for some v1 > 0 Π-a.s.

T 2 T iid Proof. Note that Λ Λ = τ T T where the (i, j)-th element of T is tij = eijφij with eij ∼ DE(1), Also,

1 d Pp Pd d 1  φ ∼ Dir(a, . . . , a) and τ ∼ Ga(pda, /2). Therefore, τ = i=1 j=1 τij and φ = Pp Pd γ11, . . . , γpd , i=1 j=1 γij iid iid −1 where γij ∼ Ga(a, 1/2) and τij ∼ Ga(a, 1/2). Therefore T can be written as T = Γ T˜ where t˜ij = eijγij Pp Pd T τ 2 ˜T ˜ and Γ = i=1 j=1 γij. Thus, λi(Λ Λ) = Γ × λi(T T ) for all i = 1, . . . , p. Now by the strong law

1 ˜T ˜ of large numbers T T − v1Id → 0 as p → ∞ where k·kF is the Frobenius norm of a matrix and p F T −1 v1 = Var(eijγij). Hence, for any i = 1, . . . , p limp→∞ λi(T˜ T˜)p = v1. Also limp→∞ τ/(pd) = E(τij) 2 which implies that limp→∞ τ/Γ = 1 Π-a.s. Hence the proof.

A. PROOF OF THEOREM 2 AND ASSOCIATED RESULTS

To prove Theorem 2, we consider an adaptation of Theorem 6.39 in Ghosal and Van Der Vaart (2017) in which, in place of having an increasing sample size, we assume an increasing data dimension with fixed sample size. This notion is consistent with the idea that more and more variables are measured

T on each study subject. We introduce the following notation. Let ϑ = (Λ, η, σ) with η = [η1, . . . , ηn] p p and ϑ ∈ Θp. Let Pϑ and P0 be the joint distributions of the data y1, . . . , yn given ϑ and ϑ0, respectively, p p p with ϑ0 = (Λ0, η0, σ0). We also denote the expectation of a function g with respect to P0 and Pϑ by P0g p p p p p and Pϑg respectively. Let p0 and pϑ be the densities of P0 and Pϑ with respect to the Lebesgue measure. Finally, define the Kullback-Leibler (KL) divergence and the r-th order positive KL-variation between

34 ( )r p  p + p p p p R p0 p + p p R p0 p p0 and pϑ, respectively, as KL(P0, Pϑ) = log p dP0 and Vr (P0, Pϑ) = log p − KL dP0, pϑ pϑ where f + denotes the positive part of a function f.

Theorem 3. If for some r ≥ 2, c > 0 there exist measurable sets Bp ⊂ Θp with lim inf Π(Bp) > 0,

1 p p 1 + p p (I) supϑ∈Bp p KL(P0, Pϑ) ≤ c and supϑ∈Bp pr Vr (P0, Pϑ) → 0. ˜ p (II) For sets Θp ⊂ Θp there exists a sequence of test functions φp such that φp → 0 P0-a.s. and R p (1 − φ )dΠ(ϑ) ≤ e−Cp for some C > 0. Θ˜ p Pϑ p   p   1 R p0 p p ˜ ˜ (III) Letting Ap = ϑ ∈ Θp : p log p − KL(P0, Pϑ) dΠp(ϑ) < ˜ , with Πp the renormalized pϑ 1 ¯ p restriction of Π to set Bp, for any ˜ > 0, (Ap) → 0 P0-a.s. ˜ p Then Π(Θp | Y) → 0 P0-a.s.

Condition (I) ensures that the assumed model is not too far from the true data-generating model. Condition (II) controls the variability of the log-likelihood around its mean. In the Lamb model, the

+ number of parameters grows with p and hence the assumption on Vr is instrumental. The conditions p on φp ensure the existence of a sequence of consistent test functions for H0 : P = P0 in which type-II error diminishes to 0 exponentially fast in the critical region. Condition (III) is a technical condition

required to bound the numerator of Π(Θ˜ p | Y). The proof of this theorem follows along the lines of the proof of Theorem 6.39 of Ghosal and Van Der Vaart (2017). Theorem 3 is a general result stating sufficient conditions for posterior consistency as p → ∞. We use this theorem to prove Theorem 2.

Theorem 2. We verify the conditions (I)-(III) from Theorem 3. Theorems 4 and 5 jointly imply that

for the Lamb model there exist a sequence of sets Bp such that conditions condition (I) and (III) are satisfied for any c > 0. Theorem 6 ensures the existence of a sequence of test functions satisfying (II), and finally Theorem 7 proves (III). Hence the proof.

  −1 p p Theorem 4. For any  > 0 define Bp = Θ : p KL(P0, Pϑ) ≤  . Then, under the settings of  Section 4, lim inf Π(Bp) > 0.

35 Proof. Let P0 and P be p-variate multivariate normal distributions with P = Np(µ, Σ) and P0 =

1 |Σ| −1  Np(µ0, Σ0). Then their Kullback-Leibler divergence is KL(P0,P ) = {log + tr Σ Σ0 + (µ − 2 |Σ0| T −1 µ0) Σ (µ − µ0) − p}, which, under the settings of Section 4, simplifies to

 ! n  1  σ2 σ2 1 X  KL( p, p ) = np log + np 0 − 1 + kµ − µ k2 , (A.20) P0 Pϑ 2 σ2 σ2 σ2 i 0i  0 i=1  where µi = Ληi and µ0i = Λ0η0i. Now,

 ! n  n o  σ2 σ2 1 X  Π p−1KL( p, p ) <  = Π n log + n 0 − 1 + kµ − µ k2 <  P0 Pϑ σ2 σ2 pσ2 i 0i  0 i=1 

 ! n   σ2 σ2  1 X p ≥ Π log + 0 − 1 ≤ , kµ − µ k2 < . σ2 σ2 2n σ2 i 0i 2  0 i=1 

2  2   2 σ σ0 σ0 σ Note that for any x > 0, log x ≤ x − 1 and therefore log 2 + 2 − 1 ≤ − implying that σ0 σ σ σ0

   2 n n o  σ0 σ  1 X p Π p−1KL( p, p ) <  ≥ Π − ≤ , kµ − µ k2 < P0 Pϑ σ σ 2n σ2 i 0i 2  0 i=1    ( 2 ) n σ0 σ  X p ≥ Π − ≤ Π kµ − µ k2 < σ , σ σ 2n  i 0i L 2  0 i=1 where the second inequality holds thanks to condition (C3). The first factor above is positive under our

2 proposed prior on σ. Now consider the second factor and note that for each i = 1, . . . , n, kµi − µ0ik = 2 T −1 T T T T T −1 T Λ(ηi − (Λ Λ) Λ Λ0η0i) + η0i(Λ0 Λ0 − Λ0 Λ(Λ Λ) Λ Λ0)η0i. By the triangle inequality

1 1 1 T T T −1 T T T T −1 T Λ0 Λ0 − Λ0 Λ(Λ Λ) Λ Λ0 ≤ Λ0 Λ0 − M + M − Λ0 Λ(Λ Λ) Λ Λ0 . (A.21) p 2 p 2 p 2

The first term on the right hand side of (A.21) goes to 0 as p → ∞ by (C2). Let us define the matrix

T −1/2 T ˜ −1/2 1/2 B = (Λ Λ) Λ , with kBk2 = 1 and Λ0 = Λ0M where M is the Cholesky factor of M. From 1 Vershynin (2012, Theorem 5.39) it follows that for any 0 <  < 1 and large enough p, 1− ≤ √ Λ˜ 0 ≤ p 2 1 + . Again, from Lemma 1.1 of the Supplement section of Pati et al. (2014) we have that 1 −  ≤

36

√1 BΛ˜ ≤ 1 + , 1 −  ≤ √1 s (BΛ˜ ) ≤ 1 + . Therefore lim λ (Λ˜ TΛ(ΛTΛ)−1ΛTΛ˜ )p−1 = p 0 p min 0 p→∞ i 0 0 2 1 T T −1 T 1 ˜ T T −1 T ˜ 1 for all i = 1, . . . , d. Now M − Λ0 Λ(Λ Λ) Λ Λ0 = kMk2 Id − Λ0 Λ(Λ Λ) Λ Λ0 and p 2 p 2 therefore the second term on the right hand side of (A.21) goes to 0 as p → ∞. Subsequently we

1 T T T T −1 T have limp→∞ p η0i(Λ0 Λ0 − Λ0 Λ(Λ Λ) Λ Λ0)η0i = 0 for all i = 1, . . . , n. Now (C2) and Lemma T −1 T A.6 jointly imply that (Λ Λ) Λ Λ0 = O(1) Π-a.s. Therefore, for standard normal priors on the 2 nPn T o latent variables, lim infp→∞ Π i=1 (µi − µ0i) (µi − µ0i) < σLp > 0. From the permanence of KL-property of mixture priors (Ghosal and Van Der Vaart 2017, Proposition 6.28) we can conclude that the right hand side is also positive.

 + p p r Theorem 5. On the set Bp defined in Theorem 4, we have Vr (P0, Pϑ) = o(p ) for r = 2.

p  p 2 + p p R 2 p0 p R p0 p Proof. For r = 2, Vr (P0, Pϑ) ≤ log p dP0 − log p dP0 . Now conditionally on ϑ ∈ ϑ, the pϑ pϑ observations y1, . . . , yn are independent. Therefore,

  n Z ( )2 (Z )2 X p0j(yj) p0j(yj) V +( p, p ) ≤ log p (y )dy − log p (y )dy (A.22) r P0 Pϑ  p (y ) 0j j j p (y ) 0j j j  j=1 ϑj j ϑj j

Qp 2 Qp 2 where p0j(yj) = i=1 N(yji; µ0ji, σ0) and pϑj (yj) = i=1 N(yji; µji, σ ) with µ0j = and µj = Ληj. We

first show the result for a particular term inside the summation of (A.22). Since kη0ik = O(1) and n is fixed, the result will readily follow afterwards. For simplicity, we drop the suffix j from the terms of (A.22) henceforth. Consider,

 2  2 ( 2  2) p0(yi) σ 1 yi − µ0i yi − µi log = log − −  pϑ(yi) σ0 2 σ0 σ ( )2 ( ) 1 y − µ 2 y − µ 2 σ y − µ 2 y − µ 2 σ = i 0i − i i + log2 − i 0i − i i log . 4 σ0 σ σ0 σ0 σ σ0

37 Note that,

2  2 ( 2  2) 2 !  2 yi − µ0i yi − µi  2 σ0 σ0 µi − µ0i  − = zi 1 − 2 − 2zi(µ0i − µi) + σ0 σ  σ σ σ  !2 ! σ2 µ − µ 2 µ − µ 4 σ2 σ = z4 1 − 0 + 4z2σ2 0i i + i 0i − 2z3 1 − 0 0 (µ − µ ) i σ2 i 0 σ σ i σ2 σ 0i i ! µ − µ 3 µ − µ 2 σ2 − 2z σ 0i i + 2z2 0i i 1 − 0 i 0 σ i σ σ2

iid where zi = (yi − µ0i)/σ0 and zi ∼ N(0, 1). Therefore,

( 2  2) 2 !  2 yi − µ0i yi − µi σ0 µi − µ0i Eyi − = 1 − 2 + and σ0 σ σ σ

2 2 ( 2  2) 2 !  2  4 yi − µ0i yi − µi σ0 2 µ0i − µi µi − µ0i Eyi − = 3 1 − 2 + 4σ0 + σ0 σ σ σ σ ! µ − µ 2 σ2 + 2 0i i 1 − 0 . σ σ2

Hence,

  Z  2  2 2 ! p0(yi) µ0i − µi  2 1 σ0 σ  log p0(yi)dyi = × σ0 + 1 − 2 − log pϑ(yi) σ  2 σ σ0  2 2 !  4 2 ! σ σ0 1 µ0i − µi 3 σ0 2 σ − log 1 − 2 + + 1 − 2 + log σ0 σ 4 σ 4 σ σ0

2 2 Z 2 ( 2 2 ) 2 ! p0(yi) σ σ0 + (µ0i − µi) 1 2 σ 1 σ0 log p0(yi)dyi = log + 2 − = log + 1 − 2 pϑ(yi) σ0 2σ 2 σ0 4 σ    4  2 2 ! 2 ! 1 µ0i − µi µ0i − µi  σ 1 σ0  σ σ0 + + × log − 1 − 2 − log 1 − 2 , 4 σ σ  σ0 2 σ  σ0 σ

38 leading to

p "Z  2 Z 2# X p0(yi) p0(yi) V +( p, p ) ≤ log p (y )dy − log p (y )dy r P0 Pϑ p (y ) 0 i i p (y ) 0 i i i=1 ϑ i ϑ i 2   2 ! 2 ! p  2 p σ  σ σ  X µ0i − µi = 1 − 0 + σ2 − 2 log + 1 − 0 × . (A.23) 2 σ2 0 σ σ2 σ  0  i=1

Note that

p p 2 X 2 X  T T  T T T T T T (µ0i − µi) = λ0iη0 − λi η = η0 Λ0 Λ0η0 + η Λ Λη − 2η0 Λ0 Λη. (A.24) i=1 i=1

T T 2 2 T T Now η0 Λ0 Λ0η0 ≤ kΛ0k2kη0k and therefore, by conditions (C2) and (C4), η0 Λ0 Λ0η0 = O(p). Also 1 2 T T 2 2 from Lemma A.6, p kΛk2 ≤ c for large enough p and some c > 0 and therefore η Λ Λη ≤ kΛk2kηk = 2  kηk O(p). From the proof of Theorem 4 we can see that in the set Bp, kηk is bounded. We have shown that the highest powers in (A.24) and thus in (A.23) are almost surely bounded by p for large enough p. Hence the proof.

n 1 Pn o Theorem 6. Let us define the test function φp = 1 √ (yi − Λ0η0i) − 1 > τ to test the npσ0 i=1 p following hypothesis H0 : y1, . . . , yn ∼ P0 versus H1 : H0 is false where τ is a positive real number. ˜ ¯ p Define the set Θp = Bp,δ. Then there exists a constant C > 0 such that φp → 0 P0-a.s. and R p (1 − φ )dΠ(ϑ) ≤ e−Cp. Θ˜ p Pϑ p

1 Pn Proof. Let us define µi = Ληi and µ0i = Λη0i. Then under H0, √ (yi − Λη0i) ∼ Np(0,Ip) nσ0 i=1 d √ and therefore √ 1 Pn (y − Λη ) = ω/ p where ω ∼ N (0,I ). Then from Rudelson and Ver- npσ0 i=1 i 0i p p p   shynin (2013, Theorem 2.1) for some c > 0 and any τ > 0 φ = Pr √1 kωk − 1 > τ ≤ np P0 p p np  2  P∞ p p 2 exp −pcτnp . Since p=1 P0φp < ∞, by Borel-Cantelli lemma φp → 0 P0-a.s. p d iid Notably when H0 is not true i.e. under Pϑ, Yi = σϕi + Ληi where ϕi ∼ Np(0,Ip) for some

39 p ϑ 6= (Λ0, η0, σ0) and therefore under Pϑ

  n p  1 X  P (1 − φp) ≤ Pr √ (σϕi + Ληi − Λ0η0i) < 1 + τnp ϑ pnσ  0 i=1      n n   1 X σ σ 1 X  ≤ Pr √ kΛη − Λ η k − 1 − τ − ≤ √ kϕ k − 1 . (A.25) pnσ i 0 0i np σ σ  np i   0 i=1 0 0 i=1 

Notably for ϑ ∈ Θ˜ , √ 1 Pn kΛη − Λ η k is unbounded above for increasing p and σ is bounded p pnσ0 i=1 i 0 0i σ0 thanks to (C3). Letting C = √ 1 Pn kΛη − Λ η k − 1 − τ − σ we have lim inf C > 0. p pnσ0 i=1 i 0 0i np σ0 p→∞ p ˜ p Therefore, from Rudelson and Vershynin (2013, Theorem 2.1), we have for ϑ ∈ Θp, Pϑ(1 − φp) ≤  2 2 exp −pncCp . Hence the proof.

˜  Theorem 7. Let Πp be the renormalized restriction of Π to the set Bp defined in Theorem 4. Then 1 ¯ p {Ap} → 0 P0-a.s.

P∞ p ¯ p ¯ Proof. If we can show that p=1 P0(Ap) < ∞, then by Borel-Cantelli lemma P0[lim sup Ap] = 0 and 1 ¯ p henceforth {Ap} → 0 P0-a.s. Now

  n  ! 1 Z X  1 1 1 σ2  p(A¯ ) = p  ky − µ k2 − ky − µ k2 − kµ − µ k2 − p 0 − 1 dΠ˜ > 2 . P0 p P0 p σ2 i i σ2 i 0i σ2 i 0i σ2 p  i=1  0 

p d iid Notably under P0, Yi = σ0ϕi + µ0i where ϕi ∼ Np(0,Ip). Therefore

    Z n 2 ! 1 X  σ σ0  p(A¯ ) = Pr  0 − 1 (kϕ k2 − p) + 2 ϕT(µ − µ ) dΠ˜ > 2˜ P0 p p σ2 i σ2 i i 0i p  i=1       n Z 2 ! Z n   1 X σ 2 X σ0 ≤ Pr (kϕ k2 − p) 0 − 1 dΠ˜ > ˜ + Pr ϕT(µ − µ ) dΠ˜ > ˜ . p i σ2 p  p σ2 i i 0i p  i=1 i=1 (A.26)

40 Let us consider the first term of (A.26). Notably

 !    n Z 2 n Z 2 1 X 2 σ0 ˜ 1 X 2 σ0 ˜ Pr  (kϕik − p) − 1 dΠp > ˜ ≤ Pr  (kϕik − p) × − 1 dΠp > ˜ . p σ2 p σ2 i=1 i=1 (A.27) From (C3) we have that σ lies in a compact interval. Hence the integral in the right hand side of

(A.27) is bounded above by some positive constant, say Cσ,1. Therefore,

 !    1 n Z σ2 1 n ˜ X 2 0 ˜ X 2 −pCσ,2 Pr  (kϕik − p) − 1 dΠp > ˜ ≤ Pr  (kϕik − p) >  ≤ 2e . p σ2 p C i=1 i=1 σ,1 for some positive constant Cσ,2 > 0. The second inequality in the above equation follows from Rudelson and Vershynin (2013, Theorem 2.1). Clearly

∞  n !  X 1 X Z σ2 Pr (kϕ k2 − p) 0 − 1 dΠ˜ > ˜ < ∞. (A.28) p i σ2 p  p=1 i=1

T Now we consider the second term of (A.26). As ϕi = (ϕi1, . . . , ϕip) (similarly µi and µ0i are also p-dimensional vectors) we can write

    Z n   n p Z   2 X σ0 2 X X σ0 Pr ϕT(µ − µ ) dΠ˜ > ˜ = Pr ϕ (µ − µ ) dΠ˜ > ˜ p σ2 i i 0i p  p ij σ2 ij 0ij p  i=1 i=1 j=1   p2˜2 ≤ exp −  ,  2 Pn Pp 2 n 1 o 4σ E 2 (µij − µ0ij) 0 i=1 j=1 Π˜ p σ where E denotes the expectation with respect to the probability measure Π˜ . The above inequality Π˜ p p follows from sub-Gaussian concentration bounds. Now

n p   n X X 2 1 X 1 2 E ˜ (µij − µ0ij) ≤ E ˜ kµi − µ0ik (by Jensen’s inequality) Πp σ2 Πp σ4 i=1 j=1 i=1 n 1 X 2 =E ˜ ×E ˜ kµi − µ0ik . (A.29) Πp σ4 Πp i=1

41 Since we consider independent priors on σ, Λ and ηi, (A.29) follows from its preceding step. Note that

 on the set Bp ! n σ2 σ2 1 X n log + n 0 − 1 + kµ − µ k2 < 2. (A.30) σ2 σ2 pσ2 i 0i 0 i=1

2  2  σ σ0  From the inequality log x < x − 1 we see that n log 2 + n 2 − 1 > 0. Therefore for ϑ ∈ Bp, in σ0 σ conjunction of (A.30) and (C3) we have 1 Pn kµ − µ k2 < 2σ2 ⇒ 1 Pn E kµ − µ k2 < 2σ2 . p i=1 i 0i U p i=1 Π˜ p i 0i U Also thanks to (C3) E 1 is bounded above. Hence the term in (A.29) is bounded above and Π˜ p σ4 consequently   ∞ Z n   X 2 X σ0 Pr ϕT(µ − µ ) dΠ˜ > ˜ < ∞. (A.31) p σ2 i i 0i p  p=1 i=1

P∞ p ¯ Combining (A.28) and (A.31) we conclude that p=1 P0(Ap) < ∞. Hence the proof.

B. ADDITIONAL FIGURES

B.1 Figures Associated to Section 5

Figures S1-S6 report the UMAP (McInnes et al. 2018) plots of the simulated datasets of Section 5, corresponding to the replicate with median adjusted Rand index (Rand 1971). In each figure, the upper and lower panels show the true clustering and the estimated clustering obtained by the Lamb model, respectively. Each figure’s caption specifies the true number of clusters (k0) and the dimension (p).

42 Figure S1: k0 = 10, p = 1, 000.

Figure S2: k0 = 10, p = 2, 500

43 Figure S3: k0 = 15, p = 1, 000

Figure S4: k0 = 15, p = 2, 500

44 Figure S5: k0 = 25, p = 1, 000

Figure S6: k0 = 25, p = 2, 500

45 B.2 Figures Associated to Section 6

Figure S7 reports the UMAP representations of the different clustering in the application discussed in Section 6.

True cell types Lamb Cluster 1 10 10 2 Cluster 3 A549 5 5 4 GM12878 5 H1 6 H1437 0 0 7 HCT116 8 IMR90 9 −5 K562 −5 10 11 −10 −10 12 −10 −5 0 5 10 −10 −5 0 5 10 PCA−KM Seurat Cluster 10 10 1 Cluster 2 1

5 3 5 2 4 3 5 4 0 6 0 5 7 6 8 7 −5 −5 9 8 10 −10 −10

−10 −5 0 5 10 −10 −5 0 5 10

Figure S7: UMAP plots of the cell line dataset: clusterings corresponding to the true cell-types, Lamb estimate, PCA-KM estimate and Seurat estimate are plotted in clockwise manner. Different panels use different color legends.

46