Selective Inference for Hierarchical Clustering
Lucy L. Gao†,∗ Jacob Bien◦, and Daniela Witten ‡
† Department of Statistics and Actuarial Science, University of Waterloo ◦ Department of Data Sciences and Operations, University of Southern California ‡ Departments of Statistics and Biostatistics, University of Washington
September 23, 2021
Abstract Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clus- tering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a differ- ence in means between two clusters. Our procedure controls the selective type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly-used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data.
Keywords: post-selection inference, hypothesis testing, difference in means, type I error arXiv:2012.02936v2 [stat.ME] 22 Sep 2021
∗Corresponding author: [email protected]
1 1 Introduction
Testing for a difference in means between groups is fundamental to answering research questions across virtually every scientific area. Classical tests control the type I error rate when the groups are defined a priori. However, it is increasingly common for researchers to instead define the groups via a clustering algorithm. In the context of single-cell RNA- sequencing data, researchers often cluster the cells to identify putative cell types, then test for a difference in mean gene expression between the putative cell types in that same data set (Hwang et al. 2018, Zhang et al. 2019). In fact, the resulting inferential challenges have been described as a “grand challenge” in the field (L¨ahnemannet al. 2020). Similar issues arise in the field of neuroscience (Kriegeskorte et al. 2009). In this paper, we develop a valid test for a difference in means between two clusters estimated from the data. We consider the following model for n observations of q features:
2 X ∼ MN n×q(µ, In, σ Iq), (1)
n×q 2 where µ ∈ R , with rows µi, is unknown, and σ > 0 is known. For G ⊆ {1, 2, . . . , n}, let 1 X 1 X µ¯ ≡ µ and X¯ ≡ X , (2) G |G| i G |G| i i∈G i∈G which we refer to as the mean of G and the empirical mean of G in X, respectively. Given a realization x ∈ Rn×q of X, we first apply a clustering algorithm C to obtain C(x), a ˆ ˆ partition of {1, 2, . . . , n}. We then use x to test, for a pair of clusters C1, C2 ∈ C(x),
ˆ ˆ ˆ ˆ H{C1,C2} :µ ¯ =µ ¯ versus H{C1,C2} :µ ¯ 6=µ ¯ . (3) 0 Cˆ1 Cˆ2 1 Cˆ1 Cˆ2
It is tempting to simply apply a Wald test of (3), with p-value given by