Equitability, Interval Estimation, and Statistical Power Yakir A

Statistical Science 2020, Vol. 35, No. 2, 202–217 https://doi.org/10.1214/19-STS719 © Institute of Mathematical Statistics, 2020 Equitability, Interval Estimation, and Statistical Power Yakir A. Reshef1, David N. Reshef1, Pardis C. Sabeti2 and Michael Mitzenmacher2 Abstract. Emerging high-dimensional data sets often contain many nontrivial relationships, and, at modern sample sizes, screening these using an independence test can sometimes yield too many relationships to be a useful exploratory approach. We propose a framework to address this limitation centered around a property of measures of dependence called equitability. Given some measure of relationship strength, an equitable measure of dependence is one that assigns similar scores to equally strong relationships of different types. We formalize equitability within a semiparametric inferen- tial framework in terms of interval estimates of relationship strength, and we then use the correspondence of these interval estimates to hypothesis tests to show that equitability is equivalent under moderate assumptions to requiring that a measure of dependence yield well-powered tests not only for distinguishing nontrivial relationships from trivial ones but also for distinguishing stronger relationships from weaker ones. We then show that equitability, to the extent it is achieved, implies that a statistic will be well powered to detect all relationships of a certain minimal strength, across different relationship types in a family. Thus, equitability is a strengthening of power against independence that enables exploration of data sets with a small number of strong, interesting relationships and a large number of weaker, less interesting ones. Key words and phrases: Equitability, measure of dependence, statistical power, independence test, semiparametric inference. 1. INTRODUCTION possible pairs of variables. The score of each variable pair can be evaluated against a null hypothesis of sta- Suppose we have a data set that we would like to tistical independence, and variable pairs with significant explore to find associations of interest. A commonly scores can be kept for follow-up (Storey and Tibshirani, taken approach that makes minimal assumptions about 2003, Emilsson et al., 2008, Sun and Zhao, 2014, Rachel the structure in the data is to compute a measure of de- Wang, Waterman and Huang, 2014). There is a wealth pendence, that is, a statistic whose population value is of measures of dependence from which to choose for zero exactly in cases of statistical independence, on all this task (Hoeffding, 1948, Breiman and Friedman, 1985, Kraskov, Stögbauer and Grassberger, 2004, Gretton et al., Yakir A. Reshef is Ph.D. candidate, School of Engineering and 2005a, Székely, Rizzo and Bakirov, 2007, Székely and Applied Sciences, Harvard University, Cambridge, Rizzo, 2009, Reshef et al., 2011, Gretton et al., 2012, Massachusetts 02138, USA (e-mail: [email protected]). Heller, Heller and Gorfine, 2013, Sugiyama and Borg- David N. Reshef is Ph.D. candidate, Department of Electrical wardt, 2013, Heller et al., 2016, Lopez-Paz, Hennig and Engineering and Computer Science, Massachusetts Institute of Schölkopf, 2013, Rachel Wang, Waterman and Huang, Technology, Cambridge, Massachusetts 02139, USA (e-mail: 2014, Jiang, Ye and Liu, 2015, Reshef et al., 2016, Zhang, [email protected]). Pardis C. Sabeti is Professor, Department 2016, Wang, Jiang and Liu, 2017, Romano et al., 2018). of Organismic and Evolutionary Biology, Harvard University, Cambridge, Massachusetts 02138, USA (e-mail: While this approach works well in some settings, it can [email protected]). Michael Mitzenmacher is be limited by the size of modern data sets. In particular, Professor, School of Engineering and Applied Sciences, as data sets grow in dimensionality and sample size, the Harvard University, Cambridge, Massachusetts 02138, USA above approach often results in lists of significant rela- (e-mail: [email protected]). tionships that are too large to allow for meaningful follow- 1Co-first author. up of every identified relationship, even after correction 2Co-last author. for multiple hypothesis testing. For example, in the gene 202 EQUITABILITY, INTERVAL ESTIMATION AND STATISTICAL POWER 203 expression data set analyzed in Heller et al. (2016), sev- Intuitively, our proposal is simply to quantify the ex- eral measures of dependence reliably identified, at a false tent to which a measure of dependence can be used to discovery rate of 5%, thousands of significant relation- estimate an effect size rather than just to reject a null of ships amounting to between 65 and 75 percent of the vari- independence. More formally, given a measure of depen- able pairs in the data set. Given the extensive manual ef- dence ϕˆ, a benchmark set Q of relationship types, and fort that is usually necessary to better understand each of some quantification of relationship strength defined on these results, further characterizing all of them is imprac- Q, we construct an interval estimate of the relationship tical. strength from the value of ϕˆ that is valid over Q.We A tempting way to deal with this challenge is to rank then use the sizes of these intervals to quantify the utility all the variable pairs in a data set according to the test of ϕˆ as an estimate of effect size on Q, and we define an statistic used (or according to p-value) and to examine equitable statistic to be one that yields narrow interval es- only a small number of pairs with the most extreme val- timates. As we explain, this property can be viewed as a ues (Faust and Raes, 2012, Turk-Browne, 2013). How- natural generalization of one of the “fundamental proper- ever, this idea has a pitfall: while a measure of depen- ties” described by Renyi in his framework for measures dence guarantees nonzero scores to dependent variable of dependence (Rényi, 1959). It can also be viewed as a pairs, the magnitude of these nonzero scores can depend weakening of the notion of consistency of an estimator. heavily on the type of dependence in question, thereby After defining equitability, we connect it to statistical skewing the top of the list toward certain types of relation- power using a variation on the standard equivalence of ships over others (Faust and Raes, 2012, Sun and Zhao, interval estimation and hypothesis testing. Specifically, ˆ 2014). For example, if some measure of dependence ϕ we show that under moderate assumptions an equitable systematically assigns higher scores to, say, linear rela- statistic is one that yields tests for distinguishing finely ˆ tionships than to nonlinear relationships, then using ϕ to between relationships of two different strengths that may rank variable pairs in a large data set could cause noisy both be nontrivial. This result gives us a way to under- linear relationships in the data set to crowd out strong stand equitability as a natural strengthening of the tradi- nonlinear relationships from the top of the list. The nat- tional requirement of power against independence, which ural result would be that the human examining the top- asks that a statistic be useful only for detecting devia- ranked relationships would never see the nonlinear rela- tions from strict independence (i.e., distinguishing zero tionships, and they would not be discovered (Speed, 2011, relationship strength from nonzero relationship strength). Sun and Zhao, 2014). As we discuss, this view of equitability is related to The consistency guarantee of measures of dependence the concept of separation rate in the minimax hypoth- is therefore not strong enough to solve the data explo- esis testing literature (Baraud, 2002, Fromont, Lerasle ration problem posed here. What is needed is a way not just to identify as many relationships of different kinds as and Reynaud-Bouret, 2016, Arias-Castro, Pelletier and possible in a data set, but also to identify a small number Saligrama, 2018). of strongest relationships of different kinds. Finally, motivated by the connection between equitabil- Here we propose and formally characterize equitability and power, we define an additional property, the de- ity, a framework for meeting this goal. In previous work, tection threshold of an independence test, which is the equitability was informally described as the extent to minimal relationship strength x such that the test is well which a measure of dependence assigns similar scores to powered to detect all relationships with strength at least equally noisy relationships, regardless of relationship type x at some fixed sample size, across different relation- Q (Reshef et al., 2011). Given that this informal definition ship types in . This is analogous to the commonly has led to substantial follow-up work (Murrell, Murrell analyzed notion of testing rate (Ingster, 1987, Lepski and Murrell, 2016, Reshef et al., 2016, Ding et al., 2017, and Spokoiny, 1999, Baraud, 2002, Ingster and Suslina, Wang, Jiang and Liu, 2017, Romano et al., 2018), the con- 2003), which has been studied in detail for indepen- cept of equitability merits a unifying framework. In this dence testing both in the statistics literature (Ingster, 1989, paper, we therefore formalize equitability in terms of in- Yodé, 2011, Zhang, 2016) as well as the computer sci- terval estimates of relationship strength and use the cor- ence and information theory literature (Paninski, 2008, respondence between confidence intervals and hypothesis Acharya, Daskalakis and Kamath, 2015). Traditionally, tests to tie it to the notion of statistical power. Our for- testing rate for independence testing problems has been malization shows that equitability essentially amounts to defined in terms of some distance (usually total varia- an assessment of the degree to which a measure of depen- tion distance) between the alternatives in question and in- dence can be used to perform conservative semiparamet- dependence. Here we define the property generically in ric inference based on extremum quantiles. In this sense, terms of an arbitrary notion of relationship strength (e.g., it is a natural application of ideas from statistical decision R2 of a noisy functional relationship) and show that high theory to measures of dependence.

Equitability, Interval Estimation, and Statistical Power Yakir A

Confidence Interval Estimation in System Dynamics Models: Bootstrapping Vs

WHAT DID FISHER MEAN by an ESTIMATE? 3 Ideas but Is in Conﬂict with His Ideology of Statistical Inference

A Framework for Sample Efficient Interval Estimation with Control

Likelihood Confidence Intervals When Only Ranges Are Available

32. Statistics 1 32

Interval Estimation Statistics (OA3102)

Confidence Intervals in Analysis and Reporting of Clinical Trials Guangbin Peng, Eli Lilly and Company, Indianapolis, IN

Interval Estimation for Drop-The-Losers Designs

Interval Estimation, Point Estimation, and Null Hypothesis Significance Testing Calibrated by an Estimated Posterior Probability of the Null Hypothesis David R

Statistical Inference II: Interval Estimation and Hypotheses Tests

Contents 3 Parameter and Interval Estimation

Applied Resampling Methods