STAR: Statistical Tests with Auditable Results

STAR: Statistical Tests with Auditable Results System for Tamper-proof Hypothesis Testing Sacha Servan-Schreiber Olga Ohrimenko MIT CSAIL Microsoft Research Tim Kraska Emanuel Zgraggen MIT CSAIL MIT CSAIL ABSTRACT The replication crisis is, in part, a direct result of these We present STAR: a novel system aimed at solving the com- problems and what is formally known as the Multiple Com- plex issue of “p-hacking” and false discoveries in scientific parisons Problem (MCP). With every hypothesis tested over studies. STAR provides a concrete way for ensuring the ap- a dataset (using any type statistical testing procedure), there plication of false discovery control procedures in hypothesis is a small probability of a chance, i.e., false positive, discov- testing, using mathematically provable guarantees, with the ery with no real basis to the population being studied. With goal of reducing the risk of data dredging. STAR generates every additional statistical test performed on the data, the an efficiently auditable certificate which attests to theva- chance of encountering such a random correlation increases. lidity of each statistical test performed on a dataset. STAR This can be intentionally exploited to “fabricate” signifi- achieves this by using several cryptographic techniques cant discoveries and, if done in systematically, is referred to which are combined specifically for this purpose. Under- as “HARKing“ [35], “p-hacking” [25] or “data dredging”. the-hood, STAR uses a decentralized set of authorities (e.g., While a variety of statistical techniques exist to control research institutions), secure computation techniques, and for the MCP by setting threshold on the false discovery rate an append-only ledger which together enable auditing of (FDR), i.e., the ratio of false-positives to true-positives over a scientific claims by 3rd parties and matches real world trust sequence of hypotheses [4, 17], there is surprisingly almost assumptions. We implement and evaluate a construction no support to ensure that researchers and analysts actually of STAR using the Microsoft SEAL encryption library and use them. Rather individual research groups rely on often SPDZ multi-party computation protocol. Our experimental varying data analysis guidelines and trust in their group evaluation demonstrates the practicality of STAR in multi- members to abide by the control procedures correctly which, ple real world scenarios as a system for certifying scientific unfortunately, rarely works in the real world. This is because discoveries in a tamper-proof way. 1) making even one simply mistake in the application of the control procedure can result in a false discovery and 2) there is no means of guaranteeing that each researcher carefully applied the control procedure (or, didn’t intentionally deviate from the procedure to get a “significant” (false) 1 INTRODUCTION discovery). Things get even worse when the same data is According to a 2016 Nature Magazine survey, over 70% of analyzed by several institutions or teams since guarding researchers failed to reproduce published results of other against false discoveries requires a coordinated effort. It is scientists and over 50% failed to reproduce their own published currently close to impossible to reliably employ statistical results [2]. The “Replication Crisis”, plaguing almost all sci- procedures, such as the Bonferroni [17] method, that guard arXiv:1901.10875v2 [cs.CR] 23 Oct 2019 entific domains, has serious and far reaching consequences against p-hacking across collaborators. It only requires one on the continued progress of scientific discoveries. Unfortu- member to “misuse” the data (intentionally or otherwise) and nately, solutions addressing the problem are few and often detecting, let alone recovering from such incidents is next to ineffective for two reasons: 1) current solutions either fail impossible. This problem is perhaps further exacerbated by to take into account real world trust assumptions (e.g., by the pressure on PhD students and PIs to publish [40], “publi- trusting researchers to carefully apply false discovery con- cation bias” [15] as papers with significant results are more trol protocols) or 2) are overly restrictive (e.g., by requiring likely to be published, and the increasing trend to share and independent replication of results prior to publishing or pre- make datasets publicly available for any researchers to use registration of hypotheses). Moreover, these solutions fail to in studies. Therefore, after examining the state of affairs, it is take into account modern approaches to data analysis, specif- perhaps not surprising that scientific community is plagued ically the abundance of existing data and means by which to by false discoveries [3, 27, 28, 30]. explore it, and impose overly stringent requirements. To illustrate this problem concretely, consider a publicly 1.1 Contributions available dataset such as MIMIC III [32]. This dataset con- • We present a novel system for preventing p-hacking tains de-identified health data associated with ≈ 40; 000 using cryptographic techniques which provide (math- critical care patients. MIMIC III has already been used in ematical) guarantees on the validity of each tested various studies [23, 26, 38] and it is probably one of the hypotheses during analysis, even in settings where re- most (over)analyzed clinical datasets and therefore prone searchers are not trusted to apply control procedures to “dataset decay” [46]. As such, any new discovery made correctly, while also ensuring full auditability of all on MIMIC runs the risk of being a false discovery. Even if a results obtained through STAR. particular group of researchers follow a proper FDR control • We implement and evaluate STAR on four widely used protocol, there is no control over happens across different statistical tests (Student’s t-test, Pearson’s correlation, groups and tracking hypotheses at a global scale poses many Chi-squared and ANOVA F-test) to demonstrate the of its own challenges. It is therefore hard to judge the validity applicability of STAR to real world scenarios. of any insight derived from such a dataset [46]. • We describe how a common false discovery control A solution to guarantee validity of insights commonly procedure known as α-investing can be applied with used in clinical trials — preregistration of hypotheses [12] STAR to provide full control over the data analysis — falls short in these scenarios since the data is collected phase in a certifiable manner. upfront without knowing what kind of analysis will be done To the best of our knowledge, is the first crypto- later on. Perhaps more promising is the use of a hold-out STAR graphic solution to the problem of p-hacking by providing dataset. The MIMIC author, for example, could have released guarantees (in the form of tamper-proof certificates) on the only 30K patient records as an “exploration” dataset and hold validity of all insights gleaned from a dataset. We believe back 10K records as a “validation” dataset. The exploration that is the first system to address the long standing dataset can then be used in arbitrary ways to find interest- STAR problem of discovery certification across scientific domains ing hypotheses. However, before any publication is made and achieves this with minimal overhead on researchers and by a research group using the dataset, all hypotheses must data providers. be (re)tested for statistical significance over the validation dataset. Unfortunately, in order to use the validation dataset more than once, we run into the same probem: every hy- 2 DESIGN pothesis over the validation dataset has to be tracked and Our design of STAR is motivated by the following obser- controlled for. Furthermore, the data owner (the MIMIC au- vations. The data owner cannot release a dataset D to the thor in this case) needs to provide this hypothesis validation researchers directly since it creates a possibility for p-hacking service. This is both a burden for the data owner as well as a (e.g., researchers can run tests privately and report only favor- potential risk. Researchers need to trust the data owner to able results without controlling for false discoveries). There- apply necessary control procedures and to objectively evalu- fore, any system which addresses p-hacking must “hide” the ate their hypotheses, which, unfortunately does not always raw data from researchers. With this in mind, we begin by align with real world incentive structures. describing several initial design ideas and discuss their limi- The above example illustrates the motivation behind the tations. This will serve as a motivation for why we choose need for a system that addresses these problems. With STAR, the design and construction described in § 2.2 and § 5. We our goal is to create a system that guarantees the va- then describe use some foreseeable use cases of our design lidity of statistical test outcomes and allows readers in § 3. (and/or reviewers) of publications to audit them for correctness, all without introducing unnecessary bur- 2.1 Strawman Designs dens on data provider and researchers. Using cryptographic techniques to certify outcomes of statistical tests The Trusted Authority Scheme. and by introducing a decentralized authority, we eliminate The simplest scheme is to assume a trusted authority (3rd the risk of data-dredging (intentional and otherwise) by re- party) which has full access to the unencrypted dataset. Re- searchers using a dataset. STAR can be used in various set- searchers then perform statistical tests on the dataset by tings, including cases where the data is public and only the querying the authority who executes the computations on hold-out data is fed into STAR (as in the example above), in their behalf and returns only the result of the test. settings where a few research groups collaborate on com- Unfortunately, there are two immediate problems with bined data, or even within single teams where lab managers such a design. The solution requires that the data owner, can opt to use STAR as a way to prevent unintentional false the researchers, and the auditors trust the authority when it discoveries, assign accountability and foster reproducibility.

Load more