<<

STAR: Statistical Tests with Auditable Results System for Tamper-proof Hypothesis Testing Sacha Servan-Schreiber Olga Ohrimenko MIT CSAIL Microsoft Research Tim Kraska Emanuel Zgraggen MIT CSAIL MIT CSAIL ABSTRACT The is, in part, a direct result of these We present STAR: a novel system aimed at solving the com- problems and what is formally known as the Multiple Com- plex issue of “p-hacking” and false discoveries in scientific parisons Problem (MCP). With every hypothesis tested over studies. STAR provides a concrete way for ensuring the ap- a dataset (using any type statistical testing procedure), there plication of false discovery control procedures in hypothesis is a small probability of a chance, i.e., false positive, discov- testing, using mathematically provable guarantees, with the ery with no real basis to the population being studied. With goal of reducing the risk of data dredging. STAR generates every additional statistical test performed on the data, the an efficiently auditable certificate which attests to theva- chance of encountering such a random correlation increases. lidity of each statistical test performed on a dataset. STAR This can be intentionally exploited to “fabricate” signifi- achieves this by using several cryptographic techniques cant discoveries and, if done in systematically, is referred to which are combined specifically for this purpose. Under- as “HARKing“ [35], “p-hacking” [25] or “data dredging”. the-hood, STAR uses a decentralized set of authorities (e.g., While a variety of statistical techniques exist to control research institutions), secure computation techniques, and for the MCP by setting threshold on the an append-only ledger which together enable auditing of (FDR), i.e., the ratio of false-positives to true-positives over a scientific claims by 3rd parties and matches real world trust sequence of hypotheses [4, 17], there is surprisingly almost assumptions. We implement and evaluate a construction no support to ensure that researchers and analysts actually of STAR using the Microsoft SEAL encryption library and use them. Rather individual research groups rely on often SPDZ multi-party computation protocol. Our experimental varying guidelines and trust in their group evaluation demonstrates the practicality of STAR in multi- members to abide by the control procedures correctly which, ple real world scenarios as a system for certifying scientific unfortunately, rarely works in the real world. This is because discoveries in a tamper-proof way. 1) making even one simply mistake in the application of the control procedure can result in a false discovery and 2) there is no means of guaranteeing that each researcher carefully applied the control procedure (or, didn’t intention- ally deviate from the procedure to get a “significant” (false) 1 INTRODUCTION discovery). Things get even worse when the same data is According to a 2016 Nature Magazine survey, over 70% of analyzed by several institutions or teams since guarding researchers failed to reproduce published results of other against false discoveries requires a coordinated effort. It is scientists and over 50% failed to reproduce their own published currently close to impossible to reliably employ statistical results [2]. The “Replication Crisis”, plaguing almost all sci- procedures, such as the Bonferroni [17] method, that guard arXiv:1901.10875v2 [cs.CR] 23 Oct 2019 entific domains, has serious and far reaching consequences against p-hacking across collaborators. It only requires one on the continued progress of scientific discoveries. Unfortu- member to “misuse” the data (intentionally or otherwise) and nately, solutions addressing the problem are few and often detecting, let alone recovering from such incidents is next to ineffective for two reasons: 1) current solutions either fail impossible. This problem is perhaps further exacerbated by to take into account real world trust assumptions (e.g., by the pressure on PhD students and PIs to publish [40], “publi- trusting researchers to carefully apply false discovery con- cation ” [15] as papers with significant results are more trol protocols) or 2) are overly restrictive (e.g., by requiring likely to be published, and the increasing trend to share and independent replication of results prior to publishing or pre- make datasets publicly available for any researchers to use registration of hypotheses). Moreover, these solutions fail to in studies. Therefore, after examining the state of affairs, it is take into account modern approaches to data analysis, specif- perhaps not surprising that scientific community is plagued ically the abundance of existing data and means by which to by false discoveries [3, 27, 28, 30]. explore it, and impose overly stringent requirements. To illustrate this problem concretely, consider a publicly 1.1 Contributions available dataset such as MIMIC III [32]. This dataset con- • We present a novel system for preventing p-hacking tains de-identified health data associated with ≈ 40, 000 using cryptographic techniques which provide (math- critical care patients. MIMIC III has already been used in ematical) guarantees on the validity of each tested various studies [23, 26, 38] and it is probably one of the hypotheses during analysis, even in settings where re- most (over)analyzed clinical datasets and therefore prone searchers are not trusted to apply control procedures to “dataset decay” [46]. As such, any new discovery made correctly, while also ensuring full auditability of all on MIMIC runs the risk of being a false discovery. Even if a results obtained through STAR. particular group of researchers follow a proper FDR control • We implement and evaluate STAR on four widely used protocol, there is no control over happens across different statistical tests (Student’s t-test, Pearson’s correlation, groups and tracking hypotheses at a global scale poses many Chi-squared and ANOVA F-test) to demonstrate the of its own challenges. It is therefore hard to judge the validity applicability of STAR to real world scenarios. of any insight derived from such a dataset [46]. • We describe how a common false discovery control A solution to guarantee validity of insights commonly procedure known as α-investing can be applied with used in clinical trials — preregistration of hypotheses [12] STAR to provide full control over the data analysis — falls short in these scenarios since the data is collected phase in a certifiable manner. upfront without knowing what kind of analysis will be done To the best of our knowledge, is the first crypto- later on. Perhaps more promising is the use of a hold-out STAR graphic solution to the problem of p-hacking by providing dataset. The MIMIC author, for example, could have released guarantees (in the form of tamper-proof certificates) on the only 30K patient records as an “exploration” dataset and hold validity of all insights gleaned from a dataset. We believe back 10K records as a “validation” dataset. The exploration that is the first system to address the long standing dataset can then be used in arbitrary ways to find interest- STAR problem of discovery certification across scientific domains ing hypotheses. However, before any publication is made and achieves this with minimal overhead on researchers and by a research group using the dataset, all hypotheses must data providers. be (re)tested for over the validation dataset. Unfortunately, in order to use the validation dataset more than once, we run into the same probem: every hy- 2 DESIGN pothesis over the validation dataset has to be tracked and Our design of STAR is motivated by the following obser- controlled for. Furthermore, the data owner (the MIMIC au- vations. The data owner cannot release a dataset D to the thor in this case) needs to provide this hypothesis validation researchers directly since it creates a possibility for p-hacking service. This is both a burden for the data owner as well as a (e.g., researchers can run tests privately and report only favor- potential risk. Researchers need to trust the data owner to able results without controlling for false discoveries). There- apply necessary control procedures and to objectively evalu- fore, any system which addresses p-hacking must “hide” the ate their hypotheses, which, unfortunately does not always raw data from researchers. With this in mind, we begin by align with real world incentive structures. describing several initial design ideas and discuss their limi- The above example illustrates the motivation behind the tations. This will serve as a motivation for why we choose need for a system that addresses these problems. With STAR, the design and construction described in § 2.2 and § 5. We our goal is to create a system that guarantees the va- then describe use some foreseeable use cases of our design lidity of statistical test outcomes and allows readers in § 3. (and/or reviewers) of publications to audit them for correctness, all without introducing unnecessary bur- 2.1 Strawman Designs dens on data provider and researchers. Using crypto- graphic techniques to certify outcomes of statistical tests The Trusted Authority Scheme. and by introducing a decentralized authority, we eliminate The simplest scheme is to assume a trusted authority (3rd the risk of data-dredging (intentional and otherwise) by re- party) which has full access to the unencrypted dataset. Re- searchers using a dataset. STAR can be used in various set- searchers then perform statistical tests on the dataset by tings, including cases where the data is public and only the querying the authority who executes the computations on hold-out data is fed into STAR (as in the example above), in their behalf and returns only the result of the test. settings where a few research groups collaborate on com- Unfortunately, there are two immediate problems with bined data, or even within single teams where lab managers such a design. The solution requires that the data owner, can opt to use STAR as a way to prevent unintentional false the researchers, and the auditors trust the authority when it discoveries, assign accountability and foster . comes to storing the dataset, correctly executing statistical 2 ✓ evaluate p-value statistical test p-value [ p-value ] p-value validate result data owners researchers encrypted result multi-party computation public log auditor

Figure 1: High-level overview of STAR. Data owners upload datasets to the network of computing servers. Re- searchers proceed to compute statistical tests over the dataset using secure computation that is logged on a public log. An auditor can examine the log to determine the validity of a given discovery. tests, and applying the control procedure. While this can it comes to evaluating simple functions with few shared in- work in certain isolated cases, often real world parties have puts [14, 33] This motivates our main design which uses conflicting incentives that make finding such an authority minimal multi-party computation and other secure compu- challenging. For example, consider a drug company releasing tation techniques to create a practical scheme. a dataset for public use but researchers don’t fully trust the company to evaluate each hypothesis objectively. Moreover, 2.2 STAR Framework there is not way to verify the actions of the authority, i.e., With the initial design attempts described above, we are ensure that each statistical test was computed correctly and ready describe the design of STAR, which addresses all the all p-values reported, etc., meaning that it must be trusted problems mentioned in the strawman ideas. We desire a blindly. general solution which fits well with real world trust as- The Secure Enclave Scheme. A more viable alternative to sumptions and incentives. The design of STAR achieve these the above scheme is to have a trusted secure processor [45] requirements by using two cryptographic building blocks. with a strict interface which only allows for the evaluation Specifically, we use secure computation and a distributed of statistical tests. In such a scheme, the data owner encrypts ledger (covered in the following section), which are com- the dataset using the enclave’s public key which allows the bined in a special way to achieve the desired goal building enclave to execute statistical tests by first decrypting the a system for conducting auditable hypothesis testing while dataset in secure memory and then evaluating the statistical remaining practical. Figure 1 provides a design overview of test. The enclave can then reveal the result along with a the system. digital signature authenticating the computation. However, At a high level, the data owner encrypts a dataset D and such a design suffers from similar problems as the design releases the (encrypted) dataset publicly or to a set of re- involving a singe trusted authority: namely, it is now the searchers. The encryption scheme used is “special” in the secure enclave and the machine on which it runs that have sense that encrypted values can be used as inputs to func- to be trusted in the process to guarantee computation are tions, and the functions evaluated over the encrypted data, performed [11, 41]. without knowledge of the secret key needed to decrypt the values. Such schemes are called homomorphic encryption The Multi-party Computation Scheme. The secure en- schemes. Researchers use the encrypted dataset, denoted [D], clave approach motivates the idea of distributing trust using to compute statistical tests (encoded as an arithmetic circuit) multiple parties. For example, instead of having one entity using the homomorphic properties of the scheme. However, in control of the data, a dataset can be shared among a set one problem remains: once the statistical test is evaluated, of parties which can collectively compute over their shares the result is still encrypted and researchers do not have the but cannot individually access the shared data [6, 7, 43]. This secret key needed to decrypt it. One trivial solution is to send takes care of the guaranteed output problem and provides the result to the data owner who then uses the secret key the means for ensuring all computations are accounted for to decrypt the result. However, this requires the data owner provided some threshold number of parties follow proto- to be online in order process such decryption requests and col correctly. While on the surface this may appear as good furthermore places full trust in the data owner to correctly solution, there are several barriers when it comes to practi- report results, which, as we mention above we would like cality. Computing statistical tests over such a shared dataset to avoid. Moreover, this can be especially problematic in requires general multi-party computation which quickly be- scenarios where the dataset is comprised of multiple data comes impractical due to high network overheads needed sources since there is only one secret key. to evaluate functions using the secret shared data [7, 14, 33]. In STAR, we distribute the power to decrypt ciphertexts to In general, multi-party computation is only practical when a set of parties (servers) which each hold a share of the secret 3 key, but, crucially, cannot decrypt on their own using their similar attributes. For example, consider the Food and Drug share of the key. Parties can decrypt if and only if they “agree” Administration (FDA) in the United States and Chinese Food to do so collectively; thus preventing any subset of rogue and Drug Administration (CFDA) in China. Assume that both parties from decrypting results on their own and colluding administrations have data about a new drug claiming to cure with researchers. Alzheimer’s disease. Both the FDA and CFDA use their own When parties decrypt the statistical test result, they also respective datasets to explore the data and release their re- keep a public log of the action which guarantees that each spective datasets in encrypted form such that tests can only revealed statistic is recorded in a tamper-proof and auditable be computed through STAR. Once both organization have way. This latter requirement ensures that all statistical tests found interesting hypotheses in their own datasets, they executed on [D] are recorded, in sequence, and makes it use the other organization’s data as a validation dataset by possible for anyone to verify whether false discovery control testing hypotheses through STAR. In other words, each orga- procedures were applied correctly. We elaborate on these nization uses its own data to determine interesting insights important requirements in subsequent sections. and proceeds to validate each other’s hypotheses using the other organization’s data. Since all validations are logged in 3 USE CASES STAR, results remain auditable. Moreover, the data of each organization remains private (e.g., the FDA does not learn We describe several distinct and highly relevant scenarios CFDA’s data, except for the result of a statistical test). where STAR is directly applicable. We hope these scenar- ios provides evidence for both the utility of STAR and the Third scenario. A data owner wishes to release sensitive relevance of the system for addressing false discoveries. data for research purposes but is required to comply with stringent regulatory demands for privacy (e.g., as required First scenario. A data owner wishes to release a dataset by the European GDPR regulation [47]). Even allowing re- for public research use but wants to prevent the data from searchers to compute statistical tests can be a violation under being overanalyzed and false discoveries being extracted such privacy laws. This results in potentially useful data be- from the data as a result. As is usually done in such cases, ing left outside of the reach of the research community. Using the data owner partitions the dataset into an exploration and STAR, however, it becomes possible to accommodate privacy a validation dataset. The exploration dataset is released to needs by only allowing researchers to test for statistical sig- the public without any restrictions on how statistical tests nificance (i.e., not even revealing the p-value for a statistical should be computed. The validation dataset is encrypted in test, which we describe in a following section). STAR can such a way that results computed on the dataset are only be extended to only reveal whether or not a given hypothe- made available through STAR: researchers wishing execute sis should be accepted (resp. rejected) while simultaneously tests on the validation dataset must do so by requesting the controlling for false discoveries. parties in STAR to reveal the result. Using the exploration dataset, researchers independently form hypotheses about the data. Prior to publishing the 4 TECHNICAL DETAILS results, they validate each hypothesis over the validation dataset via STAR. This guarantees that all validated hypothe- We now turn to describing the security properties we require sis remain auditable and all computed p-values are accounted and explain how STAR can be used to certify the correct for in a tamper-proof, timestamped sequence on the log main- application of discovery control procedures. tained by the decrypting parties. The publication can then be validated by auditors (e.g., reviewers) to ensure that the 4.1 Security Goals researchers’ hypothesis was accepted (resp. rejected) in a way that controls for false discoveries and that the insights In order adequately present and analyze the guarantees of therein are valid. This ensures that the publication’s claims STAR, we must first establish a set of security goals. Ata are statistically significant (within the alleged error bounds) high level, STAR aims to achieve the following goals which and guarantees that no p-hacking or other forms of data guarantee correctness of the system. We outline the goals in dredging occurred during the analysis. this section and formalize them in § 7.

Second scenario. STAR also extends to setting where mul- Correctness. Statistical tests computed on the encrypted tiple data owners contribute to a collective dataset and want dataset [D] must be correct, i.e, the p-value of a test com- to ensure that their data is not misused. Consider a scenario puted in STAR has to be equivalent to the p-value computed where two separate research groups (that potentially don’t through standard statistical packages (e.g., MATLAB and trust each other with their data) have datasets consisting of SciPy) over the unencrypted D. 4 Confidentiality. The only information revealed about D researchers are interested in abusing the system with the is the results of statistical tests which ensures that false goal of extracting as many insights as possible and avoiding discoveries are fully controlled for and no hypothesis can applying the false discovery control procedure. To provide be tested outside of STAR’s interface. In other words, STAR accountability and prevent other forms of attacks, we require does not “leak” data about the D when statistical tests are a secure public key infrastructure which identifies individual evaluated (more than the result of the statistical test). researchers by their public key and assume that all messages exchanged between researchers and computing servers are Access control. In many cases, it is desirable to provide digitally signed during protocol execution. access to only a set of approved researchers. We require that STAR have a mechanism in place for enforcing access control, though it is not an inherent requirement. If an access control 4.3 Controlling False Discoveries policy is set, then statistical tests on D can only be executed by the set of approved researchers. An important first step in realizing the construction of STAR is understanding how false discovery control procedures are Auditability. For all hypotheses evaluated through STAR, used in the data analysis phase. In order to certify discoveries, the resulting p-values can be checked for their statistical it is necessary to use the (ordered) sequence of p-values significance with respect to the false discovery control proce- computed over dataset. The sequence is then used to bound dure used. Crucially, anyone can audit the correctness the false discovery rate (FDR) during or after analysis. of a statistical test and can do so using only publicly While standard procedures such as Bonferroni [17] must available information and without interacting with re- be applied a posteriori of the data analysis (once all hy- searchers or the computing parties. potheses have been tested), in STAR, we desire a method for controlling the FDR in a streaming fashion, ideally with- 4.2 Threat Model out knowledge or restriction on future tests. Fortunately, STAR provides provable guarantees to the four properties there is a common false discovery control procedure known outlined in § 4.1 under the following threat model. Our se- as α-investing [20] which achieves just that. The α-investing curity assumptions are with respect to the data owner(s), procedure is a standard choice for controlling false discov- computing parties (e.g., different institutions and research eries in practice when the total number of hypotheses is labs), and researchers evaluating their hypotheses through unknown ahead of time (e.g., in exploration settings) [51, 52]. STAR. We describe the intuitive version of the procedure below and we refer the reader to [20, 51] for more formal definitions Data Owner. We make the necessary assumption that the and proofs which are outside the scope to understanding data owner does not collude with the researchers who run this work. the statistical tests. This assumption is required given that a The α-investing procedure works by assigning, to each malicious owner has unfettered access to the (unencrypted) hypothesis, a “budget” α from an initial “α-wealth.” If the p- dataset D and can thereby trivially avoid executing tests value of the null hypothesis being considered is above some through STAR, only “testing” results known to be signifi- α the null hypothesis is accepted and some budget is lost, oth- cant ahead of time without applying the necessary control erwise the null is rejected and the available budget increases. procedure. Choosing α depends on the “investment strategy” [51] and is ultimately left as a choice for the analyst. We present a Computing Servers. Computing parties are collectively common investment strategy in Procedure 1 which allows in possession of the decryption key, however, none of the for a (theoretically) unbounded number of statistical tests to parties individually have access to the unencrypted data D. be executed over the dataset. The dataset remains unknown to the servers provided at least More formally, for the jth null hypothesis, Hj , being tested, one of the servers follows protocol correctly. If the parties are Hj is assigned a budget αj > 0. Let pj denote the p-value asso- also maintaining the distributed log on which statistical test ciated with the result of the statistical test used to determine results are recorded, we assume that a majority of the parties the statistical significance of Hj . The null is rejected if pj ≤ αj , are honest when it comes to maintaining the correctness of otherwise it is accepted. In the case that Hj is rejected, the the log. testing procedure obtains a “return” on investment γ ≤ α. On the other hand, if the H is accepted, α /(1 − α ) wealth Researchers. We assume that researchers are malicious and j j j is subtracted from the remaining α-wealth. Therefore, the interested in gaining as much information on D as possi- remaining wealth, denoted W (j), after the jth statistical test ble (for the purpose of bypassing control procedures and is set according to: falsely certifying discoveries). To this end, we assume that 5 Publication ( Publication W (j) + γ if pj ≤ αj , p-value W (j + 1) = α (1) p-value p-value p-value p-value p-value p-value p-value W (j) − j if p > α p-value 1−αj j j test computation log auditor

Procedure 1: InvestmentRule(α, β,γ ) [51] Figure 2: Auditor verifies the validity of reported p- values by examining the values stored on the log and 1 W (0) = α; // initial wealth ensuring that the α-investing procedure was applied 2 foreach i ← 1, 2,... do  W (i−1)(1−β)  correctly. 3 αi = min α, 1+W (i−1)(1−β) ; 4 pi ← test(Hi ); // p-value for hypothesis Hi 5 if pi < αi then T(D), i.e., the test statistic. Given that the certificate con- 6 W (i) = W (i − 1) + γ ; tains both the test index ρ and resulting p-value, it is easy 7 else α to verify (in conjunction with previous certificates) whether 8 W (i) = W (i − 1) − i ; 1−αi the false discovery control procedure was applied correctly by (re)applying Procedure 1 and verifying the conclusion re- searchers came to during analysis. Note that the statistical In STAR, the parameters W (0), α, β, and γ can be set as tests do not need to be re-evaluated in order to verify static constant or determined by more complex mechanisms, the correctness of results; the p-value only has to be as seen fit. The important takeaway in understanding computed once. STAR is that given all p-values computed on D, up to the current test, it is easy to determine whether any 5 CONSTRUCTION given hypothesis should be accepted (resp. rejected) based on the corresponding p-value in order to bound We are now ready to describe the system in detail. We start by the number of false discoveries to α overall. We illus- describing the cryptographic tools used in realizing STAR and trate the process in following example. then formalize the different components of STAR. We note that the cryptographic techniques used as building-blocks are well studied and fairly standard but are combined in a Illustrative example: Consider three hypotheses H1, H2, novel and carefully thought out manner to achieve practical H3, with p-values p1 = 0.0012, p2 = 0.2331, and p3 = efficiency and meet the goals outlined in§2. 0.0049, respectively. Set the initial alpha wealth W (0) = ( )/ 0.05, fixγ = W 0 4, and let β = 0.25. After testing hypoth- 5.1 Preliminaries esis H1 we obtain p1 < α1 so H1 is rejected and we obtain a “return on investment.” Now W (1) = W (0) + γ = 0.0625. STAR is constructed using the following building blocks: a tamper-proof ledger for recording all statistical tests ex- For the second test, we obtain p2 > α2 so the H2 is accepted αj ecuted over a dataset in an auditable fashion, somewhat and the remaining wealth is W (2) = W (1) − − ≈ 0.01. 1 αj homomorphic encryption for evaluating statistical tests over For the third test, we get p3 < αj so H3 is rejected and W (3) ≈ 0.022. Examining the sequence of p-values, it is encrypted data, and multi-party computation for securely decrypting results and in some cases evaluating parts of the trivial to verify the validity of discoveries H1 and H3 by reapplying Procedure 1 with all three computed p-values. statistical test circuit. Table 1 provides a summary of the cryptographic building blocks used and their role in STAR.

4.4 Certification of Hypotheses Tamper-proof Ledger. We rely on a tamper-proof ledger The goal of STAR is to certify, in a tamper-proof way, that abstraction which we call TestLog. The ledger records statisti- the insights gained from the available data are valid. Taking cal test executed on the dataset such that each test result and a birds-eye view, for every test computed through STAR, a order of test execution is publicly available and remains im- publicly auditable certificate ψ is produced which attests mutable once recorded. In STAR, TestLog can be maintained to the application of the α-investing procedure during hy- by a central authority, by the computing servers themselves pothesis testing process. Formally, we define a certificate using a consensus protocol (e.g., Paxos [36]), or by inde- of a statistical test T as a signed tuple (ρ, σ,τ ) where ρ is pendent parties (e.g., using a public Blockchain based on a the test index (i.e., the test’s order among all the tests that decentralized consensus protocol such as Nakamoto consen- have been executed so far on D), σ is the identifier of the sus [39]). Similar to the ledger abstraction used in [10], our statistical test function T , and τ is the result of executing only requirement is that TestLog is correct and available. 6 Property Cryptographic Building Block(s) Role in STAR

Access to data is restricted to the STAR protocol by being encrypted and hence inaccessible to researchers directly. Moreover, access to the system Access Control Digital signatures & Encryption is controlled via digital signatures to thwart system abuse by any single researcher in the form of a denial-of-service attack. Computation result are revealed via consensus of computing servers, and Hypothesis tracking Homomorphic Encryption & Multi-party computation remain hidden otherwise. Researchers cannot test hypotheses outside of STAR because the data remains encrypted and secret at all times. Every statistical test computation is recorded in a publicly readable log Auditability Tamper-proof ledger & Controlled Decryption at decryption time making test results (and application of false discovery control procedures) auditable by researchers and third parties. Table 1: Summary of cryptographic building blocks and their role in realizing the main properties of STAR.

Somewhat Homomorphic Encryption. Somewhat Ho- message (0, ⊥, ⊥) that corresponds to the counter of the tests momorphic Encryption (SHE) is a special form of asymmetric to be executed on D, set to zero. In addition to releasing encryption where a limited number of computations can be the encrypted dataset, the data owner releases metadata performed over the encrypted values without knowledge of pertaining to D deemed sufficient for researchers to form the decryption key. In other words, SHE enables the eval- hypotheses on the dataset (e.g., attribute metadata, etc). We uation of a limited class of function on encrypted inputs formalize the metadata requirements in § 7. to obtain an encrypted output. Unlike fully homomorphic encryption [22], where arbitrary functions can be evaluated ComputeTest. A researcher uses the encrypted data to eval- on encrypted inputs (theoretically at least), SHE can only uate a statistical test σ evaluated over a subset of attributes evaluate a handful of multiplication gates but does not incur in D. Once the researcher obtains the encrypted result, she high computational overhead of fully homomorphic encryp- sends it to the servers {S1,..., Sk } which engage in the tion [42]. multi-party protocol to decrypt the result. In doing so, the servers make the result publicly available on TestLog. Re- Multi-party Computation. Multi-party computation (MPC) call that the test result cannot be recovered by any of the is a set of techniques for computing over secret data that parties individually and requires all servers to reveal their is shared amongst a set of parties. The data is encoded and shares in order to recover the result. This specification en- shared in such a way so as to prevent individual parties (or sures that each test result is made publicly available to all even a subset of parties) from obtaining any information the servers and researchers and the end of the computation. about the (secret) data while still allowing for functions to be The decrypted result, in conjunction with the existing data evaluated over the encoded data. The importance of MPC is written to TestLog, is the test certificate consisting of the that functions can be evaluated by the parties while ensuring tuple ψ = (ρ, σ,τ ). that no party learns the input values. In STAR we use the SPDZ [14, 34] protocol for evaluating PerformAudit. A test with certificate ψ = (ρ, σ,τ ) can be functions in a multi-party setting. SPDZ is the current state- audited for correctness by any entity with access to TestLog. of-the-art when it comes to performing computations and An auditor retrieves all certificates up to ρ from TestLog and is highly efficient in practice. We provide additional details uses this information to ensure the α-investing procedure about the protocol in § 6. was applied correctly when claiming a result is statistically significant. Note that auditing the statistical test results can 5.2 STAR protocol be done entirely “offline” using only information recorded on TestLog – this makes the auditing procedure non-interactive. We are now ready to describe the construction of STAR. The functionality of the STAR can be split into three distinct Procedure 2: PerformAudit(ψρ = (ρ, σ,τ )) phases: Setup, ComputeTest, and PerformAudit. 1 W (ρ − 1) ← wealth remaining after applying Setup. To initialize the system, the data owner encrypts InvestmentRule(α, β,γ ) up to the p-value of the ρth test; a dataset [D] and shares partial decryption keys with the  W (ρ−1)(1−β)  2 ← // see § 4.3 αρ min α, 1+W (ρ−1)(1−β) ; set of computing servers {S1,..., Sk } such that collectively they may decrypt computation results. The data owner then 3 Accept ψ as valid if and only if pρ < αρ ; initializes the public log TestLog and posts to it a signed 7 6 STATISTICAL TESTS IN STAR in practice; an expensive operation to do over encrypted In this section we describe the statistical tests we implement data). We therefore focus on describing the computation of in STAR and their usage in hypothesis testing. We describe only the test statistic and leave the mapping from test result four very common statistical tests: Student’s t-test, Pearson’s to an exact p-value implicit as it can be trivially computed. correlation test, the Chi-squared test, and ANOVA F-test. We note that in certain situations revealing the test statistic While not exhaustive, this set of statistical tests forms a (and as a result the p-value) can be too much information basis for quantitative analysis and covers many cases: from since researchers may be able to reconstruct the dataset after reasoning about population means to analyzing differences observing a sufficient number of queries. We therefore also between sets of categorical data. These four tests are also describe how STAR can be adapted to only reveal statistical widely used in practice and accounted for more than 70% of significance which bypasses p-value computations altogether. all clinical trials from a pool of 1,828 medical studies from a variety of journals [16]. 6.2 STAR for the holdout dataset With our selection of tests we aim to demonstrate the While we describe STAR as a general scheme for evaluat- versatility and practicality of STAR to real world applications. ing statistical tests on a dataset, in practice, we envision the Our selection captures both tests used on continuous and dataset D to be split into an exploration dataset Dexp and categorical data. Many other statistical tests are close variants a validation dataset Dval . The data owner will encrypt the of the statistical tests we describe in this paper and can be dataset Dval and release the encrypted validation dataset implemented in STAR as well. along with the (unencrypted) exploration dataset. This way, researchers can explore different hypotheses using Dexp 6.1 Dataset Characteristics and Notation and validate their results through STAR using [Dval ]. In- Before diving into the equations, we first describe how the deed, this is the way we envision STAR deployed in real data is structured in STAR, how computations are performed world settings as it mirrors current “best-practices”, but in a over encoded data, and introduces some additional notation. provably certifiable way.

Metadata requirements. First, we formalize which infor- 6.3 Hypothesis Testing mation about the data is revealed (resp. hidden) from re- We are now ready to describe the four statistical tests imple- searchers at setup time. To allow researchers to form hy- mented in STAR and their usage in hypothesis testing. We potheses on [D], the data owner is required to releases at- refer the reader to [5] for a additional details on the statistical tribute metadata about the dataset. This includes, for example, tests and their applications. the number of attributes, the type and domain size, indepen- dence from other attributes, etc.). This is to help researchers Student’s t-test is used to compare means of two indepen- determine what test to use and what hypothesis to test for. dent samples. The null hypothesis stipulates that there is no We require that the metadata be such that it is possible to 1) difference between the two distributions5 [ ]. Let x and y be define a hypothesis on D and 2) that it contain the number of two independent samples from D. Denote the mean of x, y rows and columns in the dataset, denoted by n, m. Moreover, as x¯, y¯, and the standard deviation as sx , sy . The test statistic, since some tests only work on attributes that are indepen- t, is then computed according to: dent of each other, we require the independence of attributes x¯ − y¯ t ≜ (2) (if any) to be contained therein. Finally, we note that we q s 2 implicitly set a bound on the attribute domain size (e.g., as- p n q 2 2 sume each value can be represented by a 32 bit integer). This (n−1)sx +(n−1)sy is necessary for the correct selection of the parameters at where sp = 2n−2 . system setup time. Pearson’s correlation test is used to compare the linear Computing p-values from test statistics. We make an correlation between two continuous and independent vari- important observation which enables us to more efficiently ables. The test statistic, r, lies in the range [−1, 1], where apply FDR control procedures in STAR. Each statistical test either extreme corresponds to negative (resp. positive) corre- produces a test statistic from which p-values are derived lation between variables [5]. Let x and y be two independent using degrees of freedom (a function of the dataset size). Since continuous samples in D. Denote the mean of x and y as the p-value can be computed “in the clear” using the revealed x¯ and y¯. Pearson’s correlation coefficient is computed as test statistic, we do not need to compute the p-value using follows: Ín (xi − x¯)(yi − y¯) secure computation which drastically reduces the complexity r ≜ i=1 (3) pÍn 2pÍn 2 overhead (since obtaining p-values requires lookup tables i=1(xi − x¯) i=1(yi − y¯) 8 Chi-squared test is used to determine whether there exists a significant difference between the expected and observed + Statistical Test Arithmetic Circuit frequencies in a set of observations from mutually exclusive + categories. The test evaluates the “goodness of fit” between a set of expected values and observed values. + For a collection of n observations, classified into m mu- Ciphertext Secret Shares ÷ tually exclusive categories, where each observed value is + denoted by xi for i = 1, 2,...,m, we denote the probabil- ity that a value falls into the ith category by ηi such that + Ím Evaluated Locally Evaluated using Multi-party Computation i=1 ηi = 1. The Chi-squared statistic is given by: + m Õ (x − e )2 χ 2 ≜ i i (4) e i=1 i Figure 3: Evaluation of statistical test circuit in STAR. Numerator and denominator terms of the test statis- where e = nη . i i tic are computed locally by the researcher using the homomorphic properties of the encryption scheme. ANOVA F-test test is commonly used to determine the fit of Computing servers then evaluate the division gate se- a statistical model to a given dataset. Let xi for i = 0,...,m curely using multi-party computation and reveal the be independent samples from D. Denote the mean of xi by x¯i result. and the mean across all samples by x¯. Denote the jth value of the ith sample by xi, j . The test statistic, F, is then computed according to: orders of magnitude) making the construction impractical for real-world settings. m 2 m n 2 Õ n(x¯ − x¯) . Õ Õ (xi, j − x¯i ) Instead, we opt for a hybrid approach: we use the decryp- F ≜ i (5) m − 1 n − m tion parties (parties to whom the shares of the secret key gets i=1 i=1 j=1 distributed to at system setup time) to “boost” the computa- 6.4 Statistical Test as an Arithmetic Circuit tional power of encryption scheme and evaluate one division gates. We can use multi-party computation techniques to We now turn to describing how the statistical tests are com- evaluate division gates in the arithmetic circuit of the sta- puted in STAR. The first obstacle in the way of a practical tistical test. Since statistical tests that we use to describe construction is that somewhat homomorphic encryption is STAR only require one division per evaluation, and at the only viable for circuits with low multiplicative depth. With very end of the computation, we can evaluate the numera- practical implementations of homomorphic encryption, it is tor and denominator locally using somewhat-homomorphic only possible to evaluate a few (preferably less than 3) lev- encryption and then perform the division using the help of els of multiplication gates where both inputs are encrypted the computing parties. This leads to a practical solution for but supports many additions and multiplications by public evaluating statistical tests precisely because the bulk of the constants. Observe that each test statistic follows a pattern: test can be evaluated locally and thus requires minimal addi- they require only a few multiplication per each term in the tional interaction from the parties (where network latency summation followed by a single division. This makes it pos- is a bottleneck). If the majority of the circuit is evaluated sible to efficiently evaluate the numerator and denomina- locally, the parties are only needed to evaluate division gates tor terms of each statistical test equation using at most 2-3 and thus only need to receive two ciphertexts (numerator multiplication operations. However, when it comes to the and denominator) which they divide prior to revealing the final division gate (where both the numerator and denomi- result. This provides the best of both worlds when it comes nator are encrypted), there is no homomorphic operation to to practical efficiency. Figure 3 illustrates the process of eval- perform it outright. Instead, existing methods require a re- uation of a statistical in STAR, where the test is represented cursive approach (converting the division gate to a series of as an arithmetic circuit. multiplications) which requires a significant number of con- secutive encrypted multiplications and bars the way towards Ciphertext to Share conversion. One caveat, however, is any practical implementation. While theoretically we could that in order for parties to perform multi-party computation, use fully-homomorphic encryption to evaluate division gates the inputs (e.g., the numerator and denominator for a devi- using an iterative approach, such an approach would sig- sion gate) must be secret shared to all the parties. Normally, nificantly blow up the computational complexity (by many this is done by a trusted party that splits the plaintext inputs 9 into shares which are then distributed to each party indi- takes as input the dataset and returns the number of rows vually such that any threshold number of parties can pitch-in and attributes n, m and other metadata decided on by the their shares and collectively recover the secret value. In STAR, data owner. At setup time, this function captures the exact however, these shares must be created and distributed on- “leakage” of D as nothing beyond that is revealed. the-fly from the encrypted inputs. To achieve this, weusea There is additional leakage of information on D that re- multi-party protocol which securely converts an encryption sults from the statistical test computation and is captured to a set of shares held by parties. This is achieved using a by the function Lc : (D × σ) → σ(D) which is the result protocol that takes as input the ciphertext encrypted under of the statistical test σ computed on D. Confidentiality of a public key and results in parties having secret shares of the dataset then follows directly from the fact that the only the previously encrypted value. Such protocols have been information revealed by computing a statistical test (repre- well-studied and are described in [13]. sented as an arithmetic circuit) over the encrypted dataset [D] is the output σ(D) (i.e., the result of the statistical test 7 SECURITY ANALYSIS σ) and nothing else. If this is deemed too much, the leakage We now analyze the different security properties outlined can be reduced to just revealing significance (we explain in § 2. The security of STAR hinges on the use of the somewhat- this in the following subsection) in which case only a single homorphic ecnryption scheme and the multi-party computa- bit of information is leaked when performing a statistical test tion framework used. We opt for the SPDZ multi-party com- through STAR. putation framework because it is the current state-of-the-art Access Control. Though anyone can perform a computa- when it comes to performing efficient MPC computations. tion over [D], the computing servers can enforce “rate limits” We use SPDZ to evaluate the division gates in the statisti- on researchers or refuse to reveal results to researchers who cal test circuit and control the revelation of statistical test attempt to abuse the system. This is an important feature results. Security hinges on the fact that given an encrypted to avoid a situations where a malicious researcher exhausts dataset, researchers are unable to recover the underlying the α-investing procedure’s alpha-wealth and prevents other data to perform covert statistical tests outside of STAR, i.e., researchers from testing their hypotheses. While describing without requesting a decryption from the parties. the exact access control policies that could be enforced is out- Correctness. STAR follows the plaintext execution of statis- side the scope of this work, we emphasize that such policies tical tests by evaluating an arithmetic circuit of the tests over can be useful in real-world situations and can be integrated the encrypted dataset as presented in § 6. The only differ- in the design of STAR by default. For example, the system ence with a plaintext evaluation compared to the encrypted can be easily made to enforce daily quotas on the number of evaluation is in the accuracy of the results since STAR guar- statistical tests that any single researcher can evaluate over antees O(f )-bits of precision which follows directly from a dataset. the parameters used in instantiating the encryption scheme Auditability. All computations are performed securely over and SPDZ parties. The precision parameter, f , can be set to the encrypted dataset [D] where we assume at least one of provide an arbitrary decimal places of precision. The default the computing servers follows the protocol honestly. Thus, is usually set to 240 which guarantees at least 12 decimal every resulting decryption is written to TestLog which guar- places of precision and is more than sufficient for p-value antees auditability of all the statistical test result computed calculation in most situations 1. on D. For every certificate ψ = (ρ, σ,τ ), the sequence of Confidentiality. The aim of STAR is to enforce p-value resulting computations numbered 1 to ρ − 1 can be used to calculation in a truthful manner. It achieves this goal by certify the validity of ψ by auditing the application of the hiding the content of D and revealing only the metadata α-investing procedure as shown in § 4.3. It is important to on D sufficient to form hypotheses and compute statistical note that auditing does not require interaction from the com- significance. The data owner only needs to reveal the size puting parties and can be done fully “offline”,- using only of the dataset and attribute metadata information on the publicly available information. contents of D (e.g., characteristics of each attribute, domain size, and independence from other attributes) 2. The metadata 7.1 1-bit computation leakage ∗ is succinctly captured as a function Ls : D → {0, 1} , that In some situations, leaking the result of the statistical test may be deemed to reveal too much about the dataset D. For 1Many p-value calculation tables are rounded to fewer than 5 decimal places [5]. 2For example, metadata for an “age” attribute may be the set {AttrType: be made general and independent of the actual values found in D, when Age, NumericRange: 0-110}). We stress, however, that the metadata can deemed of no importance to forming hypotheses. 10 example, there may be situations where researchers can ex- 8.1 Implementation and Environment ploit the information revealed by a statistical test to make an We implement STAR in C++ using the open source homomor- estimated guess about the significance of a yet-not-computed phic encryption library SEAL (v3.3.0) [42] and SPDZ MPC en- statistical test (effectively cheating the control procedure). gine (https://github.com/KULeuven-COSIC/SCALE-MAMBA). If this is deemed a risk, then the computation leakage can The implementation of statistical tests computes the equa- be reduced to Lc : (D × σ) → {0, 1} where the only output tions described in § 6: the SEAL library is used by the data of a computation through STAR is a single bit denoting statis- owner to encrypt the data, researchers use the encrypted tical significance and nothing else. This can be achieved by dataset to evaluate statistical tests, and the final result is com- performing a comparison of the encrypted result τ with a puted and revealed by parties running the SPDZ protocol. significance threshold t using secure computation (right after SEAL does not currently support threshold decryption as de- evaluating the division gate, see Figure 3) and only revealing scribed in [8, 29], we therefore simulate the evaluation of the the result of this comparison. This potentially sacrifices the share-conversion protocol (see Figure 3) using the SPDZ par- utility of the result since nothing beyond statistical signifi- ties and a single decryption key by emulating the ciphertext cance is revealed but provides the minimum leakage possible to share conversion protocol run by the computing parties. (a single bit). We note that the α-investing procedure still The homomorphic evaluation of the statistical tests were works (and can be audited as before) because the procedure conducted on a single machine with Intel Xeon E5 (Sandy only requires the result of this threshold comparison, and Bridge) @ 2.60GHz processor (30 cores per machine) and not the p-value itself. 120GB of RAM, running Ubuntu 18.04 LTS. The cost of run- ning the machine was estimated at $914.98/month. 8 EXPERIMENTAL EVALUATION Each computation party was instantiated on a machine We now turn to demonstrating the applicability of STAR as with Intel Xeon E5 (Sandy Bridge) @ 2.60GHz processors a system for providing auditable results to statistical tests (8 cores per machine) and 15GB of RAM, running Ubuntu in an efficient way. Our goal is to show that STAR has mini- 18.04 LTS. The cost of running each party was estimated at mal overhead on the data owner and researchers. Moreover, $109.79/month. since we envision a deployment of STAR where the com- Unless otherwise stated, each result is the average over puting servers are far apart (e.g., universities in the United five separate trial runs (for each combination of parameters). States and in Europe), network latency has the potential to incur significant performance overheads which could bea problem in practice and we must therefore evaluate care- fully. Finally, we wish to evaluate the effectiveness of the 8.2 Datasets α-investing procedure for controlling the FDR to illustrate We evaluate each statistical test on synthetic and real-world why this procedure is an ideal pairing for STAR. datasets. However, we note that runtime performance is not impacted by the distribution of the underlying data. Since We outline the goals of our evaluation as follows: the data itself is encrypted, all computations run in the same • Thoroughly evaluate STAR on a variety of datasets to amount of time because the evaluation of the statistical tests measure the overall runtime for all three statistical is done over the ciphertexts. We evaluate STAR on one real tests. Specifically, we wish to show that STAR can world dataset which we make the same size as one of the perform in realistic deployment scenarios where the synthetic datasetsd to illustrate this fact. The characteristics datasets are sufficiently large in size and servers used of the datasets used in our evaluation are summarized in in the final steps of the computation are far apart from Table 2. one another. The real world dataset was obtained from the UC Irvine • Illustrate the effectiveness of the α-investing proce- Machine Learning Repository [1]. For experiments involving dure when it comes to controlling false discoveries Student’s t-test, Pearson’s correlation test, and ANOVAF-test under different dataset characteristics to show why (which require continuous data), we use the Abalone dataset STAR would be effective at preventing p-hacking when consisting of a total of nine continuous measurements and used as described. 4, 177 rows. We truncate the dataset down to 1, 000 rows to • Measure the impact that network latency, number of provide a comparison baseline with our synthetic dataset computing servers, and other parameters, introduce cont_1k. We do this to illustrate the fact that the only factor and the effects on the practicality of the STAR. influencing the computational overhead is the size ofthe • Measure the overhead imposed on the data owner at dataset — the synthetic datasets do not advantage STAR in system setup and on researchers during the evaluation any way. of statistical tests. 11 server config. 1 server config. 2 server config. 3 server config. 4 server config. 5 3 parties 3 parties 3 parties 4 parties 5 parties

Figure 4: Server configurations used in the evaluation of STAR.

Synthetic datasets. We generate three continuous synthetic configuration as as reference point to analyze the impact of datasets containing 1,000, 5,000 and 10,000 rows, respectively, settings where servers are further apart. with random real value entries sampled from a normal dis- tribution. We use these datasets to evaluate Student’s t-test, 8.4 Parameters Pearson’s correlation test, and ANOVA F-test. For evalu- To ensure all tests are evaluated in a way that enables com- ating Chi-squared, we generate three categorical datasets parisons between results, we fix the parameters ahead of containing 8 mutually exclusive categories. We evaluate the time to satisfy the requirements imposed by all four statisti- Chi-squared test on a subset of 2, 5, and 8 mutually exclusive cal tests and datasets. Specifically, we set fixed-point scaling categories to demonstrate the impact of varying the number factor f = 40 which provides up to 12 decimal places of pre- of selected attributes over which the test is performed. cision for each computation and is sufficient for computing precise p-values. We set the statistical security parameter Dataset # rows # attrs. Data type κ = 40 which provides 40-bits of statistical security during abalone 1000∗ 8 continuous multi-party computations and is the default choice [14]. In cat_1k 1000 8 categorical words, this parameter ensures that the probability of a com- cat_5k 5000 8 categorical putation leaking information about a secret value is less than 1 , which is negligible in κ. cat_10k 10000 8 categorical 2κ cont_1k 1000 3 continuous cont_5k 5000 3 continuous 8.5 Results cont_10k 10000 3 continuous We describe the results in four parts which roughly corre- Table 2: Summary of dataset characteristics. spond to the goals outlined at the beginning of this section.

∗ We truncate this dataset to 1000 rows to illustrate the fact that System setup. Across all experiments, the setup time STAR computation time is entirely independent of the data distribution (i.e., time required to encode and encrypt the data) remained (as it should be given that the data is encrypted) below 32 minutes per dataset. We report the setup time for each dataset in Figure 8. We stress that this is only a one-time operation needed at setup time: after the dataset is encrypted, the data owner goes offline and does not need to interact 8.3 Experiment Setup with researchers or perform any further computation. To ensure our experiments emulate a potential deployment scenario (e.g., where servers are maintained by universities Computation runtime. Since we implement STAR using in different geographic regions), we run our experiments the SEAL homomorphic encryption library and the SPDZ with five different configurations. We set up servers in three multi-party computation engine, we split the runtime evalu- different locations in the USA (South Carolina, Virginia, Ore- ation into two phases: offline and online. The offline phase is gon) as well as Germany, Brazil, and Japan with 3-5 parties where researchers evaluate the statistical test arithmetic cir- depending on the configuration. We report each of the five cuit using homomorphic encryption to obtain the numerator configurations in Figure 4. Our choice of server configura- and denominator terms. In the online phase, the researchers tions allows us to evaluate the effects of network latency in provide the evaluated numerator and denominator cipher- a deployed scenario. The first configuration consists of all texts to the computing parties which then engage in a multi- three servers located in the same datacenter (South Carolina) party computation to evaluate the division and reveal the which results in parties being connected on the local area statistical test result. We report the runtime of each statistical network (LAN) with “ideal” network latency. We use this test in Figure 5. 12 Student's t-test evaluation of -investing for MCP control 10:00 abalone 0.8 no MCP control 06:40 cont_1k using -investing cont_5k 0.6 cont_10k 03:20 0.4

00:00 0.2 avg. false discovery rate config. 1 config. 2 config. 3 config. 4 config. 5 0.0 100% Null 75% Null 50% Null 25% Null Pearson's correlation test 10:00

06:40 Figure 6: Comparison of average FDR with and with- out the α-investing control procedure on synthetic 03:20 datasets with varying percentage of null hypotheses. 00:00 Green line indicates the α = 0.05 threshold.

config. 1 config. 2 config. 3 config. 4 config. 5 -investing budget F-test 10:00

t 100% Null e

g 0.04 75% Null d

06:40 u b 50% Null

g n

i 25% Null t

s 0.02

03:20 e v n i - 00:00 0.00 0 10 20 30 40 50 60 config. 1 config. 2 config. 3 config. 4 config. 5 number of statistical tests performed Chi-squared test (2 attributes) 10:00 cat_1k. Figure 7: Remaining α-investing budget as a func- 06:40 cat_5k. cat_10k. tion of the number of tests computed over synthetic 03:20 datasets with varying percentage of null hypotheses.

00:00 dataset encryption time

config. 1 config. 2 config. 3 config. 4 config. 5 Chi-squared test (5 attributes) 10:00 16:40

06:40 00:00

cat_1k cat_5k 03:20 abalone cont_1k cont_5k cat_10k cont_10k

00:00 Figure 8: Data owner setup time (mm:ss) for each config. 1 config. 2 config. 3 config. 4 config. 5 dataset used in the evaluation. Chi-squared test (8 attributes) 10:00

06:40 The runtime of computing statistical tests ranged from a few seconds to approximately 9 minutes on the larger 03:20 datasets. Even though this is several orders of magnitude 00:00 overhead compared to computing a statistical test directly over the unencrypted data (e.g., using SciPy), it is not un- config. 1 config. 2 config. 3 config. 4 config. 5 reasonably high when considering practical circumstances. Researchers in the real world are likely to only evaluate a Figure 5: Total computation time (mm:ss) required for few statistical tests on a dataset per day which STAR is adept each test as a function of the computing server config- at handling, even with many researchers using the system. uration and dataset selection. Offline runtime is pre- The results also illustrate the overhead of performing multi- sented in solid color; online runtime is cross-hatched. party computation. In Figure 5 we see that the online time accounted for a significant fraction of the computation time, 13 even though it is only used for evaluating the division gate that (1) does not release the dataset to the researchers and and revealing the result. This confirms that evaluating the (2) stops answering queries when more than m hypotheses full statistical test over shared data using multi-party com- queries. Moreover, the method introduces random noise into putation would incur an unreasonably high overhead on the the results which can potentially alter the outcome of the runtime for a practical application. computed test. Lauter et al. [37] and Zhang et al.[50] propose the use Discovery certification. To demonstrate how STAR per- of homomorphic encryption to compute statistical tests on forms at its main goal of discovery certification, we evaluate encrypted genomic data. However, the constructions are the α-investing procedure for controlling false discoveries. not intended for validating statistical testing procedures but For evaluating the FDR with and without the α-investing con- rather for guarding the privacy of patient data. Furthermore, trol procedure, we generate one hundred synthetic datasets Lauter et al. make several simplifications such as not per- with values ranging between 0 and 100 across 64 attributes forming encrypted division (rather performing arithmetic and 1,000 rows. For each dataset we perform 64 “multiple division in the clear), imposing assumptions on the data, etc., comparisons” using each of the three tests over a subset of making their solution less general compared to methods for 2 attributes. We repeat this process for four dataset collec- computing statistical tests in STAR. tions where each collection contains a different percentage of Homomorphic encryption and multi-party computation true null hypotheses (e.g., 100% implies completely random techniques have been used for outsourcing machine learn- data and so all “discoveries” are false). We report the results ing tasks of private data [9, 24, 48, 49]. Our work, however, in Figure 6. The results show that the α-investing proce- crucially relies on the decryption functionality being decen- dure indeed bounds the expected FDR to the specified alpha tralized in order to control and keep track of data exposure. (α = 0.05), even when all discoveries in the data are false. Other related work surrounds non-cryptographic tech- We additionally report the α-budget remaining after each niques to guard against false discoveries in statistical analy- test (following Procedure 1) in Figure 7. We use the same sis. This includes procedures such as the α-investing proce- generated datasets and set the initial wealth W (0) = 0.05 dure [20, 51] and other methods for bounding the number of and parameters β = 0.5, γ = 0.0125. As expected, the budget false discoveries [4, 17]. When applied properly these tech- is never fully depleted but quickly decreases in the case of niques prevent p-hacking in an idealized scenario but they datasets with a large portion of null hypotheses. provide no guarantee that the methods were are applied correctly which is why the system we describe becomes a 9 RELATED WORK necessity. Much of the related work surrounding STAR either focuses Finally, we note that preserving privacy of dataset values on private computations using homomorphic encryption or when releasing results of statistical tests [21, 31] is orthogo- non-cryptographic methods for preventing p-hacking such nal to our work. as pre-registration of hypotheses. To our knowledge, there has been no prior work using cryptographic techniques for 10 CONCLUSIONS certifying the validity of statistical tests in a provable and fully auditable way. We present STAR, a system that certifies hypothesis testing Sornette et al. [44] describe an approach for demonstrating using cryptographic techniques and a decentralized certify- the validity of stock market bubble predictions a posteriori. ing authority to eliminate avenues for p-hacking and other Their method relies on hashing documents containing finan- forms of data dredging. STAR computes statistical test over cial forecasts which get released after the forecasted date has encrypted data by combining somewhat homomorphic en- passed. Coupled with a trusted entity for storing the hashes, cryption with multi-party computation in a way that enables their method provides a means of validating the historical efficient and practical evaluation yet provides full auditabil- accuracy of predictions in an unbiased way. While their solu- ity of results. We show that STAR supports a wide range of tion isn’t directly applicable to hypotheses testing, the main statistical tests used in practice for hypothesis testing and goal of their work is comparable to ours. evaluate four of the most common tests used by researchers. Dwork et al. [18, 19] propose a method that constrains an- Our experiments demonstrate the feasibility of such a system alyst’s access to a hold-out dataset as follows. A trusted party for real-word settings and only incurs a small computational keeps a hold-out dataset and answers up to m hypotheses overhead on researchers evaluating tests. As such, we believe using a differentially private algorithm. Here, m depends on that STAR is a viable solution to prevent p-hacking in a prov- the size of the hold-out data and the generalization error that able way, promoting accountability and reproducibility in one is willing to tolerate. Hence, compared to our decentral- scientific studies. With STAR, data can be released for public ized approach, this method assumes trust into a single party use with the guarantee that the insights extracted from the 14 data are correct and all necessary false discovery control [19] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. procedures were applied. STAR is the first system which ad- Preserving statistical validity in adaptive data analysis. In Proceedings dress p-hacking and fosters reproducibility by providing a of the forty-seventh annual ACM symposium on Theory of computing, pages 117–126. ACM, 2015. certificate-of-correctness and enabling post hoc auditing of [20] D. P. Foster and R. A. Stine. α-investing: a procedure for sequential all actions taken by researchers. control of expected false discoveries. Journal of the Royal Statistical ⋆ Society: Series B (Statistical Methodology), 70(2):429–444, 2008. [21] M. Gaboardi, H. Lim, R. Rogers, and S. Vadhan. Differentially private REFERENCES chi-squared hypothesis testing: Goodness of fit and independence testing. In Proceedings of The 33rd International Conference on Machine [1] A. Asuncion and D. Newman. UCI machine learning repository, 2007. Learning, volume 48, pages 2111–2120, 2016. [2] M. Baker. 1,500 scientists lift the lid on reproducibility. Nature News, [22] C. Gentry. A fully homomorphic encryption scheme. Stanford University, 533(7604):452, 2016. 2009. [3] C. G. Begley and L. M. Ellis. Drug development: Raise standards for [23] M. M. Ghassemi, S. E. Richter, I. M. Eche, T. W. Chen, J. Danziger, and preclinical cancer research. Nature, 483(7391):531, 2012. L. A. Celi. A data-driven approach to optimized medication dosing: a [4] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: focus on heparin. Intensive care medicine, 40(9):1332–1339, 2014. a practical and powerful approach to multiple testing. Journal of the [24] T. Graepel, K. Lauter, and M. Naehrig. ML confidential: Machine royal statistical society. Series B (Methodological), pages 289–300, 1995. learning on encrypted data. In International Conference on Information [5] D. P. Bertsekas and J. N. Tsitsiklis. Introduction to probability, volume 1. Security and Cryptology (ICISC), 2013. Athena Scientific Belmont, MA, 2002. [25] M. L. Head, L. Holman, R. Lanfear, A. T. Kahn, and M. D. Jennions. [6] D. Bogdanov. Foundations and properties of shamir’s secret sharing The extent and consequences of p-hacking in science. PLoS biology, scheme research seminar in cryptography. University of Tartu, Institute 13(3):e1002106, 2015. of Computer Science, 1, 2007. [26] K. E. Henry, D. N. Hager, P. J. Pronovost, and S. Saria. A targeted [7] D. Bogdanov, S. Laur, and J. Willemson. Sharemind: A framework real-time early warning score (trewscore) for septic shock. Science for fast privacy-preserving computations. In European Symposium on translational medicine, 7(299):299ra122–299ra122, 2015. Research in Computer Security, pages 192–206. Springer, 2008. [27] J. P. Ioannidis. Contradicted and initially stronger effects in highly [8] D. Boneh, R. Gennaro, S. Goldfeder, A. Jain, S. Kim, P. M. Rasmussen, cited clinical research. Jama, 294(2):218–228, 2005. and A. Sahai. Threshold cryptosystems from threshold fully homo- [28] J. P. Ioannidis. Why most published research findings are false. PLoS morphic encryption. In Annual International Cryptology Conference, medicine, 2(8):e124, 2005. pages 565–596. Springer, 2018. [29] A. Jain, P. M. Rasmussen, and A. Sahai. Threshold fully homomorphic [9] R. Bost, R. A. Popa, S. Tu, and S. Goldwasser. Machine learning classi- encryption. IACR Cryptology ePrint Archive, 2017:257, 2017. fication over encrypted data. In NDSS, volume 4324, page 4325, 2015. [30] L. K. John, G. Loewenstein, and D. Prelec. Measuring the prevalence [10] E. Cecchetti, F. Zhang, Y. Ji, A. Kosba, A. Juels, and E. Shi. Solidus: of questionable research practices with incentives for truth telling. Confidential distributed ledger transactions via PVORM. In Proceedings Psychological science, 23(5):524–532, 2012. of the 2017 ACM SIGSAC Conference on Computer and Communications [31] A. Johnson and V. Shmatikov. Privacy-preserving data exploration Security, pages 701–717, 2017. in genome-wide association studies. In Proceedings of the 19th ACM [11] A. R. Choudhuri, M. Green, A. Jain, G. Kaptchuk, and I. Miers. Fairness SIGKDD International Conference on Knowledge Discovery and Data in an unfair world: Fair multiparty computation from public bulletin Mining, KDD ’13, pages 1079–1087, 2013. boards. In Proceedings of the 2017 ACM SIGSAC Conference on Computer [32] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, and Communications Security, pages 719–728. ACM, 2017. B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark. Mimic-iii, a freely [12] A. Cockburn, C. Gutwin, and A. Dix. Hark no more: on the preregis- accessible critical care database. Scientific data, 3:160035, 2016. tration of chi experiments. In Proceedings of the 2018 CHI Conference [33] M. Keller, E. Orsini, and P. Scholl. Mascot: faster malicious arithmetic on Human Factors in Computing Systems, page 141. ACM, 2018. secure computation with oblivious transfer. In Proceedings of the 2016 [13] R. Cramer, I. Damgård, and J. B. Nielsen. Multiparty computation from ACM SIGSAC Conference on Computer and Communications Security, threshold homomorphic encryption. In International Conference on the pages 830–842. ACM, 2016. Theory and Applications of Cryptographic Techniques, pages 280–300. [34] M. Keller, V. Pastro, and D. Rotaru. Overdrive: making spdz great again. Springer, 2001. In Annual International Conference on the Theory and Applications of [14] I. Damgård, M. Keller, E. Larraia, V. Pastro, P. Scholl, and N. P. Smart. Cryptographic Techniques, pages 158–189. Springer, 2018. Practical covertly secure mpc for dishonest majority–or: breaking the [35] N. L. Kerr. Harking: Hypothesizing after the results are known. Per- spdz limits. In European Symposium on Research in Computer Security, sonality and Social Review, 2(3):196–217, 1998. pages 1–18. Springer, 2013. [36] L. Lamport. Paxos made simple. ACM Sigact News, 32(4):18–25, 2001. [15] K. Dickersin, S. Chan, T. Chalmersx, H. Sacks, and H. Smith Jr. Publi- [37] K. Lauter, A. López-Alt, and M. Naehrig. Private computation on cation bias and clinical trials. Controlled clinical trials, 8(4):343–353, encrypted genomic data. In International Conference on Cryptology 1987. and Information Security in Latin America, pages 3–27. Springer, 2014. [16] J.-B. Du Prel, B. Röhrig, G. Hommel, and M. Blettner. Choosing statis- [38] L. Mayaud, P. S. Lai, G. D. Clifford, L. Tarassenko, L. A. G. Celi, and tical tests: part 12 of a series on evaluation of scientific publications. D. Annane. Dynamic data during hypotensive episode improves mor- Deutsches Ärzteblatt International, 107(19):343, 2010. tality predictions among patients with sepsis and hypotension. Critical [17] O. J. Dunn. Multiple comparisons among means. Journal of the Ameri- care medicine, 41(4):954, 2013. can Statistical Association, 56(293):52–64, 1961. [39] S. Nakamoto. Bitcoin: A peer-to-peer electronic cash system, 2008. [18] C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. Roth. http://www.bitcoin.org/bitcoin.pdf. Guilt-free data reuse. Communications of the ACM, 60(4):86–93, 2017. [40] U. S. Neill. Publish or perish, but at what cost? The Journal of clinical investigation, 118(7):2368–2368, 2008. 15 [41] R. Pass, E. Shi, and F. Tramer. Formal abstractions for attested execution [47] P. Voigt and A. Von dem Bussche. The eu general data protection reg- secure processors. In Annual International Conference on the Theory ulation (gdpr). A Practical Guide, 1st Ed., Cham: Springer International and Applications of Cryptographic Techniques, pages 260–289. Springer, Publishing, 2017. 2017. [48] D. Wu and J. Haven. Using homomorphic encryption for large scale [42] Microsoft SEAL (release 3.3). https://github.com/Microsoft/SEAL, 2019. statistical analysis. Technical report, Technical Report: cs. stanford. Microsoft Research, Redmond, WA. edu/people/dwu4/papers/FHESI Report. pdf, 2012. [43] A. Shamir. How to share a secret. Communications of the ACM, [49] P. Xie, M. Bilenko, T. Finley, R. Gilad-Bachrach, K. E. Lauter, and 22(11):612–613, 1979. M. Naehrig. Crypto-nets: Neural networks over encrypted data. CoRR, [44] D. Sornette, R. Woodard, M. Fedorovsky, S. Reimann, H. Woodard, and abs/1412.6181, 2014. W.-X. Zhou. The financial bubble experiment: advanced diagnostics [50] Y. Zhang, W. Dai, X. Jiang, H. Xiong, and S. Wang. Foresee: Fully and forecasts of bubble terminations. arXiv preprint arXiv:0911.0454, outsourced secure genome study based on homomorphic encryption. 2009. 15:S5, 12 2015. [45] P. Subramanyan, R. Sinha, I. Lebedev, S. Devadas, and S. A. Seshia. [51] Z. Zhao, L. De Stefani, E. Zgraggen, C. Binnig, E. Upfal, and T. Kraska. A formal foundation for secure remote execution of enclaves. In Controlling false discoveries during interactive data exploration. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Proceedings of the 2017 ACM International Conference on Management Communications Security, pages 2435–2450. ACM, 2017. of Data, pages 527–540. ACM, 2017. [46] W. H. Thompson, J. Wright, P. G. Bissett, and R. A. Poldrack. Dataset [52] J. Zhou, D. Foster, R. Stine, and L. Ungar. Streaming feature selection decay: the problem of sequential analyses on open datasets. bioRxiv, using alpha-investing. In Proceedings of the eleventh ACM SIGKDD page 801696, 2019. international conference on Knowledge discovery in , pages 384–393. ACM, 2005.

16