BINARY CONSISTENCY TEST 1

DEBIT: A Simple Consistency Test For Binary Data

1 2 James A. J. Heathers ,​ Nicholas J. L. Brown ​ ​

1. Northeastern University, Boston MA

2. University of Groningen, The Netherlands

Correspondence to: [email protected] ​ Bouvé College of Health Sciences

Northeastern University, 360 Huntington Avenue, Boston MA 02115. BINARY DATA CONSISTENCY TEST 2

Abstract

Scientific papers occasionally report group membership coded as a binary variable [i.e., 0 or 1], and present the mean and standard deviation calculated from it in a table of descriptive . This is redundant as the mean and standard deviation are not independent. This manuscript demonstrates this redundancy, and uses the observation as a simple error detection test to investigate the accuracy of published descriptive statistics, termed here DEBIT

(DEscriptive BInary Test). Salient features of deploying the test are discussed, as are some anonymized examples where presented tables of descriptives appear to fail DEBIT.

Keywords: error detection, metascience ​

BINARY DATA CONSISTENCY TEST 3

DEBIT: A Simple Consistency Test For Binary Data

Within science, there is a pressing concern that methods and results may be underreported or under-specified. The inaccessibility of raw data during review (Mayernik, ​ Callaghan, Leigh, Tedds, & Worley, 2015) or in perpetuity (Vines et al., 2014), and the ​ ​ ​ underreporting of methodological details and/or irreproducibility of methods (Prinz, Schlange, & ​ Asadullah, 2011), are contributors to the present crisis in reproducibility and replicability ​ (Camerer et al., 2016, 2018). In this environment, it is unusual to consider information that is ​ overreported—that is, details included in scientific work although they contribute no additional ​ information. However, overreported metrics have the potential to form error detection tests (i.e., they can be used to investigate inconsistencies or errors in scientific publications), as they contain redundant elements that ought necessarily to be in agreement.

Scientific papers commonly report measures of and dispersion to define sample characteristics. As this is ubiquitous, papers will occasionally—and unnecessarily— report both the mean and standard deviation of a binary variable, most typically from a group ​ ​ which has yes/no membership, coded numerically as 0 and 1. Examples include dichotomous age delineations (e.g., participants aged 25 or below, versus participants aged 26 or above), participants meeting a group membership criterion (e.g., depressed patients with BDI scores of

14 or above vs. non-depressed or recovering patients with BDI scores of 13 or below; Beck,

Steer, & Brown, 1996), sex (male vs. female), geographical location (urban vs. rural), and so on.

In text, these variables are often reported as a single percentage (e.g., “A sample of 70 ​ participants (47.1% female) was collected for analysis”) or as one or more raw numbers (e.g., ​ BINARY DATA CONSISTENCY TEST 4

“n = 85 urban and n = 35 rural participants returned the survey”). For inclusion in statistical ​ ​ ​ ​ ​ ​ ​ ​ ​ analyses for proportionality between groups as categories (e.g. using the chi-square test of independence), or for use as variables in a regression, these group memberships are often assigned a binary value (0 vs. 1) from which the mean and standard deviation are calculated. It is unclear why or how this reporting standard has arisen, because the standard deviation is entirely determined by the mean and cell sizes. It is trivial to demonstrate the redundancy of the standard deviation, and thus determine it contains no additional information about the sample of interest.

Consider a sample with two groups (cell sizes a and b), which are assigned the binary ​ ​ ​ ​ values of 0 and 1 respectively with an overall sample size of N, hence: ​ ​

If this is the case, the standard deviation is a simple function of a and b:

As ,

BINARY DATA CONSISTENCY TEST 5

As the sample standard deviation is invariably used, this can be modified to:

This relationship can also be expressed in terms of the overall mean using the previous identity:

As a consequence, any sample of binary variables has a mean and standard deviation that are precisely described by the two cell sizes, and as these mean/standard deviation pairs are commonly reported, we can treat them as a window to check the consistency of the presented figures. Similar to other methods of checking internal consistency, this identity is absolute, as it is when test statistics are back-calculated (e.g., with statcheck: Nuijten, Hartgerink, van Assen, ​ Epskamp, & Wicherts, 2016) or in the GRIM test, a method for using granularity to determine ​ whether reported means are possible (Brown & Heathers, 2016). Unlike SPRITE (Heathers, ​ ​ ​ Anaya, van der Zee, & Brown, 2018), a method for reconstructing potential datasets via iterating ​ individual changes in the constituent values, the SD of a binary variable requires no estimation and is absolute rather than probabilistic. For instance, if we report a sample (N = 280) with sex ​ ​ coded as a binary variable, with 127 male participants and 153 female participants, the standard deviation is exactly equal to 0.4987..., as:

BINARY DATA CONSISTENCY TEST 6

For simplicity, we have termed this observation the DEscriptive BInary Test (DEBIT1).

While the test is absolute if the numbers are perfectly specified, there are exceptions to this besides basic typographical or statistical errors:

(1) Rounding

In a sufficiently large sample, reporting any number to a low amount of decimal places

(usually 2) or as a percentage may return a range of possible sample sizes. Assume a sample of

N = 2,500, which is described as 17% patient population, 83% healthy participants (with the ​ numbers of participants in each condition being designated as P and H, respectively), with a ​ ​ ​ ​ reported SD of 0.37. Allowing for numbers very close to the rounding values, a check for ​ ​ consistency is now whether an SD of ≥0.365 to <0.375 is produced by any possible solution where N = P + H = 2500 (i.e., from ~16.5% to ~17.5% P and from ~82.5% to ~83.5% H). ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ Hence, a range of possible solutions may fit the reported data; in this case, 396 ≤ P ≤ 422. ​ ​ (2) Unreported exclusions

Real-world datasets frequently have incomplete items, which are presumably more common when data is collected unsupervised, under duress, under time pressure, when resistance is provoked by the content of the questions (eg. if on deeply personal matters), or with methods which do not compel a response (i.e. an mailed paper survey, or an automated survey which does not mandate that every question be answered to achieve completion). These deficiencies are, in general, more likely when research is completed on a large scale or at multiple testing sites. In this case, cell sizes or subsamples of individual items may not be implied by the overall sample size reported.

1 The name BInary DEscriptive Test (BIDET) was rejected after consulting a small focus group of colleagues. BINARY DATA CONSISTENCY TEST 7

(3) Altered data

If a data set has been altered in certain ways—such as the accidental omission of part of the data due to an incorrect operation in a text editor or incorrect filtering in a statistical package, or if the descriptive statistics have simply been fabricated—then sample sizes, means, and standard deviations may not align as they should. (Note, however, that if the raw data for one or ​ ​ more variables have been fabricated, the descriptive statistics for these variables will be internally consistent if correctly reported.)

Visualising DEBIT

As these errors differ in magnitude (e.g., while a rounding error might produce an n1/n2 ​ ​ ​ ​ ​ pair that is slightly incorrect, other errors may be of larger magnitude), and the relevant numbers are presented en masse in tables of descriptives, a simple visualization can help to simplify the task of inspecting data. In any sample size of interest, the overall N can be split into every ​ ​ possible n1/n2 pair (x-axis), and expressed either in absolute terms or as a proportion. These ​ ​ ​ ​ ​ ​ points can be plotted against the corresponding standard deviations (y-axis). If an individual point does not lie on this line, it is necessarily incorrect.

An example: In a hypothetical medium-sized sample of participants (women over the age of 75; N = 100), a small number answer a binary question positively (“are you still in the ​ ​ workforce?”; yes, n1 = 15) and a larger sub-sample answer the same question negatively (“are ​ ​ ​ you still in the workforce?”; no, n1 = 85). Figure 1 displays this visualization with SDs at 0.28, ​ ​ ​ 0.36, 0.44, and 0.52.

BINARY DATA CONSISTENCY TEST 8

Figure 1. A simple DEBIT graph, with potential values seen at x=0.15

The four points in Figure 1 represent different classes of solutions. SD = 0.36 is correct ​ ​ ​ ​ for the sample/cell sizes, and as such it lies on the line at (15, 0.36). SD = 0.52 is trivially ​ ​ ​ ​ incorrect, and a test is not required to make this determination as it is impossible at any point; if n = 100, the maximum possible value for any SD of a standard experimental sample size is ​ ​ ​ ​ 0.50 + ε, where ε approaches 0 as n increases. SD = 0.28 and SD = 0.44 are both incorrect, and ​ ​ ​ ​ ​ ​ ​ ​ potentially provide useful additional information. These solutions may themselves fall into one

of two categories: they may either correspond to a different mean (i.e. if SD = 0.44, n1 = 25 and ​ ​ ​ ​ ​ ​ n1 = 26 are both solutions), or they may not exist as per the GRIM test (Brown & Heathers, ​ ​ ​ ​ ​ ​ BINARY DATA CONSISTENCY TEST 9

2016), as these SDs are also affected by the relationship between granularity and the level of ​ rounding possible in the solutions. In this case, if we report all figures to 2 d.p., SD = 0.28 is ​ ​ ​ ​ impossible as it lies between n1 = 8 (SD = 0.27) and n2 = 9 (SD = 0.29). As previously ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ ​ mentioned, rounding may also lead to a lack of precision, for which DEBIT returns a range of solutions rather than a single point. These solutions can be visualized within a box on the graph

above between the minimum and maximum values of n1 on the x-axis, and the rounding limits of ​ ​ ​ the SD on the y-axis. If there is a single x-axis value, the box appears as a vertical line between the rounding limits of the SD.

If several results are returned for the same table, paper, or larger body of work that

describes the same sample of size N, the x-axis can be normalised and expressed as n1 / N (i.e., ​ ​ ​ ​ ​ ​ ​ ​ ​ the mean), and each corresponding point can be plotted separately. This is particularly useful, as one can assess even an extensive table of descriptive statistics at a glance. Consider the following series of accurate mean/SD pairs which includes one deliberate mistake—which, from visual inspection, is invisible.

Table 1: Descriptive statistics (N = 163) ​ ​

Variable Mean SD

Male/Female 0.53 0.50

White/Non-white 0.44 0.50

College-educated 0.77 0.42

Income over $65,000 0.19 0.35

Demographic (urban/rural) 0.34 0.47

Contacted prior to 2015 0.93 0.25

Participants in previous study 0.12 0.33 BINARY DATA CONSISTENCY TEST 10

When these numbers are graphed, the mischaracterized row (mean = 0.19, SD = 0.35) ​ ​ ​ ​ ​ ​ ​ ​ becomes clear2:

Figure 2: DEBIT graph of values in Table 1. The incorrect value lies off the line at [0.19,0.35].

While the origin of the above mistake might be unknown, the possibilities for the actual nature of the sample are confined entirely to the following:

2 The present manuscript was prepared using Google Docs, making it difficult to embed vector graphics. As a result, the widths of values which have a narrow x-value range are reduced - however, widening the line for visibility drastically changes the accuracy of the graph, which is necessary to maintain from Figure 2 onwards. If any features of the graph are indistinct, readers are invited to consult the attached graphs in PDF format or reproduce the figures from the code provided. BINARY DATA CONSISTENCY TEST 11

● the mean being misreported (hence, the SD is correct, and the mean should be ~0.38);

● the SD being misreported (hence, the mean is correct, and the SD should be ~0.14);

● both being misreported (in which case the true values are indeterminate);

● the sample size being misreported (for example, with M = 0.19 and SD = 0.35, there is a ​ ​ ​ ​ ​ ​ ​ ​ solution with N = 15). ​ ​ ​ ​ Having established these observations, DEBIT can be deployed in the published literature at large. In cases where DEBIT is failed, further investigation of the reported results in context may reveal more information about the nature of the problem. Additionally, the cases below also give further insight into both salient details about the test, and possible internal features of the datasets behind the summary statistics.

Consistencies and Inconsistencies

Example 1: Paper #1

From Table 1 of Paper #1, DEBIT can analyse 14 mean/SD pairs, returning nine which are impossible (note: if mean/SD pairs are close on the graph, they appear as a homogenous block). Alternative solutions exist if the SDs are correct and the means are not. The full DEBIT graph is given in Figure 3.

This graph reveals a simple heuristic: sometimes, values with the same mean return different SDs, and vice versa. Both cases are present in this dataset; identical means are seen as vertically arranged points (marked with a single arrow), identical SDs are seen as horizontally arranged points (marked with a double arrow). For the case of the means, DEBIT may not BINARY DATA CONSISTENCY TEST 12 required to determine if a mistake has been made, as it can be seen at a glance why the values cannot co-exist. Note that this is only true if the values are relatively far apart, as they are in this example, and two slightly divergent means may return the same SD due to rounding. This is not the case for SD values, which may be shared between means either (a) reflected symmetrically around the central axis (i.e. a mean of 0.1 will have the same associated SD as 0.9) (b) or, due to rounding to two decimal places, may lie next to each other. Indeed, in the central portion of this

DEBIT graph where SD approaches 0.5 + ε, there are >100 values where SD = 0.50 (2dp). ​ ​ ​ ​ ​ ​ ​ ​ The below data appears to exist in more than one manuscript. The authors are aware of the errors, have received this manuscript, and the discrepancies are unresolved at time of writing.

Figure 3: DEBIT graph of Paper #1. BINARY DATA CONSISTENCY TEST 13

Example 2: Paper #2

Descriptive statistics were supplied which lists eight variables measured on a binary scale

(i.e., with a range of 0 to 1), concerning demographic factors. The DEBIT graph (Figure 4) shows six of these diverging from possible solutions.

Figure 4: DEBIT graph of Paper #2

This graph shows several divergent values, with two being particularly divergent (marked with single arrows). An analyst might be tempted to include that these are quite substantially incorrect, but this is not the case. An direct analogy: consider the descriptive statistics for N =

1000 children derived from N = 8 schools, coded by sex binary variables. Assuming schools of a BINARY DATA CONSISTENCY TEST 14 non-selective intake, the sex means by school will be approximately balanced [0.47, 0.53, 0.49, ​ ​ 0.50, etc.], and the SD of these means will be extremely low (~=0.04), much lower than the mean/SD reported for the whole sample (~=0.50). ​ ​ The other variables present, however, appear to be derived from the whole sample and therefore fail DEBIT. As above, these figures also appear to be used in more than one published manuscript, the authors are aware of the errors, have received this manuscript, and the discrepancies are unresolved at time of writing.

As in our previous (Brown and Heathers, 2016), identifying details for both Papers #1 and #2 are not presented, and will remain so until the underlying issues can be privately resolved.

Our reasons are as before: First, DEBIT is an exploratory technique, and should be regarded as such. Second, all of the above may result from entirely mundane analytical oversights. Third, these results are not an exposé but an attempt to generate discussion around an issue we feel worthy of investigation, and to produce a tool that might assist pre- and post-publication review of presented results.

Counter-Example #1: Johnson (2005)

Johnson (2005) examined departures from sentencing guidelines given the social context, nature of sentence, ethnicity, etc. of the defendant. This paper was selected at random purely due to the substantial amount of binary variables reported, which are supplied on p.778 (Table 1). All ​ ​ variables pass DEBIT and no errors are observed. However, this table includes a large overall sample (N > 1e5) with a dramatically smaller subsample (N = 60). Without this qualification for ​ ​ ​ ​ ​ the appropriate mean/SD pairs, three values emerge as non-congruent. With the correct curve, BINARY DATA CONSISTENCY TEST 15 these values are revealed to be accurately described. To use DEBIT successfully, an interested analyst may have to carefully consult the related text to ensure that binary mean/SD values are compared to the appropriate sample sizes if the table is unlabelled. This may be of peripheral importance when sample sizes are similar, as DEBIT curves are extremely congruent as N increases.

Figure 5: The DEBIT graph of Johnson (2005). No errors are observed.

Conclusion

Most error detection tests deal with a fairly specific set of circumstances. This includes

GRIM and its variants, SPRITE, but also others; for example, Simonsohn’s (2013) test is ​ ​ applicable to cases where standard deviations between groups selected from the same population BINARY DATA CONSISTENCY TEST 16 are too similar; Carlisle’s (2017) application of the Stouffer-Fisher method is applicable to tables ​ ​ of descriptives reporting p values for the differences between groups at baseline in a randomized ​ ​ trial; and the “Förster test” (Anonymous, 2012) is applicable to one-way ANOVAs with suspiciously consistent linearity. However, in the right circumstances, these tests can reveal additional features (and, occasionally, serious inconsistencies) in the data they analyse. DEBIT should be seen in this light, as a specialized observation of a particular feature of analysis, and not necessarily employable in any given paper—but, nonetheless, a test that can give valuable insight into the accuracy of the data underlying commonly presented numerical values when correctly deployed in the correct context.

An instantiation of the DEBIT test (Matlab), the figures presented above, the code required to reproduce them, and this manuscript are publicly available at https://osf.io/6ajmd/. ​ ​

BINARY DATA CONSISTENCY TEST 17

References

Beck, A. T., Steer, R. A., & Brown, G. K. (1996). Manual for Beck Depression Inventory-II. San Antonio, ​ ​ TX: Psychological Corporation.

Brown, N. J. L., & Heathers, J. A. J. (2016). The GRIM Test: A simple technique detects numerous

anomalies in the reporting of results in psychology. Social Psychological and Personality Science, 8, ​ ​ ​ ​ 363–369. http://dx.doi.org/10.1177/1948550616673876 ​ ​ ​ Camerer, C., Dreber, A., Holzmeister, F., Ho, T. H., Huber, J., Johannesson, M., … Wu, H. (2018).

Evaluating the replicability of social science experiments in Nature and Science between 2010 and ​ ​ ​ ​ 2015. http://dx.doi.org/10.31235/osf.io/4hmb6 ​ ​ Camerer, C. F., Dreber, A., Forsell, E., Ho, T.-H., Huber, J., Johannesson, M., … Wu, H. (2016).

Evaluating replicability of laboratory experiments in economics. Science, 351(6280), 1433–1436. ​ ​ ​ ​ http://dx.doi.org/10.1126/science.aaf0918 ​ Heathers, J. A. J., Anaya, J., van der Zee, T., & Brown, N. J. L. (2018). Recovering data from summary ​ statistics: Sample Parameter Reconstruction via Iterative TEchniques (SPRITE). ​ http://dx.doi.org/10.7287/peerj.preprints.26968 ​ Johnson, B. D. (2005). Contextual disparities in guidelines departures: Courtroom social contexts,

guidelines compliance, and extralegal disparities in criminal sentencing. Criminology, 43, 761–796. ​ ​ ​ ​ ​ ​ https://doi.org/10.1111/j.0011-1348.2005.00023.x

Mayernik, M. S., Callaghan, S., Leigh, R., Tedds, J., & Worley, S. (2015). Peer review of datasets: When,

why, and how. Bulletin of the American Meteorological Society, 96, 191–201. ​ ​ ​ ​ http://dx.doi.org/10.1175/BAMS-D-13-00083.1 ​ Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2016). The

prevalence of statistical reporting errors in psychology (1985-2013). Behavior Research Methods, ​ ​ 48, 1205–1226. http://dx.doi.org/10.3758/s13428-015-0664-2 ​ ​ ​ ​ BINARY DATA CONSISTENCY TEST 18

Prinz, F., Schlange, T., & Asadullah, K. (2011). Believe it or not: how much can we rely on published

data on potential drug targets? Nature Reviews. Drug Discovery, 10, 712–713. ​ ​ ​ ​ http://dx.doi.org/10.1038/nrd3439-c1

Vines, T. H., Albert, A. Y. K., Andrew, R. L., Débarre, F., Bock, D. G., Franklin, M. T., . . . Rennison, D.

J. (2014). The availability of research data declines rapidly with article age. Current Biology, 24, ​ ​ ​ ​ 94–97. http://dx.doi.org/10.1016/j.cub.2013.11.014 ​