Scaling up Fact-Checking Using the Wisdom of Crowds

Scaling Up Fact-Checking Using the Wisdom of Crowds

Jennifer Allen1*, Antonio A. Arechar1,2,3*, Gordon Pennycook4, David G. Rand1,5,6†

Misinformation on social media has become a major focus of research and concern in recent years. Perhaps the most prominent approach to combating misinformation is the use of professional fact- checkers. This approach, however, is not scalable: Professional fact-checkers cannot possibly keep up with the volume of misinformation produced every day. Furthermore, not everyone trusts fact- checkers, with some arguing that they have a liberal bias. Here, we explore a potential solution to both of these problems: leveraging the “wisdom of crowds'' to identify misinformation at scale using politically-balanced groups of laypeople. Using a set of 207 news articles flagged for fact-checking by an internal Facebook algorithm, we compare the accuracy ratings given by (i) three professional fact-checkers after researching each article and (ii) 1,128 Americans from Amazon Mechanical Turk after simply reading the headline and lede sentence. We find that the average rating of a small politically-balanced crowd of laypeople is as correlated with the average fact-checker rating as the fact-checkers’ ratings are correlated with each other. Furthermore, the layperson ratings can predict whether the majority of fact-checkers rated a headline as “true” with high accuracy, particularly for headlines where all three fact-checkers agree. We also find that layperson cognitive reflection, political knowledge, and Democratic Party preference are positively related to agreement with fact- checker ratings; and that informing laypeople of each headline’s publisher leads to a small increase in agreement with fact-checkers. Our results indicate that crowdsourcing is a promising approach for helping to identify misinformation at scale.

Teaser: When rating articles’ accuracy, a small politically-balanced crowd of laypeople yields high agreement with fact-checkers.

Forthcoming in Science Advances

First Version: Oct 2, 2020 This Version: August 8, 2021

------†[email protected] 1Sloan School of Management, Massachusetts Institute of Technology 2Center for Research and Teaching in Economics, CIDE 3Centre for Decision Research and Experimental Economics, CeDEx 4Hill/Levene Schools of Business, University of Regina 5Institute for Data, Systems, and Society, Massachusetts Institute of Technology 6Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology *Joint first authors

INTRODUCTION The spread of misinformation on social media – including blatantly false political “fake news”, misleading hyper-partisan news, and other forms of inaccurate content – has become a major matter of societal concern and focus of academic research in recent years (1). In particular, there is a great deal of interest in understanding what can be done to reduce the reach of online misinformation. One of the most prominent approaches to combating misinformation, which technology companies such as Facebook and Twitter are currently employing (2, 3) and which has received considerable attention within academia (for a review, see (4)), is the use of professional fact-checkers to identify and label false or misleading claims. Fact-checking has the potential to greatly reduce the proliferation and impact of misinformation in at least two different ways. First, fact-checking can be used to inform users about inaccuracies. Debunking false claims typically reduces incorrect belief (5) (early concerns about potential “backfire effects” (6, 7) have not been supported by subsequent work; for a review see (8)). In the context of social media specifically, putting warning labels on content that has been contested by fact-checkers substantially reduces sharing intentions (9–11). Second, social media platforms can use fact-checking to influence the likelihood that particular pieces of content are shown to users. Using ranking algorithms to demote content that is contested by fact-checkers can dramatically reduce the number of users who are exposed (12). As it is typically implemented, however, fact-checking has substantial problems with both scalability and trust (10) and therefore falls far short of realizing its potential. Professional fact- checking is a laborious process that cannot possibly keep pace with the enormous amount of content posted on social media every day. Not only does this lack of coverage dramatically reduce the impact of corrections, it also has the potential to increase belief in, and sharing of, misinformation that fails to get checked via the “implied truth effect”: People may infer that lack of warning implies that a claim has been verified (10). Furthermore, even when fact-check warnings are successfully applied to misinformation, their impact may be reduced by lack of trust. For example, according to a Poynter study, 70% of Republicans and 50% of Americans overall think that fact-checkers are biased and distrust fact-checking corrections (13). Thus, despite fact- checking’s potential, these drawbacks have substantially hindered its effectiveness. Here we investigate a potential solution to both of these problems: harnessing the “wisdom of crowds'' (14, 15) to make fact-checking scalable and protect it from allegations of bias. Unlike professional fact-checkers, who are in short supply, it is easy (and inexpensive) to recruit large numbers of laypeople to rate headlines, thereby allowing scalability. And by recruiting laypeople from across the political spectrum, it is easy to create layperson ratings that are politically balanced – and thus cannot be accused of having liberal bias. But would such layperson ratings actually generate useful insight into the accuracy of the content being rated? Professional fact-checkers have a great deal of training and expertise that enables them to assess information quality (16). Conversely, the American public has a remarkably low level of media literacy. For example, in a 2018 Pew poll, a majority of Americans could not even reliably distinguish factual statements from opinions (17). Furthermore, layperson judgments may be unduly influenced by partisanship and politically motivated reasoning over actual veracity. Thus, there is reason to be concerned that laypeople would not be effective in classifying the accuracy of news articles. This, however, is where the power of the wisdom of crowds comes into play. A large literature shows that even if the ratings of individual laypeople are noisy and ineffective, aggregating their responses can lead to highly accurate crowd judgments. For example, the judgment of a diverse, independent group of laypeople has been found to outperform the judgment of a single expert across a variety of domains, including guessing tasks, medical diagnoses, and predictions of corporate earnings (14, 15, 18, 19). Most closely related to the current paper, crowd ratings of the trustworthiness of news publishers were very highly correlated with the ratings of professional fact-checkers (20, 21). These publisher-level ratings, however, may have limited utility for fighting online misinformation. First, there is a great deal of heterogeneity in the quality of content published by a given outlet (22). Thus, using publisher-level ratings may lead to a substantial number of false negatives and false positives. In other words, publisher-level ratings are too coarse to reliably classify article accuracy. This interferes with the effectiveness of both labeling and down-ranking problematic content. Second, publisher trust ratings are largely driven by familiarity: People overwhelmingly distrust news publishers that they are unfamiliar with (20, 21). Thus, using publisher-level ratings unfairly punishes publishers that produce accurate content but are either new or niche outlets. This is highly problematic given that much of the promise of the internet and social media as positive societal forces comes from reducing the barrier to entry for new and specialized content producers. To address these shortfalls and increase the utility of fact-checking, we ask whether the wisdom of crowds is sufficiently powerful to allow laypeople to successfully tackle the substantially harder, and much more practically useful, problem of rating the veracity of individual articles. Prior efforts at using the crowd to identify misinformation have typically focused on allowing users to flag content that they encounter on platform and believe is problematic, and then algorithmically leveraging this user flagging activity (23). Such approaches, however, are vulnerable to manipulation: Hostile actors can engage in coordinated attacks – potentially using bots – to flood the reporting system with misleading responses, for example engaging in coordinated flagging of actually accurate information that is counter to their political ideology. However, this danger is largely eliminated by using a rating system in which random users are invited to provide their opinions about a specific piece of content (as in, for example, election polling), or simply hiring laypeople to rate content (as is done with content moderation). When the crowd is recruited in this manner, it is much more difficult for the mechanism to be infiltrated by a coordinated attack, as the attackers would have to be invited in large numbers to participate and suspicious accounts could be screened out when selecting which users to invite to rate content. Thus, in contrast to most prior work on misinformation identification, we explore a crowdsourcing approach in which participants are recruited from an online labor market and asked to rate randomly selected articles. We investigate how well laypeople perform when making judgments based on only the headline and the lede of the article, rather than reading the full article and/or conducting their own research into the article’s veracity. We do so for two reasons. First, we are focused on scalability and thus are seeking to identify an approach that involves a minimum amount of time per rating. Directly rating headlines and ledes is much quicker than reading full articles and doing research. Second, this approach protects against articles that have inaccurate or sensational headlines but accurate texts. Given that most users do not read past the headline of articles on social media (24), it is the accuracy of the headline (rather than the full article) that is most important for preventing exposure to misinformation online. Finally, in an effort to optimize crowd performance, we also explore the impact of (i) identifying the article’s publisher (via domain of the URL, e.g. breitbart.com (22)), which could either be informative or work to magnify partisan bias, and (ii) selecting layperson raters based on individual difference characteristics (25, 26) that we find to be correlated with truth discernment. A particular strength of our approach involves stimulus selection. Past research has demonstrated that laypeople can discern truth from falsehoods on stimulus sets curated by experimenters (25, 27); for example, Bhuiyan et al (2020) found that laypeople ratings were highly correlated with experts’ for claims about scientific topics that had a “high degree of consensus among domain experts, as opposed to political topics in which the potential for stable ground truth is much more challenging” (28). However, the crowd’s performance will obviously depend on the particular statements that participants are asked to evaluate. For example, discerning the truth of “the earth is a cube” versus “the earth is round” will lead to a different result from “the maximum depth of the Pacific Ocean is 34,000 feet” versus “the maximum depth of the Pacific Ocean is 36,000 feet”. In each case, the first statement is false while the second statement is true – but the first comparison is far easier than the second. Thus, when evaluating how well crowds can help scale up fact-checking programs, it is essential to use a set of statements that – unlike past work – accurately represent the misinformation detection problem facing social media companies. To that end, Facebook provided us with a set of articles for evaluation. The articles were sampled from content posted on Facebook in an automated fashion to upsample for content that (i) involved civic topics or health-related information, (ii) was predicted by Facebook's internal models to be more likely to be false or misleading (using signals such as comments on posts expressing disbelief, false news reports, pages that have shared misinformation in the past), and/or (iii) was widely shared (i.e. viral). Using these articles, we can directly assess the crowd’s ability to fact- check a representative set of potentially problematic content that social media platforms would have directed to professional fact-checkers. In our experiment, we recruited 1,128 American laypeople from Amazon Mechanical Turk. Each participant was presented with the headline and lede of 20 articles (sampled randomly from a full set of 207 article URLs); half of the participants were also shown the domain of the article’s publisher (i.e., the source). Participants rated each article on seven dimensions related to accuracy which we averaged to construct an aggregate accuracy rating (Cronbach’s ɑ = .96), as well as providing a categorical rating of true, misleading, false, or can’t tell. We then compared the layperson headline+lede ratings to ratings generated by three professional fact-checkers doing detailed research on the veracity of each article. For further details, see Methods.

RESULTS

To provide a baseline for evaluating layperson performance, we begin by assessing the level of agreement among the professional fact-checkers. The average correlation across articles between the three fact-checkers’ aggregate accuracy ratings was r = .62 (range = .53 - .81, ps < .001). Considering the categorical ratings, all three fact-checkers gave the same rating for 49.3% of articles, two out of three fact-checkers gave the same rating for 42.0% of articles, and all three fact-checkers gave different ratings for 8.7% of articles. The Fleiss kappa for the categorical ratings is .38 (p < .001). On the one hand, these results demonstrate considerable agreement among the fact- checkers: The aggregate accuracy rating correlation is quite high, and at least two out of three fact- checkers’ categorical ratings agreed for over 90% of the articles. At the same time, however, the aggregate accuracy rating correlation is far from 1, and the categorical ratings’ kappa statistic only indicates “fair” agreement. This disagreement among fact-checkers is not unique to the fact- checkers we recruited for this study. For example, a study comparing the scores given by FactCheck.org and Politifact using ordinal rating scales found a correlation of r = .66 (29), while another study found inter-fact-checker agreement as low as .47 when comparing the ratings of well-known fact-checking organizations (30). Furthermore, as described in SI Section 9, ratings generated by a set of 4 professional journalists fact-checking the same set of articles used in our study had an average correlation of r = .67. This level of variation in the fact-checker ratings has important implications for fact- checking programs, such as emphasizing the importance of not relying on ratings from just a single fact-checker for certifying veracity, and highlighting that “truth” is often not a simple black- and-white classification problem. Moreover, this is an even larger issue for political news: Among the 109 political URLs, the correlation among the fact-checkers was only r = .56, compared to r = .69 among the 98 non-political URLs. With this in mind, we use the variation among experts as a benchmark against which to judge the performance of the layperson ratings, to which we now turn. In our first analysis, we examine how the correlation between the layperson aggregate accuracy ratings and the fact-checker aggregate accuracy ratings varies with the size of the crowd (i.e. the number of layperson ratings per article, k, as smaller – and thus more scalable – crowds can often approximate the performance of larger crowds (31–33)); see Figure 1a. We begin by comparing the Source and No-Source conditions. We find that the correlation between the laypeople and the fact-checkers is consistently higher in the Source condition, although the difference is comparatively small (increase of between .03 and .06, Pearson’s r, for the correlation), and only becomes statistically significant for higher values of k (p < .05 for k ≥ 24).

1.00

0.75 Correlation between

fact−checkers ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.50 ● ● ●

● ●

0.25 checker accuracy ratings accuracy checker − and fact layperson Correlation between 0.00

5 10 15 20 25 Condition Layperson ratings per article ● No Source

Nonpolitical Political ● Source 1.00

0.75

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.50 ● ● ● ● ● ● ●

0.25 checker accuracy ratings accuracy checker

− and fact layperson Correlation between 0.00 5 10 15 20 25 5 10 15 20 25 Layperson ratings per article Figure 1. Correlation across articles between (i) politically-balanced layperson aggregate accuracy ratings based on reading the headline and lede and (ii) average fact-checker research-based aggregate accuracy ratings, as a function of the number of layperson ratings per article. Laypeople are grouped by condition (Source vs. No Source). For results showing up to 100 laypeople per article, see SI Figure S7; for results using the average of the correlations with a single fact-checker, rather than the correlation with the average of the fact-checker ratings, see SI Figure S9. Panels show results for (a) All articles, (b) Non-Political articles, (c) Political Articles. The dashed line indicates the average Pearson correlation between fact-checkers (all articles, r = .62; non-political articles, r=.69, political articles, r = .56).

To assess how well the layperson judgments correlate with the fact-checkers’ in absolute terms, we use the correlation between the fact-checker ratings as a benchmark. Perhaps surprisingly, even with a small number of layperson ratings per article, the correlation between the laypeople and the fact-checkers is not significantly lower than the correlation among the fact- checkers. Specifically, starting at k = 8 for the Source condition and k = 12 for the No-Source condition, we find that the correlation between laypeople and the fact-checkers does not significantly differ from the average correlation between the fact-checkers (Source condition: k = 8, r = .58, 95% CI = .51 - .65, p = .2; No-Source condition: k = 12, r = .57, 95% CI = .50 - .63, p = .08). Furthermore, in the Source condition, the correlation between the laypeople and the fact- checkers actually gets significantly higher than the correlation between the fact-checkers once k becomes sufficiently large (k = 22, r = .66, 95% CI = .61 - .71; p < .05 for all k >= 22). Examining political versus non-political URLs separately (Figure 1b,c), we see that the layperson judgments agree with the fact-checkers to the same extent for different types of news. Taken together with the observation that there is substantially less agreement among the fact-checkers for political news, this means that the crowd performs particularly well relative to the intra-fact- checker benchmark for political news. Our results therefore indicate that a relatively small number of laypeople can produce an aggregate judgment, given only the headline and lede of an article, that approximates the individual judgments of professional fact-checkers – particularly for political headlines. The advantage of these analyses is that they provide an apples-to-apples comparison between laypeople and fact-checkers by using aggregate accuracy ratings for both groups. In practice, however, it is often desirable to draw actionable conclusions about whether or not any particular article is true, which these analyses do not. To this end, our second analysis approach uses the layperson aggregate accuracy ratings to build a classifier that determines whether or not each article is true based on the fact-checker categorical ratings. To assess the performance of this classifier, we compare its classification with the modal fact-checker categorical rating (an article is labeled “1” if the modal fact-checker response is “True” and “0” otherwise; see Supplementary Information, SI, Section 5 for analysis where an article is labeled “1” if the modal fact-checker response is “False” and “0” otherwise). We then evaluate the area-under-the-curve (AUC), which measures accuracy while accounting for differences in base rates and is a standard measure of model performance in fields such as machine learning (34). The estimate of the AUC when considering all articles asymptotes with a crowd of around 26 at .86 for the Source condition and .83 for the No-Source condition (n = 26, Source condition: AUC = .86, 95% CI = .84 - .87, No-Source condition, AUC = .83, 95% CI = .81 - .85). This can be interpreted as meaning that when shown a randomly selected true article and a randomly selected not-true (i.e. false, misleading, or can’t tell) article, a crowd of 26 laypeople will rate the true article higher than the not-true article 85% of the time. This overall AUC analysis, however, does not take into account heterogeneity across articles in the level of fact-checker agreement. Pragmatically we should expect – and normatively we should demand – less predictive power from layperson judgments for articles where there was disagreement among the fact-checkers. Indeed, the AUC is substantially higher when considering the articles in which there was a consensus among the three fact-checkers (n = 102, asymptoting with an AUC of .90 - .92; Figure 2A) compared to the articles for which there was disagreement among the fact-checkers (n = 105, asymptoting at AUC of .74 - .78; Figure 2B). Thus, the layperson ratings do a very good job of classifying the articles for which there is a relatively clear correct answer and do worse on articles that fact-checkers also struggle to classify. While of course it is a priori impossible to tell which articles will generate consensus or disagreement among fact-checkers, these results suggest that perfect crowd prediction might even be impossible (given the sort of headlines that are flagged for fact-checking by Facebook) due to limitations of the fact-checkers’ agreement and difficulty of the task. It is also encouraging to note that despite the lower levels of consensus among fact-checkers for political vs. non-political headlines, the AUC performance between the crowd is not significantly worse for political headlines than for non-political ones (see SI Section 6).

Figure 2. Classifying articles as true versus non-true based on layperson aggregate Likert ratings. (a) (b) AUC scores as a function of the number of layperson ratings per article and source condition. AUC is calculated using a model in which the average layperson aggregate Likert rating is used to predict the modal fact-checker categorical rating, where the fact-checker rating is coded as “1” if the modal rating is “True” and “0” otherwise. For full ROC curves using a politically-balanced crowd of size 26, see SI Section 14. (c) Out-of-sample accuracy for ratings from a politically-balanced crowd of size 26 given source information, calculated separately for unanimous and non- unanimous headlines, as the proportion of unanimous headlines in the sample increases. (d) Cutoff for classifying an article as “True” as the proportion of unanimous headlines in the sample increases

We further explore the ability of crowd ratings to classify articles by simulating how the accuracy of the layperson ratings varies based on the share of headlines with unanimous fact- checker agreement in the stimulus set. Crucially, in these analyses the threshold used for classifying an article as true must be the same for unanimous versus non-unanimous headlines (given that in real applications, fact-checker unanimity will not be known ex ante); for simplicity, we focus on the Source condition in these analyses. As expected, we find that as the share of unanimous articles increases, the accuracy of the crowd increases, going from an out-of-sample accuracy of 74.2% with a set of all non-unanimous articles to 82.6% with a set of all unanimous articles (Figure 2C). Interestingly, although the optimal aggregate accuracy scale cutoff decreases as the share of unanimous headlines increases, it remains within a fairly restricted range (Figure 2D). Regardless of stimulus make-up, the optimal model is somewhat conservative, in that the crowd must be slightly above the aggregate accuracy scale midpoint for an article to be rated as true. Finally, we examine how individual differences among laypeople relate to agreement with fact-checker ratings, and whether it is possible to substantially improve the performance of the crowd by altering its composition. In particular, we focus on three individual differences which have been previously associated with truth discernment: partisanship, political knowledge, and cognitive reflection (the tendency to engage in analytic thinking rather than relying on intuition). For each individual difference, we collapse across Source and No-Source conditions, and begin by fixing a crowd size of k = 26 (at which point crowd performance mostly asymptotes). We then examine (i) the correlation between layperson and fact-checker aggregate Likert ratings and (ii) the AUC for predicting whether the fact-checkers’ modal categorical rate is “true” (Figure 3).

Figure 3. Comparing crowds with different layperson compositions to a baseline, politically-balanced crowd. (a) Pearson Correlations between the average aggregate accuracy rating of a crowd of size 26 and the average aggregate accuracy rating of the fact-checkers. (b) AUC for the average aggregate accuracy rating of a crowd of size 26 predicting whether the modal fact-checker categorical rating is “true”. For both a) and b), we compare the baseline to a crowd of only Democrats vs. only Republicans; a politically-balanced crowd of participants who scored above the median on the CRT vs. at or below the median on the CRT; and a politically-balanced crowd of participants who score above the median on political knowledge vs. at or below the median on political knowledge. Means and confidence intervals generated using bootstraps with 1000 iterations; see SI Section 2 for details. For analysis comparing political to non-political headlines, see SI Section 13.

As expected, we see clear differences. Democrats are significantly more aligned with fact- checkers than Republicans (correlation, p < .001; AUC, p < .001); participants who are higher on political knowledge are significantly more aligned with fact-checkers than participants who are lower on political knowledge (correlation, p < .001; AUC, p < .001); and participants who are higher on cognitive reflection are significantly more aligned with fact-checkers than participants who are lower on cognitive reflection (correlation, p = .01; AUC, p = .02). The knowledge and reflection results held among both Democrats and Republicans, see SI Section 7. Strikingly, however, restricting to the better-performing half of participants for each individual difference does not lead to a significant increase in performance over the baseline crowd when examining correlation with fact-checkers’ average aggregate accuracy rating (Democrats vs. Baseline: p = .35, High CRT vs. Baseline: p = .74, High PK vs. Baseline: p = .57) or AUC when predicting model fact-checker categorical rating (Democrats vs. Baseline: p = .18, High CRT vs. Baseline: p = .59, High PK vs. Baseline: p = .60). While perhaps surprising, this pattern often arises in wisdom-of-crowds phenomena. The existence of uncorrelated observations from low performers amplifies the high-performer signal by canceling out noise. Thus, while it is important that any given crowd includes some high performers, it is not necessary to exclude low performers to achieve good performance when the crowd is sufficiently large. Using only higher truth- discernment participants does, however, reduce the crowd size needed to reach high agreement with the fact-checkers (see SI Section 8) – that is, crowds that are high on cognitive reflection, political knowledge, or preference for the Democratic Party reach their asymptotic agreement with fact-checkers at substantially smaller crowd sizes. For example, for all three individual differences, restricting to the high performers roughly halves the number of participants needed for the correlation between laypeople and the fact-checkers to reach the average correlation between the fact-checkers. Thus, selecting better-performing raters may boost scalability.

DISCUSSION

Here we have shown that crowdsourcing can help identify misinformation at scale. We find that, after judging merely the headline and lede of an article, a small, politically-balanced crowd of laypeople can match the performance of fact-checkers researching the full article. Interestingly, selecting raters whose characteristics make them more likely to agree with the fact- checkers (e.g. more deliberative, higher political knowledge, more liberal) or providing more information (the article’s publisher) leads to only minimal improvements in the crowd’s agreement with the fact-checkers. This indicates the robustness of the crowd wisdom that we observe, and suggests that our results are not particularly sensitive to our specific participant population and task design – and thus that our results are likely to generalize. Together, our findings suggest that crowdsourcing could be a powerful tool for scaling fact-checking on social media. That these positive results were achieved using small groups of untrained laypeople without research demonstrates the viability of a fact-checking pipeline that incorporates crowdsourcing. Such an approach need not rely on users volunteering their time to provide ratings, but could be scalable even with social media platforms compensating raters. For example, in our study each rating took on average 35.7s, raters were paid $0.15/minute (1.2x greater than the federal minimum wage in the U.S.), and we achieved good performance with crowds of size 10. Thus, the crowdsourcing approach used here produced useful quality ratings at a cost of roughly $0.90 per article. Our results also have practical implications for the manner in which crowdsourcing is implemented. In particular, we advocate for using the continuous crowdsourced accuracy ratings as a feature in newsfeed ranking, proportionally downranking articles according to their scores. A continuous feature incorporates the signal in the crowd’s ratings while guarding against errors that accompany sharp cutoffs of “true” vs. “false”. Additionally, downranking has the benefit of lowering the probability that a user encounters misinformation at all, guarding against the illusory truth effect by which familiar falsities seem more true after repetition (9, 35). While corrections to misinformation have generally shown to be effective (5, 36), that efficacy is dependent on the manner of correction and the possibility of a familiarity “backfire” effect cannot be ruled out (9)(although there has been consistent evidence against it, see (8)). Preventing the spread of misinformation by limiting exposure is a proactive way to fight fake news. Furthermore, work on accuracy prompts/nudges indicates yet another benefit of crowdsourcing: simply asking users to rate the accuracy of content primes the concept of accuracy and makes them more discerning in their subsequent sharing (37, 38). The promise of crowd ratings that we observe here does not mean that professional fact- checkers are no longer necessary. Rather, we see crowdsourcing as just one component of a misinformation detection system that incorporates machine learning, layperson ratings, and expert judgments. While machine-learning algorithms are scalable and have been shown to be somewhat effective for detecting fake news, they also are domain specific and thus susceptible to failure in our rapidly changing information environment (39–43). Additionally, the level of disagreement between fact-checkers raises concerns about systems that 1) privilege the unilateral decisions of a single fact-checker or 2) use a single fact-checker’s ratings as “ground truth” in supervised machine-learning models, as is often done (see also ref (44)). We see the integration of crowdsourced ratings as helping to address the shortcomings of these other methods: Using crowd ratings as training inputs can help artificial intelligence adapt more quickly, and supplement and extend professional fact-checking. Allowing a greater number of articles to be assessed and labeled will directly expand the impact of fact-checking, as well as reducing the chance that unlabeled inaccurate claims will be believed more because they lack a warning (i.e. the “implied truth effect”) (10). In addition to these practical implications for fighting misinformation, our results have substantial theoretical relevance. First, we contribute to the ongoing debate regarding the role of reasoning in susceptibility to misinformation. Some have argued that people engage in “identity protective cognition” (45–48), such that more reasoning leads to greater polarization rather than greater accuracy. In contrast to this argument, our findings of greater cognitive reflection and political knowledge being associated with higher fact-checker agreement support the “classical reasoning” account whereby reasoning leads to more accurate judgments. In particular, our results importantly extend prior evidence for classical reasoning due to our use of an ecologically valid stimulus set. It is possible that prior evidence of a link between cognitive sophistication and truth discernment (9, 20, 25, 49, 50) was induced by experimenters selecting headlines that were obviously true versus false. Here we show that the same finding is obtained when using a set of headlines where veracity is much less cut and dried, and which represents an actual sample of misleading content (rather than being hand-picked by researchers). Relatedly, our individual differences results extend prior work on partisan differences in truth discernment, as we find that Republicans show substantially less agreement with professional fact-checkers than Democrats in our naturally-occuring article sample - even for non-political articles (see SI Section 13). Second, our results contribute to the literature on source effects. While a large literature has shown that people preferentially believe information from trusted sources (for a review, see (51)) this work has usually focused on information shared by people who are more or less trusted. In contrast, numerous recent studies have examined the impact of the publisher associated with a news headline (either by hiding versus revealing the actual publisher, or by experimentally manipulating which publisher is associated with a given article), and have surprisingly found little effect (22, 26, 52). To explain this lack of effect, it has been theorized that information about the publisher only influences accuracy judgments insomuch as there is a mismatch between publisher trustworthiness and headline plausibility (22). In the absence of such a mismatch, publisher information is redundant. But, for example, learning that an implausible headline comes from a trusted source should increase the headline’s perceived accuracy. Our observation that providing publisher information increases agreement between laypeople and fact-checkers (albeit by a fairly small amount) supports this theory, because our stimulus set involves mostly implausible headlines. Thus, by this theory, we would expect our experiment to be a situation where publisher information would indeed be helpful, and this is what we observe. Third, our results contribute to work on applying the wisdom of crowds in political contexts. In particular, we add to the growing body of work suggesting that aggregating judgments can substantially improve performance even in highly politicized contexts. In addition to the studies examining trust in news publishers described above (20, 21), research has shown that crowdsourcing produces more accurate judgments than individual decision-making across a variety of partisan issues including climate change, immigration, and unemployment (53, 54). This result is true whether the network is politically balanced or homogenous, although evidence has shown that a politically balanced group produces higher-quality results than a homogenous one (53–55). Together, this body of work demonstrates the broad power of crowdsourcing despite systematic polarization in attitudes amongst members of the crowd. Our results also show the limits of a recent finding that cognitively diverse groups made of a mix of intuitive and analytical thinkers perform better than crowds of only more analytical thinkers (56). We do not observe this effect in our data, where mixed groups were no more effective – and if anything, slightly less effective – than groups of only more analytic thinkers. Finally, there are various limitations of our study and important directions for future research. First, we note that our results should not be interpreted as evidence that individual participants identify false information with high reliability. Even when the crowd performance was good, individual participants often systematically misjudged headline veracity. Second, although we show that small crowds of laypeople perform surprisingly well at identifying (mis)information in our experiment, it is possible that layperson crowds could do even better – or that even smaller crowds could achieve similar levels of performance. For example, in the name of scalability, we had the laypeople rate only the headline and lede. It is possible that having them instead read the full article, and/or do research on the claims, could lead to even more accurate ratings. If so, however, the improvement gained would have to be weighed against the increased time required (and thus decreased scalability). Another route to improved performance could be to investigate more complex algorithms for weighting the crowd responses (31, 33, 57) rather than simply averaging the ratings as we do here. Furthermore, the crowd ratings of headline accuracy could be integrated with other signals, such as publisher quality (20, 21) and existing machine-learning algorithms for misinformation detection. Third, future work should assess how our results generalize to other contexts. A particular strength of our approach is that the articles we analyzed were selected in an automated fashion by an internal Facebook algorithm, and thus are more representative of articles that platforms need to classify than are the researcher-selected articles used in most previous research. However, we were not able to audit Facebook's article selection process, and there was no transparency regarding the algorithm used to select the articles. Thus, the articles we analyze here may not actually be representative of misinformation on Facebook or of the content that platforms would use crowds to evaluate. It is possible that biases in the algorithm or the article selection process may have biased our results in some way. For example, certain types of misleading or inaccurate headlines may be under-represented. Alternatively, it is possible that Facebook misled us and purposefully provided a set of articles that were artificially easy to classify (in an effort to cast their crowdsourcing efforts in a positive light). While we cannot rule out this possibility, we did replicate our results by analyzing a previously published dataset of researcher-selected headlines (25) and found that the crowd performed substantially better than for the articles provided by Facebook (see SI Section 16), suggesting that the article set from Facebook was at least more difficult to classify than article sets often used in academic research. That being said, it is critical for future research to apply the approach used here to a wide range of articles in order to assess the generalizability of our findings. It is possible that under circumstances with rapidly evolving facts, such as in the case of the COVID-19 news environment, results for both the crowd and fact- checkers would differ (although studies with researcher-selected true and false COVID-19 headlines find high crowd performance, e.g. (38)). A related question involves the generalizability of the crowd itself. Our sample was not nationally representative, and it is possible that a representative sample would not perform as well. Similarly, highly inattentive crowds would likely show worse performance. However, our goal here was not to assess the accuracy of judgments of the nation as a whole. Instead, the question was whether it was possible to use laypeople to inexpensively scale fact-checking. In this regard, our results are unambiguously positive. Social media platforms could even simply hire workers from Amazon Mechanical Turk to rate articles, and this would allow for low-cost fact-checking. It remains unclear, however, how these results would generalize to other countries and cultures. Cross-cultural replications are an essential direction for future work. Relatedly, a key feature of the American partisan context that we examine here is that the two relevant factions (Democrats and Republicans) are roughly equally balanced in frequency. As a result, one side’s opinion did not heavily outweigh the other’s when creating average crowd ratings. It is essential for crowdsourcing methods to develop ways to extract signal from the crowd without allowing majorities to certify untruths above marginalized groups (e.g. ethnicity minorities). Fourth, one might be concerned about partisan crowds purposely altering their responses to try to “game the system” and promote content that they know to be false or misleading but is aligned with their political agenda. However, most Americans do not actually care that much about politics (58), and thus are unlikely to be overly motivated to distort their responses. Furthermore, to the extent that partisans do bias their responses, it is likely that they will do so in a symmetric way, such that when creating politically balanced ratings the bias cancels out. Accordingly, research shows that informing users that their responses will influence what content is shown to social media users does not substantially affect crowd performance for identifying trustworthy news publishers (21). However, it is important to keep in mind that this work focused on settings in which users are presented with specific pieces of content to rate (e.g. Facebook's "Community Review" (59)). It is unclear how such findings would generalize to designs where participants can choose which pieces of news to rate (e.g. Twitter’s “BirdWatch” (60)), which could be more vulnerable to coordinated attacks. In sum, the experiment presented here indicates the promise of using the wisdom of crowds to scale fact-checking on social media. We believe that, in combination with other measures like detection algorithms, trained experts, and accuracy prompts, crowdsourcing can be a valuable asset in combating the spread of misinformation on social media.

METHODS

Data and materials are available online (https://osf.io/hts3w/). Participants provided informed consent, and our studies were deemed exempt by MIT’s Committee On the Use of Humans as Experimental Subjects, protocol number 1806400195.

Layperson Participants. Between 2/9/2020 and 2/11/2020, we recruited 1,246 U.S. residents from Amazon Mechanical Turk. Of those, 118 participants did not finish the survey and were thus excluded, leaving our final sample with 1,128 participants. They had a mean age of 35.18 years old; 38.48% were female; 65.86% had completed at least a Bachelor’s degree; 58.33% had an income of less than $50,000; and 54.29% indicated a preference for the Democratic party over the Republican party. The median completion time for the full study was 15:48 minutes, and the median time spent completing the article assessment portion of the study was 7:51 minutes. Participants were paid a flat fee of $2.25 for their participation. We chose not to give financial rewards for producing answers that agreed with the professional fact-checkers, because one of the benefits of the crowdsourced approach is the ability to avoid claims of liberal bias. If the laypeople were incentivized to agree with the fact-checkers, this would make the crowd susceptible to the same complaints of bias made by some about the fact-checkers.

Professional Fact-checkers. Between 10/27/2019 and 1/21/2020, we hired three professional fact-checkers from the freelancing site Upwork to fact-check our set of articles. These fact- checkers, which we selected after an extensive vetting process and initial screening task, had substantial expertise and experience, with a combined total of over 2000 hours of fact-checking experience logged on Upwork. Specifically, we had the fact-checkers first complete an initial assessment task, in which they fact-checked 20 of the 207 articles in our set. They were asked to each independently conduct research online to support their evaluations, spending up to 30 minutes on each article. They had the opportunity to work at their own pace but most delivered this initial task within a week. We then checked their responses to confirm that they were thorough and displayed a mastery of the task, including giving individualized feedback and engaging in discussion when there was substantial disagreement between the fact-checkers. Interestingly, this discussion revealed real, reasoned disagreements, rather than misunderstandings or sloppiness; see SI section 15 for details. Once this initial trial was completed satisfactorily, we had the fact- checkers independently evaluate the remainder of the articles (without any communication or discussion amongst themselves about the articles). Furthermore, to demonstrate that our results are not driven by idiosyncrasies of these particular fact-checkers, in SI Section 9 we replicate our main analyses from Figure 1 using ratings generated by a different set of 4 professional journalists who had just completed a prestigious fellowship for mid-career journalists and had extensive experience reporting on U.S. politics (61). The average rating of these 4 fact-checkers correlated strongly with the average rating of the Upwork fact-checkers (r = .81).

Materials. Participants were each asked to rate the accuracy of 20 articles, drawn at random from a set of 207 articles; professional fact-checkers rated all 207 articles. Our goal is to assess how effective crowdsourcing would be for meeting the misinformation identification challenge faced by social media platforms. Therefore, it is important that the set of articles we use be a good representation of the articles that platforms are trying to classify (rather than, for example, articles that we made up ourselves). To that end, we obtained from Facebook a set of 796 URLs. Facebook provided us with a set of articles for evaluation. The articles were sampled from content posted on Facebook in an automated fashion to upsample for content that (i) involved civic topics or health- related information, (ii) was predicted by Facebook's internal models to be more likely to be false or misleading (using signals such as comments on posts expressing disbelief, false news reports, pages that have shared misinfo in the past), and/or (iii) was widely shared (i.e. viral). Because we were specifically interested in the effectiveness of layperson assessments based on only reading the headline and lede, we excluded 299 articles because they did not contain a claim of fact in their headline or lede (as determined by four research assistants). We also excluded 34 URLs because they were no longer functional. Of the remaining 463 articles, we randomly selected a subset of 207 to use for our study. The list of URLs can be found in the SI Section 1. In terms of URL descriptives, of the 207 URLs, Facebook’s topic classification algorithms labeled 109 as being political, 43 as involving crime and tragedy, 22 as involving social issues, 17 as involving health; and fewer than 15 of all other topic categories (URLs could be classified as involving more than one topic). Furthermore, using MediaBiasFactCheck.org to classify the quality of the source domains for the 209 URLs, 46 were rated Very High or High, 13 were rated Mostly Factual, 75 were rated Mixed, 23 were rated Low, Very Low, or Questionable Source, 12 were rated as Satire, and 38 were not rated by MediaBiasFactCheck.com. Using the quality scores provided by NewsGuard (between 0 and 100), the mean source quality was 67.7 (sd 28.2), and the median source quality score was 75; 26 URL domains were not rated by NewsGuard.

Procedure. The layperson participants and the professional fact-checkers completed different, but similar, surveys. In both cases, the survey presented respondents with a series of articles and asked them to assess each article’s central claim. For each article, respondents were first asked to make a categorical classification, choosing between the options “True”, “Misleading”, “False”, and “Not sure”. Second, respondents were asked to provide more fine-grained ratings designed to assess the objective accuracy of the articles using a series of 7-point Likert scales. Specifically, they were asked the extent to which the article 1) described an event that actually happened, 2) was true, 3) was accurate 4) was reliable, 5) was trustworthy, 6) was objective, and 7) was written in an unbiased way. These 7 responses were averaged to construct an aggregate Likert rating (Cronbach’s ɑ = 0.96). Our main text analyses focus on the layperson aggregate Likert ratings as they are much more fine-grained; see SI Section 4 for analyses using the layperson categorical classification ratings, which are qualitatively similar but (as expected) somewhat weaker. Respondents completed the task independently without communication or discussion with each other. The layperson versus fact-checker surveys differed in how respondents were asked to evaluate the articles. Fact-checkers were presented with the URL of each article and asked to read the full article, and conduct research to evaluate what they assessed to be the article’s central claim. In addition to the ratings described above, the fact-checkers were also asked to provide any relevant evidence from their research that justified their assessment. Laypeople, on the other hand, were only shown the headline and lede sentence of each article, not the full article - and they were not asked to do any research or provide any evidence/sources for their assessments, but rather to rely on their own judgment. (Given that mean time per article for laypeople was 35.7s, median 23.5s, it is extremely unlikely that many people took it upon themselves to nonetheless research the headlines.) Furthermore, to test whether knowledge of the article’s source influenced layperson assessments (22), laypeople were randomly assigned to either a no source condition (just shown headline and lede) or a source condition (also shown the source domain of the article, e.g. “breitbart.com”). The layperson versus fact-checker surveys also differed in the number of articles respondents were asked to rate. Each fact-checker rated all 207 articles, while each layperson rated 20 randomly selected articles. On average, each article was rated by 100 laypeople (min. 79; max. 137). After rating the 20 articles, the layperson participants completed the cognitive reflection test (62), a political knowledge test, and a series of demographic questions.

Analysis. A main question of interest for our study was to examine how the average layperson ratings varied based on the number of laypeople included. To assess this question, as well as to achieve politically balanced layperson ratings, we used the following bootstrapping procedure. First, we classified each participant as “Democrat” versus “Republican” based on their response to a question in the demographics about which political party they preferred (6-point scale from “Strong Democrat” to “Strong Republican”; values of 1 to 3 were classified as Democrat, while values of 4 to 6 were classified as Republican, such that no participants were excluded for being Independents). Then, for each value of k layperson ratings per article (from k = 2 to k = 26), we performed 1,000 repetitions of the following procedure. For each article, we randomly sampled (with replacement) k/2 Democrats and k/2 Republicans. This gave us 1,000 different crowds of size k for each of the 207 articles. For each crowd, we averaged the responses to create a politically balanced layperson rating for each article. We then computed (i) the correlation across articles between this politically-balanced layperson average rating and the average of the fact-checkers’ aggregate Likert ratings; and (ii) an AUC produced by using the politically-balanced layperson average ratings for each article to predict whether or not the modal fact-checker categorical rating for that article was “True” (binary variable: 0=modal response was “False”, “Misleading”, or “Couldn’t be determined”, 1=”True”). We then report the average value of each of these two outcomes across repetitions, as well as the 95% confidence interval. In addition, for a crowd of size k = 26, we calculated the out-of-sample accuracy of a model that used the politically-balanced layperson ratings to predict the fact-checker’s modal binary “Is True” rating (described above). For each of a 1,000 different crowds, we performed 20 trials in which we split the data 80/20 into training and test sets, respectively, and calculated the accuracy on the test set of the optimal threshold found in the training set. We then averaged across the 1,000 crowds and 20 trials per crowd to calculate the out-of-sample accuracy. A more detailed description of this methodology can be found in the SI Sections 2 and 3.

ACKNOWLEDGEMENTS

The authors gratefully acknowledge Paul Resnick’s generosity in sharing journalist fact-checker ratings his team collected, comments and feedback from Kevin Aslet, Adam Berinsky, Nathan Persily, Paul Resnick, Zeve Sanderson, Josh Tucker, Yunhao Zhang, and the Community Review team at Facebook, as well as funding from the William and Flora Hewlett Foundation, the John Templeton Foundation, and the Reset project of Omidyar Group’s Luminate Project Limited.

Competing Interests: J.A. has a significant financial interest in Facebook. Other research by D.G.R. is funded by a gift from Google. The other authors declare no competing interests.

Author Contributions: D.G.R, G.P., and A.A. conceived of the research; A.A., D.G.R., and G.P. designed the study; A.A. conducted the study; J.A. analyzed the results; J.A. and D.G.R. wrote the paper, with input from G.P. and A.A.

Data Availability: All data and materials needed to recreate the analysis in this paper can be found on the OSF site: https://osf.io/hts3w/.

REFERENCES

1. D. M. J. Lazer, M. A. Baum, Y. Benkler, A. J. Berinsky, K. M. Greenhill, F. Menczer, M. J. Metzger, B. Nyhan, G. Pennycook, D. Rothschild, M. Schudson, S. A. Sloman, C. R. Sunstein, E. A. Thorson, D. J. Watts, J. L. Zittrain, The science of fake news. Science. 359, 1094–1096 (2018).

2. Facebook, Facebook’s Third-Party Fact-Checking Program. Facebook, (available at https://www.facebook.com/journalismproject/programs/third-party-fact-checking).

3. Twitter, Updating our approach to misleading information. Twitter, (available at https://blog.twitter.com/en_us/topics/product/2020/updating-our-approach-to-misleading- information.html).

4. S. Nieminen, L. Rapeli, Fighting Misperceptions and Doubting Journalists’ Objectivity: A Review of Fact-checking Literature. Political Studies Review. 17, 296–309 (2019).

5. T. Wood, E. Porter, The Elusive Backfire Effect: Mass Attitudes’ Steadfast Factual Adherence. Political Behavior. 41, 135–163 (2019).

6. B. Nyhan, J. Reifler, When Corrections Fail: The Persistence of Political Misperceptions. Political Behavior. 32, 303–330 (2010).

7. S. Lewandowsky, U. K. H. Ecker, C. M. Seifert, N. Schwarz, J. Cook, Misinformation and its correction: Continued influence and successful debiasing. Psychol. Sci. Public Interest. 13, 106–131 (2012).

8. B. Swire-Thompson, J. DeGutis, D. Lazer, Searching for the Backfire Effect: Measurement and Design Considerations. J. Appl. Res. Mem. Cogn. (2020), doi:10.1016/j.jarmac.2020.06.006.

9. G. Pennycook, T. D. Cannon, D. G. Rand, Prior exposure increases perceived accuracy of fake news. J. Exp. Psychol. Gen. 147, 1865 (2018).

10. G. Pennycook, A. Bear, E. T. Collins, D. G. Rand, The Implied Truth Effect: Attaching Warnings to a Subset of Fake News Headlines Increases Perceived Accuracy of Headlines Without Warnings. Manage. Sci. (2020), doi:10.1287/mnsc.2019.3478.

11. W. Yaqub, O. Kakhidze, M. L. Brockman, N. Memon, S. Patil, in Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (Association for Computing Machinery, New York, NY, USA, 2020), CHI ’20, pp. 1–14.

12. T. Lyons, Hard Questions: What’s Facebook's Strategy for Stopping False News? Facebook (2018), (available at https://about.fb.com/news/2018/05/hard-questions-false-news/).

13. D. Flamini, Most Republicans don’t trust fact-checkers, and most Americans don’t trust the media - Poynter. Poynter (2019), (available at https://www.poynter.org/ifcn/2019/most-republicans-dont-trust- fact-checkers-and-most-americans-dont-trust-the-media/).

14. F. Galton, Vox Populi. Nature. 75, 450–451 (1907).

15. J. Surowiecki, The Wisdom Of Crowds (Anchor Books, 2005).

16. L. Graves, Anatomy of a Fact Check: Objective Practice and the Contested Epistemology of Fact Checking. Commun Cult Crit. 10, 518–537 (2017).

17. A. Mitchell, Distinguishing between factual and opinion statements in the news (Pew Research Center, 2018).

18. M. W. Kattan, C. O’Rourke, C. Yu, K. Chagin, The Wisdom of Crowds of Doctors: Their Average Predictions Outperform Their Individual Ones. Med. Decis. Making. 36, 536–540 (2016).

19. Z. Da, X. Huang, Harnessing the Wisdom of Crowds. Manage. Sci. (2019), doi:10.1287/mnsc.2019.3294.

20. G. Pennycook, D. G. Rand, Fighting misinformation on social media using crowdsourced judgments of news source quality. Proc. Natl. Acad. Sci. U. S. A. 116, 2521–2526 (2019).

21. Z. Epstein, G. Pennycook, D. Rand, Will the Crowd Game the Algorithm? Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (2020), , doi:10.1145/3313831.3376232.

22. N. Dias, G. Pennycook, D. G. Rand, Emphasizing publishers does not effectively reduce susceptibility to misinformation on social media. Harvard Kennedy School Misinformation Review. 1 (2020) (available at https://misinforeview.hks.harvard.edu/article/emphasizing-publishers-does-not-reduce- misinformation/).

23. J. Kim, B. Tabibian, A. Oh, B. Schölkopf, M. Gomez-Rodriguez, Leveraging the Crowd to Detect and Reduce the Spread of Fake News and Misinformation. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining - WSDM ’18 (2018), , doi:10.1145/3159652.3159734.

24. M. Gabielkov, A. Ramachandran, A. Chaintreau, A. Legout, in Proceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science (Association for Computing Machinery, New York, NY, USA, 2016), SIGMETRICS ’16, pp. 179– 192.

25. G. Pennycook, D. G. Rand, Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning. Cognition. 188, 39–50 (2019).

26. G. Pennycook, D. G. Rand, Who falls for fake news? The roles of bullshit receptivity, overclaiming, familiarity, and analytic thinking. J. Pers. 88, 185–200 (2020).

27. G. Pennycook, D. G. Rand, The Psychology of Fake News. Trends Cogn. Sci. 25, 388–402 (2021).

28. M. M. Bhuiyan, A. X. Zhang, C. M. Sehat, T. Mitra, Investigating Differences in Crowdsourced News Credibility Assessment. Proceedings of the ACM on Human-Computer Interaction. 4 (2020), pp. 1– 26.

29. M. A. Amazeen, Checking the Fact-Checkers in 2008: Predicting Political Ad Scrutiny and Assessing Consistency. Journal of Political Marketing. 15, 433–464 (2016).

30. C. Lim, Checking how fact-checkers check. Research & Politics. 5, 2053168018786848 (2018).

31. D. G. Goldstein, R. P. McAfee, S. Suri, in Proceedings of the fifteenth ACM conference on Economics and computation (Association for Computing Machinery, New York, NY, USA, 2014), EC ’14, pp. 471–488.

32. R. M. Hogarth, A note on aggregating opinions. Organ. Behav. Hum. Perform. 21, 40–46 (1978).

33. A. E. Mannes, J. B. Soll, R. P. Larrick, The wisdom of select crowds. J. Pers. Soc. Psychol. 107, 276– 299 (2014). 34. Jin Huang, C. X. Ling, Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 17, 299–310 (2005).

35. L. K. Fazio, N. M. Brashier, B. K. Payne, E. J. Marsh, Knowledge does not protect against illusory truth. J. Exp. Psychol. Gen. 144, 993–1002 (2015).

36. U. K. H. Ecker, S. Lewandowsky, M. Chadwick, Can Corrections Spread Misinformation to New Audiences? Testing for the Elusive Familiarity Backfire Effect. Cognitive Research: Principles and Implications. 5, 1–25 (2020).

37. G. Pennycook, Z. Epstein, M. Mosleh, A. A. Arechar, D. Eckles, D. G. Rand, Shifting attention to accuracy can reduce misinformation online. Nature (2021), doi:10.1038/s41586-021-03344-2.

38. G. Pennycook, J. McPhetres, Y. Zhang, J. G. Lu, D. G. Rand, Fighting COVID-19 Misinformation on Social Media: Experimental Evidence for a Scalable Accuracy-Nudge Intervention. Psychol. Sci. 31, 770–780 (2020).

39. N. J. Conroy, V. L. Rubin, Y. Chen, Automatic deception detection: Methods for finding fake news. Proc. Assoc. Info. Sci. Tech. 52, 1–4 (2015).

40. V. Pérez-Rosas, B. Kleinberg, A. Lefevre, R. Mihalcea, in Proceedings of the 27th International Conference on Computational Linguistics (2018), pp. 3391–3401.

41. V. L. Rubin, Y. Chen, N. J. Conroy, Deception detection for news: Three types of fakes. Proc. Assoc. Info. Sci. Tech. 52, 1–4 (2015).

42. K. Shu, A. Sliva, S. Wang, J. Tang, H. Liu, Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD explorations newsletter. 19, 22–36 (2017).

43. N. Ruchansky, S. Seo, Y. Liu, in Proceedings of the 2017 ACM on Conference on Information and Knowledge Management (Association for Computing Machinery, New York, NY, USA, 2017), CIKM ’17, pp. 797–806.

44. J. E. Uscinski, R. W. Butler, The Epistemology of Fact Checking. Crit. Rev. . 25, 162–180 (2013).

45. S. A. Sloman, N. Rabb, Thought as a determinant of political opinion. Cognition. 188, 1–7 (2019).

46. G. Charness, C. Dave, Confirmation bias with motivated beliefs. Games Econ. Behav. 104, 1–23 (2017).

47. D. M. Kahan, E. Peters, M. Wittlin, P. Slovic, L. L. Ouellette, D. Braman, G. Mandel, The polarizing impact of science literacy and numeracy on perceived climate change risks. Nat. Clim. Chang. 2, 732– 735 (2012).

48. D. M. Kahan, Ideology, Motivated Reasoning, and Cognitive Reflection: An Experimental Study. Judgment and Decision making. 8, 407–424 (2012).

49. B. Bago, D. G. Rand, G. Pennycook, Fake news, fast and slow: Deliberation reduces belief in false (but not true) news headlines. J. Exp. Psychol. Gen. 149, 1608–1613 (2020).

50. M. V. Bronstein, G. Pennycook, A. Bear, D. G. Rand, T. D. Cannon, Belief in Fake News is Associated with Delusionality, Dogmatism, Religious Fundamentalism, and Reduced Analytic Thinking. J. Appl. Res. Mem. Cogn. 8, 108–117 (2019). 51. C. Pornpitakpan, The Persuasiveness of Source Credibility: A Critical Review of Five Decades’ Evidence. J Appl Social Pyschol. 34, 243–281 (2004).

52. M. Jakesch, M. Koren, A. Evtushenko, M. Naaman, The Role of Source, Headline and Expressive Responding in Political News Evaluation. Headline and Expressive Responding in Political News Evaluation (2018), , doi:10.2139/ssrn.3306403.

53. J. Becker, E. Porter, D. Centola, The wisdom of partisan crowds. Proc. Natl. Acad. Sci. U. S. A. 116, 10717–10722 (2019).

54. D. Guilbeault, J. Becker, D. Centola, Social learning and partisan bias in the interpretation of climate trends. Proc. Natl. Acad. Sci. U. S. A. 115, 9714–9719 (2018).

55. F. Shi, M. Teplitskiy, E. Duede, J. A. Evans, The wisdom of polarized crowds. Nat Hum Behav. 3, 329–336 (2019).

56. S. Keck, W. Tang, Enhancing the Wisdom of the Crowd With Cognitive-Process Diversity: The Benefits of Aggregating Intuitive and Analytical Judgments. Psychol. Sci. 31, 1272–1282 (2020).

57. D. Prelec, A Bayesian truth serum for subjective data. Science. 306, 462–466 (2004).

58. P. E. Converse, Assessing the Capacity of Mass Electorates. Annu. Rev. Polit. Sci. 3, 331–353 (2000).

59. S. Fischer, Exclusive: Facebook adding part-time fact-checking contractors. Axios (2019), (available at https://www.axios.com/facebook-fact-checking-contractors-e1eaeb8b-54cd-4519-8671- d81121ef1740.html).

60. Introducing Birdwatch, a community-based approach to misinformation, (available at https://blog.twitter.com/en_us/topics/product/2021/introducing-birdwatch-a-community-based- approach-to-misinformation.html).

61. P. Resnick, A. Alfayez, J. Im, E. Gilbert, Informed Crowds can Effectively Identify Misinformation (2021).

62. S. Frederick, Cognitive Reflection and Decision Making. J. Econ. Perspect. 19, 25–42 (2005).

Supporting Information for:

Scaling Up Fact-Checking Using the Wisdom of Crowds

Jennifer Allen, Antonio A. Arechar, Gordon Pennycook, David G. Rand*

*[email protected]

1. List of URLs With Fact-checker ratings 21 2. Methods: Bootstrapping Many Crowds 27 3. Methods: Out-of-Sample Accuracy 28 4. Robustness: Predicting fact-Checker binary ratings with laypeople binary ratings 28 5. Robustness: Predicting Fact-checker “Is False” ratings 30 6. Robustness: AUC, Political vs. Non-Political Headlines 31 7. Robustness: Political Knowledge and Cognitive Reflection by Party 32 8. Robustness: Performance as a Function of Crowd Size: Political Party, Cognitive Reflection, and Political Knowledge 33 9. Robustness: Replicating Correlational Analysis with Alternate Set of Fact-checkers 34 10. Robustness: Replicating Correlational Analysis with Larger Crowds 35 11. Robustness: AUC Analysis, Excluding “Couldn’t Be Determined” ratings 35 12. Robustness: Correlational Analysis with a Single Fact-Checker 36 13. Robustness: Individual Differences for Political vs. Non-political headlines 37 14. ROC Curves for Predicting Fact-Checker Binary Ratings with Laypeople Continuous Ratings 37 15. Qualitative Examination of Disagreement Among Fact-Checkers 38 16. Generalization to a Researcher-Selected Dataset 43

1. List of URLs With Fact-checker ratings Modal Binary Average Fact- Fact-Checker Checker Rating Accuracy (1 = True, 0 = Rating URL Not True) (1 - 7 Scale) http://alexschadenberg.blogspot.com/2018/10/sick-kids-hospital-toronto-will.html 0 3.67 http://coolcatapproves.com/funny/australia-doesnt-exist-and-people-who-live-there-are- actors-paid-by-nasa-flat-earthers-claim/ 0 2.14 http://feedytv.com/florida-declares-state-emergency-toxic-red-tide-outbreak.html 1 5.52 http://fortune.com/2016/09/01/medical-marijuana-gun/ 1 7.00 http://fox4kc.com/2018/09/28/mom-sues-after-son-doesnt-make-varsity-soccer-team/ 1 6.95 http://healthimpactnews.com/2011/dr-russell-blaylock-warns-dont-get-the-flu-shot-it- promotes-alzheimers/ 0 1.05 http://nypost.com/2017/09/28/department-of-justice-demands-facebook-account- information-of-anti-trump-activists/ 1 4.95 http://nypost.com/2018/09/27/two-men-tell-senate-that-they-not-kavanaugh-assaulted- ford/ 1 6.43 http://nypost.com/2018/10/09/de-blasio-signs-bill-allowing-third-gender-on-birth- certificates/ 1 7.00 http://pittsburgh.cbslocal.com/2018/10/27/heavy-police-presence-near-synagogue-in- squirrel-hill/ 1 7.00 http://rare.us/rare-news/around-the-world/kids-cant-tell-time/ 0 4.90 http://rare.us/rare-politics/rare-liberty/police-state/a-court-has-ruled-that-police-can- execute-your-dog-if-it-moves-or-barks/ 0 3.62 http://time.com/5390884/nike-sales-go-up-kaepernick-ad/ 0 5.33 http://tiphero.com/dont-dress-up-your-chickens/ 0 3.67 http://www.55meals.com/did-you-know-your-energy-drinks-contain-bull-urine-semen/ 0 1.57 http://www.breakingnews247.net/5b872f00c5639/massive-alligator-found-in-browns- mills-new-jersey.html 0 1.76 http://www.collective-evolution.com/2017/02/13/how-sunscreen-could-be-causing-skin- cancer-not-the-sun/ 0 3.71 http://www.higherperspectives.com/one-glass-red-wine-1577145867.html 0 1.95 http://www.icarizona.com/2018/09/arizona-may-end-up-with-former.html 0 3.57 http://www.ladbible.com/community/interesting-a-giant-500ft-asteroid-is-heading-for- earth-at-20000mph-20180826 0 4.67 http://www.ladbible.com/entertainment/music-fans-flip-the-finger-at-machine-gun-kelly- and-get-in-his-face-20180920 1 6.29 http://www.ladbible.com/entertainment/uk-boyfriend-runs-away-after-his-girlfriend- catches-a-brides-bouquet-20180904 1 6.00 http://www.latimes.com/local/lanow/la-me-ln-thousand-oaks-20181107-story.html 1 6.90 http://www.msnbc.com/rachel-maddow-show/trump-crafts-new-excuse-massive-budget- deficit-forest-fires 1 4.57 http://www.pretty52.com/entertaining/tv-and-film-the-one-tree-hill-cast-are-reuniting-for- a-one-hour-special-20180927 1 6.19 http://www.social-consciousness.com/2017/03/donald-trump-says-pedophiles-will-get- death-penalty.html 0 2.43 http://www.theguardian.com/us-news/2018/nov/10/trump-baltics-balkans-mixup-le- monde-belleau-cemetery-paris 1 5.19 http://www.tmz.com/2018/09/18/sesame-street-bert-ernie-gay-couple-confirmed-writer- speculation-over/ 1 5.76 http://www.tmz.com/2018/11/01/president-trump-immigrant-caravan-throwing-rocks- same-firearms/ 0 2.29 http://www.wdbj7.com/content/news/Farmers-Almanac-predicts-a--491821281.html 1 5.29 http://www.wect.com/2018/10/17/breaking-state-trooper-shot-killed-suspect-custody/ 1 6.90 http://www.worldstarhiphop.com/videos/video.php?v=wshhddDiUTw9SDG7wvd7 0 1.57 https://abc13.com/4038817/ 1 6.90 https://abc13.com/4554120/ 1 6.95 https://abcnews.go.com/Politics/blame-abc-news-finds-17-cases-invoking- trump/story?id=58912889 1 3.67 https://abcnews.go.com/Politics/desantis-floridians-monkey-electing-african-american- democrat-governor/story?id=57476957 0 6.43 https://abcnews.go.com/Politics/trump-kicks-off-week-tweet-calling-media- true/story?id=58827743 1 6.43 https://abcnews.go.com/Politics/trump-plans-end-birthright-citizenship-babies-born- citizens/story?id=58845684 1 6.76 https://abcnews.go.com/US/daycare-owner-drugged-kids-tied-car-seats- police/story?id=57932436 1 6.81 https://abcnews.go.com/US/man-accused-groping-woman-flight-trump-grab- women/story?id=58681265 1 6.95 https://americanmilitarynews.com/2018/08/china-hacked-hillary-clintons-email-server- and-took-nearly-all-her-emails-report-says/ 0 3.62 https://americanmilitarynews.com/2018/10/armys-nco-promotion-criteria-fails-to-gauge- leadership-qualities-study-says/ 1 7.00 https://americanmilitarynews.com/2018/10/disabled-and-retired-vets-to-see-largest-cost- of-living-raise-in-six-years/ 1 6.67 https://americanmilitarynews.com/2018/10/guatemala-captured-100-isis-terrorists- president-reveals-ahead-of-migrant-caravan-arrival/ 0 4.48 https://babylonbee.com/news/bill-clinton-allegations-of-sexual-misconduct-should- disqualify-a-man-from-public-office 0 1.86 https://babylonbee.com/news/joel-osteen-launches-line-pastoral-wear-sheeps-clothing/ 0 1.86 https://babylonbee.com/news/socialist-leaders-clarify-we-only-want-socialism-for- everyone-else/ 0 1.52 https://conservativedailypost.com/savage-claims-ford-deeply-tied-to-deep-state/ 0 3.00 https://conventionofstates.com/news/ginsburg-can-t-remember-14th-amendment-gets- pocket-constitution-from-the-audience 0 3.81 https://dailycaller.com/2018/10/15/elizabeth-warren-less-native-american-dna/ 0 4.10 https://dailycaller.com/2018/10/30/bus-mexico-migrant-caravan-border/ 0 3.95 https://educateinspirechange.org/health/experienced-butcher-admits-see-cancer-pork-just- cut-still-sell-customers/ 1 5.43 https://endoftheageheadlines.wordpress.com/2018/10/24/deep-state-sending-explosive- packages-to-themselves-in-hopes-of-stopping-red-wave/ 0 2.24 https://kutv.com/news/local/woman-who-alleges-mtc-president-raped-her-filmed- testifying-about-rape-in-church 1 6.67 https://lawandcrime.com/civil-rights/gop-removes-sole-polling-place-from-famous- hispanic-majority-city-in-kansas/ 1 5.05 https://legalinsurrection.com/2018/09/maxine-waters-suggests-knocking-off-trump-then- going-after-pence/ 0 4.14 https://meaww.com/queen-invites-meghan-markle-mother-doria-christmas-sandringham 0 4.67 https://neonnettle.com/news/4335-obama-at-bilderberg-the-us-must-surrender-to-the- new-world-order- 0 1.00 https://news.jamaicans.com/kanye-west-bob-marleys-spirit-flows/ 1 6.29 https://news.unclesamsmisguidedchildren.com/obama-says-benghazi-is-a-wild- conspiracy-theory/ 0 2.90 https://news.vice.com/en_us/article/3kek75/another-kavanaugh-accuser-is-taking-to- maryland-authorities 1 6.52 https://news.vice.com/en_us/article/nepwng/some-texans-had-to-wait-so-long-to-vote- they-gave-up-a-lawsuit-is-trying-to-give-them-a-second-chance 1 6.76 https://patriotjournal.org/democrat-fbi-closet/ 0 2.67 https://patriotjournal.org/sanctuary-state-vote-trump/ 0 2.52 https://patriotjournal.org/video-train-south-border/ 0 3.86 https://prepforthat.com/christine-blasey-fords-yearbook-seems-to-show-high-school- racism/ 0 3.24 https://realfarmacy.com/surgeon-mammogram/ 0 4.19 https://rewire.news/ablc/2018/10/11/supreme-court-native-americans-november/ 1 5.38 https://rwnofficial.com/dems-huge-secret-hillary-caught-in-special-relationship-with- stormy-daniels/ 0 2.43 https://shareblue.com/russia-president-vladimir-putin-annual-address-americas-downfall/ 0 2.67 https://socialsecurityworks.org/2018/04/12/politicians-steal-social-security/ 0 2.24 https://thefederalistpapers.org/opinion/california-bans-feeding-homeless-without-police- supervision 1 4.33 https://thefederalistpapers.org/us/liberal-professor-says-hurricane-victims-deserved-got- supporting-republicans 0 3.86 https://thefreethoughtproject.com/watch-cop-chokes-innocent-man-calling-water- company-corruption/ 0 2.57 https://tribunist.com/news/band-performs-skit-about-shooting-police-during-halftime-of- football-game/ 0 3.10 https://twitter.com/realDonaldTrump/status/1040217897703026689 0 3.67 https://washingtonpress.com/2018/10/28/pittsburgh-jewish-leaders-just-banned-trump- from-their-city-until-he-meets-their-demands/ 0 3.05 https://wokesloth.com/fox-news-poll/stefanie/ 0 1.71 https://worldnewsdailyreport.com/canadians-face-major-donut-shortage-after-first-day-of- cannabis-legalization/ 0 1.76 https://worldnewsdailyreport.com/cops-beat-up-teen-after-bank-teller-mistakes-his- erection-for-a-pistol/ 0 1.86 https://worldnewsdailyreport.com/man-accused-of-raping-a-cow-claims-it-is-the- reincarnation-of-his-dead-wife/ 0 1.57 https://worldnewsdailyreport.com/morgue-worker-arrested-after-giving-birth-to-a-dead- mans-baby/ 0 1.86 https://worldnewsdailyreport.com/pregnant-teen-seeks-13-paternity-tests-after-gangbang- with-football-team/ 0 1.86 https://worldnewsdailyreport.com/teen-on-female-viagra-crashes-into-building-while- masturbating-to-gear-shift/ 0 1.76 https://worldnewsdailyreport.com/teenager-sues-his-parents-for-250000-for-naming-him- gaylord/ 0 2.05 https://worldnewsdailyreport.com/woman-arrested-for-training-squirrels-to-attack-her-ex- boyfriend/ 0 1.95 https://worldnewsdailyreport.com/woman-claims-she-is-the-daughter-of-marilyn-monroe- and-jfk/ 0 1.71 https://www.12up.com/posts/6209256-report-steelers-hoping-eagles-renew-their-interest- in-potential-trade-for-le-veon-bell 1 6.33 https://www.aol.com/article/news/2015/12/09/study-links-e-cigarettes-to-incurable- disease-called-popcorn-lu/21281029/ 1 6.81 https://www.axios.com/rod-rosenstein-resign-justice-department-trump-cf761f4c-fca3- 4794-92d4-a56c9e32ff43.html 1 5.38 https://www.bet.com/celebrities/news/2016/12/07/kim-kardashian-kanye-west- marriage.html?cid=facebook 0 2.76 https://www.billboard.com/articles/columns/chart-beat/8094819/cardi-b-beyonce-five- hits-top-10-rb-hip-hop-songs-chart 1 6.00 https://www.breitbart.com/big-journalism/2018/08/18/cnn-accused-intimidating-paul- manafort-jury/ 0 3.14 https://www.breitbart.com/border/2018/10/30/armed-migrants-in-caravan-opened-fire-on- mexican-cops-say-authorities/ 0 4.19 https://www.breitbart.com/entertainment/2018/10/22/bette-midler-world-under-siege- from-murderers-like-trump/ 1 4.81 https://www.breitbart.com/midterm-election/2018/10/18/nancy-pelosi-collateral-damage/ 0 2.48 https://www.breitbart.com/politics/2018/10/14/cnn-poll-biden-leads-field-2020- democratic-hopefuls/ 1 5.10 https://www.breitbart.com/politics/2018/11/01/orourke-campaign-exposed-in-undercover- video-for-assisting-honduran-migrants-nobody-needs-to-know/ 0 4.00 https://www.breitbart.com/the-media/2018/10/26/nolte-nbc-news-hid-info-wouldve- cleared-kavanaugh-avenatti-rape-allegations/ 1 4.10 https://www.breitbart.com/video/2018/10/18/biden-there-is-no-evidence-of-widespread- voter-fraud-in-the-american-electoral-process/ 1 7.00 https://www.breitbart.com/video/2018/10/22/obama-unlike-some-i-state-facts-i-dont- believe-in-just-making-stuff-up/ 1 6.81 https://www.buzzfeed.com/aliciabarron/this-reeses-machine-will-swap-out-your- halloween-candy-for 1 6.81 https://www.cbsnews.com/video/forecast-tropical-storm-erika-could-hit-florida-as- hurricane/ 1 6.33 https://www.chicksonright.com/youngconservatives/2018/10/16/democrats-eat-crow-fbi- raids-city-offices-headed-by-san-juan-mayor-trump-was-right-all-along/ 0 4.43 https://www.chicksonright.com/youngconservatives/2018/10/24/time-to-be-polite-is-over- heated-fox-co-hosts-confront-juan-williams-on-air-if-youre-going-to-sit-there-and-call- your-colleague-a-liar/ 0 3.52 https://www.christianpost.com/news/china-trying-to-rewrite-the-bible-force-churches- sing-communist-anthems-227664/ 1 6.43 https://www.christianpost.com/news/white-house-hosts-100-evangelical-leaders-state- like-dinner-this-is-spiritual-warfare-227044/ 1 6.43 https://www.cnbc.com/2018/10/19/saudi-arabia-admits-journalist-jamal-khashoggi-was- killed-after-a-fight-broke-out-in-consulate.html 1 6.86 https://www.cnn.com/2018/10/16/us/first-female-commander-us-army-trnd/index.html 1 6.86 https://www.cnn.com/2018/10/24/us/florida-middle-girls-allegedly-wanted-to-kill- classmates/index.html 1 6.81 https://www.concealedcarry.com/news/armed-citizens-are-successful-95-of-the-time-at- active-shooter-events-fbi/ 0 3.57 https://www.dailymail.co.uk/health/article-6176151/No-evidence-having-high-levels-bad- cholesterol-causes-heart-disease.html 0 4.62 https://www.dailymail.co.uk/news/article-6285989/Canada-kicks-muted-pot-party-1st- G7-nation-OK-recreational-cannabis.html 1 6.57 https://www.dailymail.co.uk/news/article-6373429/Caitlyn-Jenners-Malibu-house-burns- California-wildfires-rage-control.html 1 6.52 https://www.dailywire.com/news/34778/medical-website-indulges-trans-community-new- term-hank-berrien 0 3.90 https://www.dailywire.com/news/34945/blowin-wind-hospital-security-guard-fired-hank- berrien 1 6.76 https://www.dailywire.com/news/35040/controversial-reporter-jemele-hill-out-espn-part- emily-zanotti 1 5.86 https://www.dailywire.com/news/35097/la-and-ny-overrun-topless-women-seeking- equality-amanda-prestigiacomo 1 5.10 https://www.dailywire.com/news/35631/linda-sarsour-calls-dehumanization-jews-report- ryan-saavedra 0 3.67 https://www.dailywire.com/news/36198/watch-beto-orourke-pins-himself-corner-over- drunk-ryan-saavedra 1 6.48 https://www.dailywire.com/news/37070/anti-kavanaugh-protester-uses-her-two-kids-she- hank-berrien 1 4.62 https://www.dailywire.com/news/37398/migrant-caravan-swells-5000-resumes-advance- us-joseph-curl 1 5.19 https://www.dailywire.com/news/37552/migrant-caravan-marching-us-borders-swells- 14000-joseph-curl 0 4.86 https://www.dailywire.com/news/37685/epa-greenhouse-gas-emissions-dropped-nearly-3- joseph-curl 0 4.86 https://www.dailywire.com/news/37900/nobody-needs-know-orourke-campaign-appears- ryan-saavedra 0 4.33 https://www.dailywire.com/news/38153/breaking-voter-fraud-allegedly-found-deep-blue- ryan-saavedra 0 4.19 https://www.dailywire.com/news/38230/broward-county-elections-supervisor-be-forced- james-barrett 1 5.00 https://www.forbes.com/sites/rachellebergstein/2016/11/10/new-balance-gets-political- and-throws-support-behind-trump/ 1 6.52 https://www.foxnews.com/politics/mob-chants-threats-outside-tucker-carlsons-dc-home 1 6.52 https://www.hannity.com/media-room/free-money-cory-booker-unveils-plan-to-give- poor-americans-50000/ 0 4.43 https://www.hatetriots.com/2018/10/nancy-pelosi-shouted-out-of-restaurant.html 0 3.33 https://www.healthnutnews.com/60-lab-studies-now-confirm-cancer-link-to-a-vaccine- you-probably-had-as-a-child/ 0 2.00 https://www.healthy-holistic-living.com/instant-noodles-inflammation-dementia.html 0 2.52 https://www.iflscience.com/editors-blog/pieces-of-a-ufo-fell-from-the-sky-and-landed-in- remote-cambodian-village/ 1 5.81 https://www.iflscience.com/health-and-medicine/a-cancer-kill-switch-has-been-found-in- the-body-and-researchers-are-already-hard-at-work-to-harness-it/ 1 7.00 https://www.iflscience.com/plants-and-animals/drone-footage-reveals-over-100-whales- trapped-in-secret-underwater-jails/ 1 6.90 https://www.investors.com/politics/editorials/u-s-has-3-5-million-more-registered-voters- than-live-adults-a-red-flag-for-electoral-fraud/ 0 4.10 https://www.joyscribe.com/all-of-the-harry-potter-films-are-officially-coming-to-netflix/ 0 5.19 https://www.jta.org/2018/10/27/top-headlines/trump-blames-deaths-pittsburgh- synagogue-lack-armed-guards 1 6.81 https://www.judicialwatch.org/press-room/press-releases/judicial-watch-uncovers-more- classified-material-on-hillary-clintons-unsecure-email-system/ 1 5.67 https://www.lapd.com/blog/lapd-officer-shot-point-blank-range-and-not-peep-aclu-blm- or-assemblymember-weber 1 3.90 https://www.lifenews.com/2018/08/20/oprah-winfrey-promotes-shout-your-abortion- movement-where-women-brag-about-their-abortions/ 0 3.19 https://www.lifenews.com/2018/08/27/125-women-take-abortion-pills-to-kill-their- babies-to-protest-pro-life-laws/ 0 2.48 https://www.lifenews.com/2018/08/28/united-methodist-church-proposes-new-position- statement-saying-we-support-abortion/ 0 3.86 https://www.lifenews.com/2018/09/14/chelsea-clinton-says-it-would-be-un-christian-to- protect-babies-from-abortion/ 0 3.43 https://www.lifenews.com/2018/09/17/joe-biden-calls-trump-supporters-the-dregs-of- society/ 0 3.57 https://www.lifezette.com/2018/10/kavanaugh-turned-down-scads-of-gofundme-dollars- blasey-ford-hits-paydirt/ 0 4.14 https://www.lifezette.com/2018/11/trump-challenges-dems-on-kavanaugh-accusers- admission-that-she-made-it-all-up-where-are-you-on-this/ 0 4.10 https://www.lifezette.com/2018/11/winner-ocasio-cortez-insists-of-medicare-for-all-you- just-pay-for-it/ 1 4.05 https://www.lovebscott.com/steve-harveys-talk-show-canceled-nbc-will-replaced-kelly- clarkson-show 1 6.90 https://www.metrotimes.com/table-and-bar/archives/2018/10/18/detroit-judge-tosses- gardening-while-black-case-brought-by-three-white-women 1 6.43 https://www.movieguide.org/news-articles/netflix-animated-series-dedicates-an-entire- episode-to-promote-planned-parenthood-and-killing-babies.html 0 3.38 https://www.msn.com/en-us/news/us/jewish-leaders-tell-trump-hes-not-welcome-in- pittsburgh-until-he-denounces-white-nationalism/ar-BBP2lEL 1 6.05 https://www.myleaderpaper.com/news/accidents/desloge-man-dies-in-two-vehicle- accident-west-of-festus/article_f2b6b228-dbdd-11e8-8b87-abb2d7153695.html 1 7.00 https://www.nbcnews.com/news/us-news/meijer-pharmacist-denies-michigan-woman- miscarriage-medication-citing-religious-beliefs-n921711 1 7.00 https://www.nbcnews.com/news/us-news/retired-firefighter-who-fired-shotgun-black- teen-asking-directions-gets-n935611 1 6.90 https://www.nbcnews.com/politics/donald-trump/least-impressive-sex-i-ever-had-stormy- daniels-tells-all-n910566 1 6.10 https://www.nbcnews.com/pop-culture/celebrity/burt-reynolds-charismatic-star-1970s- blockbusters-dies-82-n907206 1 7.00 https://www.nbcnews.com/tech/social-media/far-right-group-takes-victory-lap-social- media-after-violence-n920506 1 5.05 https://www.nbcnews.com/video/cincinnati-police-tased-11-year-old-girl-accused-of- shoplifting-1313262659520?v=raila& 1 6.95 https://www.nbcnews.com/video/sessions-says-he-will-recuse-himself-from-any-clinton- investigations-851626564002?v=railb 1 7.00 https://www.newsandguts.com/white-house-leaves-pence-condemn-attacks-clinton- obama-cnn/ 0 2.38 https://www.newsandguts.com/white-house-uses-doctored-video-support-claim-reporter/ 1 5.86 https://www.newsweek.com/donald-trump-jr-says-joe-biden-went-too-far-calling-trump- voters-dregs-1123294 1 6.05 https://www.newsweek.com/donald-trump-speak-anti-lgbt-hate-groups-annual-event- first-president-683927 1 3.52 https://www.newsweek.com/trump-ceo-pay-wages-tax-cuts-1076795 1 5.86 https://www.newsweek.com/witches-hex-brett-kavanaugh-amy-kremer-scary- conservatives-ritual-trump-1168948 0 2.81 https://www.newyorker.com/news/news-desk/the-five-year-old-who-was-detained-at-the- border-and-convinced-to-sign-away-her-rights 1 5.90 https://www.nsfnews.com/5b8ea8e312074/jackson-man-arrested-for-hacking-a-college- computer-and-returning-all-funds-to-students-since-2010.html 0 2.05 https://www.nytimes.com/2018/05/22/us/politics/georgia-primary-abrams-results.html 1 6.24 https://www.pbs.org/newshour/politics/trump-says-it-will-be-hard-to-unify-country- without-a-major-event 1 6.95 https://www.pbs.org/newshour/politics/watch-kanye-west-in-white-house-visit-says- maga-hat-gives-him-power 1 5.62 https://www.plantbasednews.org/post/no-one-should-be-doing-keto-diet-leading- cardiologist 0 4.62 https://www.politico.com/states/florida/story/2017/12/15/experts-browards-elections- chief-broke-law-in-destroying-ballots-150258 1 6.19 https://www.politicususa.com/2018/10/26/texas-voting-machines-beto-orourke-ted- cruz.html 0 3.29 https://www.rawstory.com/2018/11/woman-pretends-persecuted-trump-supporter-scams- conservatives-thousands-dollars/ 0 3.71 https://www.rd.com/food/fun/coffee-black-psychopath-study/ 1 6.24 https://www.rollingstone.com/politics/politics-news/brian-kemp-leaked-audio-georgia- voting-745711/ 1 6.00 https://www.scarymommy.com/parkland-shooting-victim-sculpture/ 1 6.05 https://www.thedailybeast.com/elizabeth-warren-releases-native-american-dna-test-results 1 5.62 https://www.thedailybeast.com/ivanka-trumps-gurus-say-their-techniques-can-end-war- and-make-you-fly 0 5.71 https://www.thedenverchannel.com/news/chris-watts-case/chris-watts-reaches-plea-deal- to-avoid-death-penalty-in-deaths-of-pregnant-wife-2-daughters 1 6.90 https://www.theepochtimes.com/joy-disbelief-as-korean-families-separated-by-war-meet- after-65-years_2628430.html 1 6.95 https://www.theepochtimes.com/threat-of-yellowstone-volcano-eruption-increases-to- high-says-usgs_2700008.html 1 6.38 https://www.theepochtimes.com/top-ranking-democrats-silent-on-keith-ellison-abuse- claims_2669521.html 0 4.33 https://www.thegatewaypundit.com/2018/09/former-scalia-law-clerk-drops-pictures-and- evidence-that-blows-christine-fords-case-wide-open/ 0 2.62 https://www.thehollywoodgossip.com/2018/07/meghan-markle-and-kate-middleton-due- to-give-birth-on-the-same-d/ 0 3.90 https://www.thenewamerican.com/usnews/crime/item/24549-nypd-source-weiner-laptop- has-enough-evidence-to-put-hillary-away-for-life 0 2.48 https://www.theroot.com/trump-rants-like-racist-grandpa-in-speech-to-maga-negro- 1830045693 0 2.71 https://www.thisisinsider.com/hot-cocoa-rolls-pillsbury-2018-11 1 5.95 https://www.thisisinsider.com/michael-buble-quits-music-sons-cancer-battle-2018-10 0 4.48 https://www.unilad.co.uk/tv/bert-and-ernie-are-a-gay-couple-confirms-sesame-street- writer/ 0 5.33 https://www.unilad.co.uk/tv/making-a-murderer-part-2-to-be-officially-released-on- october-19th-2018/ 1 6.57 https://www.usatoday.com/story/news/nation-now/2018/08/18/no-prison-time-ex- houston-doctor-who-raped-heavily-sedated-patient/1031665002/ 1 6.57 https://www.usatoday.com/story/news/politics/2018/08/30/federal-pay-freeze-trump- cancels-2-1-percent-pay-raise-federal-workers-citing-budget-deficit/1145355002/ 1 6.86 https://www.vice.com/en_us/article/43ejmj/generation-z-is-skipping-college-for-trade- school 1 5.33 https://www.washingtonpost.com/technology/2018/11/08/white-house-shares-doctored- video-support-punishment-journalist-jim-acosta/ 1 6.57 https://www.washingtontimes.com/news/2018/oct/14/ted-wheeler-portland-mayor-stands- decision-allow-a/ 1 5.19 https://www.westernjournal.com/ct/cnn-fails-blunder-kavanaugh/ 1 4.81 https://www.westernjournal.com/ct/huge-feinstein-investigated/ 0 3.67 https://www.westernjournal.com/ct/trump-black-liberal-reporter-racist-question/ 1 3.57 https://www.westernjournal.com/ct/watch-slow-mo-video-catches-angry-acosta-shoving- woman-white-house/ 0 1.76 https://www.westernjournal.com/obama-urges-lives-better-non-americans/ 1 6.00 https://www.yahoo.com/entertainment/orange-new-black-cancelled-netflix- 085607387.html 1 6.52 https://www.yahoo.com/lifestyle/low-carb-diet-keto-could-134200344.html 1 6.29 https://www.yahoo.com/news/trump-helped-parents-hide-money-tax-returns-york- 195327106.html 1 6.62 https://www.zeptha.com/cotton-swab-soaked-in-alcohol-and-placed-in-your-navel/ 0 1.62

2. Methods: Bootstrapping Many Crowds

In order to simulate the performance of crowds of different sizes, we performed the bootstrapping procedure described below. Note that we sampled the layperson judgments independently for each article, rather than keeping the same crowd for the entire set of 207 articles. We did this because we collected only 20 ratings per layperson, similar to the implementation likely to be used by platforms in which laypeople would rate only a subset of all content. Simulations were performed in R using the purrr, foreach, doParallel packages, and code can be found on our OSF site: https://osf.io/hts3w/.

K = the maximum size of the crowd (K = 26) N = the total number of articles in the set (N = 207) B = the total number of bootstraps for each article (B = 1000)

Ri = The set of Democratic layperson judgments for article i Di = The set of Republican layperson judgments for article i

fi = The average continuous fact-checker rating for article i mi = The average binary truth fact-checker rating for article i (0 = Not True, 1 = True)

For k = 2...K: For b = 1...B: For i = 1...N: Ri,b,k = Sample with replacement k / 2 responses from Ri Di,b,k = Sample with replacement k / 2 responses from Di µi,b,k = Average of {Ri,k,b, Di,k,b} ρk,b = Pearson Correlation between the average layperson ratings {µi,b,1 … µi,b,n} and the fact-checker ratings {f1 … fn} and ak,b = AUC of a model using the average layperson ratings {µi,b,1 … µi,b,n} to predict the categorical fact-checker ratings {m1 … mn} rk = Average Pearson Correlation across all bootstraps {ρk,1 … ρk,B} AUCk = Average AUC across all bootstraps {ak,1 … ak,B}

3. Methods: Out-of-Sample Accuracy In order to estimate the out-of-sample accuracy of a model that uses the crowd’s aggregate ratings to predict the fact-checkers’ average binary rating as a function of the percent of unanimous headlines in the dataset, we performed the procedure described below. In order to calculate the out-of-sample accuracy, we first calculated the average layperson ratings for a crowd of size 26 for each article, averaged across 1000 bootstrapped simulations. We then split our headlines data into an 80/20 train/test set, calculated the optimal cutoff that maximized the weighted average of accuracy on unanimous and non-unanimous headlines, and then used that cutoff to calculate the weighted out-of-sample accuracy. The following procedure was done in Python using the pandas and numpy packages. Code can be found on our OSF site: https://osf.io/hts3w/.

S = Number of test/train split trials C = The set of possible cutoffs, taken from the average fact-checker rating of each of the 207 articles p = The proportion of unanimous headlines in the sample max_cp = Optimal cutoff based on training data where there are p unanimous articles in the sample L = The set of laypeople ratings where lj is the rating for headline j F = The set of fact-checker ratings where fj is the rating for headline j U = The set of articles with unanimous fact-checker agreement NU = The set of articles with non-unanimous fact-checker agreement

For i in 1..S: Split headlines into 80/20 train/test sets where: Ui,train= Set of unanimous headlines for training NUi,train = Set of non-unanimous headlines for training Ui,test = Set of unanimous headlines for testing NUi,test = Set of non-unanimous headlines for testing For p in 0...1 by .01: For c in C: Acctrain,c,p,i = p * ∑ 1((lj > c) == fj) || + (1 - p)* ∑ 1((lj > c) == fj) || max_cp = argmaxc Acctrain,c,p,i Acctest,p,i = p * ∑ 1((lj > max_cp) == fj) || + (1 - p)* ∑ 1((lj > max_cp) == fj) || Acctest,p = Average {Acctest,p,1 … Acctest,p,S}

4. Robustness: Predicting fact-Checker binary ratings with laypeople binary ratings As a robustness test, we performed the same bootstrapping procedure as specified in Section S2 to calculate the AUC of a model that uses the categorical ratings of the laypeople, rather than the average continuous ratings, to predict the categorical ratings of the fact-checkers. We binarize the crowd’s ratings in the same way as the fact-checkers with 1 as True and 0 as Not True and use the proportion of True ratings in the crowd to predict the fact-checkers ratings. As one might expect, the results are similar to - but slightly worse than - the model that uses the (more sensitive) continuous crowd ratings. The AUC for the model on the unanimous set asymptotes at .90 for the source condition, and .85 for the non-source condition, slightly lower than the .92 and .90 from the continuous model, respectively. The performance on the non- unanimous set shows a similar pattern, .74 for the source condition and .71 for no source vs. .78 and .74 for the continuous model, respectively. Unsurprisingly, the model is worse for smaller crowds than the continuous model, showing an average AUC that is 0.05 - 0.1 points worse than the continuous model for a crowd of size 2. However, with larger crowds, the model does better, narrowing the gap to about .02 - .05. Overall, the results still show that even a binary rating can be a useful predictor when aggregated across many people.

Figure S1. Classifying articles as true versus non-true based on layperson aggregate categorical ratings. (a) (b) AUC scores as a function of the number of layperson ratings per article and source condition. AUC is calculated using a model in which the layperson rating is coded as 1 for True and 0 Otherwise and the share of “True” layperson ratings is used to predict the modal fact-checker categorical rating, coded in the same way. 5. Robustness: Predicting Fact-checker “Is False” ratings

While predicting whether a URL is “True” vs. “False” or “Misleading” is a more relevant task for platforms combating misinformation, there are also cases when distinguishing outright “False” from “Misleading” or “True” URLs is useful. Thus, we extend the bootstrapping procedure described in Section S1 to use the crowd’s average Likert-scale accuracy ratings to predict the fact-checkers’ categorical ratings where “False” is coded as 1 and all other options are coded as 0. The results are shown in Figure S2. The results are qualitatively similar to the same model predicting “Is True”. However, the performance is slightly worse, with the AUC asymptoting at .85 - .86 if or unanimous URLs and .74 - .75 for non-unanimous URLs, as compared to .90 - .92 for unanimous and .74 - .78 for non-unanimous in the “Is True” prediction case. The results from the source condition do not significantly differ from the no-source condition.

Figure S2. Classifying articles as false versus non-false based on layperson aggregate continuous ratings. (a) (b) AUC scores as a function of the number of layperson ratings per article and source condition. AUC is calculated using a model in which the layperson average Likert-scale accuracy ratings are used to predict the modal fact- checker categorical rating, with responses coded as 1 for False and 0 Otherwise.

6. Robustness: AUC, Political vs. Non-Political Headlines We also examine the AUC of a model predicting the modal fact-checker rating, with “True” coded as 1 and 0 otherwise. We do not see a significant difference between the performance of the crowd on political vs. non-political headlines in either condition.

Figure S3. Classifying articles as true versus non-true based on layperson aggregate continuous ratings. (a) (b) AUC scores as a function of the number of layperson ratings per article and source condition. Panels show results for (a) Non-Political articles, (B) Political Articles. 7. Robustness: Political Knowledge and Cognitive Reflection by Party We also examine whether the results that show that crowds with high levels of political knowledge and cognitive reflection outperform their low level counterparts hold among both Democrats and Republicans. Figure S3 shows that high CRT and high political knowledge crowds outperform their low level counterparts for balanced, all Democrat, and all Republican crowds. Note that we look at the performance of a crowd of size 10, rather than size 26 as in Figure 3 the main text, because restricting participants to be of only one party cuts the pool of potential respondents in half.

Figure S4. Comparing the performance of high vs. low CRT and high vs. low political knowledge crowds for politically balanced, all Democrat, and all Republican populations, respectively. Panel A shows the correlation between the fact-checkers and a crowd of size 10, while Panel B shows the AUC of a model using the crowd’s average continuous rating to predict the modal fact-checker categorical response with responses coded as 1 for True and 0 otherwise.

8. Robustness: Performance as a Function of Crowd Size: Political Party, Cognitive Reflection, and Political Knowledge In the main text Figure 3, we show the results of subsetting on partisanship, cognitive reflection, and political knowledge using a crowd of size 26, collapsing across the source and no source condition. Here we show performance as a function of k, the size of the layperson crowd. We use the identical procedure as described in Figure 3, but simulate the performance of crowds of size 2 through 26. As can be seen, (1) Democrat, (2) high CRT, and (3) high political knowledge crowds outperform the balanced crowd at small k values, but the advantage diminishes as the crowd size grows.

Figure S5. (a-c) Correlation between average fact-checker rating and crowd and (d-f) AUC predicting modal fact- checker rating (1 = True, 0 = Not True) as a function of k, the number of layperson ratings per article. All panels compare performance to the baseline politically-balanced crowd, shown in red. Panels (a) and (d) show performance of an all Democrat vs. all Republican crowd. Panels (b) and (e) show performance for a politically-balanced crowd of participants who scored above the median on the CRT vs. at or below the median on the CRT; Panels (c) and (f) show a politically-balanced crowd of participants who score above the median on political knowledge vs. at or below the median on political knowledge.

9. Robustness: Replicating Correlational Analysis with Alternate Set of Fact-checkers To demonstrate that our results are not an artifact of the particular Upwork fact-checkers we used, and to support the validity of their ratings, we were also able to obtain ratings from a set of 4 journalist fact-checkers that a colleague had recruited to fact-check the same set of articles. These fact-checkers were professional journalists who had just completed a prestigious fellowship for mid-career journalists and had extensive experience reporting on U.S. politics. These journalists had an average inter-fact-checker correlation of .67 (similar to the .62 correlation among our Upwork fact-checkers), and the average of their ratings had a high degree of correlation with the average rating of our Upwork fact-checkers (r = .81). Furthermore, Figure S6 below shows that replicating our main analysis (Figure 1 of our paper) using the average rating from the 4 journalists (instead of the Upwork fact-checkers) yields qualitatively similar results. This demonstrates that our key result - that a relatively small number of laypeople can achieve similar correlation with the fact-checkers as the fact- checkers show with each other - is qualitatively robust to a different set of clearly qualified fact-checkers.

Figure S6. Correlation across articles between (i) politically-balanced layperson aggregate accuracy ratings based on reading the headline and lede and (ii) average rating of 4 professional journalist fact-checkers’ research-based accuracy ratings, as a function of the number of layperson ratings per article. Laypeople are grouped by condition (Source vs. No Source). Panels show results for (a) All articles, (b) Non-Political articles, (c) Political Articles. The dashed line indicates the average Pearson correlation between the fact-checkers. 10. Robustness: Replicating Correlational Analysis with Larger Crowds We repeat the correlational analysis done in Figure 1 with crowds of size up to 100. As can be seen, while the correlation between fact-checkers and the crowd improves marginally after 25 responses, the performance largely asymptotes and does not improve at larger crowd sizes.

Figure S7: Correlation across articles between (i) politically-balanced layperson aggregate accuracy ratings based on reading the headline and lede and (ii) average fact-checker research-based aggregate accuracy ratings, as a function of the number of layperson ratings per article (up to 100). Laypeople are grouped by condition (Source vs. No Source). Panels show results for (a) All articles, (b) Non-Political articles, (c) Political Articles.

11. Robustness: AUC Analysis, Excluding “Couldn’t Be Determined” ratings

We believe that an article should only be classified as “true” if there is evidence in favor of being true - and therefore that “Couldn’t Be Determined” counts against being “true”. However, one could argue that “Couldn’t Be Determined” ratings should be excluded from analysis since fact-checkers could not make a judgment for those articles. Therefore, we repeat our AUC analysis, excluding ratings of “Couldn’t Be Determined” by the fact-checkers. The results are extremely similar to those presented in the main text, since only a handful of articles changed rating.

Figure S8: Classifying articles as true versus non-true based on layperson aggregate Likert ratings. AUC scores as a function of the number of layperson ratings per article and source condition for articles with (a) unanimous fact- checker ratings and (b) non-unanimous ratings, excluding ratings of “Couldn’t Be Determined”.

12. Robustness: Correlational Analysis with a Single Fact-Checker One potential methodological concern is that we are comparing the correlation between averages of laypeople and fact-checkers to the average correlation between three individual fact-checkers. To address this potential measurement artefact, we show a plot with the correlation between the average crowd and a single fact-checker’s ratings (rather than taking the average of all three fact- checkers). Note that the results are qualitatively similar to Figure 1 in our manuscript, although the number of laypeople it takes to match the fact-checker performance is higher than when averaging the three fact-checker responses.

Figure S9: Average correlation across articles between (i) politically-balanced layperson aggregate accuracy ratings based on reading the headline and lede and (ii) single fact-checker, as a function of the number of layperson ratings per headline.

13. Robustness: Individual Differences for Political vs. Non-political headlines We recreate the output in figure 3 examining the performance of crowds with different layperson characteristics for political and non-political headlines separately. We find similar results for political and non-political headlines, with Democrats outperforming Republicans, High CRT outperforming Low CRT, and High Political Knowledge outperforming Low Political Knowledge.

Figure S10: Comparing crowds with different layperson compositions to a baseline, politically-balanced crowd for non-political vs. political headlines. (a) (b) Pearson Correlations between the average aggregate accuracy rating of a crowd of size 26 and the average aggregate accuracy rating of the fact-checker for nonpolitical vs. political headlines, respectively. (c) (d) AUC for the average aggregate accuracy rating of a crowd of size 26 predicting whether the modal fact-checker categorical rating is “true” for nonpolitical vs. political headlines, respectively. 14. ROC Curves for Predicting Fact-Checker Binary Ratings with Laypeople Continuous Ratings We provide the ROC curves for our AUC analysis in which we used the crowd average Likert scale ratings to predict the fact-checker modal binary ratings (where “True” was coded as 1, else 0), annotated with the 1 - 7 Likert score cutoffs. While these curves are informative, we believe they underscore our argument for using a continuous truth rating applied in newsfeed ranking rather than a binary cutoff.

Figure S11: ROC curves for a model that uses the average Likert rating of a crowd of size 26 to predict the fact- checker modal binary ratings (with “True” coded as 1, and all other answers coded as 0). Panel (a) shows the performance for the entire set of articles, and Panels (b) and (c) show performance for articles with unanimous and non-unanimous fact-checker agreement, respectively.

15. Qualitative Examination of Disagreement Among Fact-Checkers

In this section, we provide more insight into the nature of the disagreement among the fact- checkers. In particular, we provide the categorical assessment, average accuracy rating, and rationale given by each fact-checker for 3 URLs from the set of 20 URLs that we had each fact-checker initially rate to show their competency. These 3 URLs are examples of where there was disagreement among the fact-checkers about whether or not the focal claim was true. As part of the initial competency assessment, for URLs where there was disagreement, we gave the other fact-checkers’ rationales to the fact-checker that disagreed with the modal assessment. Thus, we also include the disagreeing fact-checker’s response as (sometimes lightly edited for conciseness). Our assessment of these texts (and the others in the initial set of 2) is that there was legitimate disagreement between the fact-checkers.

DOES ELIZABETH WARREN HAS STATISTICALLY LESS NATIVE AMERICAN BLOOD THAN AVERAGE WHITE AMERICAN?

Massachusetts Sen. Elizabeth Warren released the results from a DNA test Monday proving that she has statistically less Native American DNA than the average American white person.

https://dailycaller.com/2018/10/15/elizabeth-warren-less-native-american-dna/

FC1: Misleading; 3.57. The article's title was updated to reflect the updated information that came out, but the first sentence apparently wasn't updated. The entire article is a bit of a mess because of the updated information.

FC2: Misleading; 2.00. Nuances, nuances. All is much more complicated and even sources they cite are saying that. The article is pure clickbait article, however they have added an edit mentioning complexity of the whole thing. Problem is that the article is still showing a shallow understanding of it.

FC3: True; 6.71. It’s nice that this piece has a correction as it should, and that helps with the trustworthiness and reliability scores. The only gripe I can offer is incidentally addressed by that correction: the distinction between heritage as calculated from generation versus ancestry by proportion of genetic material. The piece directs the reader on where to learn more about the nuances at play, but conflating the two remains reasonable in this situation as Elizabeth Warren’s report did not provide the percentage figure for genetic material, necessitating a calculation of heritage based off of generations and the comparison of that to the distribution of genetic material you can find nationally. The nuances at play aren’t an issue even in the absence of elaboration, but the piece even directs acknowledges the nuances and directs the reader as to where they can learn more about it.

FC3 response:

So the issue raised by the two other fact checkers in this case were that, regardless of correction, the article makes a false assertion that the DNA report Warren released proved she had less Native American genetic material than the average white American — this led to their classing of the article as misleading/false.

In my work on this article, I also honed in on the piece’s claims regarding Warren’s genetic material and heritage and how it relates to a subset of the population. There was three questions to answer here: (1) how much Native American genetic material does Warren have; (2) how much Native American heritage does Warren have; and (3) how does Warren’s DNA and heritage compare to the white American subset.

DNA. The piece itself did not provide a figure for Warren’s proportion of Native American genetic material. As a result, I had to seek such a figure in the DNA report, produced by one Dr. Bustamante, which Warren published. (See: http://templatelab.com/bustamante-report-2018/). Bustamante reported that: (1) at a 99% confidence level, about 95% of Warren’s genetic material was identified as European; (2) the proportion of Warren’s actual European genetic material was likely higher than 95%; (3) at a 99% confidence level, 5 DNA segments were identified as Native American, of a combined about 12,300,000 bases and about 25.6 centiMorgans; (4) at a 99% confidence level, the unassignable DNA segments were of a combined about 267,650,000 bases of about 366 centiMorgans; (5) a reference human has 46 chromosomes in 23 pairs, with a combined about 6,469,660,000 bases and of about 7,190 centiMorgans. It should be noted that the number of bases would indicate the quantity of genetic material discussed, rather than measurements in centiMorgans indicating that. (See: https://www.genome.gov/genetics-glossary/Centimorgan). It should also be noted that the provided figures were all approximations and that the overall figure, itself an approximation, was for a reference human rather than for Warren as an individual as Warren’s individual figure was not provided by Bustamante. Ultimately, we can calculate that Bustamante reported the proportion of Warren’s genetic material assignable as Native American at the 99% confidence level as compared to a reference human at about 0.19%.

HERITAGE. ‘Heritage’ is a less firm term than ‘proportion of genetic material’. For example, while heritage could refer to an individual’s inherited genetic material, it could also refer to an individual’s ancestors across preceding generations. In this way, and as discusses in the Washington Post article referenced by The Daily Caller, an individual’s heritage in terms of ancestors and generations may differ from an individual’s heritage in terms of inherited genetic material. For example, an individual with an unadmixed British father and an unadmixed Chinese mother, or a heritage in terms of ancestors of 50% Chinese, may in actuality only have 40% of their genetic material be assignable as Chinese as a result of biological processes. Additionally, it isn’t possible to conclude with certainty just from analyzing an individual’s genetic material when or how many times genetic material of a certain group, say Chinese genetic material, was introduced into that individual’s lineage. This, again, is due to biological processes. Bustamante does, however, conclude from his analysis that an unadmixed Native American ancestor entered Warren’s lineage about 8 generations, or between 6 and 10 generations, prior. Warren would have 64 ancestors in generation 6, 256 ancestors in generation 8, and 1024 ancestors in generation 10. Calculating heritage in terms of ancestors using Bustamante’s conclusion, Warren’s Native American heritage would therefore be as high as 1.562% but as low as 0.097%.

WHITES. There isn’t even a supreme definition of who qualifies as ‘white’, let alone a supreme figure for how many white Americans there are or a supreme approximation of their genetic makeup. Bustamante’s report never uses the word “white” but rather European, and The Daily Wire never elaborated on their use of the word “white” with a definition, except to reference a study which discusses “European Americans”. The US Census Bureau, for example, operates with “white” defined as having ancestors from Europe, the Middle East, or North Africa.“American”, in contrast, can reasonably be assumed to mean individuals who are US citizens. Even making the assumption that those Americans referred to as “white” by The Daily Wire are those Americans with European ancestry, not all Americans self-identify as “white”. As an example, even a ‘black’ or a ‘Chinese’ American with a known European ancestor, say Portuguese, from the distant past, say three centuries prior, may nonetheless decline to self-identify as ‘white’ in a legal, social, or research context. With this in mind, there moreover is not an abundance of research to indicate the proportion of Native American genetic material in the ‘average white American’. The aforementioned study referenced by The Daily Wire provides an average proportion figure of Native American genetic material for the study participants who self-identified as European Americans of about 0.18%. However, the same study provides an average proportion figure of European genetic material for the study participants who self-identified as African of about 24%, underscoring that the 0.18% figure only corresponds to those who self-identified as European rather than those who were found to have European genetic material. (See: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4289685/pdf/main.pdf). Unfortunately, the study does not directly provide nor provide the data needed to calculate the proportion of Native American genetic material in the average American found to have European genetic material. As a result, Warren’s here-calculated proportion of 0.19%. can’t even fairly be compared to the study’s proportion of 0.18%, which only describes individuals who self-identified as European and excludes those individuals who did not self-identify as European but who nonetheless were found to have European genetic material. On top of all that, the study does not define its participants as ‘American’ based on whether they are US citizens but rather based on where they live in the US.

The breakdown I’ve just provided demonstrates that while the Washington Post’s clarification of heritage as defined by genetic material versus heritage as defined by ancestors was warranted, there is no evidence either in the Daily Wire article nor the Washington Post article nor in my own limited literature review to disprove the claim that “[Warren] has statistically less Native American DNA than the average American white person”. Moreover, and even more importantly, that claim has so much wiggle room in how the terms “white”, especially, and “American” may be defined that there isn’t grounds to class it as false. Consider, moreover, that the central point of the article, which is that in the context of Warren’s claims of Native American heritage she demonstrated herself to have a minuscule amount of Native American heritage, stands. Here we have a case where the study used to suposedly ‘debunk’ the claim is itself unsuitable to the task, the claim affords itself more than enough wiggle room anyways, and the central point of the article is left standing. That’s why, in good conscience, I can only class this piece as True while acknowledging its inaccuracies rather than class it as Misleading, much less False.

Toxic, treasonous media pushing “white supremacist” hoax and hit lists of Trump supporters in desperate scheme to drive America into civil war

It’s now obvious the malicious, toxic media is pushing a “white supremacist” hoax in a desperate scheme to drive America into a civil war. The entire left-wing media has now become nothing more than a hate machine that’s spreading its “daily hate” to radicalize left-wing Americans into an unprecedented level of hatred, insanity and violence.

https://www.naturalnews.com/2019-08-07-toxic-treasonous-media-pushing-white-supremacist-hoax.html

FC1: False; 1.29. There are parts of the truth in many of the points mentioned. Overall, the individual points are highly misleading and the overall article false.

FC2: False; 1.00. There are many different things in this article, many conspiracy theories and a lot of racism. I don't think it even deserves real debunking.

FC3: True; 3.86. My assesment of “true” for the piece isn’t a confirmation of the opinion or analysis in this piece, but rather just of the facts provided and of the consistency between title and body: the opinion and conclusions may be criticized, but the promis of the title is delivered on and the facts provided are with one minor exception worthy of characterization as accurate. This piece doesn’t seem to offer strict falsehoods or act in bad faith, but it is heavy with opinion, interpretation, and argument to an extreme degree— it’s not a news piece, and it’s certainly a biased piece. One of the central points of the article, for example, is that the media is pushing a hoax that mass shootings are motivated by white supremacy. The facts it provides in support of this claim are provided in good faith and appear to be produced through some due diligence. Consider the work of Daniel Greenfield with the David Horowitz Freedom Center who arrives at a figure showing that whites committed only a minority of mass shootings in 2019: he used an accepted pool of data, the product of a volunteer and crowdsourced endeavor, and a particular definition. A different definition, such as that used by Mother Jones, produces a very different figure contending the opposite, but both can be considered accurate given the data and criterea at hand and made in good faith. To make an accurate assessment, which no one has, a research project would have to be undertaken which surveys law enforcement across the country at the federal, state, and local level, a massive undertaking, and uses a particular definition for what’s being considered. In short, I’m not prepared to call that evidence inaccurate or misleading. Another case in the piece which I’m not prepared to call inaccurate or misleading is where it claims that Joe Biden said he was going to send armed federal agents to confiscate guns: in the respective interview, he says that he’s going to ‘come for the guns’ of those with assault weapons, then says he’ll do so through a gun buyback program rather than through confiscation, then says he can’t confiscate because there isn’t a law on the book by which he can do so. Ultimately, there’s reasonable wiggle room to interpret that in a scenario where there’s a mandatory gun buy back program, for example, those who resist will ultimately face law enforcement. These kinds of incidents where true statements or incidents are heavily interpreted or argued to a far conclusion appear throughout the piece. Indeed, there are many opinions, or conclusions, throughout the piece which stand alone or aren’t arrived at through evidence in the piece: take the conclusions that the media want America to fall into civil war and be destroyed, for example. Those aren’t facts you can simply call correct or incorrect, rather they’re matters of opinion, of political analysis by the author, and to whatever degree of rhetoric. The claim about a ‘hit list’ is similar, in that the interpretation is within the realm of rational arguement, but it has to be arrived at through interpretation. It’s not misleading, but rather within the realm of analysis and not dependent on factual errors. There’s only one incident I found of simple factual inaccuracy rather than a matter of heavy interpretation, opinion, or argument, and that’s in the case of the ‘death camp’ posters, where the inaccuracy is that the posters were supposedly posted across New York City when actually they were posted in a Long Island town many miles away. This piece is characterized by heavy opinion, interpretation, and argument, and sometimes standalone conclusions or even calls to action. When it comes to the facts featured and without making judgment on the opinion or analysis, with one relatively minor exception regarding the geographic distribution of posters, the facts themselves were accurate.

FC3 response to FC1 and FC2 rationales:

This piece was not an easy one to assess, and that should be apparent in light what a lengthy remark I provided with my assessment of it.

[Furthermore] the piece does not claim that ‘white supremacy’ is a hoax — there’s nothing in the piece to suggest a claim that there aren’t white supremacists — rather the claim of the piece is that whereas the media is perpetrating a hoax that white supremacy is driving mass shootings, the media’s claim is untrue. The piece goes on to support that point with evidence. I won’t make a judgment on the opinion or the analysis, but I don’t have any significant factual errors to point to. There is also the question of whether the piece’s claim that the media is trying to drive the country to civil war is a conspiracy theory. I judge such claims in the piece to be rhetoric or analysis, and outside of my mandate except counting towards the piece’s low scores in Bias, Subjectivity, and Trustworthiness. Had the piece included falsehoods as evidence to back up such ‘conspiracy theory’ claims, I may have had grounds to class this piece as False, but where the piece goes so far as that it does not bring falsehoods with it. I take care to exclude emotion from these assessments, and neither opinion nor analysis are grounds for classing this piece as False in the absence of material inaccuracies. Moreover, I cannot class this piece as Misleading when it delivers exactly what it says on the tin and doesn’t equivocate et cetera. The piece doesn’t pretend to be something it’s not and doesn’t cross any ethical lines. When you cut out the emotional knee-jerk, separate the fat, which is the analysis and opinion, from the meat, which is the factual claims, as much as I can do is to class the piece as True while marking it rock-bottom for Bias, Subjectivity, and Trustworthiness. I stand by my classification.

Ginsburg Can't Remember 14th Amendment, Gets Pocket Constitution from the Audience

Some of our Supreme Court justices care more about politics and logical gymnastics than the text of the Constitution.

https://conventionofstates.com/news/ginsburg-can-t-remember-14th-amendment-gets-pocket-constitution-from-the- audience

F1: Misleading; 3.14. The article's title is misleading, while the actual article is correct, although written in an biased manner. From the video, it doesn't appear Ginsburg forgot the amendment so much as she wanted to cite it correctly.

F2: Misleading; 3.00. Justice Ruth Bader Ginsburg indeed needed a text of Constitution in front of her eyes, however it is probably more related to her health state, age and the point she was making. She definitely knows the subject well as later on during the same conversation she was referring multiple times to the 14th amendment. Besides that she has been giving lectures to high school students on this very topic. So the article is very biased, uses the event to ridicule the Judge and has call to action just under the story.

F3: True; 5.29. Factually true, but of course biased and even calling to action. It didn’t clarify that she wanted to find her pocket constitution to have on hand before she began answering the question and that she clearly knew where she was going before she got her hands on one, and the title is rather sensational when really it’s more fair to her to say that she couldn’t remember the text verbatim, but I can’t call this inaccurate simply for being unflattering. Furthermore, that the body and truth delivers on the title makes it all the harder to call this misleading — I wouldn’t call it misleading, but it could give a kinder context.

FC3 response: This particular assessment was not an easy one to make, and the reasons why are echoed in the remarks of the other fact checkers. I ultimately had to class this piece as either Misleading or True because while containing no factual errors, the piece was sensational, biased, and deeply unflattering. It’s fair to say that this title is sensational and unflattering, but it is factually true and the context is provided in the article. I’m understanding of the classifications as Misleading, because I too had to reconcile the sensational hook with the factual accuracy, but ultimately the article provides the context and delivers on what it sets out to. I can’t class a piece as misleading simply for being unkind, unflattering, charged, or sensational — that would be an inappropriate, emotionally-driven judgment rather than an ethical and objective one. I understand though respectfully disagree with those who classed it as Misleading. I stand by my classification of the piece as True though it wasn’t the easiest decision to reach.

16. Generalization to a Researcher-Selected Dataset We also applied the same crowdsourcing approach to an alternative dataset from Pennycook & Rand (2019) in order to explore the generalizability of our findings to different stimulus sets. The alternative dataset contained 30 researcher-selected headlines, half true and half false, balanced between pro-Democrat, pro-Republican, and neutral. The crowd was composed of 800 participants recruited from Amazon Mechanical Turk. We used the same methodology as in Figure 2 of the main text to use the crowd's ratings to classify headlines as True or Not True. We find substantively similar findings to those presented in the main text, with an AUC of .95 (95% CI: [.85,1]) for a crowd of size 26. In fact, crowd performance was better on the researcher- selected stimulus set than the Facebook-provided one, suggesting that the set of headlines provided to us by Facebook was substantially more difficult to rate.

Figure S12: AUROC for a model that uses the average Likert rating of a politically-balanced crowd to predict fact- checker ratings of “True” (1) vs. “Not True” (0).