<<

INFERRED AND ECOLOGICAL VALIDITY IN BAYESIAN REASONING

Christopher B. Arnold

A Dissertation

Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the requirements for the degree of

DOCTOR OF

May 2018

Committee:

Richard Anderson, Advisor

Tong Sun Graduate Faculty Representative

Mary Hare

Michael Zickar

ii

ABSTRACT

Richard B. Anderson, Advisor

Research on Bayesian reasoning has indicated that people struggle with performance, usually providing non-normative responses. However, the problems used to assess Bayesian reasoning have been criticized on account of lacking ecological validity. In order to allow computation of a normative response, numerical has been explicitly provided and responses have been shown to demonstrate base rate neglect. However, previous research has shown people to be highly attentive to base rates learned from experience (Nelson, Biernat, &

Manis, 1990; Mastropasqua, Crupi, & Tentori, 2010). I hypothesize that performance is improved when the relationship between variables can be visualized. Although normative

Bayesian judgment is determined using a mathematical formula, in most situations people make decisions without such explicit information and may develop effective heuristics for making likelihood judgments. The present study asked participants to make judgments using inferred base rate as well as inferred diagnostic information, a design which has not been used in any previous research. Because all of the statistics are inferred from experience it was expected that likelihood judgments would be made relatively intuitively and efficiently, avoiding base rate neglect and Bayesian conservativism. Performance on the inferred statistics problem was compared within-participants to performance on problems using natural frequencies and probabilities. Previous research has not compared performance in the same sample using inferred and explicit statistics. Contrary to my hypothesis, participants did not perform better when using inferred statistics than when using probabilities or natural frequencies. The present research implies that poor performance on Bayesian reasoning tasks is not the result of the way that information is presented or conflict between participants’ beliefs and provided statistics. Instead, iii it provides that people are able to evaluate provided statistics in a way that is equivalent to judgments made using inferred statistics and that participants may use heuristics when provided with probabilities or natural frequencies. Although the results provided evidence for individual differences in Bayesian reasoning, numeracy did not predict performance.

iv

Dedicated to adjunct faculty members.

v

ACKNOWLEDGMENTS

I’d like to thank my Advisor, Dr. Richard B. Anderson, for taking me as a student and guiding me through my prelim and dissertation. The assistance and dedication to my academic growth has been invaluable and more than I could have hoped to receive. Furthermore, I would like to thank the other members of my committee, Dr. Mary Hare, Dr. Michael Zickar, and Dr.

Tong Sun for their willingness to work with me and pushing me to be my best. I’d also like to thank my parents, without whose undying support this never would have been possible and who I can never repay for everything they’ve done. Finally I’d like to recognize my friends for always being there for me and their influence in me finding my way.

vi

TABLE OF CONTENTS

Page

INTRODUCTION……………………………………………………………………... 1

Numerical Format…………………………………………………………..….. 6

Dual-Process Reasoning………………………………………………..……… 8

Heuristics...……………………………………………………………………... 9

Causality…………………………………………………………………...….. 15

Individual Differences………………………………………………………… 16

Cognitive Ability…………..…………………………………………………… 18

Presentation Format………..…………………………………………………… 19

Hypotheses...…………………………………………………………………… 23

EXPERIMENT 1..………………………………………………………………………. 26

Method…………………………………………………………………………… 26

Participants…..………………………………………………………….. 26

Design…………………………………………………………………... 26

Individual Difference Measures………………………………………... 27

Procedure...……………………………………………………………... 28

Results……………………………………………………………………….…. 32

Subjective Base Rates and Subjective Diagnostic Validity.………..…... 32

Inferred Probability of Residing in a Particular U.S. State...……..…..... 34

Individual Differences………..………………………………………... 35

Discussion………………………………………………………………………. 38

EXPERIMENT 2..…………………………………………………………………..…. 40 vii

Method…………………………………………………………………..……… 40

Participants…..………………………………………………………….. 40

Design…………………………………………………………………... 41

Individual Difference Measures………………………………………... 41

Procedure...……………………………………………………………... 42

Results……………………………………………………………………….…. 43

Discussion…………………………………………………………………….… 48

GENERAL DISCUSSION……………………………………………………………... 51

REFERENCES…………………………………………………………………………. 57

APPENDIX A: HSRB FORM………………………………………………………….. 61

viii

LIST OF TABLES

Table Page

1 Experiment 1 diagnostic estimates ……………..……………………….…….. 33

2 Experiment 1 correlation matrix…………………………….………………… 36

3 Experiment 2 diagnostic estimates ……………..……………………….…….. 43

4 Experiment 2 correlation matrix………………………...…….……………….. 48

ix

LIST OF FIGURES

Figure Page

1 Competency check questions…… ……………..……………………….……… 29

2 Example stimuli for Experiment 1..………………………….………………… 31

3 Mean Deviation from Bayes, by numeracy score, for each condition in

Experiment 1……………..……………………….…………………………….. 37

4 Example stimuli for Experiment 2…………………………….……………….. . 42

5 Distribution of Bayesian-optimal responses, by subjective population ratio, in the

Experiment 2 Massachusetts-Texas inferred estimate condition...………..…… 44

6 Distribution of Deviation from Bayes scores, by subjective population ratio, in the

Experiment 2 Massachusetts-Texas inferred estimate condition……………… 45

7 Mean deviation from Bayes, by numeracy score, for each condition in

Experiment 2…………………………….…………………………………….. 47

1

INTRODUCTION

Bayesian reasoning has a long and storied history in the field of cognitive psychology, going back to at least the 1960s. Bayesian reasoning is defined by Brase and Hill (2015) as the general process of using new information to calculate the revised likelihood that an event of a known prior base rate will occur. In essence, Bayesian reasoning is the act of adjusting beliefs in accordance with the introduction of new information and assumes inherent uncertainty in making judgments. Although people may be very confident in most of their judgments, Bayesian reasoning requires some uncertainty. If no uncertainty existed then there would be no need to revise judgments based on new information.

Bayesian reasoning is seen to have two core components. The first of these is the base rate, or prior beliefs. Rather than being completely oblivious to the background of the topic, it is expected that people rely on initial expectations when making judgments. The base rate serves as a baseline probability, which will be relied upon solely in the absence other information, and a framework in which later information will be considered.

The second component of Bayesian reasoning is the diagnostic information. Diagnostic information is any information which provides evidence toward or against the occurrence of an event. Although base rates provide a foundation on which to make judgments, judgment can be enhanced by taking into account the specifics of the situation. Outcomes that are more likely to occur in one scenario than another are diagnostic, knowing that the outcome occurred should allow greater confidence in the likelihood of its probable scenario.

Although judgment research is not always considered as being interested in Bayesian reasoning, many fields inherently cover these topics, including attitude formation and social cognition. For an everyday example, consider the weather. If we were not to rely on 2

meteorologists it would still be possible to make judgments about the likelihood of rainfall.

Depending on where you live rain occur more or less frequently. This can be thought of as the

base rate. Living in Florida the likelihood that a randomly selected day will have rain is higher

than in the deserts of Nevada, which has a lower base rate of rain. It is not equally likely to rain

each day, however. The presence of dark clouds may be seen as one piece of diagnostic

information that indicates that rain will occur. Although dark clouds does not guarantee rain and

rain is possible on a day with clear skies rain is more likely to occur given the former context

than the latter. If a particular town receives rainfall on 40% of days a resident should adjust their

estimate upwards, to above 40%, when estimating the likelihood of rain on a day with dark

clouds overhead. Other factors, such as the amount of wind or the sound of thunder may also provide information about the likelihood of rain, although not these factors may not be equally informative. Information that is more strongly diagnostic should have a greater impact on

likelihood estimates than less diagnostic information.

Early Bayesian research explored this using scenarios such as drawing marbles

from one of two jars. Each of these jars would contain marbles of a variety of colors, but with a

different distribution in each jar. Participants were then told the outcome of randomly selecting a

marble from one of these two jars and asked to estimate the likelihood that it was selected from

each jar. The color of the marble was meant to be diagnostic — for instance, if the first jar has a

higher proportion of red marbles than the second jar then drawing a red marble would point

toward the fact the first jar was selected.

Performance on these tasks could be measured by looking at whether or not participants

provided responses consistent with Bayes’ formula. Bayes’ formula is as follows:

( | ) ( ) ( | ) = ( | ) ( ) + ( | ) ( ) 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑒𝑒 𝐴𝐴 ∗ 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐴𝐴 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐴𝐴 𝑒𝑒 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑒𝑒 𝐴𝐴 ∗ 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐴𝐴 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝑒𝑒 𝐵𝐵 ∗ 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐵𝐵 3

In simpler terms, Bayes’ formula divides the likelihood that the outcome occurred due to the scenario of interest by the likelihood that the outcome would occur given an unknown scenario, using the base rate. Bayes’ formula provides a mathematical model for integrating all of the information to reach a normative response. This normative response is seen as the correct answer and divergences from this response are considered breaches in rational human judgment.

In the formula A notates the scenario of interest (e.g. having selected a specific jar), B notates the alternative scenario, and e notates the outcome (e.g. drawing a red marble).

Over time Bayesian reasoning tasks have become standardized in the form of a few typical word problems. One of them is the taxicab problem (Tversky & Kahneman, 1982):

“Two cab companies operate in given city, the Blue and the Green (according to the color of cab they run). 85% of the cabs in the city are Blue, and the remaining 15% are Green. A cab was involved in a hit-and-run at night. A witness later identified the cab as a Green cab. The court tested the witness’ ability to distinguish between Blue and Green cabs under nighttime visibility conditions. It found the witness able to identify each color correctly about 80% of the time, but confusing it with the other color about 20% of the time. What do you think are the chances that the errant cab was indeed Green, as the witness claimed?”

These Bayesian word problems have the advantage that they explicitly provide all of the information necessary to reach the normative Bayesian solution — the base rate, the false positive rate, and the false negative rate. This item asks for the probability that the cab was green

(the outcome, A) given that it was reported as green (the scenario, e). Using the numerical information provided within the problem the solution can leads to the following formula:

. 8 .15 ( | ) = . 8 .15 + .2 .85 ∗ 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐴𝐴 𝑒𝑒 This formula can be further simplified to ∗ ∗

. 12 ( | ) = . 12 + .17 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 𝐴𝐴 𝑒𝑒 4

Comparing the numbers in the denominator reveals a surprising truth — it is more likely

that the witness reported the cab was green mistakenly (i.e. a false positive) than as an accurate

assessment (i.e. a true hit). Because of the disproportionate base rate in which green taxis are

much more prevalent the true likelihood that the taxi is green is only 41.3%, despite the witness

being correct 80% of the time in assessing both green taxis and blue taxis. The mammography

problem (Gigerenzer & Hoffrage, 1995) provides an even more extreme example of this

scenario.

The probability of breast cancer is 1% for a woman at age forty who participates in routine screening [base-rate]. If a woman has breast cancer, the probability is 80% that she will get a positive mammography [hit-rate]. If a woman does not have breast cancer, the probability is 9.6% that she will also get a positive mammography [false-alarm rate]. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer? _%.

Here the base rate is even more one-sided, with ninety-nine times as many women being cancer-free as having breast cancer. This causes the likelihood that a woman receives a positive mammogram despite not having breast cancer to dwarf the likelihood of both having breast cancer and receiving a positive mammogram. Although a positive mammogram is indicative of breast cancer, Bayes’ formula shows that under eight percent of women who receive a positive mammogram actually have breast cancer. This result may seem underwhelming, but it is important to keep in mind that it is an increase many times from the initial one percent likelihood of breast cancer. Furthermore, although most positive mammogram results will occur due to a false alarm, extreme confidence can be had if there is a negative mammogram as the likelihood of breast cancer given a positive mammogram is a miniscule .0022.

Research has consistently shown deficits in Bayesian reasoning ability, but not always for the same reason. Early research, which asked participants to make successive likelihood approximations (e.g. Peterson & Miller, 1965; Edwards, 1968) concluded that participants were 5

conservative in their judgments, adjusting their estimates less than is justified by the introduction

of new diagnostic information. Put another way, people anchored their estimates too strongly to

their prior , or the base rate. However, research conducted using Bayesian word problems

has consistently demonstrated base rate neglect, the process of ignoring or undervaluing base rate

information. Whether through conservativism or base rate neglect participants fail to effectively

integrate new information with a known base rate, leading the field of psychology to adopt the

position advocated by Kahneman and Tverskey (1972) that “in his evaluation of evidence, man is apparently not a conservative Bayesian: he is not Bayesian at all” (p. 450).

Given the prevalence of non-normative Bayesian responses, there has been interest in understanding the techniques that people use to formulate likelihood estimates. Cohen and Staub

(2015) discovered a variety of different strategies being used. The model which provided the best fit was the additive model, more so than the multiplicative model or a Bayesian model. The additive model involves the summing of two numbers from the task, typically by subtracting the false positive rate from the true positive rate. Although Cohen and Staub (2015) found high levels of between-subject variability individual participants were consistent in their strategy

across trials, evidence that they were not answering randomly or haphazardly.

Before looking at the additive model as a potential mechanism by which people perform

Bayesian reasoning it is important to note an additional finding of Cohen and Staub (2015).

Participants were asked to estimate their performance and most stated that their performance was

either somewhat or very poor. Participants’ low levels of confidence suggests that although they

used a strategy consistently they do not actually believe in its efficacy. As such it is unclear if

different strategies relate to meaningful individual differences or if, due to low task

comprehension, a strategy is adopted pseudo-randomly, maintained only through an inability to 6

formulate a better strategy. It may be that people are overwhelmed by the amount of new

information presented in Bayesian reasoning tasks.

Numerical Format

Perhaps the most studied variant of Bayesian reasoning problems is the difference in performance using probabilities using percentages and natural frequencies. The examples given previously were presented using probabilities. For instance, the mammogram problem states that

1% of relevant women have breast cancer. Although mathematically equivalent, frequencies differ from probabilities by having a reference number. For example, instead of stating that 99% of women do not have cancer, as a frequency it would say that 99 out of 100 women do not have cancer. Instead of stating a single-event likelihood, frequencies show on a population level what proportion of people will fit into a certain class (e.g. being cancer-free). It is important to

distinguish between relative frequencies and natural frequencies. Relative frequencies are direct transformation from probabilities but natural frequencies, include nested sets relations between

variables and contain information about base rate information (Hoffrage, Gigerenzer, Krauss, &

Martignon, 2002). Consider the following example of the mammogram problem, presented using

natural frequencies.

10 out of every 1000 women at age 40 who participate in routine screening have breast cancer. 8 out of every 10 women with breast cancer will get a positive mammography. 95 out of every 990 women without breast cancer will also get a positive mammography. Here is a new representative sample of women at age 40 who got a positive mammography in a routine screening. How many of these women do you expect to actually have breast cancer?

Using natural frequencies base rate information is integrated in when discussing the

diagnostic validity. By stating the number of women who will receive a positive mammogram 7 out of ten women with breast cancer, in comparison with a sample of 990 women without breast cancer, the low prior probability of breast cancer is accounted for.

Performance improves drastically when information is presented using natural frequencies rather than percentages (Hoffrage & Gigerenzer, 1998). However, there is debate as to why natural frequencies improve performance, with two competing hypotheses: ease of computational processing and evolutionary information processing.

There is debate as to why performance is enhanced when given natural frequencies. One is that it provides greater ease of access. In order to calculate a normative Bayesian response it is necessary to find the likelihood of the given outcome (e.g. a positive mammogram) occurring due to each condition. Natural frequencies, which include base rate information, provide this information automatically. Using probabilities, application of Bayes’ formula would require participants to multiply (.99*.2) and (.01*.8), but when the information is provided using natural frequencies it is possible to skip this step and instead read in the question itself that in a sample of 1000 women 198 will receive a false positive and eight receive a true positive. A simple comparison of these two numbers makes apparent that positive mammograms are false positives more often than true hits and, therefore, a woman who receives a positive mammogram result probably doesn’t have cancer. According to this argument natural frequencies do not improve Bayesian reasoning so much as the limit the number of computational errors made by making some of the calculations for the participant.

A competing hypothesis states that people have evolutionarily adapted to work with natural frequencies. The concept of probabilities is new and probabilities are not readily apparent. As such, the brain has not developed to deal with such numbers and people have conceptual difficulties in understanding how to process the information. On the other hand, 8

natural frequencies are a simple counting statistic. Instead of thinking in terms of probabilities

early humans may have learned to try hunting in a certain location because they were successful

on four out of six previous attempts, for example. Natural frequencies can be more naturally

learned and updated than probabilities, which exist as a more abstract mathematical concept.

Thus, a second argument is that poor performance can be attributed to an artificial presentation

format, as people do not store or process information as probabilities and naturally frequencies

increase the ease of understanding the data.

I propose that people use mental representations to form imprecise approximations when

making Bayesian likelihood judgments. Normative Bayesian judgments require complex

calculations, but such a process requires intensive cognitive resources. In typical decision-

making scenarios numerical information is not available or accessible, so people may have

developed heuristics to help evaluate event likelihoods. If people are not accustomed to making

likelihood judgments using explicit calculations, numerical information may lead to suboptimal

performance. People are likely to be more comfortable, and to exhibit better performance, when

they are able to use internal schemas to make rough estimates. As such, Bayesian word

problems, which present quantitative information for unfamiliar scenarios, may exaggerate the extent to which people are irrational Bayesian reasoners. I expect that people would exhibit better, though not fully Bayesian-optimal, performance when solving problems for which they can use pre-existing mental representations to aid in reaching a likelihood estimate.

Dual-Process Reasoning

Cognitive psychologists have proposed a dual-process model of reasoning (Evans, 2003).

System 1 processing is viewed as simple, unconscious, and automatic. Although System 1 9

processing is not thorough, it holds the advantage of requiring few cognitive resources. In

contrast, System 2 processing is explicit, thorough, and requires conscious effort. System 2

processing is necessary for abstract thinking and computation, whereas System 1 uses learned

associations to quickly make judgments. System 2 is believed to have evolved more recently

than System 1, which includes innate behaviors (Evans, 2003).

Application of Bayes’ formula requires the ability to consider hypothetical situations and engage in the cognitively demanding conscious process of performing numerical calculations.

These behaviors require use of System 2 processing (Evans, 2003), making it optimal for determining the normative response. However, due to the cognitive load required of System 2 processing, as well as its relatively recent evolution, humans have adapted to use System 1 processes to make most decisions. These processes, labelled fast and frugal (Gigerenzer &

Goldstein, 1996) serve to make decisions quickly and efficiently. By using only one or two pieces of information people may reach decisions that are normatively irrational, but which provide sufficient accuracy. Given that people do not have the cognitive resources and computational capabilities to use System 2 processing for all decisions, fast and frugal reasoning helps optimize the decisions that people can realistically make.

Heuristics

Fast and frugal reasoning has been explored in relation to Bayesian reasoning. Martignon,

Vitouch, Takezawa, and Forster (2003) were able to develop a model that exhibited high performance when presented a Bayesian task, using natural frequencies. Unlike traditional

Bayesian reasoning tasks, which require likelihood estimation, the Martignon et al. (2003) model asked for a binary decision, as in the case of whether or not to administer surgery. It also differs 10 in that rather than providing diagnostic validity information for a single cue, as in the mammogram problem, seven different cues and contingency information were provided. It is not clear the extent to which their model mimics human processes, but it demonstrates that fast and frugal processes may be valid when making decisions in a Bayesian manner. Unfortunately, because of discrepancies in the task Martignon et al. (2003) does not signify that fast and frugal processes are effective in obtaining a numerical likelihood estimate when presented with a single piece of base rate information and diagnostic information, as has been the norm in research.

Given the prevalence of System 1 processing, research needs to be conducted that allows for it to act. That means designing a task in which the information is not abstract, but instead that people have associations to automatically refer to.

One scenario that looks at Bayesian reasoning without providing all of the information is the lawyer-engineer problem. In this problem the description given the following description

(Kahneman & Tversky, 1973):

Jack is a 45-year-old man. He is married and has four children. He is generally conservative, careful, and ambitious. He shows no interest in political and social issues and spends most of his free time on his many hobbies which include home carpentry, sailing, and mathematical puzzles.

Participants are then asked to evaluate the likelihood that Jack is either a lawyer or an engineer. The description provided is much more representative of people’s schemas about an engineer than a lawyer, and so it is viewed as more likely that Jack is an engineer. Participants in this problem displayed base rate neglect, with little difference in estimates whether participants were told that Jack was drawn from a population with 70 engineers and 30 lawyers or vice versa

(Kahneman & Tversky, 1973). Although, given the lack of clarity regarding the strength of the diagnostic information, it is impossible to formulate a normative answer, the lack of responsiveness to base rate information indicates a departure from normative Bayesian 11 reasoning. In an clearer case of the over-adherence to representativeness Kahneman and Tversky

(1973) found that although participants relied on base rate information when no description was given, if an unhelpful (i.e. undiagnostic) description was given participants estimated that it was equally likely the described person was a lawyer as an engineer.

One reason that people may have shown base rate neglect in the lawyer-engineer problem is expectations to the relevancy of the information provided. In film, unlikely events will often be foreshadowed and character traits will be revealed through subtle cues. Schwarz, Strack,

Hilton, and Naderer (1991) argue that although cognitive psychologists may view certain information as undiagnostic or irrelevant, people’s expectation is that the information would not have been mentioned if they were not meant to glean something from it. Schwarz et al. (1991) found that although the problem is treated as a purely statistical one, this is not the case. Base rate neglect was significantly reduced on the lawyer-engineer problem when participants were told to treat the problem as a statistician, as opposed to a researcher more generally. Thus, although the normative response is reached by adherence to Bayes’ formula, it is likely that participants make their own inferences about the relevance of different pieces of information.

Regardless of the concern for base rate neglect, there is some evidence that people effectively use inferred diagnostic information. Mastropasqua, Crupi, and Tentori (2010) asked participants to state the likelihood that someone meeting certain diagnostic criteria (e.g. usually applies eye makeup) was a male or female. Then, by asking what proportion of men and women meet the criteria, Mastropasqua et al. (2010) were able to reverse-engineer the Bayesian formula.

That is to say, instead of providing all of the information necessary for Bayes’ formula participants estimates of the diagnostic validity was used to reach a Bayesian-optimal estimate for each participants. There was a high correlation between participants’ likelihood estimates and 12

the normative estimates, given the perceived validity of the diagnostic criteria. This provides

evidence that people are Bayesian when making likelihood judgments using inferred diagnostic

information. However, Mastropasqua et al. (2010) specified that the population contained an

equal number of men and women, effectively eliminating the need to consider base rate

information. The proposed study will add to Mastropasqua et al. (2010) by examining

performance using inferred diagnostic information along with an uneven base rate.

Less research has been conducted in which the base rate is implied, perhaps because of

the difficulty of such a task. Whereas representativeness can be used to explain why people may

give normatively incorrect responses to the lawyer-engineer problem, availability serves as a better explanation of perceptions of base. According to the people make judgments based on how easily available different events are. For instance, Tversky and

Kahneman (1973) found that participants estimated more words exist with the letter “r” as the first letter (e.g. rook) than the third letter (e.g. more), although in the opposite is true. This

phenomenon is credited to the manner in which words are organized typically organized, by first

letter, as in a dictionary. As such, word beginning with any any particular letter, including “r”,

can be retrieved more easily than words with a particular letter in the third slot. In practice,

people make assumptions about the base rate based on the availability of competing assets.

As a more real world example, the safety of flying as compared to driving. You are more

likely to survive a car accident than a plane crash, but because of the exceeding rarity of plane

crashes you are more likely to die on the way to the airport than in the air, not that either even is

very likely. The common perception that driving is safer than flying could be seen as yet another

case of base rate neglect, ignoring the base rates (i.e. likelihood of an accident) in favor of the

diagnostic information (i.e. the likelihood of death given an accident). A better explanation 13 would account for peoples’ beliefs about the likelihood of each event. Plane crashes, when they do occur, garner international exposure whereas car accidents are unlikely to make even the local news. Because of this people likely overestimate the relative frequency of plane crashes, causing planes to seem more dangerous than they are in reality.

The availability heuristic makes judgment tasks particularly difficult to judge. As demonstrated above, the availability heuristic may lead to vast departures from normative responding. Oftentimes this can be attributed to media exposure, but personal experience can also drastically affect his. Someone who has lost a relative in a car crash is likely to be acutely aware of the risks involved with driving. To account for this research conducted examining the availability heuristic has traditionally made comparisons to nonexistent entities for which memories are unavailable (e.g. blue and green taxi companies). If it is expected that participants should use base rate information when making judgments then, to strengthen external validity, it is necessary to look at scenarios in which this information is previously known.

Nelson, Biernat, and Manis (1990) conducted a study looking at everyday base rates.

Nelson et al. (1990) showed participants photos of men and women, asking the participants to make estimates about the heights of the people in the photographs. No numerical information was provided, but it was expected that participants would use the photographs (i.e. diagnostic information) along with prior beliefs about mean sex height differences (i.e. base rate information) in making judgments. In accordance with Bayesian theory, Nelson et al. (1990) found evidence that people utilized sex information when making height judgments, especially when the photograph displayed a sitting subject, a condition in which the diagnostic information was weaker. This effect was present even when participants were provided with a monetary incentive for performance or were told not to stereotype. In a second experiment, Nelson et al. 14

(1990) manipulated the photographs so that every picture of a male was paired with a picture of a

female of equal height, eliminating any differences in height distribution. Even when participants

were told this they still estimated the men as being taller, on average, than the women. It seems

that, rather than displaying base rate neglect, inferred prior beliefs are incredibly persistent and

affect judgments even when explicit instruction is given to ignore them.

The present study looks to build external validity by asking participants to make

Bayesian judgments in a scenario more similar to a typical scenario. Whereas previous research

has presented some or all of the information numerically, the present study will require

participants to make inferences to determine both the base rate information and diagnostic

validity. By using inferred, rather than explicitly stated, base rates and diagnostic validities the

present study will ask participants to make judgments using information that has been stored

automatically. The drawback to this approach is that, due to differing beliefs as to the base rate

and diagnostic validity, it is not possible to calculate a single normative response. To ascertain

this information, participants will also be asked to provide personal estimates of the base rate and the diagnostic validity. Applying Bayes’ formula to the estimates will provide the normative response for each participant.

Using this information the present study hopes to provide clarity to the debate as to why natural frequencies improve performance. I believe that using inferred likelihoods will require participants to make judgments in a typical manner, as they have evolved to do. Thus, if natural frequencies have an evolutionary advantage, representing information that it is naturally stored, then performance with inferred statistics should be equivalent to performance with natural frequencies. However, if performance using inferred statistics is significantly better than with

natural frequencies, this may be seen as evidence that although natural frequencies increase the 15

transparency of the solution they still represent an unnatural and artificial model of information

processing that people are not adapted to dealing with.

One reason that base rate neglect is so persistent in the lawyer-engineer problem is that

the description provided (i.e. the diagnostic information) is much more salient than the base rate

information. The base rate is provided using simple numerical information, whereas the

description is vivid and lifelike. These two pieces of information require different cognitive

processes to interpret. The base rate information necessitates the use of analytical, System 2, processing. The participant must thoughtfully consider how the number of participants in each group compare to each other and the level of uncertainty necessitated. In contrast, the description can be processed quickly and automatically through the use of the representativeness heuristic.

Without any great effort a participant can conclude that the descriptions is more fitting of an engineer than a lawyer and therefore it is more likely that it was based off of an engineer. The lawyer-engineer problem may not actually demonstrate base rate neglect so much as the

dominance of System 1 processing when in competition with System 2.

Causality

One reason that people struggle to effectively perform Bayesian reasoning is that they

have difficulty conceptualizing unlikely causes for events. Take the taxicab problem as an

example. It is fairly straightforward to imagine how seeing a green taxi would lead a witness to

report they saw a green taxi, but it is much less intuitive why a participant might see a green taxi

and report that it was blue. Although realistically a person may have difficulty discriminating

between blue and green taxis due to poor lighting, among other reasons, participants may think 16 that logically if a witness reports that the taxi was green then there is no reason to second-guess their statement, as nothing should cause them to report a color other than the taxi’s actual color.

Fernbach, Darlow, and Sloman (2011) asked participants to make both predictive and diagnostic likelihood judgments and found different patterns of the use of alternative causes.

Participants effectively utilized the alternative likelihood when estimating the likelihood of an outcome given the event (e.g. the likelihood that a taxi will be seen as green if it is, in actuality, green) but neglected the alternative when making the Bayesian diagnostic estimate (e.g. predicting the likelihood that a tax that was seen as green was, in actuality, green).

In another study exploring the effect of causality judgments on likelihood estimates

Krynski and Tenenbaum (2007) manipulated Bayesian word problems to provide an alternative explanation for the result, such as the possibility of a benign cyst, which is incorrectly diagnosed as indicative of cancer. It was found that by highlighting a plausible reason for the misdiagnosis people were more attentive to the potential for false positives and provided more normative responses.

Individual Differences

Some research has been conducted looking at who performs better on Bayesian reasoning tasks. For instance, Brase, Fiddick, and Harries (2006) provided a Bayesian reasoning task to different samples to determine if performance differed as a function of characteristics of the sample. In a series of experiments, Brase et al. (2006) found that students at a top-tier university performed better than students at a mid-tier university, honors students performed better than non-honors students at the same university, and performance increased when monetary compensation was provided. From this it was concluded that performance on the Bayesian 17 reasoning task was a function of both cognitive ability as well as motivation of the participants.

However, Brase et al. (2006) did not directly assess any individual differences, instead showing how the sample, broadly speaking, may influence results.

Consistent with Brase et al.’s (2006) findings that participants assumed to be higher in ability performed better, Stanovich and West (1998a) demonstrated in a series of experiments that cognitive ability, as measured through Scholastic Aptitude Test scores, performance on the

Raven Advanced Progressive Matrices (Raven, 1962), and the Comprehension subscale of the

Nelson-Denny Reading Test (Brown, Bennett, & Hanna, 1981) predicted deviation from normative reasoning. Separate experiments examine inductive preferences task, which asked participants to make a series of decisions based off strong prior evidence toward one conclusion and one piece of contrary anecdotal evidence (e.g. a car that has high reliability ratings, but which a friend had a bad experience with) and a more traditional Bayesian likelihood estimation task. Stanovich and West (1998a) found that increased cognitive ability was found to be significantly related to greater preference for the initial, aggregate information than the later, more anecdotal, evidence (i.e. lower levels of base rate neglect). However, there were not consistent individual differences in performance on the Bayesian reasoning tasks. Perhaps something about the problem format occludes the normative response, preventing increased cognitive ability from enhancing performance.

Stanovich and West (1998a) also found a second predictor of performance in addition to cognitive ability — thinking dispositions. A thinking dispositions composite score was created by summing scores on an actively open-minded scale and counterfactual thinking scale, while subtracting for absolutism, dogmatism, and beliefs. Overall, the thinking dispositions composite measures whether or not a person is open-minded with a relativist 18

grounded in science, in contrast to being rigid in their beliefs. Although thinking dispositions did

not predict reasoning performance as strongly as cognitive ability, a consistent, positive, effect

was present (Stanovich & West, 1998a). This suggests that people are not cognitively incapable

of Bayesian reasoning, but rather that although the task is cognitively demanding, performance is

impacted by the level of exertion. People who engage in scientific thinking are more likely to

understand the various potential causes of an event and make nuanced decisions.

Cognitive Ability

To determine directly whether or not cognitive ability impacted performance on a

Bayesian reasoning task Stanovich and West (1998b) looked at performance in a sample of

undergraduate students. Stanovich and West (1998b) found that participants who viewed the

base rate to be relevant and successfully integrated the base rate information with the diagnostic

information scored higher than average on a series of cognitive ability. It appears likely that one

reason that performance is poor on Bayesian reasoning tasks is that many participants have

difficulty comprehending how the different pieces of information that are presented relate to

each other.

In favor of the hypothesis that performance is poor due to low comprehension is the

extensive research demonstrating that participants provide more Bayesian-optimal responses

when the information is presented using natural frequencies, rather than single-event

probabilities (e.g. Hoffrage & Gigerenzer, 1998). Natural frequencies make the relation between base rate information and within-condition likelihoods more transparent, decreasing the cognitive load of the task. However, there are individual differences in the extent to which presentation format affects performance. Although a natural frequency format led to high 19

performance in all participants, Chapman and Liu (2009) found that participants high in

numeracy, defined as “the ability to process basic probability and numerical concepts” (Peters et

al., 2006), benefitted more from the use of a natural frequency format. This result is surprising as

those high in numeracy are expected to be able to perform the necessary calculations regardless

of the way that information is presented. Chapman and Liu (2009) conclude that regardless of presentation format those with low numeracy lack the ability to perform Bayesian analyses.

Presentation Format

Previous research has found that people employ consistent strategies when solving

Bayesian reasoning problems. Cohen and Staub (2015) found that participants, across different problems, combined numerical information in discernible and consistent methods. Similarly,

Arnold and Anderson (2017) found that across trials of a repeated Bayesian reasoning task there

was a high level of within-participant consistency in levels of conservativism, with most

participants either consistently over-updating or under-updating. However, previous research has

not compared performance within-participants on tasks where all of the information provided is

numerical and where some or all of the information is inferred. The proposed study will be the

first to make this comparison.

The proposed study seeks to determine the extent to which performance using inferred

statistics is consistent with performance on traditional, numerical, Bayesian reasoning tasks.

Although there is reason to expect that there will be strong internal consistency, differences

should also be expected between inferred and numerical information. The participants in Cohen

and Staub’s (2015) study did not believe that their responses were the normative response. It is 20

possible that numerical information is not conducive to comprehension and that errors in

reasoning, such as base rate neglect, will be less prevalent using inferred statistics.

It is also important to consider who will benefit the most from the use of inferred

statistics. Participants who are low in cognitive ability are expected to have more difficulty

understanding probabilities and successfully applying Bayes’ formula. Participants who are low

in cognitive ability be less likely to reach the normative response using calculated System 2

processes. However, if it is possible to engage System 1 processes I expect that cognitive ability

will not impact performance. Because of this, cognitive ability should be a nonfactor when

inferred statistics, based on learned associations, are used.

The proposed study aims to build a Bayesian reasoning problem using inferred, rather

than explicit, base rate and diagnostic information. Much research has been conducted in which

both base rate and diagnostic information is provided numerically and although research has

looked at inferred base rates (Nelson et al., 1990) and inferred diagnostic information

(Kahneman & Tversky, 1973), no previous research has required participants to infer both

components. Previous research suggests that inferred information is processed differently than

explicitly stated information. The proposed study seeks to explore performance when the entirety

of the information is inferred.

For the base rate information, participants will be provided with two potential states.

Although it is highly unlikely that any participant knows the exact populations of states, they

should automatically understand the difference between a highly populated state (e.g. California) and a state with a relatively low population (e.g. Vermont). Previous research has provided base rate information, but without being able to connect it to existing knowledge may limit comprehension. For instance, although it should be clear that you are more likely to randomly 21

select an engineer the higher the concentration of engineers in a group, participants may struggle

to think about the numbers as anything but numbers. By providing a background for the base rate information it is more easily digested and, thus, less likely to be neglected.

In order to promote a probabilistic , the problem should ideally present the

diagnostic information as being related to, but not the cause of, the relevant outcome. The

present study does so by using political affiliations. National elections bring to the forefront the

concept of “red states” and “blue states”, which can be seen as a piece of diagnostic information.

Take the state of Texas, as an example. Republican presidential candidates regularly win the

state of Texas, so it stands to reason that most Texans are also Republicans. However, it is not

the case that someone is Republican because they live in Texas, as there are clearly Republicans

in other states and few would argue that there are not Democrats in the state of Texas. By using

naturally occurring groups a framework exists in which people can conceptualize the relation

between variables. I believe that in this scenario participants will be able to conceptualize the

nested sets relation and understand that although someone from Texas is likely to be Republican,

being a Texan is not an indictment that the person should be Republican.

Now consider the following question: If you were to select a random person from the national list of registered Democrats is it more likely that that person lives in Texas or in New

Hampshire? Although this question does not provide any numerical information it has all of the components of a typical Bayesian word problem. The base rates of the two groups in consideration is extremely lopsided, with Texas having a much higher population than New

Hampshire. The fact that overall so many more people live in Texas than New Hampshire should dominate the decision-making process and make Texas the correct answer, despite the fact that it

may be viewed as a stronghold of Republican . If participants were to display base rate 22 neglect, as has been found using other questions, then they would say that a Democrat is more likely to be selected from whichever state has a higher proportion of Democrats within its borders.

This same scenario can easily be played out with a variety of different states. For instance, the scenario could easily be changed to ask for the likelihood of selecting a Republican, perhaps by comparing California (a highly populated blue state) and Alaska (a red state with a low population). It can even look at swing states, such as Florida or Ohio, and states of varying populations but the same political leanings. Although a savvy participant may complicate the problem by taking into consideration additional factors such as voter turnout levels or the proportion of independent voters, this may also potentially occur in other Bayesian problems.

For instance, someone making a judgment in the mammogram problem may believe that high- risk women are more likely to receive a mammogram. It is expected that participants will consider only the population of the states (i.e. the base rate) and the state’s voting tendencies in presidential elections (i.e. the diagnostic validity). Varying the states chosen makes it possible to assess the extent to which voters are responsive to shifts in these.

Using states and political affiliations the present study attempts to extend the field of

Bayesian reasoning by presenting a scenario more similar to that which might be encountered in everyday life. A common criticism of Bayesian research is that the problems are unrealistic.

Although people are unlikely to ever have to make judgments about a person’s home state based on their political affiliation, this is more representative of the manner in which information is naturally categorized.

In addition, the present study adds to the literature by using an online sample. Sample demographics can potentially lead to large differences in performance (Brase et al., 2006) Most 23

of the prior research has used undergraduate student samples. However, college students are a

relatively small and homogenous population. This may affect performance on a Bayesian

reasoning task. For instance, college students are likely familiar with taking tests and may have

different attitudes toward new information. Amazon’s Mechanical Turk (MTurk), on the other hand, recruits participants from a wide range of background, , and educational levels, allowing for a more representative study of typical behavior, while still providing high-quality data (Buhrmester, Kwang, & Gosling, 2011; Casler, Bickel, & Hackett, 2013; Hauser &

Schwarz, 2016). By using MTurk the proposed study looks to generalize to a larger population than typical Bayesian research.

Hypotheses

Hypothesis 1: Performance will be superior using inferred statistics than with probabilities or natural frequencies. The main goal of the proposed study is to compare

Bayesian performance using inferred statistics to performance when base rate and diagnostic information is provided using either percentages or natural frequencies. It is expected that inferred statistics are more compatible with the dominant, more automatic System 1 processes and that this will lead to maximal performance. Because humans have adapted to making judgments automatically they will have less difficulty integrating the different factors when the information is inferred, allowing them to use heuristics to estimate relative frequencies. Previous research has shown that people are more likely to successfully utilize information when it is inferred (Nelson et al., 1990; Mastropasqua et al., 2010). The proposed study will be the first to explore how multiple pieces of inferred information are integrated. 24

Hypothesis 2: Level of education and numeracy will correlate more strongly with

performance, when people are given probabilities or natural frequencies than when they use inferred statistics. Attaining a degree in higher education requires persistence and high cognitive ability, both of which have been found to impact performance on Bayesian reasoning tasks.

Higher education also encourages an open-minded thinking disposition. It is expected that higher levels of educational attainment will increase the likelihood that participants will consider the multiple possible reasons for a particular outcome, will have the ability comprehend the problem and to perform calculations, and the willpower to carry them through with the necessary procedures. This will lead to improved performance on Bayesian word problems that use probabilities or natural frequencies, especially the more cognitively demanding probabilities. It is expected that this advantage will disappear when using inferred statistics, as people are likely to be equally adept at System 1 processing regardless of their education level or numeracy ability.

Hypothesis 3: Individual differences will predict level of conservativism. Arnold and

Anderson (2017) found that participants in a repeated Bayesian likelihood estimation task were consistent in their level of conservativism. Similarly, Cohen and Staub (2015) found that even when participants developed strategies which were then used consistently when given a series of similar Bayesian reasoning tasks. It is expected that this within-participant consistency will persist across presentation formats.

Hypothesis 4: Performance will be superior using natural frequencies than probabilities.

The present study looks to replicate previous research, which has found that people show greater divergence from Bayesian-optimal responding when information is presented using probabilities as opposed to natural frequencies (e.g. Hoffrage & Gigerenzer, 1998). 25

Hypothesis 5: Participants will exhibit mean base rate neglect across conditions. The present study looks to replicate previous research, which has found that participants exhibit base rate neglect (e.g. Kahneman & Tversky, 1973). It is expected that participants, regardless of condition, will make estimates that diverge from the base rate more than is Bayesian-optimal.

26

EXPERIMENT 1

Method

Participants. Four-hundred and twenty-five participants were recruited using Amazon’s

Mechanical Turk. Participation was restricted to workers who have an approval rating of at least

95% on 100 or more previous tasks and are based in the United States. All participants who completed the survey were paid $0.50.

MTurk workers who responded to the task were given a link to a survey on Qualtrics.

Any participants who did not display competency on the first page of the survey were excluded from analyses. Participants were also excluded if they failed to pass an attention check item or if they gave the same estimate in each condition (typically 50). This resulted in a final effective sample of 343 participants.

The average age of the participants was 36.68 (SD = 11.58). The sample was primarily

White (81.1%) and 52.5% of participants identified as female. Approximately two-thirds

(67.4%) had at least an associate’s degree in college.

Design. Experiment 1 examined the factor of information presentation format. There were three levels to this factor — inferred statistics, probability, and natural frequency. The factor varied within participants. Stimulus parameters other than information presentation format were counterbalanced as follows. For each participant, the inferred statistics condition referred to an Alabama/California comparison or a Massachusetts/Texas comparison. Also for each participant, the probability and natural frequency condition (which cited generic state names

"State A" and "State B") incorporated data consistent with an Alabama/California and a

Massachusetts/Texas comparison, respectively, or with a Massachusetts/Texas and an 27

Alabama/California comparison, respectively. The dependent variable was the deviation of

participants’ subjective likelihood estimates from the Bayesian-optimal response.

Individual Difference Measures. Numeracy was measured using two items from the

Berlin Numeracy Scale (Cokely et al., 2012). These items were “Imagine we are throwing a five- sided die 50 times. On average, out of these 50 throws how many times would this five-sided die show an odd number (1, 3 or 5)?” and “Imagine we are throwing a loaded die (6 sides). The probability that the die shows a 6 is twice as high as the probability of each of the other numbers.

On average, out of these 70 throws how many times would the die show the number 6?” The other two items were not included because they required Bayesian reasoning to solve. Responses were coded dichotomously as either correct or incorrect, resulting in possible scores between zero and two.

Participants were classified dichotomously on if they provided similar diagnostic estimates to those used in the probability and natural frequency conditions. Participants who made an inferred estimate regarding Massachusetts and Texas were classified as providing similar estimates if they provided a subjective population ratio within two of the objective ratio and if their estimates of the proportion of Democrats in Massachusetts and Texas were both within fifteen points of the proportion used in the other conditions. Participants who made an inferred estimate regarding Alabama and California were classified as providing similar estimates if they provided a subjective population ratio within three of the objective ratio and if their estimates of the proportion of Republicans in Alabama and California were both within fifteen points of the proportion used in the other conditions. Overall, 112 participants were 28 classified as similar, with an almost equal number having provided an inferred estimate for each state comparison.

Procedure. The first page of the survey asked participants about their beliefs regarding the states of Alabama, California, Massachusetts, and Texas. Because participants who do not hold pre-existing opinions about differences in state population cannot use this information to infer differential base rates, participants were shown the pairs of states to be used in the survey and asked “Do you believe that [State 1] or [State 2] has a higher population?” In each pairing the state that is first in alphabetical order was listed first. Participants were asked to state how confident they are about the relative state population on a scale of -100 (absolute certainty that

State 1 has a higher population) to 100 (absolute certainty that State 2 has a higher population).

Competency is operationalized as picking the correct answer with confidence of |50|. Participants who were displayed competency for only one pairing were later asked to make an inferred estimate based on that pairing. Participants who did not display competency for either pairing were not included in analyses. For each state, participants were also be asked “Which party is the state more likely to vote for?” and asked to state their level of confidence on a four-point scale with options of “Not At All”, “Slightly”, “Moderately”, and “Very”. Only participants who said that they are at least slightly confident about every state were considered to demonstrate competency and included in analyses. See Figure 1. 29

Figure 1. Competency check questions.

Following the competency check questions, participants were asked to complete three

Bayesian likelihood estimation tasks. One of these was be presented using inferred statistics, one using probabilities, and one using natural frequencies. To counteract order effects, the order in the questions were displayed was randomized.

For the inferred statistics question participants were shown one of two comparisons. In order to be consistent with the other Bayesian likelihood estimation tasks, in which base rate information is in direct competition with diagnostic information, each comparison featured a 30

more populous state (Texas or California) and a less populous state of the opposite political

leaning (Massachusetts or Alabama). Participants were told to imagine that they picked a random

person from the registry for the party that is more dominant in the less populous state. They were then told that it is known this person is from one of the two states and to estimate the likelihood that the person is from the less populous state. For instance here is the prompt for the

Texas/Massachusetts example: “Imagine that you were to randomly select a name from the list of all registered Democrats in Massachusetts and Texas. What is the likelihood that the Democrat you picked lives in Massachusetts, rather than Texas? Zero indicates absolute certainty that he lives in Texas, 100 indicates absolute certainty that he lives in Massachusetts, and 50 means that you believe both options are equally likely. Please enter your response as a numeral.”

For the probability format question participants were asked to make a similar estimate,

but using generically named states with the information provided using statistics. In reality, the

information was designed to mimic the objective information for one of the pairs of states (i.e.

Alabama and California or Massachusetts and Texas), with the less populous state being labeled

as “State A” and the more populous state labeled as “State B”. Diagnostic information (i.e.

political affiliations) was determined by looking at the results of the 2016 presidential election

(Politico, 2016), with results rounded to the nearest 5%. The base rate information (i.e. relative

populations) was determined by looking at information from the 2010 U.S. census, again

rounded to the nearest 5%.

The natural frequencies format used the same information to determine estimates, making the Bayesian-optimal response mathematically equivalent. However, all participants were shown a different comparison in the natural frequencies condition than in the probability condition. In other words, if the probability format used numbers mimicking Alabama and California, the 31 natural frequencies condition mimicked Massachusetts and Texas, and vice versa. Figure 2 provides an example of the stimuli that a participant may have seen.

Example of the Judgment Stimulus for One Participant

Randomly Ordered Conditions Inferred Statistics Probabilities Natural Frequencies

Imagine that you were to randomly Imagine that you were to randomly Imagine that you were to randomly select a name from the list of all select a name from the list of all select a name from the list of all registered Democrats in Massachusetts registered Democrats in State A and registered Republicans in State A and and Texas. What is the likelihood that State B. 20% of citizens live in State A, State B. 100 out of every 1000 citizens the Democrat you picked lives in with the other 80% living in State B. live in State A. 65 out of every 100 Massachusetts, rather than Texas? 100 60% of the citizens of State A are citizens of State A are registered indicates absolute certainty that he registered Democrats and 45% of the Republicans and 315 out of every 900 lives in Massachusetts, zero indicates citizens of State B are registered citizens of State B are registered absolute certainty that he lives in Texas, Democrats. What is the likelihood that Republicans. What is the likelihood that and 50 means that you believe both the Democrat you picked lives in State the Republican you picked lives in State states are equally likely. Please enter A, rather than State B? 100 indicates A, rather than State B? 100 indicates your response as a numeral. _____ absolute certainty that he lives in State absolute certainty that he lives in State A, zero indicates absolute certainty that A, zero indicates absolute certainty that he lives in State B, and 50 means that he lives in State B, and 50 means that you believe both states are equally you believe both states are equally likely. Please enter your response as a likely. Please enter your response as a numeral. _____ numeral. _____

Figure 2. Example stimuli for Experiment 1.

Following completion of the three Bayesian likelihood estimation tasks participants were asked questions relating to the base rate and diagnostic information. To determine the perceived base rate, participants were shown the following prompt: “[Texas] has a higher population than

[Massachusetts]. How many times bigger do you believe the population of [Texas] is than the population of [Massachusetts]? For instance, if you believe the population of [Texas] is double the population of [Massachusetts] enter 2 or if you believe the population of [Texas] is five times that of [Massachusetts] enter 5.” In order to determine diagnostic validity participants was asked to estimate the percentage of people in each state that are registered with the party (e.g. the

Democratic party) which was referred to in the earlier inferred Bayesian likelihood estimation task. 32

Numeracy was measured using two items from the Berlin Numeracy Scale (Cokely et al.,

2012). These items were “Imagine we are throwing a five-sided die 50 times. On average, out of these 50 throws how many times would this five-sided die show an odd number (1, 3 or 5)?” and

“Imagine we are throwing a loaded die (6 sides). The probability that the die shows a 6 is twice as high as the probability of each of the other numbers. On average, out of these 70 throws how many times would the die show the number 6?” The other two items were not included because they required Bayesian reasoning to solve. Responses were coded dichotomously as either correct or incorrect, resulting in possible scores between zero and two. Also included on this page was an attention check item asking participants to enter “41” as their answer. Participants who failed to do so were excluded from analyses.

Finally, after completion of the numeracy items, participants were asked demographic questions. Following completion of this page all participants were provided a code to verify completion of the survey, which could be entered on MTurk to receive compensation.

Results

Subjective Base Rates and Subjective Diagnostic Validity. Participants showed high levels of awareness of the relative populations of states and their political leanings. The only state in which more than 10% of participants were only slightly confident of the political affiliation was Massachusetts. This may actually show greater political aptitude because although a recent Gallup poll found Massachusetts to be considered the most liberal state in America

(DeLuca, 2015), it has recently elected Republican governors. Even in this most uncertain of conditions more than half of participants were very confident about Massachusetts’ party of preference. 33

Participants' estimates of the relative populations of state populations and the political

leanings of the states are displayed in Table 1. Overall, participants showed sensitivity to

differences in states, despite some differences, such as understating the presence of the minority

party. Although participants made estimates about the relative population of states using ratios, I

have converted this information to instead be listed as the base rate probability of residing in the

less populous state (e.g. a 4:1 ratio is a 20% probability) in the first two columns.

Table 1. Experiment 1 diagnostic estimates Objective Mean Standard Median t p- Probability Subjective Deviation Subjective value (%) Estimated Estimated Probability Probability % Massachusettsan 20 20.48 8.71 20 1.01 .311 (Texas/Massachusetts % Alabaman 10 17.42 7.87 16.67 17.47 <.001 (California/Alabama) % Democrat (Texas) 45 33.89 14.71 35 -13.98 <.001 % Democrat 60 60.98 14.95 64 .72 .473 (Massachusetts) % Republican 35 31.90 16.54 30 -3.474 .001 (California) % Republican 65 65.23 18.90 70 .23 .819 (Alabama)

Participants estimated on average that 20.48% (SD = 8.71) of the combined population of

Massachusetts and Texas resided in Massachusetts. This does not differ significantly from the true value, used in the non-inferred conditions, of 20%, t(352) = 1.01, p =.311). On average, participants estimated that from the combined population of Alabama and California, 17.42%

(SD = 7.87) of citizens are from Alabama. This is a significant overestimate of the relative population of Alabama compared to the reality that California has a population nine times as large as that of Alabama, t(342) = 17.47, p < .001). Because of limit effects, these estimates both 34

displayed a skewed distribution, with a median estimate lower than the mean. The median

estimate for California and Alabama was that California is more populous by a magnitude of five

and the median estimate for Texas and Massachusetts was, accurately, four.

Inferred Probability of Residing in a Particular U.S. State. Performance in Bayesian reasoning was determined by comparing participants’ subjective responses to the normative

Bayesian response on the residency inference tasks. Using the probability format and natural frequencies format, the normative response is the same for all participants—17.11% for the comparison mimicking Alabama and California and 25% for the comparison mimicking

Massachusetts and Texas. This is because in these conditions all participants saw the same diagnostic information, leading to there being no variance in the Bayesian-optimal response.

However, for the question using inferred statistics the normative Bayesian response was calculated individually for each participant using the estimates they provided of the base rate (i.e. relative state populations) and the diagnostic validity (i.e. proportions of people in each state affiliated with the relevant political party). In the inferred statistics condition there was no diagnostic information provided and as such two people with different beliefs may provide different probability estimates that are both Bayesian-optimal. It is necessary to account for differences in subjective beliefs when measuring performance accuracy.

Performance accuracy was operationally defined as the participant’s subjective likelihood estimate minus the normative Bayesian response, referred to as deviation from Bayes (DFB). On this scale, zero indicates optimal performance. Positive values indicate that the participant overestimated the likelihood that the person was from the less populous state, displaying base rate neglect (i.e. aggressive updating). Negative values indicate that the participant made estimates which were too close to the initial base rate (i.e. conservative updating). 35

The average deviation from Bayes in the inferred statistics condition was 28.69 (SD =

25.10), with lower mean DFB scores in the inferred statistics (M = 22.53, SD = 22.98) and the probability condition (M = 18.51, SD = 23.73). Post-hoc paired samples t-tests found all differences to be statistically significant. See Figure 3. The larger DFB in the inferred statistics condition fails to support Hypothesis 1, but does not account for differences in diagnostic beliefs between conditions.

The results of only those participants who provided similar subjective samples were very similar to the larger sample. As before, the largest mean DFB score were found in the inferred estimates (M = 32.63, SD = 20.91), with lower mean DFBs in the probability (M = 13.72, SD =

20.85) and natural frequency (M = 16.30, SD = 20.63). A one-factor repeated measures ANOVA found a significant effect of information format, F (2, 222) = 38.38, p < .001.

Individual Differences. To assess whether participants were consistent in their responses, correlations were run to see if a participants’ DFB in one condition was predictive of their DFB in other conditions. The inferred DFB was not significantly correlated to the probability DFB (r =.06, n = 341, p = .280) or the natural frequency DFB (r = .04, n = 340, p =

.460). However, the probability DFB was significantly correlated to the natural frequency DFB

(r = .32, n = 342, p <.001).

Forty-nine percent of participants answered the first numeracy item correctly and 31.5% answered the second numeracy item correctly. 23.0% of participants answered both items correctly, resulting in an average score of .80 (SD = .79). Numeracy scores were not correlated with level of education (r = .05, n = 341, p = .384).

Numeracy was examined as a predictor of DFB. Numeracy displayed a significant negative correlation with DFB using probabilities (r = -.27, n = 343, p < .001) and natural 36

frequencies (r = -.21, n = 342, p < .001) and was non-significantly correlated with inferred errors

(r = -.05, n = 341, p = .364). Level of education showed a similar, though weaker, pattern of

results. Level of education was significantly correlated with DFB using probabilities (r = -.16, n

= 341, p = .004), but not with natural frequencies (r = -.08, n = 340, p = .144) or inferred

statistics (r = -.05, n = 339, p = .371). These results are consistent with Hypothesis 2. Given the

general tendency for people to make aggressive estimates, negative correlations indicate a

reduction in aggressive responding. Although inferred statistics displayed the weakest correlation

between DFB and numeracy, this difference was not statistically significant (z = -.80, p = .212).

See Table 2.

Table 2. Experiment 1 correlation matrix

N M SD 1 2 3 4

1 DFB - Inferred 341 25.10 25.10

2 DFB - Probability 343 23.70 23.70 0.06

3 DFB - Nat. Freq. 342 22.93 22.93 0.04 0.32**

4 Education Level 341 4.27 1.31 -0.05 -0.16** -0.08

5 Numeracy 343 0.81 0.79 -0.05 -0.27** -0.21** 0.05

** p < .01.

In addition, the effect of condition, numeracy, and the interaction between condition and numeracy on DFB were tested using the univariate general linear model. When looking at the conditions for all participants, F(2, 1017) = 19.10, p < .001, numeracy, F(2, 1017) = 17.88, p <

.001, and the interaction of the two, F(4, 1017) = 2.54, p = .039 were all significantly related to 37

the estimate DFB. When the sample was narrowed to those participants who provided similar

inputs condition, F(2, 327) = 28.90, p < .001, and numeracy, F(2, 327) = 11.93, p < .001 continued to have a significant effect. However, the interaction between numeracy and condition,

F(4, 327) = 0.69, p = .600 was not significantly related to the DFB. See Figure 3.

Figure 3. Mean Deviation from Bayes, by numeracy score, for each condition in Experiment 1.

38

Discussion

Experiment 1 found participants showed greater divergences from Bayesian-optimal responding in the inferred statistics condition that in traditional presentation formats. This contrasts with Hypothesis 1, which predicted superior performance using inferred statistics. This result suggests that poor performance in Bayesian reasoning tasks is not caused by information presentation format, but instead represents true deficits in Bayesian reasoning ability.

Hypothesis 2 received partial support. Although numeracy displayed a weaker relation with DFB using inferred statistics than probability or natural frequency information this effect was not statistically significant. Thus, it is not clear how information presentation format influences the relation between numeracy and performance. Hypothesis 3 also received partial support. DFB scores using probability and natural frequency information were significantly correlated with each other, but not with DFB using inferred statistics. This indicates some level of within-participant consistency, though the lower relation between inferred and non-inferred statistics may indicate that there are differences in how the types of information are evaluated.

Hypothesis 4, which predicted that performance would be superior using natural frequency than probability information, was not supported. Instead, Experiment 1 found the lowest mean DFB scores in the probability condition. It is not clear why this occurred given the extensive previous research indicating superior performance using natural frequencies.

Hypothesis 5, however, replicated previous research and found extensive base rate neglect. The mean DFB scores were positive in all conditions, indicating that participants’ estimates diverged from the base rate more than is Bayesian-optimal.

Although Experiment 1 provided initial evidence for differences in performance using inferred statistics, as opposed to probabilities or natural frequencies, it also lacked consistency 39 the in diagnostic beliefs that were the inputs to performance. Although this was controlled for by looking at by looking at participants who provided similar estimates, it is possible that the differences in beliefs are representative of the amount of certainty that is typical. For instance, participants may have underestimated how many times more populous California is than

Alabama due to a difficulty imagining extremely disproportionate base rates. To control for this,

Experiment 2 was conducted, using the diagnostic beliefs from Experiment 1 construct the problems in the probability and natural frequencies conditions. By doing so it was possible to ensure that they were representative of typical levels of uncertainty people experience when making inferred decisions.

40

EXPERIMENT 2

Because participants’ beliefs about the relative populations of states and the political

affiliation of each state did not match the numbers provided in the probabilities and natural

frequencies conditions, a second experiment was conducted. This experiment sought to address concerns about differences in beliefs by using the subjective estimates provided in Experiment 1 to construct new likelihood estimation tasks using probabilities and natural frequencies. In doing so, Experiment 2 ensures that participants are making judgments using beliefs of a normal strength, and provides stronger evidence of how performance using inferred statistics compares to using probabilities or natural frequencies under typical levels of uncertainty.

Method

Participants. Four hundred and seventy participants were recruited using MTurk. As in

Experiment 1, participants were paid $0.50 for completing the survey. Participation was restricted to workers located in the United States with an approval rating of 95% or higher on a minimum of 100 previous tasks, and who did not participate in Experiment 1.

Responses were excluded from analyses using the same criteria as in Experiment 1: if they did not pass an attention check item, if they provided the same estimate on all likelihood estimation tasks, or if they did not display competency in their beliefs about the relative populations or political leanings, as was defined in Experiment 1. The only difference is that in

Experiment 1 participants who displayed competency in one state population comparison, but not both, were included. In Experiment 2 only participants who displayed competency in both state population comparisons were included in analyses. 41

The final sample size of 354 participants. The average age, in years, was 38.21 (SD =

12.05). A slight majority (52.5%) of the participants identified as female and 81.1% identified as

White.

Design. Experiment 2, as in Experiment 1, examined the factor of information

presentation format. There were three levels to this factor — inferred statistics, probability, and

natural frequency. The factor varied within participants. Stimulus parameters other than

information presentation format were counterbalanced as follows. For each participant, the

inferred statistics condition referred to the average of an Alabama/California comparison and a

Massachusetts/Texas comparison. Also for each participant, the probability and natural frequency condition (which cited generic state names "State A" and "State B") incorporated data consistent with an Alabama/California and a Massachusetts/Texas comparison, respectively, or with a Massachusetts/Texas and an Alabama/California comparison, respectively. The dependent variable was the deviation of participants’ subjective likelihood estimates from the Bayesian- optimal response.

Individual Difference Measures. As in Experiment 1, numeracy was measured using two items from the Berlin Numeracy Scale (Cokely et al., 2012).

As in Experiment 1, participants were dichotomously categorized based on whether or not they provided similar subjective diagnostic estimates to those used in the non-inferred conditions. Participants were categorized as similar if they provided subjective population ratios within two of the test value for both comparisons and subjective political beliefs within fifteen points for all four. Ninety-eight participants were classified as providing similar estimates using this categorization. 42

Procedure. Experiment 2 used a nearly identical procedure to that of Experiment 1.

There were two key differences. Most importantly, the probability and natural frequency problems were altered to use numbers relating to the median inferred diagnostic beliefs from

Experiment 1. Medians subjective estimates were used, rather than mean, due to the skewed distribution. These numbers were rounded to the nearest five percent. For instance, California was previously estimated to be five times larger than Alabama, an effective base rate of 16.67%.

A rate of 15% was used instead. The second difference in procedure is that in Experiment 1 each participant only provided one inferred likelihood estimate, either for Massachusetts and Texas or for Alabama and California, but not for both. In Experiment 2 all participants made an inferred likelihood estimate for both pairs of states. Figure 4 shows example stimuli.

Example of the Judgment Stimulus Set for One Participant

Inferred Statistics Inferred Statistics Randomly Ordered Conditions Massachusetts/Texas Alabama/California Probabilities Natural Frequencies

Imagine that you were to Imagine that you were to Imagine that you were to Imagine that you were to randomly select a name randomly select a name randomly select a name randomly select a name from the list of all registered from the list of all registered from the list of all registered from the list of all registered Democrats in Massachusetts Republicans in Alabama and Democrats in State A and Republicans in State A and and Texas. What is the California. What is the State B. 20% of citizens live State B. 150 out of every likelihood that the Democrat likelihood that the in State A, with the other 1000 citizens live in State A. you picked lives in Republican you picked lives 80% living in State B. 65% of 105 out of every 150 citizens Massachusetts, rather than in Alabama, rather than the citizens of State A are of State A are registered Texas? 100 indicates California? 100 indicates registered Democrats and Republicans and 255 out of absolute certainty that he absolute certainty that he 35% of the citizens of State B every 850 citizens of State B lives in Massachusetts, zero lives in Alabama, zero are registered Democrats. are registered Republicans. indicates absolute certainty indicates absolute certainty What is the likelihood that What is the likelihood that that he lives in Texas, and 50 that he lives in California, the Democrat you picked the Republican you picked means that you believe both and 50 means that you lives in State A, rather than lives in State A, rather than states are equally likely. believe both states are State B? 100 indicates State B? 100 indicates Please enter your response equally likely. Please enter absolute certainty that he absolute certainty that he as a numeral. _____ your response as a numeral. lives in State A, zero lives in State A, zero _____ indicates absolute certainty indicates absolute certainty that he lives in State B, and that he lives in State B, and 50 means that you believe 50 means that you believe both states are equally likely. both states are equally likely. Please enter your response Please enter your response as a numeral. _____ as a numeral. _____

Figure 4. Example stimuli for Experiment 2. 43

Results

The pattern of subjective diagnostic estimates was very similar in Experiment 2 to the estimates provided in Experiment 1. Of the six diagnostic estimates, five had an identical median response between studies. Although participants estimated the relative size of populations as a ratio, these values have been converted to list the base rate probability of belonging to the less populous state (e.g. a 4:1 ratio is 20). See Table 3.

Table 3. Experiment 2 diagnostic estimates. Previous Median Mean Standard t p-value Median Subjective Subjective Deviation Estimated Estimated Estimated Probability Probability Probability % Massachusettsan 20 16.67 18.81 7.67 -2.93 .004 (Texas/Massachusetts) % Alabaman 16.67 16.67 17.26 7.30 1.53 .128 (California/Alabama) % Democrat (Texas) 35 35 34.84 15.97 .19 .847 % Democrat 64 64 60.43 16.21 -4.15 <.001 (Massachusetts) % Republican 30 30 33.06 17.36 3.32 .001 (California) % Republican 70 70 64.94 17.94 -5.31 <.001 (Alabama)

Fifty-two percent of participants answered the first numeracy item correctly and 31.6% answered the second numeracy item correctly. The average total score on the numeracy measure was 0.84 (SD = .79).

As a means of verifying that the inferred statistics condition was equivalent to the probability and natural frequency conditions, despite having variability in the Bayesian-optimal response estimates were graphically compared based on participants’ estimates of the relative 44

population of states. Figure 5 shows how the Bayesian-optimal response differed as a function subjective beliefs about how many times more populous Texas is than Massachusetts (which had a previous median estimate of four). As can be seen, as participants increased their estimates of how much more populous Texas is than Massachusetts, the Bayesian-optimal response tended to get smaller. The reference line represents the Bayesian-optimal solution, given the diagnostic estimates used in the probability and natural frequencies conditions. The fact that it does not differ significantly from the estimate made given a subjective population ratio between two and six provides further evidence of the equivalence of conditions, and the fact it nearly overlaps the median line for a subjective population ratio of four demonstrates that the results are not biased by participants’ subjective political beliefs.

Figure 5. Distribution of Bayesian-optimal responses, by subjective population ratio, in the Experiment 2 Massachusetts-Texas inferred estimate condition.

45

However, as estimates of the base rate changed participants also adjusted their subjective

estimates as to the likelihood that the chosen person was from the smaller state. As can be seen

in Figure 6, although the Bayesian-optimal response differed as a function of subjective

population beliefs the DFB scores were similar. As such I can say that although the inferred

statistics condition has variability in the Bayesian-optimal response, unlike the probability and natural frequency conditions, this did not the results and DFB scores can be compared across conditions.

Figure 6. Distribution of Deviation from Bayes scores, by subjective population ratio, in the Experiment 2 Massachusetts-Texas inferred estimate condition.

Because there was not a theoretical difference between the two inferred estimates made by each participant, the two DFB scores were averaged to reach an overall inferred DFB. The mean DFB was positive using inferred statistics, probabilities, and natural frequencies, indicating a tendency of participants to show base rate neglect and make aggressive likelihood estimates regardless of condition. The lowest DFB score was from the natural frequency condition (M = 46

20.75, SD = 22.11). This was significantly lower than the DFB in the probability condition (M =

28.85, SD = 22.89) and the inferred statistics condition (M = 30.04, SD = 20.68). Among participants who provided similar inputs, the lowest mean DFB was in the natural frequencies condition (M = 22.33, SD = 22.00), with higher DFB values in the probability (M = 27.95, SD =

20.94), and inferred statics (M = 26.73, SD = 16.65) conditions.

These results, as in Experiment 1, fail to support Hypothesis 1, which predicted that performance would be superior when making inferred estimates. In contrast to this, the lowest

DFB values were found when making natural frequency estimates and the inferred statistics condition and probability condition produced similar results.

Numeracy was significantly correlated with level of education (r = .16, n = 353, p =

.003). Numeracy was not significantly correlated with DFB using inferred statistics (r = -.06, n =

352, p = .220), probabilities (r = .01, n = 354, p = .903) or natural frequencies (r = -.04, n = 353, p = .505). Level of education was not significantly correlated with any DFB score, with the largest correlation existing between it and the inferred Massachusetts/Texas DFB (r = -.09, n =

353, p = .078).

As in Experiment 1, the effect of condition, numeracy, and the interaction between condition and numeracy on DFB were tested using the univariate general linear model. When looking at all participants condition, F(2, 1050) = 17.08, p < .001, was significantly related to the estimate DFB. Numeracy, F(2, 1050) = 1.16, p = .314, and the interaction between numeracy and condition, F(4, 1050) = 0.68, p = .607 were not significantly related to the estimate DFB though. Among participants who provided similar inputs condition, condition F(2, 285) = 1.82, p

= .164, and numeracy, F(2, 285) = 0.85, p = .429, and the interaction between numeracy and condition, F(4, 285) = 1.07, p = .370 all failed to achieve statistical significance. See Figure 7. 47

Figure 7. Mean deviation from Bayes, by numeracy score, for each condition in Experiment 2.

These results fail to provide support for Hypothesis 2, which predicted that numeracy and level of education would more strongly affect performance using probabilities and natural frequencies than using inferred statistics. Numeracy and level of education were not predictive of

DFB in the probability or natural frequencies conditions. Although there is not a clear relationship between numeracy and level of education on DFB in the inferred condition, given the lower correlations in the other conditions I cannot say that numeracy is less important when estimates are made using inferred diagnostic information.

To look for individual differences in responding, correlations were conducted between the various DFB scores. All correlations were positive and statistically significant, with the 48 weakest correlation existing between the inferred statistics and probability conditions (r = .122, n

= 352, p = .022). The correlation was between the two inferred DFB scores (r = .37, n = 352, p <

.001) was stronger than between any two conditions, indicating greater consistency within- condition. See Table 4. This provides support for Hypothesis 3 as participants’ DFB on one problem was predictive of their DFB on another problem, with the strongest relation existing when the condition remained constant.

Table 4. Experiment 2 correlation matrix.

N M SD 1 2 3 4

1 DFB - Inferred 352 30.04 20.68

2 DFB - Probability 354 28.68 23.09 0.12*

3 DFB – Nat. Freq. 353 20.75 22.11 0.27** 0.12*

4 Education Level 353 4.33 1.27 -0.01 0.02 -0.08

5 Numeracy 354 0.84 0.79 -0.07 0.01 -0.04 0.16**

* p < .05. ** p < .01.

Discussion

Contrary to Hypothesis 1, which predicted superior performance using inferred statistics,

Experiment 2 found participants showed similar DFB scores between the inferred statistics and probability condition, with lower DFB scores in the natural frequency condition. In conjunction with the results of Experiment 1 this provides strong evidence that poor performance in Bayesian reasoning tasks is not caused by information presentation format, but instead represents true deficits in Bayesian reasoning ability. 49

Hypothesis 2 was not supported. Neither numeracy nor level of education was predictive

of performance in any of the three conditions. This suggests that mathematical ability was not

important for likelihood estimation and that participants may rely on heuristics in all conditions

rather than attempt to calculate the Bayesian-optimal response. Hypothesis 3, however, was

supported. All DFB scores were significantly correlated, demonstrating within-participant consistency across conditions.

Hypothesis 4, which predicted that performance would be superior using natural frequency than probability information, was supported in Experiment 2. As hypothesized, the mean DFB when using natural frequency information was significantly lower than when using probability information. This replicates previous research demonstrating superior performance when information is presented with natural frequencies. Hypothesis 5, which predicted positive

DFB scores, was also supported, replicating previous research which has found extensive evidence for base rate neglect.

Although Experiment 1 provided initial evidence for differences in performance using inferred statistics, as opposed to probabilities or natural frequencies, it also lacked consistency the in diagnostic beliefs that were the inputs to performance. Although this was controlled for by looking at by looking at participants who provided similar estimates, it is possible that the differences in beliefs are representative of the amount of certainty that is typical. For instance, participants may have underestimated how many times more populous California is than

Alabama due to a difficulty imagining extremely disproportionate base rates. To control for this,

Experiment 2 was conducted, using the diagnostic beliefs from Experiment 1 construct the problems in the probability and natural frequencies conditions. By doing so it was possible to 50 ensure that they were representative of typical levels of uncertainty people experience when making inferred decisions.

51

GENERAL DISCUSSION

In this research paper I have described two separate but related studies that were

conducted to examine performance in Bayesian reasoning when relying entirely on inferred

diagnostic information, a procedure which has not previously been used. In Experiment 1,

participants made three estimates regarding the likelihood that a random person of a given party

lived in one of two possible states. This study provided some evidence toward differences in

performance using inferred statistics, but served the primary function of providing typical

diagnostic beliefs to be used in Experiment 2. Experiment 2 followed a similar procedure,

addressing the potential confound of differences in initial beliefs by using the median subjective

diagnostic estimates to create new probability and natural frequency likelihood estimation tasks.

Contrary to the primary hypothesis, performance was not improved through the use of

inferred statistics. In both studies the largest deviation from Bayesian-optimal responding was

found when people used inferred statistics, although this did not differ significantly from

performance in the probability format condition in Experiment 2. This suggests that the low

performance typically exhibited in the probability format condition may not be the result of

difficulty understanding the task or discrepancies between personal beliefs and provided

statistics, but instead is related to in the way that people process information. As the first

study to look at performance using entirely inferred statistics, the present study is the first to be

able to clearly demonstrate this conclusion.

One possible explanation for the poor performance shown when using inferred statistics is that political affiliations are a more salient feature than the differential population sizes of different states due to the ability to apply it to stereotypes of a single individual. If this was true, use of the representativeness heuristic could explain the base rate neglect that was found in this 52

study. Given that the diagnostic and base rate information were in competition for which state

was more likely, participants likely defaulted to the diagnostic information which they could had

applied to the information. Doing so would result in subjective likelihood estimates that are more

closely anchored to the diagnostic information than is Bayesian-optimal and, thus, positive DFB

scores.

Another explanation is that, following Murphy (2003), people consolidate their beliefs

and behave according to the belief that the most likely outcome is necessarily the correct

outcome. Arnold and Anderson (2017) found evidence for this in the participants’ subjective

future likelihood estimates in which most participants made judgments about the likelihood of a

future roll based on either absolute certainty or uncertainty, rather than integrating dual

possibilities. In the current study it is possible that participants committed this same in

saying that people of a particular affiliation are more typical of one state, so someone of that

political affiliation must be from that state. The non-inferred conditions may have exhibited higher performance due to not eliciting strong beliefs.

One major implication of these results is that people make likelihood judgments using probabilities and natural frequencies in a way that is equivalent to how they make these judgments using inferred beliefs. If participants did not understand how to interpret and make decisions using probabilities they should have shown worse performance as opposed to using inferred statistics. Instead, the similar levels of performance suggests that people manipulate and use probability and natural frequency information the same way that they would when relying entirely on subjective judgments. It is even possible that participants attempted to estimate the relevant probabilities when providing likelihood estimates in the inferred statistics condition, though there is no way to confirm this. Given my belief that participants rely on heuristics when 53

making inferred judgments, the equivalence of conditions suggests that participants apply these

heuristics to provided information. As such, this research shows that the poor performance

typically found in Bayesian likelihood tasks cannot be attributed to people altering their decision-making process significantly due to an artificial information presentation format.

A potential confound in the present study is that the inferred statistics condition presented the names of particular states, whereas the non-inferred conditions presented the states as “State

A” and “State B”. Although this creates an additional modification in addition to the independent variable of information presentation format, it was necessary given the research design. The inferred statistics condition required the use of stimuli for which participants held existing beliefs. It would be impossible for participants to infer the population of a fictional state, for instance. In the non-inferred condition the use of generic monikers was important to guarantee that participants were attending to the numerical information provided. If the participants were informed which states they were making a comparison between it is possible they would rely on their inferred estimates rather than making a likelihood estimate using the probability or natural frequency information. As such I believe that this slight divergence between conditions is justified. However, it is still necessary to consider the possible impact of this confound on the results. It is not clear how the difference in naming procedures would influence participants, but the use generic monikers in the non-inferred conditions may have caused the problem to feel more abstract than when the names of specific states were provided. This should have, if anything, hindered performance in the non-inferred conditions. The fact that participants did not display superior performance in the inferred statistics condition despite this confound provides additional evidence that people are capable of evaluating provided statistics in the same manner in which they evaluate inferred statistics. 54

Also contrary to the hypothesized results, numeracy and level of education did not

differentially relate to performance depending on the problem format. It was expected that these

would be less predictive of performance using inferred statistics, but there was not a clear effect

as in Experiment 2 numeracy and level of education displayed the lowest correlation with

performance using probabilities, which were expected to be the most mathematically taxing.

Given the near-zero correlations between numeracy and performance, it seems that either people

inherently have the ability to perform likelihood calculations or do not process the information

mathematically, even when the information provided is numerical. Given the large divergences

between subjective estimates and Bayesian-optimal responses the latter seems more likely.

Another concern is that the short measure used in the present study did not capture the skills necessary for Bayesian likelihood calculations. Somebody may be highly educated without being mathematically inclined or may do well on the numeracy task but have difficulty thinking about the particular problems being asked.

The present studies provided evidence for individual differences in Bayesian reasoning.

Consistent with Arnold and Anderson (2017), evidence of individual differences in belief updating was found in the current studies, as performance in one information presentation format

was predictive of performance in other formats. This is the first study to compare performance

within-participants across formats. Numeracy and level of education were not significantly

predictive of performance, but future research should continue to look for potential predictors of

Bayesian reasoning performance.

Another important implication of this study is that some of the criticism of Bayesian

likelihood estimation tasks (e.g. Koehler, 1996) may be overstated. By using inferred statistics

the present study introduced a more ecologically valid method of assessing Bayesian reasoning 55 performance. The positive relation in performance between presentation formats, as well as the poor performance using inferred statistics, provides evidence that errors in Bayesian reasoning are not an artifact of problem design, but is instead a real and tangible phenomenon. However, estimating the likelihood that someone of a particular political party is from one of two states is an atypical task, so future research should explore performance using inferred statistics regarding a more typical decision.

Interestingly, although previous research has demonstrated superior performance using natural frequencies over probabilities this result was replicated in Experiment 2, but not

Experiment 1, providing partial support for Hypothesis 4. As both experiments used a near- identical procedure it is not clear why this pattern of results emerged. The two experiments did differ in the statistics used for the probabilities and natural frequencies conditions, so one possibility is that the numbers provided in Experiment 1 were easier to understand using probabilities. However, post-hoc analyses did not find a significant effect of the state comparison being made on performance in either the probabilities or natural frequencies conditions of

Experiment 1. Future research should manipulate the statistics provided to better understand the situations in which people show the greatest deviations from Bayesian-optimal responding.

Another possibility is that condition is differentially important depending on the extremity of the initial values. Hoffrage and Gigerenzer (1998) used four different questions when examining the superiority of natural frequencies to probabilities. Although performance was superior using natural frequencies in all four problems, it is worth noting that the same problem that exhibited the worst performance using probabilities displayed the highest performance when natural frequency information was provided. Hoffrage and Gigerenzer (1998) did not provide the statistics that were used in each problem so it is not clear why there was this 56

effect of problem on performance. Although the superiority of probability information found in

Experiment 1 looks to be an anomalous result, future research should nonetheless explore if extremity of statistical information can negate or exacerbate the superiority of natural frequencies.

Both experiments provided strong support for Hypothesis 5, which predicted that participants would demonstrate base rate neglect, as seen in positive DFB values. The mean DFB was positive in every condition of both experiments and overall greater than 70% of estimates made exhibited some level of base rate neglect. Base rate neglect, thus, may be considered an inherent aspect of the way people use heuristics to make likelihood judgments. Kahneman and

Tverskey (1972) may be justified in their proclamation that man is not Bayesian at all.

Overall, the two experiments conducted here provide evidence that people update their beliefs using heuristics, with an overemphasis on diagnostic information at the expense of base rate information. The present research is important as it is the first to look at Bayesian reasoning performance using inferred diagnostic and base rate information, showing that the use of non-

inferred statistics is not to blame for poor performance. Bayesian likelihood estimation tasks

using provided statistics appear to be a valid way to assess Bayesian reasoning, consistent with

tasks using inferred statistics. Future research should look at performance using inferred statistics

in other situations. 57

REFERENCES

Arnold, C.B., & Anderson, R.B. (2003). Under-updaters and Over-updaters in Belief Updating.

Unpublished manuscript, Department of Psychology, Bowling Green State University,

Bowling Green, U.S.A.

Brase, G. L., & Hill, W. T. (2015). Good fences make for good neighbors but bad science: a

review of what improves Bayesian reasoning and why. Frontiers in Psychology, 6, 340.

http://dx.doi.org/10.3389/fpsyg.2015.00340

Brase, G. L., Fiddick, L., & Harries, C. (2006). Participant recruitment methods and statistical

reasoning performance. The Quarterly Journal of Experimental Psychology, 59(05), 965-

976. http://dx.doi.org/10.1080/02724980543000132

Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk: A new source

of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6(1), 3-5.

http://dx.doi.org/10.1177/1745691610393980

Brown, J., Bennett, J., & Hanna, G. (1981). The Nelson-Denny Reading Test. Lombard, IL:

Riverside Publishing Co.

Casler, K., Bickel, L., & Hackett, E. (2013). Separate but equal? A comparison of participants

and data gathered via Amazon’s MTurk, social media, and face-to-face behavioral

testing. Computers in Human Behavior, 29(6), 2156-2160.

http://dx.doi.org/10.1016/j.chb.2013.05.009.

Chapman, G. B., & Liu, J. (2009). Numeracy, frequency, and Bayesian reasoning. Judgment and

Decision Making, 4(1), 34-40. 58

Cohen, A. L., & Staub, A. (2015). Within-subject consistency and between-subject variability in

Bayesian reasoning strategies. Cognitive Psychology, 81, 26-47.

http://dx.doi.org/10.1016/j.cogpsych.2015.08.001

Cokely, E. T., Galesic, M., Schulz, E., Ghazal, S., & Garcia-Retamero, R. (2012). Measuring risk

literacy: The Berlin numeracy test. Judgment and Decision Making, 7(1), 25.

DeLuca, N. (2015, February 7). Massachusetts is the most liberal state in the country. American

Inno. Retrieved from https://www.americaninno.com/boston/massachusetts-is-the-most-

liberal-state-in-the-country-gallup-poll/

Edwards, W. (1968). Conservatism in human information processing. Formal Representation of

Human Judgment, 17, 51.

Evans, J. S. B. (2003). In two minds: dual-process accounts of reasoning. Trends in cognitive

sciences, 7(10), 454-459. http://dx.doi.org/10.1016/j.tics.2003.08.012.

Fernbach, P. M., Darlow, A., & Sloman, S. A. (2011). Asymmetries in predictive and diagnostic

reasoning. Journal of Experimental Psychology: General, 140(2), 168.

http://dx.doi.org/10.1037/a0022100.

Gigerenzer, G., & Goldstein, D. G. (1996). Reasoning the fast and frugal way: models of

bounded rationality. Psychological review, 103(4), 650.

Gigerenzer, G., & Hoffrage, U. (1995). How to improve Bayesian reasoning without instruction:

frequency formats. Psychological Review, 102(4), 684-704.

Hauser, D. J., & Schwarz, N. (2016). Attentive Turkers: MTurk participants perform better on

online attention checks than do subject pool participants. Behavior Research Methods,

48(1), 400-407. http://dx.doi.org/10.3758/s13428-015-0578-z 59

Hoffrage, U., & Gigerenzer, G. (1998). Using natural frequencies to improve diagnostic

inferences. Academic medicine, 73(5), 538-540.

Hoffrage, U., Gigerenzer, G., Krauss, S., & Martignon, L. (2002). Representation facilitates

reasoning: what natural frequencies are and what they are not. Cognition, 84(3), 343-352.

Hoffrage, U., Krauss, S., Martignon, L., & Gigerenzer, G. (2015). Natural frequencies improve

Bayesian reasoning in simple and complex inference tasks. Frontiers in psychology, 6.

http://dx.doi.org/10.3389/fpsyg.2015.01473.

Kahneman, D., & Tversky, A. (1972). Subjective probability: A judgment of representativeness.

Cognitive Psychology, 3(3), 430-454.

Kahneman, D., & Tversky, A. (1973). On the psychology of prediction. Psychological review,

80(4), 237.

Koehler, J. J. (1996). The reconsidered: Descriptive, normative, and

methodological challenges. Behavioral and brain sciences, 19(1), 1-17.

Krynski, T. R., & Tenenbaum, J. B. (2007). The role of causality in judgment under uncertainty.

Journal of Experimental Psychology: General, 136(3), 430-450.

http://dx.doi.org/0.1037/0096-3445.136.3.430.

Martignon, L., Vitouch, O., Takezawa, M., & Forster, M. R. (2003). Naive and yet enlightened:

From natural frequencies to fast and frugal decision trees. Thinking: Psychological

perspectives on reasoning, judgment and decision making, 189-211.

Mastropasqua, T., Crupi, V., & Tentori, K. (2010). Broadening the study of :

Confirmation judgments with uncertain evidence. Memory & cognition, 38(7), 941-950.

http://dx.doi.org/10.3758/MC.38.7.941

Murphy, G. L. (2003). The downside of categories. Trends in cognitive sciences, 7(12), 513-514. 60

Nelson, T. E., Biernat, M. R., & Manis, M. (1990). Everyday base rates (sex stereotypes): Potent

and resilient. Journal of Personality and Social Psychology, 59(4), 664.

Peterson, C. R., & Miller, A. J. (1965). Sensitivity of subjective probability revision. Journal of

Experimental Psychology, 70(1), 117.

Politico. (2016). [Graph illustration the 2016 presidential election results by state]. Politico.

Retrieved from https://www.politico.com/mapdata-2016/2016-

election/results/map/president/

Raven, J. C. (1962). Advanced Progressive Matrices (Set IT). London: H. K. Lewis & Co.

Schwarz, N., Strack, F., Hilton, D., & Naderer, G. (1991). Base rates, representativeness, and the

logic of conversation: The contextual relevance of “irrelevant” information. Social

Cognition, 9(1), 67-84.

Stanovich, K. E., & West, R. F. (1998a). Individual differences in rational thought. Journal of

Experimental Psychology: General, 127(2), 161.

Stanovich, K. E., & West, R. F. (1998b). Who uses base rates and P (D/ H)? An analysis of

individual differences. Memory & Cognition, 26(1), 161-179. ∼

Tversky, A., & Kahneman, D. (1973). Availability: A heuristic for judging frequency and

probability. Cognitive psychology, 5(2), 207-232.

Tversky, A. & Kahneman, D. (1982). Evidential impact of base rates. In D. Kahneman, P. Slovic

& A. Tversky (eds), Judgment under uncertainty: Heuristics and biases (pp. 153-160).

Cambridge: Cambridge University Press.

61

APPENDIX A: HSRB FORM