Too Good to Be True: When Overwhelming Evidence Fails to Convince

Too good to be true: when overwhelming evidence fails to convince Lachlan J. Gunn,1, ∗ Fran¸coisChapeau-Blondeau,2, y Mark D. McDonnell,3, z Bruce R. Davis,1, x Andrew Allison,1, { and Derek Abbott1, ∗∗ 1School of Electrical and Electronic Engineering, The University of Adelaide 5005, Adelaide, Australia 2Laboratoire Angevin de Recherche en Ingénieriedes Systèmes(LARIS), University of Angers, 62 avenue Notre Dame du Lac, 49000 Angers, France 3School of Information Technology and Mathematical Sciences, University of South Australia, Mawson Lakes SA, 5095, Australia.yy Is it possible for a large sequence of measurements or observations, which support a hypothesis, to counterintuitively decrease our confidence? Can unanimous support be too good to be true? The assumption of independence is often made in good faith, however rarely is consideration given to whether a systemic failure has occurred. Taking this into account can cause certainty in a hypothesis to decrease as the evidence for it becomes apparently stronger. We perform a probabilistic Bayesian analysis of this effect with examples based on (i) archaeological evidence, (ii) weighing of legal evidence, and (iii) cryptographic primality testing. We find that even with surprisingly low systemic failure rates high confidence is very difficult to achieve and in particular we find that certain analyses of cryptographically-important numerical tests are highly optimistic, underestimating their false-negative rate by as much as a factor of 280. INTRODUCTION when they can|surprisingly|disimprove confidence in the final outcome. We choose the striking example that In a number of branches of science, it is now well- increasing confirmatory identifications in a police line-up known that deleterious effects can conspire to produce or identity parade can, under certain conditions, reduce a benefit or desired positive outcome. A key example our confidence that a perpetrator has been correctly iden- where this manifests is in the field of stochastic reso- tified. nance [1{3] where a small amount of random noise can Imagine that as a court case drags on, witness after surprisingly improve system performance, provided some witness is called. Let us suppose thirteen witnesses have aspect of the system is nonlinear. Another celebrated testified to having seen the defendant commit the crime. example is that of Parrondo's Paradox where individu- Witnesses may be notoriously unreliable, but the sheer ally losing strategies combine to provide a winning out- magnitude of the testimony is apparently overwhelming. come [4, 5]. Anyone can make a misidentification but intuition tells Loosely speaking, a small amount of `bad' can produce us that, with each additional witness in agreement, the a `good' outcome. But is the converse possible? Can too chance of them all being incorrect will approach zero. much `good' produce a `bad' outcome? In other words, Thus one might na¨ıvely believe that the weight of as can we have too much of a good thing? many as thirteen unanimous confirmations leaves us be- The answer is affirmative|when improvements are yond reasonable doubt. made that result in a worse overall outcome this situa- However, this is not necessarily the case and more tion is known as Verschlimmbesserung [6] or disimprove- confirmations can surprisingly disimprove our confidence ment. Whilst this converse paradigm is less well known that the defendant has been correctly identified as the arXiv:1601.00900v1 [stat.AP] 5 Jan 2016 in the literature, a key example is the Braess Paradox perpetrator. This type of possibility was recognised in- where an attempt to improve traffic flow by adding by- tuitively in ancient times. Under ancient Jewish law [13], pass routes can counterintuitively result in worse traffic one could not be unanimously convicted of a capital congestion [7{9]. Another example is the truel, where crime|it was held that the absence of even one dissent- three gunmen fight to the death|it turns out that un- ing opinion among the judges indicated that there must der certain conditions the weakest gunman surprisingly remain some form of undiscovered exculpatory evidence. reduces his chances of survival by firing a shot at either Such approaches are greatly at odds with standard of his opponents [10]. These phenomena can be broadly practice in engineering, where measurements are often considered to fall under the class of anti-Parrondo ef- taken to be independent. When this is so, each new fects [11, 12], where the inclusion of winning strategies measurement tends to lend support to the outcome with fail. which it most concords. An important question, then, is In this paper, for the first time, we perform a Bayesian to distinguish between the two types of decision problem; mathematical analysis to explore the question of multiple those where additional measurements truly lend support, confirmatory measurements or observations for showing and those for which increasingly consistent evidence ei- 2 ther fails to add or actively reduces confidence. Other- in Roman-occupied Britain or whether it was brought wise, it is only later when the results come under scrutiny from Italy by travelling merchants. Suppose that we are that unexpectedly good results are questioned; Mendel's fortunate and that a test is available to distinguish be- plant-breeding experiments provide a good example of tween the clay from the two regions; clay from one area| this [14, 15], his results matching their predicted values let us suppose that it is Britain|contains a trace element sufficiently well that their authenticity has been mired in which can be detected by laboratory tests with an error controversy since the early 20th century. rate pe = 0:3. This is clearly excessive, and so we run the The key ingredient is the presence of a hidden fail- test several times. After k tests have been made on the ure state that changes the measurement response. This pot, the number of errors will be binomially-distributed change may be a priori quite rare|in the applications E ∼ Bin(k; pe). If the two origins, Britain and Italy, are that we shall discuss, it ranges from 10−1 to 10−19|but a priori equally likely, then the most probable origin is when several observations are aggregated, the a posteriori the one suggested by the greatest number of samples. probability of the failure state can increase substantially, Now imagine that several manufacturers of pottery de- and even come to dominate the a posteriori estimate of liberately introduced large quantities of this element dur- the measurement response. We shall show that by includ- ing their production process, and that therefore it will be ing error rates, this changes the information-fusion rule detected with 90% probability in their pots, which make in a measurement dependent way. Simple linear super- up pc = 1% of those found; of these, half are of British position no longer holds, resulting in non-monotonicity origin. We call pc the contamination rate. This is the hid- that leads to these counterintuitive effects. den failure state to which we alluded in the introduction. This paper is constructed as follows. First, we intro- Then, after the pot tests positive several times, we will duce an example of a hypothetical archaeological find: become increasingly certain that it was manufactured in a clay pot from the Roman era. We consider multiple Britain. However, as more and more test results are re- confirmatory measurements that decide whether the pot turned from the laboratory, all positive, it will become was made in Britain or Italy. Via a Bayesian analysis, more and more likely that the pot was manufactured with we then show that due to failure states, our confidence this unusual process, eventually causing the probability in the pot's origin does not improve for large numbers of of British origin, given the evidence, to fall to 50%. This confirmatory measurements. We begin with this example is the essential paradox of the system with hidden fail- of the pot, due to its simplicity and that it captures the ure states|overwhelming evidence can itself be evidence essential features of the problem in a clear manner. of uncertainty, and thus be less convincing than more Second, we build on this initial analysis and extend ambiguous data. it to the problem of the police identity parade, showing our confidence that a perpetrator has been identified sur- Formal model prisingly declines as the number of unanimous witnesses becomes large. We use this mathematical framework to revisit a specific point of ancient Jewish law|we show Let us now proceed to formalise the problem above. that it does indeed have a sound basis, even though it Suppose we have two hypotheses, H0 and H1, and a se- grossly challenges our na¨ıve expectation. ries of measurements X = (X1;X2;:::;Xn). We define a variable F 2 that determines the underlying measure- Third, we finish with a final example to show that our N ment distribution, p (x). We may then use Bayes' analysis has broader implications and can be applied to XjF;Hi law to find electronic systems of interest to engineers. We chose the example of a cryptographic system and that a surpris- P [XjH ]P [H ] P [H jX] = i i ; (1) ingly small bit error rate can result in a larger-than- i P [X] expected reduction in security. Our analyses ranging from cryptography to criminol- ogy, provide examples of how rare failure modes can have which can be expanded by condition with respect to F , a counterintuitive effect on the achievable level of confi- yielding dence. X P [XjHi; f]P [Hi;F = f] f A HYPOTHETICAL ROMAN POT = X : (2) P [XjHk; f]P [Hk;F = f] f;Hk Let us begin with a simple scenario, the identification of the origin of a clay pot that has been dug from British In our examples there are a number of simplifying soil.

Too Good to Be True: When Overwhelming Evidence Fails to Convince

Developments in Parrondo's Paradox (Derek Abbott)

Quantum Aspects of Life / Editors, Derek Abbott, Paul C.W

Sex Offender Registration, Monitoring and Risk Management

Jaruwongrungsee, Kata; Tuantranont, Adisorn; Fumeaux, Christophe

Plenary Debate: Quantum Effects in Biology: Trivial Or Not?

Keeping the Energy Debate Clean: How Do

Paul Davies; Oxford University Press 2006)

Stochastic Resonance: from Suprathreshold Stochastic Resonance to Stochastic Signal Quantization Mark D

A Hybrid CMOS-Memristor Neuromorphic Synapse Mostafa Rahimi Azghadi, Member, IEEE, Bernabe Linares-Barranco, Fellow, IEEE, Derek Abbott, Fellow, IEEE, Philip H.W

ACCEPTED VERSION Omid Kavehei, Said Al-Sarawi, Kyoung-Rok Cho

Nuclear Power Is Not "Commercially Viable." This Conclusion Holds for Victoria, Given the Facts in Point (A) Above

Who Murdered the Somerton Man? Yami Li and Yifan Ma, Supervised by Prof