Pattern recognition and forensic identification: The presumption of scientific accuracy and other falsehoods IR Coyle, D Field and P Wenderoth* Decision-making in forensic contexts where patterns (such as fingerprints) are compared involves processes of perception and cognition which are notoriously fallible in many circumstances. The known or potential rate of error in those scientific methods of forensic identification which have long been accepted by the courts is often higher than would generally be perceived, despite the presumption of accuracy of such techniques. In this article, the authors argue that errors arising from perceptual and cognitive errors in such forensic identification evidence are overwhelmingly due to the misuse and profound lack of understanding of basic epistemological and statistical principles. To avoid miscarriages of justice, these principles need to be understood and safeguards employed so that the legal process is not contaminated by pseudoscience. INTRODUCTION Few things can be more damning to the prospects of a defendant than the unqualified pronouncement of an authoritative expert that there is forensic evidence directly linking the accused to a . Before the First Fleet sailed into Sydney Harbour, Lord Mansfield issued the following warning, the basic thrust of which is still apposite today: The fact that an expert witness has impressive scientific qualifications does not by that fact alone make his opinion on matters of human nature and behaviour within the limits of normality any more helpful than that of the jurors themselves. But there is a danger that they may think it does.1 Forensic evidence comes in many forms. To name but a few tools of trade of forensic scientists, this may involve comparison of latent fingerprints found at the crime scene with exemplar prints either obtained pursuant to a forensic order or extant in some ; comparison of bite marks on a victim with orthodontic analysis; DNA evidence; hair and fibre matching; or other emerging techniques based on the anthropometric or biomedical characteristics of humans. Whatever their genesis, all of these techniques have one thing in common: the potential for human error. The decision that evidence found at a crime scene can be matched to a particular suspect is made by a human. While mathematical algorithms processed with computational power almost beyond comprehension may reduce to a manageable number the comparisons that need to be made, the ultimate decision is always made by a human. This startlingly simple observation has far-reaching consequences that have not been fully appreciated or accommodated by the legal system. Decision-making, whether in a forensic context or otherwise, is as much a function of the processes of the human mind as of the technology on which such decisions are founded. It is as meaningless to try to isolate the two as it is to try to determine what caused a motor vehicle accident by only considering the vehicles involved and the traffic conditions, whilst ignoring the decisions made by the drivers. Decision-making in forensic contexts, in which patterns of various types are compared, involves processes of perception and cognition. Perception and cognition are notoriously fallible in many

* Ian Coyle: Visiting Professorial Fellow, Forensic Psychologist, Forensic Ergonomist and Forensic Psychopharmacologist, Bond University Centre for Forensic Excellence; Principal Consultant Safetysearch Forensic Consultants, Gold Coast, Queensland. David Field: Associate Professor of Law, Director, Bond University Centre for Forensic Excellence. Peter Wenderoth: Professor of Psychology, Macquarie University. The authors would like to thank Professor Don Thomson and the Honourable Tim Carmody for their helpful comments on the manuscript. 1 Folkes v Chadd (1782) 99 ER 589.

214 (2009) 33 Crim LJ 214 Pattern recognition and forensic identification circumstances. This has long been recognised with respect to the identification of an accused person as the perpetrator of a crime by way of eyewitness testimony, and legal safeguards have evolved to limit the consequences of error when such evidence is adduced, albeit that these safeguards are often useless.2 It is therefore curious that such safeguards are, in the main, conspicuously absent when the perceptions of forensic experts that patterns they have observed in evidence (whether they be fingerprints, bite marks, or DNA) match the characteristics of a suspect. This appears to be due in large measure to the misuse and profound lack of understanding of basic epistemological and statistical principles not only among lawyers, but also among scientists and other forensic “experts”. This has led, inter alia, to the legal doctrine of the presumption of scientific accuracy. When a scientific instrument may be said to belong to: [a] class of instruments of a scientific or technical character, which by general experience [are] known to be trustworthy, and are so notorious that the court requires no evidence to the effect that they do fall into such class, before allowing the presumption in question to operate with regard to readings made thereon,3 a court will, at common law, be entitled to take what is called “judicial notice” of its reliability. This means that the results or readings which are derived from such an instrument, may be relied on in evidence when this is relevant to the outcome of a case. This doctrine is founded on the notion that some tests are so notorious in their accuracy that it would require statistical improbability on a vast scale for them to be wrong in any particular case. Unfortunately, the same presumption of accuracy seems to have become applied to so-called scientific testing procedures which rely not upon the use of instruments, but the bare application of human judgment, albeit skilled and experienced judgment. Whatever may be presumed in law, errors in decision-making in tests undoubtedly do occur. These errors are common in the area of forensic identification, and are so ubiquitous as to require a fundamental reassessment of the way such evidence is received by the courts. A convenient place to start when considering this proposition is with biometric identification since this has the longest history. BIOMETRIC IDENTIFICATION Biometric measurement goes back to the 19th century, when Alphonse Bertillon categorised a series of measurements of the human body (forearm length, hand width etc). These were used to describe an individual. Literally tens of thousands of such measurements were collected in England, America and France and used to obtain convictions, typically of habitual criminals. Despite misgivings, the most significant of which was the report of the Royal Commission in 1898 in England, the system continued to be used. Then there occurred the case of Mr Will West. In 1903, Mr West was incarcerated in Leavenworth prison, Kansas, in the United States. His Bertillon measurements were taken and were identical, based on 15 matching points of comparison, to another inmate who had been admitted to Leavenworth two years earlier and was still there. This was the genesis of the requirement to have 16 matches or points of comparison that migrated, through a process of unscientific osmosis, to the system of fingerprint matching. examination replaced the Bertillon system during the early part of the 19th century. Since then, the criteria for obtaining a “match” have varied. In most of Europe, a fingerprint match requires 16 points; in Greece it is 10; and in Turkey eight points are required. In the United States and Australia, no specific criteria are used – the analyst simply forms an opinion that a latent and “exemplar” print (ie one taken from the suspect) match. There is no empirical or statistical basis for these thresholds: none whatsoever. As Thompson and Cole noted:

2 Coyle IR, Field D and Miller G, “The Blindness of the Eye-witness” (2008) 82 ALJ 471. 3 Porter v Kolodzeil [1962] VR 75 at 78.

(2009) 33 Crim LJ 214 215 Coyle, Field and Wenderoth Latent Print Examiners (LPEs) have no scientific basis to estimate the probability of a random match between two impressions, and they present no statistics in connection with their testimony. If they find sufficient consistent detail they simply declare a positive identification of individualization, claiming the potential donor for the mark has been reduced to one and only one area of friction ridge skin in the world to the exclusion of all other friction ridge skin in the world.4 And so LPEs in Australia and North America simply state that the latent print matched or did not match the exemplar print and ignore the vexatious issue that they might not be correct 100% of the time when making such judgments. LPEs routinely assert that latent and exemplar prints can be matched by properly trained examiners with no realistic chance of incorrect matching (ie making a false positive error). This is a comforting thought for an accused. It is also wrong. Errors in fingerprint analysis have been known since the 1920s. Typically these have been shrugged off as being due to poor training, poor supervision or difficulties in matching poor quality latent impressions with exemplars in a database. The cases of Brandon Mayfield in the United States and Shirley McKie in Scotland have conclusively demonstrated that these arguments will not fly.5 In both of these cases, the most experienced LPEs in North America and the United Kingdom conclusively and comprehensively identified the wrong person. What processes of decision-making are involved when a forensic scientist compares latent and exemplar evidence of whatever type? The short answer is that no one knows. Because forensic scientists making decisions as to the similarity or otherwise of such evidence produce almost no documentation, it is very difficult (if not impossible) to determine, post facto, what led them to their conclusion. It seems clear from the fact that there is no standard in North American and Australia vis-à-vis the number of indicia that must match before a positive identification is called that LPEs must consider the overall pattern of the latent print as well as individual minutiae, but how this is done is not clear. Some clues can be gleaned from computer algorithms based on multivariate statistical techniques such as factor analysis and principal component analysis,6 but we have no idea if humans adopt a decision-making heuristic based on these or similar approaches in forensic pattern matching. It is entirely possible that the ultimate decision making heuristic can be reduced to “X looks like Y”. It is arguable that forensic scientists who make a decision based on such a simplistic decision- making heuristic, without any other explicitly reasoned argument, have made their decision on the basis of degrees of consistency between the crime scene and exemplar evidence. The concept of degrees of consistency is inherent in verbal descriptors such as “unable to exclude”, “matches to a reasonable degree of medical certainty” etc, typically employed by “expert” witnesses to describe the congruence between latent and exemplar forensic identification evidence. In the context of eyewitness identification evidence, the High Court has explicitly rejected the notion of degrees of consistency. In Martin v Osborne, Dixon J, as he then was, observed: If an issue is to be proved by circumstantial evidence, facts subsidiary to or connected with the main fact must be established from which the conclusion follows as a rational inference. In the inculpation of an accused person the evidentiary circumstances must bear no other explanation.7 In Plomp v The Queen, Dixon CJ, citing this observation, acknowledged the difficulty in stating this rule, which he opined has “not been overcome by employing the expression ‘more consistent’ as if there were degrees of consistency”.8 This line of reasoning was affirmed in Pitkin v The Queen, where Deane J (Toohey and McHugh JJ concurring), observed: There are not, as Dixon CJ observed, degrees of consistency and, if a reasonable jury ought to have found that an inference or hypothesis consistent with innocence was open on the evidence, then it ought

4 Thompson WC and Cole SA, “Psychological Aspects of Forensic Identification Evidence” in Costanzo M, Krauss D and Pezdek K (eds), Expert Psychological Testimony for the Courts (Routledge, New York, 2006) p 38. 5 Thompson WC and Cole SA, “Lessons from the Brandon Mayfield Case” (2005) 29 The Champion 32. 6 Joliffe IT, Principal Component Analysis (Springer, New York, 1986). 7 Martin v Osborne (1936) 55 CLR 367 at 375. 8 Plomp v The Queen (1963) 110 CLR 234 at 243.

216 (2009) 33 Crim LJ 214 Pattern recognition and forensic identification to have given the appellant the benefit of the doubt necessarily created by that circumstance.9 This case involved the eyewitness identification of the defendant on the basis that a witness, having observed the theft of a lady’s bag, at a subsequent photo-identification selected a photograph and stated that: “This looks like the person that I seen take the lady’s handbag.” In quashing the original verdict, their Honours observed: Under our system of administering criminal justice, a person is not to be convicted of a serious crime on the sole basis of a verbal ambiguity.10 To reiterate, much forensic identification evidence deals with “degrees of consistency”, albeit not expressed in these terms, and ambiguities are inherent in many of the verbal descriptors used to anchor forensic identification evidence. Equally, it is the authors’ view that scientific ambiguity of a high order is a prominent feature of forensic identification evidence. These ambiguities arise from different sources.

SOURCES AND TYPES OF ERROR IN FORENSIC IDENTIFICATION Potential sources of errors in declaring that items have a common source include fraud, incompetence, instrumentation and technological errors11 (although these are dependent ultimately on human errors) and fundamental methodological errors that are inherent in the field in question.12 Although fraud, egregious incompetence and technological errors are of enormous concern in forensic evidence,13 these issues are not considered here. Rather, this article is concerned with fundamental epistemological issues arising from psychological processes inextricably linked to the function of the mind that are relevant to all types of forensic identification evidence. There are two types of errors an individual can make in arriving at a decision, whether this involves forensic identification or any other sort of decision: a false positive and a false negative. Within the context of this article, a false positive error (also referred to as a Type 1 error or false alarm) involves the incorrect acceptance of a hypothesis that two objects match. A false negative error (also referred to as a Type 2 error or miss) involves the incorrect rejection of a hypothesis that two objects match. These types of errors, and their associated error rates, are related. They also have different consequences in different fields. For example, a biometric scanning system based on iris scans that had a high false negative rate would be unacceptable as a screening system for admission to an unremarkable commercial building, even if it had a zero false positive error rate, since many customers would be denied access. However, the same system might be considered appropriate for a sensitive military installation. A sophisticated and pervasive methodology, Receiver Operating Characteristic (ROC)14 Analysis, has been developed from the simple premise that there are four fundamental decisions that can be made in matching or identifying objects: false positive/true positive and false negative/true negative.15 ROC analysis enables the accuracy of decision-making to be determined, as well as the effects of

9 Pitkin v The Queen (1995) 80 A Crim R 302 at 306; 69 ALJR 612. 10 Pitkin v The Queen (1995) 80 A Crim R 302 at 306; 69 ALJR 612. 11 An example here would be the failure of a facial recognition algorithm to provide a matching facial image from a database to enable comparison with a suspect. 12 Dror IE and Charlton D, “Why Experts Make Errors” (2006) 56(4) Journal of Forensic Identification 600; Coyle IR, Field D and Starmer G, “An Inconvenient Truth: Errors in Breath Alcohol Analysis Arising from Statistical Uncertainty” (2009) 41(2) Australian Journal of Forensic Sciences 1. 13 Field D, Coyle IR, Starmer GA, Miller G and Wilson P, “Trust Me – I’m an Expert” (2009) 41(2) Australian Journal of Forensic Sciences 113. 14 Developed in the physical sciences, this technique was originally called Signal Detection Theory (SDT), which it still is, but “ROC” is also used interchangeably with “SDT”. To avoid confusion, the term ROC is used throughout this article. 15 Green DM and Swets JA, Signal Detection Theory and Psychophysics (John Wiley & Sons, New York, 1966).

(2009) 33 Crim LJ 214 217 Coyle, Field and Wenderoth response bias on accuracy, in a plethora of fields. In the context of forensic identification evidence, accuracy defined as the area under the ROC curve measures the observer’s ability to correctly match “field” and exemplar evidence. The ROC curve combines the concepts of sensitivity (the true positive fraction) and specificity (the true negative fraction) into a single measure of accuracy. The false positive fraction is complement of specificity (1-specificity). The area under the curve (AUC) is defined as the diagnostic accuracy of a test or methodology. It ranges from 0 to 1.0 (perfect); an area of 0.5 indicates that the observers are guessing. According to Swets,16 AUC values above 0.9 indicate “high accuracy”, 0.7-0.9 indicates “useful for some purposes”, and 0.5-0.7 indicates “poor accuracy”.17 ROC analysis is routinely used to determine drug safety/efficacy, detection of threats in a military environment, fundamental studies in perception, and assessment of the utility of psychometric tests. Relatively recently it has begun to be applied to forensic identification.18 While it is inherent in ROC analysis that accuracy, sensitivity and specificity are interrelated, from a legal perspective, the probability of a false positive is of the utmost concern since it can result in an innocent party being found guilty. Of course, focusing on this type of error leaves open the prospect that more false negative errors will occur, with the result that some villains may escape justice, but it is axiomatic in our legal system that this is as it should be. Bearing this in mind, it is obvious that the probative value of any forensic procedure which purports to be able to identify a suspect by matching impressions found at a crime scene with exemplar impressions is restricted by the prejudicial probability that a false positive has occurred. Or rather, it should be obvious, but scientists involved in forensic identification have largely ignored the effects of false positives on the probative/prejudicial value of their evidence. Even worse, in many cases, they blithely refuse to accept that such a thing as false positives exist in their specific domain. And even if they do acknowledge this inconvenient truth, many forensic scientists produce statistical arguments which purport to render this problem insignificant. These arguments are specious.

USING STATISTICS LIKE A DRUNKARD USES A LAMP POST; FOR SUPPORT NOT ILLUMINATION Curiously, expert LPE witnesses are banned from using probabilities in their testimonies in many jurisdictions. Thompson and Cole point out that: [a] 1979 Resolution of the International Association for Identification, the main professional organization for LPEs in North America, stated, “Any member, officer, or certified latent print examiner who provides oral or written reports, or gives testimony of possible or probable, or likely friction ridge identification shall be deemed to be engaged in conduct unbecoming such member, officer or certified latent print examiner”.19 Although this rule had its origins in noble intentions (being designed to encourage LPEs to only give evidence when they are convinced of the accuracy of their conclusion) it has had an inimical effect. Results are often expressed in terms that imply scientific certainty (ie a probability of 1.0 or 100%). This is indefensible, since it implies that it is possible to prove a hypothesis whereas, as a matter of logic, it is only possible to disprove a null hypothesis.20 The problem does not end there.

16 Swets JA, “Measuring the Accuracy of Diagnostic Systems” (1988) 240 Science 1285. 17 Technically, an AUC of 0.5 cannot indicate a ROC curve since this is only a chance response and thus there is no response curve per se. However, for practical purposes in the context of the arguments espoused herein this can be ignored and an AUC of 0.5 may be considered equivalent to a diagnostic accuracy of 50%. 18 Phillips VL, Saks MJ and Peterson JL, “The Application of Signal Detection Theory to Decision-making in ” (2001) 46(2) Journal of Forensic Sciences 294. 19 Thompson and Cole, n 4, pp 45-46. 20 The null hypothesis is a hypothesis of no difference; it is formulated for the express purpose of being capable of rejection. If rejected, the alternative hypothesis may be accepted. Suppose, eg that one wished to test the hypothesis that all humans born with hands had five fingers. The only way this could be completely proven would be to observe every human on the planet. However, one could disprove the opposite or null hypothesis, to a specific degree of certainty or probability, by observing a sufficiently large sample of humans, which would then imply that the original hypothesis was correct. The degree of certainty in

218 (2009) 33 Crim LJ 214 Pattern recognition and forensic identification While some forensic disciplines use qualitative assessments of certainty such as “source attributable to reasonable medical certainty”,21 others argue that when comparing two items practitioners assert that they can identify characteristics or patterns of characteristics that are unique. That is, that they have narrowed down the source of potential donors of the evidence found at a crime scene to one, and only one, individual or object. This is often referred to as “individualisation”. This is so for fingerprint analysis, firearm and tool mark analysis,22 and barefoot morphology, the examination of the impressions of the weight bearing areas of the human foot.23 The degrees of “certainty” are indicated by verbal descriptors. As an example, the degrees of certainty, which are misleadingly referred to as “confidence intervals” by practitioners of barefoot morphology are as follows: • “Insufficient Detail” – when there is not enough detail or clarity. • “Support” – agreement or disagreement of details, such as the overall size, the location of the toe pads, but lack of sufficient quantity and or clarity. • “Strong Support” – agreement or disagreement of all the detail, such as overall size, shape and location of the toe pads, contour of the metatarsal ridge, and the contour of the ball of the foot, but with a lack of sharp detail. • “Did Make” – agreement of all detail, such as the overall size, shape and location of the toe pads, contour of the metatarsal ridge and the contour of the ball of the foot, sharp edge detail, in combination with random accidental characteristics (damage to the foot, flexion creases etc.) • “Did Not Make” – contains clear detail that shows without doubt that the impression was not made by the individual in question. While there is a strong logical argument that the “Did Not Make” category has a scientific basis, since one missing element, or an additional one such as polydactyly, can disprove a hypothesis the category “Did Make” is also absolute. This, for the reasons advanced earlier, cannot be supported from a logical or statistical perspective. To compound this error, the other qualitative categories are, in reports submitted to the court by practitioners of this technique, set out in a continuous scale whereby the difference between each category is indicated as being identical. That is, the qualitative categories are assumed to be scored using an interval scale of measurement.24 This is mathematically impossible, since many of the individual elements of the footprint impression that are aggregated, through some unexplained mental process, are not measured on an interval scale. Rather, the crime scene impressions may be simply referred to as “large”, “curved” etc. It is mathematically impossible to sum these individual components of the pattern, or indeed to perform the other mathematical operations required to result in equal differences between the various categories used to define “confidence intervals” that are employed in this technique. Indeed, the term “confidence intervals” is nonsensical because it implies a standardised normal distribution, but the mathematical assumptions underlying this distribution are violated when making comparisons on the basis of greater, curved, present/non present. In short, the use of the term “confidence interval” and the graphical presentation of the categories used to support or disconfirm the hypothesis that a particular individual “did make” the footprint such a process is never absolute; it may approach a probability of 1.0 but it can never obtain this level of certainty unless all members of a class (in this case, humans) are observed. In fact, this example is not far fetched or fanciful since polydactyly (having more that five digits on either hand or foot) clearly exists, albeit that it is rare. 21 The American Board of Odontology has promulgated this definition. 22 See Thompson and Cole, n 4, pp 44-45. 23 Yamashita AB, “Forensic Barefoot Morphology Comparison” (2007) 49 Canadian Journal of Criminology and Criminal Justice 647; Kennedy RB, Pressman IS, Chen S, Petersen PH and Pressman AE, “Statistical Analysis of Barefoot Impressions” (2003) 48(1) Journal of Forensic Science 55. 24 The four scales of measurement, in ascending order of sophistication, are: nominal, ordinal, interval and ratio. In a nominal scale, one item is simply different from another, eg male or female. In an ordinal scale, one item may be said to be lesser or greater than another but not by a defined amount, eg a sergeant has a higher rank than a private. In an interval scale, one item can be defined as being greater or lesser than another by a defined amount, eg 101 km/h is greater than 100 km/h by the same amount as 101 km/h is greater than 100 km/h. A ratio scale has the same properties as an interval scale with the exception that it incorporates a true zero point as its origin. The mathematical operations that are admissible increase with the increasing sophistication of measurement scales.

(2009) 33 Crim LJ 214 219 Coyle, Field and Wenderoth pattern are not only nonsensical, but they give the grossly misleading impression of scientific accuracy when none necessarily exists. Similar problems exist with other qualitative scales routinely used in forensic identification. What of the probabilities routinely quoted by experts when presenting the results of their analysis? Often these refer to comparisons between exemplar specimens such as fingerprints or weight-bearing patterns of the feet. For example, suppose that the probability of a random match between individuals based on the exemplar studies of Kennedy and colleagues in barefoot morphology is, as they claim, less than one in a hundred million.25 Then suppose that the false positive rate for any particular examiner based on repeated trials is, say, 5% when comparing samples found at crime scene with exemplar prints. The combined probability of error is found by the sum of these error rates, ie the additive law of probabilities applies. As Koehler26 noted, if experts make false positive errors when comparing impressions found in situ with exemplar impressions, then that is the rate limiting factor which determines and controls the match report. Studies of jurors in the United States, however, show that this basic law of probability is not understood; they give more weight to the low probability of random matches, which is dwarfed by the false positive rate.27 While there is good reason to be deeply pessimistic about the actual rate of false positive errors in forensic identification, there is even more reason to be highly sceptical about the claims advanced by experts of various persuasions that their discipline does not suffer from this defect. For example, LPEs, whilst grudgingly accepting that some fingerprint identifications have been wrong, argue that after nearly a century of adversarial challenge there have been relatively few false-positive errors exposed regarding latent fingerprint identification. Does this prove that false positive errors are few and far between in LPE forensic evidence presented to courts? No. It merely demonstrates the unlikelihood of such errors being exposed. In fact, when fingerprint comparison is presumed to be accurate unless proved otherwise, experience has shown that the process of rebuttal can be fraught with difficulty. While there have been few properly controlled studies that provide objective evidence as to the false positive rate among LPEs, one such study is both noteworthy and disturbing. Dror and Charlton28 presented expert LPEs from across the world with latent and exemplar prints taken from actual criminal cases. Half of the prints had been categorised as individualisations whilst the other half had been excluded. Using a within-subjects experimental design, the same prints were presented to the same experts, many years after they had originally assessed them. Two-thirds of the experts made inconsistent decisions; ie they disagreed with themselves. The percentage of inconsistent decisions ranged from 12% when there was no overtly potentially biasing contextual information, to 16.6% when contextually biasing information (such as telling the participants that the suspect was in police custody at the time of the crime or the suspect confessed to the crime) was provided. The false positive rate (as determined by a rejection of an initial individualisation which was subsequently disconfirmed) was 10.4%. This study is particularly noteworthy since the experimental design employed negates the arguments routinely advanced to support the contention that false positives in fingerprint identification are only possible if the examiners are not properly trained and not subjected to ongoing proficiency evaluation. Another study of false positive rates using ROC analysis was conducted with 32 forensic odontologists who were asked to determine whether or not four sets of photographs actually represented bite marks, and how certain they were, based on comparison of dental casts used in actual

25 Kennedy et al, n 23 at 62. 26 Koehler JJ, “Fingerprint Error Rates and Proficiency Test: What Are They and Why Do They Matter?” (2008) 59 Hastings Law Journal 101. 27 Koehler JJ, Chia A and Lindsey JS, “The Random Match Probability (RMP) in DNA Evidence: Irrelevant and Prejudicial?” (1995) 35 Jurimetrics 201. 28 Dror and Charlton, n 12.

220 (2009) 33 Crim LJ 214 Pattern recognition and forensic identification cases, that each set of teeth had made each bite mark.29 ROC analysis resulted in an accuracy score (AUC) of 0.86 with a 95% confidence interval ranging from 0.83 to 0.91. In other words, the average error rate was 14%. While the false positive rate was 0.02 or lower for ratings of “probable to reasonable medical certainty”, the results of this study need to be interpreted with considerable caution owing to the small number of comparisons used and the decision to declare whether a false positive had occurred on the basis of the evidence given by the original examining dentist in a court case. An obvious factor limiting the accuracy of comparison of evidence obtained at a crime scene with exemplar evidence is the often incomplete nature of the former. For example, with respect to fingerprint evidence, it is usually considered that there are 80 minutiae that can be compared between exemplar samples; with barefoot morphology it is argued that there are 200 different measurements of the weight bearing patterns of the feet that can be measured.30 Usually not all of these indicia can be reliably observed from crime scene evidence. Accordingly, the comparison of such evidence with exemplar specimens is based on a subset of the potentially available comparisons. This has profound consequences when considering the claims for probability of matches derived from multivariate statistic models on exemplar evidence. Usually these models are based on principal component analysis or factor analysis;31 these multivariate statistical techniques enable the underlying components or factors to be identified from the multitude of possible combinations of individual indicia. These factors are a mathematical construct which, in essence, groups together measurements of individual indicia. A prosaic example is the use of factor analysis to determine underlying personality factors on the basis of responses to a large number of individual questions such as is commonly used in psychometric testing.32 Insofar as all the underlying factors derived from these analyses are mathematically independent,33 the probability of, say, five such factors all agreeing can be determined by application of the multiplicative law of probabilities. This is the method used to calculate the probability of getting “heads” 10 times in row by tossing an evenly weighted coin. However, applying these statistical techniques is fraught with difficulties when comparing crime scene and exemplar evidence. This is analogous to trying to ascertain an individual’s underlying personality factors when they have only completed 20% of a personality questionnaire. Faced with this difficulty, forensic identification often proceeds from a far less rigorous foundation. Consider the 16-point comparison standard for LPE adopted in most of Europe. Applying simple combinatorial mathematics, it may be determined that there are 3,160 possible combinations of comparisons between any two minutiae based on a total number of 80 minutiae found in exemplar fingerprints. In the face of this fact, what is the probative value in stating that 16 points of comparison match? There is, fortunately, a relatively simple means of providing statistical guidance to those charged with assessing these sorts of situations, although to the authors’ knowledge it has never been used in any forensic identification case. Applying the cumulative binomial distribution, and assuming that the examiner is only guessing when making a decision, the probability of matching 16 or fewer comparisons in fingerprint analysis is 2.93E-08. Counter-intuitively, in this example the cumulative probability goes down as the accuracy of each individual comparison goes up goes; with a probability

29 Arheart KL and Pretty IA, “Results of the 4th ABFO Bitemark Workshop – 1999” (2001) 124 Forensic Science International 104. 30 Kennedy et al, n 23. 31 Principal component analysis and factor analysis are based on common underlying mathematical premises such that discussion of the two techniques is often conflated in statistical texts. 32 Cattell RB and Krug SE, “The Number of Factors in the 16PF: A Review of the Evidence with Special Emphasis on Methodological Problems” (1986) 46 Educational and Psychological Measurement 509. 33 This assumes that the final factor rotation is orthogonal (ie mathematically independent).

(2009) 33 Crim LJ 214 221 Coyle, Field and Wenderoth of 0.9 of making a correct match for each point of comparison made, the cumulative probability of only getting 16 matches is 5.13E-49, ie 48 zeros after the decimal point followed by 513.34 This is scarcely impressive. In other forensic identification disciplines, the situation is much worse. Returning to the example of barefoot morphology, the cumulative probability of only getting 10 or fewer matches out of a possible 200 points of comparison is 7.87E-175 when the probability of making each comparison accurately is 0.9. This latter example is used since it is precisely this example that has been advanced in cases involving forensic identification via barefoot morphology. What is so different between forensic identification evidence and other evidence that enables this minute probability to be regarded as indicating “Strong Support” for the hypothesis that the crime scene evidence matches the exemplar evidence obtained from a suspect? Clearly, other factors such as the overall pattern must be considered by analysts but, in the absence of all the data used in deriving principal component/factor analysis models, this cannot be done except intuitively and, as has been previously pointed out, there is no information on how this intuitive matching of patterns is done. Nor, in the absence of ROC analysis, is there anything to suggest that this process, per se, has anything like the same accuracy of comparisons between exemplar evidence. One thing is, however, made abundantly clear from the statistical analyses set out herein. There is no scientific foundation to support the claim that categorisations such as “Did Not Make”, “Strong Support” and the like are validated by evidence based on the very small number of comparisons between elements or minutiae or patterns of whatever type typically cited in forensic identification reports. OBSERVER EFFECTS Sir George Jessel in Lord Arbinger v Ashton, commenting on expert evidence, expressed the following view: An expert is not like an ordinary witness, who hopes to get his expenses, but he is employed and paid in the sense of gain, being employed by the person who calls him. Now it is natural that his mind, however honest he may be, should be biased in favour of the person employing him, and accordingly we do find such bias … Undoubtedly there is a natural bias to do something serviceable for those who employ you and adequately remunerate you.35 Irrespective of financial gain, other psychological processes inevitably play a large part in the decisions made by expert witnesses of whatever persuasion, as was obliquely recognised in this opinion by reference to doing “something serviceable”. These processes have long been recognised and are repeatedly demonstrated. The “placebo effect” is perhaps the best known example of the way in which observers interact with the object, process or person being observed to affect the observation. Even animals are affected by observers’ expectations and change their behaviours accordingly as was demonstrated by the case of “Clever Hans”, an Arabian stallion in the late 19th and early 20th century who could, so it appeared, count, add and subtract,36 and he could still perform these feats when his trainer was absent. While Hans was undoubtedly clever, he could not count; rather, he responded to subtle changes in the facial expressions of those observing him as he was tapping out his responses to mathematical problems with his hoof. If this result has been seen to occur with animals, it is also likely to occur with that most suggestible of animals – humans. In an exhaustive and seminal review, Risinger and colleagues37 have iterated these and other problems of expectation and suggestion in forensic science. These include “observer effects”,

34 See http://www.statrek.com/Tables/Binomial.apsx. 35 Lord Arbinger v Ashton (1873) 17 LR Eq 358 at 374 (emphasis added). 36 Sebeok TA, The Clever Hans Phenomenon: Communication with Horses, Whales, Apes and People (New York Academy of Sciences, 1970). 37 Risinger DM, Saks M, Thompson WC and Rosenthal R, “The Daubert/Kumho Implications of Observer Effects in Forensic Science: Hidden Problems of Expectation and Suggestion” (2002) 90(1) California Law Review 1

222 (2009) 33 Crim LJ 214 Pattern recognition and forensic identification “decision thresholds”, “anchoring effects”, “role effects”, “conformity effects” and “experimenter effects”. These problems are insidious. They are, in many respects, more troublesome than fraudulent conduct, since they can lead otherwise competent and honest scientists into offering sincere conclusions that are inaccurate. While detailed analysis of all these factors is beyond the scope of this article, it is apposite to consider these effects as they apply to perception and cognition under the conditions of ambiguity and subjectivity that apply in many cases of forensic identification. Under conditions of ambiguity and subjectivity, it has been repeatedly demonstrated that decision thresholds change so that, in response to identical stimuli, response biases arising from expectancies or reinforcing effects can unconsciously affect accuracy of prediction. Suppose that observers have to look at a dark visual field and determine whether a very dim light, at or below their visual threshold, is on or off. Then suppose that when they are correct in the sense of obtaining a true positive (ie when they say “yes” when a light is on) they are rewarded with, say, $10. This is a classic experiment used in perceptual psychology; overwhelmingly, observers respond to this sort of situation by saying “yes” more frequently, but their perceptual capacity remains unaltered. They have simply become more willing to say “yes” because of reinforcement contingencies. Reinforcement comes in many non-monetary forms, as the case of Clever Hans demonstrates. In a forensic context, these may include praise from an investigator for “getting it right” or relief from anxiety/pressure by “getting it done”; in the larger context, a forensic scientist may be regarded as a subject in an experiment, the setting of which is a forensic laboratory. As Risinger et al noted: The beliefs and expectancies of superiors, coworkers and external personnel are manifest in their behaviour toward the forensic scientist “subject” in turn affecting the behaviour of these “subjects” – their observations, recordings, computations and interpretations-not to mention the additional impact role and conformity effects may have.38 The weight of research and epistemological principles render it impractical – if not immoral – to ignore these and other elements of the observer effects. Referring to the mandate arising from Kumho Tire Co v Carmichael,39 to evaluate the reliability of expert opinion evidence whenever “their factual basis, data principles, methods or their application are called sufficiently into question”, Risinger et al put it thus: For what could more centrally call into question the methodology by which a particular conclusion was reached than the uncontrolled presence of the precursors of the various observer effects, which render it impossible to say with confidence whether or not the conclusion is merely an artefact of these conditions?40 A similar implied warning against the dangers of simply accepting what an “expert” has to say, without more closely examining what has led them to say it, was issued by Heydon JA (as he then was) in Makita (Australia) Pty Ltd v Sprowles: [T]he expert’s evidence must explain how the “field of specialised knowledge” in which the witness is expert by reason of “training, study or experience”, and on which the opinion is “wholly or substantially based”, applies to the facts assumed or observed so as to produce the opinion propounded.41 This arguably applies, not just to the “specialised knowledge” which the expert possesses, but the process by which it has been applied to the facts of the case in hand, in order to leave the court in no doubt that no errors – human, statistical or methodological – have been allowed to compromise the conclusion(s) reached.42

38 Risinger et al, n 37 at 21. 39 Kumho Tire Co v Carmichael 526 US 137 (1999) at 149. 40 Risinger et al, n 37 at 54. 41 Makita (Australia) Pty Ltd v Sprowles (2001) 52 NSWLR 705 at 744. 42 For a recent example of two experts duelling it out over their respective choices of statistical models for the creation of a population database from which certain DNA comparisons could be made in an issue involving paternity, see Bropho v The State of Western Australia [No 2] [2009] WASCA 94.

(2009) 33 Crim LJ 214 223 Coyle, Field and Wenderoth

CONCLUSIONS With the forgoing arguments in mind, it is but a small step to regard a courtroom as the setting for an experiment in which the forensic scientist “subject” is reinforced, both subtly and overtly, on the basis of the opinions proffered. The consequences that proceed from this recognition are unambiguous. Unreliable forensic identification evidence must not be allowed to contaminate criminal trials, since to do so will make it more likely that sloppy, invalid and epistemologically flimsy “expert” evidence will be nurtured by the legal system. Somewhat bleakly, Edmond et al have argued that, in the Australian context: extant admissibility jurisprudence and traditional safeguards associated with expert opinion evidence and the adversarial system might not adequately protect those accused of committing criminal acts when they are confronted with incriminating expert identification evidence.43 This is well put. In R v Tang, a case concerning the admissibility of forensic identification evidence based on facial mapping, Spigelman CJ referred to the issue of the reliability of such evidence as “extraneous”.44 However, the concept of reliability of scientific evidence has attracted more direct judicial comment in a number of appeals and inquiries. Over a quarter of a century ago, in a case in which the essential issue was whether or not a psychologist might testify as to the likely emotional effect on a young man of being told by his girlfriend that she had been unfaithful to him, and that her expected child was not his, the English appeal judge Lawton LJ observed: Before a court can assess the value of an opinion it must know the facts upon which it is based … It is wrong to leave the other side to elicit the facts by cross-examination.45 This elementary precaution is, it is submitted, even more appropriate when considering the scientific principles upon which such an opinion is based. It has long been held that the simplest way to destroy an opponent’s expert evidence is to destroy the factual platform upon which it is constructed. Why should that not also be true of the epistemological underpinnings or the mathematical and statistical constructs which have led to that same opinion? There has been no shortage of similarly expressed sentiments from Australian judges. In R v Anderson, Winneke P, in the Victorian Court of Appeal, pointed out that: an opinion is only as good as the factual or scientific basis upon which it is expressed; and if no such basis is given or, if given, can be seen to be speculative or irrelevant to the opinion expressed, then the opinion will be worthless.46 His Honour might equally have added to the list of factors which can render so-called “expert” evidence “worthless”, bogus or flawed statistical techniques. The argument for rejecting unreliable methodology is perhaps more obvious when the specialist area in which it is being employed is itself somewhat “marginal” in terms of its acceptance by mainstream science, but logically and ethically it ought to be equally applicable (and is perhaps in more urgent need of reconsideration) in those areas of forensic science such as fingerprinting and grouping which have long since passed into the daylight of judicial acceptance. One of the bluntest judicial warnings against being blinded by pseudo-science was issued in the context of a newly emerging, and somewhat suspect, branch of forensic investigation, namely forensic odontology. Undeterred by the Queensland Court of Appeal’s trenchant rejection47 of bite mark

43 Edmond G, Biber K, Kemp R and Porter G, “Law’s Looking Glass: Expert Identification Evidence Derived from Photographic and Video Images” (2009) 20(3) Current Issues in Criminal Justice 338 at 338. 44 R v Tang (2006) 65 NSWLR 681 at [137]; 161 A Crim R 377. 45 R v Turner [1975] QB 834 at 840. 46 R v Anderson (2000) 1 VR 1 at 25; 111 A Crim R 19. 47 In R v Carroll (1985) 19 A Crim R 410.

224 (2009) 33 Crim LJ 214 Pattern recognition and forensic identification comparisons as a respected method of identifying a criminal offender, the Northern Territory DPP tried again in Lewis v The Queen,48 only to have such evidence rejected as “subjective”. During the course of this process, Maurice J confirmed the thesis which underlies this article when he opined that: The inability to articulate the principal tenets that need to be understood, to describe in ordinary language the methods used and the reasons that point to a particular conclusion, these are the hallmarks of unreliable science and the not-so-qualified expert.49 This point was more recently and cogently affirmed in a Canadian appellate case concerning the admissibility of barefoot morphology evidence. In R v Dimitrov, the court noted that “novel scientific theories or techniques are subject to ‘special scrutiny’”50 and that the burden is on the party putting forth the expert to “establish its reliability on the balance of probabilities”. Rejecting the admissibility of barefoot morphology evidence on the grounds of “significant issues as to the reliability of the evidence”,51 the court referred to earlier decisions of the Canadian Supreme Court and commented as follows: In R v J-LJ (2000) 148 CCC (3d) 487 (SCC) at paras 34 and 28 respectively, the Supreme Court of Canada noted that the admissibility of expert evidence is highly case specific and that the trial judge is to take seriously the role of “gatekeeper”. The court set out the following factors that should be considered in determining the threshold reliability: (1) whether the theory or technique can be and has been tested; (2) whether the theory or technique has been subjected to peer review and publication; (3) the known or potential rate of error or the existence of standards; and (4) whether the theory or technique used has been generally accepted within the scientific community.52 It is not only with respect to novel methods of forensic identification that the known or potential rate of error, from all sources, must be considered when determining the admissibility or otherwise of expert evidence; it is for all methods of forensic identification, including those that are presumed to be accurate at law. As has been demonstrated above, the known or potential rate of error in those scientific methods of forensic identification long accepted by the courts is often higher than practitioners of various methods accept. Further, the statistical basis for the verbal descriptors used to define degrees of consistency or confidence is startlingly weak; for the overwhelming majority of forensic evidence identification techniques, it is non-existent or mathematically invalid. To reiterate, it is mathematically invalid to refer to categories based on verbal descriptors such as “strongly supports”, “supports” and the like as if the difference between these categories is equal when the underlying data is based on a nominal scale of measurement. In addition, there is often no statistical justification for forensic scientists asserting that their evidence provides “strong support” for the claim that latent and exemplar evidence matches; rather, these claims are usually ex cathedra statements. To paraphrase the observations of Dixon CJ in Plomp v The Queen: degrees of consistency or confidence that are not founded on a scientific basis are not consistent, nor do they afford any reasonable degree of confidence on which to base a decision.53 It is trite to state that, before admitting scientific evidence, the courts must be satisfied that such evidence is, indeed, scientific. The problem is making this decision in the face of confident assertions from “experts” that are often unwarranted. From a scientific perspective, one obvious way of dealing with this problem is to ask fundamental questions concerning the scientific method employed by forensic experts, of whatever persuasion. As a simple prophylactic to unwarranted confidence, if not outright ignorance, forensic experts should be challenged with the following questions: (1) What is the diagnostic accuracy of the procedure or method when comparing latent and exemplar specimens?

48 Lewis v The Queen (1987) 88 FLR 104; 29 A Crim R 267. 49 Lewis v The Queen (1987) 88 FLR 104 at 124 (emphasis added); 29 A Crim R 267. 50 R v Dimitrov (2003) 68 OR (3d) 641 at [37]. 51 R v Dimitrov (2003) 68 OR (3d) 641 at [56]. 52 R v Dimitrov (2003) 68 OR (3d) 641 at [38] (emphasis added). 53 Plomp v The Queen (1963) 110 CLR 367 at 243.

(2009) 33 Crim LJ 214 225 Coyle, Field and Wenderoth (2) What is the false positive error rate of the procedure or method when comparing latent and exemplar specimens? (3) What statistical bases are there for verbal anchors used to describe the confidence of the expert vis-à-vis diagnostic accuracy? (4) What is the test-retest error rate between and within experts when comparing latent and exemplar specimens under conditions of double blind testing? It is predicted that the answers to these obvious questions that are fundamental to establishing the epistemological basis for forensic identification evidence will prove illuminating, if not alarming, to judicial gatekeepers.

226 (2009) 33 Crim LJ 214