Center for Statistics and Applications in CSAFE Presentations and Proceedings Forensic Evidence

12-11-2018

Forensic Statistics and the Assessment of Probative Value

Hal Stern University of California, Irvine

Follow this and additional works at: https://lib.dr.iastate.edu/csafe_conf

Part of the and Technology Commons

Recommended Citation Stern, Hal, "Forensic Statistics and the Assessment of Probative Value" (2018). CSAFE Presentations and Proceedings. 20. https://lib.dr.iastate.edu/csafe_conf/20

This Presentation is brought to you for free and open access by the Center for Statistics and Applications in Forensic Evidence at Iowa State University Digital Repository. It has been accepted for inclusion in CSAFE Presentations and Proceedings by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Forensic Statistics and the Assessment of Probative Value

Disciplines Forensic Science and Technology

Comments Posted with permission of CSAFE.

This presentation is available at Iowa State University Digital Repository: https://lib.dr.iastate.edu/csafe_conf/20 Forensic Statistics and the Assessment of Probative Value

OSAC Meeting Phoenix, AZ December 11, 2018

Hal Stern Department of Statistics University of California, Irvine [email protected] Interesting times in forensic science Evaluation of forensic evidence • Forensic examinations cover a range of questions – timing of events – cause/effect – source conclusions • Focus here on source conclusions – topics addressed (e.g., need to assess uncertainty, logic of the likelihood ratio) are relevant beyond source conclusions • The task of interest for purposes of this presentation: assess two items of evidence, one from a known source and one from an unknown source, to determine if the two samples come from the same source – Bullet casing from test fire of suspect’s gun – Bullet casing from the

The Daubert standard • Daubert standard (Daubert v. Merrell Dow Pharmaceuticals, 1993) governs admission of scientific expert testimony in federal courts – judge as gatekeeper – conclusions should be the product of applying a scientific methodology – relevant factors for judge to consider • Has the technique been tested in actual field conditions (and not just in a laboratory)? • Has the technique been subject to peer review and publication? • What is the known or potential rate of error? • Do standards exist for the control of the technique's operation? • Has the technique been generally accepted within the relevant scientific community? – applies to all expert evidence (Kumho Tire Co. v. Carmichael, 1999)

. Frye standard (Frye v. United States, 1923) – general acceptance in relevant scientific community standard – applicable to novel scientific evidence FRE Rule 702 (post-Daubert) A witness who is qualified as an expert by knowledge, skill, experience, training, or education may testify in the form of an opinion or otherwise if: a) the expert’s scientific, technical, or other specialized knowledge will help the trier of fact to understand the evidence or to determine a fact in issue; b) the testimony is based on sufficient facts or data; c) the testimony is the product of reliable principles and methods; and d) the expert has reliably applied the principles and methods to the facts of the case. Logic of forensic examinations • Examine two samples to identify similarities and differences • Assess similarities and differences to see if they are expected (or likely) under the same source hypothesis • Assess similarities and differences to see if they are expected (or likely) under the different source hypothesis Evaluation and interpretation of forensic evidence

• Approaches

– Expert assessment based on experience, training, use of accepted methods. Typically summarized by a categorical conclusion (e.g., identification / exclusion / inconclusive)

– Two-stage procedure (see, e.g., Parker and Holford in the 1960s) • similarity (binary decision based on distance/score) • identification (likelihood of coincidental match)

– Likelihood ratio (sometimes known as the Bayes factor) Satisfying Daubert / FRE 702

• Application of any of these approaches should be supported by evidence regarding how well (how reliably) they perform • Examples: – Studies of reliability and validity of measurements (e.g., chemical composition of glass) – Peer-reviewed studies of techniques/models – Studies of reliability and validity of examiner conclusions • Important to also recall that the approach needs to be “reliably applied …. to the facts of the case.” (e.g., N.C. vs McPhaul, 2017) Forensic Evidence as Expert Opinion • Status quo in pattern disciplines (, shoe prints, firearms, toolmarks, questioned documents, etc.) • Examiner analyzes evidence based on – Experience – Training – Use of accepted methods in the field • Assessment of the evidence reflects examiner’s expert opinion • Conclusions typically reported as categorical conclusions – Identification, Exclusion, Inconclusive – Multi-category scales (some support, strong support, very strong support,..) Forensic Evidence as Expert Opinion • Occasionally conclusions are expressed as statements about the hypotheses rather than the evidence, e.g., “based on the evidence, author of the known samples … – Wrote the questioned sample – Highly probable wrote the questioned sample – Probably wrote the questioned sample – Indications may have written the questioned sample – with similar statements on the negative side • This is logically problematic – It is a statement about the likelihood of a hypothesis (“same source”) after viewing the evidence – But, as we will see later, this conclusion must also reflect in part the examiner’s a priori (pre-evidence) opinion about the hypothesis Forensic Evidence as Expert Opinion • What does it take to establish that testimony is – “based on sufficient facts or data” – “the product of reliable principles and methods” • Note that the use of the word “reliable” in the legal sense (trustworthy) differs from its technical use in statistics • In measurement / assessment, statisticians focus on a number of related concepts in thinking about “reliability”: – Would the same analyst draw the same conclusion in a new examination of the evidence (repeatability) – Would different analysts draw the same conclusion given the same evidence (reproducibility) – Repeatability and reproducibility are both components of reliability – Do analysts get the right answer in studies where the ground truth is available (accuracy / validity) Reliability of Measurements: An Example from Handwriting • 5 forensic document examiners (FDE) rated 123 signatures in terms of difficulty to simulate on a 5-point scale (easy - fairly easy - medium - difficult - very difficult)

• Assessing reproducibility (similarity of assessments by two different examiners) − Correlation of ratings of each pair of FDEs (.62 - .75) − Statistical model (intraclass correlation coefficient) (.65)

• Assessing repeatability (similarity of assessments by same examiner at two different times) … a very small study w/ only 7 signatures − Correlation of ratings (range from .40 - .88) − Statistical model estimates .68 Forensic Evidence as Expert Opinion

• PCAST report called for assessment of – Foundational validity of a forensic science discipline – Validity as applied in a particular case • Foundational validity – A method can in principle be reliable (in the legal sense) – PCAST advocated for multiple “black box” studies • Validity as applied – Proficiency testing (this person can do the task) – Case report establishing it has been applied appropriately in this case • PCAST report has been controversial  Forensic Evidence as Expert Opinion

• Example of a (PCAST-style) “black box” study – Having examples with known “ground truth” allows estimation of error rates – Ulery et al. (2011) “black box” study of decisions • false positive rate was 0.1% • false negative rate was 7.5% – There are limitations in this and any study (similarity to case work, case environment?) – Same group carried out a series of “white box” studies in fingerprints to assess • Reliability of different steps in the examination process (e.g., marking of minutiae) Forensic Evidence as Expert Opinion

• Reliability and validity are likely to depend on characteristics of the evidence, e.g., – quality of latent print – complexity of a signature • Studies should address this and would allow statements like “for evidence of this type …” Forensic Evidence as Expert Opinion

Example: Forensic Evidence as Expert Opinion

• A few final remarks on forensic evidence as expert opinion – Information on reliability and accuracy for forensic analyses is extremely helpful and will likely be increasingly requested – As per FRE 702, there is also a need to address application of the method or technique in the current case (e.g., N.C. vs. McPhaul, 2017) – There will always be unique situations without relevant empirical studies (e.g., did this typewriter produce this note) • Not necessarily a problem as long as lack of relevant empirical evidence is acknowledged The Two-Stage Approach • Stage 1 - Similarity – Statistical test or procedure to determine if the two samples “are indistinguishable”, “can’t be distinguished”, “match”, etc.

• Stage 2 - Identification – Assessment of the probability that two samples from different sources would be found indistinguishable

• Used in assessment of (like glass)

• Conceptually many other disciplines appear to act in this way The Two-Stage Approach • Stage 1 - Similarity – Statistical test or procedure to determine if the two samples “can’t be distinguished”, etc. – This is natural (and easy!) when the evidence is a discrete characteristic or a categorical trait (e.g., DNA alleles) – When data are quantitative measurements there is a loss of information in summarizing them by a binary decision • “Can’t be distinguished” might mean an exact match of measures • “Can’t be distinguished” might mean samples just miss being significantly different The Two-Stage Approach • Stage 1 – Example (Curran et al. 1997) – Comparing two glass samples (crime scene, suspect) based on aluminum (AL) concentration measurements – Five AL measurements from crime scene sample (.751,.659,.746,.772,.722) – Five AL measurements from suspect sample (.752,.739,.695,.741,.715) – Perform a statistical test of the hypothesis that the two samples come from populations having the same mean AL concentration – p-value of .70  no reason to reject the hypothesis of equal means – Thus might conclude that the two samples are “indistinguishable” • Technical issues – Typically test many elements – Test relies on assumptions about the measurements (e.g., normal distribution) – Testing one element at a time ignores information in the correlations – But we do not focus on these …. The Two-Stage Approach • Stage 1 – Conceptual issues – The role of the hypothesis being tested • Statistical tests of this form do not treat the two hypotheses (equal means, unequal means) symmetrically • One hypothesis (equal means) is assumed true unless the data rejects • A possible concern with this approach is that the “default” position is incriminating – Relying on a binary decision (“distinguishable / indistinguishable”) requires specifying a threshold or cutoff • Choice of threshold impacts the error or misclassification rate • A “low” threshold makes it easy to reject the hypothesis of equal means (too many “distinguishable” decisions) and thus risks a failure to identify a matching suspect • A “high” threshold makes it easy to accept the hypothesis of equal means (too many “indistinguishable” decisions) and thus risks incriminating an innocent person – Separating the analysis into two stages is not ideal The Impact of Threshold (black=type I error=false exclusion; red=type II error=false inclusion) High threshold  Lowering the threshold  low type I error, high type II error increases type I, lowers type II The Two-Stage Approach

• Stage 2 - Identification – Assessment of the probability that two samples from different sources would be found indistinguishable – Requires a database of samples from relevant sources – Answer will depend on the representativeness of the available data – Stage 2 information is rarely provided at present • This is a problem …. Evidence presented is that two glass samples “can not be distinguished” without further information The Two-Stage Approach • Stage 2 – How it might be done? – Figure below shows the distribution of mean refractive index across many different windows – Suppose we have a control sample (crime scene) measuring 1.522 (red line) – For each possible source in the population (i.e., the figure), we can find the probability that a sample from that source would be found “indistinguishable” from the control – Total these up to get probability of a coincidental match

– Having reliable information about the population of possible sources is critical (and challenging!!) The likelihood ratio (LR) • A current focus of much attention in forensic science is the likelihood ratio • The LR is a statistical concept seen as a potential unifying logic for evaluation and interpretation of forensic evidence • The LR already plays a role outside forensics in … – Statistical inference (hypothesis tests) – Evaluating evidence provided by medical diagnostic tests • Europe has moved decisively in this direction (ENFSI Guideline) The likelihood ratio (LR) • E = evidence

• Hs = “same source” proposition (two samples have the same source) Hd = “different source” proposition (two samples have different sources) • Bayes’ Theorem

Pr(Hs | E) = Pr(E | Hs) Pr(Hs) Pr(Hd | E) Pr(E | Hd ) Pr(Hd)

“a posteriori” odds Likelihood ratio or “a priori” odds in favor of same Bayes factor in favor of same source hypothesis source hypothesis

• Details: role of task-relevant contextual information, terminology (LR vs Bayes factor) A critical difference: Pr(E | Hs) vs Pr(Hs | E)

• Pr(E | Hs ) is an assessment of how likely the evidence is if the two samples (crime scene and suspect) have the same source – If E is matching blood types, then we are likely to say this probability is approximately 1! – If E is matching DNA profiles, then we are likely to say this probability is approximately 1! – If E are two fingerprints with similar features and no clear differences, then we think this probability is high … but difficult to assign a numerical value

• Pr(Hs | E) is an assessment of how much we believe the same source hypothesis based on the observed evidence – Previous formula shows that it is not possible to provide this probability without first having specified the “pre-evidence” Pr(Hs) and Pr(Hd) – But we don’t necessarily want our forensic experts to form pre-evidence opinions about this The likelihood ratio (LR)

• LR = Pr(E | Hs) / Pr(E | Hd) • A quantitative summary of the evidence • The numerator asks

– How likely is the evidence if Hs is true • The denominator asks

– How likely is the evidence if Hd is true • Key point is that one must consider (at least) two competing hypotheses • Important Caveat: There is evidence that lay audiences (and others) struggle to understand probabilities and the LR -- more on this later How the LR works - DNA

A DNA profile • Assume: • crime scene sample (known to be from a single source) • sample from a suspect • matching DNA profiles • Evidence E is two matching profiles • Numerator of LR: probability of observing matching profiles with these values if there is a single source • Denominator of LR: probability of observing matching profiles with these values if there are two different sources (also known as the “random match probability”) How the LR works - DNA

• How likely is a coincidental match? • Can determine this for each marker given allele frequencies in the population TH01 4 5 6 7 8 9 9.3 10 11 Frequency .001 .001 .266 .160 .135 .199 .200 .038 .001 Pr(TH01=7,9) = 2*.160*.199 = .064 • LR based only on THO1 = 1/.064 = 15.6 • Can compute for multiple markers and combine

Example from Dawid and Thomas at https://plus.maths.org/content/os/issue55/features/dnacourt/index How the LR works - DNA

“the gold • Can compute the numerator and denominator standard” probabilities because: • underlying biology is well understood • biological theory provides a probability model for the evidence • population databases are available to provide the numbers required by the probability model • method peer-reviewed by the scientific community • Note that even here there are questions • still subjective elements in calling alleles • increasingly sensitive techniques can lead to inadvertent contamination • assessment of DNA mixtures is challenging How the LR might work - trace evidence

• “Trace” evidence (e.g., glass fragments) • May have broken glass at crime scene and glass fragments on suspect • Evidence E comprises measurements of chemical concentrations of elements or other characteristics of the glass (e.g., refractive index) • Can we construct a likelihood ratio for evidence of this type? Perhaps …

Measurements vary Measurements are across the population relatively constant of windows within a fragment

Data from Lund and Iyer, 2017 How the LR might work - trace evidence • Have a well-defined and reliably measured characteristic (e.g., chemical concentration, refractive index) • Requires a probability model to describe variation within a single sample (e.g., normal distribution?) • Requires defining a relevant “different source” population (e.g., other windows) • Requires a probability model to describe variation across different sources (note the complex shape) • Terminology note: measurements here are continuous; models give likelihoods instead of probabilities • There are published examples of how this might work: – Aitken and Lucy (2004) – glass – Carriquiry, Daniels and Stern (2000 technical report) – bullet lead How the LR might work - trace evidence

• Challenges – Many of the required pieces (described on the previous slide) are not available – Depending on the evidence type may also need to account for • Manufacturing process • Distribution process • Probabilities of transfer of evidence – Assessing the “population” is hard (the neighborhood, the city, the world) – Likelihood ratio can be quite sensitive to assumptions (Lund and Iyer, 2017) • More on this later Likelihood ratios for pattern evidence?

• Have a mark or impression at the crime scene (“unknown or questioned”) and a mark or impression from a potential source (“known”) • Many examples – Latent print examination – Shoeprints and tire tracks – Questioned documents – Firearms – Toolmarks Likelihood ratios for pattern evidence? • Even defining the evidence E is challenging – Data is very high dimensional – There is considerable flexibility in defining the number and types of features to look at (e.g., fingerprint minutiae, matching striae in ballistics) – Typically E is taken to include observed similarities and differences • As with trace evidence, formal evaluation here requires that we study two different types of variation – Require information about the variation expected in repeated impressions from the same source (e.g., distortion of finger prints) – Require information about variation expected in impressions from different items in the population (i.e., the “coincidental match”) – May also need information about manufacturing, distribution, wear (e.g., for shoes) Likelihood ratios for pattern evidence?

• How do we measure Pr(E | Hs) and Pr(E | Hd) • This is a very hard problem – The data are so complex that it is hard to actually enumerate or assign probabilities (or likelihoods) • Approaches: – Probability models for features (Neumann et al., 2015) based on distortion model and nearest nonmatch data

– Subjective likelihood ratios? (permitted by the ENFSI Guideline) – Score-based likelihood ratios ENFSI & Likelihood Ratios • ENFSI has officially endorsed likelihood ratios (see the ENFSI Guideline for Evaluative Reporting in Forensic Science) • Guideline cites four requirements for evaluative reporting: balance, logic, robustness, transparency • Some key statements from the Guideline: – Evaluate findings (evidence) with respect to competing hypotheses/propositions – Evaluation should use probability as a measure of uncertainty – Evaluation should be based on the assignment of a likelihood ratio • According to the Guideline, probabilities in the likelihood ratio are ideally based on published data but experience, subjective assessments, case-specific surveys can be used as long as justified – The use of experience-based or subjective probabilities has been viewed a bit more skeptically in the U.S. ENFSI & Likelihood Ratios • LRs reported as numbers or as verbal equivalents Value of LR Verbal equivalent: “The forensic findings ... 1 do not support one proposition over the other 2 – 10 provide weak support for the same source proposition relative to the different source proposition 10 - 100 provide moderate support for the same source proposition relative … 100 - 1000 provide moderately strong support for the same source proposition relative … 1000 - 10000 provide strong support for the same source proposition relative … 10000 – 1 mill. provide very strong support for the same source proposition relative … 1 million + provide extremely strong support for the same source proposition relative …

– Similar statements apply to small LRs (values less than one) which support the different source proposition • Verbal equivalents are less precise, but may be easier to understand Score-based likelihood ratios • The idea: – Replace the evidence E by a “score” S summarizing difference/similarity of the two samples

– Fit a probability distribution to the scores of known matches (Pr(S | Hs))

– Fit a probability distribution to the scores of known non-matches (Pr(S | Hd)) – Score-based likelihood ratio for observed score S is the ratio of these two probabilities • An alternative: – Consider a score threshold (or multiple thresholds) and examine misclassification rates Score-based likelihood ratios • Example: FRSTATS software for latent prints (Swofford et al., 2018) – Figure below shows “score” distributions for known-matching pairs of prints (light shade) and known-non-matching pairs of prints (dark shade) – Separate pictures are provided depending on the number of minutiae identified by the analyst Lund & Iyer (2017) – sensitivity of LRs

4

44 Lund & Iyer (2017) – sensitivity of LRs

• Can fit different models to each set of data (e.g., black, red, green) • Each gives the probability of getting a 4 • LRs vary and it is important to examine this variation • But need to insure we are focusing on “plausible” models 4

45 Likelihood ratio summary (1 of 2) • Likelihood ratios are a (the?) logical way to evaluate and summarize forensic evidence • Advantages – Explicitly compares two relevant hypotheses/propositions – Provides a quantitative summary of the evidence – Avoids arbitrary match/non-match decisions when faced with continuous data – Can potentially accommodate a wide range of factors (e.g., manufacturing, distribution, wear) – Provides a mapping from a specified set of assumptions to a quantitative summary of the evidence • This has the potential to enhance the transparency of the evidence assessment process Likelihood ratio summary (2 of 2) • Challenges – Requires assumptions about the distribution of evidence – Need to define the relevant reference distribution to define the denominator – Can be difficult to account for all relevant factors (manufacturing, distribution, wear) – LR is not uniquely determined (depends on features, probability model, population data available) – Conveying LRs to the “trier of fact” is difficult • Note: Score-based likelihood ratios are easier (not easy!) to develop but come with their own challenges Expressing source conclusions (Thompson et al., 2018) Expressing source conclusions (Thompson et al., 2018) Conclusions • Any approach to assessing the probative value of forensic evidence should: – Account for the two (or more) competing hypotheses about how the evidence (data) were generated – Be explicit about the reasoning and assumptions on which the assessment is based – Have relevant empirical support for the reasoning and assumptions – Include an assessment of the level of uncertainty associated with the assessment • Focused here primarily on statistical issues associated with the examination / evaluation evidence. Note that the language used in reports, testimony, opening/closing statements are also critical. • Contact: [email protected]