Further Development of SNP Panels for Forensics
Total Page:16
File Type:pdf, Size:1020Kb
The author(s) shown below used Federal funds provided by the U.S. Department of Justice and prepared the following final report: Document Title: Further Development of SNP Panels for Forensics Author(s): Kenneth K. Kidd Document No.: 249548 Date Received: December 2015 Award Number: 2010-DN-BX-K225 This report has not been published by the U.S. Department of Justice. To provide better customer service, NCJRS has made this federally funded grant report available electronically. Opinions or points of view expressed are those of the author(s) and do not necessarily reflect the official position or policies of the U.S. Department of Justice. NIJ Final Technical Report December 1, 2010 to April 30, 2014 Further Development of SNP Panels for Forensics NIJ Grant# 2010-DN-BX-K225 Submitted by Kenneth K. Kidd (PI) Professor of Genetics, Yale University School of Medicine Email: [email protected] Telephone: 203-785-2654 NOTE: Portions of this report are taken from 10 published papers, 3 submitted manuscripts, a few manuscripts in preparation, along with various poster presentations and slide talks--all supported by this grant--that are listed in the sections of this report describing “Dissemination of Results” and “Publications & Poster Presentations”. Funding for this project ended on November 30, 2013; no cost extensions allowed time for the external reviews to be completed and final revisions of this report. ABSTRACT Single nucleotide polymorphisms (SNPs) are a largely untapped DNA resource for forensic applications. Because of the extensive databases and experience that have been established with the CODIS markers, SNPs are unlikely to replace STRPs for applications involving individual identification. However, SNPs offer many advantages over STRPs for inferring the ancestry of a DNA sample, for identifying close biological relatives, and eventually for inferring phenotypes like eye color. In the current project we built on our previous work and have provided improved SNP panels for forensic applications by (1) enhancing our developing ancestry informative (AISNP) panel, (2) identifying a panel of mini-haplotypes and a preliminary set of micro-haplotypes for inference of lineage (clan and extended family) relationships (LISNPS), and (3) providing population genetic background evidence for many Phenotype Informative markers (PISNPS). 1 This document is a research report submitted to the U.S. Department of Justice. This report has not been published by the Department. Opinions or points of view expressed are those of the author(s) and do not necessarily reflect the official position or policies of the U.S. Department of Justice. We identified potential candidate SNPs useful for ancestry inference and lineage identification both by screening our accumulated datasets and large web-accessible SNP datasets (e.g. 1000 Genomes, CEPH Human Genome Diversity Panel) as well as the published literature. We employed a number of different statistical methods iteratively to extract the “best” subset of ancestry inference SNPs (AISNPs)--heatmaps, pairwise Fst, PCA analyses, and STRUCTURE--based on the SNPs that differentiated most between the populations. The best current set of 55 AISNPs identified during this project was made public on the FROG knowledge base (http://frog.med.yale.edu) in January, 2013; it is also described at length in a manuscript that has been submitted for publication. LISNPs or Lineage Informative SNPs are haplotyped loci with two or more SNPs defining multiple alleles (haplotypes). Haplotypes can provide the multiple low frequency alleles that are optimal for familial and lineage assignment of a DNA sample. We began this project identifying small haplotypes that are less than 10,000 basepairs in extent and that have multiple haplotype alleles at common frequencies. Our published work on the concept of mini-haplotypes (Pakstis et al., 2012) provides proof of principle. We also developed a panel of 25 essentially unlinked mini-haplotypes for which a manuscript is in preparation. New sequencing methodology that gives high throughput sequencing of individual DNA molecules of 200 plus basepairs led us to shift our priorities during the last year of the project. We emphasized finding micro-haplotypes with two or more SNPs within a segment of no more than 200 basepairs which define at least 3 haplotypes and also display high heterozygosity in most populations. By the end of our project we had identified a set of 28 microhaps, each defined by 2 to 3 SNPs, which we have studied on 54 of our populations. Because of the high information content of these markers, the random match probabilities based on them range from 2.93 x 10-13 to 1.03 x 10-18 in the populations studied, values comparable to the IISNP panel we published. Microhaps typed by sequencing have another desirable capability: identification of mixtures with the potential to quantify the components, i.e., to disentangle mixtures in a quantitative way. The presence of three or more different sequences becomes clear evidence of DNA from more than one person contributing to the sample. Thus, microhaps can become powerful markers to identify and quantify components of mixtures. Carefully validated IISNP, AISNP, LISNP, and eventually PISNP panels offer great utility for forensic applications. Our work has contributed to all of these areas and documents the value of SNPs in forensics. This project underscores that continued work on SNP panels offers much more than a theoretical possibility of better SNPs for better forensics. By making the data public through the ALFRED and FROG databases, we also hope others will be able to build on what we have done. 2 This document is a research report submitted to the U.S. Department of Justice. This report has not been published by the Department. Opinions or points of view expressed are those of the author(s) and do not necessarily reflect the official position or policies of the U.S. Department of Justice. Table of Contents Abstract 1 Table of Contents 3 Executive Summary 4 Background, Rationale, Goals 4 Strategy and methods 4 Ancestry Inference SNPs (AISNPs) 5 Lineage Informative SNPs (LISNPs) 7 Phenotype Informative SNPs (PISNPs) 9 Introduction 11 Background, Rationale 11 Goals 15 Methods 17 Progress on identifying Ancestry Inference SNPS (AISNPs) 24 Progress on identifying Lineage Informative SNPS (LISNPs) 35 Progress on identifying Phenotype Informative SNPS (PISNPs) 62 Conclusions 64 Dissemination of Results 65 Response to two external reviewers’ comments 67 Publications & Poster Presentations of this Project 69 References 74 Appendix 80 3 This document is a research report submitted to the U.S. Department of Justice. This report has not been published by the Department. Opinions or points of view expressed are those of the author(s) and do not necessarily reflect the official position or policies of the U.S. Department of Justice. Executive Summary Background, Rationale, Goals The need for panels of SNPs (single nucleotide polymorphisms) for forensic applications is documented by a large number of publications in the literature. When DNA is badly damaged, when speed is essential, and when ease of typing and interpretation are critical, SNPs can perform better than STRPs, though SNPs are unlikely to replace STRPs entirely for individual identification. However, for inference of ancestry of a DNA sample, for inference of phenotype, and for identification of close relatives SNPs can provide much more accuracy than the CODIS STR markers. NIJ has supported our SNP research with three grants (2004-DN-BX-K025, 2007-DN-BX- K197, 2010-DN-BX-K225). The first two projects focused on developing two optimized panels of SNPs. The first panel was a Low Fst / High heterozygosity panel of SNPs for Individual Identification (IISNPs) that should be acceptable in the courts as reliable and highly probative. The second panel was a High Fst panel of SNPs that are Ancestry Informative (AISNPs) and provide a robust investigative tool. In the third grant, the subject of the current technical report, we built on our previous work and have provided improved SNP panels for forensic applications by (1) enhancing our developing AISNP panel, (2) identifying a panel of mini-haplotypes and a preliminary set of micro- haplotypes for inference of lineage (clan and extended family) relationships (LISNPS), and (3) providing population genetic background for many Phenotype Informative markers (PISNPS). In all of these studies, the goal has been to identify those SNPs likely to be good for the specific objective and to evaluate them on a large number of population samples. Our rationale for studying a global sample of populations is the highly diverse ancestries represented in the U.S. population. Strategy and methods At the beginning of this funding period we had an accumulated dataset of ~3000 SNPs with data on most of 54 populations in our population resource. During this project we have added several hundred new SNPs to the dataset and extended the nearly complete dataset to ~4000 markers on 58 populations. In addition we have incorporated additional populations into the dataset for the more informative of the SNPs for ancestry and phenotype. In those extensions and in the filling-in of markers not previously typed on some populations we have not included the panel of IISNPs developed on previous funding (Pakstis et al. 2010) although we did conduct additional analyses and write a paper on those SNPs (Kidd et al., 2012). 4 This document is a research report submitted to the U.S. Department of Justice. This report has not been published by the Department. Opinions or points of view expressed are those of the author(s) and do not necessarily reflect the official position or policies of the U.S. Department of Justice. Our strategy for this funding period evolved as the science evolved. We initially worked (1) to identify the best markers for ancestry inference and (2) to characterize the population genetics of some of the SNPs that were implicated, usually only by association studies, with phenotypes.