<<

9/11/15

DNA Phenotyping: Predicting Ancestry and Physical Appearance from Forensic DNA

Ellen McRae Greytak, PhD Director of Bioinformatics Parabon NanoLabs, Inc.

© 2015 Parabon NanoLabs, Inc. All rights reserved. snapshot.parabon-nanolabs.com

Forensic Applications of DNA Phenotyping

Predict a person’s ancestry and/or appearance (“”) from his or her DNA Generate investigative leads when DNA doesn’t match a database (e.g., CODIS) Gain additional information (e.g., pigmentation, detailed ancestry) about unidentified remains

Phenotype Prediction

Step 1: Estimate Ancestry

© 2015 Parabon NanoLabs, Inc. All rights reserved.

1 9/11/15

Ancestry by Statistical Clustering Tens of thousands of SNPs across the Compare to background subjects with known ancestry Very precise estimates of ancestry across as many populations as are defined in the background Can detect even low levels of admixture

Ancestry by Statistical Clustering Thousands of background subjects from hundreds of populations (aggregated from multiple studies for worldwide coverage)

Ancestry by Statistical Clustering Seven continental groups: , , , Central , , , and America

−180˚ −120˚ −60˚ 0˚ 60˚ 120˚ 180˚

60˚ 60˚

0˚ 0˚

Li et al. 2008 −60˚ −60˚

−180˚ −120˚ −60˚ 0˚ 60˚ 120˚ 180˚

2 9/11/15

Ancestry by Statistical Clustering DNA from an unknown subject is partitioned into 7 parts according to its proportional similarity to each of the continental groups 100% 90% America 80% Oceania 70% East Asia 60% 50% Europe 40% Middle East 30% Africa 20% 10% 0%

Regional Ancestry Inference Within each of the 7 continental populations, we can narrow the source of an individual’s ancestry

Europe 100% Southwest 80% Southeast South 60% Northwest

40% Northeast Central-West 20% Central-East Caucasus 0%

Complex Ancestry Example Real results from an unknown individual given to us during a validation test

100% 39% Native American ancestry 90% America 80% Oceania 70% East Asia 60% Central Asia 50% Europe 40% Middle East 30% Africa 20% 54% East Asian ancestry 10% 7% European ancestry 0%

3 9/11/15

Regional Ancestry Inference

East Asia America Europe 100% 100% 100% Southwest

80% Southeast 80% 80% Southeast Polynesia South South 60% 60% 60% North Central Northwest Japan 40% 40% 40% Northeast Central Central-West 20% 20% 20% Central-East Caucasus 0% 0% 0% Conclusion: This individual is half Japanese and half Latino

Phenotype Prediction

Step 2: Data Mining and Predictive Modeling

© 2015 Parabon NanoLabs, Inc. All rights reserved.

Association Testing Use data mining of +phenotype (G+P) data to identify those SNPs that have the strongest predictive power for phenotype

Eye SNP1 SNP2 SNP3 SNP4 SNP1,000,000 Subject … Color Genotype Genotype Genotype Genotype Genotype 1 A/G C/C A/A G/G … T/T 2 A/A T/T A/G A/G … T/T 3 Hazel G/G C/T G/G A/A … T/T 4 A/A T/T G/G G/G … C/T … … … … … … … …

3,000 Blue G/G T/T A/A G/G … C/T

4 9/11/15

Interaction Analysis Single SNP association testing may not capture the whole story Many traits are influenced not only by individual SNPs but by non-additive (epistatic) interactions among SNPs

Ritchie et al. 2001 – AJHG 69:138-147

Interaction Analysis Looking for high-order interactions (e.g., 3, 4, and 5 factors) on a genome-wide scale results in a combinatorial explosion of possible models § 8,333,250,000,000,000,000,000,000,000 (1027) possible 5-way interactions among 1 million SNPs Most investigators can only search among candidate SNPs Parabon has developed software that uses a distributed evolutionary search algorithm to explore the massive space of possible interactions

Face Morphology Face shape is not a single trait but a combination of many variables As with other traits, models are built from genotype +phenotype data, but now the phenotype is a 3-D image of the individual’s face

5 9/11/15

Face Space Each face is described by (x,y,z) coordinates at thousands of points, many of which are correlated with one another Use dimensionality reduction to capture most of this variation in a smaller number of independent variables Each face can now be represented by a set of variables in “face space” Perform mining and predictive modeling as with other traits

Face Space

PC1 Effect

PC3 PC1

PC4

PC5 PC2

PC3

PC2

PC4

PC5

Create a Predictive Model Goal: combine SNPs, sex, and ancestry into a mathematical model that can be used for prediction on new samples Our approach: Advanced machine learning – automatically weights variables and discovers nonlinear interactions among variables § Parameters are optimized using evolutionary search Automated pipeline that can quickly generate new models when new data / become available

6 9/11/15

DNA Phenotyping

Step 3: Evaluate accuracy

© 2015 Parabon NanoLabs, Inc. All rights reserved.

Evaluating Accuracy To evaluate accuracy, we need to make predictions on subjects with known phenotypes so we can compare predicted vs. actual These subjects need to be new to the model so that the accuracy is truly representative of what we would see on new, unknown subjects We want to make lots of these out-of-sample predictions so we are confident in our accuracy However, the more subjects we remove from the model, the lower its predictive power Solution: cross-validation

Cross-Validation

10%

90% Testing set

Training set

7 9/11/15

Cross-Validation

10%

Data mining 90%

Predictive modeling x10 Reveal true values and Make calculate predictions on accuracy testing set

Cross-Validation This means we are performing data mining and predictive modeling 10 times for each phenotype However, we now have out-of-sample predictions on every single subject in our dataset At the end, we build a final model using all of the data, and the cross-validation accuracy approximates the accuracy of this final model

Phenotype Prediction

Step 4: Predict on new samples

© 2015 Parabon NanoLabs, Inc. All rights reserved.

8 9/11/15

Snapshot DNA Phenotyping System Applicable to subjects from any ethnic background or an admixed background Can use any Illumina genome-wide SNP chip Optimally requires 2.5 ng of DNA, but good call rates have been obtained with <1 ng

Sample Results 1 – Color Predicted Value = 1.638 0% 7.4% 100% Consistency Values

Blue 39.0% 5 Eye Color Green 59.4% Hazel 20.6%

Brown 0.7% 4 Black 0.0%

0% 25% 50% 75% 100% 3

Green (61% confidence) Predicted Values 2 Green or Blue (79.4% 1.638 confidence) 1

NOT Brown or Black (99.3% Blue Green Hazel Brown Black confidence) True Eye Color

Sample Results 1 – Skin Color Predicted Value = 1.687 0% 6.9% 100% Skin Color Consistency Values

Very Fair 90.1% 5 Skin Color Fair 22.0% Light Olive 1.8%

Dark Olive 2.9% 4 Dark 0.0%

0% 25% 50% 75% 100% 3

Very Fair (78% confidence) Predicted Values 2 Very Fair or Fair (97.1% 1.687 confidence) 1

NOT Light Olive, Dark Olive, Very Fair Fair Light Olive Dark Olive Dark or Dark (97.1% confidence) True Skin Color

9 9/11/15

Sample Results 1 – Hair Color Predicted Value = 2.560 0% 15.4% 100% Hair Color Consistency Values

Red 84.8% 4.0 Hair Color

Blond 97.0% 3.5 Brown 0.4%

Black 0.0% 3.0

0% 25% 50% 75% 100% 2.56 2.5 Predicted Values

Blond (15.2% confidence) 2.0

Blond or Red (99.6% 1.5 confidence) 1.0 NOT Brown or Black (99.6% Red Blond Brown Black confidence) True Hair Color

Sample Results 1 – Face Shape

Compare the Predicted Face Shape for this individual to the Average Predicted Face Shape for

subjects with the same sex and ancestry + –

X: narrow face, Y: long face – Z: more prominent particularly jaw higher midface; nose and cheekbones; and nose longer chin less prominent chin

Sample 1 Results - Composite Apply predicted pigmentation to face

10 9/11/15

Sample 1 Predicted vs. Actual

Snapshot Workflow

Workflow of a Parabon® Snapshot™ Investigation

Unidentified Remains DNA Evidence Is Collected and Sent to Crime Lab DNA Evidence DNA Crime Lab

Crime Lab Extracts DNA And Produces STR Profile Checked STR Profile (a.k.a. “DNA ”) Against DNA Database(s) Yes STR$Profile$ (“DNA&Fingerprint”)& &&&&07,15&|&10,18&|&12,09& &&&&16,10&|&18,16&|&13,09& Match No &&&&02,05&|&12,07&|&06,19& &&&&03,14&|&15,27&|&18,28& Found? &&&&&&&&&&|&12,29&|& Snapshot Composite Ordered Extracted DNA ™ DNA PHENOTYPING

DNA Service Labs Unidentified DNA Is Genotype Data Is Lab Produces SNP Sent To Service Lab Sent To Parabon Profile (a.k.a. “DNA Blueprint”) (DNA Extracted If Needed)

SNP$Genotype$ 50pg – 2ng (“DNA&Blueprint”)& CC$CC$CT$CC$GG$CC$GG$ GG$AA$AA$AG$CC$CC$TT$ DNA Evidence — or — Extracted DNA CC$AA$AA$TT$CC$TT$GG$ CT$AA$TT$AA$CC$AG$CT$ AA$CT$AG$CC$CC$AA$CC$$ !!!07,15! !!!16,10! NOTE: STR Profiles Do Not Contain Sufficient !!!02,05! Genetic Information to Produce A SNP Genotype Parabon NanoLabs Investigator Uses Parabon Analyzes Parabon Predicts Physical Traits 10# Genotype Data and Produces Snapshot Report Snapshot Report To: Hair Color Eye Color ! Generate Leads Skin Color ! Exclude Suspects Freckling Face Shape ! Identify Remains Ancestry Kinship © 2015 Parabon NanoLabs, Inc. All Rights Reserved. Ready To Learn More? Contact Us For A Free Consultation: http://Parabon-NanoLabs.com/Snapshot

Challenges Encountered in Casework Small DNA quantities reduce the genotype call rate § We use a machine learning algorithm that allows for missing data (many do not) § For each sample, we recalculate the cross-validation results and confidence intervals for that set of SNPs DNA degradation results in short fragment lengths § We are working with laboratory techniques to accommodate fragmented samples Mixtures confuse both genotype calling and phenotype prediction § We are developing new computational methods to deconvolute mixtures

11 9/11/15

Extended Kinship Inference

© 2015 Parabon NanoLabs, Inc. All rights reserved.

Kinship Inference with Snapshot

great-great- great-great- grandfather grandmother

great- great- great-great- grandfather grandmother uncle / aunt

great- first cousin Relationship grandfather grandmother uncle / aunt twice removed Identical Twins first cousin second cousin Parent-Offspring father mother uncle / aunt once removed once removed Full Siblings

2nd Degree Relatives brother / sister self / twin first cousin second cousin third cousin 3rd Degree Relatives 4th Degree Relatives niece / first cousin second cousin third cousin nephew son / daughter once removed once removed once removed 5th Degree Relatives th 6 Degree Relatives grand-niece / grandson / first cousin second cousin third cousin 7th Degree Relatives nephew granddaughter twice removed twice removed twice removed

Kinship Inference with Snapshot Developed a new method that uses machine learning to predict relatedness between two

Accuracy Within Absolute Kinship One Degree Accuracy Siblings 100% 100%

100% Parent 100%

100% 2nd-Degree 100% (Grandparent, Uncle, Half-Sibling)

100% 3rd-Degree 94% (First cousin, Great-grandparent)

98% 4th-Degree 77% (First cousin once-removed)

95% 5th-Degree 45% (Second cousin)

93% 6th-Degree 57% (Second cousin once-removed)

99.5% Unrelated 97.5%

12