Consumer Genomics to Genomic Medicine: Role of Discrete Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
IBM Computational Biology Center Consumer Genomics to Genomic Medicine: Role of Discrete Algorithms Laxmi Parida IBM TJ Watson Research Center New York, USA IBM Computational Biology Center Consumer Genomics www.ibm.com/genographic www.nationalgeographic.com/genographic Delivering genomic results directly to consumers 2 IBM Computational Biology Center IBM Computational Biology Center Migratory Map Representation IBM Computational Biology Center www.ibm.com/genographic www.nationalgeographic.com/genographic Over 450,000 public participants 100,000 participants thru PI 5 IBM Computational Biology Center www.ibm.com/genographic www.nationalgeographic.com/genographic non-medical applications Over 450,000 public participants 100,000 participants thru PI 6 IBM Computational Biology Center Predicting the past– Andreas Dress Recombining loci ARG Ancestral Recombinations graphs Griffiths and Marjoram, 1996 7 Neighborhood joining method tree courtesy Saitou Naruya 2002 IBM Computational Biology Center Quantify what can be inferred ? Recombining loci ?? 100 % ARG Ancestral Recombinations graphs Griffiths and Marjoram, 1996 8 Neighborhood joining method tree courtesy Saitou Naruya 2002 IBM Computational Biology Center The Origin of Population Genomics Brilliant Blunders 1860 1859 9 IBM Computational Biology Center Mathematical Population Genetics Darwinian selection & Mendelian genetics are actually complementary Sewall Wright, Ronald Fisher, JBS Haldane Population Genetics: “deals with (statistical) analysis of the inheritance & prevalence of genes in populations” Neutral Theory (MK, King, Jukes): 1968-80 • Most of the observed genetic variation is selectively neutral • Importance of stochastic factors by random genetic drift Kingman’s coalescence 1970-80’s Retrospective coalescent . “explicitized the tree” 1990s: Availability of genomes Population Genetics=>Genomics 10 IBM Computational Biology Center Genome Variations • Human Genome Project (completed 2000; 2003) genomics.energy.gov More variations than expected • International SNP Consortium (launched 2000) The HapMap Project (2003) www.hapmap.org; snp.cshl.org SNP variations • 1000 Genomes Project (launched June 2008) www.1000genomes.org Deep catalog of human genetic variation • Personal Genome Project www.personalgenomes.org Nothing in Biology Makes Sense Except in the Light of Evolution. -- Theodosius Dobzhansky 11 IBM Computational Biology Center Retrospective: Coalescence (track flow of ancestral genetic material for the extants only) past Wright-Fisher population: 1. Constant population 2. Non-overlapping generations 3. Panmictic LCA / MRCA LCA: 1. Common ancestor-CA 2. Least among CAs present The Phylogeny 12 IBM Computational Biology Center Retrospective: Coalescence (track flow of ancestral genetic material for the extants only) past Wright-Fisher population: 1. Constant population 2. Non-overlapping generations 3. Panmictic - single parent/node - multiple parents/node present The Phylogeny 13 IBM Computational Biology Center Bound the infinite structure: root MRCA/GMRCA [Grand] Most Recent Common Ancestor root root 14 IBM Computational Biology Center Motivation question: Is ARG even reconstructible? How much? 15 IBM Computational Biology Center Motivation question: Is ARG even reconstructible? How much? JCB 10; BMC Bioinformatics16 11 IBM Computational Biology Center Quantify what can be inferred ? Recombining loci ?? 100 % ARG Ancestral Recombinations graphs Griffiths and Marjoram, 1996 17 Neighborhood joining method tree courtesy Saitou Naruya 2002 IBM Computational Biology Center The Random Graphs Framework Forbidden Structure Graph definition . Infinite number of vertices arranged in finite sized rows . Edges introduced via a random process across immediate rows Probability space of such graphs . Space is non-enumerable (no bijective map to natural numbers) . Uniform probability measure 18 IBM Computational Biology Center Question: LCA ? 19 IBM Computational Biology Center Question: LCA ? Ancestor without ancestry paradox 20 IBM Computational Biology Center Question: LCA ? Ancestor without ancestry paradox 21 IBM Computational Biology Center Annotated graph Mendelian model 22 IBM Computational Biology Center Annotated Edges Retrospective Coalescence Marginal Genealogies 23 IBM Computational Biology Center Illustration of Theorem & edge annotations {1, 2, 3} 24 IBM Computational Biology Center Illustration of Theorem 25 IBM Computational Biology Center Illustration 26 IBM Computational Biology Center Topological Definition of MRCA: Least Common Ancestor with Ancestry (LCAA) 27 IBM Computational Biology Center Random Graphs: Probability Space . Space is non-enumerable (no bijective map to natural numbers) . Uniform probability measure . Probability of some event F(h) for a fixed depth, h, & take limit: 28 IBM Computational Biology Center Minimal Descriptor Non-redundant core L Parida, P F Palamara, A Javed, A minimal descriptor of an ancestral recombinations graph, BMC Bioinformatics,29 2011. 3 0IBM Computational Biology Center Minimal Descriptor of an ARG 1. Structure-preserving (marginal genealogies are identical to that of G) 2. Samples-preserving (genetic variation patterns in the samples are identical to that of G) 30 3 1IBM Computational Biology Center Minimal Descriptor 1. Structure-preserving (marginal genealogies are identical to that of G) 2. Samples-preserving (genetic variation patterns in the samples are identical to that of G) t-coalescent node t-coalescent node 31 IBM Computational Biology Center Minimal descriptor (MD) results Theorem 1 : An unbounded ARG always has a bounded minimal descriptor. Theorem 2 : The reduced MD ARG is unique. 32 IBM Computational Biology Center Minimal descriptor of an ARG Implications: • What is the model? • Theoretical upper bound on the reconstructed structure Embarrassingly simple; models data that don’t matter • The lean mdARG can be a benchmark • How small is the core? Substantially (empirically observed) • Importance sampling ? • Mindless compression? • Observation: no gapped segments in binary MD ARGs No: structure & samples preserving L Parida, P F Palamara, A Javed, A minimal descriptor of an ancestral recombinations graph, BMC Bioinformatics,33 2011. IBM Computational Biology Center Quantify what can be inferred ? Recombining loci ~65 % 100 % ARG Ancestral Recombinations graphs Griffiths and Marjoram, 1996 34 Neighborhood joining method tree courtesy Saitou Naruya 2002 IBM Computational Biology Center DSR Algorithm RECOMATRIX Reconstruction (IRiS pipeline) network recombinations JCB 08, BMC Bioinformatics 09 Hum Gen 11, MBE 11, Bioinformatics 11 RECOTYPE calibration PLoS Comp Bio 10 35 IBM Computational Biology Center DSR Algorithm RECOMATRIX Reconstruction (IRiS pipeline) network recombinations JCB 08, BMC Bioinformatics 09 Hum Gen 11, MBE 11, Bioinformatics 11 RECOTYPE calibration PLoS Comp Bio 10 36 IBM Computational Biology Center The Central Problem: DSR Algorithm a b 1. Identify SNP block patterns c d e segment 1 segment 2 2. Compute segments with no evidence of recombination within each segment 3. Construct Trees Tree 2 2:7 Tree 1 1:5 2:5 2:6 1:4 1:1 2:2 2:4 1:2 1:3 2:1 2:3 c,d a b,e JCB ‘08 d e a b,c IBM Computational Biology Center Minimize recombination events DSR AlgorithmOccam’s is a Razor bottomPrinciple up tree merger in which each nodes is assigned Dominant, Subdominant, or Recombinant. Tree 1 1:5 1:5 1:5 j k 2:7 m 1:4 m j 2:5 1:1 k 1:3 1:4 1:2 1. c,d a b,e 2:6 2. i 1:4 f,g h,i 2:2 Tree 2 2:7 g 1:1 h f 1:3 2:4 2:1 2:5 2:6 1:2 2:3 1. 2:2 1:1 d c 2. 2:1 2:3 2:4 2. 1. a 1:2 a b c d e d e a b,c e b 1:3 D S R 38 IBM Computational Biology Center Minimize recombination events Occam’s Razor Principle DSR Algorithm is a bottom up tree merger in which each nodes is assigned Dominant, Subdominant, or Recombinant. Tree 1 1:5 1:5 1:5 j k 2:7 m 1:4 m j 2:5 1:1 k 1:3 1:4 1:2 1. c,d a b,e 2:6 2. i 1:4 f,g h,i 2:2 Tree 2 2:7 g 1:1 h f 1:3 2:4 2:1 2:5 2:6 1:2 2:3 1. 2:2 1:1 d c 2. 2:1 2:3 2:4 2. 1. a 1:2 a b c d e d e a b,c e b 1:3 D S R 39 IBM Computational Biology Center Parameters: devil in the details …. SNP Patterns Granularity Haplotype clusters Ancestral states Sequential Markovian 40 IBM Computational Biology Center DSR Algorithm RECOMATRIX Reconstruction (IRiS pipeline) network recombinations JCB 08, BMC Bioinformatics 09 Hum Gen 11, MBE 11, Bioinformatics 11 RECOTYPE calibration PLoS Comp Bio 10 41 IBM Computational Biology Center IRiS: Reconstructing ARG at genomic scales RECOTYPE RECOMATRIX Bioinformatics 11 42 IBM Computational Biology Center Edges & Ages 43 IBM Computational Biology Center IRiS researcher.ibm.com/project/2303 IBM Computational Biology Center Africa 1.Yoruba 2.Maasai 3.Luuya 4.Chad 5.African American Five regions on X North Africa-Middle East 6.Lebanese 7.Kuwaiti 8.Iranian 9.Egyptian chromosome 10.Morrocan spanning 2MB Europe 11.NW European 12.British 13.Dutch 14.Basque 15.Gypsies 16.Tuscan 17.Romanian 18.Chechen 19.Russian 1255 SNPs Central Asia 20.Tatar 21.Altaian 22.Uighur 1318 samples South East Asia 23.Gujarati 24.Tamil (CN) 25.Tamil(NTN) 26.Kalita 33 populations East Asia 27.Adi 28.Tibetan 29.Laotian 30.Ati 31.Chinese 32.Japanese South America 33. Mexican 19 21 20 12 13 11 18 22 16 17 14 5 31 33 15 32 8 6 28 10 9 7 27 23 26 4 25 29 30 1 24 3 2 45 Custom array of SNPs (NOT tagSNPs ie based on LD) IBM Computational Biology Center 46 IBM Computational Biology Center 47 Multi-Dimensional Scaling (MDS) on ARG IBM Computational Biology Center Possible corridor out of Africa: Northern route thru Middle East vs Southern Route thru Arabia (Bab-el-Mandeb) Using recombinational diversity (Nei’s diversity on recotypes) Does not match SNP diversity Southern Migration Northern Migration Reconstructed ARG 48 https://researcher.ibm.com/researcher/view_project.php?id=2303 IBM Computational Biology Center Summary of implications: 1.