Design and Analysis Issues in Family-Based Association
Total Page:16
File Type:pdf, Size:1020Kb
Emerging Challenges in Statistical Genetics Duncan Thomas University of Southern California Human Genetics in the Big Science Era • “Big Data” – large n and large p and complexity e.g., NIH Biomedical Big Data Initiative (RFA-HG-14-020) • Large n: challenge for computation and data storage, but not conceptual • Large p: many data mining approaches, few grounded in statistical principles • Sparse penalized regression & hierarchical modeling from Bayesian and frequentist perspectives • Emerging –omics challenges Genetics: from Fisher to GWAS • Population genetics & heritability – Mendel / Fisher / Haldane / Wright • Segregation analysis – Likelihoods on complex pedigrees by peeling: Elston & Stewart • Linkage analysis (PCR / microsats / SNPs) – Multipoint: Lander & Green – MCMC: Thompson • Association – TDT, FBATs, etc: Spielman, Laird – GWAS: Risch & Merikangas – Post-GWAS: pathway mining, next-gen sequencing Association: From hypothesis-driven to agnostic research Candidate pathways Candidate Hierarchical GWAS genes models (ht-SNPs) Ontologies Pathway mining MRC BSU SGX Plans Objectives: – Integrating structural and prior information for sparse regression analysis of high dimensional data – Clustering models for exposure-disease associations – Integrating network information – Penalised regression and Bayesian variable selection – Mechanistic models of cellular processes – Statistical computing for large scale genomics data Targeted areas of impact : – gene regulation and immunological response – biomarker based signatures – targeting of treatment – gene-environment interactions and biological pathways Sparse Regression & Hierarchical Modeling • Small n large p problem • Variable selection, model averaging • Data mining approaches • Penalized regression – Ridge regression – LASSO – Group LASSO, elastic net, group bridge, etc: • Combine variable selection and shrinkage • Allow for multi-level models (SNPs / genes / pathways) – Evolutionary Stochastic Search (ESS++,GUESS) • Hierarchical models – Incorporate external information Profile regression approach (regression on latent clusters c) Opportunity to add external info to the profile probabilities ESS++ & GUESS Bottolo & Richardson, Bayesian Anal 2010 Bottolo et al, Bioinformatics 2011 Bottolo et al, PLoS Genet 2013 • Metropolis Coupled MCMC (MC3) – Multiple chains at different temperatures with swapping (“parallel tempering”) (Geyer 1991) – “Evolutionary MC” combines MCMC with Genetic Algorithms (e.g., crossover operators) • GPU implementation • Allows application to GWAS data on multiple phenotypes (large p small n) Basic Hierarchical Models • Level 1 (outcomes Yi variables Xi = (Xp) ) or • (meta-analysis context) -ln(pp ) ~ c(bp ) • Level 2 (coefficients βp, external info Zk) – πp = probability variable p has any effect – μp = expected effect size if nonzero • Or model variances or covariances Hierarchical Model for Rare Variants BRI or iBMU (Quintana & Conti, 2011,2012,2013) BRCA rare variants in WECARE Hierarchical Model for Rare Variants (Quintana & Conti, Hum Hered 2012) Multi-level for SNPs within Genes in DNA Damage Response Pathways Duan & Thomas, Int J Genom 2013: Art. 406217 WECARE STUDY RESULTS Duan & Thomas, Int J Genom 2013: Art. 406217 Number of genes Number of SNPs Genes WECARE Study results SNPs Duan & Thomas, Int J Genom 2013: Art. 406217 “Post-GWAS” • Fine mapping, meta-analyses, annotation, epidemiologic characterization (e.g., GxE), risk models, functional studies • Integrative genomics: across –omics platforms, phenotypes, species – Great opportunity for hierarchical modeling! • Machine learning for network discovery (gene-gene, protein-protein interactions) • Bayesian networks Friedman, Science 2004 Jia & Zhao, Hum Genet 2014 – Using informative priors on network structure Mukherjee & Speed, PNAS 2008 Algorithm for Learning Pathway Structure (ALPS, Baurley et al 2010) θ β Genes, Disease Millstein et al, AJHG 2006 exposures Z X Y Joint effects Extension of logic regression (Kooperberg & Ruczinski 2005) Prior Λ topology Λ0 Network topology Xp(Λ) = f (Xpa(p)(Λ), θpΛ) φ g[E(Y|X,Λ)] = XP,Λ β p(Λ) ~ exp[– φ d(Λ,Λ0)] Allows inference on model structure and on predictions p(Y|Z) allowing for uncertainty about model structure Meta-Analysis & Mega-Analysis • Why? Power • How? Consortia, meta-analysis • Epi 101 concerns: Lango Allen et al, Nature 2010 – Inappropriate controls – Heterogeneity – Lack of risk factor data, harmonization – Can genomic control overcome all? (Clayton 2005) • Countless hours on teleconferences & travel to meetings! • All to discover another p loci with RRs < 1.2 accounting for <x% of the “missing heritability”?? Next Generation Sequencing • Limitations of GWAS – Small effect sizes – Missing heritability – Predominately non-coding • NGS can assay entire genome (~50M polymorphic sites out of 3.3B BPs) • Noisy data, depending on depth of coverage • Design issues – Sample size vs. coverage, constrained by cost – DNA pooling, bar coding, pools of pools – Family-based, two-phase designs Next-Gen Sequencing Designs • Two-phase designs for targeted follow-up of GWAS hits – Subsampling jointly on disease and associated variant(s) – Joint analysis of GWAS and sequence data • Family-based designs for genome-wide sequencing – Sampling of families and members based on phenotypes – Analyses that exploit co-segregation of variants with phenotypes and uses all phenotype data (including subjects not sequenced) Thomas et al, Front Genet 2013 • Sampling of “unrelateds” based on population genetics Kang & Marjoram, Genet Epidemiol 2012 • Optimal design using DNA pools – Bayesian analysis of unobserved individual-level associations from pool data – Trade-off between numbers of subjects (N), numbers of pools (P), and depth of sequencing (R) – Optimal design where cost depends on P,R,N Liang et al, Genet Epidemiol , 2012 x Some Other “Big Data” Problems • Pharmaco- (toxico-) genomics • Exposome • Microbiome • Connectome Pharmacogenetics • From individual PBPK modeling to Bayesian population PBPK models (PK-BUGS) – Combining differential equation and statistical models – Davidian & Galant, J Pharmacokinet Biopharm 1992 – Wakefield, JASA 1996 – Gelman, Bois & Jiang JASA, 1996 – Lunn & Best , J Pharmacokinet Pharmacodyn 2002 • Incorporation of genetic variability in rates – Conti et al., Hum Hered 2003 • Extension to genomics – Weinshilboum & Wang, Annu Rev Genom Hum Genet 2006 – Nebert et al, Drug Metab Rev 2008 Differential equation • Individuals’ metabolic rates 1 models for and 1 depend lognormally on pharmaco-kinetics genes G1 and G2 respectively dM Cortessis & Thomas, IARC #157, 2003 lE mM dt • Consider a single • Steady state solution is intermediate metabolite, M (l /m )E created at rate λE, i Gi1 Gi 2 i removed at rate μM, • Multistep processes with linear kinetics K l (G ) M (E,G) E k1 k1,1 _ K l (G ) m (G ) l k1 k k1 k k 2 • Biomarkers lognormal on M Bi Gi1 li • Disease risk depends on M logit Pr(Yi 1) Ei Mi Yi M (E ,G ,G ) 0 1 i i i1 i2 Gi2 mi • Extensions: – Michaelis-Menten kinetics _ removal m – SDEs for single molecule kinetics Use of Biomarkers • Measurements of hypothesized G Y intermediates can substantially improve ability to fit pathway- M based models • But need to be concerned about E B “reverse causation” in case- control designs • Two-phase sampling – disease Y or its treatment for B measurements: affects intermediate variable M – Conditional on Y,G,E or its measurement B|M • Pooling of biomarker • Approaches: samples for greater – Mendelian randomization efficiency (Weinberg & – Imputation of M |G,E,Y=0 Umbach 1999, 2011) – Full Bayesian (sampling M ) The Exposome • Chris Wild (CEPB 2005): “Comprehensive description of lifelong exposure history” • Implementation via metabolomics (MS) (Wild 2010; Rappaport 2012; Chadeau-Hyam 2011) • EWAS (Patel, 2010) • Problems – Temporal variation – Reverse causation – Confounding by population structure E – Metabolite measurement error • Gene environment wide interaction studiesG (GEWIS) – Two-step (Murcray 2010; Gauderman 2013; Lewinger 2013) – EB or BMA (Mukherjee et al, 2008, 2012; Li & Conti, 2009) • Epigenetic mediation (Huang et al, Ann Appl Stat 2013) The Microbiome • 10 x as many microbial as human cells • 3 million microbial vs 25,000 human genes • NIH Human Microbiome Project: 35 billion DNA reads from 300 subjects in 15 body sites: 2.3 TB data! (Cho & Blasser, Nat Rev Genet 2012) • Associations of broad taxonomic units or specific species with exposures or disease • Community measures like diversity (entropy) or resiliance (Li et al., PLoS One 2012) • Host-microbiome interactions (Kinross et al, Genome Med 2011) The Connectome (NIH BRAIN Initiative: RFA-MH-14-217) • Functional MRI of the brain • ENIGMA Consortium: 12,000 individuals from 100+ centers with functional MRI (Thompson et al, Brain Imag Behav 2014) • High-dimensional phenotypes x GWAS • Associations with volumes straightforward • Neuronal connections (Prasad et al, 2014) – MCMC clustering of connections between partitions of 500 regions related to disease – Bell’s number: B(5)=52, B(10)=115,975, … , B(500)=1.6×10843 The Future of Computing Schadt et al, Nat Rev Genet 2010 “Sequencing DNA, RNA, the epigenome, the metaboloma and the proteome from numerous cells in millions of individuals, and sequencing environmentally collected samples routinely from thousands of locations a day, will take us into the exabyte [1018] scales of data in the next 5–10 years.” Computational Advances • Cluster, grid, or cloud computing,