Limiting Bias in Biological Data Analysis by Pooling Information
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF CALIFORNIA, SAN DIEGO Limiting Bias in Biological Data Analysis by Pooling Information A Dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Bioinformatics and Systems Biology by Kunal Bhutani Committee in charge: Professor Nicholas J. Schork, Chair Professor Vineet Bafna, Co-Chair Professor Vikas Bansal Professor Lawrence Goldstein Professor Olivier Harismendy Professor Diane Nugent 2017 Copyright Kunal Bhutani, 2017 All rights reserved. The Dissertation of Kunal Bhutani is approved, and it is acceptable in quality and form for publication on micro- film and electronically: Co-Chair Chair University of California, San Diego 2017 iii DEDICATION To my parents. iv EPIGRAPH In the depth of winter, I finally learned that within me there lay an invincible summer. | Albert Camus v TABLE OF CONTENTS Signature Page................................... iii Dedication...................................... iv Epigraph......................................v Table of Contents.................................. vi List of Figures................................... ix List of Tables....................................x Acknowledgements................................. xi Vita......................................... xii Abstract of the Dissertation............................ xiii Chapter 1 Introduction.............................1 1.1 Genomics and the Identification of Somatically Acquired Variations...........................2 1.1.1 Somatic Mutations in Cancer............3 1.1.2 Induced Pluripotent Stem Cells...........4 1.2 Transcriptomics and Transcriptome-wide Association Studies7 1.3 Metabolomics and Li-Fraumeni Syndrome.........8 Chapter 2 Whole Genome Mutational Burden Analysis of Three Pluripotency Induction Methods.........................9 2.1 Abstract............................9 2.2 Introduction.......................... 10 2.3 Results............................. 11 2.3.1 Identification of single nucleotide variants...... 11 2.3.2 Identification of Insertions and Deletions...... 13 2.3.3 Identification of Structural Variants......... 13 2.3.4 Assessing pathogenicity of unique variants..... 14 2.4 Discussion........................... 16 2.5 Methods............................ 18 2.5.1 Retroviral Reprogramming.............. 18 2.5.2 Sendai Virus Reprogramming............ 19 2.5.3 mRNA Reprogramming................ 19 2.5.4 Cell Culture...................... 20 2.5.5 DNA Extraction and Sequencing.......... 20 2.5.6 Realignment of Illumina Reads and Recalibration. 21 vi 2.5.7 SNVs and Indel Variant Calling........... 21 2.5.8 SNVs and Indel Validation and Filtering...... 22 2.5.9 SNVs and Indel Variant Annotation......... 22 2.5.10 ENCODE Transcription Factor Binding Sites Anno- tation......................... 23 2.5.11 Identification of Integration Sites.......... 23 2.5.12 Permutation Testing................. 24 2.5.13 ANOVA Contrasts.................. 24 2.5.14 Comparison of Combined Annotation Dependent De- pletion Score Distributions.............. 25 2.5.15 BioNano High Molecule Weight DNA Extraction.. 25 2.5.16 BioNano DNA Labeling............... 26 2.5.17 BioNano Data collection............... 26 2.5.18 BioNano De Novo Assembly............. 27 2.5.19 BioNano Structural Variation Calls......... 27 2.5.20 BioNano Identification of Cell Line Specific Calls. 28 2.6 Figures and Tables...................... 29 2.7 Acknowledgements...................... 40 Chapter 3 Modeling prediction error improves power of transcriptome-wide association studies......................... 41 3.1 Abstract............................ 41 3.2 Introduction.......................... 42 3.3 Results............................. 44 3.3.1 Uncertainty in TWAS................. 44 3.3.2 Simulation study................... 45 3.3.3 Application to seven diseases............. 47 3.4 Discussion........................... 52 3.5 Methods............................ 53 3.5.1 Uncertainty in TWAS................. 53 3.5.2 Bayesian two-stage regression model......... 55 3.5.3 Multiple imputation................. 56 3.5.4 Regression calibration................ 56 3.5.5 Simulation study................... 58 3.5.6 GWAS processing................... 59 3.5.7 Reference expression processing........... 60 3.5.8 GEO data....................... 62 3.6 Acknowledgements...................... 74 Chapter 4 Longitudinal metabolome, microbiome, and transcriptome profil- ing of a germline TP53 mutation carrier............. 75 4.1 Abstract............................ 75 4.2 Introduction.......................... 76 4.3 Results............................. 77 vii 4.3.1 Whole Genome Sequencing.............. 77 4.3.2 Longitudinal Profiling................ 79 4.4 Discussion........................... 83 4.5 Methods............................ 85 4.5.1 Whole Genome Sequencing.............. 85 4.5.2 Metabolome...................... 86 4.5.3 Microbiome...................... 88 4.5.4 Transcriptome..................... 88 4.6 Figures and Tables...................... 88 4.7 Acknowledgements...................... 98 Chapter 5 Closing Remarks.......................... 99 Bibliography.................................... 101 viii LIST OF FIGURES Figure 1.1: Overall Virmid workflow.......................5 Figure 2.1: Experimental and computational design for identifying variants caused by reprogramming...................... 29 Figure 2.2: Characterization of variants caused by reprogramming method.. 30 Figure 2.3: A 228.8 kb deletion at Xp22.11 in sample M3 detected by BioNano genome mapping........................... 31 Figure S2.1: Kernel density estimation of coverage and VAF of variants unique to cell lines............................... 34 Figure S2.2: De-novo assemblies of the fibroblast control and one of each iPSC type from nanochannel mapping data................ 35 Figure S2.3: Integration sites of Retrovirus in the three replicates....... 36 Figure S2.4: The distribution of CADD scores for each sample.......... 37 Figure 3.1: Simulation Results.......................... 63 Figure 3.2: Impact of expression normalization on trained models of transcrip- tional regulation........................... 65 Figure 3.3: TWAS results for Crohn's Disease................. 66 Figure S3.1: Graphical model for BAY-TS.................... 67 Figure S3.2: Precision-recall curves for the proposed methods over four simulated cis-regulatory architectures and four trials............. 68 Figure S3.3: Quantile-quantile plots for the frequentist methods across four simulated cis-regulatory architectures and four trials....... 69 Figure S3.4: Power vs. variance ratio across four simulated cis-regulatory archi- tectures................................ 70 Figure S3.5: Mean variance ratio across the diseases............... 71 Figure 4.1: Pedigree of Nuclear Family and Overview of Sample Collection for Patient................................ 89 Figure 4.2: De novo mutation Profiling of the offspring............ 90 Figure 4.3: Metabolome profiling reveals metabolite-dependent instrumentation noise................................. 91 Figure 4.4: High variation in Microbiome profiling............... 92 Figure 4.5: Outliers detected by Transcriptome profiling........... 93 Figure S4.1: Residual mutational signatures of de-novo variants in offspring.. 95 Figure S4.2: Metabolomic profiling of cancer-associated biomarkers...... 96 Figure S4.3: Outliers in simulated real-time monitoring of oncogene transcriptome 97 Figure S4.4: Graphical model for Metabolite Levels............... 97 ix LIST OF TABLES Table 2.1: P-values from the permutation-based ANOVA test for variant type differences across the three reprogramming methods......... 32 Table 2.2: High confidence variants in coding regions............. 33 Table 2.3: Functional impacts as calculated by ANOVA contrasts using a One versus All approach.......................... 33 Table S2.1: Statistics of molecules and de novo assembled genome....... 38 Table S2.2: Coding mutations identified by MuTect............... 39 Table 3.1: TWAS results for Bayesian two-stage regression model....... 64 Table S3.1: Replication of PrediXcan TWAS genes by our imputed genotypes and BAY-TS.............................. 67 Table S3.2: OLS-E results using RLog learned models of transcription..... 72 Table S3.3: MI results using RLog learned models of transcription....... 72 Table S3.4: RC results using RLog learned models of transcription....... 73 Table 4.1: Low instrumentation variance metabolites.............. 94 x ACKNOWLEDGEMENTS Thanks to all the friends, family, collaborators, committee members, students, teachers, and strangers that made this journey worthwhile. Chapter 2, in full, is a reprint of the material as it appears in Nature Communi- cations 2016. Kunal Bhutani, Kristopher L Nazor, Roy Williams, Ha Tran, Heng Dai, Zeljko Dzakula, Edward H Cho, Andy WC Pang, Mahendra Rao, Han Cao, Nicholas J Schork, Jeanne F Loring. The dissertation author was the primary investigator and author of this paper. Chapter 3, in part is currently being prepared for submission for publication of the material. Kunal Bhutani, Abhishek Sarkar, Yongjin Park, Manolis Kellis, Nicholas J Schork. The dissertation author was the primary investigator and author of this paper. Chapter 4, in part is currently being prepared for submission for publication of the material. Kunal Bhutani, Victoria Magnuson, Alexandra Buckley, Danjuma Quarless, Laura Goetz, Nicholas J Schork. The dissertation author was the primary investigator and author of this paper. xi VITA 2011 B.