Computational Elucidation of the Regulatory Snps in the Non-Coding Regions of the Human Genome
Total Page:16
File Type:pdf, Size:1020Kb
AN ABSTRACT OF THE DISSERTATION OF Yao Yao for the degree of Doctor of Philosophy in Computer Science presented on February 16, 2021. Title: Computational Elucidation of the Regulatory SNPs in the Non-Coding Regions of the Human Genome Abstract approved: Stephen Ramsey We describe a series of novel computational models, CERENKOV (Computational Elu- cidation of the REgulatory NonKOding Variome) and its successors CERENKOV2, CE- RENKOV3, and Convolutional CERENKOV3, for discriminating regulatory single nu- cleotide polymorphisms (rSNPs) from non-regulatory SNPs within non-coding genetic loci. The CERENKOV models are designed for recognizing rSNPs in the context of a post-analysis of a genome-wide association study (GWAS); they include a novel accu- racy scoring metric (average rank, or AVGRANK) and a novel cross-validation strategy (locus-based sampling) that both correctly account for the \sparse positive bag" nature of the GWAS post-analysis rSNP recognition problem. We trained and validated the CERENKOV series models using a set of reference SNPs whose composition is based on selection criteria (linkage disequilibrium and minor allele frequency) that we designed to ensure relevance to GWAS post-analysis. The CERENKOV models are based on a machine-learning algorithm (gradient boosted decision trees) incorporating various SNP annotation features that are from genomic, epigenomic, phylogenetic, and chromatin data. CERENKOV2 includes features based on the geometry of the annotation features in data-space, and the CERENKOV3 models include features derived from SNP clus- tering, molecular network and convolutional output on genomic signals. We compared the validation performance of CERENKOV to nine other methods for rSNP recognition (including GWAVA, RSVP, DeltaSVM, DeepSEA, EIGEN, and DANQ), and found that CERENKOV's validation performance is the strongest out of all of the classifiers that we tested, by both traditional global rank-based measures (AUPRC, AUROC) and AV- GRANK. From the performance comparison between CERENKOV and its successors, we found that rSNP recognition performance benefits from data-space geometry, SNP clustering and molecular network-derived features. ©Copyright by Yao Yao February 16, 2021 All Rights Reserved Computational Elucidation of the Regulatory SNPs in the Non-Coding Regions of the Human Genome by Yao Yao A DISSERTATION submitted to Oregon State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Presented February 16, 2021 Commencement June 2021 Doctor of Philosophy dissertation of Yao Yao presented on February 16, 2021. APPROVED: Major Professor, representing Computer Science Head of the School of Electrical Engineering and Computer Science Dean of the Graduate School I understand that my dissertation will become part of the permanent collection of Oregon State University libraries. My signature below authorizes release of my dissertation to any reader upon request. Yao Yao, Author ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my advisor Dr. Stephen Ramsey for the continuous support of my PhD study and related research, for his patience, motivation, and immense knowledge. His guidance helped me in all the time of research and writing of this thesis. I could not have imagined having a better advisor and mentor for my PhD study. Besides my advisor, I would like to thank the rest of my thesis committee: Dr. Glen- cora Borradaile, Dr. Xiaoli Fern, and Dr. Amir Nayyeri, for their insightful comments and encouragement, and also for the questions which incented me to widen my research from various perspectives. Also I thank Dr. Harold Bae for his service as a GCR. My sincere thanks also goes to my colleagues and friends for their support and help. Special thanks to Tanjin Xu, Jun He, Zheng Liu, Satpreet Singh, Qi Wei, Deqing Qu, Steven Carrell, Finn Womack, Meghamala Sinha, Brandy Nagamine, Janice Blouse for creating the best memories throughout my graduate life in Dryden Hall. Last but not least, I would like to thank my parents, my aunts, my uncles, my grandmas for supporting me spiritually throughout writing this thesis and my life in general. Great thanks to my girlfriend Suzy for her deep love and standing behind me all the time. TABLE OF CONTENTS Page 1 Introduction 1 1.1 Genome-Wide Association Studies . 1 1.2 Single-Nucleotide Polymorphisms . 2 1.3 The rSNP Detection Problem . 3 1.4 Previous Approaches . 4 1.4.1 rSNV-Unsupervised Approaches . 5 1.4.2 rSNV-Supervised Approaches . 5 1.5 Limitations of Previous Approaches . 6 1.6 Our Approaches . 9 1.6.1 CERENKOV . 9 1.6.1.1 Overview . 9 1.6.1.2 New Performance Measure{AVGRANK . 10 1.6.1.3 Locus-Based Sampling . 11 1.6.2 CERENKOV2 . 11 1.6.2.1 Overview . 11 1.6.2.2 The Importance of Data-Space Geometry . 12 1.6.2.3 Data-Space Geometric Features for rSNP Recognition . 12 1.6.3 CERENKOV3 . 15 1.6.3.1 Hypothesis . 15 1.6.3.2 Overview . 16 1.6.4 Convolutional CERENKOV3 . 17 1.6.4.1 Hypothesis . 17 1.6.4.2 Convolutional Neural Network . 18 1.6.4.3 Genomic Signals . 20 1.6.4.4 Convolutional Variational Autoencoder . 22 1.6.4.5 Overview of Convolutional CERENKOV3 . 23 1.7 Dissertation Organization and Structure . 23 2 Materials and Methods 25 2.1 CERENKOV . 25 2.1.1 The OSU17 Reference SNP Set . 25 2.1.2 CERENKOV Features . 26 2.1.2.1 Features Extracted from UCSC Genome Browser . 29 2.1.2.2 Features Extracted from Ensembl . 30 2.1.2.3 GTEx Feature . 30 TABLE OF CONTENTS (Continued) Page 2.1.2.4 DNA Shape Feature . 31 2.1.3 Features for the Other Classifiers in Comparison . 31 2.1.3.1 GWAVA . 31 2.1.3.2 DeltaSVM . 31 2.1.3.3 RSVP . 32 2.1.3.4 DeepSEA . 32 2.1.3.5 DANQ . 32 2.1.3.6 EIGEN, CADD, DANN, fitCons . 33 2.1.4 Machine Learning . 33 2.1.4.1 Random Forest . 33 2.1.4.2 Gradient Boosted Decision Trees and Cross-Validation . 33 2.1.4.3 Tuning CERENKOV . 34 2.1.5 t-SNE and Statistical Testing . 35 2.2 CERENKOV2 . 35 2.2.1 The OSU18 Reference SNP Set . 35 2.2.2 Adjustment to the Non-geometric Features . 36 2.2.3 Computing the Geometric Features . 36 2.2.4 Machine Learning . 38 2.3 CERENKOV3 . 38 2.3.1 Adjustment to the Reference SNP Set and Annotation Features . 39 2.3.2 A Clustering-Derived Feature{Locus Sizes . 40 2.3.3 Construction of Molecular Networks . 41 2.3.3.1 Detailed Procedure for Obtaining SNP-Gene Edges . 41 2.3.3.2 Detailed Procedure for Obtaining Gene-Gene Edges . 43 2.3.4 Network-Derived Features . 43 2.3.5 Machine Learning Pipeline and Hyperparameter Tuning . 43 2.4 Convolutional CERENKOV3 . 47 2.4.1 Construction of Genomic Signal Tracks . 47 2.4.2 Adjustment to the Reference SNP Set and Cross-Validation . 48 2.4.3 Machine Learning . 48 2.4.3.1 Building Blocks of Convolutional CERENKOV3 Models . 49 2.4.3.2 Construction of the Ensemble Classifiers . 50 2.4.3.3 Construction of the Convoluational VAE-Embedded Clas- sifiers . 52 TABLE OF CONTENTS (Continued) Page 3 Results 53 3.1 Metric Comparison between AUPRC and AVGRANK . 53 3.2 Performance Comparison . 55 3.2.1 CERENKOV vs. Nine Previous Approaches . 55 3.2.2 CERENKOV2 vs. CERENKOV . 57 3.2.3 CERENKOV3 vs. GWAVA, CERENKOV, CERENKOV2 . 57 3.2.4 Convolutional CERENKOV3 vs. CERENKOV3 . 59 3.2.4.1 Convolutional CERENKOV3 Ensemble vs. CERENKOV3 60 3.2.4.2 VAE-Embedded Convolutional CERENKOV3 vs. CE- RENKOV3 . 60 3.3 Feature Analysis . 62 3.3.1 CERENKOV feature importance . 62 3.3.2 Analysis of the Intralocus Radius Distributions in CERENKOV2 . 64 3.3.3 Analysis of the Intralocus Radius Likelihood Ratios in CEREN- KOV2.................................. 64 3.3.4 CERENKOV2 Feature Importance . 66 3.3.5 Analysis of the Newly Engineered Features in CERENKOV3 . 68 4 Conclusion and Discussion 69 Bibliography 70 LIST OF FIGURES Figure Page 1.1 Regional association plot of SNP rs16946931 (±400 kbp) . 7 1.2 Distributions of intralocus radii computed using Pearson distance applied to OSU18 SNPs' feature data (see Sec. 2.2.1), conditioned on the type of reference SNP (rSNP or cSNP) for the intralocus radius calculation . 13 1.3 The geometric idea behind the intra-SNP distance features that are used in CERENKOV2 . 14 1.4 Representation of a simple, fictional transcription factor network . 16 1.5 The underlying mechanism of node2vec ................... 17 1.6 A typical convolutional neural network architecture for medical image classification . 19 1.7 1-D deep convolutional neural network structure . 19 1.8 Two PhyloP score tracks (based on vertebrate species genomic sequence alignments) of two SNPs, rs6663784 and rs3737717 . 21 1.9 Illustration of variational autoencoder model with the multivariate Gaus- sian assumption . 22 2.1 Class-wise frequency histograms of SNPs vs. locus size . 40 2.2 Data sources and types of relations used to construct the molecular net- works in CERENKOV3 . 41 2.3 Pipeline of CERENKOV3 machine learning approach . 44 2.4 Structure of LeNet-5 CNN . 49 2.5 Structure of VGG16 CNN . 50 2.6 Structure of Convolutional CERENKOV3 ensemble classifier . 50 2.7 Structure of convolutional VAE-embedded ensemble classifier . 52 3.1 The AUPRC and AVGRANK performance measures are functionally dis- tinct . 54 LIST OF FIGURES (Continued) Figure Page 3.2 Validation performance of CERENKOV improves upon nine methods . 56 3.3 Performance of GWAVA, CERENKOV and CERENKOV2 on the OSU18 reference SNP set, by three performance measures . 58 3.4 Performance of GWAVA, CERENKOV, CERENKOV2 and CERENKOV3 on the OSU18 reference SNP set, by three performance measures .