COMPUTATIONAL METHODS for the FUNCTIONAL ANALYSIS of DNA SEQUENCE VARIANTS by Lucas Santana Dos Santos BS, Universidade Federal
Total Page:16
File Type:pdf, Size:1020Kb
COMPUTATIONAL METHODS FOR THE FUNCTIONAL ANALYSIS OF DNA SEQUENCE VARIANTS by Lucas Santana dos Santos BS, Universidade Federal de Minas Gerais, 2008 MS, University of Pittsburgh, 2012 Submitted to the Graduate Faculty of the School of Medicine in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2017 UNIVERSITY OF PITTSBURGH SCHOOL OF MEDICINE This dissertation was presented by Lucas Santana dos Santos It was defended on April 4, 2017 and approved by Richard Duerr, MD, Medicine Vanathi Gopalakrishnan, PhD, Associate Professor, Biomedical Informatics Xia Jiang, PhD, Associate Professor, Biomedical Informatics Dissertation Director: Panayiotis Benos, PhD, Professor, Computational and Systems Biology ii Copyright © by Lucas Santana dos Santos 2017 iii COMPUTATIONAL METHODS FOR THE FUNCTIONAL ANALYSIS OF DNA SEQUENCE VARIANTS Lucas Santana dos Santos, PhD University of Pittsburgh, 2017 Complex diseases, such as cancer and inflammatory bowel disease, are caused by a combination of genetic and environmental factors. The advent of next-generation sequencing (NGS) technology allowed the genome-wide investigation of the underlying genetic causes of complex disorders. Analysis of the large amount of data generated by NGS is computationally intensive and require new computational methods. One of the current problems in genomic data analysis is the lack of computational methods for functional annotation of DNA sequence variants (DSVs), especially regulatory DNA sequence variants (rDSVs). In recent years, rDSVs have been shown to be the primary cause of complex diseases, supported by the fact that functional regulatory sites are more polymorphic than coding regions, and that rDSVs vastly outnumber coding variants. Also, GWAS studies of complex traits have shown that SNPs with the strongest association signals lie outside known genes in non-coding regions of the genome. This dissertation contributes to a solution for the lack of computational methods for the analysis of DNA sequence variants. Two novel computational methods for the analysis of DSVs are proposed here: 1) an algorithm, called is-miRSNP, DSVs on miRNA binding, 2) a pipeline for the functional annotation of DSVs using NGS. The is-miRSNP algorithm uses a binding-energy approach for the prediction of DSVs effects on miRNA binding. The algorithm is flexible enough to process large amounts of data and can be easily integrated in existing pipelines. Experiments using a manually curated set of experimentally validated DSVs-miRNA showed that is-miRSNP outperforms all most popular iv existing methods. The pipeline for functional annotation of functional DSVs utilizes state-of-the- art existing computational methods. The pipeline has been applied to an effector memory T cell RNA-Seq dataset that is related to inflammatory bowel disease and has identified biologically relevant genes and isoforms that are differentially expressed upon treatment with Prostaglandin E2. Important pathways and biologically relevant DSVs were also identified and recovered. These methods have the potential to help clinicians and researchers analyze and interpret genomic datasets, and might in the future help the development of new diagnostics methods and treatments. v TABLE OF CONTENTS ACKNOWLEDGMENTS .......................................................................................................... XI 1.0 INTRODUCTION ........................................................................................................ 1 1.1 THE PROBLEM .................................................................................................. 4 1.1.1 Predicting and prioritizing rDSVs that affect miRNA binding sites ....... 5 1.1.2 Functional evaluation and prioritizing of DSVs in genomic datasets ...... 7 1.2 THE APPROACH ............................................................................................... 8 1.2.1 THESIS ........................................................................................................ 11 1.3 SIGNIFICANCE ................................................................................................ 12 1.4 DISSERTATION OVERVIEW ....................................................................... 13 2.0 PREDICTION OF MIRNA BINDING SITES ........................................................ 14 2.1 BACKGROUND ................................................................................................ 14 2.2 METHODS ......................................................................................................... 17 2.2.1 Computation of miRNA binding energy background distributions ...... 18 2.2.2 Computation of the log-ratio background distributions ......................... 20 2.2.3 Scoring the effect of a DSV in miRNA binding ........................................ 21 2.2.4 Validation dataset ....................................................................................... 24 2.2.5 Performance comparison ........................................................................... 24 2.3 RESULTS ........................................................................................................... 25 vi 2.3.1 Background distributions ........................................................................... 25 2.3.2 Performance and comparison to existing tools ........................................ 26 2.3.3 Discussion..................................................................................................... 31 3.0 PIPELINE FOR PRIORITIZATION AND FUNCTIONAL EVALUATION OF DNA SEQUENCE VARIANTS ................................................................................................. 33 3.1 BACKGROUND ................................................................................................ 33 3.2 METHODS ......................................................................................................... 35 3.2.1 RNA-Seq analysis ........................................................................................ 37 3.2.1.1 Data quality-control ............................................................................ 37 3.2.1.2 Adapter trimming and read filtering and trimming ....................... 37 3.2.1.3 Read mapping ..................................................................................... 38 3.2.1.4 Estimation of gene and isoform expression ...................................... 38 3.2.1.5 Differential gene expression analysis ................................................ 38 3.2.2 Variant identification from RNA-Seq ....................................................... 39 3.2.3 GWAS data analysis ................................................................................... 40 3.2.4 Functional Annotation of DSVs ................................................................. 40 3.2.4.1 Functional annotation of exonic and splicing DSVs ........................ 41 3.2.4.2 ............................................. 41 3.2.4.3 Functional annotation of intronic, promoters ... 41 4.0 ANALYSIS OF INFLAMMATORY BOWEL DISEASE DATASET ................. 43 4.1 BACKGROUND ................................................................................................ 43 4.2 METHODS ......................................................................................................... 46 4.2.1 Cell isolation, purification and stimulation .............................................. 46 vii 4.2.2 RNA sequencing .......................................................................................... 47 4.2.3 Analysis of gene expression ........................................................................ 47 4.2.4 Evaluation of cis effects of IBD-associated SNPs within differentially expressed gene regions ............................................................................................... 48 4.3 RESULTS ........................................................................................................... 49 4.3.1 The effects of PGE2 dominate the differential gene expression observed in activated effector memory T cells treated with IL-23+IL- mediators ..................................................................................................................... 49 4.3.2 Effect of PGE2 on the transcriptome of activated effector memory T cells...............................................................................................................................51 4.3.3 Effects of IL-23 and IL- memory T cells............................................................................................................ 52 4.3.4 Pathway analysis reveals important relationships to IBD ...................... 53 4.3.5 SNPs with potential functional role in IBD .............................................. 57 4.3.6 Effects of PGE2 on expression of isoforms in genes inside IBD-associated regions ....................................................................................................................... 75 4.4 DISCUSSION ..................................................................................................... 76 5.0 CONCLUSIONS, LIMITATIONS AND FUTURE WORK .................................. 79 5.1 CONCLUSIONS ................................................................................................ 79 5.2 LIMITATIONS .................................................................................................. 80 5.3 FUTURE WORK ..............................................................................................