COMPUTATIONAL METHODS FOR THE FUNCTIONAL ANALYSIS OF DNA SEQUENCE VARIANTS
by
Lucas Santana dos Santos
BS, Universidade Federal de Minas Gerais, 2008
MS, University of Pittsburgh, 2012
Submitted to the Graduate Faculty of
the School of Medicine in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
University of Pittsburgh
2017 UNIVERSITY OF PITTSBURGH
SCHOOL OF MEDICINE
This dissertation was presented
by
Lucas Santana dos Santos
It was defended on
April 4, 2017
and approved by
Richard Duerr, MD, Medicine
Vanathi Gopalakrishnan, PhD, Associate Professor, Biomedical Informatics
Xia Jiang, PhD, Associate Professor, Biomedical Informatics
Dissertation Director: Panayiotis Benos, PhD, Professor, Computational and Systems Biology
ii Copyright © by Lucas Santana dos Santos
2017
iii COMPUTATIONAL METHODS FOR THE FUNCTIONAL ANALYSIS OF DNA SEQUENCE VARIANTS
Lucas Santana dos Santos, PhD
University of Pittsburgh, 2017
Complex diseases, such as cancer and inflammatory bowel disease, are caused by a combination of genetic and environmental factors. The advent of next-generation sequencing (NGS) technology allowed the genome-wide investigation of the underlying genetic causes of complex disorders. Analysis of the large amount of data generated by NGS is computationally intensive and require new computational methods. One of the current problems in genomic data analysis is the lack of computational methods for functional annotation of DNA sequence variants (DSVs), especially regulatory DNA sequence variants (rDSVs). In recent years, rDSVs have been shown to be the primary cause of complex diseases, supported by the fact that functional regulatory sites are more polymorphic than coding regions, and that rDSVs vastly outnumber coding variants.
Also, GWAS studies of complex traits have shown that SNPs with the strongest association signals lie outside known genes in non-coding regions of the genome.
This dissertation contributes to a solution for the lack of computational methods for the analysis of DNA sequence variants. Two novel computational methods for the analysis of DSVs are proposed here: 1) an algorithm, called is-miRSNP,
DSVs on miRNA binding, 2) a pipeline for the functional annotation of DSVs using NGS. The is-miRSNP algorithm uses a binding-energy approach for the prediction of DSVs effects on miRNA binding. The algorithm is flexible enough to process large amounts of data and can be easily integrated in existing pipelines. Experiments using a manually curated set of experimentally validated DSVs-miRNA showed that is-miRSNP outperforms all most popular
iv existing methods. The pipeline for functional annotation of functional DSVs utilizes state-of-the- art existing computational methods. The pipeline has been applied to an effector memory T cell
RNA-Seq dataset that is related to inflammatory bowel disease and has identified biologically relevant genes and isoforms that are differentially expressed upon treatment with Prostaglandin
E2. Important pathways and biologically relevant DSVs were also identified and recovered.
These methods have the potential to help clinicians and researchers analyze and interpret genomic datasets, and might in the future help the development of new diagnostics methods and treatments.
v TABLE OF CONTENTS
ACKNOWLEDGMENTS ...... XI
1.0 INTRODUCTION ...... 1
1.1 THE PROBLEM ...... 4
1.1.1 Predicting and prioritizing rDSVs that affect miRNA binding sites ...... 5
1.1.2 Functional evaluation and prioritizing of DSVs in genomic datasets ...... 7
1.2 THE APPROACH ...... 8
1.2.1 THESIS ...... 11
1.3 SIGNIFICANCE ...... 12
1.4 DISSERTATION OVERVIEW ...... 13
2.0 PREDICTION OF MIRNA BINDING SITES ...... 14
2.1 BACKGROUND ...... 14
2.2 METHODS ...... 17
2.2.1 Computation of miRNA binding energy background distributions ...... 18
2.2.2 Computation of the log-ratio background distributions ...... 20
2.2.3 Scoring the effect of a DSV in miRNA binding ...... 21
2.2.4 Validation dataset ...... 24
2.2.5 Performance comparison ...... 24
2.3 RESULTS ...... 25
vi 2.3.1 Background distributions ...... 25
2.3.2 Performance and comparison to existing tools ...... 26
2.3.3 Discussion...... 31
3.0 PIPELINE FOR PRIORITIZATION AND FUNCTIONAL EVALUATION OF
DNA SEQUENCE VARIANTS ...... 33
3.1 BACKGROUND ...... 33
3.2 METHODS ...... 35
3.2.1 RNA-Seq analysis ...... 37
3.2.1.1 Data quality-control ...... 37
3.2.1.2 Adapter trimming and read filtering and trimming ...... 37
3.2.1.3 Read mapping ...... 38
3.2.1.4 Estimation of gene and isoform expression ...... 38
3.2.1.5 Differential gene expression analysis ...... 38
3.2.2 Variant identification from RNA-Seq ...... 39
3.2.3 GWAS data analysis ...... 40
3.2.4 Functional Annotation of DSVs ...... 40
3.2.4.1 Functional annotation of exonic and splicing DSVs ...... 41
3.2.4.2 ...... 41
3.2.4.3 Functional annotation of intronic, promoters ... 41
4.0 ANALYSIS OF INFLAMMATORY BOWEL DISEASE DATASET ...... 43
4.1 BACKGROUND ...... 43
4.2 METHODS ...... 46
4.2.1 Cell isolation, purification and stimulation ...... 46
vii 4.2.2 RNA sequencing ...... 47
4.2.3 Analysis of gene expression ...... 47
4.2.4 Evaluation of cis effects of IBD-associated SNPs within differentially
expressed gene regions ...... 48
4.3 RESULTS ...... 49
4.3.1 The effects of PGE2 dominate the differential gene expression observed
in activated effector memory T cells treated with IL-23+IL-
mediators ...... 49
4.3.2 Effect of PGE2 on the transcriptome of activated effector memory T
cells...... 51
4.3.3 Effects of IL-23 and IL-
memory T cells...... 52
4.3.4 Pathway analysis reveals important relationships to IBD ...... 53
4.3.5 SNPs with potential functional role in IBD ...... 57
4.3.6 Effects of PGE2 on expression of isoforms in genes inside IBD-associated
regions ...... 75
4.4 DISCUSSION ...... 76
5.0 CONCLUSIONS, LIMITATIONS AND FUTURE WORK ...... 79
5.1 CONCLUSIONS ...... 79
5.2 LIMITATIONS ...... 80
5.3 FUTURE WORK ...... 81
5.3.1 Expand is-miRSNP to predict effect of INDELS ...... 81
5.3.2 Investigate is-miRSNP background distributions ...... 82
viii 5.3.3 Expand is-miRSNP prediction to other organisms ...... 82
5.3.4 Expand pipeline for functional prioritization of DSVs to integrate
different omics data types ...... 83
5.3.5 Add variant calling step to RNA-Seq analysis ...... 83
BIBLIOGRAPHY ...... 84
ix LIST OF TABLES
Table 1 Detailed is-rSNP results for DSVs in the validation set...... 27
Table 2. Performance comparison of is-miRSNP and other 4 prediction tools...... 29
Table 3. Summary table of differentially expressed genes ...... 49
Table 4. Summary table of differentially expressed isoforms ...... 51