Bioinformatics Tools for the Analysis of Gene-Phenotype Relationships Coupled with a Next Generation Chip-Sequencing Data Processing Pipeline

Bioinformatics Tools for the Analysis of Gene-Phenotype Relationships Coupled with a Next Generation ChIP-Sequencing Data Processing Pipeline Erinija Pranckeviciene Thesis submitted to the Faculty of Graduate and Postdoctoral Studies in partial fulfillment of the requirements for the Doctorate in Philosophy degree in Cellular and Molecular Medicine Department of Cellular and Molecular Medicine Faculty of Medicine University of Ottawa c Erinija Pranckeviciene, Ottawa, Canada, 2015 Abstract The rapidly advancing high-throughput and next generation sequencing technologies facilitate deeper insights into the molecular mechanisms underlying the expression of phenotypes in living organisms. Experimental data and scientific publications following this technological advance- ment have rapidly accumulated in public databases. Meaningful analysis of currently available data in genomic databases requires sophisticated computational tools and algorithms, and presents considerable challenges to molecular biologists without specialized training in bioinformatics. To study their phenotype of interest molecular biologists must prioritize large lists of poorly characterized genes generated in high-throughput experiments. To date, prioritization tools have primarily been designed to work with phenotypes of human diseases as defined by the genes known to be associated with those diseases. There is therefore a need for more prioritization tools for phenotypes which are not related with diseases generally or diseases with which no genes have yet been associated in particular. Chromatin immunoprecipitation followed by next generation sequencing (ChIP-Seq) is a method of choice to study the gene regulation processes responsible for the expression of cellular phenotypes. Among publicly available computational pipelines for the processing of ChIP-Seq data, there is a lack of tools for the downstream analysis of composite motifs and preferred binding distances of the DNA binding proteins. This thesis is aimed to address the gap existing in the tools available to process high-throughput ChIP-Seq data to provide rapid analysis and interpretation of large lists of poorly characterized genes. Additionally, programs for the analysis of preferred binding distances of transcription factors were integrated into the pipeline for expedited results. A gene prioritization algorithm linking genes to non-disease phenotypes described by meaningful keywords was developed. This algorithm can be used to process candidate genetic targets of a transcription factor produced by a computational pipeline for ChIP-Seq data analysis. Contents Abstract i List of Figures vi List of Tables viii Abbreviations xi Acknowledgements xiii Preface xv 1 Introduction 1 1.1 ChIP-Seq Data Analysis................................5 1.1.1 Computational Processing of Sequenced Reads...............6 1.1.2 Available Computational Pipelines with Integrated Computational Proce- dures to Analyze ChIP-Seq Data....................... 11 1.2 Analysis of Gene Lists with the Enrichment of Gene Annotations......... 13 1.3 Gene Prioritization................................... 15 1.3.1 Current State of Phenotypic Resources for Phenotype Definitions..... 16 1.3.2 Approaches that are Used by Gene Prioritization Algorithms....... 19 1.3.2.1 Methods Defining the Phenotype by Training Genes....... 20 1.3.2.2 Methods Defining the Phenotype by Keywords.......... 22 1.3.2.3 Methods Prioritizing Genes for Disease Phenotypes....... 22 1.3.3 Inference of Relationships between Genes and Phenotypes......... 22 1.3.4 Concluding Remarks.............................. 26 1.4 Objectives, Contributions and Outline of this Thesis................ 27 2 Optimized ChIP-Sequencing Data Analysis Pipeline 30 2.1 Materials and Methods. Elements of ChIP-Seq Data Processing Pipeline..... 31 2.1.1 Command Line Tools............................. 32 2.1.1.1 Preprocessing Task......................... 32 2.1.1.2 Task of Alignment to a Reference Genome............ 33 2.1.1.3 Task of Peak Calling......................... 34 2.1.1.4 Estimation of Sufficiency of Coverage............... 37 2.1.2 Integrated Tools................................ 38 ii Contents iii 2.1.2.1 Characterization the Genome-Wide Binding of the Transcription Factor................................. 38 2.1.2.2 Task of Association of Peaks to Genes............... 40 2.1.2.3 Task of Motif Analysis in Peaks.................. 41 2.1.2.4 Analysis of Transcription Factor Composite Sequences...... 42 2.2 Conclusions....................................... 44 3 An Algorithm for the Prioritization of Candidate Genes 45 3.1 Materials and Methods: Components of the Prioritization Algorithm....... 47 3.1.1 Description of Input to the Gene Prioritization Algorithm......... 49 3.1.2 Computation of Gene-Phenotype Relationships Based on Literature... 50 3.1.3 Computations of Gene-Phenotype Relationships Based on Training Genes 53 3.1.4 Ranking and Prioritization of Candidate Genes............... 55 3.1.5 Availability of Gene Prioritization Algorithm................ 57 3.2 Mathematical Framework Underlying the Literature-Based Prioritization Method 58 3.2.1 The Fuzzy Set................................. 59 3.2.2 Fuzzy Binary Relation............................. 60 3.2.2.1 General Definition.......................... 61 3.2.2.2 Definition of Similarity and Inclusion FBR............ 62 3.2.3 Modeling the Relationships between Phenotype and Gene Functions by FBR 64 3.2.3.1 Definitions of FBR between MeSH terms and GO annotations. 65 3.2.3.2 Maximum Composition Operation................. 66 3.2.3.3 Inference of Relationships...................... 70 3.2.3.4 Thresholds in α-cuts......................... 71 3.3 Validation of Gene Prioritization Algorithms.................... 72 3.3.1 The benchmark................................. 73 3.3.2 Criteria to estimate performance of the prioritization algorithms..... 73 3.3.3 Method of Prioritization Performance Assessment Used in the Literature- Based Part of the Algorithm......................... 74 3.3.4 Method of Cross-Validation Prioritization Performance Assessment Used in the Algorithm Part Based on Training Genes............... 75 3.3.5 Benchmark Performance in Prioritization Algorithms............ 76 3.4 Application of the Developed Prioritization Algorithm............... 84 3.4.1 Cell Fusion Example.............................. 85 3.4.2 Cancer Related Genes............................. 90 3.5 Discussion. Advantages and Limitations of the Literature-Based Prioritization Algorithm........................................ 94 3.5.1 Factors Affecting the Prioritization...................... 94 3.5.1.1 Most Studied Topics......................... 94 3.5.1.2 Well Studied Phenotypes and Genes................ 95 3.5.1.3 Current State of Knowledge on Genes.............. 96 3.5.1.4 Good Candidates not on the Test List............... 98 3.5.2 Advantages................................... 99 3.5.2.1 Gene-Phenotype Relationships are Inferred without Reading the Full Text of the Articles....................... 99 3.5.2.2 Gene-Phenotype Links Can Be Interpreted in Natural Language 100 3.5.2.3 Specific and Rare Relationships are Promoted.......... 101 Contents iv 3.5.2.4 Physical and Chemical Domains Related to Gene Functions Are Identified............................... 102 3.5.3 Limitations................................... 103 3.5.3.1 Topics Most Studied in MEDLINE................. 103 3.5.3.2 Greater-Studied Genes Being Better Annotated......... 104 3.5.3.3 Heterogeneity of Gene Annotations in Genes with Different Roles 105 3.6 Conclusion....................................... 106 4 ChIP-Seq Data Analysis of BCL11B and TAL1 Transcription Factors 107 4.1 Pipeline Tools Applied to BCL11B ChIP-Seq data................. 109 4.2 Preferred Distances of TF Co-location in TAL1 Genome-Wide Binding Regions in ECF Cells...................................... 116 4.3 Conclusions....................................... 117 5 General Discussion 119 5.1 Overview of Findings.................................. 120 5.2 Examining Evidence of Gene-Phenotype Relationships............... 123 5.2.1 Application Testing.............................. 125 5.3 Evaluation of Performance............................... 126 5.4 Integrative Analysis of ChIP-Seq Data........................ 130 5.5 Future Directions.................................... 135 5.6 Conclusion....................................... 136 References 138 Appendix A. Supplementary Material for Chapter 3 169 A.1 Data sources Used to Create a Local Database for the Algorithm......... 171 A.2 Contents of the Local Database of the Prioritization Algorithm.......... 172 A.3 Parameters of Literature-Based Prioritization Algorithm.............. 173 A.4 Comparison of Literature-Based Prioritization Algorithm with CANDID Tool.. 175 A.5 Details of the \Respiration" Phenotype...................... 177 A.6 Details of the \DNA Repair" Phenotype...................... 178 A.7 Details of the \Autophagy" Phenotype....................... 179 A.8 Details of the \Vasoconstriction" Phenotype................... 181 A.9 User Manual...................................... 183 Appendix B. Supplementary Material for Chapter 2 190 B.10 Dependencies of the ChIP-Seq Data Analysis Tools................. 190 B.11 Characterization Genome Wide Binding of the Transcription Factor....... 192 B.12 Association of Peaks to Genes............................. 195 B.13 Motif Analysis in Peaks................................ 196 B.14 The Fasta from Bed Tool..............................

Bioinformatics Tools for the Analysis of Gene-Phenotype Relationships Coupled with a Next Generation Chip-Sequencing Data Processing Pipeline

CISD2 (NM 001008388) Human Tagged ORF Clone Product Data

Analysis of Trans Esnps Infers Regulatory Network Architecture

Title a New Centrosomal Protein Regulates Neurogenesis By

Anti-ARL4A Antibody (ARG41291)

The Study of Two Transmembrane Autophagy Proteins and the Autophagy Receptor, P62

Supplementary Information Integrative Analyses of Splicing in the Aging Brain: Role in Susceptibility to Alzheimer’S Disease

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Cytotoxic Effects and Changes in Gene Expression Profile

9. Atypical Dusps: 19 Phosphatases in Search of a Role

Genomic Approaches to Reproductive Disorders

Targeting PH Domain Proteins for Cancer Therapy

Gfh 2016 Tagungsband Final.Pdf