Open Wychung-Dissertation.Pdf

The Pennsylvania State University The Graduate School UNCOVERING HIDDEN GENOMIC FEATURES USING COMPUTATIONAL APPROACHES A Dissertation in Computer Science and Engineering by Wen-Yu Chung c 2009 Wen-Yu Chung Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy May 2009 The dissertation of Wen-Yu Chung was reviewed and approved∗ by the following: Webb Miller Professor of Computer Science and Engineering and Biology Dissertation Co-Advisor, Chair of Committee Anton Nekrutenko Assistant Professor of Biochemistry and Molecular Biology Dissertation Co-Advisor Raj Acharya Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering Padma Raghavan Professor of Computer Science and Engineering Réka Albert Associate Professor of Physics ∗Signatures are on file in the Graduate School. Abstract Modern genetic studies are heavily dependent on analyses of whole genome sequences that have only become available in the past decade. Technologies such as microarrays and next-generation sequencing can associate quantitive expression patterns of genes to their genomic sequences and allow the study of changes at the genome-wide level or the comparison of multiple genomes. Sequences plus expression information allow us to cap- ture an extensive and realistic overview on any given genome. Novel mathematical and computational methods are essential for managing and mining information from these large-scale data sets. I have undertaken three projects that try to answer the following biological questions using computational approaches: (1) how do duplicate genes diverge in a co-expression network? (2) how many vertebrate genes are there with alternative open reading frames? (3) how can we delineate whole genome expression patterns using new sequencing technology? Within each project, I have developed computational methods and applied these to targeted data sets demonstrating the feasibility and power of these new bioinformatic approaches and addressing questions of biological significance. iii Table of Contents List of Figures vi List of Tables ix Acknowledgments x Chapter 1 Introduction 1 1.1 Global interactions and constraints . 1 1.2 Dissertation outline . 3 Chapter 2 Rapid and asymmetric divergence of duplicate genes in the human gene coexpression network 4 2.1 Background . 4 2.2 Results and discussion . 6 2.2.1 Description of the network . 6 2.2.2 Differences between duplicate genes and singletons . 9 2.2.3 Duplicate genes rapidly lose shared coexpressed partners . 9 2.2.4 Acquisition of new coexpressed partners by duplicate genes . 12 2.2.5 Asymmetric expression divergence of duplicate genes . 13 2.2.6 Robustness of the network . 15 2.3 Conclusion . 17 2.4 Methods . 19 2.4.1 Network construction . 19 2.4.2 Identification of duplicate genes . 21 2.4.3 Permutation tests . 21 2.4.4 Asymmetry analysis . 21 2.4.5 Robustness analysis of the network . 22 iv Chapter 3 A first look at ARFome: dual-coding genes in mammalian genomes 23 3.1 Introduction . 23 3.2 Results and discussion . 25 3.2.1 Dual coding is virtually impossible by chance . 25 3.2.2 Defining mammalian ARFs . 26 3.2.3 Analysis of nucleotide substitutions suggests functionality of ARFs 26 3.2.4 What may be the potential function of ARF-encoded proteins? . 29 3.2.5 Conclusions . 30 3.3 Materials and methods . 32 3.3.1 CCRT algorithm . 32 3.3.2 Codon model for overlapping reading frames . 33 Chapter 4 Transcriptome profiling by next-generation sequencing technology 36 4.1 Background . 36 4.2 Results and discussion . 38 4.2.1 Sequencing result and quality . 38 4.2.2 Mapping reads to the mouse transcriptomes . 40 4.2.3 Identifying novel splice forms . 42 4.3 Conclusion . 44 Chapter 5 Conclusion 46 5.1 Summary . 46 5.2 Future research interests . 47 Appendix A Supplementary materials for Chapter 2 49 Appendix B Supplementary materials for Chapter 3 57 Bibliography 65 v List of Figures 2.1 Degree distribution of the studied network (T ≥ 7 and R ≥ 0.7). The degree distribution of the studied network . 7 2.2 The relationship between clustering coefficient c and node degree k for (A) all genes, (B) ubiquitously expressed genes, and (C) nonubiquitously expressed genes. Each point represents an average value for 100 genes. 8 2.3 The number of duplicate genes and singletons in every 500 genes ranked by degree. Duplicate genes are marked by triangles and singletons are marked by circles. The genes with the highest degree are shown at the left side of the figure. 10 2.4 The schematic representation of duplicate gene evolution (A) prior to duplication event, (B) immediately after duplication, (C, D, E) after some time following gene duplication. The ancestral singleton gene is shown with a crossed line, duplicate genes are in black, shared ancestral partners are in grey, unique ancestral partners are in stripes, and unique acquired partners are in white; ns; n1 and n2 are the numbers of partners for a singleton, first duplicate, and second duplicate, respectively; n12 is the number of shared partners for a duplicate gene pair. 11 2.5 The change in the fraction of shared partners with evolutionary time (measured by KS). Each point represents an average value for 40 duplicate gene pairs. Dashed line indicates the fraction of shared partners averaged among 1000 randomly selected pairs of singletons (random selection process was repeated 1000 times). 12 2.6 The change in the total number of coexpressed partners with evolutionary time (measured by KS). Each point represents an average value for 40 duplicate gene pairs. The lower dashed line is the average number of partners for a singleton and the upper dashed line is twice the average number of partners for a singleton. 14 vi 2.7 Asymmetric divergence in gene expression. (A) Plot of degree of one gene versus degree of another gene for 1,547 duplicate gene pairs with KS < 2 (inset shows pairs with both degrees below 200). (B) The same plot after numerical simulation of symmetric divergence with equal probability of loss and gain of coexpressed partners (P = 0.5). (C) The relationship between the difference in degree and time since duplication (measured by KS) for a pair of duplicate genes. Each point represents an average value for 40 duplicate gene pairs. 15 2.8 The results of in silico perturbations of the network. The effect of random removal of genes (error) on (A) the relative size of a giant cluster and (B) the average shortest path length. The effect of degree-based removal of genes (attack) on (C) the relative size of a giant cluster and (D) the average shortest path length (inset shows the fraction of edges removed). Singletons are marked by circles, duplicate genes by triangles, and all genes by squares. 17 3.1 Three known examples of mammalian dual-coding genes. (A) A transcript of the Gnas1 gene contains two reading frames and produces two structurally unrelated proteins, XLαs and ALEX, by differential uti- lization of translation start sites.(B) A newly transcribed XBP1 mRNA can only produce protein XBP1U from ORF A. Removal of a 26-bp spacer (yellow rectangle) joins the beginning of ORF A with ORF B and trans- lates into a different product called XBP1S.(C) Ink4a generates two splice variants that use different reading frames within exon E2 to produce the proteins p16Ink4a and p19ARF. 24 3.2 mRNAs from human and mouse are aligned. Mouse mRNAs are indicated by lowercase letters. Each of the two mRNAs contains an anno- tated coding region (white boxes). Our algorithm looks for ARFs (black boxes) that are shifted one (shown) or two nucleotides relative to the an- notated frame. The locations of the ARFs must be conserved between the species. Specifically, the ARFs in the two species must overlap for at least 500 bp. 28 4.1 An example of the quality score distribution. The x-axis is the length of the read (fixed-length, such as Illumina/Solexa and SOLiD reads or percentage, such as Roche/454 Life Science reads) and the y-axis is the base-calling scores. The quality score distribution showed the base-calling scores dropped below 20 after read position 28. 39 vii 4.2 A hypothetical example of the strategy used to obtain novel exon junctions. (A) Gene A had four exons, E1;E2;E3 and E4. Dashed lines connect all possible respecting order of exon-exon combination. (B) Two transcripts, T1 and T2, were alternatively splicing variants from gene A. T1 had E1;E2 and E3. T2 had E1;E2 and E4. (C) For gene A, junctions between E1 and E3, E1 and E4, E3 and E4 were novel junctions, which were not from known transcripts. (D) 20 bp on either side of every possible junction were taken and attached to form junction sequences. For gene A, there were 6 possible junctions in total, in which 3 were known junctions and 3 were novel junctions. 41 4.3 Examples of paired-reads mapped in novel junctions. (A) Invalid mapping: one end of a paired-end read mapped at J13; the other end mapped at E2. It is not reasonable to have a splicing form of this kind. (B-D) All valid mappings. (B) Valid mapping: one end of a paired- end read mapped at a novel junction (J13); the other end mapped at an exon (E3). This indicates a novel splicing form of E1 and E3. Another possible situation is the other end mapped at an exon that is not part of the junction (e.g., E4). This indicates a novel splicing form of E1;E3 and E4.

Open Wychung-Dissertation.Pdf

Gene Prediction: the End of the Beginning Comment Colin Semple

Predicting Clinical Response to Treatment with a Soluble Tnf-Antagonist Or Tnf, Or a Tnf Receptor Agonist

Association of Gene Ontology Categories with Decay Rate for Hepg2 Experiments These Tables Show Details for All Gene Ontology Categories

Reconstructing Contiguous Regions of an Ancestral Genome

MIRA-Assisted Microarray Analysis, a New Technology for The

Duplication, Deletion, and Rearrangement in the Mouse and Human Genomes

BIOINFORMATICS ISCB NEWS Doi:10.1093/Bioinformatics/Btp280

Open Thesisformatted Final.Pdf

ENCODE Analysis Working Group and Data Analysis Centre Rick Myers

439: PALM: Probabilistic Area Loss Minimization for Protein Sequence

Comparative Genomics

The Myth of Junk DNA