The Pennsylvania State University The Graduate School

UNCOVERING HIDDEN GENOMIC FEATURES USING

COMPUTATIONAL APPROACHES

A Dissertation in Computer Science and Engineering by Wen-Yu Chung

c 2009 Wen-Yu Chung

Submitted in Partial Fulfillment of the Requirements for the Degree of

Doctor of Philosophy

May 2009 The dissertation of Wen-Yu Chung was reviewed and approved∗ by the following:

Webb Miller Professor of Computer Science and Engineering and Biology Dissertation Co-Advisor, Chair of Committee

Anton Nekrutenko Assistant Professor of Biochemistry and Molecular Biology Dissertation Co-Advisor

Raj Acharya Professor of Computer Science and Engineering Head of the Department of Computer Science and Engineering

Padma Raghavan Professor of Computer Science and Engineering

R´eka Albert Associate Professor of Physics

∗Signatures are on file in the Graduate School. Abstract

Modern genetic studies are heavily dependent on analyses of whole genome sequences that have only become available in the past decade. Technologies such as microarrays and next-generation sequencing can associate quantitive expression patterns of to their genomic sequences and allow the study of changes at the genome-wide level or the comparison of multiple genomes. Sequences plus expression information allow us to cap- ture an extensive and realistic overview on any given genome. Novel mathematical and computational methods are essential for managing and mining information from these large-scale data sets. I have undertaken three projects that try to answer the following biological questions using computational approaches: (1) how do duplicate genes diverge in a co-expression network? (2) how many vertebrate genes are there with alternative open reading frames? (3) how can we delineate whole genome expression patterns using new sequencing technology? Within each project, I have developed computational meth- ods and applied these to targeted data sets demonstrating the feasibility and power of these new bioinformatic approaches and addressing questions of biological significance.

iii Table of Contents

List of Figures vi

List of Tables ix

Acknowledgments x

Chapter 1 Introduction 1 1.1 Global interactions and constraints ...... 1 1.2 Dissertation outline ...... 3

Chapter 2 Rapid and asymmetric divergence of duplicate genes in the human coexpression network 4 2.1 Background ...... 4 2.2 Results and discussion ...... 6 2.2.1 Description of the network ...... 6 2.2.2 Differences between duplicate genes and singletons ...... 9 2.2.3 Duplicate genes rapidly lose shared coexpressed partners . . . . . 9 2.2.4 Acquisition of new coexpressed partners by duplicate genes . . . 12 2.2.5 Asymmetric expression divergence of duplicate genes ...... 13 2.2.6 Robustness of the network ...... 15 2.3 Conclusion ...... 17 2.4 Methods ...... 19 2.4.1 Network construction ...... 19 2.4.2 Identification of duplicate genes ...... 21 2.4.3 Permutation tests ...... 21 2.4.4 Asymmetry analysis ...... 21 2.4.5 Robustness analysis of the network ...... 22

iv Chapter 3 A first look at ARFome: dual-coding genes in mammalian genomes 23 3.1 Introduction ...... 23 3.2 Results and discussion ...... 25 3.2.1 Dual coding is virtually impossible by chance ...... 25 3.2.2 Defining mammalian ARFs ...... 26 3.2.3 Analysis of nucleotide substitutions suggests functionality of ARFs 26 3.2.4 What may be the potential function of ARF-encoded ? . 29 3.2.5 Conclusions ...... 30 3.3 Materials and methods ...... 32 3.3.1 CCRT algorithm ...... 32 3.3.2 Codon model for overlapping reading frames ...... 33

Chapter 4 Transcriptome profiling by next-generation sequencing technology 36 4.1 Background ...... 36 4.2 Results and discussion ...... 38 4.2.1 Sequencing result and quality ...... 38 4.2.2 Mapping reads to the mouse transcriptomes ...... 40 4.2.3 Identifying novel splice forms ...... 42 4.3 Conclusion ...... 44

Chapter 5 Conclusion 46 5.1 Summary ...... 46 5.2 Future research interests ...... 47

Appendix A Supplementary materials for Chapter 2 49

Appendix B Supplementary materials for Chapter 3 57

Bibliography 65

v List of Figures

2.1 Degree distribution of the studied network (T ≥ 7 and R ≥ 0.7). The degree distribution of the studied network ...... 7 2.2 The relationship between clustering coefficient c and node de- gree k for (A) all genes, (B) ubiquitously expressed genes, and (C) nonubiquitously expressed genes. Each point represents an av- erage value for 100 genes...... 8 2.3 The number of duplicate genes and singletons in every 500 genes ranked by degree. Duplicate genes are marked by triangles and single- tons are marked by circles. The genes with the highest degree are shown at the left side of the figure...... 10 2.4 The schematic representation of duplicate gene (A) prior to duplication event, (B) immediately after duplication, (C, D, E) after some time following gene duplication. The an- cestral singleton gene is shown with a crossed line, duplicate genes are in black, shared ancestral partners are in grey, unique ancestral partners are in stripes, and unique acquired partners are in white; ns, n1 and n2 are the numbers of partners for a singleton, first duplicate, and second du- plicate, respectively; n12 is the number of shared partners for a duplicate gene pair...... 11 2.5 The change in the fraction of shared partners with evolutionary time (measured by KS). Each point represents an average value for 40 duplicate gene pairs. Dashed line indicates the fraction of shared part- ners averaged among 1000 randomly selected pairs of singletons (random selection process was repeated 1000 times)...... 12 2.6 The change in the total number of coexpressed partners with evolutionary time (measured by KS). Each point represents an aver- age value for 40 duplicate gene pairs. The lower dashed line is the average number of partners for a singleton and the upper dashed line is twice the average number of partners for a singleton...... 14

vi 2.7 Asymmetric divergence in gene expression. (A) Plot of degree of one gene versus degree of another gene for 1,547 duplicate gene pairs with KS < 2 (inset shows pairs with both degrees below 200). (B) The same plot after numerical simulation of symmetric divergence with equal probability of loss and gain of coexpressed partners (P = 0.5). (C) The relationship between the difference in degree and time since duplication (measured by KS) for a pair of duplicate genes. Each point represents an average value for 40 duplicate gene pairs...... 15 2.8 The results of in silico perturbations of the network. The effect of random removal of genes (error) on (A) the relative size of a giant cluster and (B) the average shortest path length. The effect of degree-based removal of genes (attack) on (C) the relative size of a giant cluster and (D) the average shortest path length (inset shows the fraction of edges removed). Singletons are marked by circles, duplicate genes by triangles, and all genes by squares...... 17

3.1 Three known examples of mammalian dual-coding genes. (A) A transcript of the Gnas1 gene contains two reading frames and produces two structurally unrelated proteins, XLαs and ALEX, by differential uti- lization of translation start sites.(B) A newly transcribed XBP1 mRNA can only produce XBP1U from ORF A. Removal of a 26-bp spacer (yellow rectangle) joins the beginning of ORF A with ORF B and trans- lates into a different product called XBP1S.(C) Ink4a generates two splice variants that use different reading frames within exon E2 to produce the proteins p16Ink4a and p19ARF...... 24 3.2 mRNAs from human and mouse are aligned. Mouse mRNAs are indicated by lowercase letters. Each of the two mRNAs contains an anno- tated coding region (white boxes). Our algorithm looks for ARFs (black boxes) that are shifted one (shown) or two nucleotides relative to the an- notated frame. The locations of the ARFs must be conserved between the species. Specifically, the ARFs in the two species must overlap for at least 500 bp...... 28

4.1 An example of the quality score distribution. The x-axis is the length of the read (fixed-length, such as Illumina/Solexa and SOLiD reads or percentage, such as Roche/454 Life Science reads) and the y-axis is the base-calling scores. The quality score distribution showed the base-calling scores dropped below 20 after read position 28...... 39

vii 4.2 A hypothetical example of the strategy used to obtain novel exon junctions. (A) Gene A had four exons, E1,E2,E3 and E4. Dashed lines connect all possible respecting order of exon-exon combination. (B) Two transcripts, T1 and T2, were alternatively splicing variants from gene A. T1 had E1,E2 and E3. T2 had E1,E2 and E4. (C) For gene A, junctions between E1 and E3, E1 and E4, E3 and E4 were novel junctions, which were not from known transcripts. (D) 20 bp on either side of every possible junction were taken and attached to form junction sequences. For gene A, there were 6 possible junctions in total, in which 3 were known junctions and 3 were novel junctions...... 41 4.3 Examples of paired-reads mapped in novel junctions. (A) Invalid mapping: one end of a paired-end read mapped at J13; the other end mapped at E2. It is not reasonable to have a splicing form of this kind. (B-D) All valid mappings. (B) Valid mapping: one end of a paired- end read mapped at a novel junction (J13); the other end mapped at an exon (E3). This indicates a novel splicing form of E1 and E3. Another possible situation is the other end mapped at an exon that is not part of the junction (e.g., E4). This indicates a novel splicing form of E1,E3 and E4. (C) Both ends were mapped at junctions. One end of a paired-end read mapped at a novel junction (J13); the other end also mapped at a novel junction (J34). This indicates a novel splicing form of E1,E3 and E4. (D) Both ends were mapped at exons. With paired-end information, we were able to detect novel junctions when reads were mapped within exons and the exons composed new splicing form...... 43

viii List of Tables

2.1 Description of the studied network and the differences between duplicate genes and singletons...... 7

3.1 List of ARF-containing genes identified using a high-stringency approach...... 27

4.1 Summary of novel transcripts analysis...... 39

ix Acknowledgments

To my advisors, my family, and my friends. Without them, nothing is possible. Special thanks to: Samir Wadhawan, Jianbin He, Dan Blankenberg, Guruprasad Ananda, Benjamin Dickins, Greg Von Kuster, Erika Kvikstad, Chungoo Park, Yogeshwar Kelka, Hiroki Goto, Melissa Wilson, Kateryna Makova, Chuan-Yi Tang, Wen-Hsiung Li, and Der-Tsai Lee.

x Chapter 1

Introduction

1.1 Global interactions and constraints

Applying theories and methods from computer science and statistics to sequence anal- yses is essential for genomic research. The sequence similarity search [1, 2, 3] and gene prediction [4] algorithms are two examples that have gained huge popularity and are primary tools for the analysis of genomic data. Furthermore, molecular evolutionary methods, instantiated in sequence substitution models and phylogenetic trees, helped to explain the dynamic processes operating on biological sequences. Yet novel method- ologies are needed for managing and mining of information as the amount and types of molecular data increases.

Network analysis of duplicate gene expression patterns

Gene duplication is known to be one of the most common mechanisms of genome evo- lution [5]. It is believed that duplication provides raw material that contributes to genetic diversity. The fate of genes after duplication is largely captured by two scenar- ios: duplicate genes subdivide the functions of the parental gene (subfunctionalization) or acquire new functions (neofunctionalization). Previous studies [6, 7] have focused on utilizing sequence or expression data, while the expression profile of duplicate genes in gene-coexpression networks has been largely unexplored. Here I investigate the divergence between pairs of duplicate genes using human tissue- specific microarray data. I follow the methods of the scale-free network model [8] to infer the characteristics of single- and multiple-copy genes. Several measures are often used to describe a network, including degrees, clustering coefficients, and the length of the 2 shortest paths. The connectivities of a gene (referred to as its degree) in a coexpression network are regarded as an indication of its functionality. A high degree indicates an important role for a gene in the network. The clustering coefficient evaluates the links of a given gene’s neighbors and the value lies between zero (no connection among the neighbors; a star-like graph) and one (every node is connected to every other node; a complete connected component). A system that retains its connectivities after removing its nodes and edges individually is said to be robust. Two methods are commonly used to assess the robustness of a system: nodes are removed randomly (to assess error tolerance) or by the ranks of their degrees (to assess attack tolerance). The damage is estimated as decreases in the size of the largest surviving cluster of connected nodes and increases in the average shortest path between nodes. Randomly removing nodes and their edges elicits little harm on a robust system. However, removing nodes in the order of their connectivities destroys the network, which manifests as the disintegration of the network into many small clusters and as an increase in the average shortest path. These measures summarize the interactions of genes and yield a more realistic view of genome evolution and function.

Overlapping reading frames in mammalian genomes

DNA encodes amino acids using codons. Because codons comprise three nucleotides, every stretch of DNA sequence can theoretically have three possible reading frames in each transcriptional direction. Each of the reading frames may be translated into polypeptides that contain different amino acid sequences. In general, one protein-coding gene has one reading frame that in turn gives rise to a single polypeptide. However, in organisms with restricted genome sizes (for example, viruses and some bacteria), overlapping reading frames occur at a high frequency [9, 10]. This phenomenon was not believed to be widespread in higher eukaryotes, but three genes are known for which the existence of alternative reading frames has been experimentally confirmed (Gnas, Xbp1, and Ink4a) [11, 12, 13]. Codependency between codons of overlapping protein-coding regions imposes a unique set of evolutionary constraints, making it a costly arrangement. Yet in cases of tightly coexpressed interacting proteins, dual coding may be advantageous: expression timing and level can be controlled in both with fewer regulatory overheads. Here I show that although dual coding is not likely to have arisen by chance, several human transcripts contain overlapping coding regions. Newly developed statistical techniques are used to identify evolutionarily conserved dual-coding regions. The results emphasize that the 3 skepticism surrounding eukaryotic dual coding is unwarranted: rather than being arti- facts, overlapping reading frames are often hallmarks of fascinating biology.

Whole genome expression analysis

Gene activities of the complete genome are measured to provide a global view of the cellular functionality. Whole genome expression patterns (i.e., the transcriptome) can be detected and quantified using microarrays, Expressed Sequence Tags (ESTs) or new high-throughput sequencing technologies. Such analyses are often performed using sam- ples from different developmental stages, cell types, or from samples subject to different infections or experimental treatments. Contrasting results obtained from these samples allows researchers to identify differential expression patterns and to individuate impor- tant genes or transcripts from a large set. Though model organisms such as yeast, worms, and mice have seemingly complete transcriptome atlases [14, 15, 16], many novel splice forms are yet to be detected.

1.2 Dissertation outline

This dissertation consists of three projects concerned with duplicate gene evolution, dual coding, and expression profiling. Each chapter discusses one of these projects and has its own introduction and summary. In chapter 2, I investigate the evolution of duplicate genes in a gene coexpression network to answer the question: how do duplicate genes diverge and influence the shape of the network? A coexpression network is built using tissue-specific microarray data. I present a model of the divergence of duplicate genes and analyze the connectivities of single- and multiple-copy genes in the network. In chapter 3, I explore the occurrences of nucleotide sequences containing two open reading frames. Novel approaches are implemented to detect evolutionarily conserved dual-coding genes in mammalian genomes. Advances in sequencing technologies provide cheaper, faster and higher throughput data. In chapter 4, I describe a stepwise procedure to analyze data from next-generation sequencing technology in order to detect novel splice forms. Chapter 5 draws conclusions from the results of these various projects and describes ideas for future directions. Chapter 2

Rapid and asymmetric divergence of duplicate genes in the human gene coexpression network

Published as W.-Y. Chung, R. Albert, I. Albert, A. Nekrutenko, and K. Makova. Rapid and asymmetric divergence of duplicate genes in the human gene coexpression network. BMC , 7(1):46, 2006.

2.1 Background

Approximately half of human genes are members of duplicate gene families[17] and such genes might be playing an important role in the robustness of organisms against muta- tions (reviewed in [18, 19, 20]). Do most duplicate genes retain the functions of their parental singleton gene? Or do they diverge after duplication? And how rapidly does this divergence occur? Several models predicting preservation of both duplicate gene copies (e.g., gene function conservation, subfunctionalization, neofunctionalization, and subne- ofunctionalization) have been proposed [21, 22, 23] and reviewed in [24, 25], however, their relative prevalence in the fates of duplicate genes is presently unknown. Originally, these questions were addressed by analysis of the protein-coding sequences of duplicate genes. Namely, the pattern of nonsynonymous vs. synonymous substitutions between duplicate genes was used to predict divergence in function (e.g., [25, 26, 27]). However, protein-coding sequences possess only partial information about gene evolution and func- tion. The availability of genome-wide mRNA expression data allows one to study another 5 important aspect of duplicate gene evolution, that is divergence in gene expression after duplication (e.g., [7, 28, 29, 30]). The divergence of duplicate genes in gene coexpression networks, where connectiv- ity is based on similarity in gene expression patterns (e.g., [31, 32, 33]), represents yet another facet of duplicate gene evolution that awaits detailed investigation. Gene coex- pression networks as well as many other biological networks (e.g., metabolic and protein- protein interaction networks) were shown to be scale-free [31, 32, 34, 35]: the topology of these networks is dominated by a relatively small number of highly connected nodes, also called hubs [36, 37]. Scale-free networks were found to be tolerant against random removal of nodes, but particularly vulnerable to preferential removal of hubs [38]. The studies of the evolutionary origins of scale-free biological networks suggested that gene duplication can lead to both network growth and preferential attachment and to result in a scale-free topology [37, 39, 40, 41]. Thus, duplicate genes are likely to be the major players in the evolution of biological networks and investigation of their divergence in these networks is of great importance. So far, the divergence of duplicate genes has only been examined in yeast transcriptional regulation and protein-protein interaction net- works [25, 42, 43, 44, 45], however, it has not been explored in networks of more complex organisms, e.g., mammals. In mammals, where genome-wide transcriptional regulation and protein-protein in- teraction data are limited [46, 47], coexpression networks provide an alternative for investigation of duplicate gene divergence at the systems biology level (protein-protein interaction and transcription regulatory links are a subset of links in gene coexpression networks). Coexpression and functional relationship of genes are expected to be posi- tively correlated. Indeed, clustering of mRNA expression data has been successfully used for grouping genes similar in function [48, 49] and a global correlation was found between gene expression and protein-protein interaction data [50, 51]. Additionally, thousands of coexpression connections between genes were found to be evolutionarily conserved among distant organisms [33], again suggesting a strong link between similar expression pattern and functional relatedness. However, some individual links between genes might not represent direct functional relationships due to the noisiness of microarray data and network transitivity. In the present study we build a human gene coexpression network based on human tissue-specific microarray data. We examine the divergence of duplicate genes in this network by addressing the following questions: (1) are duplicate genes or singletons represented more frequently among network hubs; (2) how rapidly do duplicate genes 6 lose shared parental partners; (3) how quickly do they acquire new coexpressed partners; (4) is the divergence in the gene coexpression network symmetric or asymmetric between two duplicate genes in a pair; and (5) do duplicate genes and singletons play different roles in maintaining the robustness of this network.

2.2 Results and discussion

2.2.1 Description of the network

To build the gene coexpression network, we used the mRNA expression data that provide information about ∼ 45,000 transcripts assayed in 79 human tissues [52]. We mapped probe sets to genes (see Methods) and as a result obtained a data set with one-to-one probe set to gene correspondence. This data set consisted of 14,342 genes, including 261 tissue-specific and 3,460 ubiquitously expressed genes. Two genes (represented by nodes) were connected by an edge if (1) both of them were simultaneously expressed in at least T common tissues, and (2) the Pearson cor- relation coefficient of their logarithmically transformed (with base 2) expression values was greater than or equal to R [31]. Nine networks were constructed depending on the combination of T and R (T ≥ 5, T ≥ 7, or T ≥ 9; and R ≥ 0.5, R ≥ 0.7, or R ≥ 0.9). Here in addition to the Pearson correlation coefficient we used a threshold of the minimal number of common tissues in which both genes are expressed. Relying on the correlation coefficient alone could lead to non-biological artifacts, e.g., artifactual similarities based on non-expression or expression in a few tissues only. Thus, by adding this additional criterion we obtain a meaningful correlation coefficient as it is calculated from at least five data points and only for tissues in which both genes are expressed (AD>200). To characterize the global topology of these networks, we used several graph measures [8]. First, the average node degree hki reflected the average number of genes expressed to- gether with a given gene. Second, the average shortest path length hdi specified the average number of edges required to travel from one gene to any other gene. Third, the average clustering coefficient hci measured the connectivity of the neighborhood of a gene. With increases in T and R, the number of genes in the main cluster and the average number of genes coexpressed with a given gene decreased, while the average shortest path length increased (See Supplementary Materials on page 49). Additionally, we investigated the node degree distribution P (k) describing the fre- quency of the number of genes with k coexpressed genes. For five (out of nine) networks, this distribution approximated a power law distribution (Figures 2.1 and Supplementary 7

P(k) ~ k -1.13

Figure 2.1. Degree distribution of the studied network (T ≥ 7 and R ≥ 0.7). The degree distribution of the studied network

Materials on page 51), a characteristic of scale-free networks[36]. We used the network with T ≥ 7 and R ≥ 0.7 for further examination, since for these thresholds the degree distribution had a power law tail (Figure 2.1) and we still retained a large number of genes for a statistical analysis (however, our main conclusions hold for all thresholds ex- amined). This network contained 12,897 nodes (all located in the main cluster) with the average degree of 132.4 (Table 2.1). The density (the number of observed connections divided by the number of possible connections) of the present network is 0.0103, which is comparable to the value of 0.0057 obtained for a human gene coexpression network consisting of ∼ 9,000 genes and confirmed by at least three microarray data sets [49].

Table 2.1. Description of the studied network and the differences between duplicate genes and singletons. Number of Average Average nodes in the Average de- clustering Gene categories shortest path giant cluster gree (hki) coefficient length (hdi) (n) (hci) All genes 12897 132.40 2.64 0.16 Duplicate genes 6507 120.11 n/a 0.14 Singletons 6390 44.92 n/a 0.17

Interestingly, we found a complex relationship between clustering coefficient c and degree k: c increased steadily with increasing k for k <∼ 300, then it slowly decreased (Figures 2.2A and Supplementary Materials on page 54). The increasing relationship was more pronounced since it represented a larger sample of nodes. This implied that 8

A ) c B clustering coefficient ( clustering coefficient C

degree (k)

Figure 2.2. The relationship between clustering coefficient c and node degree k for (A) all genes, (B) ubiquitously expressed genes, and (C) nonubiquitously expressed genes. Each point represents an average value for 100 genes. genes with a moderately high number of coexpressed genes usually had highly connected neighbors. This observation was unexpected as other scale-free networks display either negative or no correlation between clustering coefficient and degree [37]. Initially we suspected that the relationship observed here could be explained by a large number of ubiquitously expressed genes that have a high probability of being clustered among themselves and with other genes. However, a largely positive correlation between c and k was observed for either ubiquitously expressed or non-ubiquitously expressed genes (Figures 2.2B and 2.2C), suggesting that this is a general property of the studied network. 9

2.2.2 Differences between duplicate genes and singletons

A total of 11,512 duplicate genes were identified among 22,103 Ensembl (NCBI build 34) known and novel proteins (see Methods). The studied network consisted of 6,390 singletons and 6,507 duplicate genes. Interestingly, while genes in the two categories had similar degree distributions (See Supplementary Materials on page 56), duplicate genes had a lower average node degree and thus were less connected in the network than singletons, although the difference was small (102.1 vs. 144.9; Table 2.1; t = -7.74; P < 0.001 as assessed by permutation test). Ranking the nodes by degree indicated that among highly connected genes the proportion of singletons was higher than the proportion of duplicates (Figure 2.3). Thus, the effect of increased copy number might be more severe for genes with numerous coexpressed partners in the network and, as a result, duplications of such genes might have a lower propensity to become fixed in a population as compared with duplications of genes with few connections [42]. Interestingly, the average clustering coefficient was significantly lower for duplicates than for singletons (Table 2.1; t = -9.86; P < 0.001 as assessed by permutation test). This suggests a lower likelihood of a duplication fixation for a gene that has a tightly connected neighborhood. In this network, the percentage of duplicate gene pairs with at least one [53] term overlap was higher among pairs connected by a link vs. unconnected pairs (97% vs. 86%). This is a much higher percentage than that observed for singletons either linked (22%) or unlinked (15%). Thus, duplicate genes, especially if they are linked, had greater functional similarity. Among the strongest hubs for duplicates, the proteins participating in nucleotide, nucleic acid, ATP, and protein binding were overrepresented (determined from the Gene Ontology terms). In addition to these categories, mitochondrion, signal transduction, and membrane proteins were overrepresented among singleton hubs. Interestingly, sim- ilarly to duplicate hubs, duplicates with the lowest number of links were involved in protein and ATP binding. However, singletons with a small number of links had differ- ent functions: e.g., receptor, transcription factor and transcription regulation activity.

2.2.3 Duplicate genes rapidly lose shared coexpressed partners

We investigated the dynamics of loss and gain of coexpressed partners (genes expressed together with a given gene, henceforth called partners) between two duplicate genes constituting a pair. We denoted the number of partners of one and the other duplicate gene in a pair as n1 and n2, respectively, and the number of partners shared between the 10

more connections less connections

Figure 2.3. The number of duplicate genes and singletons in every 500 genes ranked by degree. Duplicate genes are marked by triangles and singletons are marked by circles. The genes with the highest degree are shown at the left side of the figure.

two duplicate genes as n12 (Figure 2.4). We assumed that immediately after duplication, each duplicate gene was expressed together with n1 = n2 = n12 other genes (Figure 2.4B). With time, duplicate genes lose shared partners and acquire new ones. Here we assume that shared partners in a duplicate pair are inherited from a parental gene. We discovered that duplicate genes lose shared partners rapidly with evolutionary time. We calculated the fraction of shared partners among all partners for each duplicate gene pair, n12/(n1 + n2 − n12), and used the synonymous rate per site, KS, as a proxy of evolutionary time since gene duplication (Figure 2.5). For this analysis we used the 698 independent duplicate gene pairs (i.e. each gene was present only once in this data set, see Methods) for which both genes were present in the network and KS was less than 2. Our initial observation was that the fraction of shared partners for duplicate genes within each pair was usually low: it was <20% for 666 out of 698 duplicate pairs studied (the highest fraction of shared partners for a duplicate gene pair was 68%). A significant negative correlation was observed between n12/(n1 + n2 − n12) and KS (R = 11

A B

nS n1 = n2 = n12 = nS

C D E

n1+n2-n12 < nS n1+n2-n12 ≈ nS n1+n2-n12 > nS

Figure 2.4. The schematic representation of duplicate gene evolution (A) prior to duplication event, (B) immediately after duplication, (C, D, E) after some time fol- lowing gene duplication. The ancestral singleton gene is shown with a crossed line, duplicate genes are in black, shared ancestral partners are in grey, unique ancestral partners are in stripes, and unique acquired partners are in white; ns, n1 and n2 are the numbers of partners for a sin- gleton, first duplicate, and second duplicate, respectively; n12 is the number of shared partners for a duplicate gene pair.

-0.66, P < 0.003). The fraction of shared partners for a duplicate pair was on average

6.6% after only ∼ 50 million years (MY) since duplication (this corresponds to KS = 0.13 and requires an assumption that human and Old World monkeys diverged ∼ 25 MY ago and the sequence divergence between them is ∼ 7%, [54]). At KS ≈ 2, this fraction approached 1.9%, and the partners for the two duplicate genes in a pair were as different as those for a pair of unrelated singletons (Figure 2.5).

Several factors could have affected our results. At low KS, the fraction of shared partners could have been underestimated because the youngest duplicate genes were excluded from the analysis due to the lack of unique microarray probes (see Methods).

At high KS, some of the shared partners could have been acquired independently (i.e. convergently) by each gene in a duplicate pair and not inherited from a parental gene. 12

Figure 2.5. The change in the fraction of shared partners with evolutionary time (measured by KS). Each point represents an average value for 40 duplicate gene pairs. Dashed line indicates the fraction of shared partners averaged among 1000 randomly selected pairs of singletons (random selection process was repeated 1000 times).

And finally, our assumption of identical expression profiles for two daughter duplicate genes immediately after duplication might not be valid in all cases. Indeed, sometimes the duplication unit is known to partially or completely exclude the promoter of the parental gene and hence the two daughter genes might substantially differ in their expression [55].

2.2.4 Acquisition of new coexpressed partners by duplicate genes

We analyzed the same set of 698 independent duplicate genes and explored the change in the total number of partners of each duplicate gene pair (as indicated by n1 + n2 − n12) with evolutionary time. Here we assumed that the average degree of a parental singleton gene before duplication is equal to the average degree of a singleton in the contemporary network. According to this assumption, immediately after gene duplication, the total number of partners of a duplicate gene pair is equal to that of a parental singleton gene, i.e. n1 + n2 − n12 = ns (Figure 2.4B). Following duplication, as the two genes diverge in 13 their expression profiles, their partners can be classified into three groups (Figure 2.4): (1) partners inherited from a parental singleton gene and still shared between the two genes in a pair (shared ancestral partners); (2) partners inherited from the parental singleton gene but present now only in one of the two duplicates (unique ancestral partners); and (3) new partners acquired independently by one of the duplicates (unique acquired partners). The present study does not allow us to differentiate between unique ancestral and unique acquired partners directly, but we can make indirect inferences about their relative numbers. In our data set, shared partners constitute a small fraction among the partners of a duplicate gene pair: on average lower than 6.6% (see above). If, on average, following duplication, n1 + n2 − n12 < ns (Figure 2.4C), this indicates loss of ancestral partners by duplicate gene pairs. If n1 + n2 − n12 ≈ ns (Figure 2.4D), this can be explained by the presence of a small fraction of shared partners and a large fraction of unique ancestral partners, i.e. all of the original partners of parental genes might still be retained by a duplicate pair (although some of the partners for a duplicate gene pair might be unique acquired partners, in this case such acquisition is compensated by a loss of ancestral partners). If, however, n1 + n2 − n12 > ns (Figure 2.4E), an excess of an average number of partners for duplicate pairs over that for singletons can be explained by acquisition of new partners. The average number of partners for duplicate gene pairs in our data set was signif- icantly (∼ 57%) greater than that for singleton genes (227.9 vs. 144.9; t = 10.57; P < 0.001, significance assessed by permutation test), suggesting acquisition of new partners by duplicate gene pairs. This suggests that on average more than one third of partners of a duplicate gene pair were acquired after duplication and not inherited from a parental singleton gene. Such gain of new partners was rapid: even at low KS (KS = 0.13 for the youngest 40 gene pairs), members of a duplicate gene pair were already expressed together with 291.4 genes (on average), while a singleton gene was expressed together with 144.9 genes (on average). Depending on KS, we observed some variation (and some insignificant decline) in the average total number of partners for a duplicate gene pair

(Figure 2.6). Importantly, at any time point examined (0.13 < KS < 2), n1 + n2 − n12 was greater than ns.

2.2.5 Asymmetric expression divergence of duplicate genes

Two duplicate genes in a pair usually diverged asymmetrically in the network and this asymmetry was acquired quickly after duplication. For 1,547 independent duplicate 14

Figure 2.6. The change in the total number of coexpressed partners with evolution- ary time (measured by KS). Each point represents an average value for 40 duplicate gene pairs. The lower dashed line is the average number of partners for a singleton and the upper dashed line is twice the average number of partners for a singleton.

gene pairs with KS < 2 (including pairs with only one copy in the network) we drew a scatter plot with the numbers of partners for two duplicate genes at the X and Y coordinates (Figure 2.7A). The assignment of a duplicate gene from each pair to ei- ther X or Y was random. Note that the plot predominantly reflected unique partners, since the proportion of shared partners was low (see above). Our simulations showed that if the divergence in gene expression were symmetric, we would expect a positive correlation between the numbers of partners for two duplicate genes in a pair (Figure 2.7B). However, in reality, we found a negative correlation (Figure 2.7A, Spearman’s rank correlation coefficient rs = -0.19, P < 0.001), indicating that usually two duplicate genes had different numbers of partners. Interestingly, 849 out of 1,547 duplicate gene pairs were located on either the horizontal or vertical axis, suggesting that one gene had some partners in the network while the other one had none. Additionally, we observed that the difference in degree between duplicate genes in a pair was not related to KS 15

A B

rS = - 0.189, P < 0.001 rS = 0.975, P < 0.001 degree of gene 2 degree of gene 2

degree of gene 1 degree of gene 1

C

Figure 2.7. Asymmetric divergence in gene expression. (A) Plot of degree of one gene versus degree of another gene for 1,547 duplicate gene pairs with KS < 2 (inset shows pairs with both degrees below 200). (B) The same plot after numerical simulation of symmetric divergence with equal probability of loss and gain of coexpressed partners (P = 0.5). (C) The relationship between the difference in degree and time since duplication (measured by KS) for a pair of duplicate genes. Each point represents an average value for 40 duplicate gene pairs.

(Figure 2.7C). Thus, the asymmetry in expression divergence was established early and was maintained throughout the evolutionary time examined.

2.2.6 Robustness of the network

To study the role of duplicate genes vs. singletons in the robustness of this coexpression network, we computationally perturbed the network by random removal of nodes (error) 16 and degree-based removal of nodes (attack or the preferential removal of the most highly connected genes; [38]). Error and attack were performed separately on three categories of genes singletons, duplicate genes, and all genes taken together. Thus, a total of six experiments were performed. In each experiment, we removed nodes in 10%, 20%, 30%, 40%, and 50% increments calculated from the total number of nodes in the network. The relative size S (the fraction of nodes in the giant connected cluster after node removal) and the average path length hdi of the largest connected cluster were measured at each increment (Figure 2.8). The decrease in S and increase in hdi indicate network breakdown. Random removal of duplicate genes, singletons, or all genes had minimal effect on the network (Figures 2.8A and 2.8B). The size of the main cluster did not decrease beyond the reduction expected due to node removal and the average path length remained approximately constant, indicating that most unremoved nodes stayed connected. This error tolerance was expected due to the high connectivity and scale-free of the network [38] and to the similarity in degree distribution of the three categories of genes (Figures 2.1 and 2.4). The network also appeared to be resilient to the degree-based node removal of either singletons or duplicate genes. Attack on singletons was expected to lead to a faster crash of the network than attack on duplicate genes because of the higher average degree of singletons (Table 2.1) and their higher proportion among hubs (Figure 2.3). At the 40% and 50% increments, the degree-based removal of duplicate genes indeed yielded a slightly smaller average path length (hdidup = 2.92 vs. hdisin = 2.96 at 40%; hdidup =

2.79 vs. hdisin = 2.86 at 50%; Figure 2.8D), thus providing marginal support for this expectation. However, contrary to the expectation, attack on singletons and attack on duplicate genes led to similarly minimal decreases in the relative sizes of the main cluster (Figure 2.8C). In contrast with attack on either duplicate genes or singletons, attack on all genes severely damaged the network (Figures 2.8C and 2.8D). After 50% of all genes were removed by attack, the network broke down into many small clusters and as a result the relative size of the main cluster was only 0.30 as compared with 0.50 after error (Figures 2.8A and 2.8B). Additionally, this led to ∼ two-fold increase in the average shortest path length as compared with the effect of error (hdierror = 2.66 vs. hdiattack = 4.91, both after removal of 50% of nodes; Figures 2.8A and 2.8D). Why did the network break down so rapidly after we attacked duplicate genes and singletons combined? We hypothesized that this could be due to the removal of a large 17

A B error error >) d

C D

attack avg. shortest path length (< attack relative size of the main cluster (S)

increments of nodes removed

Figure 2.8. The results of in silico perturbations of the network. The effect of random removal of genes (error) on (A) the relative size of a giant cluster and (B) the average shortest path length. The effect of degree-based removal of genes (attack) on (C) the relative size of a giant cluster and (D) the average shortest path length (inset shows the fraction of edges removed). Singletons are marked by circles, duplicate genes by triangles, and all genes by squares. number of hubs and edges. Indeed, although the same number of genes was removed in each experiment, more strong hubs were removed by attack on all genes (e.g., 20% of the strongest hubs for the 20% increment; Figure 2.8D) than on either duplicates or singletons separately (e.g., only ∼ 10% of the strongest hubs in each case for the 20% increment; Figure 2.8D). Similarly, more edges were eliminated by attack on all genes (e.g., ∼ 84% of edges for the 20% increment; inset in Figure 2.8D) than on either duplicates or singletons (∼ 62% and ∼ 68%, respectively, for the 20% increment; inset in Figure 2.8D).

2.3 Conclusion

The analysis of duplicate genes in the human gene coexpression network allowed us to make the following conclusions. First, in agreement with analysis of yeast duplicate 18 genes (e.g., [7, 29, 44, 45]), our observations suggest that human duplicate genes quickly lose similarity in gene expression profiles. As a result, except for immediately after duplication, they cannot be considered redundant parts in the network. This might explain why the network was similarly tolerant to attack and error on either duplicate genes or singletons rapidly after duplication, duplicate genes diverge in their expression profiles, reaching the level of similarity only slightly higher than that observed for random singletons. Since only 79 tissues were examined in our study, we cannot exclude a possibility that some additional links among duplicate genes could be revealed in other tissues and/or under different physiological conditions. Second, the acquisition of new coexpressed partners was found to be rapid and to play a prominent role in the evolution of duplicate genes. This process might lead to at- tainment of new functions and operate as an engine creating diversity at the phenotypic level that is vital for adaptation. The importance of addition of new interactions was also pointed out in the yeast transcriptional regulation network, although there a net loss of interactions was observed [45]. The proportion of unique acquired partners is difficult to determine precisely in the present study, because we cannot directly distinguish be- tween unique ancestral and unique acquired partners. For instance, as mentioned above, acquisition of new partners can be compensated by loss of ancestral partners and in this way will not be reflected in the total number of partners for a duplicate pair. In the future, experiments using expression information for an outgroup should allow one to differentiate between the two classes of unique partners and to provide more precise estimates of their numbers. Although more evidence is necessary, asymmetry might represent a common scenario in the functional divergence of duplicate genes. Yeast, fruit fly, nematode, and hu- man duplicate genes exhibit asymmetric divergence at the amino acid level [27, 56, 57]. Additionally, asymmetric divergence of protein-protein interactions was found for yeast duplicate genes [43]. In summary, this study provides an example of the use of duplicate genes to inves- tigate the evolution of a gene coexpression network. By investigating the divergence of duplicate genes in the network, the speed and pattern of divergence within a network can be assessed. An alternative approach is to compare orthologous genes in the networks of different organisms [31]. Unlike the analysis of orthologs where divergence is determined by speciation time, utilization of paralogs provides an opportunity to inspect a range of divergence since duplications usually occur at various times in the evolution of a lineage (unless a duplication is due to a polyploidization event). A paralogous approach has 19 an additional advantage of requiring sequence and expression information from just one genome. Utilization of paralogous and orthologous information is expected to provide complementary information and bring us closer to understanding network evolution.

2.4 Methods

2.4.1 Network construction

To map probe sets to genes, the exemplar and consensus sequences for U133A and GNF1H arrays [52] were used as queries to search against the longest transcripts of known and novel genes retrieved from Ensembl (NCBI build 34) using BLAST [58] with E = 10−20. The criteria of acceptable alignments were as described in [59]. Briefly, the alignment was accepted if (1) the identity was higher than 94% and the length was greater than either 99 bp or 90% of the length of the query, or (2) the identity was 100% and the length was greater than 49 bp. There were three cases: (1) a single probe set hit a single gene (9,381 genes); (2) multiple probe sets hit a single gene (4,961 genes and 13,071 probe sets); (3) a single probe set hit multiple genes (18,718 genes and 4,377 probe sets). All genes and probe sets in case 1 were considered. In case 2, a probe set with the highest expression value (measured by average difference or AD) value was selected, similar to [30]. All genes and probe sets from case 3 were deleted due to potential cross- hybridization. As a result, we obtained a data set of 14,342 genes and probe sets with the one-to-one correspondence. Another data set (9,056 genes) represented a subset of the previous one from which the probe sets with suboptimal design (with s, x, r, i, f, g suffixes) were excluded. However, only the original data set is discussed henceforth since it contained a larger number of genes and the results obtained from the two data sets were similar. Following [52], genes with AD > 200 in a particular tissue were considered to be expressed in this tissue. The AD values were logarithmically transformed (with base equal to 2). Tissue-specific genes were defined as those expressed only in one tissue, and ubiquitously expressed genes were defined as those expressed in at least 78 out of 79 tissues. A series of Perl and C programs were written to conduct the study. Two nodes were connected if they both were expressed in at least T common tissues with the Pearson correlation coefficient (calculated among these T common tissues) greater than R. We used an adjacency matrix A to store the topology of the network. The matrix stored binary symbols: ”aij = 1” indicated the existence of an edge between nodes i and j and ”aij = 0” indicated its absence. The matrix was symmetric with the diagonal 20 equal to zero because of no inference of the direction of edges and no self-loops (simple graph). We focused on the genes located in the main (giant) cluster in which every gene was connected to every other gene by at least one path. Genes that formed small and isolated clusters were regarded as outside of the main network. The clustering coefficient, c, is defined as the ratio between the number of edges among nodes adjacent to i and the maximum possible, ki(ki − 1)/2 [60]. The clustering coefficient approaches 1 if the neighbors of a node are connected to each other. The shortest path length was calculated according to the Floyd-Warshall’s all pairs shortest paths algorithm [61]. Transitivity is a property of networks that are based on correlation coefficients of gene expression values [31]. If gene A is correlated in expression with gene B and gene B is correlated in expression with gene C, then gene A might be correlated with gene C. However, as mentioned by Jordan et al. [31], the level of such transitive correlation is unknown. In the network investigated in the present study, a link was defined by two pa- rameters the number of tissues in which the two genes are expressed and the correlation coefficient of expression values. This led to decreased transitivity of the network. Indeed, the average clustering coefficient (c) of the network is only 0.16 (under high transitivity c is expected to approach 1). This can be explained by the following example. Let gene A A A A be expressed in 20 tissues with expression values {e1 , e2 , ..., e20}, gene B be expressed B B B in the first 10 of these 20 tissues with expression values {e1 , e2 , ..., e10}, and gene C C C C be expressed in the other 10 of these 20 tissues with expression values {e11, e12, ..., e20}. A A A B B B If the correlation coefficient between {e1 , e2 , ..., e10} and {e1 , e2 , ..., e10} is higher than 0.7, then genes A and B form a link. Similarly, if the correlation coefficient between A A A C C C {e11, e12, ..., e20} and {e11, e12, ..., e20} is higher than 0.7, genes A and C form a link. However, in this example genes B and C do not form a link because they are expressed in different tissues. Thus, transitivity is not an inherent feature of all genes in the network. The 79 tissues studied by Su et al. [52] include six tissues that overlap with several other tissues in the data set. Each of these six tissues represents a more inclusive set of cells (usually an organ, e.g., the whole brain) as compared with its parts also present in the data (e.g., parts of brain). When we generated a separate network without these six tissues, only approximately 88% of connections were the same between this and original networks. Moreover, 84% of connections stayed constant when we removed any six tissues at random (repeated 10 times). This indicates that each tissue possessed unique expression information and we did not exclude any of them from the study. 21

2.4.2 Identification of duplicate genes

Duplicate genes among 22,291 protein sequences of known and novel genes in Ensembl (NCBI build 34) were identified according to Gu et al. [6]. Briefly, each protein was used as a query to search against all other proteins using FASTA [3] with E = 10. The alignments were retained if: (1) the alignment length (L) was over 80% of the longer sequence, and (2) the identity (I) was ≥ 0.3 if L was over 150 amino acids or I ≥ 0.06 + 4.8L−0.32(1+exp(−L/1000)) if otherwise. We deleted proteins if they formed a hit due to the presence of a repetitive element of the same family. A single-linkage clustering algorithm was carried out to assemble duplicate genes into families. As a result, 11,512 duplicate genes were assigned to 2,865 families. The complete protein- coding gene sequences were re-aligned using CLUSTALW [2]. The synonymous and nonsynonymous substitution rates per site (KS and KA, respectively) were calculated using the YN00 module [62] of PAML [63] implemented in PERL. To identify the set of independent duplicate gene pairs we proceeded as follows. First, within each gene family we sorted gene pairs by KS in the ascending order and selected the pair with the lowest KS (pairs with KS < 0.05 were excluded). Next, within each gene family, we selected other independent pairs (with genes that have not yet been selected) sequentially with increasing KS. This resulted in 4,997 independent gene pairs. Only independent duplicate gene pairs were considered for the analysis of divergence of duplicate genes, however, all duplicate genes were considered for the analysis of robustness and of the differences between duplicate genes and singletons.

2.4.3 Permutation tests

The permutation test was used to assess the statistical significance of the difference in the network measures between duplicate genes vs. singletons. First, we removed original labels and randomly relabeled genes keeping the numbers of genes of the two categories consistent with the original data set. Second, we calculated the 2-sample t-statistic [64]. Third, we repeated the process 1000 times and built a null distribution of the t-statistic. Finally, we compared the observed t value (true labeling) with its null distribution to determine the P value.

2.4.4 Asymmetry analysis

To simulate symmetric divergence in gene expression, we followed the method developed by Wagner [43]. Briefly, the number of lost partners was randomly generated from 22

binomial distribution B(n1+n2−2n12,P ). We tested three possible scenarios: divergence by loss of function (P = 0), divergence by gain of function (P = 1), and equal probability of loss and gain of function (P = 0.5). To test each of the three scenarios we proceeded as follows. The ancestral number of partners for each pair was approximated by the sum of the current number of shared partners (n12) and of a random number (nl) generated from the binomial distribution B(n1 + n2 − 2n12,P ). The number of gained partners

(ng) after duplication was approximated by n1 + n2 − 2n12 − nl. Within each duplicate gene pair, the lost connections were assessed by nl1 ∼ B(nl, 0.5) and nl2 = nl − nl1 for the first and second duplicate copies, respectively. The reconstructed number of coexpression partners from the symmetric divergence model were (n12 + nl) − nl1 + ng1 and (n12 + nl) − (nl − nl1) + (ng − ng1) for the first and second gene copies, respectively.

2.4.5 Robustness analysis of the network

The degree-based and random node removals were performed separately for duplicate genes, singletons, and duplicates and singletons combined (a total of six experiments). In each experiment, nodes were deleted in five increments (f): 10% (1,290 genes), 20% (2,580 genes), 30% (3,770 genes), 40% (5,158 genes) and 50% (6,449 genes). The same number of nodes was removed in each experiment. For instance, by attacking 10% of duplicates, we removed 1,290 (10% of 12,897) duplicate genes with the highest number of connections. Similarly, by attacking 10% of all genes, we removed 1,290 of the genes with the highest connections. Two quantities were used to assess the damage to the network: S, the fraction of nodes in the giant connected cluster after node removal (the relative size of the giant cluster), and hdi, the average shortest path length between any two nodes in this cluster. If nodes stay connected except for those that are removed (minimal damage), S = 1 − f. For random removal of nodes from a scale-free network, S ≈ 1 − f and hdi stays approximately constant until a considerable fraction of nodes is removed [38]. However, degree-based removal leads to a fast decay of S and an increase in hdi until the network breaks down into small isolated clusters. Thus, the average shortest path length tends to first increase (the overall system is still functioning but it takes longer to travel from one node to another) and to decrease later, after a certain number of nodes is removed [38]. Chapter 3

A first look at ARFome: dual-coding genes in mammalian genomes

Published as W.-Y. Chung, S. Wadhawan, R. Szklarczyk, S. K. Pond, and A. Nekrutenko. A first look at arfome: dual-coding genes in mammalian genomes. PLoS Comput Biol, 3(5):e91, 2007.

3.1 Introduction

Any stretch of DNA contains six reading frames and can potentially code for multiple proteins. Situations when two partially overlapping reading frames code for functional polypeptides (dual coding) are quite common in bacteriophages and viruses (e.g., φX174, HIV-1, hepatitis C, or influenza A), where constraints on the genome size are strict. On the other hand, dual coding in vast eukaryotic genomes was reported to be scarce and restricted to short regions with secondary reading frames having poor phylogenetic conservation [65]. Yet, three known human genes (GNAS1, XBP1, and INK4a; Figure 3.1) defy this pattern by having long, well-conserved dual-coding regions (e.g., dual-coding region in XBP1 is conserved from worms to mammals [66]). In addition, the three cases exemplify some of the most striking biological phenomena and invite us to look at dual coding in greater detail. In GNAS1, a single transcript simultaneously produces the alpha subunit of G-protein from the main reading frame, and a completely different protein, ALEX, using a +1 frame [67]. A transcript of XBP1 can produce only a single protein at a time and uses the endonuclease IRE1 to switch between two overlapping reading frames 24

Figure 3.1. Three known examples of mammalian dual-coding genes. (A) A transcript of the Gnas1 gene contains two reading frames and produces two structurally unrelated proteins, XLαs and ALEX, by differential utilization of translation start sites.(B) A newly transcribed XBP1 mRNA can only produce protein XBP1U from ORF A. Removal of a 26-bp spacer (yellow rectangle) joins the beginning of ORF A with ORF B and translates into a different product called XBP1S.(C) Ink4a generates two splice variants that use different reading frames within exon E2 to produce the proteins p16Ink4a and p19ARF.

[68]. INK4a generates two alternative transcripts that use different reading frames of a constitutive exon for translation to tumor suppressor proteins p16INK4a and p14ARF [69]. Although GNAS1, XBP1, and INK4a are drastically different, there are striking parallels in the way they function. Products of the main and alternative reading frames perform related tasks, either by binding and regulating each other (GNAS1 and XBP1 ), or by complementing each other in performing a common function (INK4a) [70, 71, 72]. Dual coding is a costly arrangement because it limits the flexibility of amino acid composition [73]. A silent change in one frame is almost always guaranteed to be amino acid changing in the other. Although counterintuitive, this codependency may in fact lead to an increase of the apparent substitution rate when two frames become locked in an evolutionary race of compensatory changes. A chief example of this is the mammalian GNAS1 locus, where the overlapping reading frames accumulate substitutions so fast that primate and rodent sequences become virtually unalignable [74]. Yet despite this cost, the dual coding in GNAS1, XBP1, and INK4a is preserved throughout mammalian taxa [12, 74]. Are overlapping reading frames a new avenue for encoding functionally linked proteins? 25

3.2 Results and discussion

3.2.1 Dual coding is virtually impossible by chance

Before describing our analyses, we define terms used in this paper. A dual-coding gene contains two frames read in the same direction: canonical (annotated as protein coding in literature and/or databases) and alternative. The alternative reading frame (ARF) is shifted forward one or two nucleotides relative to the canonical frame (+1 and +2 ARFs, respectively). To identify dual-coding genes, we used a comparative strategy, because all presently known alternative reading frames are conserved in multiple species. For example, ARFs in Gnas1, XBP1, and INK4A are conserved in all sequenced mammals [72, 74, 75]. To reliably find new dual-coding genes, we must determine how likely they are to occur by chance. Simulations designed to answer this question show that dual coding is statistically unlikely, suggesting that if overlapping coding regions are detected in orthol- ogous sequences, they have a high chance of being truly functional. To determine a length threshold for identification of dual-coding regions (what is the longest dual-coding region that can arise by chance?), we conducted the following experiment. First, we generated alignments between 14,159 orthologous canonical reading frames from human and mouse transcripts (sequences, canonical frame boundaries, and orthology assignments were ob- tained from the Ensembl database at http://www.ensembl.org). We chose these two species because they have the highest number of annotated transcripts. Next, we disas- sembled all 14,159 human/mouse alignments into codon columns. By randomly picking codon columns from the previous step, we generated 10,000 simulated alignments with 5,000 columns each. Finally, we scanned simulated alignments for the presence of ARFs and built a length distribution (See Supplementary Materials on page 57). Only 0.1% of +1 ARFs were ≥500 bp, while none of the +2 ARFs extended beyond this threshold (the longest was 492 bp in the simulation). A possible weakness of this approach is the assumption of codon independence, for it is well-known that protein-coding regions possess Markovian properties [4]. To address this issue, we conducted codon-based phylogenic parametric simulations, which do not break open reading frames (ORFs), and estimated codon frequencies from gene align- ments with at least three taxa, which contained conserved, long +1 ARFs. Only 0.3% of simulated alignments preserved ARFs with 500 or more nucleotides (See Supplementary Materials on page 58). Thus, both simulations suggest that only a negligible amount of random dual-coding regions will reach 500 bp, and we set this length as the threshold 26 for defining ARFs in orthologous coding regions.

3.2.2 Defining mammalian ARFs

Using 500 bp as the lower bound, we identified 149 ARFs that were conserved in human and mouse. An example is shown in Figure 3.2 (See Supplementary Materials on page 59 and 60 for procedure steps and detection of ARFs from multiple alignments). Although all 149 candidate ARFs were conserved in the two species and were longer than the empirically derived threshold, some could still be false positives. For example, the amino acid sequence of the canonical protein may dictate specific codon composition, which in turn may render the nucleotide sequence of the canonical frame such that an ARF can be relatively long simply as an artifact of the codon usage pattern (e.g., having low complexity regions, or avoiding problem codons; See Supplementary Materials on page 62). To remove potential false positives, we developed the codon column replacement test (CCRT; see Materials and Methods). CCRT estimates how likely a given alignment is to contain an ARF by chance. If an ARF has a CCRT score of ≤5%, it is considered a reliable prediction. From the total of 149 ARFs, 66 satisfied this criterion. To make our final set even more conservative, we considered only those of the 66 ARFs that were conserved in at least one other species (rat and/or dog) in addition to human and mouse. The conservation requirement reduced the final set to 40 ARF-containing transcripts, which we examined in detail (Table 3.1). Note that our criteria are very conservative because (1) a number of true ARFs may be shorter than 500 bp (261 bp and 210 bp in XBP1 and Ink4A, respectively) and (2) transcript data for dog and rat are incomplete, which may have led to the exclusion of some true ARFs. Genomic location of the ARFs are provided in Supplementary Materials on page 64 and can be visualized as a custom track at the University of California Santa Cruz Genome Browser [76] (a link is provided at http://nekrut.bx.psu.edu). Supplementary Materials on page 63 lists assignment of ARF-containing genes to Gene Ontology categories.

3.2.3 Analysis of nucleotide substitutions suggests functionality of ARFs

Previous studies of ARF-containing genes showed that the region of overlap between canonical and alternative reading frames evolves under unique sets of constraints. If both proteins (encoded by canonical and alternative frames) are functional and maintained by purifying selection, the codependency between codon positions would manifest itself in a nucleotide substitution pattern that is sharply different from the one expected in single coding regions [12, 74]. The difference in patterns can be used to test whether the dual- 27

Table 3.1. ARF-containing genes identified using a high-stringency approach.

a b c Number GenBank Gene CCRT Score Length Divergence κ βSTOP gi Number (aa)

1 53831993 SF3A1 0.0039 195 0.09 * * 2 4758467 GRP50 0.0335 183 0.18 3 4503680 FCGBP 0.0467 187 0.20 * * 4 18201912 FOXN1 0.0018 258 0.15 * 5 27436942 RXRβ 0.0039 168 0.09 * 6 62954773 CSMD2 0.0085 239 0.11 * 7 31342353 ZNF598 0.0183 247 0.19 * 8 14165285 RHOBTB2 0.0011 226 0.10 * 9 24041034 NOTCH2 0.0334 210 0.13 * * 10 6513852 PCDH8 0.0087 173 0.12 * * 11 37655178 AP3B2 0.0200 205 0.11 * 12 109891936 DLGAP4 0.0417 172 0.10 * * 13 48762935 CSRP3 0.0040 175 0.09 * 14 4758955 BZRAP1 0.0081 176 0.16 * * 15 48255896 SEMA6C 0.0248 169 0.14 16 38348329 LANCL3 0.0008 181 0.14 * 17 52856410 CXXC1 0.0132 174 0.10 * * 18 4557256 ADCY8 0.0010 178 0.10 * * 19 38176156 SPATA2 0.0027 198 0.15 * 20 37537685 ZSCAN21 0.0026 227 0.19 21 122114640 ZNF3 0.0019 221 0.14 * 22 31317254 NLGN2 0.0001 171 0.17 * 23 58257667 KIAA0802 0.0019 204 0.18 * 24 27436945 LMNA 0.0019 169 0.11 * 25 34147467 CCDC120 0.0204 234 0.14 * 26 28559070 DNMT3A 0.0009 178 0.09 * * 27 13376631 ZC3H12A 0.0180 171 0.19 * 28 53832025 IQSEC2 0.0441 279 0.10 * * 29 18378730 BBX 0.0114 221 0.11 * 30 113423421 Predicted 0.0125 169 0.22 * * protein 31 21071079 FBXL7 0.0000 172 0.11 * 32 14017860 KIAA1822 0.0128 167 0.17 * 33 6649056 TMEM2 0.0006 193 0.16 * * 34 18379331 WAC 0.0299 187 0.08 * 35 113204605 RBAK 0.0089 179 0.21 * * 36 117189905 MINK1 0.0305 224 0.09 * * 37 52145308 LING01 0.0026 180 0.08 * 38 45433544 KIAA0460 0.0079 177 0.10 * 39 56790298 PSD 0.0464 218 0.10 * * 40 57165354 LPHN1 0.0054 206 0.10 * a Nucleotide divergence between human and mouse in the ARF region. b Asterisks indicate that κ1,2 is not significantly different from κ3 at the 5% level. c Asterisks indicate that βSTOP = 0 could not be rejected at the 5% level. 28

Figure 3.2. mRNAs from human and mouse are aligned. Mouse mRNAs are indicated by lowercase letters. Each of the two mRNAs contains an annotated coding region (white boxes). Our algorithm looks for ARFs (black boxes) that are shifted one (shown) or two nucleotides relative to the annotated frame. The locations of the ARFs must be conserved between the species. Specifically, the ARFs in the two species must overlap for at least 500 bp. coding genes identified in our study are real. We developed two new approaches for the analysis of nucleotide substitutionsa codon substitution model for overlapping reading frames and a transition/transversion ratio testto narrow the list of potential dual-coding genes to 15 high-confidence candidates. The codon model estimates five substitution rates for the overlapping reading frames by considering all 64 possible codon contexts for each one-nucleotide codon substitution in a given frame, and weighting each context based on its relative frequency in the extant sequences (see Materials and Methods).

One of the rates, βSTOP, which measures the propensity of substitutions in one frame toward introduction of stop codons in the other frame, is especially useful for testing the reliability of ARF predictions. This quantity measures the admissibility of stop codoninducing contexts in the evolutionary past of the sample and is zero or near zero in functional ARFs. For example, when applied to biochemically characterized ARFs in

Gnas1 and XBP1, the hypothesis of βSTOP being exactly zero cannot be rejected (p =

0.5 from likelihood ratio test). For 34 candidates, the hypothesis βSTOP = 0 could not be rejected. From a series of parametric simulations we estimated that at p = 0.05, the test fails to reject the null hypothesis for 6% of the datasets that were simulated using a single reading frame model. To confirm our results using an independent nucleotide-based approach (as opposed to the codon-based test described earlier), we applied the transition-to-transversion (κ) ratio test to make inferences about biological significance of ARFs. The test is based on the following reasoning: in most standard protein-coding regions (with only one reading frame), κ at the third codon position (κ3) is significantly different (higher) than at the

first and second codon positions (κ12), so that κ12 < κ3 [77]. This is because most substitutions at the third codon position are synonymous, whereas in the first codon position all but eight substitutions are nonsynonymous, and all substitutions in the 29 second codon position are nonsynonymous. By contrast, in overlapping reading frames, codon positions are codependent. For example, in a +1 ARF, the third codon positions correspond to the first codon positions of the canonical frame. Thus, almost every change in the third codon position of the ARF is guaranteed to change amino acids encoded in the canonical frame. However, if the ARF encodes a truly functional product, purifying selection would resist such changes, and the condition κ12 < κ3 would not hold. This gives us the opportunity to test functionality of ARF in our dataset by contrasting two hypotheses: H0: κ12 = κ3 (ARF does encode functional polypeptide) and HA: κ12 < κ3 (ARF does not encode functional polypeptide). To perform this test, we used a maximum likelihood framework to test κ12 and κ3 for equality [78]. Application of the test to our list of dual-coding genes identified 18 candidates. Intersecting the results of the tests yielded 15 dual-coding genes as high- confidence candidates. The small number of species used in this study (four; a currently unavoidable limitation given the low annotation quality of mammalian genomes) limits the statistical power of our analyses and explains why the other candidates did not pass this test. Similar analyses of Gnas1 and XBP1 genes used eight or more sequences [12, 74]. Adding more sequences, which should be possible in the near future, will increase the number of high-confidence candidates.

3.2.4 What may be the potential function of ARF-encoded proteins?

Although experimental confirmation of protein expression and genetic studies will ulti- mately answer this question, analysis of current literature provided us with clues to po- tential ARF functions. For example, one of the candidates is adenylate cyclase (ADCY8; Table 3.1), a membrane-bound enzyme that catalyses the formation of cyclic AMP from ATP [79]. A 534 bp ARF is located in the 5’-end of the ADCY8 transcript. The cor- responding region of the canonical peptide has two distinct functions: it interacts with Ca2+/calmodulin and binds to the catalytic subunit of protein phosphatase 2A (PP2a; [80]). Such multitasking is one of the features of dual-coding genes, where separate functions are performed by products of canonical frames and ARFs [71, 72, 81]. Two nucleotide substitutions affecting the amino acid sequence of ADCY8, W38A, and S66D (produced by mutagenesis) have conspicuous effects on ARF structure and calmodulin binding. W38A creates a stop in the ARF and disrupts calmodulin binding, but has no effect on association with PP2a. On the other hand, S66D does not disrupt ARF and has no effect on either calmodulin or PP2a binding [82]. Because in at least two instances products of ARF bind to the product of the canonical frame (i.e., Gnas1 [70] 30 and XBP1 [71]), we speculate that the polypeptide encoded by the ARF may mediate the binding of calmodulin by ADCY8. In fact, ADCY8 has a number of unidentified protein interaction partners from yeast two-hybrid screen experiments, one of which may be the ARF-encoded polypeptide [80]. Another gene in our set, Misshapen/Nck-related Kinase (MINK1 ; see Table 3.1), is involved in a number of functions related to cell spreading, fiber formation, and cell- matrix adhesion. MINK1 regulates the Jun kinase pathway (JNK) [83], is involved in thymocyte selection, and interacts with a large number of proteins controlling cytoskele- tal organization, cell cycle, and apoptosis [84]. The MINK1 protein contains three func- tional domains (N-terminal kinase, intermediate, and C-terminal germinal center kinase) and exists as five distinct isoforms translated from alternatively spliced transcripts. All five transcripts contain an intact ARF, which covers the entire length of the intermediate domain. Extreme multifunctionality of MINK1 suggests that the ARF-encoded protein may be responsible for some of the functions. In addition, the intermediate region of the protein is the most variable in cross-species comparisons [85]. This provides ad- ditional support to the functionality of MINK1’s ARF: regions containing overlapping reading frames encoding functional proteins are likely to evolve faster in comparison with single-coding regions [12, 74]. Retinoid X receptor beta (RXRβ; see Table 3.1) is a member of the retinoid X nuclear receptors that control transcription of multiple genes. In mice, RXRβ binds to the enhancer controlling major histocompatibility class I genes [86]. It is the only gene in our set in which the existence of the ARF was reported in the literature as an alternative N-terminus generated via alternative splicing [87], although this gene failed to pass our transition-to-transversion ratio test. Analysis of transcripts available for this gene shows that this was caused by the skipping of the second coding exon. Because the length of the skipped exon is not in multiples of three, this event switches the reading frame downstream of the splicing point. To recover the phase of the reading frame past the splicing point, the translation must be initiated at the ARF start codon. Because both transcripts (with and without a second exon) have identical 5’ ends, it is likely that the ARF is translated from the full-length transcript.

3.2.5 Conclusions

Maintenance of dual-coding regions is evolutionarily costly and their occurrence by chance is statistically improbable. Therefore, an ARF that is conserved in multiple species is highly likely to be functional. Historically, dual-coding regions were largely 31 overlooked as they violated the accepted views of the eukaryotic gene organization. For example, although the fact that XBP1 produces two proteins was known for years, only one of them was considered biologically important. The confirmation for the function of the second protein came only recently, when three groups described its roles [71, 81, 88]. Dual coding is also difficult to confirm experimentally and computationally. For ex- ample, one cannot use expressed sequence tags (ESTs) to confirm expression of ARFs because in the cases described here, the same transcript expresses both proteins via the use of alternative translation starts. Using initiation codon context or protein structure predictions are not guaranteed to confirm or refute ARF functionality either: the most impressive example of dual coding, Gnas1, has poorly defined Kozak motifs [89] and produces proline-rich polypeptides without clearly defined secondary structure elements [67]. However, analyses of confirmed dual-coding regions allowed us to highlight unique properties and to use them in a genome-wide scan that identified 40 candidates. Is this too much or too little? We emphasize that our criteria were set to be very strict to eliminate the noise. Therefore, the seemingly small number of candidates is likely just a subset of a larger ARFome. First, some ARFs are shorter than the stringent length threshold of 500 bp that we have set to eliminate most false positives. For example, the length of the dual-coding region in human XBP1 is 261 bp [90], and is 210 bp in human INK4a [69]. Second, because only four species were included in the analyses of nucleotide substitutions, some dual-coding regions failed codon-based and transition/transversion ratio tests due to the lack of statistical power. As the annotation quality of other mam- malian genomes increases, it will be possible to add more sequences into our analyses. Third, we required ARFs to be conserved in multiple species. A recent study has demon- strated that many dual-coding regions are specific to a narrow phylogenetic group (i.e., primates [65]) and would not be detected by the current implementation of our method. None of the 40 genes identified in our study overlaps with Liang and Landweber’s dataset [65], as these authors primarily focused on short dual-coding regions arising from alter- native splicing events. Finally, our approach assumes that the two proteins encoded by the dual-coding region evolve under a purifying selection regime as in all presently known mammalian dual-coding genes. This assumption was shown not to hold for some dual-coding regions of bacterial genomes [10]. Thus, 40 candidates is likely an underes- timate. Improving annotation of additional mammalian species will allow us to conduct lower-stringency scans to define the size of the ARFome. Our study provides a robust statistical framework for detection and computational validation of dual-coding regions. This methodology will work equally well in genome- 32 wide screens (this study) and in situations in which an ARF in a single gene needs to be evaluated. Take another look at your gene; you might find an unexpectedly simple explanation, a second protein from the alternative reading frame, for experimental results that are otherwise difficult to interpret.

3.3 Materials and methods

3.3.1 CCRT algorithm

CCRT estimates how likely an alignment is to contain an ARF by chance. The algorithm works as follows. Consider an alignment of human and mouse protein-coding regions sim- ilar to that shown in Figure 3.2. It contains two reading frames: canonical (ORF, white) and alternative (ARF, black). The objective of CCRT is to test whether the ARF is or is not the artifact of nucleotide composition imposed by the ORF. CCRT takes two inputs: the alignment we just discussed and a codon column frequency table. The codon column frequency table is similar to a codon usage table but instead of codons, it contains align- ments of codons from at least two species (in our case, human and mouse). The codon column frequency table is generated by first aligning all possible orthologous protein- coding regions between two (or more) species, splitting these alignments into individual codon alignments, and counting the frequency of each codon alignment. For this study, the table was constructed by aligning ∼ 9,000 orthologous protein-coding regions from human and mouse (alignments can be downloaded from http://nekrut.bx.psu.edu). Given an alignment and the codon frequency table, CCRT generates multiple simu- lated alignments (in this study we used 10,000 replicates) by replacing the original codon columns of the alignment with ones drawn from the codon column frequency table so that the amino acid translation is preserved in the ORF. The probability of drawing a codon alignment from the codon column frequency table is proportional to its frequency. The ORF translations of all simulated translations are identical to the ORF translation of the original alignment, but are guaranteed to be different at the nucleotide level. Fi- nally, each simulated alignment is translated in the ARF, and the number of alignments with the full-length ARF is recorded. This number serves as the empirical p-value. A low p-value (<5%) indicates that a small fraction of simulated alignments contain ARFs, and therefore the ARF is not an artifact of nucleotide composition imposed by ORFs and can be considered a true ARF. 33

3.3.2 Codon model for overlapping reading frames

Consider an alignment of N codon sequences on S codons, which encodes two overlapping reading frames. We present the case in which the frames are shifted by one nucleotide relative to one another, but other cases can be handled by straightforward modifications. We refer to the two reading frames as F0 (frame 0) and frame F1 (frame +1). We also ab make use of the following notation: πij denotes the frequency of dinucleotide ij in a and b codon positions (relative to F0) and ck denotes the frequency of nucleotide k in the c-th codon position. These quantities are estimated by observed counts from a given alignment. First, we define the model for codon evolution in F0. We discriminate four types of codon substitutions: SS (synonymous in both frames), SN (synonymous in F0 and nonsynonymous in F1), NS (nonsynonymous in F0 and synonymous in F1), and NN (nonsynonymous in both frames). We model the process of character substitution using a Markov process operating on codons and defined by the instantaneous rate matrix Q. Following the common practice of allowing nonzero rates for single instantaneous nu- cleotide substitutions only, we assign substitution rates α to all one-nucleotide SS sub- stitutions, β01 to SN substitutions, β10 to NS substitutions, and β11 to NN substitutions.

In addition, we introduce another rateβSTOPfor all those substitutions that introduce a stop codon in one of the two frames. Because the evolution at a given position in F0 depends on the flanking nucleotides (two upstream and one downstream), we condition the substitutions at a codon in F0 on the values of the relevant nucleotides, compute transition probabilities for each of the 64 possibilities, and weight over the frequency distributions π12 and π3.

Formally, the instantaneous rate of substituting a nonstop codon x = x1x2x3 with a nonstop codon y = y1y2y3 in F0 conditioned on the values of the two upstream nu- cleotides u1u2 and the downstream nucleotide d1:

  0, multiple substitutions required in x −→ y,   k  Rx y απ , SS substitution in the k-th codon position,  k k yk  k  Rx y β01π , SN substitution in the k-th codon position, qF 0|u , u , d = k k yk xy 1 2 1 k  Rxkyk β10π , NS substitution in the k-th codon position,  yk  R β πk , NN substitution in the k-th codon position,  xkyk 11 yk   R β πk , A stop codon is introduced in F1. xkyk STOP yk (3.1) 34

Conditioning on u1, u2, d1 is necessary to determine whether a substitution in F0 results in a synonymous or a nonsynonymous change in F1. Rnm denotes the rate of substitution for nucleotides n and m relative to that of A −→ G. We set Rnm = Rmn to ensure time reversibility. One can check that for any triplet u1, u2, d1, the equilibrium distribution of the Markov process defined by this rate matrix is

1 2 3 πx1 πx2 πx3 πx1x2x3 = P 1 2 3 (3.2) 1 − ijk is a stop codon πi πj πk

F 1 Second, we describe an analogous rate matrix qxy |u1, d1, d2 for F1. This rate matrix is conditioned on one upstream nucleotide u1 and two downstream nucleotides d1, d2.

  0, multiple substitutions required in x −→ y,   k  Rx y απ , SS substitution in the k-th codon position,  k k yk  k  Rx y β01π , SN substitution in the k-th codon position, qF 1|u , d , d = k k yk xy 1 1 2 k  Rxkyk β10π , NS substitution in the k-th codon position,  yk  R β πk , NN substitution in the k-th codon position,  xkyk 11 yk   R β πk , A stop codon is introduced in F0. xkyk STOP yk (3.3) Transition matrices T (t) for the processes are matrix exponentials of Qt, for the appropriate rate matrix Q. For computational tractability, we assume that the evolution at codon c can be adequately described by computing the expectation over flanking F 0 upstream and downstream nucleotides. Specifically, if LC |u1, u2, d1 is the phylogenetic likelihood at codon c in frame F0, conditioned on the flanking nucleotides, then the unconditional likelihood can be computed as

F 0 X F 0 Lc = P r{(u1, u2)}P r{d1}Lc |u1, u2, d1 (3.4) (u1,u2)∈{AA,...,T T }d1∈{A,C,G,T } Analogous calculation can be performed for frame F1. Finally, we define the joint likelihood of the entire dataset (omitting the first and the last codons in F0) as

S−1 Y F 0 F 1 L = Lc Lc (3.5) c=2 Parameter estimates such as branch lengths and substitution rates can be obtained by maximizing the likelihood as a function of model parameters with standard numerical optimization techniques. Due to the structure of the genetic code, most of the possible 35 single-nucleotide substitutions lead to nonsynonymous changes in at least one of the reading frames (Supplementary Materials on page 61). To evaluate the evolutionary regime in a multiple reading frame alignment, we test the null hypothesis to evaluate whether the introduction of premature stop codons is disallowed. The test defined a one-sided constraint on a single parameter, and the significance can be evaluated using the likelihood ratio test with the approximate distribution of the test statistic. Chapter 4

Transcriptome profiling by next-generation sequencing technology

4.1 Background

The extensive activities of the genome may be enhanced by detecting the global ex- pression pattern of its genes or the transcriptome. Previously, this type of study was performed using EST analysis or microarray technology. However, these platforms were limited in the breadth and depth of their ability to diagnose whole genome expression[91]. RNA-Seq is a newly developed approach to study transcriptomes using next-generation sequencing technology. Massively parallel sequencing of cDNA provides a more com- prehensive way to detect expression profiles than other methods. Further, the next- generation sequencing technology is faster and cheaper and provides deeper sequencing coverage than traditional capillary methods. The new technology achieves high through- put by sequencing fragmented molecules in parallel. The sequencing results are the col- lection of these fragmented molecules, rather than consensus base calling of the Sanger system. Thus, it is able to achieve higher data throughput and is an ideal way to detect rare alleles and/or genes expressed at a low level. The new approach is also able to detect novel exons and alternative splicing forms of transcripts, which cannot be accomplished using EST or microarrays. Several projects, such as the human and mouse transcriptomes, have shown the effectiveness of this pro- cess. Mouse transcriptomes from adult brain, liver and skeletal muscle tissues were 37 reported via a ∼ 50 million reads of length 25 nt [16]. Most of the reads were found to fall within exon regions (over 90% of mappable reads), indicating high sensitivities. The paper also provided a quantified expression level, RPKM, which is the number of reads per kilobases of exon model per million mapped reads. Three RPKM was estimated to correspond to one transcript per liver cell. However, prior knowledge like starting cell numbers is needed to calculate this number. Direct cross-experiment and -platform comparisons of expression levels have yet to be established. On the positive side, RNA- Seq is highly reproducible: the correlation coefficients between technical replicates and microarrays are high (R2 are ≥ 0.96 and 0.62, respectively). Human transcriptomes [92, 93] were revealed by sequencing reads of varies tissues and cell lines. More genes were detected by RNA-Seq than microarrays, suggesting RNA-Seq is a more sensitive and comprehensive method. Alternative splicing was an ubiquitous event: about 92-94% of human genes were subjected to multiple splicing. Detailed analyses of splicing forms were able to be documented [93]. Another research area of high potential is personal genomics [94]. The price of sequencing a mammalian genome has been dropped 10 to 100-fold [95]. With high- throughput sequencing results and a reference genome available, researchers are able to study in-depth the genetic differences within individuals and populations. Single- nucleotide polymorphism and heterozygosity studies, as well as Epigenomic projects, are launched either in independent labs or in collaboration with sequencing centers. The goal is to provide personal medicine targeting at one’s genetic traits. Challenges, such as short read length (25 ∼ 45 nt) and base-calling errors, need to be addressed [15]. De novo assembly is virtually impossible to complete if repeat regions are longer than the read length [96]. So short read lengths mean fewer uniquely mappable reads and more contigs (DNA fragments with no overlapping) when reads are mapped to genomic sequences. In general, a higher coverage is necessary to determine the mapping result [96, 97, 98]. The error characteristics of each commercially available machine are yet to be defined. Only a few intrinsic error features have been described. For example, Illumina/Solexa machines have higher error rates (represented by a decrease in quality scores) toward the end of the reads. Roche/454 Life Science has trouble identifying the exact number of contiguous nucleotides when the same base is repeated. No obvious error profiling is provided by SOLiD, but the mapping result usually shows higher number of errors than the previous two machines. Though the quality scores provided with reads are claimed to be Phred-equivalent, a detailed comparison is vital for further use of these technologies. Moreover, few software and tools are available to manage the amount of 38 data generated (typically several gigabytes per experiment) and conduct analyses without hazards. Before longer reads are available, paired-end reads, those sequenced from both sides of a fragmented molecule, might be a practical approach to overcome some of the limitations imposed by short read lengths. Compared with single-end reads, paired-end reads double the sequencing read length and provide more stringent information in mapping, reducing the probability of random alignments. Additionally, paired-end reads can help to close the gaps between DNA segments that are separated by repeats. Different scales of the distance between paired ends (range from 200 ∼ 600 bp to ∼ 2 kb) are available [94] and can be used in complex studies such as those aimed at identifying genetic variations in the genome. Here I looked into the transcriptomes of two important stages in mouse brain devel- opment. Mouse cortex tissues from embryonic day 18 (E18) and postnatal day 7 (P7) were collected and subjected to high-throughput sequencing using the Illumina/Solexa machine (a collaboration with Dr. Hong Ma and Dr. Gong Chen at the Department of Biology at the Pennsylvania State University). The primary aim was to identify genes with differential expression at the two stages. Specifically, I was interested in detecting genes with novel alternative splicing forms. I developed a workflow that starting off with evaluating the sequencing quality, mapping reads to transcriptomes and finally reporting a list of novel splicing forms. I described the analyses of paired-end reads in more details, because the mapping and imposing pair relation on the exons were less trivial. A set of tools for short read analyses is available on Galaxy (http://galaxy.psu.edu).

4.2 Results and discussion

4.2.1 Sequencing result and quality

The sequencing produced 2.9 to 4.5 millions reads of length 35 nt (Table 4.1). The experiments were performed using both single- and paired-end read sequencing. Before mapping, quality score distribution (an example is shown at Figure 4.1) is used to exam- ine the overall sequencing quality. In fixed-length read data, the quality scores of each site were summarized and shown in the order corresponding to the location in the read. The same tool can also handle reads of varied lengths, for example, reads from Roche/454 Life Science instruments. If the reads have varied lengths, only reads longer than 100 nt are used and are condensed to have 20 data points. Each data point represents informa- tion averaged within an interval equal to 5% of the read length. The score distribution 39

Table 4.1. Summary of novel transcripts analysis. single-end reads paired-end reads E18 P7 E18 P7 Total reads 2,956,444 3,619,970 4,536,964 4,019,273 Before excluding reads mapped in genomic sequences Reads mapped at novel junctions 11,126 11,066 1196 1,302 Number of novel junctions 4,862 5,498 297 338 Genes 2,922 3,244 273 303 Total genes 4,535 524 After excluding reads mapped in genomic sequences Reads mapped at novel junctions 2,302 2,689 621 746 Number of novel junctions 1,800 2,177 151 189 Genes 1,367 1,596 143 172 Total genes 2,394 296

Figure 4.1. An example of the quality score distribution. The x-axis is the length of the read (fixed-length, such as Illumina/Solexa and SOLiD reads or percentage, such as Roche/454 Life Science reads) and the y-axis is the base-calling scores. The quality score distribution showed the base-calling scores dropped below 20 after read position 28. yields a glimpse of whether trimming reads is necessary, and if so, which sites should be trimmed or marked as absent before mapping. Another tool that generates histograms of the lengths of high-score segments, gives an outline of meaningful mappable lengths (data not shown). One caveat here is that the quality scores were not transformed to be Phred scores. This is because each machine has its own quality measurement and not all of them are publicly available. Moreover, Illumina/Solexa and Phred scores are virtually identical if scores are higher than 20. Hence, in this study, I treated the quality score as it is and did not convert it to a Phred score. 40

4.2.2 Mapping reads to the mouse transcriptomes

There are several tools [99, 100, 101] targeted at short reads mapping. Here I used RMAP [101]. It returns uniquely mapped reads and discards multiply mapped reads (or called multi-reads). Paired-end reads were split into two ends per pair and mapping was conducted separately on each of the ends. Later pair relations were restored. Mapping was performed using the first 27 nt of each read and allowing at most three mismatches. A summary of the mapping is shown in Table 4.1. The target sequences for the transcriptome analysis are not the whole-genome se- quences but are transcripts. In this study, I devised a collection of exon junctions rather than transcripts for two reasons: first, I was interested in finding novel transcripts, so target sequences must contain new exon junctions; second, using exon junctions, rather than transcripts, reduces the redundant exon sequences in the target data set because transcripts of a gene usually share some of the exons. The reads were then mapped to these junctions to directly detect splicing form expression. The method followed [16] and was modified to incorporate paired-end reads. Figure 4.2 illustrates known and novel alternative splice forms, and an assembled junction segment. Gene cluster and transcript annotations were obtained from the UCSC Table Browser (http://genome.ucsc.edu/cgi- bin/hgTables). The data provided exon locations in the genome and exon-transcript-gene associations. All possible combinations of pairs of exons were included which did not violate direction of transcription. Novel junctions were defined as the pairs of exons that are not documented in the UCSC table. The junction segments were concatenated using 20 bp from either side of the exon boundaries, 40 bp in total. If the exon was shorter than 20 bp, the full-length sequence was included. Forty base pairs was chosen to maximize significant mapping results because of the short read length (27 nt) used in the analysis. Overlapping exons were treated as separate exons if their boundaries were different, i.e., the resulting junction segments were different. Single exon genes were excluded. In analyzing paired-end reads, full-length exons were included along with novel junction segments. None of the exon junctions were duplicated in the target data set. One can infer the expression of two exons together in a novel pattern if two ends of a read were mapped to the two exons separately, and the combination of the two exons were not part of any known transcripts. This is because ends from a pair were sequenced from one segment, thus indicating the two exons were part of the same transcript. 41

A

GA E1 E2 E3 E4

B

T1

E1 E2 E3 E4

T2 E1 E2 E3 E4

C

GA

E1 E2 E3 E4

D

GA

E1 E2 E3 E4

J24

Figure 4.2. A hypothetical example of the strategy used to obtain novel exon junc- tions. (A) Gene A had four exons, E1,E2,E3 and E4. Dashed lines connect all possible respect- ing order of exon-exon combination. (B) Two transcripts, T1 and T2, were alternatively splicing variants from gene A. T1 had E1,E2 and E3. T2 had E1,E2 and E4. (C) For gene A, junctions between E1 and E3, E1 and E4, E3 and E4 were novel junctions, which were not from known transcripts. (D) 20 bp on either side of every possible junction were taken and attached to form junction sequences. For gene A, there were 6 possible junctions in total, in which 3 were known junctions and 3 were novel junctions. 42

4.2.3 Identifying novel splice forms

After mapping, I allocated the mapped reads to exons or exon junctions. The single- end reads analysis was straightforward: reads mapping on junction segments indicated expression of the two exons. In the paired-end reads analysis, one needs to examine the pair relations before selecting the reads as valid results. These criteria were imposed: first, both ends of reads must be uniquely mapped and located in the same gene; second, one end must be mapped on the positive and the other on the negative strands. Normally the distance between paired ends (range 200 to 600 bp, resulting from the sequencing process) is a useful feature for eliminating erroneous mapping. However, the sequencing was performed on the fragmented transcripts but the locations of reads mapping were genomic locations. Because of the unknown intron sizes before splicing, I did not apply this restriction. Most of the reads were mapped to known junctions (data not shown) and only the ones mapped to novel junctions were selected for further analysis. Two ends from a paired-end read can be mapped to either junctions or exons. Figure 4.3 shows the mapping scenarios for paired-end reads. The following two situations in the mapping results were considered in finding the novel exon junctions: first, at least one end of the pair was mapped on the novel junction and the other end of the pair can be either mapped on a known or novel junction or within a exon; second, if both ends were mapped in two exons, the two exons were checked against the UCSC annotation and cannot be part of known transcripts. The first situation is a direct indication of a novel splice form and the second is an indirect signal but as legitimate as the first case. I also found some paired-end reads mapped to junctions or exons that are unlikely to be true splice forms. One such example is when one end mapped onto a junction and the other end onto a third exon that is in the middle of the two previous exons formed that junction (see Figure 4.3A). Another example is when a pair of ends mapped to overlapping exons. One might argue that the second case is a potential novel exon, however, to be conservative at this point, this type of mapping is considered as invalid. All irregular mapping pairs-end reads were therefore excluded. I identified a number of splice forms that were not annotated in the database (Table 4.1). Among these were genes that play important roles in the brain, such as FBXO2 (maintaining neurons in a postmitotic state), App (amyloid beta (A4) precursor pro- tein, functioning in neural plasticity and Alzheimers disease), Aplp1 (amyloid beta (A4) precursor-like protein 1, functioning in synaptic maturation during cortical develop- ment), BPTF (gene regulation, related to Alzheimers disease), and SAT2 (an amino 43

A

E1 E2 E3 E4

B

E1 E2 E3 E4

C

E1 E2 E3 E4

D

E1 E2 E3 E4

Figure 4.3. Examples of paired-reads mapped in novel junctions. (A) Invalid mapping: one end of a paired-end read mapped at J13; the other end mapped at E2. It is not reasonable to have a splicing form of this kind. (B-D) All valid mappings. (B) Valid mapping: one end of a paired-end read mapped at a novel junction (J13); the other end mapped at an exon (E3). This indicates a novel splicing form of E1 and E3. Another possible situation is the other end mapped at an exon that is not part of the junction (e.g., E4). This indicates a novel splicing form of E1,E3 and E4. (C) Both ends were mapped at junctions. One end of a paired-end read mapped at a novel junction (J13); the other end also mapped at a novel junction (J34). This indicates a novel splicing form of E1,E3 and E4. (D) Both ends were mapped at exons. With paired-end information, we were able to detect novel junctions when reads were mapped within exons and the exons composed new splicing form. 44 acid transportor). Few of these novel junctions were common in the two stages. This indicates that these novel alternative splicing forms are unique at E18 or P7 stages. The specific expression at particular development stages, like expression at particular tissues, might be one of the reasons they were not documented in the transcripts annotation. The same data set was mapped against genomic sequences independently and the expression level of each gene was estimated by the sum of the reads aligned to that gene. The results were compared to understand the expression profiles at the two stages. Genes that were known for their importance during brain development, such as neurogenesis- related and synaptic genes, were confirmed after analyzing the data. A large percentage of reads that were mapped to junctions were also mapped in genomic sequences (∼ 75% and ∼ 42% in single- and paired-end reads, respectively). This indicates that the read length currently generated by next-generation sequencing technology is not able to provide enough unique mapping. Pair information decreases the ambiguous or random mapping. However, to obtain the expression level of transcripts accurately, each read can only be counted once in the study. Thus, in the final result, reads mapped to both junctions and genomic sequences were excluded (Table 4.1). Reads with at least one end mapped on the novel junctions cannot be mapped ambiguously in the genomic sequences. The end that was mapped on the junction cannot mapped at any place in the genomic sequences. Reads with both ends mapped in exons in the junction study were expected to be uniquely mapped in the genomic mapping.

4.3 Conclusion

Short read lengths proved to be an issue but not a critical impediment in the whole genome mapping. Higher coverage is needed to overcome this downside. A combination of traditional capillary (yields longer reads, typically 500 ∼ 800 bp) with next-generation sequencing technology may be a practical approach for de novo assembly. Ultimately longer reads with parallel sequencing would make sequencing a mammalian genome affordable for the general public. The error characteristics have yet to be established in the next-generation sequencing technology. This will require in-depth comparison in re-sequencing studies. RNA-Seq provides a comprehensive method to understand transcriptomes and detect novel exons and/or alternative splice forms. Here I have demonstrated a way to analyze single- and paired-end reads for the detection of novel splice forms. Though the number of reads were smaller in comparison with other transcriptome studies (15 millions vs. 45

∼ 50 millions of reads), we were able to find expression of over 15,000 genes and among these, over 1,600 genes were differentially expressed at E18 and P7 stages. Hundreds of novel splice forms were predicted. Hence we provide a transcriptome atlas that is the first step toward full understanding of the mouse brain development. A complete picture of brain development would require studies on finer time scales and in combination with epigenomics information. At present, there is a profound need for an integrated sets of tools for quality control, filtering, aligning, and interpreting of short read data. We developed a series of tools for Galaxy system (http://galaxy.psu.edu) that at present includes tools for surveying and filtering transcriptome, metagenomic applications, and re-sequencing analyses. For the first time biomedical researchers are able to directly upload sequences and base quality scores, review and filter the data, align reads against appropriate databases, interpret these alignments, and visualize the results - all within a single publicly available portal. Chapter 5

Conclusion

5.1 Summary

Computational and statistical approaches are critical for today’s high-throughput biol- ogy. In this dissertation, I integrated various types of data and applied novel methods to unravel the complexity of biology. In this final chapter, I present short summaries and unexplored directions for each of the research projects. In chapter 2, I discussed duplicate gene evolution using information from sequences and expression data. The data indicated that pairs of duplicate genes evolve quickly and asymmetrically: two genes in a pair have few common coexpression partners and substantial degree differences shortly after their emergence (chapter 2.2.4). As a result of rapid divergence, redundant copies are not available to play a significant role in system robustness (chapter 2.2.6). The network maintains its connection via nodes that have no sequence similarity. Though my approach is sufficient for understanding duplicate gene evolution in gen- eral, studies in young duplicate gene pairs and detailed analysis of duplicate genes in a family are yet to be investigated. Gene families can shed light on the emergence and influence of gene duplication and of network growth. In addition, the assumption of exactly matched functionality just after gene duplication may not hold for all cases; so to include regulatory regions in analyses is the next step to gain full information about duplication. In chapter 3, I identified forty dual-coding genes which were previously not known in eukaryotic genomes. First, the conservation of both reading frames in closely related mammalian genomes was required. Sufficient lengths of the alternative reading frame and 47 unique nucleotide composition are the next indications of true dual-coding genes (chapter 3.2.1). Interdependent reading frames impose additional evolutionary constraints and require modifications from conventional substitution models. Thus candidate genes were subjected to tests that were adapted from nucleotide and codon substitution models (chapter 3.2.2). The next step to apprehend the complete ARFome in mammalian genomes is to apply the same methodology to potential reading frames in the opposite transcriptional direction. Also, in this study, dual-coding genes are expected to follow the purifying selection regime; other evolutionary processes may contribute to the rise of overlapping frames and are worthy of exploring. Protein mass spectrometry will provide direct experimental verification of the translation of dual-coding genes. In chapter 4, I analyzed mouse brain transcriptomes and detected hundreds of novel splice forms. The result demonstrated the feasibility of short-read data from next- generation sequencing technology. Though the short read lengths and unknown error characteristics increase ambiguous reads mapping, the problem is minimized with the aid of high-throughput and paired-end data (chapter 4.2.3). More experiments at different brain developmental stages are critical to fully under- stand the role of important genes or isoforms in neural development. Merging exon-exon junction mapping with genomic mapping will provide a complete and clear assessment of gene expression. Such data can also be used to identify novel exons such as regions that have high read coverage.

5.2 Future research interests

Computational biology is an exciting research area and the challenges lie in improving algorithms and developing tools to integrate and describe molecular data. I present three directions that are of high interest. Data integration and topology comparison. Various types of high-throughput whole-genome interaction data are available for model organisms, such as yeast two- hybrid, microarrays and ChIP-Seq. Integrating data from cross-platform sources and species is the first step in systems biology studies. It is critical but algorithmically challenging to develop methods to align and analyze network topologies. One approach is to identify subgraphs (for example, n-clique, trees, and loops) as independent functional units. Complexity grows exponentially with the number of nodes and edges considered in subgraphs. Developing efficient methods will be a demanding task. 48

Discrete to continuous attributes. Most studies describe nodes and edges as having two states: on or off. But biological data are more complicated than this bi- nary description permits. For example, multiple types of nodes (mRNAs, proteins or enzymes) and edges (induction or inhibition) are essential in gene regulatory networks. In addition, a network is a dynamic system and to describe the changes, one needs to consider the effects of time. The unit of time can be seconds (in metabolic pathways), or days (in development stages), or million of years (in gene duplication). Additional variables produce more realistic models but estimating values for more parameters is time-consuming. Models to predict and detect missing components. Mathematical models, such as Bayesian networks, can be used to identify missing components in a graph and to predict a system’s behavior, including the outcome of in silico perturbation. Exper- imental falsification or validation would be fed back to the computer model and used to update it. With only partial information, missing components are hard to identify, but they are important and influence system functioning. Methodologies are yet to be developed and put to test. Appendix A

Supplementary materials for Chapter 2

!""#$#%&'()*#(+),-) ) ./+)&01+2#3'()"+432#5$#%&)%*)$/+)&+$6%274) ! ! "#$%&'()! "+,-#'!&.!/&0#)! ! 78#'43#! 78#'43#! ! 1/!$2#!314/$! 78#'43#! )2&'$#)$!=4$2! 56+)$#'1/3! !!*!"! 56+)$#'! ! ! 0#3'##! ! 6#/3$2! ! ! ! ! 5&#..151#/$! ! ! #$%& 9:';! ?@>! ABCAC! D?E@?F! A@DE! ?@CD! F! ?@>! ACGDF! G?F@D>! A@DD! ?@BA! D! ?@>! ACH>H! F>B@>H! C@?F! ?@BB! >! ?@F! ABCAC! AGE@DH! C@CF! ?@AE! F! ?@F! ACGDF! ABC@E?! C@HE! ?@AH! D! ?@F! ACH>B! AA?@A>! C@DB! ?@AG! >! ?@D! ABAGE! A>@GE! B@HC! ?@?B! F! ?@D! DGH>! >@GG! >@CH! ?@?>! D! ?@D! EF?G! >@EC! H@HF! ?@AC! ! I452!'&%!'#='#)#/$)!4!/#$%&'(!$24$!%4)!3#/#'4$#0!.&'!4!=4'$15+64'!=41'!&.!$2'#)2&60)!9!! 4/0!"<@! ! ! ! 50

Degree distributions of networks generated from a combination of thresh- olds: T (tissue) and R (Pearson correlation coefficient). The degree distribution for the network with T 7 and R 0.7 is shown in Figure 2.1. !""#$#%&'()*#(+),- ! ! 0)'&")" ! 1-0 ! # !"#$%&%#'()*+

!"#

!"!#

!"!!#

!"!!!# # #! #!! #!!! #!!!! ! ! 2)'&")" ! 1-0 # . !"#$$ !"#$%&%#'(),- +(

!"#

!"!#

!"!!# 52067089:%"341 !"!!!# # #! #!! #!!! #!!!!

! ! 3)'&")" ! 1-0 / # !"#$%&%#'(),.

!"#

!"!#

!"!!#

!"!!!# # #! #!! #!!! #!!!!

/01200%"341+( #$ 51

T ! 5 and R ! 0.7 D # P(k) ~ k-1.28

!"#

!"!#

!"!!#

!"!!!# # #! #!! #!!! #!!!! T ! 9 and R ! 0.7 E P(k)) # -1.20 10 P(k) ~ k

!"#

!"!#

!"!!# frequency (log

!"!!!# ##!#!!#!!!#!!!! F T ! 5 and R ! 0.9 # P(k) ~ k-1.51

!"#

!"!#

!"!!#

!"!!!# # #! #!! #!!! #!!!!

degree (log10 k) 52

T ! 7 and R ! 0.9 G # P(k) ~ k-1.64

!"#

!"!# P(k)) !"!!# 10

!"!!!# # #! #!! #!!! #!!!! H T ! 9 and R ! 0.9 # P(k) ~ k-1.57

!"# frequency (log

!"!#

!"!!#

!"!!!# # #! #!! #!!! #!!!!

degree (log10 k) 53

The scatter plots between clustering coefficient c and node degree k for (A) all genes, (B) ubiquitously!""#$#%&'()*#(+),- expressed genes, and (C) nonubiquitously expressed genes.

!"

!"+

!"*

!")

!"(

!"'

!"&

!"%

!"$

!"#

! ! '!! #!!! #'!! $!!! $'!! #

!"(

!"'

!"& &'()*"$+,#%-."//+-+",*

!"%

!"$

!"#

! ! $!! &!! (!! *!! #!!! #$!!

,!"#

!"#$""% 54

C

!"+

!"*

!")

!"(

!"'

!"&

!"%

!"$

!"#

! ! '!! #!!! #'!! $!!! $'!!

D

!"(

!"'

!"& Clustering coefficient

!"%

!"$

!"#

! ! $!! &!! (!! *!! #!!! #$!! #&!!

,!"#

Degree 55

E

!"*

!")

!"(

!"'

!"&

!"%

!"$

!"#

! ! $!! &!! (!! *!! #!!! #$!!

F

!"(

!"'

!"&

!"% Clustering coefficient

!"$

!"#

! ! #!! $!! %!! &!! '!! (!! )!! *!! ,!! #!!!

+!"#

Degree 56

The!##$%$&'()*+$),*-. degree distribution of the studied network (T ≥ 7 and R ≥ 0.7) for (A) duplicate genes and (B) singletons.

! !"# !"#$%&%#%'()*+

!"!#

!"!!# !"#$$ (*

!"!!!# # #! #!! #!!! #!!!!

" !"# !"#$%&%#%'()*( 2/-34-567%"01. !"!#

!"!!#

!"!!!# # #! #!! #!!! #!!!!

,-./--%"01.(* #$ Appendix B

Supplementary materials for Chapter 3

Distribution of Lengths of Maximal ARFs Detected in 10,000 Simulated Alignments. m F

i e g n u t r s. e

S 1 .

D i st r i b u t i o n

o f

l e n g t h s o f

m a xi m a l

A R F

d e t e ct e d

i n

1 0 , 0

0 0

si m u l a t e d

a l i g n - 58

Distribution of Lengths of Maximal ARFs, Based on 35,000 Parametric Simulations Based on Codon Model Fits to Orthologous Gene Alignments from Three or Four species. A total of 39 gene fits, each with at least 500 bp sampled equiprobably. Only 0.29% of simulated alignments had open ARFs with 500 or more nucleotides.

Figure S2. Distribution of lengths of maximal ARF, based on 35,000 parametric simula- tions based on codon model fits to orthologous gene alignments from 3 or 4 species (to- tal of 39 gene fits, each with at least 500 bp sampled from equiprobably). Only 0.29% of simulated alignments had open ARF with 500 or more nucleotides. 59

Number of Possible Dual-Coding Genes and Corresponding Criteria. The number of possible dual-coding genes are shown in parentheses.

Figure S3. The number of possible dual coding genes (in parenthesis) and the corre- sponding criteria 60

The Discovery and Definition of Conserved Dual-Coding Regions from Multispecies Alignments. The orthologous transcripts from four species were first aligned and then translated using the second reading frame. Hence, additional start and stop codons appeared in the translation. For each of the species, an uninterrupted segment of peptides were identified (the dotted line with arrow ends in both directions), and the first start codon was marked. The region between the closest startstop codons was defined as the ARF region. From the same set of transcripts, regions from the beginning to the first stop codon in any one of the species and the last stop codon to the end of the transcript were defined as flanking the ORF region.

Figure S4. The finding and definition of conserved dual coding regions from multi spe- cies alignments. The orthologous transcripts from four species were first aligned and then translated using the second reading frame. Hence, additional start and stop codons appeared in the translation. For each of the species, an uninterrupted segment of peptides were identified (the dotted line with arrow ends at both direction) and the first start codon was marked. The region between the closest start-stop codons was de- fined as ARF region. From the same set of transcripts, regions from the beginning to first stop codon in any one of the species and the last stop codon to the end of the tran- script were defined as flanking ORF region.

61

Proportion of Substitution Types (in Percent) in Each Codon Position of F0and F1 Averaged over All Possible Nucleotide Contexts

Frame/Codon SS SN NS NN Stop Position

1 3.6 0.7 61.7 25.4 8.5

F0 2 0 0 4.0 87.2 8.8

3 0 65.3 0 25.6 9.1

1 0 0 4.0 88.2 7.8

F1 2 0 65.3 0 25.6 9.1

3 3.6 0.7 61.0 25.1 9.6

Table S1. Proportion of substitution types (in %) in each codon position of F0 and F1 averaged over all possible nucleotide contexts. 62

Proportion (Percent) of Prefix and Suffix Codons (out of 3,721 Possibili- ties) That, for a Given Middle Codon, Do Not Induce a Stop Codon in the +1 Reading Frame

First/Second A C G T AAA/86.89 ACA/86.89 AGA/86.89 ATA/41.31 AAC/86.89 ACC/86.89 AGC/86.89 ATC/86.89 A AAG/86.89 ACG/86.89 AGG/86.89 ATG/64.10 AAT/86.89 ACT/86.89 AGT/86.89 ATT/86.89 CAA/100.00 CCA/100.00 CGA/100.00 CTA/47.54 CAC/100.00 CCC/100.00 CGC/100.00 CTC/100.00 C CAG/100.00 CCG/100.00 CGG/100.00 CTG/73.77 CAT/100.00 CCT/100.00 CGT/100.00 CTT/100.00 GAA/93.44 GCA/93.44 GGA/93.44 GTA/44.42 GAC/93.44 GCC/93.44 GGC/93.44 GTC/93.44 G GAG/93.44 GCG/93.44 GGG/93.44 GTG/68.93 GAT/93.44 GCT/93.44 GGT/93.44 GTT/93.44 TAA/STOP TCA/100.00 TGA/STOP TTA/47.54 TAC/100.00 TCC/100.00 TGC/100.00 TTC/100.00 T TAG/STOP TCG/100.00 TGG/100.00 TTG/73.77 TAT/100.00 TCT/100.00 TGT/100.00 TTT/100.00 Table S2. Proportion (%) of prefix and suffix codons (out of 3721 possibilities) which, for a given middle codon, do not induce a stop codon in the +1 reading frame. Brighter col- ors indicate less tolerated codons. 63

Gene Ontology Categories of the 40 Candidate Genes.

# gi # Gene RefSeq ID GO categories

1 53831993 SF3A1 NM_005877 GO:0003723,GO:0000389,GO:0000398,GO:0006464,GO:0008380,GO:0005634, GO:0005681 2 4758467 GRP50 NM_004224 GO:0004872,GO:0008502,GO:0007165,GO:0007186,GO:0007267,GO:0005887, GO:0016020 3 4503680 FCGBP NM_003890 4 18201912 FOXN1 NM_003593 GO:0003700,GO:0003704,GO:0043565,GO:0006350,GO:0006357,GO:0006952, GO:0007275,GO:0008544,GO:0009887,GO:0005634,GO:0030216,GO:0050673, 5 27436942 RXRb NM_021976 GO:0003700,GO:0003707,GO:0003713,GO:0004886,GO:0005496,GO:0005515, GO:0008270,GO:0016439,GO:0043565,GO:0046872,GO:0006350,GO:0006355, GO:0005634,GO:0005515 6 62954773 CSMD2 AB212622 GO:0016020,GO:0016021 7 31342353 ZNF598 NM_178167 GO:0003676,GO:0005515,GO:0008270,GO:0046872,GO:0005622 8 14165285 RHOBTB2 NM_015178 GO:0000166,GO:0005515,GO:0005525,GO:0007264,GO:0005622 9 24041034 NOTCH2 NM_024408 GO:0003706,GO:0004872,GO:0005509,GO:0005515,GO:0046982,GO:0001709, GO:0006350,GO:0006355,GO:0006916,GO:0006917,GO:0007050,GO:0007219, GO:0007399,GO:0008285,GO:0009887,GO:0016049,GO:0019827,GO:0030097, GO:0030154,GO:0046579,GO:0050793,GO:0005634,GO:0005887,GO:0009986, GO:0016020,GO:0005515,GO:0002011,GO:0030326,GO:0043065,GO:0007368, GO:0042060, 10 6513852 PCDH8 AF061573 GO:0005509,GO:0005515,GO:0007155,GO:0007156,GO:0007267,GO:0005886, GO:0005887,GO:0016331,GO:0001756 11 37655178 AP3B2 NM_004644 GO:0005215,GO:0005488,GO:0006892,GO:0006897,GO:0015031,GO:0005905, GO:0030137 12 109891936 DLGAP4 NM_014902 GO:0007267,GO:0016020 13 48762935 CSRP3 NM_003476 GO:0008270,GO:0046872,GO:0007519,GO:0030154,GO:0005634,GO:0005515, GO:0030018,GO:0002026,GO:0006874,GO:0048738 14 4758955 BZRAP1 NM_004758 GO:0030156,GO:0008150,GO:0005737,GO:0005739 15 48255896 SEMA6C NM_030913 GO:0004872,GO:0007399,GO:0030154,GO:0016020,GO:0016021,GO:0007275 16 38348329 LANCL3 NM_198511 17 52856410 CXXC1 NM_014593 GO:0003677,GO:0005515,GO:0008270,GO:0016563,GO:0045322,GO:0046872, GO:0006350,GO:0006355,GO:0005634,GO:0016607,GO:0016363 18 4557256 ADCY8 NM_001115 GO:0000287,GO:0008294,GO:0006171,GO:0007242,GO:0007611,GO:0005624, GO:0005886,GO:0016021 19 38176156 SPATA2 NM_006038 GO:0003674,GO:0007283,GO:0030154,GO:0005737 20 37537685 ZSCAN21 NM_145914 GO:0003700,GO:0008270,GO:0016563,GO:0046872,GO:0006350,GO:0006355, GO:0005622,GO:0005634,GO:0005634,GO:0003677 21 122114640 ZNF3 NM_032924 GO:0003676,GO:0008270,GO:0046872,GO:0006355,GO:0005622,GO:0005634, GO:0003700,GO:0006350,GO:0030154,GO:0045321 22 31317254 NLGN2 NM_020795 GO:0042043,GO:0007416,GO:0016337,GO:0045217,GO:0016021,GO:0045211, GO:0045202 23 58257667 KIAA0802 AB018345 24 27436945 LMNA NM_170707 GO:0005198,GO:0005515,GO:0005634,GO:0005882,GO:0005634,GO:0006998, GO:0005638 25 34147467 CCDC120 NM_033626 26 28559070 DNMT3A NM_175630 GO:0003677,GO:0003886,GO:0005515,GO:0008168,GO:0008270,GO:0016740, GO:0046872,GO:0006306,GO:0006349,GO:0000791,GO:0005634,GO:0005737, GO:0016363,GO:0003677,GO:0003886,GO:0008168,GO:0008270,GO:0016740, GO:0006306,GO:0006349,GO:0000791,GO:0005634,GO:0005737,GO:0016363, GO:0005515,GO:0000122,GO:0007283,GO:0003682,GO:0000792,GO:0006346, GO:0005720 27 13376631 ZC3H12A NM_025079 28 53832025 IQSEC2 NM_015075 GO:0005086,GO:0032012,GO:0005622NM_020235 29 18378730 BBX NM_020235 30 113423421 Predicted XM_926122 protein 31 21071079 FBXL7 NM_012304 GO:0004842,GO:0005515,GO:0006511,GO:0000151 32 14017860 KIAA1822 AB058725 GO:0016021 33 6649056 TMEM2 AF137030 GO:0005044,GO:0001620 34 18379331 WAC NM_100486 35 113204605 RBAK NM_021163 GO:0003676,GO:0008270,GO:0016564,GO:0046872,GO:0006355,GO:0005622, GO:0005634 36 117189905 MINK1 NM_153827 GO:0000166,GO:0004674,GO:0005083,GO:0005524,GO:0016740,GO:0006468, GO:0006950,GO:0007243,GO:0007254,GO:0007275,GO:0045060 37 52145308 LING01 NM_032808 GO:0016740 38 45433544 KIAA0460 NM_015203 39 56790298 PSD NM_002779 GO:0004871,GO:0005086,GO:0007165,GO:0032012,GO:0005575,GO:0005622 40 57165354 LPHN1 NM_001008701 GO:0004872,GO:0004930,GO:0005529,GO:0016524,GO:0007165,GO:0007218, GO:0016020,GO:0016021

Table S3. GO categories of the 40 candidate genes. 64

Genomic Coordinates of the 40 Candidate Genes.

# gi # Gene Position 1 53831993 SF3A1 chr22:29066245-29068236 2 4758467 GRP50 chrX:150099804-150100352 3 4503680 FCGBP chr19:45091552-45092458 4 18201912 FOXN1 chr17:23875883-23881964 5 27436942 RXRbeta chr6:33274168-33276170 6 62954773 CSMD2 chr1:33862705-33874588 7 31342353 ZNF598 chr16:1991583-1993660 8 14165285 RHOBTB2 chr8:22918858-22920601 9 24041034 Notch2 chr1:120313786-120349540 10 6513852 PCDH8 chr13:52319345-52319863 11 37655178 AP3B2 chr15:81129664-81132688 12 109891936 DLGAP4 chr20:34493698-34494213 13 48762935 CSRP3 chr11:19160798-19170516 14 4758955 BZRAP1 chr17:53750639-53754726 15 48255896 SEMA6C chr1:149372252-149373545 16 38348329 LANCL3 chrX:37316077-37316616 17 52856410 CXXC1 chr18:46063056-46064320 18 4557256 ADCY8 chr8:132121215-132121748 19 38176156 SPATA2 chr20:47955607-47956197 20 37537685 ZSCAN21 chr7:99499551-99500171 21 122114640 ZNF3 chr7:99506773-99507435 22 31317254 NLGN2 chr17:7259601-7260113 23 58257667 KIAA0802 chr18:8815386-8815898 24 27436945 LMNA chr1:154371609-154372756 25 34147467 CCDC120 chrX:48811892-48812587 26 28559070 DNMT3A chr2:25320271-25322437 27 13376631 ZC3H12A chr1:37721192-37721704 28 53832025 IQSEC2 chrX:53280407-53282330 29 18378730 BBX chr3:108992777-109006989 30 113423421 chr12:79179924-79185213 31 21071079 FBXL7 chr5:15981072-15981551 32 14017860 KIAA1822 chr14:99193155-99196459 33 6649056 TMEM2 chr9:73494904-73505558 34 18379331 WAC chr10:28864495-28919728 35 113204605 RBAK chr7:5071193-5071729 36 117189905 MINK1 chr17:4735561-4736830 37 52145308 LRRN6A chr15:75693455-75693994 38 45433544 KIAA0460 chr1:148711467-148711997 39 56790298 PSD chr10:104166187-104166726 40 57165354 LPHN1 chr19:14132416-14134627

Table S4. Genomic coordinates of the 40 candidate genes. Bibliography

[1] Stephen F. Altschul, , Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215: 403–410, 1990.

[2] JD Thompson, DG Higgins, and TJ Gibson. Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position- specific gap penalties and weight matrix choice. Nucleic Acids Res, 22(22):4673– 4680, 1994.

[3] WR Pearson. Flexible sequence similarity searching with the fasta3 program pack- age. Methods Mol Biol, 132:185–219, 2000.

[4] Christopher Burge. Identification of genes in human genomic dna. PhD thesis, De- partment of Mathematics, Stanford University, Stanford, CA 94305, March 1997.

[5] Susumu Ohno. Evolution by gene duplication. Springer-Verlag, New York, 1970.

[6] Z Gu, A Cavalcanti, FC Chen, P Bouman, and WH Li. Extent of gene duplication in the genomes of drosophila, nematode, and yeast. Mol Biol Evol, 19(3):256–262, 2002.

[7] KD Makova and WH Li. Divergence in the spatial pattern of gene expression between human duplicate genes. Genome Res, 13(7):1638–1645, 2003.

[8] Barabasi AL Albert, R. Statistical mechanics of complex networks. Reviews of Modern Physics, 74:47–96, 2002.

[9] B G Barrell, G M Air, and C A 3rd Hutchison. Overlapping genes in bacteriophage phix174. Nature, 264(5581):34–41, Nov 1976. ISSN 0028-0836 (Print).

[10] Igor B Rogozin, Alexey N Spiridonov, Alexander V Sorokin, Yuri I Wolf, I King Jordan, Roman L Tatusov, and Eugene V Koonin. Purifying and directional se- lection in overlapping prokaryotic genes. Trends Genet, 18(5):228–232, 2002 May. ISSN 0168-9525 (Print). 66

[11] Samir Wadhawan, Benjamin Dickins, and Anton Nekrutenko. Wheels within wheels: clues to the evolution of the gnas and gnal loci. Mol Biol Evol, 25(12): 2745–2757, 2008 Dec. ISSN 1537-1719 (Electronic).

[12] Anton Nekrutenko and Jianbin He. Functionality of unspliced xbp1 is required to explain evolution of overlapping reading frames. Trends Genet, 22(12):645–648, 2006 Dec. ISSN 0168-9525 (Print).

[13] Radek Szklarczyk, Jaap Heringa, Sergei Kosakovsky Pond, and Anton Nekrutenko. Rapid asymmetric evolution of a dual-coding tumor suppressor ink4a/arf locus contradicts its function. Proc Natl Acad Sci U S A, 104(31):12807–12812, Jul 2007. ISSN 0027-8424 (Print).

[14] Ugrappa Nagalakshmi, Zhong Wang, Karl Waern, Chong Shou, Debasish Raha, Mark Gerstein, and Michael Snyder. The transcriptional landscape of the yeast genome defined by rna sequencing. Science, 320(5881):1344–1349, 2008 Jun 6. ISSN 1095-9203 (Electronic).

[15] LaDeana W Hillier, Gabor T Marth, Aaron R Quinlan, David Dooling, Ginger Fewell, Derek Barnett, Paul Fox, Jarret I Glasscock, Matthew Hickenbotham, Weichun Huang, Vincent J Magrini, Ryan J Richt, Sacha N Sander, Donald A Stewart, Michael Stromberg, Eric F Tsung, Todd Wylie, Tim Schedl, Richard K Wilson, and Elaine R Mardis. Whole-genome sequencing and variant discovery in c. elegans. Nat Methods, 5(2):183–188, 2008 Feb. ISSN 1548-7105 (Electronic).

[16] Ali Mortazavi, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and Barbara Wold. Mapping and quantifying mammalian transcriptomes by rna-seq. Nat Meth, advanced online publication:, 2008. ISBN 1548-7105. URL http://dx.doi.org/ 10.1038/nmeth.1226.

[17] WH Li, Z Gu, H Wang, and A Nekrutenko. Evolutionary analyses of the . Nature, 409(6822):847–849, 2001.

[18] A Wagner. Distributed robustness versus redundancy as causes of mutational robustness. Bioessays, 27(2):176–188, 2005.

[19] X Gu. Evolution of duplicate genes versus genetic robustness against null muta- tions. Trends Genet, 19(7):354–356, 2003.

[20] EV Koonin. Paralogs and mutational robustness linked through transcriptional reprogramming. Bioessays, 27(9):865–868, 2005.

[21] DC Krakauer and MA Nowak. Evolutionary preservation of redundant duplicated genes. Semin Cell Dev Biol, 10(5):555–559, 1999.

[22] A Wagner. Energy constraints on the evolution of gene expression. Mol Biol Evol, 22(6):1365–1374, 2005. 67

[23] A Force, M Lynch, FB Pickett, A Amores, YL Yan, and J Postlethwait. Preserva- tion of duplicate genes by complementary, degenerative mutations. Genetics, 151 (4):1531–1545, 1999. [24] Jianzhi Zhang. Evolution by gene duplication: an update. Trends in Ecology and Evolution, 18(6):292–298, June 2003. [25] X He and J Zhang. Rapid subfunctionalization accompanied by prolonged and sub- stantial neofunctionalization in duplicate gene evolution. Genetics, 169(2):1157– 1164, 2005. [26] FA Kondrashov, IB Rogozin, YI Wolf, and EV Koonin. Selection in the evolution of gene duplications. Genome Biol, 3(2):RESEARCH0008, 2002. [27] M Kellis, BW Birren, and ES Lander. Proof and evolutionary analysis of ancient genome duplication in the yeast saccharomyces cerevisiae. Nature, 428(6983):617– 624, 2004. [28] A Wagner. Decoupled evolution of coding region and mrna expression patterns after gene duplication: implications for the neutralist-selectionist debate. Proc Natl Acad Sci U S A, 97(12):6579–6584, 2000. [29] Z Gu, D Nicolae, HH Lu, and WH Li. Rapid divergence in expression between duplicate genes inferred from microarray data. Trends Genet, 18(12):609–613, 2002. [30] L Huminiecki and KH Wolfe. Divergence of spatial gene expression profiles follow- ing species-specific gene duplications in human and mouse. Genome Res, 14(10A): 1870–1879, 2004. [31] IK Jordan, L Marino-Ramirez, YI Wolf, and EV Koonin. Conservation and coevo- lution in the scale-free human gene coexpression network. Mol Biol Evol, 21(11): 2058–2070, 2004. [32] S Bergmann, J Ihmels, and N Barkai. Similarities and differences in genome-wide expression data of six organisms. PLoS Biol, 2(1):E9, 2004. [33] JM Stuart, E Segal, D Koller, and SK Kim. A gene-coexpression network for global discovery of conserved genetic modules. Science, 302(5643):249–255, 2003. [34] H Jeong, B Tombor, R Albert, ZN Oltvai, and AL Barabasi. The large-scale organization of metabolic networks. Nature, 407(6804):651–654, 2000. [35] H Jeong, SP Mason, AL Barabasi, and ZN Oltvai. Lethality and centrality in protein networks. Nature, 411(6833):41–42, 2001. [36] AL Barabasi and R Albert. Emergence of scaling in random networks. Science, 286(5439):509–512, 1999. [37] AL Barabasi and ZN Oltvai. Network biology: understanding the cell’s functional organization. Nat Rev Genet, 5(2):101–113, 2004. 68

[38] R Albert, H Jeong, and AL Barabasi. Error and attack tolerance of complex networks. Nature, 406(6794):378–382, 2000.

[39] A Bhan, DJ Galas, and TG Dewey. A duplication growth model of gene expression networks. Bioinformatics, 18(11):1486–1493, 2002.

[40] R Pastor-Satorras, E Smith, and RV Sole. Evolving protein interaction networks through gene duplication. J Theor Biol, 222(2):199–210, 2003.

[41] SA Teichmann and MM Babu. Gene regulatory network growth by duplication. Nat Genet, 36(5):492–496, 2004.

[42] A Wagner. The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Mol Biol Evol, 18(7):1283–1292, 2001.

[43] A Wagner. Asymmetric functional divergence of duplicate genes in yeast. Mol Biol Evol, 19(10):1760–1768, 2002.

[44] S Maslov, K Sneppen, KA Eriksen, and KK Yan. Upstream plasticity and down- stream robustness in evolution of molecular networks. BMC Evol Biol, 4:9, 2004.

[45] AM Evangelisti and A Wagner. Molecular evolution in the yeast transcriptional regulation network. J Exp Zoolog B Mol Dev Evol, 302(4):392–411, 2004.

[46] M. J. Buck and J. D. Lieb. Chip-chip: considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Ge- nomics, 83:349–360, 2004. URL http://dx.doi.org/10.1016/j.ygeno.2003. 11.004.

[47] D Figeys. Combining different ’omics’ technologies to map and validate protein- protein interactions in humans. Brief Funct Genomic Proteomic, 2(4):357–365, 2004.

[48] MB Eisen, PT Spellman, PO Brown, and D Botstein. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A, 95(25):14863– 14868, 1998.

[49] HK Lee, AK Hsu, J Sajdak, J Qin, and P Pavlidis. Coexpression analysis of human genes across many microarray data sets. Genome Res, 14(6):1085–1094, 2004.

[50] H Ge, Z Liu, GM Church, and M Vidal. Correlation between transcriptome and interactome mapping data from saccharomyces cerevisiae. Nat Genet, 29(4):482– 486, 2001.

[51] P Kemmeren, NL van Berkum, J Vilo, T Bijma, R Donders, A Brazma, and FC Holstege. Protein interaction verification and functional annotation by inte- grated analysis of genome-scale data. Mol Cell, 9(5):1133–1143, 2002. 69

[52] AI Su, T Wiltshire, S Batalov, H Lapp, KA Ching, D Block, J Zhang, R Soden, M Hayakawa, G Kreiman, MP Cooke, JR Walker, and JB Hogenesch. A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci U SA, 101(16):6062–6067, 2004.

[53] M Ashburner, CA Ball, JA Blake, D Botstein, H Butler, JM Cherry, AP Davis, K Dolinski, SS Dwight, JT Eppig, MA Harris, DP Hill, L Issel-Tarver, A Kasarskis, S Lewis, JC Matese, JE Richardson, M Ringwald, GM Rubin, and G Sherlock. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet, 25(1):25–29, 2000.

[54] S Yi, DL Ellsworth, and WH Li. Slow molecular clocks in old world monkeys, apes, and humans. Mol Biol Evol, 19(12):2191–2198, 2002.

[55] V Katju and M Lynch. The structure and early evolution of recently arisen gene duplicates in the caenorhabditis elegans genome. Genetics, 165(4):1793–1803, 2003.

[56] GC Conant and A Wagner. Asymmetric sequence divergence of duplicate genes. Genome Res, 13(9):2052–2058, 2003.

[57] P Zhang, Z Gu, and WH Li. Different evolutionary patterns between young du- plicate genes in the human genome. Genome Biol, 4(9):R56, 2003.

[58] SF Altschul, TL Madden, AA Schaffer, J Zhang, Z Zhang, W Miller, and DJ Lip- man. Gapped and psi-blast: a new generation of protein database search programs. Nucleic acids research, 25(17):3389–3402, 1997.

[59] L Huminiecki, AT Lloyd, and KH Wolfe. Congruence of tissue expression profiles from gene expression atlas, sagemap and tissueinfo databases. BMC Genomics, 4 (1):31, 2003.

[60] DJ Watts and SH Strogatz. Collective dynamics of ’small-world’ networks. Nature, 393(6684):440–442, 1998.

[61] Rivest RL Cormen TH, Leiserson CE and Stein C. Introduction to algorithms. The MIT Press, 2001.

[62] Z Yang and R Nielsen. Estimating synonymous and nonsynonymous substitution rates under realistic evolutionary models. Mol Biol Evol, 17(1):32–43, 2000.

[63] Z Yang. Paml: a program package for phylogenetic analysis by maximum likeli- hood. Comput Appl Biosci, 13(5):555–556, 1997.

[64] Ott RL. An introduction to statistical methods and data analysis. Duxbury Press, 1993.

[65] Han Liang and Laura F Landweber. A genome-wide study of dual coding regions in human alternatively spliced genes. Genome Res, 16(2):190–196, 2006 Feb. ISSN 1088-9051 (Print). 70

[66] Marcella Calfon, Huiqing Zeng, Fumihiko Urano, Jeffery H Till, Stevan R Hubbard, Heather P Harding, Scott G Clark, and David Ron. Ire1 couples endoplasmic reticulum load to secretory capacity by processing the xbp-1 mrna. Nature, 415 (6867):92–96, 2002 Jan 3. ISSN 0028-0836 (Print).

[67] M Klemke, R H Kehlenbach, and W B Huttner. Two overlapping reading frames in a single exon encode interacting proteins–a novel way of gene usage. EMBO J, 20(14):3849–3860, 2001 Jul 16. ISSN 0261-4189 (Print).

[68] H Yoshida, T Matsui, A Yamamoto, T Okada, and K Mori. Xbp1 mrna is induced by atf6 and spliced by ire1 in response to er stress to produce a highly active transcription factor. Cell, 107(7):881–891, 2001 Dec 28. ISSN 0092-8674 (Print).

[69] D E Quelle, F Zindy, R A Ashmun, and C J Sherr. Alternative reading frames of the ink4a tumor suppressor gene encode two unrelated proteins capable of inducing cell cycle arrest. Cell, 83(6):993–1000, 1995 Dec 15. ISSN 0092-8674 (Print).

[70] Kathleen Freson, Jaak Jaeken, Monique Van Helvoirt, Francis de Zegher, Chris- tine Wittevrongel, Chantal Thys, Marc F Hoylaerts, Jos Vermylen, and Chris Van Geet. Functional polymorphisms in the paternally expressed xlalphas and its cofactor alex decrease their mutual interaction and enhance receptor-mediated camp formation. Hum Mol Genet, 12(10):1121–1130, 2003 May 15. ISSN 0964-6906 (Print).

[71] Hiderou Yoshida, Masaya Oku, Mie Suzuki, and Kazutoshi Mori. pxbp1(u) encoded in xbp1 pre-mrna negatively regulates unfolded protein response activator pxbp1(s) in mammalian er stress response. J Cell Biol, 172(4):565–575, 2006 Feb 13. ISSN 0021-9525 (Print).

[72] Norman E Sharpless. Ink4a/arf: a multifunctional tumor suppressor locus. Mutat Res, 576(1-2):22–38, 2005 Aug 25. ISSN 0027-5107 (Print).

[73] P K Keese and A Gibbs. Origins of genes: ”big bang” or continuous creation? Proc Natl Acad Sci U S A, 89(20):9489–9493, 1992 Oct 15. ISSN 0027-8424 (Print).

[74] Anton Nekrutenko, Samir Wadhawan, Paula Goetting-Minesky, and Kateryna D. Makova. Oscillating evolution of a mammalian locus with overlapping reading frames: An xlalphas/alex relay. PLoS Genetics, 1(2), 2005. URL http://dx.doi. org/10.1371%2Fjournal.pgen.0010018.

[75] Martin Schroder and Randal J Kaufman. The mammalian unfolded protein re- sponse. Annu Rev Biochem, 74:739–789, 2005. ISSN 0066-4154 (Print).

[76] W James Kent. Blat–the blast-like alignment tool. Genome Res, 12(4):656–664, 2002. ISSN 1088-9051 (Print).

[77] Wen-Hsiung Li. Molecular evolution. Sinauer, Sunderland, Massachusetts, 1997. 71

[78] Sergei L Kosakovsky Pond, Simon D W Frost, and Spencer V Muse. Hyphy: hypothesis testing using phylogenies. Bioinformatics, 21(5):676–679, 2005 Mar 1. ISSN 1367-4803 (Print). [79] Dermot M F Cooper. Regulation and organization of adenylyl cyclases and camp. Biochem J, 375(Pt 3):517–529, 2003 Nov 1. ISSN 1470-8728 (Electronic). [80] Andrew J Crossthwaite, Antonio Ciruela, Timothy F Rayner, and Dermot M F Cooper. A direct interaction between the n terminus of adenylyl cyclase ac8 and the catalytic subunit of protein phosphatase 2a. Mol Pharmacol, 69(2):608–617, 2006 Feb. ISSN 0026-895X (Print). [81] Xiaohua Shen, Ronald E Ellis, Kenjiro Sakaki, and Randal J Kaufman. Genetic interactions due to constitutive and inducible gene regulation mediated by the unfolded protein response in c. elegans. PLoS Genet, 1(3):e37, 2005 Sep. ISSN 1553-7404 (Electronic). [82] Karen E Smith, Chen Gu, Kent A Fagan, Biao Hu, and Dermot M F Cooper. Residence of adenylyl cyclase type 8 in caveolae is necessary but not sufficient for regulation by capacitative ca(2+) entry. J Biol Chem, 277(8):6025–6031, 2002 Feb 22. ISSN 0021-9258 (Print). [83] Yuanming Hu, Cindy Leo, Simon Yu, Betty C B Huang, Hank Wang, Mary Shen, Ying Luo, Sarkiz Daniel-Issakani, Donald G Payan, and Xiang Xu. Identifica- tion and functional characterization of a novel human misshapen/nck interacting kinase-related kinase, hmink beta. J Biol Chem, 279(52):54387–54397, 2004 Dec 24. ISSN 0021-9258 (Print). [84] Kunbin Qu, Yanmei Lu, Nan Lin, Rajinder Singh, Xiang Xu, Donald G Payan, and Dong Xu. Computational and experimental studies on human misshapen/nik- related kinase mink-1. Curr Med Chem, 11(5):569–582, 2004 Mar. ISSN 0929-8673 (Print). [85] I Dan, N M Watanabe, T Kobayashi, K Yamashita-Suzuki, Y Fukagaya, E Ka- jikawa, W K Kimura, T M Nakashima, K Matsumoto, J Ninomiya-Tsuji, and A Kusumi. Molecular cloning of mink, a novel member of mammalian gck fam- ily kinases, which is up-regulated during postnatal mouse cerebral development. FEBS Lett, 469(1):19–23, 2000 Mar 3. ISSN 0014-5793 (Print). [86] K Hamada, S L Gleason, B Z Levi, S Hirschfeld, E Appella, and K Ozato. H- 2riibp, a member of the nuclear hormone receptor superfamily that binds to both the regulatory element of major histocompatibility class i genes and the estrogen response element. Proc Natl Acad Sci U S A, 86(21):8289–8293, 1989 Nov. ISSN 0027-8424 (Print). [87] K Fleischhauer, J H Park, J P DiSanto, M Marks, K Ozato, and S Y Yang. Isolation of a full-length cdna clone encoding a n-terminally variant form of the human retinoid x receptor beta. Nucleic Acids Res, 20(7):1801, 1992 Apr 11. ISSN 0305-1048 (Print). 72

[88] Boaz Tirosh, Neal N Iwakoshi, Laurie H Glimcher, and Hidde L Ploegh. Rapid turnover of unspliced xbp-1 as a factor that modulates the unfolded protein re- sponse. J Biol Chem, 281(9):5852–5860, 2006 Mar 3. ISSN 0021-9258 (Print).

[89] M Kozak. Extensively overlapping reading frames in a second mammalian gene. EMBO Rep, 2(9):768–769, 2001 Sep. ISSN 1469-221X (Print).

[90] Kazutoshi Mori. Frame switch splicing and regulated intramembrane proteolysis: key words to understand the unfolded protein response. Traffic, 4(8):519–528, 2003 Aug. ISSN 1398-9219 (Print).

[91] John C Marioni, Christopher E Mason, Shrikant M Mane, Matthew Stephens, and Yoav Gilad. Rna-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res, 18(9):1509–1517, 2008 Sep. ISSN 1088- 9051 (Print).

[92] Marc Sultan, Marcel H Schulz, Hugues Richard, Alon Magen, Andreas Klingen- hoff, Matthias Scherf, Martin Seifert, Tatjana Borodina, Aleksey Soldatov, Dmitri Parkhomchuk, Dominic Schmidt, Sean O’Keeffe, Stefan Haas, , Hans Lehrach, and Marie-Laure Yaspo. A global view of gene activity and alterna- tive splicing by deep sequencing of the human transcriptome. Science, 321(5891): 956–960, 2008 Aug 15. ISSN 1095-9203 (Electronic).

[93] Eric T Wang, Rickard Sandberg, Shujun Luo, Irina Khrebtukova, Lu Zhang, Chris- tine Mayr, Stephen F Kingsmore, Gary P Schroth, and Christopher B Burge. Al- ternative isoform regulation in human tissue transcriptomes. Nature, 456(7221): 470–476, 2008 Nov 27. ISSN 1476-4687 (Electronic).

[94] David R Bentley, Shankar Balasubramanian, Harold P Swerdlow, Geoffrey P Smith, John Milton, Clive G Brown, Kevin P Hall, Dirk J Evers, Colin L Barnes, Helen R Bignell, Jonathan M Boutell, Jason Bryant, Richard J Carter, R Keira Cheetham, Anthony J Cox, Darren J Ellis, Michael R Flatbush, Niall A Gormley, Sean J Humphray, Leslie J Irving, Mirian S Karbelashvili, Scott M Kirk, Heng Li, Xiaohai Liu, Klaus S Maisinger, Lisa J Murray, Bojan Obradovic, To- bias Ost, Michael L Parkinson, Mark R Pratt, Isabelle M J Rasolonjatovo, Mark T Reed, Roberto Rigatti, Chiara Rodighiero, Mark T Ross, Andrea Sabot, Subrama- nian V Sankar, Aylwyn Scally, Gary P Schroth, Mark E Smith, Vincent P Smith, Anastassia Spiridou, Peta E Torrance, Svilen S Tzonev, Eric H Vermaas, Klau- dia Walter, Xiaolin Wu, Lu Zhang, Mohammed D Alam, Carole Anastasi, Ify C Aniebo, David M D Bailey, Iain R Bancarz, Saibal Banerjee, Selena G Barbour, Primo A Baybayan, Vincent A Benoit, Kevin F Benson, Claire Bevis, Phillip J Black, Asha Boodhun, Joe S Brennan, John A Bridgham, Rob C Brown, An- drew A Brown, Dale H Buermann, Abass A Bundu, James C Burrows, Nigel P Carter, Nestor Castillo, Maria Chiara E Catenazzi, Simon Chang, R Neil Coo- ley, Natasha R Crake, Olubunmi O Dada, Konstantinos D Diakoumakos, Be- len Dominguez-Fernandez, David J Earnshaw, Ugonna C Egbujor, David W El- more, Sergey S Etchin, Mark R Ewan, Milan Fedurco, Louise J Fraser, Karin V 73

Fuentes Fajardo, W Scott Furey, David George, Kimberley J Gietzen, Colin P Goddard, George S Golda, Philip A Granieri, David E Green, David L Gustafson, Nancy F Hansen, Kevin Harnish, Christian D Haudenschild, Narinder I Heyer, Matthew M Hims, Johnny T Ho, Adrian M Horgan, Katya Hoschler, Steve Hur- witz, Denis V Ivanov, Maria Q Johnson, Terena James, T A Huw Jones, Gyoung- Dong Kang, Tzvetana H Kerelska, Alan D Kersey, Irina Khrebtukova, Alex P Kindwall, Zoya Kingsbury, Paula I Kokko-Gonzales, Anil Kumar, Marc A Laurent, Cynthia T Lawley, Sarah E Lee, Xavier Lee, Arnold K Liao, Jennifer A Loch, Mitch Lok, Shujun Luo, Radhika M Mammen, John W Martin, Patrick G McCauley, Paul McNitt, Parul Mehta, Keith W Moon, Joe W Mullens, Taksina Newington, Zemin Ning, Bee Ling Ng, Sonia M Novo, Michael J O’Neill, Mark A Osborne, An- drew Osnowski, Omead Ostadan, Lambros L Paraschos, Lea Pickering, Andrew C Pike, Alger C Pike, D Chris Pinkard, Daniel P Pliskin, Joe Podhasky, Victor J Quijano, Come Raczy, Vicki H Rae, Stephen R Rawlings, Ana Chiva Rodriguez, Phyllida M Roe, John Rogers, Maria C Rogert Bacigalupo, Nikolai Romanov, Anthony Romieu, Rithy K Roth, Natalie J Rourke, Silke T Ruediger, Eli Rus- man, Raquel M Sanches-Kuiper, Martin R Schenker, Josefina M Seoane, Richard J Shaw, Mitch K Shiver, Steven W Short, Ning L Sizto, Johannes P Sluis, Melanie A Smith, Jean Ernest Sohna Sohna, Eric J Spence, Kim Stevens, Neil Sutton, Lukasz Szajkowski, Carolyn L Tregidgo, Gerardo Turcatti, Stephanie Vandevondele, Yuli Verhovsky, Selene M Virk, Suzanne Wakelin, Gregory C Walcott, Jingwen Wang, Graham J Worsley, Juying Yan, Ling Yau, Mike Zuerlein, Jane Rogers, James C Mullikin, Matthew E Hurles, Nick J McCooke, John S West, Frank L Oaks, Peter L Lundberg, David Klenerman, Richard Durbin, and Anthony J Smith. Accurate whole human genome sequencing using reversible terminator chemistry. Nature, 456(7218):53–59, 2008 Nov 6. ISSN 1476-4687 (Electronic).

[95] Andreas von Bubnoff. Next-generation sequencing: the race is on. Cell, 132(5): 721–723, 2008 Mar 7. ISSN 1097-4172 (Electronic).

[96] Eric S. Lander and Michael S. Waterman. Genomic mapping by fingerprinting random clones: A mathematical analysis. Genomics, 2(3):231 – 239, 1988. ISSN 0888-7543. URL http://www.sciencedirect.com/science/article/B6WG1- 4DYM8X6-7P/2/31dad4% cadb98506162f88af3969fc4f3.

[97] Nava Whiteford, Niall Haslam, Gerald Weber, Adam Prugel-Bennett, Jonathan W Essex, Peter L Roach, Mark Bradley, and Cameron Neylon. An analysis of the feasibility of short read sequencing. Nucleic Acids Res, 33(19):e171, 2005. ISSN 1362-4962 (Electronic).

[98] MJ Chaisson, D Brinza, and PA Pevzner. De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res, 2009 Jan 13. ISSN 1088-9051 (Print).

[99] Heng Li, Jue Ruan, and Richard Durbin. Mapping short dna sequencing reads and calling variants using mapping quality scores. Genome Res, 18(11):1851–1858, 2008 Nov. ISSN 1088-9051 (Print). 74

[100] Hao Lin, Zefeng Zhang, Michael Q Zhang, Bin Ma, and Ming Li. Zoom! zillions of oligos mapped. Bioinformatics, 24(21):2431–2437, 2008 Nov 1. ISSN 1460-2059 (Electronic).

[101] Andrew D Smith, Zhenyu Xuan, and Michael Q Zhang. Using quality scores and longer reads improves accuracy of solexa read mapping. BMC Bioinformatics, 9: 128, 2008. ISSN 1471-2105 (Electronic). Vita Wen-Yu Chung

Wen-Yu Chung was born on November 27, 1977 in Kaohsiung, Taiwan. She received her B.S. and M.S. degrees in Computer Science from National Tsing-Hua University in Hsin-Chu Taiwan. During 2002-2003, she was a research assistant in Dr Der-Tsai Lee’s lab in Academia Sinica in Taipei Taiwan. She joined the Pennsylvania State University in the fall of 2003. Her research interests include computational biology, systems biology, data mining, and graph theory.