A Thesis

entitled

Mega-scale Investigation of Codon Bias in Vertebrates

by

Maryam Nabiyouni

Submitted to the Graduate Faculty as partial fulfillment of the requirements for the

Master of Science Degree in Biomedical Sciences

Dr. Alexei Fedorov, Committee Chair

Dr. Robert Blumenthal, Committee Member

Dr. Sadik Khuder, Committee Member

Dr. James Willey, Committee Member

Dr. Patricia R. Komuniecki, Dean College of Graduate Studies

University of Toledo

August 2011 Abstract

Although synonymous codon usage in mammals has been investigated for decades, there are still controversial interpretations of the observed results. Selectionism cannot explain the strong regularities in the codon bias, while neutralism is unable to comprehend the unusual high of G or C (~60%) in the third codon positions across all mammals. We performed a -wide computational analysis of synonymous codon usage with respect to local genomic GC-content, GC-content in the first and second codon positions, and overall genome length and GC-content. In this study, Codon

Bias Index (CBI) is used to measure the codon bias in individual . The presented data is gathered from multiple available databases such as the Codon Usage Database,

Animal Genome Size Database, BioGPS, and Exon/ Database. Our results suggest that local GC-content is a major contributor to the non-randomness of codon usage in mammals. Based on the obtained results, we propose a united hypothesis for the origin of GC-rich, GC-poor isochors and codon bias in mammals and vertebrates.

ii

Acknowledgments

I would like to express my gratitude to my advisor, Dr. Alexei Fedorov, for his patience, motivation, enthusiasm, and knowledge. Also, I am thankful for the other members of my thesis committee, Dr. Robert Blumenthal, Dr. Sadik Khuder, and Dr.

James Willey for their guidance.

I would like to indicate appreciation for my labmates in the Bioinformatics

Proteomics and Genomics (BPG) computer laboratory at the University of Toledo, especially Dr. Ashwin Prakash for providing me with the expression level data of this thesis, as well as Andrew McSweeny, Dr. Samuel Shepard, and Peter Bazeley who gave me great advice in computer programming. Also, I am thankful for Joanne Gray in the BPG program for all her support.

Finally, I am most indebted to my family and friends especially my parents for their love and support throughout my life.

iii

Table of Contents

Abstract…………...... ii

Acknowledgements...... iii

Table of Contents...... iv

Codon Bias Phenomenon Overview...... 1

Literature...... 5

Statement of the Question...... 20

Investigation of Codon Bias in Vertebrates...... 21

Discussion...... 46

Conclusions...... 50

Appendices...... 51

Appendix A: Source Code for CBI-Calculator.pl...... 51

Appendix B: Source Code for GC-Content-Calculator.pl...... 54

References...... 56

iv

1. Codon Bias Phenomenon Overview

The genetic information is carried from the chromosomes to the through mRNA. Then, the “ribonucleotides are read by translational machinery into triplets called codons” (Ross and Orlowski, 1982). Out of 64 possible codons,

UAA, UAG, and UGA signal for the termination of and end of a polypeptide chain, while the remaining 61 codons are used to specify amino acids. Since 61 codons code for 20 amino acids, most amino acids are encoded by more than one codon.

Different codons that are translated into the same are called synonymous codons. Indeed, two amino acids are each encoded by one codon, eight are encoded by two synonymous codons, one amino acid is coded by three synonymous codons, six are encoded by four synonymous codons, and three amino acids are encoded by six synonymous codons. For instance, while amino acid Phenylalanine (Phe) is only encoded by UUU and UUC, Serine is coded by UCU, UCC, UCA, UCG, AGU and AGC.

Table1.1 illustrates the including codons and their corresponding amino acids.

1

Table 1.1 The genetic code, Green represents start and red represents stop codons

(Joseph and Schild, 2010).

Now, one might be interested to know if particular synonymous codons are used to cipher amino acids with identical frequencies in different regions of the same genome.

We now know that synonymous codons are used with diverse frequencies within different organisms. Some codons such as UUG, CCA, and GCA are almost entirely absent in organisms like Thermus thermophilus (Plotkin and Kudla, 2010). Such differences in codon frequencies exist in coding regions, and some suggest that it is universal among all organisms (Bulmer, 1988; Desai, 2004). In fact, this phenomenon is called or Codon Bias (CB) (Hershberg and Petrov, 2008). Bias in usage

2 of codons might be due to a variety of factors such as the GC-content, preference for codons with G or C at the third nucleotide position (GC3) (Lafay et al., 1999), a leading strand richer in G+T than a lagging strand (Lafay et al., 1999), and which introduces chromosome segments of non-native base composition (Mozer et al., 1999). Table 1.2 demonstrates existence of codon bias in different species.

Table 1.2 Relative frequencies of leucine synonymous codons in four species, (Codon Usage Bias, science.jrank.org)

3

Codon bias phenomenon is clearly demonstrated in the above table. It illustrates that the CTG codon is responsible for about half of coded leucines in , while its usage is dropped five fold in .

Variation in codon usage within the coding regions of different genes from the same species or between species, suggest a critical role for codon bias. Understanding the codon usage bias phenomenon and its originating forces may lead us to understand other biological processes such as gene regulation and evolution.

In this study, I will investigate the frequency of synonymous codon usage in different species. Especially, I will focus on creation and maintenance of codon bias and its variation among genes. Also, I will study different regions of human genome with respect to their GC and AT content along with their level of .

Finally, I will present my hypothesis on the origin of genomic GC-rich regions commonly called GC isochores.

4

2. Literature

2.1. Available online tools

Numerous studies on creation and maintenance of codon usage bias indicate its important affect on gene expression, translation, and evolution. Several algorithms and formulas have been created to describe the codon bias.

2.1.1. Codon Adapatation Index Calculator

In 2008, Pere Puigbo and colleagues created a program called CAIcal which calculated Codon Adaptation Index (CAI) in different nucleotide sequences.CAI is a universal measure of codon bias (Carbone et al., 2003), that compares the relative codon usage of a gene to the codon usage of highly-expressed genes (Peden, 1997). Highly- expressed genes are chosen from a large pool of genes like ribosomal , elongation factors, proteins involved in glycolysis, histone proteins in eukaryotes and outer membrane proteins in (Carbone et al., 2003). This tool is free to be downloaded and it is available online at http://genomes.urv.es/CAIcal/.

5

Figure 2.1. The CAIcal web-server is used to measure Codon Adaptation Index

(CAI)

The CAI web-server computes codon adaptation index in the input sequences and measures the similarities between the synonymous codon usage of an input sequence (a gene) and the synonymous codon frequency of a reference set like highly expressed genes.

6

2.1.2 CodonO

CodonO is another online tool for the analysis of synonymous codon usage bias within and across which is available online at http://sysbio.cvm.msstate.edu/CodonO/.

There are many pre-loaded genetic codes available in the settings tool bar including the universal and vertebrate mitochondrial genetic code. This web-server uses information from chosen genetic codes to perform different computational methods and measures synonymous codon usage bias.

Figure 2.2. CodonO is used to analyze synonymous codon usage bias

CodonO provides users with options like measuring the codon usage bias on available pre-loaded prokaryotic and eukaryotic genomes through the Run CodonO on Organism

7 tool bar. Moreover, they can choose to upload their own sequence files or submit sequences individually.

2.1.3 CodonW

CodonW which is supported by the mobyle@pasteur website is another online tool freely available at http://mobyle.pasteur.fr/cgi-bin/portal.py#forms::codonw.

Figure 2.3. CodonW is used to measure Codon Bias and Codon Adaptation Indices

CodonW like previous tools analyses codon usage bias of input sequences. Users can apply the default universal genetic code, or they can change it to other available options such as Vertebrate and Yeast Mitochondrial genetic codes, etc. This program provides calculations on different indices which will help users investigate codon usage of input sequences in a customized manner. For example, the Codon Adaptation Index (CAI) of

8 different pre-loaded genomes like Escherichia coli, Bacillus subtilis, , and more can be calculated using following formula ( John Peden,1997).

1 CAI = exp L L ∑ ln Wk K = 1

Wk is ratio of usage of each codon to the usage of total synonymous codons for the same amino acid within each sequence.

Xk Wk = Yk

Xk = Frequency of particular codon coding for a specific amino acid within the sequence

Yk = Total frequency of a specific amino acid within the sequence

L = Total number of codons within the sequence

K = Position of codon within the sequence. e.g. 1st, 2nd, … codon

For instance, in the following sequence CAI is calculated for amino acid Isoleucine as the following:

AUC GAC UAC AUC GAA AUA Ile Asp Tyr Ile Glu Ile

Xk Wk = for (Ile) Yk

2 Wk = L = 6 3

9

1 CAI = exp = 0.73 6(ln 0.66) + (ln 1) + (ln 1) + (ln 0.66) + (ln 1) + (ln 0.33)

In addition to CAI, codonW enables its users to measure several indices such as

Effective Number of Codons, GC content of gene, GC content of 3rd codon position,

Base composition at synonymous third codon positions, Total Number of synonymous and non-synonymous codons, and many more. Following shows how Codon Bias Index

(CBI) and Frequency of Optimal Codon (FOP) are calculated.

Codon Bias Index (CBI) is calculated to illustrate strength of Codon Bias via below formula.

(Bennetzen and Hall, 1982)

Nopt = Number of optimal codons in the sequence

Nrand = Number of “optimal codons” under random choice (for instance, if two synonymous codons code for one amino acid, then each codon will hold Nrand of 0.5)

Ntot = Total number of codons

CBI can be as low as -1 and as high as +1. If a sequence contains mostly rare codons, its

CBI will be closer to -1. Random sequences will have CBI value of closer to 0 and sequences that have mostly optimal codons show CBI value of closer to +1.

10

For example, below codon bias index is calculated for the following human sequence.

UUC and AGC are Optimal codons based on the human codon usage table, and the rest are non-optimal codons.

UUC UUU AGC AAU GGG UGG Phe Phe Ser Asn Gly Trp

2 – (0.5 + 0.5 + 0.15 + 0.5 + 0.25) CBI = 5 - (0.5 + 0.5 + 0.15 + 0.5 + 0.25)

CBI = 0.03

Tryptophan, Methionine, and Aspartic Acid are excluded since they did not have or show codon bias in yeast (Bennetzen and Hall, 1981) and human.

Frequency of Optimal Codon (FOP) is ratio of the optimal codons to the rest of codons

(regardless of their encoded amino acids) within a sequence.

FOP = Number of Optimal Codons / Number of Non-optimal Codons (Ikemura, 1985 and Peden, 1997)

Optimal codons are codons that are mostly used to encode specific amino acids in relative organism based on the data from codon usage database.

The following is an example of measuring the frequency of optimal codon.

Optimal Optimal UUC UUU AGC AAU GGG Phe Phe Ser Asn Gly 2 FOP = =0.66 3

11

2.1.4 Galaxy

Galaxy is another free, web-server that can be used to measure indices like Codon

Adaptaion Index (CAI). Galaxy can be found at http://main.g2.bx.psu.edu/.

Figure 2.4. Galaxy is used to measure Codon Adaptation Index

Similar to the previous online tools, users can upload their sequence of interest. The

EMBOSS link on this page has many pre-loaded algorithms that can be used to investigate different aspects of an input sequence including the CAI.

12

2.1.5 JCat

According to Andreas Grote and colleagues, Jcat computer software was developed in 2005 to understand the mechanism of adaptation of codon usage of a target gene to its potential expression host. This program enables users to adapt codon usage of their gene of interest to most sequenced prokaryotic and eukaryotic organisms. This new method can help researchers to advance modeling the process of production. Jcat like some of the previous web-servers provides its users by both graphical and numerical values for the Codon Adaptation Index (CAI) of input sequences. Also, users can choose all genes of a specific organism such as and adopt them to another organism like Escherichia coli. This program is freely available at http://www.jcat.de/.

Figure 2.5. JCat is used to measure adaptation level of target genes to their potential hosts.

13

Numerous online programs that are developed for measuring the codon adaptation index, codon usage bias, and other related indices show the importance of such factors in different biological processes of various organisms. Diverse, crucial roles are considered for codon usage by well-known scientists such as Toshimichi Ikemura and others through the history. Also, many have tried to suggest explanations for observing and maintenance of such phenomena. Upcoming sections are mostly highlights of these roles and explanations.

2.2. Factors that Create and Maintain Codon Usage Bias

Possible explanations are suggested for creation and maintenance of codon usage bias. Among many proposed ideas, a balance between different forces especially , , and seem to be highly accepted (Hershber and

Petrov, 2008; Prat Y et al., 2009). However, they are not mutually exclusive. In fact, codon usage is probably maintained by a balance between selection, mutational pressures, and drift (Duret, 2002). Present evidence in Escherichia coli suggest that non- optimal codons can be maintained within genes because of conflicting selective pressures

(Duret, 2002). Also, in eukaryotic genes, it is observed that some regulatory elements involved in splicing or mRNA stability are located in the coding sequences and exons

(Blencowe BJ, 2000).

As stated previously, the mutational or neutral explanations, as well as the selection explanations are two of the most accepted responses the scientific community has had for the observation of codon usage bias (Hershberg and Petrov, 2008).

14

The mutational or neutral explanation, suggests that codon bias is created due to the higher mutation rate in some nucleotides or dinucleotides than others. In other words, non-randomness in the mutational patterns creates differences of usage frequencies among synonymous codons. Since these do not influence the fitness of organisms, they are called the neutral mutations (Plotkin and Kudla, 2010). There are multiple reasons leading to different substitution and mutation patterns across various genomic regions (Duret, 2002). First, mutation rate differs based on the nature of genomic sequences Hypermutability of CpG di-nucleotides (-C-phosphate-G is shorthanded for CpG to illustrate the Cytosine adjacent to a Guanine in a DNA linear sequence) is one of the mostly used examples for illuminating mutational explanation since cytosine in the CpG provides a favorable site of methylation by enzymes. Later, methylated cytosine can mutate to thymine via deamination (Funk and

Person, 1969; Lutsenko and Bhagwat, 1999). Second, insertions and deletions are more likely to happen in intergenic regions and (Duret, 2002). Third, transition

(mutation of a purine into amother purine or a pyrimidine into another pyrimidine) and transversion (mutation of a purine into a pyrimidine or a pyrimidine into a purine) substitution patterns vary among different codon positions and non-coding regions

(Duret, 2002).

Also, because neutral processes do not discriminate among synonymous codons they are highly accepted for explaining specific variations in codon usage among higher eukaryotes (Plotkin and Kudla, 2010). Strong mutational patterns are even seen at codon positions 1, 2, and 3 between polymorphism and fixation data (Zhao and Jiang, 2010).

Fixation shows a powerful bias towards increasing the rarest codons and decreasing the

15 most frequently used codons. Such data suggests that codon equilibrium is not yet achieved in (Zhao and Jiang, 2010). Significantly more frequent amino acids were also observed to be the smaller and simpler ones, and are more metabolically ancient than other amino acids (Kotlar and Lavner, 2006). Recently, creation of genetic codes and their respective amino acids was reconstructed by applying 60 different criteria

(Trifonov, 2004). The chronological reconstruction of amino acids suggests that amino acids like Cysteine, Tryptophan, and Phenylalanine are among latest amino acids that have increased since the last universal ancestor (Brooks and Fresco, 2002). In addition, since the mutational patterns differ among various organisms, usage of synonymous codons differ as well (Chen, 2004). The level of genomic GC content is one of the most important parameters studied by those who believe in the mutational explanation. Even though it is accepted by many scientists that mutational forces have played an important role in creation of codon usage bias, many believe that “mutation pressures can not explain why the more frequent codons (also called preferred or optimal codons) are those that are recognized by more abundant tRNA molecules” (Hershberg and Petrov, 2008;

Kotlar and Lavner, 2006). Mutational or neutral explanation is not able to respond to some other phenomenon as well. For instance, within-genome variation in codon bias and the correlation of codon bias and high level of gene expression is not elucidated by mutational explanation (Gouy and Gautier, 1982; Plotkin and Kudla, 2010).

On the other hand, according to selectionist explanation, codon bias is created and maintained by natural selection since it contributes to efficiency and accuracy of protein production (Hershberg and Petrov, 2008; Smith and Eyre-Walker, 2001; Duret, 2002).

“Efficient elongation of a transcript might increase its protein yield, or it may provide a

16 global benefit to the cell by increasing the number of ribosomes that are available to translate other messages” (Plotkin and Kudla, 2010). On the other hand, accurate translation can be beneficial to the cell by decreasing the cost of protein production

(Plotkin and Kulda, 2010). However, for some amino acids, pressure for translation efficiency would lead to different codon choice than pressure for translation accuracy

(Shah and Gilchrist, 2010). Interestingly, synonymous codons that are used at higher frequency correspond to the more abundant tRNA molecules. Also, optimal and preferred codons are mostly used in highly expressed genes in Escherichia coli, Saccharomyces cerevisiae (Bennetzen and Hall, 1982), worms, flies and plants (Prat Y et al. 2009), and vertebrates (Hershberg and Petrov, 2008). Codon usage bias showed positive correlation with highly expressed genes across eukaryotes ranging from unicellular protists to vertebrates, and it was less correlated to the poorly expressed genes in the same organisms (Subramanian, 2008). Even viral species that use rare codons relative to their hosts apply synonymous codon changes in order to increase their protein production in host cells (Clade N.M. et al., 2008). So, according to the selectionist explanation, “genes using the codons that are recognized by more abundant tRNA molecules may be translated more efficiently and with fewer mistakes than genes that use less frequent codons”. As a result, selection might play a crucial role in the usage of more abundant codons (Hershberg and Petrov, 2008; Prat Y et al. 2009). Such positive correlations between frequent codons and the more plentiful tRNAs were discovered previously in

Escherichia coli and Saccharomyces cerevisiae, and codon-choice patterns were concluded to be evolutionarily conserved (Ikemura, 1985). Moreover, selection of optimal codons might be due to the necessity of faster elongation during translation,

17 increasing the cellular concentration of free ribosomes, and decreasing the chance of erroneous incorporation of nascent amino acids (Hershberg and Petrov, 2008). However, it is shown that in yeast the initiation is rate limiting for protein production, so it determines the amount of protein produced from each message regardless of density or codon adaptation (Plotkin and Kulda, 2010). Suggested reasons for the creation, maintenance, and evolution of codon bias include the following: 1) Stability of the codon-anticodon complex (Grosjean and Fiers, 1982; Topal and Fresco, 1993), 2) overall G+C content of transcripts (Sinclair and Choy, 2002; Wu and Sair, 1991; Prat Y et al. 2009), 3) mRNA secondary structure, and turnover (Anderson and

Kurland, 1995; Carlini et al., 2001; Duan et al., 2003; Hasegawa et al., 1979), 4) and micro RNA binding or the presence of exonic enhancers

(Parmley and Hurst, 2007), 5) protein length (Prat Y et al. 2009), negative and positive selection on patterns of intragenic codon usage (Lin K., 2002), and 6) biases associated with non-protein structure-dependent gene regulation (Desai et al., 2004).

2.3. G+C Composition and Third Codon Position

Although the overall genome G+C composition of vertebrates ranges between

40% and 45%, it is interesting that almost all optimal codons end in either G or C nucleotides. So, there must be a force maintaining the G+C richness of codons’ third position (Ikemura, 1985). Moreover, since most synonymous codons allow GC  AT substitution at their last codon position, the overall GC bias of genomic sequences are highly correlated with GC3 (third codon position GC content) in prokaryotic and plant genomes (Palidwor et al., 2010). So, high G/C in third position is often observed with

18 increasing local GC rich composition and low G/C in third codon position is seen in AT rich regions. However for reasons that are still unclear, AGG (arginine) and TTG

(leucine), contrary to other G/C-ending codons, show overall usage that decreases with increasing GC bias (Palidwor et al., 2010).

In general, high GC content increases the level of gene expression in mammalian cells

(Nackley et al., 2006). As it was mentioned earlier, hypermutability of CpG sites applies forces that tend to reduce genomic GC content, however it is proposed that fixation can suppress potential rapid loss of CpG containing codons, thus prevent remarkable change of codons during short evolutionary (Zhao and Jiang, 2010). It is observed that polymorphism is suppressed at the CpG di-nucleotides in CpG islands ( Zhao and Jiang,

2010; Smith and Eyre-Walker, 2001). These islands are usually found in regions, and maintain high CpG and GC content regardless of mutational forces (Zhao and Jiang, 2010). Even though the exact relationship of G+C and expression level of genes is not clear, codons with high GC3 are often surrounded by G+C rich regions. For instance, the human insulin gene contains 80% GC3 and is located in G+C rich area where not only exons, but also introns and flanking regions have G+C rich composition.

In contrast, human gamma interferon gene which only contains 44% GC3 is located in an

A+T rich genomic neighborhood (Ikemura, 1985).

Even though local GC composition and GC presence in the third codon position has been linked with codon bias, there are studies linking the AT richness and presence of

A+T in the second (AT2) and third codon position (AT3) to the structure and function of produced . For example, mRNA with AT richness below 50% is seen to produce

19 bulky structures of RNA more often than sequences with higher AT percentage (Pluhar,

2006). Also, certain protein functions have been related to the AT richness of coding sequences. For instance, proteins with multiple transmembrane passages are observed more often in regions with AT below 50%, while structural proteins with a high glycine and proline content like collagen are produced in regions with higher AT3 than AT2

(Pluhar, 2006).

3. Statement of the Question

Although numerous studies have investigated codon bias in prokaryotic species, controversies in codon bias of eukaryotic species (especially Homo sapiens) motivated us to perform this study. Association of codon bias index of thousands of eukaryotic genes with their GC compositions, level of gene expression, genome size, and codon positions have been investigated in this study. However, suggesting the origin of GC-isochores in eukaryotic genomes and their correlations with the codon bias phenomenon is the gist of this work.

In contrast to most prior studies, only real sequences are analyzed here and pseudogenes have not been investigated in any parts of this study.

20

4. Investigation of Codon Bias in Vertebrates

4.1. Materials

Entire human intron-containing genes were studied. The 19,855 FASTA formatted human coding sequences, and their local intronic regions were obtained from the human Exon-Intron Database, release 37 (Shepelev and Fedorov, 2006).

Scientific names of the organisms investigated in this study as well as their genome size (shown by C-value), and GC Content information were obtained from the

Animal Genome Size Database (http://www.genomesize.com/) and Codon Usage

Database website (http://www.kazusa.or.jp/codon/).

Conserved protein sequences were studied using the Protein BLAST application from the National Center for Biotechnology Information webpage. Investigation of amino acid sequences were executed using the Non-redundant protein sequences (nr) database, and blastp (protein-protein BLAST) algorithm using default settings. Sequences which showed total blast score bits of 45 or less were removed from further investigations.

Expression levels of human genes were obtained from the BioGPS database.

21

4.2. Methods

Following perl programs were used in my data analysis.

CBI-Calculator.pl

CBI-Calculator measures the Codon Bias Index (CBI) of input sequences and modifies the input file containing FASTA formatted sequences by adding calculated CBI values to the end of information line. The procedure is done first by reading the input sequences into triplets (codons), and measuring their random frequencies. Then, each codon will be compared against a list of optimal codons (Human codon usage table

Nature 409, 860 - 921 (February 15th, 2001)), and excluded codons (stop codons, as well as codons encoding methionine, tryptophan, and aspartic acid which did not show significant codon bias in previous studies) to count the quantity of present optimal codons in the input sequence, and ignore codons which lacked codon bias respectively. Finally,

Codon Bias Index (CBI) is calculated using the following formula.

(Bennetzen and Hall, 1982)

Nopt = Number of optimal codons in the sequence

Nrand = Number of “optimal codons” under random choice (for instance, if two synonymous codons code for one amino acid, then each codon will hold Nrand of 0.5)

Ntot = Total number of codons

22

GC-Content-Calculator.pl

The GC-Content-Calculator perl script is used to measure the G+C richness of input sequences. It generates tables containing the name of genes along with the number of present Adenine (A), Thymine (T), Cytosine (C), Guanine (G) nucleotides, and their

GC content. Since the input file contained intronic regions from all human genome

(hs36pl.int.EID database), the program first divides intronic sequences based on their corresponding genes. Next, adenine, thymine, cytosine, and guanine quantity of each intron sequence is investigated and the GC content (percentage) is calculated.

Figure 4.1. Example of Nucleotide composition of human coding sequences and their neighboring introns in different genes, obtained using GC-Content-Calculator.pl

23

4.3. Results

Investigation of the human gene distribution according to their codon bias index is the main result of this work. As the following graphs illustrate, the CBI values of most human genes range between +0.5 and -0.1 (see figure 4.2.). The subsequent graphs present even more important results. Figures 4.3 and 4.4 show almost linear relationships between CBI values of coding sequences and their local GC content calculated for all introns of the same gene. To investigate association of GC-isochores and CBI values, we studied human and chicken genomes as GC-isochore containing genomes and compared them with the zebrafish genome (which does not contain GC- isochors).

Figure 4.2. Distribution of human genes by CBI values of their coding sequences

24

Figure 4.3. Distribution of human genes by CBI values and intronic GC content

Each dot represents a single human gene

25

Figure 4.4. Large magnification of figure 4.3.

Each dot represents a single human gene

26

Figure 4.5. Distribution of Gallus gallus (Chicken) genes by CBI values and intronic GC content

Each dot represents a single Gallus gallus gene

27

Figure 4.6. Large magnification of figure 4.5.

Each dot represents a single Gallus gallus gene

28

Figure 4.7. Distribution of Danio rerio (Zebrafish) genes by CBI values and intronic GC content

Each dot represents a single Danio rerio gene

29

Figure 4.8. Large magnification of figure 4.7.

Each dot represents a single Danio rerio gene

30

Figure 4.9. Distribution of median expression of human genes by CBI values.

Each dot represents median expression level of all genes within that range of CBI values (e.g. dot for 0.1 would represent all genes with CBI value =>0.1 and <0.2). Gene expression level was measured in GCRMA units and it was obtained from the BioGPS (affymetrix) database.

31

Figure 4.10. Distribution of mean expression of human genes by CBI values.

Each dot represents mean expression level of all genes within that range of CBI values (e.g. dot for 0.1 would represent all genes with CBI value =>0.1 and <0.2). Gene expression level was measured in GCRMA units and it was obtained from the BioGPS (affymetrix) database.

According to figures 4.9 and 4.10 gene expression only correlates with extremely high CBI values (CBI > 0.5). Also, since few highly expressed genes can misleadingly affect the mean values, both mean and median expression levels were investigated.

32

Genomic GC Content of numerous species was studied here and the following results were obtained.

Table 4.1. GC Content of entire genome and coding sequences of vertebrate, obtained from Codon Usage Database and Animal Genome Database. 1st, 2nd, and 3rd are referring to the nucleotide positions in a codon. Overall Genomic and Overall Coding GC content are presented as “Genomic” and “Coding”. Vertebrates Genome Name Size GC Content (%) Scientific Common C-val (pg) Genomic Coding 1st 2nd 3rd Anas platyrhynchos wild duck 1.44 52.52 54.67 41.60 61.28 Anguilla japonica Japanese eel 54.04 54.42 41.25 66.45 Bos taurus Cattle 3.70 41.70 53.66 56.64 41.89 62.46 Callithrix Common jacchus Marmoset 3.43 51.99 53.15 41.70 61.11 Canis familiaris dog 2.80 41.00 53.16 55.35 41.92 62.22 Carassius auratus Gold fish 2.00 37.74 50.47 52.77 40.06 58.59 Chlorocebus Vervet aethiops Monkey 4.48 50.67 53.37 41.60 57.04 Cyprinus carpio Common carp 1.70 49.77 53.34 39.32 56.67 Danio rerio Zebrafish 1.80 38.60 50.24 53.92 40.85 55.95 Dicentrarchus European labrax seabass 0.78 51.71 52.94 39.25 62.94 Eptatretus Inshore burgeri Hagfish 2.98 47.94 49.46 39.35 55.00 Equus caballus horse 3.15 52.70 54.89 40.84 62.38 Felis catus cat 2.91 42.57 52.74 54.31 41.29 62.62 Fundulus Common heteroclitus mummichog 1.50 54.44 54.93 41.48 66.93 Gallus gallus Chicken 1.25 50.00 51.38 55.01 41.37 57.78 Gorilla gorilla Gorilla 4.16 51.89 54.45 41.73 59.48 Homo sapiens Human 3.50 41.00 52.27 55.72 42.54 58.55 Ictalurus Channel punctatus catfish 1.00 49.46 51.40 40.06 56.93 Macaca mulatta Macaque 3.59 42.00 52.93 54.05 42.09 62.65

33

Vertebrates Genome Name Size GC Content (%) Scientific Common C-val (pg) Genomic Coding 1st 2nd 3rd Mus musculus Mouse 3.25 42.00 52.03 55.22 42.28 58.60 Odorrana grahami Chinese frog 43.11 49.67 35.09 44.56 Oncorhynchus mykiss Rainbow trout 2.60 43.62 53.20 53.74 41.37 64.50 Oreochromis niloticus Nile Tilapia 1.20 52.01 53.14 41.49 61.40 Pan paniscus Bonobo 49.85 50.97 40.68 57.90 Pan troglodytes Chimpanzee 3.76 42.00 54.48 56.46 43.94 63.04 Paralichthys olivaceus Olive flounder 0.71 51.69 52.41 41.52 61.13 Petromyzon marinus Sea lamprey 2.44 59.39 57.57 45.75 74.85 Plethodon shermani 48.32 51.49 40.07 53.40 Pongo Bornean pygmaeus Orangutan 3.60 41.30 50.66 53.10 41.47 57.43 Rana catesbeiana Bullfrog 6.63 47.24 52.53 39.57 49.62 Atlantic Salmo salar salmon 3.10 43.93 54.19 54.44 41.82 66.31 Gilt-head Sparus aurata bream 0.95 53.43 53.54 40.43 66.33 Sus scrofa Pig 3.00 42.48 54.72 56.73 42.44 64.99 Takifugu rubripes Puffer fish 0.40 53.53 54.07 41.50 65.02 Tetraodon Green spotted nigroviridis pufferfish 0.51 55.78 54.38 42.39 70.56 Xenopus laevis Clawed frogs 3.09 42.00 46.96 52.76 39.93 48.19 Xenopus tropicalis Frog 47.47 53.12 39.94 49.37

C-value and genome size can be used interchangeably and it is measured in picograms

(pg); it refers to the amount of DNA in a haploid nucleus.

34

Table 4.2. GC Content of entire genome and coding sequences of invertebrate, obtained from Codon Usage Database and Animal Genome Database. 1st, 2nd, and 3rd are referring to the nucleotide positions in a codon. Overall Genomic and Overall Coding GC content are presented as “Genomic” and “Coding”. Invertebrates Genome Name Size GC Content (%) C-val Scientific Common (pg) Genomic Coding 1st 2nd 3rd Mosquito (dengue Aedes aegypti fever) 0.81 50.65 53.67 40.14 58.14 Anopheles Mosquito gambiae (malaria) 0.27 55.82 56.90 41.84 68.73 Apis mellifera Honey bee 0.27 34 43.53 47.60 38.80 44.20

Bombyx mori Moth 0.52 48.12 52.29 41.19 50.89 Branchiostoma floridae Lancelet 0.59 56.54 55.45 44.55 69.61 Caenorhabditis elegans Nematode 0.1 36 42.93 50.00 39.00 39.78 Ciona intestinalis Sea squirt 44.87 50.13 41.93 42.54 Drosophila melanogaster Fruit fly 0.18 53.86 55.79 41.48 64.32 Strongylocentrotus Purple sea purpuratus urchin 0.89 52.45 62.61 48.26 46.50 Tribolium castaneum Beetle 0.21 33 47.15 47.71 37.43 56.33

35

Table 4.3. GC Content of entire genome and coding sequences of plants, obtained from Codon Usage Database and Animal Genome Database. 1st, 2nd, and 3rd are referring to the nucleotide positions in a codon. Overall Genomic and Overall Coding GC content are presented as “Genomic” and “Coding”. Plants Name GC Content (%) Scientific Common Genomic Coding 1st 2nd 3rd thale-cress 36 44.59 50.84 40.54 42.38 Brassica napus 47.66 52.32 41.58 49.09 Chlamydomonas reinhardtii 64 66.30 64.80 47.90 86.21 Glycine max soybean 45.95 52.39 39.75 45.69 Oryza sativa rice 55.26 58.19 45.97 61.61 Physcomitrella patens 50.75 54.96 42.05 55.26 Sorghum bicolor 53.91 57.40 43.17 61.18 Triticum aestivum 56.02 59.81 43.24 65.01 Vitis vinifera Fruit crop 44.24 50.18 39.88 42.65 Zea mays Corn 54.98 57.58 43.37 64.00

Table 4.4. GC Content of entire genome and coding sequences of Fungi, obtained from Codon Usage Database and Animal Genome Database. 1st, 2nd, and 3rd are referring to the nucleotide positions in a codon. Overall Genomic and Overall Coding GC content are presented as “Genomic” and “Coding”. Fungi GC Content (%) Scientific Name Genomic Coding 1st 2nd 3rd Ashbya gossypii ATCC 10895 52.75 55.28 40.40 62.57 Aspergillus fumigatus 54.19 57.09 43.96 61.52 Candida glabrata CBS 138 40.44 45.73 35.99 39.61 Debaryomyces hansenii CBS767 37.45 42.51 35.12 34.72 Encephalitozoon cuniculi GB- M1 47.52 49.08 36.59 56.89 Magnaporthe grisea 56.32 57.22 44.74 67.00 56.07 57.90 45.19 65.13 Penicillium chrysogenum 53.84 57.34 43.92 60.25 Saccharomyces cerevisiae 38.3 39.77 44.58 36.64 38.10 Schistosoma mansoni 37.31 46.89 39.13 25.91 Schizosaccharomyces pombe 36 39.80 48.05 38.21 33.14 Ustilago maydis 56.48 58.27 46.37 64.80 Yarrowia lipolytica 54.60 58.04 41.57 64.20

36

Table 4.5. GC Content of entire genome and coding sequences of protozoa, obtained from Codon Usage Database and Animal Genome Database. 1st, 2nd, and 3rd are referring to the nucleotide positions in a codon. Overall Genomic and Overall Coding GC content are presented as “Genomic” and “Coding”. Protozoa GC Content (%) Scientific Name Genomic Coding 1st 2nd 3rd Babesia bovis 44.32 49.37 38.93 44.67

Bigelowiella natans 48.13 54.25 39.90 50.25 Cryptosporidium parvum 33.28 40.35 34.78 24.72 Dictyostelium discoideum 25.7 28.62 37.52 33.45 14.89 Entamoeba histolytica 30.86 44.46 34.59 13.53 Giardia intestinalis 51.90 53.64 42.99 59.06 Guillardia theta 30.76 33.73 29.41 29.13 Leishmania braziliensis 60.34 62.13 48.39 70.50 Leishmania infantum 62.32 63.26 49.12 74.58 Leishmania major 63.38 63.63 50.30 76.20 Leishmania major strain Friedlin 62.14 61.94 49.76 74.73 Paramecium tetraurelia 31.71 37.51 30.33 27.29 Plasmodium falciparum 19.4 27.59 37.98 27.93 16.85 Plasmodium falciparum 3D7 23.80 31.95 22.16 17.28 Plasmodium vivax 43.06 46.82 33.29 49.05 Tetrahymena thermophila 32.53 38.64 31.25 27.69 Theileria annulata 41.36 45.13 32.23 46.71 Toxoplasma gondii 56.41 60.03 47.52 61.68 Trichomonas vaginalis 45.29 53.17 38.16 44.56 Trypanosoma brucei 50.73 57.21 43.62 51.35 Trypanosoma cruzi 54.11 57.94 44.04 60.36

37

Table 4.6. GC Content of entire genome and coding sequences of bacteria, obtained from Codon Usage Database and Animal Genome Database. 1st, 2nd, and 3rd are referring to the nucleotide positions in a codon. Overall Coding GC content is shown as “Coding”. Bacteria GC Content (%) Scientific Name Coding 1st 2nd 3rd

Acidiphilium cryptum JF-5 67.89 69.27 49.77 84.63

Acidothermus cellulolyticus 11B 66.76 70.56 50.46 79.26

Acidovorax sp. JS42 66.72 68.28 46.72 85.16

Acinetobacter baumannii ATCC 17978 40.13 52.62 37.14 30.63

Acinetobacter sp. ADP1 41.52 53.49 37.14 33.93

Actinobacillus succinogenes 130Z 45.93 53.06 37.77 46.96 Aeromonas hydrophila subsp. hydrophila

ATCC 7966 62.81 64.70 42.28 81.44

Aeropyrum pernix K1 56.97 59.58 43.89 67.45 Agrobacterium tumefaciens 56.88 61.55 44.90 64.18

Aquifex aeolicus VF5 43.58 50.44 32.37 47.93 Azoarcus sp. EbN1 65.08 67.38 46.61 81.25

Bacillus cereus 35.76 47.70 33.68 25.89 Bacteroides thetaiotaomicron VPI-5482 43.91 50.15 36.51 45.07 Bartonella quintana str. Toulouse 40.40 53.12 38.57 29.52 Chlamydia trachomatis 42.09 48.99 44.92 32.37 Dehalococcoides ethenogenes 195 49.62 56.06 39.75 53.04

Desulfitobacterium hafniense Y51 48.48 54.78 38.10 52.55

Escherichia coli 47.30 53.83 40.61 47.45

Idiomarina loihiensis L2TR 47.49 56.83 38.63 47.01

Jannaschia sp. CCS1 62.61 65.35 46.45 76.02

Lactococcus lactis subsp. lactis Il1403 36.18 48.63 34.57 25.34

Listeria monocytogenes 38.40 49.78 35.79 29.63

Methanococcus maripaludis C7 34.22 44.67 32.16 25.82

Mycoplasma synoviae 53 28.79 40.27 29.60 16.51

Neisseria gonorrhoeae FA 1090 53.99 57.81 41.03 63.13 Nocardia farcinica IFM 10152 71.04 71.75 50.64 90.73

Oenococcus oeni PSU-1 38.87 47.33 34.81 34.46

Pyrococcus abyssi GE5 45.14 50.29 34.86 50.28 Salmonella enterica 48.05 54.64 41.41 48.09

Shigella flexneri 2a str. 301 51.74 58.80 40.85 55.58 Staphylococcus aureus subsp. aureus NCTC

8325 33.51 45.48 32.37 22.68

Streptococcus suis 98HAH33 41.85 51.87 35.14 38.53 Vibrio vulnificus 45.52 52.75 38.00 45.79

Zymomonas mobilis subsp. mobilis ZM4 47.57 55.93 41.76 45.02

38

Table 4.7. GC Content of entire genome and coding sequences of archaeas, obtained from Codon Usage Database and Animal Genome Database. 1st, 2nd, and 3rd are referring to the nucleotide positions in a codon. Overall Coding GC content is shown as “Coding”. Archaea GC Content (%) Scientific Name Coding 1st 2nd 3rd

Archaeoglobus fulgidus DSM 4304 49.36 53.00 36.68 58.4 Nanoarchaeum equitans Kin4-M 31.2 39.76 28.87 24.97 Picrophilus torridus DSM 9790 37.08 41.64 33.2 36.39 Pyrobaculum aerophilum str. IM2 51.9 55.36 40.73 59.61 Pyrococcus abyssi GE5 45.14 50.29 34.86 50.28 Pyrococcus furiosus DSM 3638 41.12 49.51 34.56 39.29 Sulfolobus acidocaldarius DSM 639 37.48 44.72 34.24 33.49

39

In order to analyze possible correlations genome size (C-value and genome size can be used interchangeably and it is measured in picograms (pg); it refers to the amount of DNA in a haploid nucleus) and genomic GC content of multiple eukaryotic species were studied. Figure 4.11 reveals lack of association between overall genome GC content and C-values of studied eukaryotic species. Figure 4.11 shows only organisms with available overall genome GC content and genome size.

Figure 4.11. Distribution of eukaryotic species by their genome size and GC content of their genomes. Half of the amount of DNA in a diploid nucleus is referred to C-Value and is measured in picograms (pg).

Each dot represents a single eukaryotic species

40

Next, GC content of different codon positions and their relative correlation was studied in all examined eukaryotic species and positive correlation was observed in all comparisons.

Figure 4.12. GC composition in the

first and second codon positions

among eukaryotic species. Each dot

represents one species. Line represents

the trend.

Figure 4.13. GC composition in the

first and third codon positions among

eukaryotic species. Each dot represents

one species. Line represents the trend.

Figure 4.14. GC composition in the

second and third codon positions

among eukaryotic species. Red circle

represents ; its coding

sequence GC content had strongest

deviation from the trend.

41

Almost all examined eukaryotic species showed positive correlationship between

GC content of various codon positions. Few exceptions were observed; the strongest exception belonged to Sea urchin (circled on figure 4.14). Although Strongylocentrotus purpuratus (Sea urchin) initially demonstrated higher GC content in the first codon position compared to others, investigation of its genome sequence shows that the abnormal G+C richness is due to extraordinary higher frequency of the Glycine (Gly) amino acid (GGU = 56.1, GGC = 23.7, GGA = 57.4, GGG = 8.9, per thousand nucletides). Interestingly the genomic GC content decreases to a level comparable to other studied eukaryotes when Glycine coding codons are excluded.

Table 4.8. GC Content in Strongylocentrotus purpuratus including Gly

Strongylocentrotus purpuratus GC Content including Gly CDS 1st 2nd 3rd 52.45 62.61 48.26 46.5

Table 4.9. GC Content in Strongylocentrotus purpuratus excluding Gly

Strongylocentrotus purpuratus GC Content including Gly CDS 1st 2nd 3rd 48.84 56.4 39.42 50.7

42

Then, GC content of different codon positions and their relative correlations were studied in all examined prokaryotic species and positive correlation was observed in all comparisons.

Figure 4.15. GC composition in the

first and second codon positions

among prokaryotic species. Each dot

represents one species from table 4.6.

Line represents a trend.

Figure 4.16. GC composition in the

first and third codon positions

among prokaryotic species. Each dot

represents one species from table 4.6.

Line represents a trend.

Figure 4.17. GC composition in the

second and third codon positions

among prokaryotic species. Each dot

represents one species from table 4.6.

Line represents a trend.

43

Table 4.10. Slope and correlation coefficient of studied eukaryotic and prokaryotic species

Codon Position Eukaryotic Species Prokaryotic Species

Slope Correl* Slope Correl*

1st Versus 2nd Position 0.7483 0.945853 0.6367 0.919929

1st Versus 3rd Position 1.852 0.864969 2.2444 0.949108

2nd Versus 3rd Position 2.3098 0.80191 2.9934 0.854631

*Correlation coefficient

44

Some papers reported association of codon bias of a gene with its evolutionary history claiming that ancient genes show stronger codon bias (Prat et al., 2009). To test this statement we aligned all human gene coding sequences with the entire coding sequence set from zebrafish using BLAST program (default parameters). This analysis allowed us to get a sample of human genes that have no sequence similarity to the zebrafish (mammalian specific genes which are shown as New Genes in Table 4.18), genes with high similarity and those that have weak similarity to zebrafish genes (blast score bits of 20 was the cutoff in this part of the study).

Table 4.18. Distribution of human specific genes, Blue bars include genes mammalian specific genes, Red bars show human genes that are highly similar to zebrafish genes, and the Green bars contain genes with low similarity between human and zebrafish.

45

5. Discussion Examination of codon bias in different organisms (especially prokaryotes), in numerous prior studies, verifies its importance for understanding multiple processes.

Although many researchers have tried to discover the origin of differences in codon usage, to the author’s knowledge scientists mostly generated controversial hypotheses.

This study suggests a novel explanation for the observed results.

In order to investigate codon bias in the human genome, we examined real DNA sequences of the entire set (19,855) of intron-containing human genes. In contrast to some of the prior studies (Anthony J.F. et al., 2008), we preferred to examine only real sequences and not pseudogenes since many psedogenes lack binding sides which prevents them from transcription and translation processes. Consequently, many pseudogenes accumulate mutations and do not show codon bias. After considering different publicly-available algorithms, we decided to develop our own program to calculate CBI since the available tools were either prokaryotic exclusive or were not user friendly.

To confirm the accuracy and reliability of our results, the GC content calculator program was created. First, it was tested by measuring GC content of five sample coding sequences with negative CBI values and then their neighboring intronic sequences were examined (Fig. 4.1). As predicted, our results were compatible with an earlier report

(Zeberg 2002) showing a location-dependent GC content. So, we learned that GC rich exons are usually neighboring GC rich introns.

46

Using Codon Bias Index calculator we found that the majority of human genes have

CBI values between -0.1 and +0.5 (Fig. 4.2) which shows optimal codons are still highly used in most human genes.

Next, Intronic GC composition of all human intron containing genes and their correlations with CBI were studied (Fig. 4.3 and 4.4). As it is illustrated in human genes intronic GC composition was positively correlated with their CBI values. We suggested that the above results are due to the existence of GC isochors (GC-rich genomic region) which are only present in warm-blooded organisms (Hughes, 1999). To test our hypothesis we applied the same investigations on chicken which is another warm- blooded organism (Fig. 4.5 and 4.6). Interestingly, GC composition of chicken intron containing genes (12695 genes) also showed positive correlations with their CBI values.

Furthermore, to validate our results zebrafish intron containing genes (22331 genes) were analyzed (Fig. 4.7 and 4.8). We expected the positive correlation between GC composition of zebrafish intron containing genes and CBI values to disappear due to the absence of GC isochors in cold-blooded organisms. Indeed, the observations agreed with expected results supporting our hypothesis.

Later, association of gene expression level of human intron containing genes and their codon bias index were explored (Fig. 4.9 and 4.10). According to these figures, only very strong codon bias indices (CBI > 0.5) have an association with high gene expression level.

Next, we examined the relationship between GC composition of different eukaryotic species and their genome size (C-value) (Fig. 4.11), and we found that the two factors

47 were not linearly correlated. Our results agree with numerous studies (Palidwor et al.

2010; Zhao 2010; Prat 2009; Carbone 2003; Lin 2003; Duret 2002; Smith 2001; Ikemura

1985) that previously considered the genomic GC content of various eukaryotic and prokaryotic organisms. Based on the above observations and previous report (by Prat

2009) explaining correlations between genomic GC content and the level of gene expression (so, higher GC richness in coding regions) we suggest that GC richness might be positively influenced by the size of coding regions, but it is not related to the size of genome. Also, since a wide range genome sizes in different organisms show similar GC compositions, it is suggested that the genomic GC content is conserved among eukaryotes (Fig. 4.11).

Later, position of Guanine and Cytosine nucleotides were investigated in codons.

Although many investigators (Duret 2002; Lin 2003; Pluhar 2006) studied the effect of

GC/AT in the third codon position (presented as GC3/AT3), our approach is novel in studying the correlations between all three positions. We observed that in both eukaryotic and prokaryotic species GC content of first, second, and third codon positions were linearly correlated (GC1 increased as the GC2 increased (Fig. 4.12 and 4.15), GC1 increased as the GC3 rose (Fig. 4.13 and 4.16), and GC2 and GC3 increased simultaneously (Fig. 4.14 and 4.17)). In both super-kingdoms, first and second codon positions illustrated the least coordination and second and third position showed the highest correlation. We suggest that, in mammals, presence of GC in one codon position positively affects GC composition of other codon positions. Also, overall genomic GC content dictates the GC richness of coding sequences while selective forces play a secondary role. Among eukaryotic species, Strongylocentrotus purpuratus (sea urchin)

48 like some other organisms was initially thought to be an exception since its third position

GC content (GC3) was much higher than expected. However, since its amino acid pool is highly enriched in Glycine (especially GGC and GGG codons), removing these codons decreased its GC content value to the expected level. High frequency of GC3 rich codons was previously reported in sea urchin (Mann K et al. 2010) which supports the validity and reliability of our method.

Finally, we tried to explore forces that create higher GC3 composition in mammalian species (almost 60% according to the Codon Usage Database) than whole genome GC content of the same species (about 40%). Since codons that contain GC in the last position showed higher frequency compared to other synonymous codons, we believe that ribosomal machinery are more accurate when working on GC3 containing codons.

Consequently, selection can explain high GC composition in the third codon positions even though it is unable to clarify gene regulation.

49

6. Conclusions

We suggest existence of a common ancestor for all vertebrates with following characteristics:

◦ extreme high genomic GC composition of about 70%

◦ translational machinery was adapted to high GC content

◦ mutational forces interrupted its high genomic GC composition and dropped it to the current level of 41% in human

◦ translational machinery was influenced by lower GC content

◦ tRNA concentration changed since it was dependent on GC3

◦ drop in GC3 content happened with a much lower rate

50

7. Appendices

Appendix A: Source Code for CBI-Calculator.pl

#!/usr/bin/perl open(FILE, "$ARGV[0]") || die "Can't open $ARGV[0]!!!!!\n";

$/= "\n>";

#Initialize variables, arrays, and hashes @line = (); #exons are elements of @line @OptimalCodon = (); @split = (); #headr and sequence of each exon @codon = (); @triplet = (); @Letter = (); %RandomFreq = (); @RandomValue = (); @ExcludedTriplet = (); $line = ''; $nucleotide = ''; $codon = ''; $split = ''; $nt = ''; $NumberTripl = ''; $key = ''; $value = ''; $count = ''; $NumLetter = ''; $NumAminoAcid = ''; $NumberOfValues = ''; $SumValue = ''; $CBI = ''; while (){ chomp($_);

@OptimalCodon = ("ttc","ctg","atc","gtg","ccc","acc","gcc","tac","cac","cag","aag","gag","tgc","agc","aga","agg","g gc");

%RandomFreq = ( "TTT" => "0.5", "TTC" => "0.5", "TTA" => "0.16", "TTG" => "0.16",

51

"CTT" => "0.16", "CTC" => "0.16", "CTA" => "0.16", "CTG" => "0.16", "ATT" => "0.33", "ATC" => "0.33", "ATA" => "0.33", "ATG" => "0.0", "GTT" => "0.25", "GTC" => "0.25", "GTA" => "0.25", "GTG" => "0.25", "TCT" => "0.16", "TCC" => "0.16", "TCA" => "0.16", "TCG" => "0.16", "CCT" => "0.25", "CCC" => "0.25", "CCA" => "0.25", "CCG" => "0.25", "ACT" => "0.25", "ACC" => "0.25", "ACA" => "0.25", "ACG" => "0.25", "GCT" => "0.25", "GCC" => "0.25", "GCA" => "0.25", "GCG" => "0.25", "TAT" => "0.5", "TAC" => "0.5", "TAA" => "0.0", "TAG" => "0.0", "CAT" => "0.5", "CAC" => "0.5", "CAA" => "0.5", "CAG" => "0.5", "AAT" => "0.5", "AAC" => "0.5", "AAA" => "0.5", "AAG" => "0.5", "GAT" => "0.0", "GAC" => "0.0", "GAA" => "0.5", "GAG" => "0.5", "TGT" => "0.5", "TGC" => "0.5", "TGA" => "0.0", "TGG" => "0.0", "CGT" => "0.16", "CGC" => "0.16", "CGA" => "0.16",

52

"CGG" => "0.16", "AGT" => "0.16", "AGC" => "0.16", "AGA" => "0.16", "AGG" => "0.16", "GGT" => "0.25", "GGC" => "0.25", "GGA" => "0.25", "GGG" => "0.25" );

@line = split("\n",$_); #seperate header from the sequence my @split = split('',$line[1]); $nucleotide = @split; #Total number of nucleotides in a sequence for($nt = 1; $nt < $nucleotide+1; $nt=$nt+3) { #Reading the sequence into codon/triplets my @codon = @split[0..2]; $codon = join('', @codon); Push (@triplet, $codon); #creating triplets @codon = (); @split = @split[3..$#split]; }

$NumberTripl= @triplet; $count = 0; $NumAminoAcid = 0; foreach $triplet(@triplet){ while (($key, $value) = each %RandomFreq) { if ($triplet =~ /$key/i){ #find value of each triplet in hash push @RandomValue, $value; } } if ( grep { $_ =~ /$triplet/i } @OptimalCodon ) { $count++; }

@ExcludedTriplet = ("atg","taa","tag","tga","tgg","gat","gac"); my @Letter = split('',$triplet); $NumLetter = @Letter; if($NumLetter < 3){ } if ( grep { $_ =~ /$triplet/i } @ExcludedTriplet ) { #exclude some triplets } else{$NumAminoAcid ++} }

53

$SumValue = 0; ($SumValue +=$_) for @RandomValue;

@triplet = ();

@RandomValue = ();

$CBI = ($count-$SumValue)/($NumAminoAcid-$SumValue);

my $CBI = sprintf "%.3f",$CBI; #only print CBI with two decimals

print ">$line[0], CBI value:$CBI\n$line[1]\n";

print "\n"; }

Appendix B: Source Code for GC-Content-Calculator.pl open(FILE, "$ARGV[0]") || die "Can't open $ARGV[0]!!!!!\n";

$/= "\n>";

@line = (); @split = (); $gene = ''; $CBI = ''; $gcPct = ''; $result = '';

$a = 0;$t = 0;$g = 0;$c = 0; while(){ chomp($_);

@line = split("\n",$_); @split = split('',$line[1]); if($line[0]=~/^ gene=(.+),CBI:(.+),CDS-Length:(.+),,.+/){ $gene = $1; $CBI = $2; $CDS = $3; }

if($gene eq $OldGene){};

else{ print "$OldGene\t$OldCBI\t$a\t$t\t$g\t$c\t$result\t$OldCDS\n";

$a = 0;$t = 0;$g = 0;$c = 0; }

foreach $nt(@split){

54

if ( $nt =~ /a/i ) { $a++; } elsif ($nt =~ /t/i){ $t++; } elsif ($nt =~ /g/i){ $g++; } elsif ($nt =~ /c/i){ $c++; } } $gcPct = ($g+$c)/($a+$t+$g+$c)*100; $result = sprintf("%.2f", $gcPct);

@split = ();@Out = (); $OldGene = $gene; $OldCBI = $CBI; $OldCDS = $CDS; }

55

8. References

1. Anderson, S.G., Kurland, C.G., 1995. Genomic evolution drives the evolution of

the translation system. Biochem. Cell Biol. 73,775–787.

2. Baldauf S. L., 2008. An overview of the phylogeny and diversity of eukaryotes.

Journal of Systematics and Evolution. 46(3):263-273.

3. Bennetzen JL, Hall BD. 1982. Codon selection in yeast. The Journal of Biological

Chemistry. 257:3026-31.

4. Blencowe BJ. 2000. Exonic splicing enhancers: mechanism of action, diversity

and role in human genetic diseases. Trends Biochem Sci. 25:106-110.

5. Brooks D.J, Fresco J.R. 2002. Increased frequency of cysteine, tyrosine, and

phenylalanine residues since the last universal ancestor. Molecular and Cellular

Proteomics. 1:125-131.

6. Carbone A, Zinovyev A, Kepes F. 2003. Codon adaptation index as a measure of

dominating codon bias. Bioinformatics. 19:2005-15.

7. Carlini, D.B., Chen, Y., Stephan, W., 2001. The relationship between third-codon

position nucleotide content, codon bias, mRNA secondary structure and gene

expression in the drosophilid alcohol dehydrogenase genes Adh and Adhr.

Genetics 159, 623–633.

8. Chen SL, Lee W, Hottes AK, Shapiro L, McAdams HH. 2004. Codon usage

between genome is constrained by genome wide mutational processes. Proc. Natl.

Acad. Sci. USA 101:3480-85.

56

9. Cladel N.M. et al. 2008. CRPV genomes with synonymous codon optimizations

in the CRPV E7 gene show phenotypic differences in growth and altered

immunity upon E7 vaccination. PLOS One. 3(8):e2947.

10. Codon Usage Bias, Codon Usage Bias - Translational Patterns, Compositional Patterns,

Conclusions, Escherichia coli, Table 1., Codon, Table 2., (bacterium), (yeast),

(fruit fly)

11. Desai D, Zhang K, Barik S, Srivastava A, Bolander ME, Sarkar G. 2004.

Intragenic codon bias in a set of mouse and human genes. Journal of Theoretical

Biology 230:215-25.

12. Duan, J., Wainright, M.S., Comeron, J.M., Saitou, N., Sanders, A.R., Gelernter,

J., Gejman, P.V., 2003. Synonymous mutations in the human dopamine receptor

D2 (DRD2) affect mRNA stability and synthesis of the receptor. Hum. Mol.

Genet. 12, 205–216.

13. Duret L.2002. Evolution of synonymous codon usage in metazoans. Current

Opinion in Genetics and Development. 12:640-649.

14. Funk F, Person S. 1969. Cytosine to Thymine Transitions from Decay of

Cytosine-5-3H in Bacteriophage S13. Science. 166:1629-31.

15. Gouy M, Gautier C. 1982. Codon usage in bacteria: correlation with gene

expressivity. Nucleic Acids Res. 10:7055-74.

16. Grosjean, H., Fiers, W., 1982. Professional codon usage in prokaryotic genes: the

optimal codon-anticodon interaction energy and the selective codon usage in

efficiently expressed genes. Gene 18, 199-209.

57

17. Grote A, Hiller K, Scheer M, Munch R, Nortemann B, Hempel D.C, Jahn D.

2005. JCat: a novel tool to adapt codon usage of a target gene to its potential

expression host. Nucleic Acid Research. 33:526-31.

18. Hasegawa, M., Yasunaga, Y., Miata, T., 1979. Secondary structure of MS2 phage

RNA and bias in codon word usage. Nucleic Acids Res. 7, 2073–2079.

19. Hershberg R, Petrov DA. 2008. Selection on codon bias. Annu. Rev. Genet.

42:287-99.

20. Hughes S., Zelus D., Mouchiroud D. 1999. Warm-Blooded Isochore Structure in

Nile Corcodile and Turtle. Mol. Biol. Evol. 16:1521-1527.

21. Ikemura T. 1985. Codon usage and tRNA content in unicellular and multicellular

organisms. Mol. Biol. Evol. 2(1):13-34.

22. Joseph R., Schild Rudolf. 2010. Origins, Ecolution, and Distribution of Life in the

Cosmos: Panspermia, Genetics, Microbes, and Viral Visitors From the Stars.

Journal of Cosmology, Vol 7, 1616-1670

23. Kotlar D, Lavner Y. 2006. The action of selection on codon bias in the human

genome is related to frequency, complexity, and chronology of amino acids. BMC

Genomics. 7:67-78.

24. Lafay,B., Lloyd,A.T., McLean,M.J., Devine,K.M., Sharp,P.M. and Wolfe,K.H.

1999. Proteome composition and codon usage in spirochaetes: species-specific

and DNA strand-specific mutational biases. Nucleic Acids Res., 27:1642–1649.

25. Lin K., Tan S.B., Kolatkar P.R., Epstein R.J., 2003. Nonrandom intragenic

variations in patterns of codon bias implicate a sequential interplay between

58

transitional genetic drift and functional amino acid selection. J Mol Evol. 57:538-

545.

26. Lutsenko E, Bhagwat A S. 1999. Principal causes of hot spots for cytosine to

thymine mutations at sites of cytosine methylation in growing cells. A model, its

experimental support and implications. Mutat Res. 1999;437:11–20.

27. Mann K., Wilt F.H., Poustka A.J., 2010. Proteomic analysis of sea urchin

(Strongylocentrotus purpuratus ) spicule matrix. Proteome Science 8:33.

28. Moszer,I., Rocha,E.P.C. and Danchin,A. 1999. Codon usage and lateral gene

transfer in Bacillus Subtilis. Curr. Opin. Microbiol., 2:524–528.

29. Nackley AG. et al. 2006. Human catechol-O-methylteransferase haplotypes

modulate protein expression by altering mRNA secondary structure. Science.

314:1930-1933.

30. Palidwor GA, Perkins TJ, Xia X. 2010. A general model of codon bias due to GC

mutational bias. PLOS. 5(10): e13431.

31. Parmley J.L., Hurst L.D. 2007. How do synonymous mutations affect fitness?

BioEssays 29: 515-19.

32. Plotkin JB, Kudla G. 2010. Synonymous but not the same: the causes and

consequences of codon bias. Nature. 12:32-42.

33. Pluhar W. 2006. AT2-AT3-profiling: A new look at synonymous codon usage.

Journal of Theoretical Biology.243:308-321.

34. Prat Y.et al. 2009. Codon usage is associated with the evolutionary age of genes

in metazoan genomes. BMC Evolutionary Biology. 9:285-297.

59

35. Ross JF, Orlowski M. 1982. "Growth-rate-dependent adjustment of ribosome

function in chemostat-grown cells of the fungus Mucor racemosus". J. Bacteriol.

149 (2): 650–3. PMC 216554. PMID 6799491.

36. Shah P, Gilchrist M. 2010. Effect of correlated tRNA abundances on translation

errors and evolution of codon usage bias. Plos Genet. 6,e1001128.

37. Sharp MT, Li W. 1987. The codon adaptation index a measure of directional

synonymous codon usage bias, and its potential applications. Nucleic Acid Res.

15:1281-95.

38. Sinclair, G., Choy, F.Y., 2002. Synonymous codon usage bias and the expression

of human glucocerebrosidase in the methylotrophic yeast, Pichia pastoris. Protein

Expr. Purif. 26, 96–105.

39. Smith N.G.C., Eyre-Walker A. 2001. Synonymous codon bias is not caused by

mutation bias in G+C-rich genes in humans. Mol. Biol. Evol. 18(6):982-986.

40. Subramanian S. 2008. Nearly neutrality and the evolution of codon usage bias in

eukaryotic genomes. Genetics 178: 2429-32.

41. Topal, M.D., Fresco, J.R., 1993. Base pairing and fidelity in codon-anticodon

interaction. Nature 263, 289–293.

42. Trifonov E.N. 2004. The triplet code from first principles, Journal of

Biomolecular Structure and Dynamics, 22:1-11.

43. Wu, L.F., Saier, M.H., 1991. Differences in codon usage among gene encoding

proteins of different function in Rhodobacter capsulatus. Res. Microbiol. 142,

943–949.

60

44. Zao Z, Jiang C. 2010. Features of recent codon evolution: A comparative

polymorphism-fixation study. Journal of Biomedical and Biotechnology.

45. Zeberg B., 2002, Shannon information theoretic computation of synonymous

codon usage biases in coding regions of human and mouse genomes. Genome

Res. 12(6):944-55.

61