A Thesis Entitled Mega-Scale Bioinformatics Investigation Of
Total Page:16
File Type:pdf, Size:1020Kb
A Thesis entitled Mega-scale Bioinformatics Investigation of Codon Bias in Vertebrates by Maryam Nabiyouni Submitted to the Graduate Faculty as partial fulfillment of the requirements for the Master of Science Degree in Biomedical Sciences Dr. Alexei Fedorov, Committee Chair Dr. Robert Blumenthal, Committee Member Dr. Sadik Khuder, Committee Member Dr. James Willey, Committee Member Dr. Patricia R. Komuniecki, Dean College of Graduate Studies University of Toledo August 2011 Abstract Although synonymous codon usage in mammals has been investigated for decades, there are still controversial interpretations of the observed results. Selectionism cannot explain the strong regularities in the codon bias, while neutralism is unable to comprehend the unusual high frequency of G or C nucleotides (~60%) in the third codon positions across all mammals. We performed a genome-wide computational analysis of synonymous codon usage with respect to local genomic GC-content, GC-content in the first and second codon positions, and overall genome length and GC-content. In this study, Codon Bias Index (CBI) is used to measure the codon bias in individual genes. The presented data is gathered from multiple available databases such as the Codon Usage Database, Animal Genome Size Database, BioGPS, and Exon/Intron Database. Our results suggest that local GC-content is a major contributor to the non-randomness of codon usage in mammals. Based on the obtained results, we propose a united hypothesis for the origin of GC-rich, GC-poor isochors and codon bias in mammals and vertebrates. ii Acknowledgments I would like to express my gratitude to my advisor, Dr. Alexei Fedorov, for his patience, motivation, enthusiasm, and knowledge. Also, I am thankful for the other members of my thesis committee, Dr. Robert Blumenthal, Dr. Sadik Khuder, and Dr. James Willey for their guidance. I would like to indicate appreciation for my labmates in the Bioinformatics Proteomics and Genomics (BPG) computer laboratory at the University of Toledo, especially Dr. Ashwin Prakash for providing me with the gene expression level data of this thesis, as well as Andrew McSweeny, Dr. Samuel Shepard, and Peter Bazeley who gave me great advice in computer programming. Also, I am thankful for Joanne Gray in the BPG program for all her support. Finally, I am most indebted to my family and friends especially my parents for their love and support throughout my life. iii Table of Contents Abstract…………................................................................................................................ii Acknowledgements............................................................................................................iii Table of Contents............................................................................................................... iv Codon Bias Phenomenon Overview................................................................................... 1 Literature............................................................................................................................. 5 Statement of the Question................................................................................................ 20 Investigation of Codon Bias in Vertebrates...................................................................... 21 Discussion..........................................................................................................................46 Conclusions....................................................................................................................... 50 Appendices........................................................................................................................ 51 Appendix A: Source Code for CBI-Calculator.pl............................................................. 51 Appendix B: Source Code for GC-Content-Calculator.pl................................................ 54 References......................................................................................................................... 56 iv 1. Codon Bias Phenomenon Overview The genetic information is carried from the chromosomes to the ribosomes through mRNA. Then, the “ribonucleotides are read by translational machinery into nucleotide triplets called codons” (Ross and Orlowski, 1982). Out of 64 possible codons, UAA, UAG, and UGA signal for the termination of translation and end of a polypeptide chain, while the remaining 61 codons are used to specify amino acids. Since 61 codons code for 20 amino acids, most amino acids are encoded by more than one codon. Different codons that are translated into the same amino acid are called synonymous codons. Indeed, two amino acids are each encoded by one codon, eight are encoded by two synonymous codons, one amino acid is coded by three synonymous codons, six are encoded by four synonymous codons, and three amino acids are encoded by six synonymous codons. For instance, while amino acid Phenylalanine (Phe) is only encoded by UUU and UUC, Serine is coded by UCU, UCC, UCA, UCG, AGU and AGC. Table1.1 illustrates the genetic code including codons and their corresponding amino acids. 1 Table 1.1 The genetic code, Green represents start and red represents stop codons (Joseph and Schild, 2010). Now, one might be interested to know if particular synonymous codons are used to cipher amino acids with identical frequencies in different regions of the same genome. We now know that synonymous codons are used with diverse frequencies within different organisms. Some codons such as UUG, CCA, and GCA are almost entirely absent in organisms like Thermus thermophilus (Plotkin and Kudla, 2010). Such differences in codon frequencies exist in coding regions, and some suggest that it is universal among all organisms (Bulmer, 1988; Desai, 2004). In fact, this phenomenon is called codon usage bias or Codon Bias (CB) (Hershberg and Petrov, 2008). Bias in usage 2 of codons might be due to a variety of factors such as the GC-content, preference for codons with G or C at the third nucleotide position (GC3) (Lafay et al., 1999), a leading strand richer in G+T than a lagging strand (Lafay et al., 1999), and horizontal gene transfer which introduces chromosome segments of non-native base composition (Mozer et al., 1999). Table 1.2 demonstrates existence of codon bias in different species. Table 1.2 Relative frequencies of leucine synonymous codons in four species, (Codon Usage Bias, science.jrank.org) 3 Codon bias phenomenon is clearly demonstrated in the above table. It illustrates that the CTG codon is responsible for about half of coded leucines in Escherichia coli, while its usage is dropped five fold in Saccharomyces cerevisiae. Variation in codon usage within the coding regions of different genes from the same species or between species, suggest a critical role for codon bias. Understanding the codon usage bias phenomenon and its originating forces may lead us to understand other biological processes such as gene regulation and evolution. In this study, I will investigate the frequency of synonymous codon usage in different species. Especially, I will focus on creation and maintenance of codon bias and its variation among human genes. Also, I will study different regions of human genome with respect to their GC and AT content along with their level of gene expression. Finally, I will present my hypothesis on the origin of genomic GC-rich regions commonly called GC isochores. 4 2. Literature 2.1. Available online tools Numerous studies on creation and maintenance of codon usage bias indicate its important affect on gene expression, translation, and evolution. Several algorithms and formulas have been created to describe the codon bias. 2.1.1. Codon Adapatation Index Calculator In 2008, Pere Puigbo and colleagues created a program called CAIcal which calculated Codon Adaptation Index (CAI) in different nucleotide sequences.CAI is a universal measure of codon bias (Carbone et al., 2003), that compares the relative codon usage of a gene to the codon usage of highly-expressed genes (Peden, 1997). Highly- expressed genes are chosen from a large pool of genes like ribosomal proteins, elongation factors, proteins involved in glycolysis, histone proteins in eukaryotes and outer membrane proteins in prokaryotes (Carbone et al., 2003). This tool is free to be downloaded and it is available online at http://genomes.urv.es/CAIcal/. 5 Figure 2.1. The CAIcal web-server is used to measure Codon Adaptation Index (CAI) The CAI web-server computes codon adaptation index in the input sequences and measures the similarities between the synonymous codon usage of an input sequence (a gene) and the synonymous codon frequency of a reference set like highly expressed genes. 6 2.1.2 CodonO CodonO is another online tool for the analysis of synonymous codon usage bias within and across genomes which is available online at http://sysbio.cvm.msstate.edu/CodonO/. There are many pre-loaded genetic codes available in the settings tool bar including the universal and vertebrate mitochondrial genetic code. This web-server uses information from chosen genetic codes to perform different computational methods and measures synonymous codon usage bias. Figure 2.2. CodonO is used to analyze synonymous codon usage bias CodonO provides users with options like measuring the codon usage bias on available pre-loaded prokaryotic and eukaryotic genomes through the Run CodonO on Organism 7 tool bar. Moreover, they can choose to upload their own sequence files