Predictions on and Analysis of Viral Proteins Encoded by Overlapping Genes

PREDICTIONS ON AND ANALYSIS OF VIRAL PROTEINS ENCODED BY OVERLAPPING GENES Mahvash Khosravi Submitted to the faculty of the Bioinformatics Graduate Program in partial fulfillment of the requirements for the degree Master of Science in the School of Informatics, Indiana University August 2007 Accepted by the Faculty of Indiana University, in partial fulfillment of the requirements for the degree of Master of Science in Bioinformatics. Master’s Thesis Committee ___________________________ A. Keith Dunker, PhD, Chair ___________________________ Pedro Romero, PhD ___________________________ Vladimir N. Uversky, PhD ii Dedicated to the memory of my mother iii Acknowledgements I wish to express my gratitude to my thesis advisor Dr. A. Keith Dunker for his guidance, support and providing me the opportunity to work on this project. I would like to thank Dr. Pedro Romero for his advice and valuable input through the course of this research, as well as serving on my thesis committee. His guidance and encouragement is greatly appreciated. Also, I would like to thank Dr. Vladimir Uversky for serving on my thesis committee. Finally I would like to express my gratitude and love for my family. I am indebted to my husband Dr. Hossein Hariri for his endless love, dedication, patience and encouragement. Thank you Hossein. My deepest appreciation and love go to our beautiful daughters, our sweet engineer Ghazal and dear Homa who makes us so proud. iv Abstract Overlapping genes are adjacent genes that share a portion of their coding sequence. Such genes are often observed in the compact genomes of viruses, prokaryotes, and mitochondria. Overlapping genes are also seen in human and other mammalian genomes. Gene overlapping is a phenomenon to minimize genomic size and maximize encoding capacity. Overlapping genes produce different proteins. A major task in the post genomic era is the large-scale study of the structures and functions of proteins. Proteins play crucial roles in virtually all biological processes. In general it is assumed that 3-D structure determines the function of proteins, but many proteins or region of proteins may function in the absence of 3-D structure. The term “disordered” is used to describe these proteins. A large number of studies has shown that biological functions depend on both ordered and disordered proteins. Natively disordered regions are common and play essential roles in many proteins, especially, with regard to activities involved in signaling and regulation. The goal of this research was the analysis of the ordered and disordered tendencies of viral proteins encoded by overlapping genes. Our hypothesis is that, in a pair of proteins or protein regions encoded by overlapping genes, at least one of the pair is disordered (or unstructured). Our hypothesis is based on the observation that structural proteins require highly specific amino acid sequences, while unstructured (disordered) sequences are essentially unconstrained. Thus, given a structural protein and its associated mRNA sequence, any sequence derived from an overlapping reading frame seems highly unlikely to have a sequence pattern commensurate with a structural protein; on the other hand, a sequence pattern consistent with a disordered protein seems much v more likely. We performed studies on the protein products of overlapping gene sequences, tested the hypothesis and addressed the following two questions: First do the proteins encoded by overlapping genes have opposite order-disorder content, that is, does the ordered part of one of the overlapping proteins correspond to a disordered part in the other overlapping protein? Second, does the encoded protein in the overlapping regions have more disordered amino acids than the non-overlapping regions? Using our database of overlapping viral genes and the protein predictor PONDR VL3, we predicted the order-disorder of amino acids in the sequence of 97 viral protein samples. An analysis of the results supported our hypothesis and indicated that the ordered amino acids are mostly associated with non-overlapping regions while disordered amino acids are more prevalent in overlapping regions. In the overlapping regions for 52 protein pairs, we showed that most of the amino acid pairs facing each other on the protein sequences had at least one disorder for most cases. Out of 52 pairs, there were 3 protein pairs where there were no disordered amino acids and 22 protein pairs where there were no ordered amino acids on either sequence. The fraction of ordered pairs in the pool of overlapping regions of 52 protein pairs was 0.28. The non-overlapping region of 97 proteins had predominantly ordered proteins. The fraction of ordered amino acids in the pool of non-overlapping regions was determined to be 0.77. vi Table of Contents List of Tables …………………………………………………………………………...viii List of Tables ………………………………………………………………………….....ix I. Introduction………………………………………………………………………. 1 II. Background………………………………………………………………………. 2 Overlapping genes…………………………………………………………… 2 Protein structure-function paradigm………………………………………… 4 Protein folding……………………………………………………………… 6 Secondary structure of protein………………………………………………. 8 Beta bends………………………………………………………………….. 11 Tertiary structure of protein………………………………………………… 11 Natively disordered proteins – the new paradigm………………………….. 12 Characterization of natively disordered proteins…………………………… 14 NMR spectroscopy…………………………………………………………. 14 X-ray crystallography………………………………………………………. 14 Circular dichroism (CD) spectroscopy………………………………………15 Protease sensitivity…………………………………………………………. 15 Frequency of natively disordered proteins…………………………………. 16 Function of disordered proteins…………………………………………….. 17 Molecular recognition……………………………………………………… 18 Protein modification……………………………………………………….. 18 Entropic chain activities…………………………………………………… 18 Amino acid compositions of natively disordered proteins………………… 18 Prediction of natively disordered proteins………………………………… 19 Objective……………………………………………………………………. 20 III. Materials and Methods…………………………………………………………. 22 Software……………………………………………………………………. 22 Data analysis………………………………………………………………. 23 Data management…………………………………………………………. 24 Database design, construction and implementation……………………….. 32 VIRUS data schema……………………………………………………….. 33 VIRUS database construction……………………………………………… 37 Database implementation………………………………………………….. 39 Populating and query the database………………………………………… 40 Database queries…………………………………………………………… 40 Prediction of proteins encoded by overlapping genes…………………….. 43 Extraction of proteins encoded by overlapping genes……………………. 44 IV. Results…………………………………………………………………………. 46 Overlapping regions of protein pairs……………………………………… 46 Non-overlapping regions of proteins……………………………………… 50 Bootsrapping……………………………………………………………….. 51 V. Discussion……………………………………………………………………… 53 VI. Conclusions……………………………………………………………………. 58 VII. Recommendations for future work……………………………………………. 60 VIII. References …………………………………………………………………….. 61 vii List of Tables Table 1. List of software used for this research project………………………………… 23 Table 2. Dataset of Viral Proteins………………………………………………………. 24 Table 3. Description of Information on Viruses in Our Dataset…………………………32 viii List of Figures Figure 1. A Polypeptide chain with four amino acid residues…………………………… 6 Figure 2. Rigid CO-NH bond and rotation of Cα-C and Cα-N bonds……………………. 7 Figure 3. Ramachandran plot for 1000 nonglycine residues in eight proteins ……........... 8 Figure 4. Alpha helix…………………………………………………………………….. 9 Figure 5. Beta sheet………………………………………………………………………10 Figure 6. Beta bend…………………………………………………………………….. 11 Figure 7. Three dimensional protein structural motif ….................................................. 12 Figure 8. 3-D structure of Calcineurin…………………………………………………. 17 Figure 9. Schema for VIRUS database…………………………………………………. 35 Figure 10. Physical Schema Of DNA_VIRUS Database………………………………. 39 Figure 11. Examples of queries…………………………………………………………. 41 Figure 12. Example of fasta format of protein sequences………………………………. 43 Figure 13. A sample output……………………………………………………………… 45 Figure 14. Results of protein prediction in Excel Spreadsheet…………………………. 48 Figure 15. Percentages of order and disorder calculated by Excel Spreadsheet……….. 48 Figure 16. Analysis of Order-Disorder for Overlap Proteins ………………………….. 49 Figure 17. Analysis of Overlap Proteins with Disorder Breakdown…………………… 50 Figure 18. Results of protein prediction for non-overlap region ………………………. 51 Figure 19. Fraction of Ordered Amino Acids in the Non-overlap Region of Proteins… 52 Figure 20. Overlapping protein pairs with more than 50% disorder……………………. 55 Figure 21. Fraction of Ordered Amino Acids in the Non-overlap Region of Proteins… 56 Figure 22. Fraction of Disordered Amino Acids in the Entire Sequence of Proteins…. 57 ix I. Introduction Overlapping genes are adjacent genes that share a portion of their coding sequence. They are often observed in compact genome of viruses, prokaryotes, and mitochondria. Overlapping genes also occur in mammalian genomes. Overlapping genes encode different protein products using the same nucleic acid sequence. Proteins play crucial roles in virtually all biological processes. For many years it was thought that 3-D structure of proteins determine their functions [1,2,3]. But there are proteins or region of proteins which lack 3-D structure, yet such proteins and regions function in the absence of any specific fixed structure [4,5]. These proteins, called natively disordered proteins, have many important

Predictions on and Analysis of Viral Proteins Encoded by Overlapping Genes

Long Overlapping Genes in Pseudomonas Aeruginosa

Grapevine Virus Diseases: Economic Impact and Current Advances in Viral Prospection and Management1

Changes to Virus Taxonomy 2004

A Novel Short L-Arginine Responsive Protein-Coding Gene (Laob)

Origin, Adaptation and Evolutionary Pathways of Fungal Viruses

Recent Advances on Detection and Characterization of Fruit Tree Viruses Using High-Throughput Sequencing Technologies

Viral Diversity in Tree Species

In Plant Physiology

A-Lovisolo.Vp:Corelventura

The Design of Maximally Compressed Coding Sequences

Wo 2007/056463 A2

Tamada-Text R3 HK-TT