Bioinformatics Applications Through Visualization of Variations on Protein
Total Page:16
File Type:pdf, Size:1020Kb
1 Bioinformatics applications through visualization of variations on protein structures, comparative functional genomics, and comparative modeling for protein structure studies A dissertation presented by Alper Uzun to The Department of Biology In partial fulfillment of the requirements for the degree of Doctor of Philosophy in the field of Biology Northeastern University Boston, Massachusetts July 2009 2 ©2009 Alper Uzun ALL RIGHTS RESERVED 3 Bioinformatics applications through visualization of variations on protein structures, comparative functional genomics, and comparative modeling for protein structure studies by Alper Uzun ABSTRACT OF DISSERTATION Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Biology in the Graduate School of Arts and Sciences of Northeastern University, July, 2009 4 Abstract The three-dimensional structure of a protein provides important information for understanding and answering many biological questions in molecular detail. The rapidly growing number of sequenced genes and related genomic information is intensively accumulating in the biological databases. It is significantly important to combine biological data and developing bioinformatics tools while information of protein sequences, structures and DNA sequences are exponentially growing. On the other hand, especially the number of known protein sequences is much larger than the number of experimentally solved protein structures. However the experimental methods cannot always be applied or protein structures cannot be available for all protein sequences. Comparative protein modeling technique is closing a gap for the protein sequences with unknown structures by constructing a three-dimensional model of a given protein sequence based on sequence similarity to one or more known structures. I present here several different applications of comparative protein modeling by developing servers for mapping nsSNPs on to comparative protein models, studying comparative functional genomics of HMGN3a and SMARCAL1 along with multiple sequence analysis, developing comparative models for other techniques in order to find active sites, and understanding the possible functional properties of proteins while substitutions occur in a given protein. Part of the presented research focused on nonsynonymous SNPs, understanding the functional consequences of nonsynonymous changes and predicting potential causes and the molecular basis of diseases involves integration of information from multiple heterogeneous sources including sequence, structure data and pathway relations between 5 proteins. In order to visualize them on protein structures and perform the analysis on nsSNP, a web server, Structure SNP (StSNP) was developed. It provides the ability to analyze and compare human nsSNP(s) in protein structures, protein complexes and protein–protein interfaces, where nsSNP and structure data on protein complexes are available in PDB. In the second part of the research, comparative functional genomics analysis of HMGN3a and SMARCAL1 was studied along with combining information with comparative modeling and multiple protein sequence analysis across mammals. Our results showed that there was a high degree of structural conservation of HMGN3a and SMARCAL1 in the mammalian species studied. In the third part of the research several comparative models built from different species in order to find active site residues by the THEMATICS method. In the last part of the research multiple sequence alignment studies of APEX1 and dna polymerase beta and comparative models of γ-tubulins were studied and presented. 6 Acknowledgements First, and foremost, I would like to express my sincere gratitude to my advisor, Dr. Valentin Ilyin, for his guidance, suggestions and patience during the course of my work. Dr. Valentin Ilyin has provided me with necessary assistance and his supervision during my graduate work which will always be appreciated. I would like to thank the members of my committee, Dr. William Detrich, Dr. Kostia Bergman, Dr. Veronica Godoy, Dr. Mary Jo Ondrechen. I am thankful for the valuable collaboration of Dr. William Detrich, Dr. Mary Jo Ondrechen, Dr. Phyllis Strauss, and Dr. Erdogan Memili. I also would like to thank to Biology Department at Northeastern University, they always support me morally and financially during the PhD years. I would like to thank Dr. Scott Mohr from Boston University for his encouragement, and who introduced me to the field of bioinformatics and as well as special thanks should be given to Dr. Kostia Bergman and Dr. William Detrich for having their support and guidance which let me to have the initial steps to get into the field of bioinformatics. During my graduate study, while I was having an internship at Broad Institute of MIT and Harvard University, I especially would like to thank Dr. Jill Mesirov and Dr. Michael Reich for not only supervising me but also they were very kind to give their full support a long the years even after the internship. I would like to thank my colleagues from the Ilyin lab, Alex Abyzov, Chesley, Leslin, Haifeng Weng. Especially I am thankful to Alex Abyzov for patiently answering my every question. 7 My friends outside the lab were also very supportive and sincere. I would like to thank Ahmet Ozcan who was like a brother to me, Taner Kaya for sharing the tough times and fun times, he was a genuine friend along the years, Murat Erdem and Karin Schon who are always supportive and sincere, Ata Murat Kaynar and Dilsun Kaynar although later on moved to Pittsburgh, they constantly show their sincere love and friendship even from a distance. I am grateful to my mom and dad for their everlasting trust, emotional support, encouragement, and unconditional love. I spend many years away from them; I understand and appreciate their sacrifice. Since I am the only child, their patience and understanding means a lot to me. 8 TABLE OF CONTENTS Abstract 4 Acknowledgements 6 Table of Contents 8 List of Abbreviations 9 List of Figures 13 List of Tables 15 Chapter 1. Introduction. 16 Chapter 2. Mapping and modeling nsSNPs on protein structures 25 Chapter 3. Functional genomics of HMGN3a and SMARCAL1 30 in early mammalian embryogenesis Chapter 4. Comparative model structures and active site predictions 43 Chapter 5. Studies on multiple sequence alignment 50 and comparative modeling Conclusion 53 Tables 56 Figures 68 References 113 9 List of Abbreviations Abbreviations Proper name ALDH2 Aldehyde dehydrogenase-2 ANOLEA Atomic non-local environment assessment APEX1 Apurinic/apyrimidinic endonuclease 1 Arg Arginine AspAT Aspartate aminotransferase ATP Adenosine-5'-triphosphate BER Base excision repair BLAST Basic local alignment search tool BLOSUM Blocks of amino acid substitution matrix dbSNP SNP database DFIRE Distance-scaled, finite ideal-gas reference DHAP Dihydroxyacetone phosphate DNA Deoxyribonucleic acid EGA Embryonic genome activation GAP D-glyceraldehyde 3- phosphate 10 Gln Glutamine GST Glutathione s transferase g-T1 Gama tubulin 1 g-T2 Gama tubulin 2 GV germinal vesicle HARP HepA-related protein HGVbase Human genome variation database HMGN High mobility group nucleosomal HPPK 6-Hydroxymethyl-7,8-dihydropterin pyrophosphokinase Ile Isoleucine JSNP Japanese Single Nucleotide Polymorphism KEGG Kyoto encyclopedia of genes and genomes Lys Lysine MEGA Molecular evolutionary genetics analysis Met Methionine mRNA messenger ribonucleic acid MSA Multiple sequence alignment 11 MZT Maternal to zygotic transition NCBI National Center for Biotechnology Information nsSNP Nonsynonymous single nucleotide polymorphism PCR Polymerase chain reaction PDB Protein data bank pol-β DNA polymerase beta PROQ Protein quality predictor ProsaII Protein structure analysis II PSI_BLAST Position specific iterative BLAST RMSD Root mean square deviation SEDB Structural exon database Ser Serine SIOD Schimke immunoosseous dysplasia SMARCAL1 SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily a member 1 SNF2N Helicase like domain SNP Single nucleotide polymorphism StSNP Structure SNP 12 SWI/SNF SWItching and sucrose non-fermenting in yeast THEMATICS Theoretical microscopic titration curves TIM Triosephosphate isomerase TRIP7 TRAF-interacting protein 7 UPGMA Unweighted pair group method with arithmetic mean 13 List of Figures Figure 1. Growth of released protein structures per year. Figure 2. Schematic of StSNP web server. Figure 3. Data generation in StSNP. Figure 4. nsSNPs and Glutathione S Transferase. Figure 5. nsSNPs and Aldehyde dehydrogenase-2. Figure 6. Phylogenetic tree for HMGN3a. Figure 7. MSA of HMGN3a. Figure 8. Four distinctive domains of SMARCAL1. Figure 9. Phylogenetic tree for SMARCAL1. Figure 10. First HARP domain of SMARCAL1. Figure 11. MSA of the first HARP domain in SMARCAL1. Figure 12. Second HARP domain of SMARCAL. Figure 13. MSA of second HARP domain in SMARCAL1. Figure 14. SNF2N domain of SMARCAL1. Figure 15. MSA of SNF2N domain in SMARCAL1. Figure 16. Phylogenetic tree of helicase C terminal domain in SMARCAL1. 14 Figure 17. MSA of the helicase C-terminal domain in SMARCAL1. Figure 18. Disparity Index test of HMGN3a. Figure 19. Disparity Index test of SMARCAL1. Figure 20. MSA of helicase like and helicase domains with modeled protein structure. Figure 21. MSA of APEX1 and pol-β. Figure 22. A view of model structures of g-T1 and g-T2 at the positions 303, 205. 15 List of Tables Table 1. Representing query and modeling options for resources. Table 2. Summary