1
Bioinformatics applications through visualization of variations on protein structures, comparative functional genomics, and comparative modeling for protein structure studies
A dissertation presented
by
Alper Uzun
to
The Department of Biology
In partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in the field of
Biology
Northeastern University Boston, Massachusetts July 2009
2
©2009 Alper Uzun ALL RIGHTS RESERVED 3
Bioinformatics applications through visualization of variations on protein structures, comparative functional genomics, and comparative modeling for protein structure studies
by
Alper Uzun
ABSTRACT OF DISSERTATION
Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Biology in the Graduate School of Arts and Sciences of Northeastern University, July, 2009 4
Abstract The three-dimensional structure of a protein provides important information for understanding and answering many biological questions in molecular detail. The rapidly
growing number of sequenced genes and related genomic information is intensively
accumulating in the biological databases. It is significantly important to combine
biological data and developing bioinformatics tools while information of protein
sequences, structures and DNA sequences are exponentially growing. On the other hand,
especially the number of known protein sequences is much larger than the number of
experimentally solved protein structures. However the experimental methods cannot
always be applied or protein structures cannot be available for all protein sequences.
Comparative protein modeling technique is closing a gap for the protein sequences with
unknown structures by constructing a three-dimensional model of a given protein
sequence based on sequence similarity to one or more known structures. I present here
several different applications of comparative protein modeling by developing servers for
mapping nsSNPs on to comparative protein models, studying comparative functional
genomics of HMGN3a and SMARCAL1 along with multiple sequence analysis,
developing comparative models for other techniques in order to find active sites, and
understanding the possible functional properties of proteins while substitutions occur in a
given protein.
Part of the presented research focused on nonsynonymous SNPs, understanding the
functional consequences of nonsynonymous changes and predicting potential causes and
the molecular basis of diseases involves integration of information from multiple
heterogeneous sources including sequence, structure data and pathway relations between 5
proteins. In order to visualize them on protein structures and perform the analysis on
nsSNP, a web server, Structure SNP (StSNP) was developed. It provides the ability to
analyze and compare human nsSNP(s) in protein structures, protein complexes and
protein–protein interfaces, where nsSNP and structure data on protein complexes are available in PDB. In the second part of the research, comparative functional genomics analysis of HMGN3a and SMARCAL1 was studied along with combining information
with comparative modeling and multiple protein sequence analysis across mammals. Our results showed that there was a high degree of structural conservation of HMGN3a and
SMARCAL1 in the mammalian species studied. In the third part of the research several comparative models built from different species in order to find active site residues by the
THEMATICS method. In the last part of the research multiple sequence alignment studies of APEX1 and dna polymerase beta and comparative models of γ-tubulins were studied and presented.
6
Acknowledgements
First, and foremost, I would like to express my sincere gratitude to my advisor, Dr.
Valentin Ilyin, for his guidance, suggestions and patience during the course of my work.
Dr. Valentin Ilyin has provided me with necessary assistance and his supervision during my graduate work which will always be appreciated. I would like to thank the members
of my committee, Dr. William Detrich, Dr. Kostia Bergman, Dr. Veronica Godoy, Dr.
Mary Jo Ondrechen. I am thankful for the valuable collaboration of Dr. William Detrich,
Dr. Mary Jo Ondrechen, Dr. Phyllis Strauss, and Dr. Erdogan Memili.
I also would like to thank to Biology Department at Northeastern University, they always
support me morally and financially during the PhD years. I would like to thank Dr. Scott
Mohr from Boston University for his encouragement, and who introduced me to the field
of bioinformatics and as well as special thanks should be given to Dr. Kostia Bergman
and Dr. William Detrich for having their support and guidance which let me to have the
initial steps to get into the field of bioinformatics.
During my graduate study, while I was having an internship at Broad Institute of MIT
and Harvard University, I especially would like to thank Dr. Jill Mesirov and Dr. Michael
Reich for not only supervising me but also they were very kind to give their full support
a long the years even after the internship.
I would like to thank my colleagues from the Ilyin lab, Alex Abyzov, Chesley, Leslin,
Haifeng Weng. Especially I am thankful to Alex Abyzov for patiently answering my
every question. 7
My friends outside the lab were also very supportive and sincere. I would like to thank
Ahmet Ozcan who was like a brother to me, Taner Kaya for sharing the tough times and fun times, he was a genuine friend along the years, Murat Erdem and Karin Schon who are always supportive and sincere, Ata Murat Kaynar and Dilsun Kaynar although later on moved to Pittsburgh, they constantly show their sincere love and friendship even from a distance.
I am grateful to my mom and dad for their everlasting trust, emotional support, encouragement, and unconditional love. I spend many years away from them; I understand and appreciate their sacrifice. Since I am the only child, their patience and understanding means a lot to me.
8
TABLE OF CONTENTS
Abstract 4
Acknowledgements 6
Table of Contents 8
List of Abbreviations 9
List of Figures 13
List of Tables 15
Chapter 1. Introduction. 16
Chapter 2. Mapping and modeling nsSNPs on protein structures 25
Chapter 3. Functional genomics of HMGN3a and SMARCAL1 30 in early mammalian embryogenesis Chapter 4. Comparative model structures and active site predictions 43
Chapter 5. Studies on multiple sequence alignment 50 and comparative modeling Conclusion 53
Tables 56
Figures 68
References 113
9
List of Abbreviations
Abbreviations Proper name
ALDH2 Aldehyde dehydrogenase-2
ANOLEA Atomic non-local environment assessment
APEX1 Apurinic/apyrimidinic endonuclease 1
Arg Arginine
AspAT Aspartate aminotransferase
ATP Adenosine-5'-triphosphate
BER Base excision repair
BLAST Basic local alignment search tool
BLOSUM Blocks of amino acid substitution matrix dbSNP SNP database
DFIRE Distance-scaled, finite ideal-gas reference
DHAP Dihydroxyacetone phosphate
DNA Deoxyribonucleic acid
EGA Embryonic genome activation
GAP D-glyceraldehyde 3- phosphate 10
Gln Glutamine
GST Glutathione s transferase g-T1 Gama tubulin 1 g-T2 Gama tubulin 2
GV germinal vesicle
HARP HepA-related protein
HGVbase Human genome variation database
HMGN High mobility group nucleosomal
HPPK 6-Hydroxymethyl-7,8-dihydropterin pyrophosphokinase
Ile Isoleucine
JSNP Japanese Single Nucleotide Polymorphism
KEGG Kyoto encyclopedia of genes and genomes
Lys Lysine
MEGA Molecular evolutionary genetics analysis
Met Methionine mRNA messenger ribonucleic acid
MSA Multiple sequence alignment 11
MZT Maternal to zygotic transition
NCBI National Center for Biotechnology Information
nsSNP Nonsynonymous single nucleotide polymorphism
PCR Polymerase chain reaction
pol-β DNA polymerase beta
PROQ Protein quality predictor
ProsaII Protein structure analysis II
PSI_BLAST Position specific iterative BLAST
RMSD Root mean square deviation
SEDB Structural exon database
Ser Serine
SIOD Schimke immunoosseous dysplasia
SMARCAL1 SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily a member 1 SNF2N Helicase like domain
SNP Single nucleotide polymorphism
StSNP Structure SNP 12
SWI/SNF SWItching and sucrose non-fermenting in yeast
THEMATICS Theoretical microscopic titration curves
TIM Triosephosphate isomerase
TRIP7 TRAF-interacting protein 7
UPGMA Unweighted pair group method with arithmetic mean
13
List of Figures
Figure 1. Growth of released protein structures per year.
Figure 2. Schematic of StSNP web server.
Figure 3. Data generation in StSNP.
Figure 4. nsSNPs and Glutathione S Transferase.
Figure 5. nsSNPs and Aldehyde dehydrogenase-2.
Figure 6. Phylogenetic tree for HMGN3a.
Figure 7. MSA of HMGN3a.
Figure 8. Four distinctive domains of SMARCAL1.
Figure 9. Phylogenetic tree for SMARCAL1.
Figure 10. First HARP domain of SMARCAL1.
Figure 11. MSA of the first HARP domain in SMARCAL1.
Figure 12. Second HARP domain of SMARCAL.
Figure 13. MSA of second HARP domain in SMARCAL1.
Figure 14. SNF2N domain of SMARCAL1.
Figure 15. MSA of SNF2N domain in SMARCAL1.
Figure 16. Phylogenetic tree of helicase C terminal domain in SMARCAL1. 14
Figure 17. MSA of the helicase C-terminal domain in SMARCAL1.
Figure 18. Disparity Index test of HMGN3a.
Figure 19. Disparity Index test of SMARCAL1.
Figure 20. MSA of helicase like and helicase domains with modeled protein structure. Figure 21. MSA of APEX1 and pol-β.
Figure 22. A view of model structures of g-T1 and g-T2 at the positions 303, 205.
15
List of Tables
Table 1. Representing query and modeling options for resources.
Table 2. Summary of resources.
Table 3. Organisms and protein accession numbers used in MSA of SMARCAL1.
Table 4. Organisms and protein accession numbers used in MSA of HMGN3a.
Table 5. Predicted clusters for orthologous structures for three templates.
Table 6. Predicted active site clusters for additional orthologous sets for the templates Table 7. Predicted active site clusters for nine homologous sets.
16
Chapter 1. Introduction
Comparative modeling of protein structures The three-dimensional structure of a protein provides important information for
understanding and answering many biological questions in molecular detail. The rapidly
growing number of sequenced genes and related genomic information is intensively
accumulating in the biological databases. As of June 2009 more than 58,000 (see Figure
1) experimentally determined protein structures were deposited in the Protein Data Bank
(PDB) [1]. Since the experimental methods cannot always be applied, protein structures aren’t available for all protein sequences. Computational methods and tools can be useful for identifying the protein structures.
Comparative protein structure modeling (or homology modeling) techniques have been developed to build a three-dimensional model of a given protein sequence on the basis of an alignment which relies on sequence similarity to at least one known structure which is called a template. For predicting the protein structure by comparative modeling, two
conditions have to be met [2-4].
1) The sequence to be modeled must have measurable similarity to another sequence of
known structure. Such as more than 50% sequence similarity can be a good example for
modeling [5-7]. Accuracy decreases as the percentage identity falls below usually
approximately 30% (also called as ‘twilight zone’) [8]. 17
2) It must be possible to compute an accurate alignment between the protein sequence
and the template structure. High accuracy comparative models are usually based on more
than 50% sequence identity to their templates [5].
The degree of sequence similarity between the protein sequence and the template is an
important predictor of the model accuracy. Higher sequence similarity (such as more than
50%) to the template is a sign of a more accurate model [2-7].
Comparative modeling in general consists of four main steps: Searching for template(s) starts with the target sequence. In another words target sequence is the query sequence.
1) Identifying experimentally solved structure(s) that can be used as template(s) for modeling the protein sequence.
2) Accurately aligning the protein sequence (target) to the template(s).
3) Building the three-dimensional model on the basis of the alignment(s).
4) Evaluating the quality of the model [2, 4, 9].
Searching available templates for the query sequence which refers to step 1 can be
categorized into three classes. The templates for modeling may be found by a) pairwise
comparison methods such as Basic Local Alignment Search Tool (BLAST) [10] and
FASTA [11] which align the protein sequence with all the sequences in the database of known structures, b) Profile-sequence alignment methods, such as PSI_BLAST [12], that depend on profiles which are derived from multiple sequence alignments in order to increase the accuracy and sensitivity of the template search, c) Third class of methods uses a combination of sequence and structure considerations to detect similarities between sequences and structures. The protein sequence is threaded through a library of 18
3-D profiles or folds, and each threading is evaluated based on a certain scoring function.
Superfamily [13] and GenThreader [14] are examples of this class.
I now compare these three classes briefly. While the pairwise sequence comparison methods are the least sensitive and are best used to detect close homologs, the profile- based methods are usually capable of recognizing homologs sharing only approximately
25% sequence identity and threading methods can sometimes recognize common folds even in the absence of any statistically significant sequence similarity [15, 16]. During my Ph.D. research I mainly used the first class in comparative modeling dependent studies.
Aligning protein sequence and template is the second step of modeling which is significantly important in determination of the accuracy of the template and also it is a critical step in the model-building process. A direct correlation between the sequence identity level of a pair of protein structures and the deviation of the C alpha carbon atoms of their common core has been shown [9, 17]. If the two sequences are more similar, then it can be expected that more closely corresponding structures will be constructed [9].
Since this a vital step in modeling, an erroneous alignment will lead to the construction of an incorrect model. Decreasing sequence identity, alignment errors and the incorrect modeling of large insertions can be listed as major source of inaccuracies. As the percentage identity falls below usually approximately 30% (also called as ‘twilight zone’)
[8], model quality estimation on the basis of sequence identity becomes unreliable [2, 8,
9, 18-20]. 19
In the third step of comparative modeling, protein sequence-template alignment, maps the
protein sequence on the template structure. Mapping is utilized in building the 3-D model of the protein sequence. There are several methods of building the model. These methods can be listed as;
1) Assembling the model from a small number of rigid bodies which are gathered from
the aligned protein structures [3].
2) Modeling by segment matching or coordinate reconstruction [21, 22]. This method can
construct main-chain and side-chain atoms. It can also model unaligned regions [23].
3) Loop modeling; the protein sequence to be modeled may have inserted residues in
their sequences which have no corresponding regions in the template(s). Since no
structural information is available about the inserted residues of the protein sequence, it is
not possible to derive any information from the template. Loop modeling can be needed
at this point. There are two main classes of loop modeling methods;
a) the database search; scanning through all available databases for protein
structures in order to find segments fitting the anchor core regions [24, 25],
b) the conformational search; depending on an optimization of a scoring function
[26, 27]. Other than these two main methods, some methods combine these two
approaches [28, 29].
4) Side chain modeling; in modeling of the side chains, there are two simplifications in
application. One is amino acid replacement which leaves the backbone unchanged (145),
so that backbone can be fixed during the search for the best side chain conformations.
The other one is, since most of the side chains are in high-resolution crystallographic 20 structures, it is feasible to represent them by a limited number of conformers which comply with stereochemical and energetic constraints [30]. Depending on this information, libraries of side-chain rotamers have been derived [31-33]. Rotamers on a fixed backbone are often used when all the side chains need to be modeled on a given backbone. This approach overcomes the combinatorial explosion associated with a full conformational search of all the side chains, and is applied by some comparative modeling and protein design approaches [3, 34].
The last step in comparative modeling is predicting accuracy of the models. Two important factors may influence the ability to predict accurate models: one is the extent of structural conservation between protein sequence and template, and the other is correctness of alignment [4, 35, 36]. Models based on templates with more than 50% sequence identity are generally very accurate. They can exhibit approximately 1A° Cα atom rmsd from the experimental structure.
Sequence identity between protein sequence and template is not the only parameter to estimate the difficulty or the quality of a comparative model [35-37] . The quality of the alignment between these two sequences will also depend on the number and the similarity distribution of all the sequences of the multiple sequence alignment [38]. Given the unprecedented growth of both structural and sequence databases, improvements in the quality of comparative models seem to be largely due to the increased availability of sequences and structures homologous to the protein of interest [35, 38]. There are several programs available on the web for testing accuracy of the 3-D models such as DFIRE
[39], VERIFY3D [40], ANOLEA [41], PROQ [42], ProsaII [43] are some of them. 21
MODELLER [4] and SWISS-MODEL [44] are commonly used programs for building models which are also open to public use.
An introduction to multiple sequence alignment
The discovery that sequences of different organisms are often related carries a very significant meaning; it has an important part in biology for evolutionary research and analysis. Many genes are represented in highly conserved forms in a wide range of organisms and similar genes are conserved across broadly divergent species. Usually they perform very similar functions. But an alteration which may be caused by mutations or rearrangements can alter the function of a gene. These kinds of alterations or similarities can be seen by through simultaneous alignment of the sequences of the genes or proteins.
We can see the patterns of change among the species or among different tissue types or genomes of diseased/healthy individuals. In this sense analyzing and understanding the functional, structural properties of genes and proteins by using multiple sequence alignment approach (MSA) is very powerful tool [45, 46].
Multiple sequence alignments are broadly used in computational analysis of protein sequences, comparative structure modeling, phylogenetic analysis, functional site prediction, and sequence database searching [45-47]. The purpose of constructing a multiple sequence alignment is to arrange residues with inferred common evolutionary origin or functional and structural similarities in the same column position for a set of sequences. Multiple sequence alignment provides position-specific information about conservation, correlation, and residue usage for several applications [46-48]. Therefore, 22 the reliability of the acquired information from MSA relies on the quality of the alignments. In the recent years, precise and swift construction of MSA has been under intensive research; several methods and programs have been developed in order to improve the quality of the alignments [48].
One of the frequently and commonly used programs for multiple sequence alignment is
CLUSTAL which has been in use since 1988 [49, 50]. CLUSTALW [51] is a more recent version of CLUSTAL. In this recent version, “W” is standing for weighting which represents the ability of the program to provide weights to the program parameters and sequences [52]. CLUSTALW is designed to provide an adequate alignment of a large number of more closely related sequences and a reliable indication of the domain structure of those sequences. The program also has options for adding one or more additional sequences with weights or an alignment to an existing alignment [52, 53].
Once an alignment has been made, a phylogenetic tree may be made by the neighbor- joining method or Unweighted Pair Group Method with Arithmetic mean (UPGMA) guide trees can be built. Especially, UPGMA guided trees helps in speeding up the alignment of extremely large data sets. The predicted trees may also be displayed by various programs [46, 52].
Nonsynonymous single nucleotide polymorphism Single nucleotide polymorphisms (SNPs) represent one of the most common forms of genetic variation in a population [54, 55]. SNPs are DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence is altered. For example a
SNP might alter the DNA sequence “…CGTGATTACGATTA… to …CGTGTTTACGTATTA…”. In 23
order to consider a variation to be SNP, it must occur in at least 1% of the population [54,
56]. SNPs constitute about 90% of all human genetic variation and they occur with a very
high frequency, with estimates ranging from about 1 in 1000 bases to 1 in 100 to 300
bases. More than 99% of human DNA sequences are the same but variations in DNA
sequence may have a major impact on how humans respond to disease, environmental
factors, and drugs [54] and sometimes a SNP or set of SNPs may not be directly
responsible for any disease, but the shear number of SNPs means they can also be used to
locate genes that influence such traits [57]. Therefore, this makes SNPs significantly
valuable for biomedical research and for developing pharmaceutical products or
computational methods and bioinformatics tools.
Currently, (May 2009) the public SNP database (dbSNP, build 130) [58] contains 17.8
million SNP candidates, of which 6.5 million have been validated which means
experimentally verified. Nonsynonymous SNPs (nsSNPs), the SNPs located within the
open reading frame of a gene that result in an alteration in the amino acid sequence of the
encoded protein might directly or indirectly affect protein functionality alone or its
interactions in a multi-protein complex, by increasing/decreasing the activity of the
metabolic pathway [54, 59]. nsSNPs have been linked to a wide variety of diseases;
because they affect protein function, alter DNA and transcription factor binding sites, reduce protein solubility and destabilize protein structures [59]. Therefore, understanding
the functional consequences of nonsynonymous changes and predicting potential causes
and the molecular basis of diseases involves integration of information from multiple 24
heterogeneous sources including sequence, structure data and pathway relations between
proteins.
SNP information is currently collected in several databases, including: dbSNP, the
Human Genome Variation Database (HGVbase) [60], the Japanese Single Nucleotide
Polymorphism (JSNP) database [61] and the HapMap Project [54]. Currently, there are number of studies and resources which have begun to explore the effects of nsSNPs on the tertiary structure of proteins and their functionality, including: SNPs3D [56],
PolyPhen [62], TopoSNP [63], ModSNP [64], LS-SNP [65], SNPeffect [66], MutDB [67,
68] and Snap [69], StSNP [70] have all been released for public use. A brief description of the available resources for SNP analysis is presented in Tables 1 and 2. It should be noted, this is not a comparison table but a reference table, as the field is in its infancy and all resources are currently evolving, with each database having strength.
25
Chapter 2. Mapping nsSNPs on protein structures.
StSNP; a web server for mapping and modeling nsSNPs on protein structures
StSNP, a web-based server, which provides the ability to analyze and compare human
nsSNP(s) in protein structures, protein complexes and protein–protein interfaces, where
nsSNP and structure data on protein complexes are available in PDB, along with the
analysis of the metabolic data within a given pathway. Usually nsSNP do not inactivate
protein functionality completely, otherwise the mutation would most likely be lethal,
instead nsSNPs change the protein activity at some level, either directly or indirectly
through interactions with other proteins in the pathway; therefore, such information has
to be considered mutually. As a result, StSNP was developed, which utilizes information
from different sources and provides ‘on the fly’ comparative modeling of the wild-type
and mutated proteins (when an appropriate structural template is available) along with real-time analysis and visualization of structures and sequences [71] to assist researchers
in visual inspection of the possible effects of the nsSNPs in protein structure. Users can
analyze data in different formats with StSNP. They have different search options such as
by keyword, NCBI protein accession number, PDB id, NCBI nsSNP id. This helps user
quickly retrieve targeted information.
Design and implementation sources
In general, the internal database structure has been inherited from the Structural Exon
database (SEDB) [72]. StSNP was implemented using a MySQL database running on a 26
Linux server, with PERL scripts used for all data retrieval and output (Figure 2). StSNP utilizes three major data sources:
(1) Protein sequences from NCBI,
(2) The reference and nsSNPs locations from NCBI’s dbSNP,
(3) Structures and sequences from the PDB.
Every protein sequence has a pre-calculated list of structural modeling templates found by BLAST, and stored in a database for quick retrieval. The actual aligning of the protein sequence and the PDB sequence was implemented with the Smith–Waterman algorithm
[73], using similarity specific scoring matrices, from BLOSUM30 to BLOSUM90 [74].
The pathway information is utilized from KEGG [75, 76], human gene/protein
information is gathered from NCBI’s Entrez Gene [77], and the comparative modeling
phase is done by MODELLER. The modeling part of StSNP is interactive and allows the
user to choose a template from the list, select particular mutations to be modeled
calculate the model and subsequently visualize the superimposition of the models and
template in the Friend applet. Additionally, simultaneous analysis of structurally similar
proteins/models for structural correlation of nsSNP locations can be done in the Friend
applet by the TOPOFIT structure alignment method [78, 79].
StSNP currently (June 2009) contains 107,028 nsSNPs, 31,834 protein sequences, 12,741
genes and 30,517 protein structures.
27
Web server features
StSNP has several types of search options, including search by a Protein ID, PDB ID or keyword, all of which together integrates nsSNP related information. For example, the
Protein ID search displays the known nsSNP(s) for the protein, while the PDB ID search provides a list of similar Protein IDs with nsSNP(s). Both searches will provide a link to pathway information if the data are available. The resulting report pages provide the user with options for model template selection. Only templates satisfying the following two criteria are shown: the nsSNP(s) has to be within the alignment of the protein sequence with template and the sequence similarity of the alignment has to be 30%. The modeling step provides the user with the ability to choose which nsSNPs to map, and after completion, a user can instantly visualize the models with the Friend applet. StSNP has several browsing and search capabilities as well, for example, searching for available structures by protein length and percent similarity, or by a specifically chosen reference and nonsynonymous residue within a particular chromosome. The features found in
StSNP have been designed with graphics, plots and easily readable tables with the end user in mind.
28
Examples of use 1
Glutathione S Transferase
Glutathione S Transferase (GST, Protein ID: NP_000843) is a family of multifunctional enzymes involved in cellular detoxification of xenobiotics and reactive endogenous compounds of oxidative metabolism. nsSNPs of GST were mapped onto protein structure which shown in Figure3 [80]. The output page reports the available reference and nonsynonymous residues for the protein with the rs number which refers to a reference SNP ID number, or “rs” ID, is an identification tag which is assigned by NCBI to a group or cluster of SNPs that map to an identical location [81], amino acid properties for the variations, and the alignment picture of protein sequence with template including nsSNP locations. In this example, all nsSNPs are located inside the alignment and thus available for mapping onto PDB ID 1aqv chain B. The next step is to choose the nsSNPs for modeling. All the known nsSNPs associated with GST, I105V, T110S, A114V,
D147Y and L176M have been modeled in this example and are presented in Figure 4A.
A black circle denotes where isoleucine has changed to valine at position 105. The role of functional I105V GSTP1 polymorphism in the pathogenesis of methamphetamine abuse was studied, with researchers noting that individuals with the G allele (valine) are expected to have decreased GST detoxification [80]. It is visible from the mapping of this nsSNP onto the protein structure (Figure 4A) the location of I105V is located in direct contact with the glutathione, and could potentially have a strong effect on the GST activity or its binding affinity with glutathione. The results section also provides a user 29
with a link to glutathione metabolism in order to view other members found in the
pathway (Figure 4B).
Examples of use 2
Aldehyde Dehydrogenase-2
Aldehyde Dehydrogenase-2 (ALDH2) (PROTEIN ID: NP_000681) is illustrated in
Figure 5. ALDH2 is involved in acetaldehyde oxidation at physiological concentrations
and found when a person consumes alcohol. Worldwide, the Lys504 allele has the
highest prevalence (30–50%) in Asian populations [82]. In this example, glutamate is replaced by lysine at position 504 (Glu504Lys), where it has been demonstrated to essentially eliminate ALDH2 activity [83]. From these examples, one can see how a quick search in StSNP in conjunction with the structural mapping of the nsSNP locations provides structural support to the medical studies mentioned here and may facilitate in the designing of future experiments.
Discussions and Conclusions
StSNP provides practical, user friendly access to the wealth of information related to nsSNPs by seamlessly connecting various databases into one pipeline. Key functional and structural information along with known pathways the proteins are involved in, have all been linked together to provide users some advantages when compared to other current resources:
(a) the sequence, structure and pathway information have all been cross-referenced,
which enables a user to quickly query and visualize the inter-related nsSNP data; 30
(b) a graphical display of the nsSNPs provides a user with the location of the nsSNP(s) in
terms of primary sequence, and whether such nsSNP(s) can be modeled;
(c) the modeling options provide the user with a choice of which nsSNP to map and
visualize which nsSNPs could potentially have deleterious effects on a protein’s function;
(d) the modeled protein structures are automatically loaded in Friend, where they can be
easily viewed, compared and analyzed;
(e) finally, StSNP will be updated on a regular basis following the updates on the major sources, dbSNP, PDB, KEGG and others.
Chapter 3. Functional genomics of HMGN3a and SMARCAL1 in early mammalian embryogenesis.
Stages of early embryonic development
Early embryonic development in general is initiated when mature oocytes (MII) are fertilized by spermatozoa. Maternal factors, such as mRNAs, microRNAs and proteins stored in the oocyte, provide the means of support for the first few days of development.
The transition from a maternal to a zygotic control of development, called maternal to zygotic transition (MZT), and the activation of the embryonic genome involve chromatin structural modifications that take place during the first few embryonic cell cycles [84].
Embryonic genome activation (EGA) sets the stage for later development [85, 86].
Changes in chromatin structure have been characterized throughout the transition from transcriptional incompetence to the minor activation of the zygotic genome at the 1-cell stage and through the major genome activation at the 2-cell stage in murine embryos 31
[87]. In bovine embryos EGA occurs at the 8- to 16-cell stage with extensive programming of gene expression. However, the regulation of chromatin remodeling during EGA still remains a mystery [88].
Description of Chromatin remodeling
Chromatin remodeling is an extensive process occurring during early embryogenesis. An essential property of the embryonic chromatin structure is to prevent the access of the transcriptional machinery to all of the promoters in the genome [88]. The expression of some genes may be mediated by chromatin remodeling proteins. Chromatin remodeling complexes may change the overall pattern of expression of mammalian genes, allowing transcription factors and signaling pathways to produce different genomic transcriptional responses to common signals [89]. This is particularly important for preimplantation embryos starting cell differentiation cascades that will lead to tissue and organogenesis.
These changes in chromatin structure generate activation of the transcriptional machinery and gene expression occurring during early embryo development, leading to a unique chromatin structure capable of maintaining totipotency during embryogenesis and differentiation during postimplantation development [86, 88].
High mobility group nucleosomal protein family
The High Mobility Group Nucleosomal (HMGN) protein family is the only group of nuclear proteins that bind to the 147-base pair long nucleosome core particle with no sequence specificity [90]. HMGN proteins are present in the nuclei of all mammalian and most vertebrate cells at approximately 10% of the abundance of histones [91]. They bind 32
as homodimers to the nucleosome and cause chromatin modifications that facilitate and
enhance several DNA-dependent activities, such as transcription, replication and DNA
repair. This protein family is composed of 3 members, HMGN1 (also known as HMG-
14), HMGN2 (also known as HMG-17), and the most recently discovered HMGN3,
initially named TRIP7 for its ability to bind the thyroid hormone receptor [92].
In the mouse HMGN1 and HMGN2 have been detected throughout oogenesis and
preimplantation development and are progressively down-regulated throughout the entire
embryo, except in cell types undergoing active differentiation [93]. Since reduction in the
levels of HMGN1 and 2 mRNA also occurs during myogenesis in rat, this decrease is suggesting that down-regulation of HMGN mRNA may be associated with tissue differentiation [94]. Depletion of HMGN1 and HMGN2 in one- or two-cell embryos delays subsequent embryonic divisions. Cells derived from HMGN1-/- mice have an altered transcription profile and are hypersensitive to stress [93]. Experimental manipulations of the intracellular levels of HMGN1 in X. laevis embryos cause specific developmental defects at the post-blastula stages. Furthermore, HMGN proteins regulate the expression of specific genes during X. laevis development [95]. Several lines of evidence implicate HMGN1 and 2 in transcriptional regulation. Chromatin containing genes that are actively being transcribed have two- to three times more HMGN1 and 2 compared with total chromatin [88, 93].
33
Description of HMGN3
The human HMGN3 transcript produces two splice variants HMGN3a the long isoform
with 99 amino acids, and HMGN3b with 77 amino acids that arises due to a truncation of the fifth exon. Although no HMGN3b protein has been identified in the rat and cow,
ESTs with high identity to it suggest that this splice variant may also exist in these
species. HMGN3a constitutes a family of relatively low molecular weight non-histone
components of about 100 amino acid residues. The cow, mouse, and rat HMGN3a
proteins share more than 81% identity with the human HMGN3a protein [92]. The role of
HMGN3a has not been studied in mammalian development. Although the exact function
of HMGN3a during early embryonic development has not been determined, its role in
facilitating chromatin modifications and enhancing transcription, replication, and DNA
repair is critical for early embryo development [92].
SWI/SNF protein family
Another important mechanism in regulation of chromatin structure in the early embryo is
mediated by nucleosome repositioning factors, which are ATP-dependent chromatin- remodeling enzymes. Nucleosome repositioning factors use energy released by ATP hydrolysis to alter histone-DNA contacts and reposition nucleosomes to create chromatin environments that are either open or compact. These factors do not involve sequence specific DNA binding sites, but rather are recruited onto promoter regions by specific transcription factors. Nucleosome repositioning factors typically exist as multi subunit protein complexes, like the SWI/SNF (from SWItching and Sucrose Non-Fermenting in 34 yeast) ATP-dependent chromatin remodeling complex [96]. SWI/SNF complexes are thought to regulate transcription of certain genes by altering the chromatin structure around them with their helicase and ATPase activities [97, 98]. In mammals, each
SWI/SNF complex has any of two distinct ATPases as the catalytic subunit of
SMARCA2 and SMARCA4 [99]. Both ATPases have important developmental functions. In primates, expression of both subunits remains constant and low throughout embryogenesis until the blastocyst stage [100]. In mouse embryos, Smarca4 transcripts remain at stable levels throughout preimplantation development, while Smarca2 transcripts remain low until the blastocyst stage, when its mRNA levels increase [101]. In porcine embryos, SMARCA2 transcripts are most abundant in germinal vesicle (GV) stage oocytes and decline progressively during embryo development to blastocyst stage
[102]. Mutant mice lacking the Smarca4 gene die at preimplantation while the Smarca2- null mouse mutant is viable and shows a mild overgrowth phenotype [103, 104].
Description of SMARCAL1
Another member of the SWI/SNF family of proteins involved in chromatin remodeling is
SMARCA1 (SWI/SNF related, matrix associated, actin dependent regulator of chromatin, subfamily a member 1), considered a global transcription activator and also called
SNF2L1. Like other SWI/SNF members, the SMARCA1 protein has a helicase ATP- binding domain. However, since the rest of its motifs diverge from other members of the
SWI/SNF family, it has been classified in the ISWI (for Imitation SWItch) subfamily of
ATPases, together with SMARCA5. Decreasing levels of SMARCA5 were found during
Rhesus monkey embryogenesis from GV oocytes until blastocyst stage. The same study 35
reported low levels of SMARCA1 throughout all stages of embryogenesis except for the
8-cell stage [100].
Members of the SNF2 subfamily of SWI/SNF proteins are characterized by its seven
motifs (I, Ia, II, III, IV, V and VI) [105]. SMARCAL1 (SWI/SNF related, matrix
associated, actin dependent regulator of chromatin, subfamily a-like 1) is one of the
SNF2 members and shows high sequence similarity to the E.coli RNA polymerase-
binding protein HepA [105]. Recent reports have linked mutations in the SMARCAL1
gene with Schimke immunoosseous dysplasia (SIOD), a human autosomal recessive disorder with the diagnostic features of spondyloepiphyseal dysplasia which is a is a descriptive term for a group of disorders with primary involvement of the vertebrae and epiphyseal centers resulting in a short-trunk disproportionate dwarfism, renal dysfunction, and T-cell immunodeficiency [106-108]. The ability of SMARCAL1, to interact primarily with nucleosomes was demonstrated using protein interaction microarrays. SMARCAL1 transcripts are ubiquitously expressed in different human and mouse tissues, suggesting a role in normal cellular functions or housekeeping activities, such as transcriptional regulation [105]. Although no studies have reported the expression of SMARCAL1 during early embryogenesis in mammals, our collaborator Dr Memili and his group previously detected a 7- fold increase of the SMARCAL1 mRNA in 8-cell bovine embryos as compared with MII oocytes by using oligonucleotide microarray gene expression analysis and Real Time PCR validation [86]. Additionally, studies on the
SWI/SNF complex associated factor SMARCC1 (also called SRG3 and BAF155), a core subunit of the SWI/SNF complex, have highlighted the importance of the ATPase 36
subunits and the whole complex during embryogenesis. In the absence of Smarcc1,
mouse embryonic development ceased during preimplantation stages, indicating that
Smarcc1, as well as the chromatin-remodeling process, plays an essential role in early
mouse development [109]. SMARCC1 mRNA was found in high levels in GV stage
Rhesus monkey oocytes and at very low levels throughout early embryogenesis but was higher later at the hatched blastocyst stage [100, 105].
Comparative functional genomics analyses of HMGN3a across mammals
I have used seven mammalian (number of species depend on the availability of the data from current databases) sequences in the construction of a HMGN3a phylogenetic tree
(Figure 6). The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500 replicates) is shown next to the branches. All positions containing gaps and missing data were eliminated from the dataset. There were a total of
95 positions in the final dataset, of which 18 were parsimony informative. If the site contains at least two types of nucleotides or amino acids, and at least two of them occur with a minimum frequency of two. It is called as parsimony-informative. The most significant observation in multiple sequence alignment of HMGN3a was the insertion of alanine, in the fifth exon of the Bos taurus protein, (highlighted in red on Figure 7).
Several substitutions in the bovine sequence were shared by other mammals in the alignment. Macaca mulatta and Canis familiaris HMGN3a proteins have longer
sequences with regions not shared with the other species. We focused on the regions of
the protein shared by all species. Also we showed other alanine substitutions in the
alignment (marked with stars) (Figure 7). 37
Comparative functional genomics analyses of SMARCAL1 across mammals
SMARCAL1 has four conserved domains (Figure 8). The first and the second are two
HARP (HepA-related protein) domains of approximately 60 residues long, with single
stranded DNA-dependent ATPase activity. The third conserved domain is a helicase like
domain named SNF2N-terminal domain and the fourth is a helicase C-terminal domain
[105] .
I used the whole SMARCAL1 protein sequences from 9 mammalian species (number of
species depend on the availability of the data from current databases) to construct the
phylogenetic tree (Figure 9). The percentage of replicate trees in which the associated
taxa clustered together in the bootstrap analysis (500 replicates) is shown next to the
branches. All positions containing gaps and missing data were eliminated from the
dataset. There were a total of 856 positions in the final dataset, of which 198 were
parsimony informative. I used each one of the four SMARCAL1 conserved domains to
build separate multiple sequence alignments and construct separate phylogenetic trees for
each domain (Figures 10, 12, 14 and 16). Phylogenetic analysis shows that while Homo
sapiens, Pan troglodytes, Macaca mulatta are clustering together, Equus caballus, Canis
familiaris and Bos taurus have relatively distant position in the tree. Rattus norvegicus
and Mus musculus separated these organisms in the tree. When the first and the second
domain of HARP in SMARCAL1 is compared, also there is a separation which can be easily identified between the group of Canis familiaris, Rattus norvegicus, Mus musculus
and the group of Pan troglodytes, Homo sapiens and Macaca mulatta. Equus caballus
observed closer to the second group in the first HARP domain. Monodelphis domestica 38
(gray short-tailed opossum) becomes the most distant species among 9 mammals in the
tree. For the first HARP domain, the positions at which substitutions occur are
highlighted in yellow (Figure 11) Monodelphis domestica was the most distantly related mammal with respect to this domain. Substitutions were observed in 24 positions. On the
4th substitution, glutamate, a medium size acidic amino acid was substituted by alanine, a small size hydrophobic amino acid in Bos taurus. On the 8th substitution, while Bos taurus, Equus caballus, and Canis familiaris have a serine, it is substituted for asparagine
in Pan troglodytes, and Homo sapiens, and for arginine in Macaca mulatta and Rattus
norvegicus. Additionally Mus musculus has a histidine, and Monodelphis domestica has a
lysine at this position. On the 10th substitution, while Bos taurus, Equus caballus, Canis
familiaris, and Monodelphis domestica have an alanine, a small size hydrophobic amino
acid, Pan troglodytes, Homo sapiens, and Macaca mulatta have aspartate, a medium size
acidic amino acid. Both Rattus norvegicus and Mus musculus have phenylalanine at this
position. All these substitutions along the HARP sequence may suggest that acidic
residue distribution is conserved. Although in the 4th position Bos taurus is the only
species which has alanine instead of glutamate may still suggest that along the species in
that position acidic property is well conserved. For the second HARP domain, there were
34 positions with amino acid substitutions in at least one of the species studied. These
substitutions are highlighted in the alignment (Figure 13). Again Monodelphis domestica
was the most distant species for this domain.
In phylogenetic tree analysis of HARP1 domain, significantly higher bootstrap values
were observed for Rattus norvegicus, Mus musculus, Pan troglodytes and Homo sapiens. 39
In the second HARP domain, high bootstrap values conserved only in Rattus norvegicus
and Mus musculus. For both domains Monodelphis domestica observed as the most distant mammalian among 9 species. When we compared first and second domain of
HARP in SMARCAL1 also there was a separation which can easily be identified between the group of Canis familiaris, Rattus norvegicus, Mus musculus and the group of
Pan troglodytes, Homo sapiens, and Macaca mulatta. Equus caballus was observed closer to the second group in the first HARP domain.
The phylogenetic tree of SNF2N which is the third domain of SMARCAL1, shows similar composition like the first two phylogenetic trees (Figure 14), but the most significant difference is thatCanis familiaris is getting closer to Bos taurus.
For the last domain of SMARCAL1, one of the clearest observations is lowering of bootstrap values between Bos taurus and GLEAN 20241, when it is compared to the other phylogenetic trees. Also in the phylogenetic and sequence related analysis 11 parsimony informative sites detected and 46 of the sites are conserved among the species
(Figure 15). Positions with insertions and deletions are marked with stars. The first insertion comprises 3 additional amino acids (glutamate, leucine, and lysine) present only in the Equus caballus, protein. There is a deletion of the amino acid arginine, present in all species, except for the NCBI bovine sequence. However the GLEAN_20241 does not have the deletion. The amino acid threonine is also absent in both Rattus norvegicus and
Mus musculus. The bovine NCBI sequence showed significant mutations of the third conserved domain marked in red on the alignment. However the sequenced official gene set for this protein (Bovine Genome Database http://racerx00.tamu.edu/bovine) shows a 40
higher homology to all species, differing in only 2 amino acids from the horse and human
protein. These findings indicate sequencing errors in the currently available bovine
SMARCAL1 protein. These errors will likely be corrected with the completion of the
bovine genome annotation effort. The bovine helicase C-terminal domain protein shows a
deletion (marked with a star) and several substitutions highlighted in red (Figure 17) that
do not exist in GLEAN_20241. These observations point to the need for an update in
SMARCAL1 protein sequence currently available at NCBI. In addition to our analysis,
we applied disparity index, ID [110], which measures the observed difference in
evolutionary patterns for a pair of sequences. The disparity index for HMGN3a (Figure
18), did not show any significant pairs of species. The disparity index for each domain of
SMARCAL1 is presented in Figure 19. In the first HARP domain (Figure 19A) 6 pairs of
species (Bos taurus-Rattus norvegicus, Pan troglodytes-Rattus norvegicus, Homo
sapiens- Rattus norvegicus, Macaca mulatta-Rattus norvegicus, Canis familiaris-Rattus
norvegicus, and Mus musculus-Homo sapiens) were considered significant. The disparity
index did not observed differences in evolutionary patterns for the second HARP domain
(Figure 19B). There were 9 significant pairs in the SNF2 N-terminal domain disparity
index (Bos Taurus-Equus caballus, Bos Taurus-Pan troglodytes, Bos Taurus-Homo
sapiens, Bos taurus -Macaca mulatta, Bos Taurus- Rattus norvegicus, Bos Taurus-
Monodelphis domestica, Equus caballus-Pan troglodytes, Equus caballus-Homo sapiens,
Equus caballus-Macaca mulatta) (Figure (Figure 19C). In the disparity index for the helicase C-terminal domain only the pair Bos taurus-Canis familiaris was significant
(Figure 19D). 41
Analysis of the conserved/non-conserved regions on the comparative modeled structure of SMARCAL1
Since the protein structures for SMARCAL1 are available for helicase like and helicase domain, a comparative homology model was built on covering only these domains. The percentage similarity between template and protein sequence was 24%. Depending on the multiple sequence alignments, all non-conserved residues were mapped on the modeled structure (Figure 20). SNF2N and helicase C domains have nucleotide binding and ATP binding residues. There are 9 residues responsible for nucleotide binding and 8 residues for ATP binding which were retrieved from the literature [111, 112]. These locations were mapped on the modeled structure. In the analysis it was shown that all ATP binding residues exist in the conserved regions. Although Threonine781 is among the residues that are responsible in nucleotide binding, it falls into the non-conserved region of the protein sequence. Multiple alignment results show that in this specific location only one species (in Mus musculus) has variation which is Proline (Figure 12). This substitution creates a difference in amino acid side chain polarity as well as hydrophobicity and size at that specific position.
Methods of comparative functional genomics analyses of HMGN3a and SMARCAL1 across mammals
Protein sequences of SMARCAL1 were retrieved from NCBI by performing protein
BLAST against mammalian database using Bos taurus SMARCAL1 (NP_788839) as the query protein. Sequence data were manipulated with the Friend software, a 42 bioinformatics application designed for simultaneous analysis and visualization of multiple structures and sequences of proteins, DNA or RNA. Multiple sequence alignment of nine mammalian SMARCAL1 protein sequences that are listed in Table 3 were created by using Clustal W under Friend Software. We defined conserved regions based on domains listed in the Pfam [113] database which has conserved amino acid sequence regions. The same steps were applied for constructing HMGN3a phylogenetic tree, for which we used the only available Reference Sequence protein for Bos taurus
(NP_001029676). Since the availability of mammalian HMGN3a sequences is limited, we excluded Monodelphis domestica from the HMGN3a phylogenetic analyses, which were conducted in MEGA 4 [110]. There are 7 mammalian species used in our analysis of HMGN3a which are shown in Table 4. The Maximum Parsimony method was used for inferring the evolutionary history when creating the phylogenetic trees for both
SMARCAL1 and HMGN3.
Methods of comparative modeling of SMARCAL1
Because of the availability of possible templates for comparative modeling, a model was created for only SMARCAL1. PDB-file 1z63 chain A was used as a template for comparative modeling which shares 24% sequence identity with SMARCAL1 protein sequence of Bos taurus. The template that was including the residues from 422 to 869 covered two domains of SMARCAL1 and these were helicase like and helicase domains.
Comparative modeling was performed by MODELLER 9v1 [4]. Structural analysis was done under Friend and model picture was created with Chimera [114]. 43
Discussions and conclusions
In the analysis, the bovine HMGN3a and SMARCAL1 showed a high degree of homology in all studied mammals. This high structural conservation highlights the importance of chromatin remodeling in the regulation of gene expression, particularly during early embryonic development. Understanding the interactions between these proteins and their roles could improve our understanding of epigenetics in reproduction and disease. Appropriate models for the study of chromatin remodeling proteins are essential to understanding this process, particularly in the case of diseases like SIOD,
caused by a mutation in the SMARCAL1 gene. The greater similarities of the HMGN3a
and SMARCAL1 proteins in human and bovine species may suggest that more attention
should be paid to a bovine model in the study of chromatin remodeling [88].
Chapter 4. Comparative model structures and active site predictions.
Description of Theoretical MicroscopicTitration Curves
THEMATICS (Theoretical Microscopic Titration Curves) is a computational predictor of
the active sites of enzymes from protein structure [115-119]. For the corresponding
protein, the electrical potential function is computed by using Finite Difference Poisson-
Boltzmann methods. Then the predicted titration curves for all of the ionisable residues in
the protein structure are calculated [116]. The shapes of the predicted titration curves are
analyzed for identification of residues with elongated, non-sigmoidal titration behavior
[116]. A cluster of two or more such anomalous residues in physical proximity is a highly
reliable predictor of the active site of the protein [115, 116]. 44
It is an advantage that THEMATICS requires only the three-dimensional structure of the query protein as an input. This is one of the powerful sides of THEMATICS. The query protein does not have to have any similarity in sequence or in structure to any previously characterized protein [115, 116]. However, there is also a disadvantage of the method:
Three-dimensional structure of the protein has to exist. Which one is sufficient; an experimentally determined structure, or a theoretical model structure [116]?
Building comparative model structures for THEMATICS
In order to answer the question, comparative model structures were built from sequence homology [7]. In the present work, it is shown that THEMATICS can predict active site locations in comparative model structures [7, 115]. It was started with an experimentally determined template structure [1, 120] and with the sequence of the query protein.
Sequence alignments and comparative structure modeling [6, 121] were performed using: the integrated application Friend, which interfaces with ClustalW and with MODELLER;
In order to perform modeling, pairwise alignments were done under MODELLER. The titration curves are calculated for all of the ionisable residues in each of the template and comparative model structures [116]. The curves are analyzed for selecting the ones that deviate most from the typical sigmoidal shape. Most of the curves do possess the characteristic sigmoidal shape, with a sharp fall-off in charge in the region around the midpoint, as predicted by the Henderson-Hasselbalch equation [116]. Only a small fraction (about 3 to 7 %) of the ionisable residues deviates from the typical behavior.
THEMATICS identifies the deviant ones [116]. Then a search was performed for a 45 cluster of residues with deviant titration behavior that are in physical proximity. A residue is deemed to belong to a cluster if it is a nearest neighbor, or is within 7 A°, of another cluster member [116]. Since these clusters are highly reliable predictors of active site location in the protein structure, they are called THEMATICS positives [115].
Comparative protein modeling for THEMATICS
In this part comparative models and their related results for THEMATICS will be presented.
Triosephosphate isomerase (TIM) orthologs
The conversion of D-glyceraldehyde 3- phosphate (GAP) to dihydroxyacetone phosphate
(DHAP) is catalyzed by triosephosphate isomerase (TIM). The x-ray crystal structure data for TIM from chicken (PDB ID: 1tph) is utilized from the Protein Data Bank with a resolution of`1.8 A°. Since TIM is active as a dimer, the calculations are performed on the dimer.
The first of the four structures homologous to the chicken TIM structure 1tph is built from the sequence for Schistosoma japonicum with 60% sequence identity in the pair wise alignment and 0.16 A° RMSD value for the model structure. The second model is determined for the sequence for Enterococcus faecalis with 40.2 % sequence identity, resulting in a 0.29 A° RMSD value with the template structure. The third model is built from the sequence of Bartonella henselae with 38.7 % identity and RMSD value of 0.31
A°, and the last model is built from the sequence of Mycoplasma genitalium with 33 % 46
identity and RMSD value of 1.73 A°. These structures are all obtained with MODELLER
and are summarized in Table 5.
Table 5 gives the THEMATICS result for the active site cluster for each template
structure and the orthologous model structures. Known active site residues are shown in
boldface and “second shell” residues (those immediately adjacent to known active site
residues but not considered to be in the active site) are underlined. For the TIM structure
from chicken (1tph), four neighboring residues with anomalous titration behavior are
identified as the active site cluster. Two of these residues, H95 and E165, are well
established by experiment as catalytically active residues [116, 122, 123].
Two other residues, C126 and Y164, are located in the active site cleft but any possible
catalytic role for these residues has not been investigated experimentally. Upon alignment
of the sequences and superposition of the structures, it is confirmed that all four of these
residues are conserved, both in the sequence and in the spatial arrangement of the active
site cleft, in all of the four model structures. Sequence alignment across a wider range of
species again reveals high conservation of all four of these residues. Although known
active sites are located by THEMATICS on the models. Other two locations C126 and
Y164 may be used as a guide. In order to understand if these two residues are also
addition to the existed active sites or if they have any supportive role.
6-Hydroxymethyl-7,8-dihydropterin pyrophosphokinase (HPPK) orthologs
6-Hydroxymethyl-7,8-dihydropterin pyrophosphokinase (HPPK) is a monomeric pyrophosphate transferase. Its crystal structure for E. coli (PDB ID: 1hka) was utilized 47
from the protein data bank with 1.5 A° resolution. Four homologous models to the E. coli
structure 1hka are built using the MODELLER software from the sequences for the following organisms: Vibrio vulnificus (with 63% sequence identity with E. coli and 0.34
A° RMSD), Vibrio parahaemolyticus (with 57% sequence identity and 0.22 A° RMSD),
Pseudomonas aeruginosa (with 51% sequence identity and 0.36 A° RMSD) and
Pseudomonas putida (with 48% sequence identity and 0.50 A° RMSD). All of them are
conserved across the four species for which the model structures are built and are also
generally well conserved across bacterial kinases. When the four model structures are
superimposed onto the template E. coli structure, the positions of these residues are
conserved in the active site pocket with similar orientations. For the HPPK case,
THEMATICS identifies the same cluster for all four of the model structures as for the E.
coli template structure (see Table 5).
Aspartate aminotransferase (AspAT) orthologs
The structure of the pyridoxamine 5'-phosphate dependent enzyme Aspartate
aminotransferas from E. coli at 2.2 Å resolution (PDB ID: 1amr) is used as the template.
Its fold is a unique amino transferase fold. AspAT is active as a homodimer and the
calculations are performed on the dimer structure. Using MODELLER software four
model structures homologous to the AspAT template from E. coli are constructed from
the sequences for the following organisms: Vibrio cholera (with 62 % pairwise identity and 1.52 Å RMSD), Oryza sativa (with 44% identity and 0.64 A° RMSD), and Neiserria meningitides (41 % identity and 1.28 A° RMSD), and Clostridium perfringens (22% identity and 3.67 A° RMSD). For all four models and for the template, THEMATICS 48
finds the active site cluster, although the list of identified residues is a little different for
each species for the AspAT case (see Table 5).
Additional examples of comparative protein models for application of THEMATICS
Tables 6-7 give THEMATICS results for additional sets of enzymes with a significant variety of different folds and chemical functions. The comparative structures range from
93% to 22% sequence identity with templates. Table 6 gives the active site cluster
predicted by THEMATICS for eight sets of orthologous proteins. Results are given for
the eight templates and a total of 31 comparative model structures.
In these examples (in Table 6), the homologues are presumed to have the same function
as the template. Table 7 gives the active site cluster predicted by THEMATICS for nine
more sets of proteins, including nine templates and 36 comparative models. In the
examples given in Table 7, there may be variation in function among some of the
members of the homologous sets.
In most cases, active sites in the comparative model structures that are similar to those of
the corresponding templates are located by THEMATICS. There are also some examples
with low sequence identity where the predicted active site cluster is similar to the
template. However, in a couple of cases involving distant homologues, the predicted
active site is quite different from that of the template.
49
Discussion and conclusions
THEMATICS successfully locates the active sites for the comparative models. In some
cases, there is some minor variation in the list of important residues, but the catalytically
active residues almost always seem to be properly identified. In a couple of cases involving remote homologues, the predicted active site residues are quite different from
those of the template. This may happen because the function of the remote homologue is
different, or because the quality of the comparative model may not be adequate for
THEMATICS to predict the active site residues. On the other side for almost all of the
cases studied, the comparative model structures are good enough to acquire an accurate active site prediction from THEMATICS.
50
Chapter 5. Studies on multiple sequence alignment and comparative modeling.
Multiple sequence analyses of APEX1 and pol-β
Abasic (apurinic/apyrimidinic, AP) sites are one of the most common lesions in DNA. It
has been estimated that approximately 10,000 AP sites are formed in each mammalian
cell per day under normal physiological conditions [124-127]. They can occur in DNA as a result of spontaneous hydrolysis of the N-glycosylic bond or the removal of altered bases by DNA glycosylases [124]. They are potentially mutagenic and lethal lesions that can prevent normal DNA replication and transcription [127]. In the cell there are systems
to recognize and repair such sites, the base excision repair (BER) pathway is specifically responsible for the repair of alkylation and oxidative DNA damage.
Apurinic/apyrimidinic endonuclease 1 (APEX1) cleaves the phosphodiester backbone 5’ to the AP site [2–4]. The cleavage, which is a key step in the BER pathway, is followed by nucleotide insertion and removal of the downstream deoxyribose moiety, performed most often by DNA polymerase beta (pol-β) [5]. The fact that nucleotide insertion
requires cleavage of the AP site suggests interaction of the two enzymes. While several
biochemical studies indicate interaction between the two proteins, the details of the
interaction remain unknown.
In my Ph.D. research, one of the collaborative projects that I was involved in focused on
predicting the most likely protein-protein interface between APEX1 and pol-β by
applying a new methodology [126]. This methodology relies on the assumption, which is
validated by experimental evidence that both proteins must bind to DNA in order to 51
interact. Analysis of the simulated protein behavior in water suggests how protein
interaction might be coupled to conformational changes in DNA polymerase beta [126].
Moreover, multiple sequence alignment of APEX1 and pol-β in related organisms
identified a set of correlated mutations of specific residues at the predicted interfaces
[126]. Methods and results are presented below.
Multiple Sequence Analysis of APEX1 and pol-β Proteins
BLAST searches against non-redundant protein sequence database were performed by
using human APEX1 (accession number NP_001632) and pol-β (accession number
NP_002681) as query sequences. Fourteen eukaryotic organisms were identified where
both proteins were available in each organism. The selected sequences were aligned
using CLUSTALW program.
Results for correlated mutations of the interface residues
If APEX1 and pol-β evolved to form a molecular complex so that the specificity of their
interaction optimized the function of the BER pathway, then it may be expected that the
network of inter-residue contacts constrains the protein sequence. It may suggest that the
changes accumulated in the evolution of one of the interacting proteins would be
compensated by changes in the other one [16]. Thus, correlated mutations in predicted
regions between APEX1 and pol-β were observed across a variety species. Multiple
sequence analyses of the two proteins shows correlated mutation at the interface of the
two proteins in the 3’-complex (Figure 21). In particular Arg221 of APEX1 and Gln31 of
pol-β that interacted in the 3’-complex with pol-β in the closed conformation were 52
changed in five organisms to Lys and Arg respectively. In four of these organisms there
was also correlated variation of Ser275 in APEX1 and Ser109 in pol-β, but these residues
did not interact in the predicted complex. In addition, in S. purpuratus there was one
more coordinated change in interacting residues, Gly225 of APEX1 was mutated to Ser
and Ile33 of pol-b was mutated to Met.
Comparative protein structure modeling of E. focardii γ-tubulins
Description of γ-tubulins
In the study of characterization of γ-tubulin from the psychrophilic Antarctic ciliate
Euplotes focardii, comparative models for γ-tubulins were built. γ-tubulin is a low
abundance protein which localized to the pericentriolar material. It is important in the
nucleation and polar orientation of microtubules [128]. Microtubule assembly is nucleated by organizing centers, which include centrioles, basal bodies, and other structures [128, 129]. Both centrioles and basal bodies require γ-tubulin for their assembly and maintenance. Microtubule assembly is entropically driven, predominantly via hydrophobic interactions, and therefore environmental temperature plays an important role both in vitro and in vivo [129-131].
Comparative protein structure modeling of E. focardii γ-tubulins
In the presented study, sequences of γ-T1 and γ-T2 of E. focardii were modeled [129].
Comparative homology models of the two E. focardii γ-tubulins were obtained by use of
MODELLER (version 9v1) and the Friend interface. The 3.0 A˚ structure of human γ- 53
tubulin containing bound GTP (Protein Databank 1z5w) was used as a template for comparative modeling. Structural alignments between the template and modeled sequences were performed with TOPOFIT and models were analyzed under Friend software. The percentage similarities between modeled and template sequences were
68.36% for γ-T1 and 69.28% for γ-T2, and the length of alignment was 433 residues for both models. Based on these values, it is estimated that the accuracies of the modeled structures of γ-T1 and γ-T2 approach 3 A˚[129, 132].
In the comparative modeled structures, it has been shown that of prolines for threonine at position 297 of γ-T2 and for serine at 303 in γ-T1 do not alter significantly the conformation of the H9-S8 loop, although they are likely to restrict its mobility.
However, the Pro303 substitution of γ-T1 eliminates the bent hydrogen bond that forms between Ser303 and Asn205 in γ-T2 (see figure 22 A and B).
Conclusion
The three-dimensional structure of a protein provides important information for understanding and answering many biological questions. Combining the information with other available sources offer new extensions and visions. Developing StSNP web server is a good example for uniting the available sources for a scientific exploration. Since variations in DNA sequence may have a major impact on how humans respond to disease, environmental factors, and drugs [54], it is also important to visualize and analyze nsSNPs on protein structures. But as it was described previously, we don’t have all protein structures for every protein sequences in the available databases currently. By using comparative structure modeling, we have the models of the sequences with 54 unknown structures. Mapping nsSNPs on to the models not only provides deeper information about substitutions and their possible structural effects, also we gain insight in nsSNPs for unknown structures about where they may exist in the three dimensional structure of a protein. Combining this information along with the pathways enables us to create possible interactions among the proteins and pathways which may lead to disease pathways in order to understand the structural and physiological mechanism. Thus, the first steps have been taken in the development of a resource for mapping nsSNPs onto protein structures, providing structural insight into the effects of nsSNPs on proteins such as, stability, functionality, protein–protein interactions and other structurally related issues. As a web server in a rapidly evolving area of research, StSNP is designed to evolve with other related resources; future directions include; a more detailed analysis of the SNP, predictions of the functional/biological implications of the SNP(s) and the use of image map technology from the KEGG API for more interactive data retrieval. StSNP creates the basis for further studies involving the metabolic pathways and the disease(s) associated with a particular SNP.
In the study for comparative functional genomics analysis of HMGN3a and SMARCAl1 across mammals includes both comparative protein structure modeling and multiple sequence alignment analysis which suggest high levels of structural conservation of these proteins highlight the importance of chromatin remodeling in the regulation of gene expression, particularly during early mammalian embryonic development. The greater similarities of human and bovine HMGN3a and SMARCAL1 proteins may suggest the cow as a valuable model to study chromatin remodeling at the onset of mammalian 55
development. Understanding the roles of chromatin remodeling proteins during
embryonic development emphasizes the importance of epigenetics and could shed light
on the underlying mechanisms of early mammalian development.
Using comparative protein structure modeling is also performed in one of the research
projects which is for locating the active site residues by the THEMATICS method. The
importance of the study was to show that THEMATICS can predict active site locations
in comparative model structures. Since the protein sequences and templates were
available to perform comparative modeling, results showed that THEMATICS performed
successfully. More than 40 comparative models were used in order to apply
THEMATICS from several different species. This study also showed the importance of
using accurate models which affects the end results such as locating the active sites
accurately.
APEX1 and pol-β involve in DNA repair mechanism, in the study with the aid of the
molecular dynamic applications, interacting interface residues of these proteins were
determined. In the further steps of the study multiple sequence alignments were
performed across the species which identifies coordinated mutations of specific residues at the predicted interfaces.
Research projects that I worked and was involved in during my work on this dissertation were published in six peer-reviewed journals and here in this dissertation presented data with related information was obtained from these publications. [70, 88, 116, 126, 129,
133]. 56
Tables
Table 1. Representing query and modeling options for resources. Table is reproduced from Uzun et al. [70].
57
58
Table 2. Summary of resources. Table shows (pages 59 and 60) the differences and the similarities of the resources for their search options and background information (number of nsSNP information for each database is from 2007, current update of StSNP is presented in the introduction part). Table is reproduced from Uzun et al. [70].
59
60
61
Table3. Organisms and protein accession numbers used in MSA of SMARCAL1. Table is reproduced from Uzun et al. [88].
Table 4. Organisms and protein accession numbers used in multiple sequence alignment of HMGN3a. Table is reproduced from Uzun et al. [88].
62
Table 5. Predicted clusters for orthologous structures for three templates. For each model structure, % pairwise identity with the template and the RMSD value in A° are given. THEMATICS results for the active site cluster are given with known active site residues shown in boldface and second shell residues underlined. Sequence numbers for the models are adjusted to match those of the template structures. Table is reproduced from Shehadi et al. [116].
63
Table 6. Predicted active site clusters for additional orthologous sets for the templates. Known active site residues are shown in bold. Second shell residues are underlined. For the models, residues aligned with a known active site residue in the template are shown in bold; those aligned with a second shell template residue are underlined. Table is reproduced from Shehadi et al. [116]
64
65
Table 7. Predicted active site clusters for nine homologous sets (pages 66 and 67). For the templates, known active site residues are shown in bold. Second shell residues are underlined. For the models, residues aligned with a known active site residue in the template are shown in bold; those aligned with a second shell template residue are underlined. Table is reproduced from Shehadi et al. [116].
66
67
68
Figures
Figure1. Growth of released protein structures per year. Graph displays the number of searchable structures per year in PDB since 1976. Red bars represent number of structures accumulated per year. The number of structures available printed to 1985 was less than 195.
69
70
Figure2. Schematic of StSNP web server. StSNP is an interactive web server, which utilizes several heterogeneous data sources.
71
72
Figure3. Data generation in StSNP. (A) Main query page, (B) Formatted data for nsSNPs along with graphical alignment representation, (C) nsSNP(s) selection for modeling, (D) Output page, and (E) Visualization in the Friend applet.
73
74
Figure 4A. nsSNPs and Glutathione S Transferase. Glutathione S Transferase is shown with nsSNP locations displayed in ball and stick representation, with I105V marked with a black circle. The reference residues are shown in blue, nonsynonymous residues in red and the substrate glutathione is displayed in space fill representation (yellow). The query for the example was Protein ID NP_000843 and template PDB ID 1aqv chain B. B. The Results section also provides a user with a link to glutathione metabolism in order to view other members found in the pathway. Figure 4A is reproduced from Uzun et al. [70].
75
76
77
Figure 5. nsSNPs and Aldehyde dehydrogenase-2. Aldehyde dehydrogenase-2 is shown with nsSNP locations displayed in ball and stick representation, with E504K marked with a black circle. The reference residues are shown in blue, nonsynonymous residues in red and the substrate NAD is displayed in space fill representation (green). The query for the example was Protein ID NP_000681 and template PDB ID 1ag8 chain A. Backbones in the figure for models with nsSNPs and with reference residue is shown in different colors (purple and green). Figure is reproduced from Uzun et al. [70].
78
79
Figure 6. Phylogenetic tree for HMGN3a. Phylogenetic tree of evolutionary relationships of HMGN3a in 7 mammalian taxa using the Maximum Parsimony method. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500 replicates) is shown next to the branches. All positions containing gaps and missing data were eliminated from the dataset. Figure is reproduced from Uzun et al. [88].
80
81
Figure 7. MSA of HMGN3a. Highlighted regions show substitutions in at least 1 of the 7 species. The alignment includes both the official bovine HMGN3a gene model GLEAN_08006, and the bovine NCBI HMGN3a protein (NP_001029676.1). The insertion of alanine in the fifth exon of the bovine protein is marked in red. Figure is reproduced from Uzun et al. [88].
82
83
Figure 8. Four distinctive domains of SMARCAL1. The starting and ending residue numbers are 245–299, 342–396, 437–727, and 741–818. Figure is reproduced from Uzun et al. [88].
84
85
Figure 9. Phylogenetic tree for SMARCAL1. Phylogenetic tree of evolutionary relationships of the complete SMARCAL1 protein in 9 mammalian taxa, using the Maximum Parsimony method. The bootstrap consensus tree inferred from 500 replicates is taken to represent the evolutionary history of the taxa analyzed. All positions containing gaps and missing data were eliminated from the dataset. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500replicates) is shown next to the branches. Figure is reproduced from Uzun et al. [88].
86
87
Figure 10. First HARP domain of SMARCAL1. SMARCAL1 first HARP domain phylogenetic tree with the highest parsimony (length = 46). The consistency index is 0.95, the retention index is 0.94, and the composite index is 0.90 for all sites and parsimony-informative sites. There were a total of 57 positions in the final dataset, of which 20 were parsimony informative. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500 replicates) is shown next to the branches. Figure is reproduced from Uzun et al. [88].
88
89
Figure 11. MSA of first HARP domain in SMARCAL1. Substitutions in at least one species are highlighted. Numbers on top of the alignment show significant substitutions which are mentioned in the text. Figure is reproduced from Uzun et al. [88].
90
91
Figure 12. Second HARP domain of SMARCAL. SMARCAL1 second HARP domain phylogenetic tree with the highest parsimony (length = 53). The consistency index is 0.84, the retention index is 0.85, and the composite index is 0.77 for all sites. There were a total of 62 positions in the final dataset, of which 17 were parsimony informative. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500 replicates) is shown next to the branches. Figure is reproduced from Uzun et al. [88].
92
93
Figure 13. MSA of second HARP domain in SMARCAL1. Highlighted regions show substitutions occur. Figure is reproduced from Uzun et al. [88].
94
95
Figure 14. SNF2N domain of SMARCAL1. SMARCAL1 SNF2N domain phylogenetic tree with the highest parsimony (length = 145). The consistency index is 0.81, the retention index is 0.79, and the composite index is 0.70 for all sites. There were a total of 290 positions in the final dataset, out of which 39 were parsimony informative. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500 replicates) is shown next to the branches. Figure is reproduced from Uzun et al. [88].
96
97
Figure 15. MSA of SNF2N domain in SMARCAL1. Multiple sequence alignment of the SNF2N domain in SMARCAL1. Highlighted regions show where the substitutions occur. The multiple substitutions marked in red in the Bos taurus sequence may be due to sequencing errors since the corrected model for this protein (GLEAN_20241, in blue) added to this portion of the alignment only differs from the human sequence in 2 amino acids. Figure is reproduced from Uzun et al. [88].
98
99
Figure 16. Phylogenetic tree of helicase C terminal domain in SMARCAL1. SMARCAL1 helicase C-terminal domain in phylogenetic tree with the highest parsimony (length = 52). The consistency index is 0.86, the retention index is 0.81, and the composite index is 0.77 for all sites. There were a total of 78 positions in the final dataset, out of which 11 were parsimony informative. The percentage of replicate trees in which the associated taxa clustered together in the bootstrap test (500 replicates) is shown next to the branches.
100
101
Figure 17. MSA of the helicase C-terminal domain in SMARCAL1. Highlighted regions show where the substitutions occur. Figure is reproduced from Uzun et al. [88].
102
103
Figure 18. Disparity Index test of HMGN3a. Probability of rejecting the null hypothesis that HMGN3a sequences have evolved with the same pattern of substitution, as judged from the extent of differences in base composition biases between sequences (Disparity Index test). A Monte Carlo test (1000 replicates) was used to estimate the P-values, which are shown below the diagonal. P-values smaller than 0.05 are considered significant. The estimates of the disparity index per site are shown for each sequence pair above the diagonal. There were a total of 95 positions in the final dataset. None of the P- values were smaller than 0.05. All positions containing gaps and missing data were eliminated from the dataset. Figure is reproduced from Uzun et al. [88]. Black colored numbers: Probability computed (must be < 0.05 for hypothesis rejection at 5% level), Blue colored numbers: Disparity Index.
104
105
Figure 19. Disparity Index test of SMARCAL1. Probability of rejecting the null hypothesis that the sequences of the SMARCAL1 conserved domains have evolved with the same pattern of substitution, as judged from the extent of differences in base composition biases between sequences (Disparity Index test). A Monte Carlo test (1000 replicates) was used to estimate the P-values, which are shown below the diagonal. P- values smaller than 0.05 are considered significant. The estimates of the disparity index per site are shown for each sequence pair above the diagonal. All positions containing gaps and missing data were eliminated from the dataset. A. First HARP domain: there were a total of 57 positions in the final dataset. B. Second HARP domain: there were a total of 62 positions in the final dataset. C. SNF2N domain: there were a total of 290 positions in the final dataset. D. Helicase C-terminal domain: there were a total of 78 positions in the final dataset. Figure is reproduced from Uzun et al. [88]. Black colored numbers: Probability computed (must be < 0.05 for hypothesis rejection at 5% level [yellow background]), Blue colored numbers: Disparity Index.
106
107
Figure20. MSA of helicase like and helicase domains with modeled protein structure. Based on the domains in the multiple sequence alignment, residues are colored in green and blue. Green color represents the helicase like domain and blue color represents helicase domain. ATP binding residues are shown in red color and represented as balls and sticks. Nucleotide binding residues are colored in magenta color. The arrows shows the residue which existed in the non-conservative region on the structure and in the MSA, which is indicated in yellow. Figure is reproduced from Uzun et al. [88].
108
109
Figure 21. MSA of APEX1 and pol-β. Only alignment for fragments of interacting regions in the 39 complexes (with open and closed conformation of pol-β) is shown. Residues at the interfaces are in bold; neighboring residues are in normal font. Interacting residues include residues from segments #1, #2, and #3 and adjacent residues, found at interface only in the complex with open conformation of pol-β. Adjacent residues are termed (where possible) AR. Correlated mutations of interacting residues are highlighted in cyan and orange. Other variations in interacting residues are highlighted in red. Figure is reproduced from Abyzov et al. [126].
110
111
Figure 22. A view of model structures of γ-T1 and γ-T2 at the positions 303, 205. A) Contains a proline at position 303 in γ-T1 in contrast to the serine of γ-T2 (in figure B). The proline substitution of γ-T1 eliminates the hydrogen bond between Ser303 and Asn205 in γ-T2. Figure is reproduced from Marziale et al. [129]
112
113
References 1. Berman, H.M., et al., The Protein Data Bank. Nucleic Acids Res, 2000. 28(1): p. 235-42. 2. Baker, D. and A. Sali, Protein structure prediction and structural genomics. Science, 2001. 294(5540): p. 93-6. 3. Blundell, T.L., et al., Knowledge-based prediction of protein structures and the design of novel molecules. Nature, 1987. 326(6111): p. 347-52. 4. Marti-Renom, M.A., et al., Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct, 2000. 29: p. 291- 325. 5. Jacobson, M. and A. Sali, Comparative Protein Structure Modeling and its Applications to Drug Discovery. Annual Reports in Medicinal Chemistry, 2004. 39: p. 259-276. 6. Sali, A., Modeling mutations and homologous proteins. Curr Opin Biotechnol, 1995. 6(4): p. 437-51. 7. Sali, A., 100,000 protein structures for the biologist. Nat Struct Biol, 1998. 5(12): p. 1029-32. 8. Rost, B., Twilight zone of protein sequence alignments. Protein Eng, 1999. 12(2): p. 85-94. 9. Bordoli, L., et al., Protein structure homology modeling using SWISS- MODEL workspace. Nat Protoc, 2009. 4(1): p. 1-13. 10. Altschul, S.F., et al., Basic local alignment search tool. J Mol Biol, 1990. 215(3): p. 403-10. 11. Pearson, W.R., Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol, 1990. 183: p. 63-98. 12. Altschul, S.F., et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 1997. 25(17): p. 3389-402. 13. Gough, J., et al., Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. J Mol Biol, 2001. 313(4): p. 903-19. 14. Jones, D.T., GenTHREADER: an efficient and reliable protein fold recognition method for genomic sequences. J Mol Biol, 1999. 287(4): p. 797-815. 15. Lindahl, E. and A. Elofsson, Identification of related proteins on family, superfamily and fold level. J Mol Biol, 2000. 295(3): p. 613-25. 16. John, B. and A. Sali, Detection of homologous proteins by an intermediate sequence search. Protein Sci, 2004. 13(1): p. 54-62. 17. Chothia, C. and A.M. Lesk, The relation between the divergence of sequence and structure in proteins. Embo J, 1986. 5(4): p. 823-6. 18. Canutescu, A.A., A.A. Shelenkov, and R.L. Dunbrack, Jr., A graph-theory algorithm for rapid protein side-chain prediction. Protein Sci, 2003. 12(9): p. 2001-14. 114
19. Rohl, C.A., et al., Modeling structurally variable regions in homologous proteins with rosetta. Proteins, 2004. 55(3): p. 656-77. 20. Soto, C.S., et al., Loop modeling: Sampling, filtering, and scoring. Proteins, 2008. 70(3): p. 834-43. 21. Bystroff, C. and D. Baker, Prediction of local structure in proteins using a library of sequence-structure motifs. J Mol Biol, 1998. 281(3): p. 565-77. 22. Unger, R., et al., A 3D building blocks approach to analyzing and predicting structure of proteins. Proteins, 1989. 5(4): p. 355-73. 23. Chinea, G., et al., The use of position-specific rotamers in model building by homology. Proteins, 1995. 23(3): p. 415-21. 24. Chothia, C. and A.M. Lesk, Canonical structures for the hypervariable regions of immunoglobulins. J Mol Biol, 1987. 196(4): p. 901-17. 25. Jones, T.A. and S. Thirup, Using known substructures in protein model building and crystallography. Embo J, 1986. 5(4): p. 819-22. 26. Bruccoleri, R.E. and M. Karplus, Prediction of the folding of short polypeptide segments by uniform conformational sampling. Biopolymers, 1987. 26(1): p. 137-68. 27. Shenkin, P.S., et al., Predicting antibody hypervariable loop conformation. I. Ensembles of random conformations for ringlike structures. Biopolymers, 1987. 26(12): p. 2053-85. 28. Deane, C.M. and T.L. Blundell, CODA: a combined algorithm for predicting the structurally variable regions of protein models. Protein Sci, 2001. 10(3): p. 599-612. 29. van Vlijmen, H.W. and M. Karplus, PDB-based protein loop prediction: parameters for selection and methods for optimization. J Mol Biol, 1997. 267(4): p. 975-1001. 30. Janin, J. and C. Chothia, Role of hydrophobicity in the binding of coenzymes. Appendix. Translational and rotational contribution to the free energy of dissociation. Biochemistry, 1978. 17(15): p. 2943-8. 31. Mendes, J., et al., Improved modeling of side-chains in proteins with rotamer-based methods: a flexible rotamer model. Proteins, 1999. 37(4): p. 530-43. 32. Tuffery, P., et al., A new approach to the rapid determination of protein side chain conformations. J Biomol Struct Dyn, 1991. 8(6): p. 1267-89. 33. Xiang, Z. and B. Honig, Extending the accuracy limits of prediction for side-chain conformations. J Mol Biol, 2001. 311(2): p. 421-30. 34. Desjarlais, J.R. and T.M. Handel, Side-chain and backbone flexibility in protein core design. J Mol Biol, 1999. 290(1): p. 305-18. 35. Ginalski, K., Comparative modeling for protein structure prediction. Curr Opin Struct Biol, 2006. 16(2): p. 172-7. 36. Kryshtafovych, A., et al., Progress over the first decade of CASP experiments. Proteins, 2005. 61 Suppl 7: p. 225-36. 37. Tress, M., et al., Assessment of predictions submitted for the CASP6 comparative modeling category. Proteins, 2005. 61 Suppl 7: p. 27-45. 115
38. Cozzetto, D. and A. Tramontano, Relationship between multiple sequence alignments and quality of protein comparative models. Proteins, 2005. 58(1): p. 151-7. 39. Zhou, H. and Y. Zhou, Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Sci, 2002. 11(11): p. 2714-26. 40. Luthy, R., J.U. Bowie, and D. Eisenberg, Assessment of protein models with three-dimensional profiles. Nature, 1992. 356(6364): p. 83-5. 41. Melo, F. and E. Feytmans, Assessing protein structures with a non-local atomic interaction energy. J Mol Biol, 1998. 277(5): p. 1141-52. 42. Wallner, B. and A. Elofsson, Can correct protein models be identified? Protein Sci, 2003. 12(5): p. 1073-86. 43. Sippl, M.J., Boltzmann's principle, knowledge-based mean fields and protein folding. An approach to the computational determination of protein structures. J Comput Aided Mol Des, 1993. 7(4): p. 473-501. 44. Arnold, K., et al., The SWISS-MODEL workspace: a web-based environment for protein structure homology modelling. Bioinformatics, 2006. 22(2): p. 195-201. 45. Notredame, C., Recent evolutions of multiple sequence alignment algorithms. PLoS Comput Biol, 2007. 3(8): p. e123. 46. Mount, D., Bioinformatics: sequence and genome analysis. 2nd Edition ed. 2004, New York Cold Spring Harbour Laboratory Press. 47. Pei, J., Multiple protein sequence alignment. Curr Opin Struct Biol, 2008. 18(3): p. 382-6. 48. Wallace, I.M., G. Blackshields, and D.G. Higgins, Multiple sequence alignments. Curr Opin Struct Biol, 2005. 15(3): p. 261-6. 49. Higgins, D.G. and P.M. Sharp, CLUSTAL: a package for performing multiple sequence alignment on a microcomputer. Gene, 1988. 73(1): p. 237-44. 50. Higgins, D.G. and P.M. Sharp, Fast and sensitive multiple sequence alignments on a microcomputer. Comput Appl Biosci, 1989. 5(2): p. 151-3. 51. Thompson, J.D., D.G. Higgins, and T.J. Gibson, CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res, 1994. 22(22): p. 4673-80. 52. Larkin, M.A., et al., Clustal W and Clustal X version 2.0. Bioinformatics, 2007. 23(21): p. 2947-8. 53. Higgins, D.G., J.D. Thompson, and T.J. Gibson, Using CLUSTAL for multiple sequence alignments. Methods Enzymol, 1996. 266: p. 383-402. 54. The International HapMap Project. Nature, 2003. 426(6968): p. 789-96. 55. Sachidanandam, R., et al., A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 2001. 409(6822): p. 928-33. 116
56. Wang, Z. and J. Moult, SNPs, protein structure, and disease. Hum Mutat, 2001. 17(4): p. 263-70. 57. Stoneking, M., Single nucleotide polymorphisms. From the evolutionary past. Nature, 2001. 409(6822): p. 821-2. 58. Sherry, S.T., et al., dbSNP: the NCBI database of genetic variation. Nucleic Acids Res, 2001. 29(1): p. 308-11. 59. Chasman, D. and R.M. Adams, Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. J Mol Biol, 2001. 307(2): p. 683-706. 60. Fredman, D., et al., HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources. Nucleic Acids Res, 2002. 30(1): p. 387-91. 61. Hirakawa, M., et al., JSNP: a database of common gene variations in the Japanese population. Nucleic Acids Res, 2002. 30(1): p. 158-62. 62. Sunyaev, S., V. Ramensky, and P. Bork, Towards a structural basis of human non-synonymous single nucleotide polymorphisms. Trends Genet, 2000. 16(5): p. 198-200. 63. Stitziel, N.O., et al., topoSNP: a topographic database of non-synonymous single nucleotide polymorphisms with and without known disease association. Nucleic Acids Res, 2004. 32 Database issue: p. D520-2. 64. Yip, Y.L., et al., The Swiss-Prot variant page and the ModSNP database: a resource for sequence and structure information on human protein variants. Hum Mutat, 2004. 23(5): p. 464-70. 65. Karchin, R., et al., LS-SNP: large-scale annotation of coding non- synonymous SNPs based on multiple information sources. Bioinformatics, 2005. 21(12): p. 2814-20. 66. Reumers, J., et al., SNPeffect: a database mapping molecular phenotypic effects of human non-synonymous coding SNPs. Nucleic Acids Res, 2005. 33(Database issue): p. D527-32. 67. Dantzer, J., et al., MutDB services: interactive structural analysis of mutation data. Nucleic Acids Res, 2005. 33(Web Server issue): p. W311- 4. 68. Han, A., et al., SNP@Domain: a web resource of single nucleotide polymorphisms (SNPs) within protein domain structures and sequences. Nucleic Acids Res, 2006. 34(Web Server issue): p. W642-4. 69. Li, S., et al., Snap: an integrated SNP annotation platform. Nucleic Acids Res, 2007. 35(Database issue): p. D707-10. 70. Uzun, A., et al., Structure SNP (StSNP): a web server for mapping and modeling nsSNPs on protein structures with linkage to metabolic pathways. Nucleic Acids Res, 2007. 35(Web Server issue): p. W384-92. 71. Abyzov, A., et al., Friend, an integrated analytical front-end application for bioinformatics. Bioinformatics, 2005. 21(18): p. 3677-8. 117
72. Leslin, C.M., A. Abyzov, and V.A. Ilyin, Structural exon database, SEDB, mapping exon boundaries on multiple protein structures. Bioinformatics, 2004. 20(11): p. 1801-3. 73. Smith, T.F., and Waterman M.S., Comparison of biosequences. Adv. Appl. Math., 1981. 2: p. 482-489. 74. Henikoff, S. and J.G. Henikoff, Performance evaluation of amino acid substitution matrices. Proteins, 1993. 17(1): p. 49-61. 75. Kanehisa, M., A database for post-genome analysis. Trends Genet, 1997. 13(9): p. 375-6. 76. Kanehisa, M. and S. Goto, KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res, 2000. 28(1): p. 27-30. 77. Pruitt, K.D. and D.R. Maglott, RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res, 2001. 29(1): p. 137-40. 78. Ilyin, V.A., A. Abyzov, and C.M. Leslin, Structural alignment of proteins by a novel TOPOFIT method, as a superimposition of common volumes at a topomax point. Protein Sci, 2004. 13(7): p. 1865-74. 79. Leslin, C.M., A. Abyzov, and V.A. Ilyin, TOPOFIT-DB, a database of protein structural alignments based on the TOPOFIT method. Nucleic Acids Res, 2007. 35(Database issue): p. D317-21. 80. Hashimoto, T., et al., A functional glutathione S-transferase P1 gene polymorphism is associated with methamphetamine-induced psychosis in Japanese population. Am J Med Genet B Neuropsychiatr Genet, 2005. 135B(1): p. 5-9. 81. Marth, G.T., et al., The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics, 2004. 166(1): p. 351-72. 82. Goedde, H.W., et al., Population genetic studies on aldehyde dehydrogenase isozyme deficiency and alcohol sensitivity. Am J Hum Genet, 1983. 35(4): p. 769-72. 83. Li, Y., et al., Mitochondrial aldehyde dehydrogenase-2 (ALDH2) Glu504Lys polymorphism contributes to the variation in efficacy of sublingual nitroglycerin. J Clin Invest, 2006. 116(2): p. 506-11. 84. Memili, E., T. Dominko, and N.L. First, Onset of transcription in bovine oocytes and preimplantation embryos. Mol Reprod Dev, 1998. 51(1): p. 36-41. 85. Memili, E. and N.L. First, Control of gene expression at the onset of bovine embryonic development. Biol Reprod, 1999. 61(5): p. 1198-207. 86. Misirlioglu, M., et al., Dynamics of global transcriptome in bovine matured oocytes and preimplantation embryos. Proc Natl Acad Sci U S A, 2006. 103(50): p. 18905-10. 87. Thompson, E.M., E. Legouy, and J.P. Renard, Mouse embryos do not wait for the MBT: chromatin and RNA polymerase remodeling in genome activation at the onset of development. Dev Genet, 1998. 22(1): p. 31-42. 118
88. Uzun, A., et al., Functional genomics of HMGN3a and SMARCAL1 in early mammalian embryogenesis. BMC Genomics, 2009. 10: p. 183. 89. Olave, I., et al., Identification of a polymorphic, neuron-specific chromatin remodeling complex. Genes Dev, 2002. 16(19): p. 2509-17. 90. Shirakawa, H., et al., Targeting of high mobility group-14/-17 proteins in chromatin is independent of DNA sequence. J Biol Chem, 2000. 275(48): p. 37937-44. 91. Bustin, M. and R. Reeves, High-mobility-group chromosomal proteins: architectural components that facilitate chromatin function. Prog Nucleic Acid Res Mol Biol, 1996. 54: p. 35-100. 92. West, K.L., et al., HMGN3a and HMGN3b, two protein isoforms with a tissue-specific expression pattern, expand the cellular repertoire of nucleosome-binding proteins. J Biol Chem, 2001. 276(28): p. 25959-69. 93. Mohamed, O.A., M. Bustin, and H.J. Clarke, High-mobility group proteins 14 and 17 maintain the timing of early embryonic development in the mouse. Dev Biol, 2001. 229(1): p. 237-49. 94. Crippa, M.P., J.M. Nickol, and M. Bustin, Developmental changes in the expression of high mobility group chromosomal proteins. J Biol Chem, 1991. 266(5): p. 2712-4. 95. Korner, U., et al., Developmental role of HMGN proteins in Xenopus laevis. Mech Dev, 2003. 120(10): p. 1177-92. 96. Banine, F., et al., SWI/SNF chromatin-remodeling factors induce changes in DNA methylation to promote transcriptional activation. Cancer Res, 2005. 65(9): p. 3542-7. 97. Fry, C.J. and C.L. Peterson, Chromatin remodeling enzymes: who's on first? Curr Biol, 2001. 11(5): p. R185-97. 98. Pollard, K.J. and C.L. Peterson, Chromatin remodeling: a marriage between two families? Bioessays, 1998. 20(9): p. 771-80. 99. Lusser, A. and J.T. Kadonaga, Chromatin remodeling by ATP-dependent molecular machines. Bioessays, 2003. 25(12): p. 1192-200. 100. Zheng, P., et al., Expression of genes encoding chromatin regulatory factors in developing rhesus monkey oocytes and preimplantation stage embryos: possible roles in genome activation. Biol Reprod, 2004. 70(5): p. 1419-27. 101. LeGouy, E., et al., Differential preimplantation regulation of two mouse homologues of the yeast SWI2 protein. Dev Dyn, 1998. 212(1): p. 38-48. 102. Magnani, L. and R.A. Cabot, Developmental arrest induced in cleavage stage porcine embryos following microinjection of mRNA encoding Brahma (Smarca 2), a chromatin remodeling protein. Mol Reprod Dev, 2007. 74(10): p. 1262-7. 103. Gebuhr, T.C., S.J. Bultman, and T. Magnuson, Pc-G/trx-G and the SWI/SNF connection: developmental gene regulation through chromatin remodeling. Genesis, 2000. 26(3): p. 189-97. 119
104. Reyes, J.C., et al., Altered control of cellular proliferation in the absence of mammalian brahma (SNF2alpha). Embo J, 1998. 17(23): p. 6979-91. 105. Coleman, M.A., J.A. Eisen, and H.W. Mohrenweiser, Cloning and characterization of HARP/SMARCAL1: a prokaryotic HepA-related SNF2 helicase protein from human and mouse. Genomics, 2000. 65(3): p. 274- 82. 106. Dahiya, R., S. Cleveland, and C.A. Megerian, Spondyloepiphyseal dysplasia congenita associated with conductive hearing loss. Ear Nose Throat J, 2000. 79(3): p. 178-82. 107. Boerkoel, C.F., et al., Mutant chromatin remodeling protein SMARCAL1 causes Schimke immuno-osseous dysplasia. Nat Genet, 2002. 30(2): p. 215-20. 108. Spranger, J., A. Winterpacht, and B. Zabel, The type II collagenopathies: a spectrum of chondrodysplasias. Eur J Pediatr, 1994. 153(2): p. 56-65. 109. Sun, F., et al., Expression of SRG3, a chromatin-remodelling factor, in the mouse oocyte and early preimplantation embryos. Zygote, 2007. 15(2): p. 129-38. 110. Tamura, K., et al., MEGA4: Molecular Evolutionary Genetics Analysis (MEGA) software version 4.0. Mol Biol Evol, 2007. 24(8): p. 1596-9. 111. Muthuswami, R., et al., A eukaryotic SWI2/SNF2 domain, an exquisite detector of double-stranded to single-stranded DNA transition elements. J Biol Chem, 2000. 275(11): p. 7648-55. 112. Theis, K., et al., Crystal structure of UvrB, a DNA helicase adapted for nucleotide excision repair. Embo J, 1999. 18(24): p. 6899-907. 113. Finn, R.D., et al., Pfam: clans, web tools and services. Nucleic Acids Res, 2006. 34(Database issue): p. D247-51. 114. Pettersen, E.F., et al., UCSF Chimera--a visualization system for exploratory research and analysis. J Comput Chem, 2004. 25(13): p. 1605-12. 115. Ondrechen, M.J., J.G. Clifton, and D. Ringe, THEMATICS: a simple computational predictor of enzyme function from structure. Proc Natl Acad Sci U S A, 2001. 98(22): p. 12473-8. 116. Shehadi, I.A., et al., Active site prediction for comparative model structures with thematics. J Bioinform Comput Biol, 2005. 3(1): p. 127-43. 117. Shehadi, I.A., H. Yang, and M.J. Ondrechen, Future directions in protein function prediction. Mol Biol Rep, 2002. 29(4): p. 329-35. 118. Tong, W., et al., Enhanced performance in prediction of protein active sites with THEMATICS and support vector machines. Protein Sci, 2008. 17(2): p. 333-41. 119. Wei, Y., et al., Selective prediction of interaction sites in protein structures with THEMATICS. BMC Bioinformatics, 2007. 8: p. 119. 120. Westbrook, J., et al., The Protein Data Bank and structural genomics. Nucleic Acids Res, 2003. 31(1): p. 489-91. 120
121. Fiser, A., R.K. Do, and A. Sali, Modeling of loops in protein structures. Protein Sci, 2000. 9(9): p. 1753-73. 122. Lodi, P.J. and J.R. Knowles, Neutral imidazole is the electrophile in the reaction catalyzed by triosephosphate isomerase: structural origins and catalytic implications. Biochemistry, 1991. 30(28): p. 6948-56. 123. Zhang, Z., et al., Crystal structure of recombinant chicken triosephosphate isomerase phosphoglycolo-hydroxamate complex at 1.8-A resolution. Biochemistry, 1994. 33(10): p. 2830-7. 124. Podlutsky, A.J., et al., Human DNA polymerase beta initiates DNA synthesis during long-patch repair of reduced AP sites in DNA. Embo J, 2001. 20(6): p. 1477-82. 125. Lindahl, T. and B. Nyberg, Rate of depurination of native deoxyribonucleic acid. Biochemistry, 1972. 11(19): p. 3610-8. 126. Abyzov, A., et al., An AP endonuclease 1-DNA polymerase beta complex: theoretical prediction of interacting surfaces. PLoS Comput Biol, 2008. 4(4): p. e1000066. 127. Boiteux, S. and M. Guillet, Abasic sites in DNA: repair and biological consequences in Saccharomyces cerevisiae. DNA Repair (Amst), 2004. 3(1): p. 1-12. 128. Shu, H.B. and H.C. Joshi, Gamma-tubulin can both nucleate microtubule assembly and self-assemble into novel tubular structures in mammalian cells. J Cell Biol, 1995. 130(5): p. 1137-47. 129. Marziale, F., et al., Different roles of two gamma-tubulin isotypes in the cytoskeleton of the Antarctic ciliate Euplotes focardii: remodelling of interaction surfaces may enhance microtubule nucleation at low temperature. Febs J, 2008. 275(21): p. 5367-82. 130. Detrich, H.W., 3rd, et al., Brain and egg tubulins from antarctic fishes are functionally and structurally distinct. J Biol Chem, 1992. 267(26): p. 18766- 75. 131. Detrich, H.W., 3rd, K.A. Johnson, and S.P. Marchese-Ragona, Polymerization of Antarctic fish tubulins at low temperatures: energetic aspects. Biochemistry, 1989. 28(26): p. 10085-93. 132. Hoffman, D.C., et al., Macronuclear gene-sized molecules of hypotrichs. Nucleic Acids Res, 1995. 23(8): p. 1279-83. 133. Shehadi, I., et al., THEMATICS is Effective for Active Site Prediction in Comparative Model Structures. APBC, 2004. 1: p. 209-215.